Bellman Optimality Equation
- Shiyu Zhao. Chapter 3: Bellman Optimality Equation. Mathematical Foundations of Reinforcement Learning.
- --> Youtube: Bellman Optimality Equation
- --> Youtube: How to Solve Bellman Optimality Equation
The previous chapter introduced the Bellman equation of any given policy. The present post introduces the Bellman optimality equation, which is a special Bellman equation whose corresponding policy is optimal. By solving
Optimal Policy
The state value could be used to evaluate if a policy is good or not: if
A policy
The definition leads to many questions:
- Does the optimal policy exist?
- Is the optimal policy unique?
- Is the optimal policy stochastic or deterministic?
- How to obtain the optimal policy?
To answer these questions, we study the Bellman optimality equation.
BOE (elementwise form)
The tool for analyzing optimal policies and optimal state values is the Bellman optimality equation (BOE). By solving this equation, we can obtain optimal policies and optimal state values.
For every
+ $$
Here,
To get optimal state values (or optimal state value function) and optimal policies, we keep calculating BOE for each state
Like before, the first term
Remarks:
and are known. are unknown and to be calculated.
Maximize on the right-hand side of BOE
Here we talk about how to solve BOE. In practice we need to deal with the matrix-vector form since that is what we're faced with. But since each row in the matrix is actually a vector of the elementwise form, we start with the element form.
In fact, we can turn the problem into "solve the optimal
Example 3.1. Consider two unknown variables
The first step is to solve
We now turn to the maximization problem on the right-hand side of the BOE. The (elementwise) BOE can be written as
Inspired by Example 3.1, we can first solve the optimal
Example 3.2. Given
Inspired by the above example, considering that
Now that we know the solution of BOE is to maximize the right-hand side, and we know how to do it as well --- just select the action with the largest action value. But we don't know action value or state value at this time, so this method doesn't work.
In fact, the solution of BOE derives from the contraction mapping theorem (see later) on the matrix-vector form. That's an iterative method
So why we introduce
Matrix-vector form of the BOE
To leverage the contraction mapping theorem, we'll express the matrix-vector form as
Since the optimal value of
where
Then, the BOE can be expressed in a concise form as
Every row
Contraction mapping theorem
Now that the matrix-vector form is expressed as a nonlinear equation
Concepts: Fixed point and Contraction mapping
Fixed point:
Contraction mapping (or contractive function):
Example1
Givn euqation:
Moreover,
Example2
Given
Therefore,
Theorem: Contraction Mapping Theorem
Theorem (Contraction Mapping Theorem): For any equation that has the form of
- Existence: there exists a fixed point
satisfying . - Uniqueness: The fixed point
is unique. - Algorithm: Consider a sequence
where , then as . Moreover, the convergence rate is exponentially fast.
Contraction property of BOE
Theorem (Contraction Property):
Solution of the BOE
Due to the contraction property of BOE, the matrix-vector form can be solved by computing following equation iteratively
At every iteration, for each state, what we face is actually the elementwise form:
As you can see,
Procedure summary (value iteration algorithm):
For every
, estimate(randomly select) current state value asFor any
, calculateCalculate the greedy policy
for every as where .Calculate
Example
Example: Manually solve the BOE.

- Actions:
represent go left, stay unchanged, and go right. - Reward: entering the target area: +1 ; try to go out of boundary -1.
The values of
q-value table | |||
---|---|---|---|
Consider
Our objective is to find
v-value: select
q-value (using the previous table):
q-value table | |||
---|---|---|---|
-1 | 0 | 1 | |
0 | 1 | 0 | |
1 | 0 | -1 |
Greedy policy (select the greatest q-value)
With
q-value table | |||
---|---|---|---|
-0.1 | 0.9 | 1.9 | |
0.9 | 1.9 | 0.9 | |
1.9 | 0.9 | -0.1 |
Greedy policy (select the greatest q-value):
The policy is the same as the previous one, which is already optimal. v-value:
BOE: Optimality
Suppose
Suppose
Then
Therefore,
Now we prove
Theorem (Policy Optimality):
Suppose that
What does look like?
For any
It is clear that
Theorem: Optimal policy invariance
Theorem (Optimal policy invariance):
Consider a Markov decision process with
Consequently, the optimal policy derived from
Appendix
Proof of the contraction mapping theorem
Part 1
We prove that the consequence
The proof relies on Cauchy sequences:
A sequence
It is guaranteed that a Cauchy sequence converges to a limit.
Note that we must have
We next show that
First, since
Similarly, we have
Since
Notably, the convergence of
Therefore, we need to further consider
As a result, for any
We show that the limit
Part 3
We show that the fixed point is unique. Suppose that there is another fixed point
Proof of the contraction property of BOE
Consider any two vectors
Therefore,
Define
It then follows that
Since
Similarly, we have
Substituting this inequality to
Q.E.D.
Proof of Theorem: Policy Optimality
For any policy
Since
Repeatedly applying the above inequality gives
Proof of Theorem: Optimal policy invariance
For any policy
If
We next solve the new BOE in
Since
Finally, since
Hence, the greedy optimal policy derived from