Bellman Equation
Sources:
- Shiyu Zhao. Chapter 2: State Values and Bellman Equation. Mathematical Foundations of Reinforcement Learning.
- --> Youtube: Bellman Equation: Derivation
Problem Formula
Given a policy, finding out the corresponding state values is called policy evaluation. This is through solving an equation called the Bellman equation.
State Value
The expectation (or expected value, mean) of
Remarks:
- It is a function of
. It is a conditional expectation with the condition that the state starts from . - Since
is based on some policy , its expectation is also based on some policy , i.e., for a different policy, the state value may be different. - It represents the "value" of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.
Derivation of the Bellman equation
Recalling the definition of the state value
The two terms in the last line of are analyzed below.
First term: the mean of immediate rewards
First, calculate the first term
This is the mean of immediate rewards.
Explanation
Given events
From the law of total expectation: given discrete random variables
and , then If is finite, we have where is the alphabet of .Thus we obtain:
where is finite.From the definition of expectation,
, leading to Q.E.D.
A more verbose version of deduction is:
First, consider the definition of expectation:
Therefore,
Replace
Q.E.D.
Second term: the (discounted) mean of future rewards
First, we calculate the mean of future rewards
Then we multiply it with a dicount factor
Explanation
The transotion of the first line comes from the law of total expectation as well.
The transotion of the second line comes from the Markov property. Recall that reward$$
Bellman equation
Therefore, we have the Bellman equation, $$$$
- It consists of two terms:
- First term: the mean of immediate rewards
- Second term: the (discounted )mean of future rewards
- The Bellman equation is a set of linear equations that describe the relationships between the values of all the states.
- The above elementwise form is valid for every state
. That means there are equations like this!
Examples
Refer to the grid world example for the notations.
For deterministic policy

Consider the first example shown in Figure 2.4, where the policy is deterministic. We next write out the Bellman equation and then solve the state values from it.
First, consider state
The reward probabilities are
Substituting these values into the Bellman equation mentioned before
Similarly, it can be obtained that
We can solve the state values from these equations. Since the equations are simple, we can manually solve them. More complicated equations can be solved by the algorithms presented later. Here, the state values can be solved as
For stochastic policy

Consider the second example shown in Figure 2.5, where the policy is stochastic. We next write out the Bellman equation and then solve the state values from it.
At state
Similarly, it can be obtained that
The state values can be solved from the above equations. Since the equations are
simple, we can solve the state values manually and obtain
Furthermore, if we set
In this article we only deal with discounted return witout the loss of generality since we can simply trate undiscounted return as discounted-return's special case when
.↩︎