Bellman Equation

Posted on 2024-01-03 Edited on 2025-06-17 In Computer Science Views: 558

Sources:

Problem Formula

Given a policy, finding out the corresponding state values is called policy evaluation. This is through solving an equation called the Bellman equation.

State Value

The expectation (or expected value, mean) of $G_{t}$ is defined as the state-value function or simply state value: $\begin{matrix} (1) & v_{π} (s) = E [G_{t} ∣ S_{t} = s] \end{matrix}$

Remarks:

It is a function of $s$ . It is a conditional expectation with the condition that the state starts from $s$ .
Since $G_{t}$ is based on some policy $π$ , its expectation $v_{π} (s)$ is also based on some policy $π$ , i.e., for a different policy, the state value may be different.
It represents the "value" of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

Derivation of the Bellman equation

Recalling the definition of the state value $(1)$ , we substitute $G (t) = R_{t + 1} + γ G_{t + 1}$ into it¹:

$\begin{aligned} v_{π} (s) & = E [G_{t} ∣ S_{t} = s] \\ = E [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = E [R_{t + 1} ∣ S_{t} = s] + γ E [G_{t + 1} ∣ S_{t} = s] \end{aligned}$

The two terms in the last line of are analyzed below.

First term: the mean of immediate rewards

First, calculate the first term $E [R_{t + 1} ∣ S_{t} = s]$ : $\begin{aligned} E [R_{t + 1} ∣ S_{t} = s] & = \sum_{a \in A (s)} π (a ∣ s) E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] \\ = \sum_{a \in A (s)} π (a ∣ s) \sum_{r} p (r ∣ s, a) r . \end{aligned}$

This is the mean of immediate rewards.

Explanation

Given events $R_{t + 1} = r, S_{t} = s, A_{t} = a$ , the deduction is quite simple.

From the law of total expectation: given discrete random variables $X$ and $Y$ , then $E (X) = E (E (X ∣ Y))$ If $Y$ is finite, we have $E (X) = \sum_{y_{i} \in Y} E (X ∣ Y = y_{i}) p (Y = y_{i})$ where $Y$ is the alphabet of $Y$ .

Thus we obtain: $E [R_{t + 1} ∣ S_{t} = s] = \sum_{a \in A (s)} π (a ∣ s) E [R_{t + 1} ∣ S_{t} = s, A_{t} = a]$ where $A_{t}$ is finite.
From the definition of expectation, $E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] = p (r ∣ s, a) r$ , leading to $\sum_{a \in A (s)} π (a ∣ s) E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] = \sum_{a \in A (s)} π (a ∣ s) \sum_{r} p (r ∣ s, a) r$ Q.E.D.

A more verbose version of deduction is:

First, consider the definition of expectation: $\begin{matrix} (2) & E [R_{t + 1} ∣ S_{t} = s] ≜ \sum_{r} p (r | s) r . \end{matrix}$ Look at use the formula for marginal probability, $p (r | s) = \sum_{a \in A (s)} p (r, a | s) .$ Due to the chain rule of probability, we obtain $p (r, a | s) = p (r | a, s) . p (a | s)$ And in RL context, $p (s | s)$ is often written as $π (a | s)$ .

Therefore, $p (r | s) = \sum_{a \in A (s)} π (a | s) . p (r | a, s)$

Replace $p (r | s)$ in $(2)$ with $\sum_{a \in A (s)} π (a | s) . p (r | a, s)$ , we get $\begin{aligned} E [R_{t + 1} ∣ S_{t}] & = \sum_{r} \sum_{a \in A (s)} π (a ∣ s) p (r ∣ s, a) r \\ = \sum_{a \in A (s)} π (a ∣ s) \sum_{r} p (r ∣ s, a) r \\ ≜ \sum_{a \in A (s)} π (a ∣ s) E [R_{t + 1} ∣ S_{t} = s, A_{t} = a] \end{aligned}$

Q.E.D.

Second term: the (discounted) mean of future rewards

First, we calculate the mean of future rewards $\begin{aligned} E [G_{t + 1} ∣ S_{t} = s] & = \sum_{s^{'}} E [G_{t + 1} ∣ S_{t} = s, S_{t + 1} = s^{'}] p (s^{'} ∣ s) \\ = \sum_{s^{'}} E [G_{t + 1} ∣ S_{t + 1} = s^{'}] p (s^{'} ∣ s) \\ = \sum_{s^{'}} v_{π} (s^{'}) p (s^{'} ∣ s) \\ = \sum_{s^{'}} v_{π} (s^{'}) \sum_{a} p (s^{'} ∣ s, a) π (a ∣ s) \end{aligned}$

Then we multiply it with a dicount factor $γ$ . For simplicity, we say the second term $γ E [G_{t + 1} ∣ S_{t}$ is "the mean of future rewards" also it's discounted.

Explanation

The transotion of the first line comes from the law of total expectation as well.

The transotion of the second line comes from the Markov property. Recall that reward

r_{t}

is defined to only rely on

s_{t}

and

a_{t}

p (r_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, \dots, s_{0}, a_{0}) = p (r_{t + 1} ∣ s_{t}, a_{t}) .

We obtain that $$

\begin{aligned} E [R_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, \dots, s_{0}, a_{0}] & ≜ \sum_{r_{t + 1} \in R_{t + 1}} p (r_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, \dots, s_{0}, a_{0}) \cdot r_{t + 1} \\ = \sum_{r_{t + 1} \in R_{t + 1}} p (r_{t + 1} ∣ s_{t}, a_{t}) \cdot r_{t + 1} \\ ≜ E [R_{t + 1} ∣ s_{t}, a_{t}] . \end{aligned}

S i n c e

G_t = R_{t+1}+R_{t+2}+^2 R_{t+3}+,

w e h a v e

\begin{aligned} E [G_{t + 1} ∣ S_{t} = s, S_{t + 1} = s^{'}] & = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ S_{t} = s, S_{t + 1} = s^{'}] \\ = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ S_{t + 1} = s^{'}] \end{aligned}

Bellman equation

Therefore, we have the Bellman equation, $$

\begin{aligned} v_{π} (s) & = E [R_{t + 1} ∣ S_{t} = s] + γ E [G_{t + 1} ∣ S_{t} = s], \\ = \underset{mean of immediate rewards}{\underset{⏟}{\sum_{a} π (a ∣ s) \sum_{r} p (r ∣ s, a) r}} + \underset{(discounted) mean of future rewards}{\underset{⏟}{γ \sum_{s^{'}} π (a ∣ s) \sum_{a} p (s^{'} ∣ s, a) v_{π} (s^{'}),}} \\ = \sum_{a} π (a ∣ s) [\sum_{r} p (r ∣ s, a) r + γ \sum_{s^{'}} p (s^{'} ∣ s, a) v_{π} (s^{'})], \forall s \in S . \end{aligned}

It consists of two terms:
1. First term: the mean of immediate rewards
2. Second term: the (discounted )mean of future rewards
The Bellman equation is a set of linear equations that describe the relationships between the values of all the states.
The above elementwise form is valid for every state $s \in S$ . That means there are $| S |$ equations like this!

Examples

Refer to the grid world example for the notations.

For deterministic policy

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deterministic.

Consider the first example shown in Figure 2.4, where the policy is deterministic. We next write out the Bellman equation and then solve the state values from it.

First, consider state $s_{1}$ . Under the policy, the probabilities of taking the actions are $π (a = a_{3} ∣ s_{1}) = 1 and π (a \neq a_{3} ∣ s_{1}) = 0.$ The state transition probabilities are $p (s^{'} = s_{3} ∣ s_{1}, a_{3}) = 1 and p (s^{'} \neq s_{3} ∣ s_{1}, a_{3}) = 0 .$

The reward probabilities are $p (r = 0 ∣ s_{1}, a_{3}) = 1 and p (r \neq 0 ∣ s_{1}, a_{3}) = 0 .$

Substituting these values into the Bellman equation mentioned before $v_{π} (s) = \sum_{a} π (a ∣ s) [\sum_{r} p (r ∣ s, a) r + γ \sum_{s^{'}} p (s^{'} ∣ s, a) v_{π} (s^{'})], \forall s \in S$ gives $v_{π} (s_{1}) = 0 + γ v_{π} (s_{3})$

Similarly, it can be obtained that $\begin{aligned} v_{π} (s_{2}) = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{3}) = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{4}) = 1 + γ v_{π} (s_{4}) . \end{aligned}$

We can solve the state values from these equations. Since the equations are simple, we can manually solve them. More complicated equations can be solved by the algorithms presented later. Here, the state values can be solved as $\begin{aligned} v_{π} (s_{4}) = \frac{1}{1 - γ}, \\ v_{π} (s_{3}) = \frac{1}{1 - γ}, \\ v_{π} (s_{2}) = \frac{1}{1 - γ}, \\ v_{π} (s_{1}) = \frac{γ}{1 - γ} . \end{aligned}$ Furthermore, if we set $γ = 0.9$ , then $\begin{aligned} v_{π} (s_{4}) = \frac{1}{1 - 0.9} = 10, \\ v_{π} (s_{3}) = \frac{1}{1 - 0.9} = 10, \\ v_{π} (s_{2}) = \frac{1}{1 - 0.9} = 10, \\ v_{π} (s_{1}) = \frac{0.9}{1 - 0.9} = 9 . \end{aligned}$

For stochastic policy

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

Consider the second example shown in Figure 2.5, where the policy is stochastic. We next write out the Bellman equation and then solve the state values from it.

At state $s_{1}$ , the probabilities of going right and down equal 0.5 . Mathematically, we have $π (a = a_{2} ∣ s_{1}) = 0.5$ and $π (a = a_{3} ∣ s_{1}) = 0.5$ . The state transition probability is deterministic since $p (s^{'} = s_{3} ∣ s_{1}, a_{3}) = 1$ and $p (s^{'} = s_{2} ∣ s_{1}, a_{2}) = 1$ . The reward probability is also deterministic since $p (r = 0 ∣ s_{1}, a_{3}) = 1$ and $p (r = - 1 ∣ s_{1}, a_{2}) = 1$ . Substituting these values into (2.7) gives $v_{π} (s_{1}) = 0.5 [0 + γ v_{π} (s_{3})] + 0.5 [- 1 + γ v_{π} (s_{2})]$

Similarly, it can be obtained that $\begin{aligned} v_{π} (s_{2}) = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{3}) = 1 + γ v_{π} (s_{4}), \\ v_{π} (s_{4}) = 1 + γ v_{π} (s_{4}) . \end{aligned}$

The state values can be solved from the above equations. Since the equations are

simple, we can solve the state values manually and obtain $\begin{aligned} v_{π} (s_{4}) & = \frac{1}{1 - γ}, \\ v_{π} (s_{3}) & = \frac{1}{1 - γ}, \\ v_{π} (s_{2}) & = \frac{1}{1 - γ}, \\ v_{π} (s_{1}) & = 0.5 [0 + γ v_{π} (s_{3})] + 0.5 [- 1 + γ v_{π} (s_{2})], \\ = - 0.5 + \frac{γ}{1 - γ} . \end{aligned}$

Furthermore, if we set $γ = 0.9$ , then $\begin{aligned} v_{π} (s_{4}) = 10, \\ v_{π} (s_{3}) = 10, \\ v_{π} (s_{2}) = 10, \\ v_{π} (s_{1}) = - 0.5 + 9 = 8.5 . \end{aligned}$

In this article we only deal with discounted return witout the loss of generality since we can simply trate undiscounted return as discounted-return's special case when $γ = 1$ .↩︎