Bellman Equation: The Matrix-Vector Form

Posted on 2024-01-03 Edited on 2025-06-17 In Computer Science Views:

Sources:

Shiyu Zhao. Chapter 2: State Values and Bellman Equation. Mathematical Foundations of Reinforcement Learning.
--> Youtube: Bellman Equation: Matrix-Vector form

Bellman equation: the matrix-vector form

Consider the Bellman equation: $v_{π} (s) = \sum_{a} π (a ∣ s) [\sum_{r} p (r ∣ s, a) r + γ \sum_{s^{'}} p (s^{'} ∣ s, a) v_{π} (s^{'})]$

It's an elementsise form. That means there are $| S |$ equations like this! If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form.

Recall that this equation can be rewritten as $v_{π} (s) = r_{π} (s) + γ \sum_{s^{'}} p_{π} (s^{'} ∣ s) v_{π} (s^{'})$ where $r_{π} (s) ≜ \sum_{a} π (a ∣ s) \sum_{r} p (r ∣ s, a) r, p_{π} (s^{'} ∣ s) ≜ \sum_{a} π (a ∣ s) p (s^{'} ∣ s, a)$

Suppose the states could be indexed as $s_{i} (i = 1, \dots, n)$ . For state $s_{i}$ , the Bellman equation is $v_{π} (s_{i}) = r_{π} (s_{i}) + γ \sum_{s_{j}} p_{π} (s_{j} ∣ s_{i}) v_{π} (s_{j})$

Put all these equations for all the states together and rewrite to a matrix-vector form $\begin{matrix} (1) & v_{π} = r_{π} + γ P_{π} v_{π} \end{matrix}$ where

$v_{π} = {[v_{π} (s_{1}), \dots, v_{π} (s_{n})]}^{T} \in R^{n}$ ,

$r_{π} = {[r_{π} (s_{1}), \dots, r_{π} (s_{n})]}^{T} \in R^{n}$ ,

$P_{π} \in R^{n \times n}$ , where ${[P_{π}]}_{i j} = p_{π} (s_{j} ∣ s_{i})$ , is the state transition matrix.

Examples

Refer to the grid world example for the notations.

If there are four states, $v_{π} = r_{π} + γ P_{π} v_{π}$ can be written out as $\underset{v_{π}}{\underset{⏟}{[\begin{array}{l} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{array}]}} = \underset{r_{π}}{\underset{⏟}{[\begin{array}{l} r_{π} (s_{1}) \\ r_{π} (s_{2}) \\ r_{π} (s_{3}) \\ r_{π} (s_{4}) \end{array}]}} + γ \underset{P_{π}}{\underset{⏟}{[\begin{array}{llll} p_{π} (s_{1} ∣ s_{1}) & p_{π} (s_{2} ∣ s_{1}) & p_{π} (s_{3} ∣ s_{1}) & p_{π} (s_{4} ∣ s_{1}) \\ p_{π} (s_{1} ∣ s_{2}) & p_{π} (s_{2} ∣ s_{2}) & p_{π} (s_{3} ∣ s_{2}) & p_{π} (s_{4} ∣ s_{2}) \\ p_{π} (s_{1} ∣ s_{3}) & p_{π} (s_{2} ∣ s_{3}) & p_{π} (s_{3} ∣ s_{3}) & p_{π} (s_{4} ∣ s_{3}) \\ p_{π} (s_{1} ∣ s_{4}) & p_{π} (s_{2} ∣ s_{4}) & p_{π} (s_{3} ∣ s_{4}) & p_{π} (s_{4} ∣ s_{4}) \end{array}]}} \underset{v_{π}}{\underset{⏟}{[\begin{array}{l} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{array}]}} .$

For deterministic policy

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deter- ministic.

For this specific example: $[\begin{array}{l} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{array}] = [\begin{array}{l} 0 \\ 1 \\ 1 \\ 1 \end{array}] + γ [\begin{array}{llll} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}] [\begin{array}{l} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{array}]$

For stochastic policy

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

For this specific example: $[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] = [\begin{matrix} 0.5 (0) + 0.5 (- 1) \\ 1 \\ 1 \\ 1 \end{matrix}] + γ [\begin{array}{cccc} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}] [\begin{array}{l} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{array}]$

Solution of the matrix-vector form

Recalling the Bellman equation in matrix-vector form $(1)$ ,

We can convert it to two forms:

The closed-form solution is: $v_{π} = {(I - γ P_{π})}^{- 1} r_{π}$ The drawback of closed-form solution is that it involves a matrix inverse operation, which is computationally expensive. Thus, in practice, we'll use an iterative solution.
An iterative solution is: $v_{k + 1} = r_{π} + γ P_{π} v_{k},$ where $I$ is the identity matrix.

We can just randomly select a matrix $v_{0}$ , then calculate $v_{1}, v_{2}, \dots$ . This leads to a sequence ${v_{0}, v_{1}, v_{2}, \dots}$ . We can show that $v_{k} \to v_{π} = {(I - γ P_{π})}^{- 1} r_{π}, k \to \infty$

Proof: the closed-form solution

First, the Bellman equation in matrix-vector form is $v_{π} = r_{π} + γ P_{π} v_{π} .$ Put the $γ P_{π} v_{π}$ into the left side: $\begin{array}{r} v_{π} - γ P_{π} v_{π} = r_{π} \\ (I - γ P_{π}) v_{π} = r_{π} \end{array}$

Now we calculate the matrix inverse: $v_{π} = {(I - γ P_{π})}^{- 1} r_{π} .$ Q.E.D.

Proof: the iterative solution

Define the error as $δ_{k} = v_{k} - v_{π}$ . We only need to show $δ_{k} \to 0$ . Substituting:

$v_{k + 1} = δ_{k + 1} + v_{π}$ and
$v_{k} = δ_{k} + v_{π}$

into $v_{k + 1} = r_{π} + γ P_{π} v_{k}$ gives $δ_{k + 1} + v_{π} = r_{π} + γ P_{π} (δ_{k} + v_{π})$ which can be rewritten as $δ_{k + 1} = - v_{π} + r_{π} + γ P_{π} δ_{k} + γ P_{π} v_{π} = γ P_{π} δ_{k}$

As a result, $δ_{k + 1} = γ P_{π} δ_{k} = γ^{2} P_{π}^{2} δ_{k - 1} = \dots = γ^{k + 1} P_{π}^{k + 1} δ_{0}$

Note that $0 \leq P_{π}^{k} \leq 1$ , which means every entry of $P_{π}^{k}$ is no greater than 1 for any $k = 0, 1, 2, \dots$ . That is because $P_{π}^{k} 1 = 1$ , where $1 = [1, \dots, 1]^{T}$ . On the other hand, since $γ < 1$ , we know $γ^{k} \to 0$ and hence $δ_{k + 1} = γ^{k + 1} P_{π}^{k + 1} δ_{0} = γ P_{k}^{π} (γ^{k} P_{π}^{k} δ_{0}) \to 0$ as $k \to \infty$ .