Bellman Equation: The Matrix-Vector Form

Sources:

  1. Shiyu Zhao. Chapter 2: State Values and Bellman Equation. Mathematical Foundations of Reinforcement Learning.
  2. --> Youtube: Bellman Equation: Matrix-Vector form

Bellman equation: the matrix-vector form

Consider the Bellman equation: vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]

It's an elementsise form. That means there are |S| equations like this! If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form.

Recall that this equation can be rewritten as vπ(s)=rπ(s)+γspπ(ss)vπ(s) where rπ(s)aπ(as)rp(rs,a)r,pπ(ss)aπ(as)p(ss,a)

Suppose the states could be indexed as si(i=1,,n). For state si, the Bellman equation is vπ(si)=rπ(si)+γsjpπ(sjsi)vπ(sj)

Put all these equations for all the states together and rewrite to a matrix-vector form (1)vπ=rπ+γPπvπ where

vπ=[vπ(s1),,vπ(sn)]TRn,

rπ=[rπ(s1),,rπ(sn)]TRn,

PπRn×n, where [Pπ]ij=pπ(sjsi), is the state transition matrix.

Examples

Refer to the grid world example for the notations.

If there are four states, vπ=rπ+γPπvπ can be written out as [vπ(s1)vπ(s2)vπ(s3)vπ(s4)]vπ=[rπ(s1)rπ(s2)rπ(s3)rπ(s4)]rπ+γ[pπ(s1s1)pπ(s2s1)pπ(s3s1)pπ(s4s1)pπ(s1s2)pπ(s2s2)pπ(s3s2)pπ(s4s2)pπ(s1s3)pπ(s2s3)pπ(s3s3)pπ(s4s3)pπ(s1s4)pπ(s2s4)pπ(s3s4)pπ(s4s4)]Pπ[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]vπ.

For deterministic policy

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deter- ministic.

For this specific example: [vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0111]+γ[0010000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]

For stochastic policy

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

For this specific example: [vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0.5(0)+0.5(1)111]+γ[00.50.50000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]

Solution of the matrix-vector form

Recalling the Bellman equation in matrix-vector form (1),

We can convert it to two forms:

  • The closed-form solution is: vπ=(IγPπ)1rπ The drawback of closed-form solution is that it involves a matrix inverse operation, which is computationally expensive. Thus, in practice, we'll use an iterative solution.

  • An iterative solution is: vk+1=rπ+γPπvk, where I is the identity matrix.

    We can just randomly select a matrix v0, then calculate v1,v2,. This leads to a sequence {v0,v1,v2,}. We can show that vkvπ=(IγPπ)1rπ,k

Proof: the closed-form solution

First, the Bellman equation in matrix-vector form is vπ=rπ+γPπvπ. Put the γPπvπ into the left side: vπγPπvπ=rπ(IγPπ)vπ=rπ

Now we calculate the matrix inverse: vπ=(IγPπ)1rπ. Q.E.D.

Proof: the iterative solution

Define the error as δk=vkvπ. We only need to show δk0. Substituting:

  1. vk+1=δk+1+vπ and
  2. vk=δk+vπ

into vk+1=rπ+γPπvk gives δk+1+vπ=rπ+γPπ(δk+vπ) which can be rewritten as δk+1=vπ+rπ+γPπδk+γPπvπ=γPπδk

As a result, δk+1=γPπδk=γ2Pπ2δk1==γk+1Pπk+1δ0

Note that 0Pπk1, which means every entry of Pπk is no greater than 1 for any k=0,1,2,. That is because Pπk1=1, where 1=[1,,1]T. On the other hand, since γ<1, we know γk0 and hence δk+1=γk+1Pπk+1δ0=γPkπ(γkPπkδ0)0 as k.