Bellman Equation

Sources:

  1. Shiyu Zhao. Chapter 2: State Values and Bellman Equation. Mathematical Foundations of Reinforcement Learning.
  2. --> Youtube: Bellman Equation: Derivation

Problem Formula

Given a policy, finding out the corresponding state values is called policy evaluation. This is through solving an equation called the Bellman equation.

State Value

The expectation (or expected value, mean) of Gt is defined as the state-value function or simply state value: (1)vπ(s)=E[GtSt=s]

Remarks:

  • It is a function of s. It is a conditional expectation with the condition that the state starts from s.
  • Since Gt is based on some policy π, its expectation vπ(s) is also based on some policy π, i.e., for a different policy, the state value may be different.
  • It represents the "value" of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

Derivation of the Bellman equation

Recalling the definition of the state value (1), we substitute G(t)=Rt+1+γGt+1 into it1:

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]

The two terms in the last line of are analyzed below.

First term: the mean of immediate rewards

First, calculate the first term E[Rt+1St=s] : E[Rt+1St=s]=aA(s)π(as)E[Rt+1St=s,At=a]=aA(s)π(as)rp(rs,a)r.

This is the mean of immediate rewards.

Explanation

Given events Rt+1=r,St=s,At=a, the deduction is quite simple.

  1. From the law of total expectation: given discrete random variables X and Y, then E(X)=E(E(XY)) If Y is finite, we have E(X)=yiYE(XY=yi)p(Y=yi) where Y is the alphabet of Y.

    Thus we obtain: E[Rt+1St=s]=aA(s)π(as)E[Rt+1St=s,At=a] where At is finite.

  2. From the definition of expectation, E[Rt+1St=s,At=a]=p(rs,a)r, leading to aA(s)π(as)E[Rt+1St=s,At=a]=aA(s)π(as)rp(rs,a)r Q.E.D.


A more verbose version of deduction is:

First, consider the definition of expectation: (2)E[Rt+1St=s]rp(r|s)r. Look at use the formula for marginal probability, p(r|s)=aA(s)p(r,a|s). Due to the chain rule of probability, we obtain p(r,a|s)=p(r|a,s).p(a|s) And in RL context, p(s|s) is often written as π(a|s).

Therefore, p(r|s)=aA(s)π(a|s).p(r|a,s)

Replace p(r|s) in (2) with aA(s)π(a|s).p(r|a,s), we get E[Rt+1St]=raA(s)π(as)p(rs,a)r=aA(s)π(as)rp(rs,a)raA(s)π(as)E[Rt+1St=s,At=a]

Q.E.D.

Second term: the (discounted) mean of future rewards

First, we calculate the mean of future rewards E[Gt+1St=s]=sE[Gt+1St=s,St+1=s]p(ss)=sE[Gt+1St+1=s]p(ss)=svπ(s)p(ss)=svπ(s)ap(ss,a)π(as)

Then we multiply it with a dicount factor γ. For simplicity, we say the second term γE[Gt+1St is "the mean of future rewards" also it's discounted.

Explanation

The transotion of the first line comes from the law of total expectation as well.

The transotion of the second line comes from the Markov property. Recall that reward rt is defined to only rely on st and at: p(rt+1st,at,st1,at1,,s0,a0)=p(rt+1st,at). We obtain that $$ E[Rt+1st,at,st1,at1,,s0,a0]rt+1Rt+1p(rt+1st,at,st1,at1,,s0,a0)rt+1=rt+1Rt+1p(rt+1st,at)rt+1E[Rt+1st,at]. Since G_t = R_{t+1}+R_{t+2}+^2 R_{t+3}+, wehave E[Gt+1St=s,St+1=s]=E[Rt+1+γRt+2+γ2Rt+3+St=s,St+1=s]=E[Rt+1+γRt+2+γ2Rt+3+St+1=s]

$$

Bellman equation

Therefore, we have the Bellman equation, $$ vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s],=aπ(as)rp(rs,a)rmean of immediate rewards +γsπ(as)ap(ss,a)vπ(s),(discounted) mean of future rewards =aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)],sS.

$$

  • It consists of two terms:
    1. First term: the mean of immediate rewards
    2. Second term: the (discounted )mean of future rewards
  • The Bellman equation is a set of linear equations that describe the relationships between the values of all the states.
  • The above elementwise form is valid for every state sS. That means there are |S| equations like this!

Examples

Refer to the grid world example for the notations.

For deterministic policy

Figure 2.4: An example for demonstrating the Bellman equation. The policy in this example is deterministic.

Consider the first example shown in Figure 2.4, where the policy is deterministic. We next write out the Bellman equation and then solve the state values from it.

First, consider state s1. Under the policy, the probabilities of taking the actions are π(a=a3s1)=1andπ(aa3s1)=0. The state transition probabilities are p(s=s3s1,a3)=1andp(ss3s1,a3)=0.

The reward probabilities are p(r=0s1,a3)=1andp(r0s1,a3)=0.

Substituting these values into the Bellman equation mentioned before vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)],sS gives vπ(s1)=0+γvπ(s3)

Similarly, it can be obtained that vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).

We can solve the state values from these equations. Since the equations are simple, we can manually solve them. More complicated equations can be solved by the algorithms presented later. Here, the state values can be solved as vπ(s4)=11γ,vπ(s3)=11γ,vπ(s2)=11γ,vπ(s1)=γ1γ. Furthermore, if we set γ=0.9, then vπ(s4)=110.9=10,vπ(s3)=110.9=10,vπ(s2)=110.9=10,vπ(s1)=0.910.9=9.

For stochastic policy

Figure 2.5: An example for demonstrating the Bellman equation. The policy in this example is stochastic.

Consider the second example shown in Figure 2.5, where the policy is stochastic. We next write out the Bellman equation and then solve the state values from it.

At state s1, the probabilities of going right and down equal 0.5 . Mathematically, we have π(a=a2s1)=0.5 and π(a=a3s1)=0.5. The state transition probability is deterministic since p(s=s3s1,a3)=1 and p(s=s2s1,a2)=1. The reward probability is also deterministic since p(r=0s1,a3)=1 and p(r=1s1,a2)=1. Substituting these values into (2.7) gives vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)]

Similarly, it can be obtained that vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).

The state values can be solved from the above equations. Since the equations are

simple, we can solve the state values manually and obtain vπ(s4)=11γ,vπ(s3)=11γ,vπ(s2)=11γ,vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)],=0.5+γ1γ.

Furthermore, if we set γ=0.9, then vπ(s4)=10,vπ(s3)=10,vπ(s2)=10,vπ(s1)=0.5+9=8.5.


  1. In this article we only deal with discounted return witout the loss of generality since we can simply trate undiscounted return as discounted-return's special case when γ=1.↩︎