Bellman Optimality Equation

  1. Shiyu Zhao. Chapter 3: Bellman Optimality Equation. Mathematical Foundations of Reinforcement Learning.
  2. --> Youtube: Bellman Optimality Equation
  3. --> Youtube: How to Solve Bellman Optimality Equation

The previous chapter introduced the Bellman equation of any given policy. The present post introduces the Bellman optimality equation, which is a special Bellman equation whose corresponding policy is optimal. By solving

Optimal Policy

The state value could be used to evaluate if a policy is good or not: if vπ1(s)vπ2(s) for all sS then π1 is "better" than π2.

A policy π is optimal if vπ(s)vπ(s) for all s and for any other policy π.

The definition leads to many questions:

  • Does the optimal policy exist?
  • Is the optimal policy unique?
  • Is the optimal policy stochastic or deterministic?
  • How to obtain the optimal policy?

To answer these questions, we study the Bellman optimality equation.

BOE (elementwise form)

The tool for analyzing optimal policies and optimal state values is the Bellman optimality equation (BOE). By solving this equation, we can obtain optimal policies and optimal state values.

For every sS, the elementwise expression of the BOE is v(s)=maxπ(s)Π(s)(E[Rt+1St=s]+γE[Gt+1St=s])=maxπ(s)Π(s)aAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))=maxπ(s)Π(s)aAπ(as)q(s,a), where v(s),v(s) are unknown variables to be solved and $$ q(s, a)

+ $$

Here, π(s) denotes a policy for state s, and Π(s) is the set of all possible policies for s.

To get optimal state values (or optimal state value function) and optimal policies, we keep calculating BOE for each state sS, which involves choose one policy π(s)Π(s). When the state values don't change too much, i.e., they have converged, the state value fucntion is the optimal state value function, and the policy it adopts during that computation is the the optimal policy.

Like before, the first term rRp(rs,a)r is the mean of immediate rewards, and the second term γsSp(ss,a)v(s) is the discounted mean of future rewards.

Remarks:

  • p(rs,a) and p(ss,a) are known.
  • v(s),v(s) are unknown and to be calculated.

Maximize on the right-hand side of BOE

Here we talk about how to solve BOE. In practice we need to deal with the matrix-vector form since that is what we're faced with. But since each row in the matrix is actually a vector of the elementwise form, we start with the element form.

In fact, we can turn the problem into "solve the optimal π on the right-hand side". Let's look at one example first:


Example 3.1. Consider two unknown variables x,yR that satisfy x=maxyR(2x1y2).

The first step is to solve y on the right-hand side of the equation. Regardless of the value of x, we always have maxy(2x1y2)=2x1, where the maximum is achieved when y=0. The second step is to solve x. When y=0, the equation becomes x=2x1, which leads to x=1. Therefore, y=0 and x=1 are the solutions of the equation.


We now turn to the maximization problem on the right-hand side of the BOE. The (elementwise) BOE can be written as v(s)=maxπ(s)Π(s)aAπ(as)q(s,a),sS.

Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to do that? The following example demonstrates its basic idea.


Example 3.2. Given q1,q2,q3R, we would like to find the optimal values of c1,c2,c3 to maximize i=13ciqi=c1q1+c2q2+c3q3, where c1+c2+c3=1 and c1,c2,c30. Without loss of generality, suppose that q3q1,q2. Then, the optimal solution is c3=1 and c1=c2=0. This is because q3=(c1+c2+c3)q3=c1q3+c2q3+c3q3c1q1+c2q2+c3q3 for any c1,c2,c3.


Inspired by the above example, considering that aπ(as)=1, we have (1)v(s)=maxπaπ(as)q(s,a)(2)=maxaA(s)q(s,a) where the optimality is achieved when π(as)={1a=a0aa where a=argmaxaq(s,a).

Now that we know the solution of BOE is to maximize the right-hand side, and we know how to do it as well --- just select the action with the largest action value. But we don't know action value or state value at this time, so this method doesn't work.

In fact, the solution of BOE derives from the contraction mapping theorem (see later) on the matrix-vector form. That's an iterative method

So why we introduce (2) here? The reason at during every iteration, for every state s, the action value will already have been known, so we can use (2) to get the maximized right-hand side, which is the maximized v(s)!

Matrix-vector form of the BOE

To leverage the contraction mapping theorem, we'll express the matrix-vector form as v=f(v).

Since the optimal value of π is determined by v, the right-hand side of BOE (matrix-vector form) is a function of v, denoted as (3)f(v)maxπ(rπ+γPπv).

where vR|S| and maxπ is performed in an elementwise manner. The structures of rπ and Pπ are the same as those in the matrix-vector form of the normal Bellman equation: [rπ]saAπ(as)rRp(rs,a)r,[Pπ]s,s=p(ss)aAπ(as)p(ss,a).

Then, the BOE can be expressed in a concise form as v=f(v)

Every row [f(v)]s is the elementwise form of s.

Contraction mapping theorem

Now that the matrix-vector form is expressed as a nonlinear equation v=f(v), we next introduce the contraction mapping theorem to analyze it.

Concepts: Fixed point and Contraction mapping

Fixed point: xX is a fixed point of f:XX if f(x)=x ***

Contraction mapping (or contractive function): f is a contraction mapping if f(x1)f(x2)γx1x2 where γ(0,1). - Note: γ must be strictly less than 1 so that many limits such as γk0 as k0 hold. - Here, can be any vector norm.

Example1

Givn euqation: x=f(x)=0.5x,xR. It is easy to verify that x=0 is a fixed point since 0=0.5×0.

Moreover, f(x)=0.5x is a contraction mapping because 0.5x10.5x2=0.5x1x2γx1x2 for any γ[0.5,1).

Example2

Given x=f(x)=Ax, where xRn,ARn×n and Aγ<1. It is easy to verify that x=0 is a fixed point since 0=A0. To see the contraction property, Ax1Ax2=A(x1x2)Ax1x2γx1x2.

Therefore, f(x)=Ax is a contraction mapping.

Theorem: Contraction Mapping Theorem

Theorem (Contraction Mapping Theorem): For any equation that has the form of x=f(x), if f is a contraction mapping, then

  • Existence: there exists a fixed point x satisfying f(x)=x.
  • Uniqueness: The fixed point x is unique.
  • Algorithm: Consider a sequence {xk} where xk+1=f(xk), then xkx as k. Moreover, the convergence rate is exponentially fast.

-> See the proof

Contraction property of BOE

Theorem (Contraction Property):

f(v) in (3) is a contraction mapping satisfying f(v1)f(v2)γv1v2 where γ(0,1) is the discount rate, and is the maximum norm, which is the maximum absolute value of the elements of a vector.

-> See the proof

Solution of the BOE

Due to the contraction property of BOE, the matrix-vector form can be solved by computing following equation iteratively (4)vk+1=f(vk)=maxπ(rπ+γPπvk).

At every iteration, for each state, what we face is actually the elementwise form:

vk+1(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)vk(s))=maxπaπ(as)qk(s,a)=maxaqk(s,a).

As you can see, (2) is leveraged here. (But I don't know why can I do it. There's no proof about the contraction property of elementwise form, only one for the matrix-vector form)

Procedure summary (value iteration algorithm):

  • For every s, estimate(randomly select) current state value as vk(s)

  • For any aA(s), calculate qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)

  • Calculate the greedy policy πk+1 for every s as πk+1(as)={1a=ak(s)0aak(s) where ak(s)=argmaxaqk(s,a).

  • Calculate vk+1(s)=maxaqk(s,a)

Example

Example: Manually solve the BOE.

Example
  • Actions: a,a0,ar represent go left, stay unchanged, and go right.
  • Reward: entering the target area: +1 ; try to go out of boundary -1.

The values of q(s,a):

q-value table a a0 ar
s1 1+γv(s1) 0+γv(s1) 1+γv(s2)
s2 0+γv(s1) 1+γv(s2) 0+γv(s3)
s3 1+γv(s2) 0+γv(s3) 1+γv(s3)

Consider γ=0.9.

Our objective is to find v(si) and π:

k=0:

v-value: select v0(s1)=v0(s2)=v0(s3)=0

q-value (using the previous table):

q-value table a a0 ar
s1 -1 0 1
s2 0 1 0
s3 1 0 -1

Greedy policy (select the greatest q-value) π(ars1)=1,π(a0s2)=1,π(as3)=1 v-value: v1(s)=maxaq0(s,a) v1(s1)=v1(s2)=v1(s3)=1

k=1:

With v1(s) calculated in the last step, q-value:

q-value table a a0 ar
s1 -0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 -0.1

Greedy policy (select the greatest q-value): π(ars1)=1,π(a0s2)=1,π(as3)=1

The policy is the same as the previous one, which is already optimal. v-value: v2(s)=

k=2,3,, iterate until the produced q-value doesn't change too much.

BOE: Optimality

Suppose v is the solution to the Bellman optimality equation. It satisfies v=maxπ(rπ+γPπv)

Suppose π=argmaxπ(rπ+γPπv)

Then v=rπ+γPπv

Therefore, π is a policy and v=vπ is the corresponding state value.

Now we prove π is the optilmal policy:

Theorem (Policy Optimality):

Suppose that v is the unique solution to v=maxπ(rπ+γPπv), and vπ is the state value function satisfying vπ=rπ+γPπvπ for any given policy π, then vvπ,π -> See the proof

What does π look like?

For any sS, the deterministic greedy policy π(as)={1a=a(s)0aa(s) is an optimal policy solving the BOE. Here, a(s)=argmaxaq(a,s) where q(s,a):=rp(rs,a)r+γsp(ss,a)v(s). Proof: Due to π(s)=argmaxπΠaAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))q(s,a),sS.

It is clear that aAπ(as)q(s,a) is maximized if π(s) selects the action with the greatest q(s,a).

Theorem: Optimal policy invariance

Theorem (Optimal policy invariance):

Consider a Markov decision process with vR|S| as the optimal state value satisfying v=maxπΠ(rπ+γPπv). If every reward rR is changed by an affine transformation to αr+β, where α,βR and α>0, then the corresponding optimal state value v is also an affine transformation of v : v=αv+β1γ1 where γ(0,1) is the discount rate and 1=[1,,1]T.

Consequently, the optimal policy derived from v is invariant to the affine transformation of the reward values.

-> See the proof

Appendix

Proof of the contraction mapping theorem

Part 1

We prove that the consequence {xk}k=1 with xk=f(xk1) is convergent.

The proof relies on Cauchy sequences:

A sequence x1,x2,R is called Cauchy if for any small ε>0, there exists N such that xmxn<ε for all m,n>N.

It is guaranteed that a Cauchy sequence converges to a limit.

Note that we must have xmxn<ε for all m,n>N. If we simply have xn+1xn0, it is insufficient to claim that the sequence is a Cauchy sequence. For example, it holds that xn+1xn0 for xn=n, but apparently, xn=n diverges.

We next show that {xk=f(xk1)}k=1 is a Cauchy sequence and hence converges.

First, since f is a contraction mapping, we have xk+1xk=f(xk)f(xk1)γxkxk1.

Similarly, we have xkxk1γxk1xk2,,x2x1γx1x0. Thus, we have xk+1xkγxkxk1γ2xk1xk2γkx1x0.

Since γ<1, we know that xk+1xk converges to zero exponentially fast as k given any x1,x0.

Notably, the convergence of {xk+1xk} is not sufficient for implying the convergence of {xk}.

Therefore, we need to further consider xmxn for any m>n. In particular, xmxn=xmxm1+xm1xn+1+xn+1xnxmxm1++xn+1xnγm1x1x0++γnx1x0=γn(γm1n++1)x1x0γn(1++γm1n+γmn+γmn+1+)x1x0=γn1γx1x0.

As a result, for any ε, we can always find N such that xmxn<ε for all m,n>N. Therefore, this sequence is Cauchy and hence converges to a limit point denoted as x=limkxk. ### Part 2

We show that the limit x=limkxk is a fixed point. To do that, since f(xk)xk=xk+1xkγkx1x0, we know that f(xk)xk converges to zero exponentially fast. Hence, we have f(x)=x at the limit.

Part 3

We show that the fixed point is unique. Suppose that there is another fixed point x satisfying f(x)=x. Then, xx=f(x)f(x)γxx.

Proof of the contraction property of BOE

Consider any two vectors v1,v2R|S|, and suppose that π1argmaxπ(rπ+γPπv1) and π2argmaxπ(rπ+γPπv2). Then, f(v1)=maxπ(rπ+γPπv1)=rπ1+γPπ1v1rπ2+γPπ2v1,f(v2)=maxπ(rπ+γPπv2)=rπ2+γPπ2v2rπ1+γPπ1v2, where is an elementwise comparison. As a result, f(v1)f(v2)=rπ1+γPπ1v1(rπ2+γPπ2v2)rπ1+γPπ1v1(rπ1+γPπ1v2)=γPπ1(v1v2) Similarly, it can be shown that f(v2)f(v1)γPπ2(v2v1).

Therefore, γPπ2(v1v2)f(v1)f(v2)γPπ1(v1v2)

Define zmax{|γPπ2(v1v2)|,|γPπ1(v1v2)|}R|S|, where max(),||, and are all elementwise operators. By definition, z0. On the one hand, it is easy to see that zγPπ2(v1v2)f(v1)f(v2)γPπ1(v1v2)z, which implies |f(v1)f(v2)|z.

It then follows that (5)f(v1)f(v2)z where is the maximum norm. On the other hand, suppose that zi is the i th entry of z, and piT and qiT are the i th row of Pπ1 and Pπ2, respectively. Then, zi=max{γ|piT(v1v2)|,γ|qiT(v1v2)|}.

Since pi is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that |piT(v1v2)|piT|v1v2|v1v2

Similarly, we have |qiT(v1v2)|v1v2. Therefore, ziγv1v2 and hence z=maxi|zi|γv1v2

Substituting this inequality to (5) gives f(v1)f(v2)γv1v2 which concludes the proof of the contraction property of f(v).

Q.E.D.

Proof of Theorem: Policy Optimality

For any policy π, it holds that vπ=rπ+γPπvπ

Since v=maxπ(rπ+γPπv)=rπ+γPπvrπ+γPπv we have vvπ(rπ+γPπv)(rπ+γPπvπ)=γPπ(vvπ)

Repeatedly applying the above inequality gives vvπγPπ(vvπ)γnPπn(vvπ) It follows that vvπlimnγnPπn(vvπ)=0 where the last equality is true because γ<1 and Pπn is a nonnegative matrix with all its elements less than or equal to 1 (because Pπn1=1 ). Therefore, vvπ for any π.

Proof of Theorem: Optimal policy invariance

For any policy π, define rπ=[,rπ(s),]T where rπ(s)=aAπ(as)rRp(rs,a)r,sS.

If rαr+β, then rπ(s)αrπ(s)+β and hence rπαrπ+β1, where 1= [1,,1]T. In this case, the BOE becomes (6)v=maxπΠ(αrπ+β1+γPπv) where v is the new state value after the change of rewards.

We next solve the new BOE in (6) by showing that v=αv+c1 with c=β/(1γ) is a solution of (6). In particular, substituting v=αv+c1 into (6) gives αv+c1=maxπΠ(αrπ+β1+γPπ(αv+c1))=maxπΠ(αrπ+β1+αγPπv+cγ1) where the last equality is due to the fact that Pπ1=1. The above equation can be reorganized as αv=maxπΠ(αrπ+αγPπv)+β1+cγ1c1 which is equivalent to β1+cγ1c1=0

Since c=β/(1γ), the above equation is valid and hence v=αv+c1 is the solution of (6). Since (6) is the BOE, v is also the unique solution.

Finally, since v is an affine transformation of v, the relative relationships between the action values remain the same.

Hence, the greedy optimal policy derived from v is the same as that from v:argmaxπΠ(rπ+γPπv) is the same as argmaxπ(rπ+γPπv).