Proof of the Policy Gradient Theorem

Here we prove the Policy gradient theorem, i.e., the gradient of an objective function J(θ) is θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a) where η is a state distribution and θπ is the gradient of π with respect to θ.

Moreover, this equation has a compact form expressed in terms of expectation: θJ(θ)=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)], where ln is the natural logarithm.

We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics v¯π,r¯π,v¯π0.

For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.

Sources:

  1. Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Derivation of the gradients in the discounted case

We next derive the gradients of the metrics in the discounted case where γ(0,1). The state value and action value in the discounted case are defined as vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+St=s],qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+St=s,At=a].

It holds that vπ(s)=aAπ(as,θ)qπ(s,a) and the state value satisfies the Bellman equation.

First, we show that v¯π(θ) and r¯π(θ) are equivalent metrics.

Lemma 9.1

Lemma 9.1 (Equivalence between v¯π(θ) and r¯π(θ) ). In the discounted case where γ (0,1), it holds that (1)r¯π=(1γ)v¯π

Proof: Note that v¯π(θ)=dπTvπ and r¯π(θ)=dπTrπ, where vπ and rπ satisfy the Bellman equation vπ=rπ+γPπvπ. Multiplying dπT on both sides of the Bellman equation yields v¯π=r¯π+γdπTPπvπ=r¯π+γdπTvπ=r¯π+γv¯π which implies (1). Second, the following lemma gives the gradient of vπ(s) for any s.

Lemma 9.2

Lemma 9.2 (Gradient of vπ(s) ). In the discounted case, it holds for any sS that θvπ(s)=sSPrπ(ss)aAθπ(as,θ)qπ(s,a) where Prπ(ss)k=0γk[Pπk]ss=[(InγPπ)1]ss is the discounted total probability of transitioning from s to s under policy π. Here, []ss denotes the entry in the sth row and s th column, and [Pπk]ss is the probability of transitioning from s to s using exactly k steps under π.

With the results in Lemma 9.2, we are ready to derive the gradient of v¯π0.

Theorem 9.2

Theorem 9.2 (Gradient of v¯π0 in the discounted case). In the discounted case where γ(0,1), the gradient of v¯π0=d0Tvπ is θv¯π0=E[θlnπ(AS,θ)qπ(S,A)] where Sρπ and Aπ(S,θ). Here, the state distribution ρπ is ρπ(s)=sSd0(s)Prπ(ss),sS where Prπ(ss)=k=0γk[Pπk]ss=[(IγPπ)1]ss is the discounted total probability of transitioning from s to s under policy π.

Theorem 9.3

Theorem 9.3 (Gradients of r¯π and v¯π in the discounted case). In the discounted case where γ(0,1), the gradients of r¯π and v¯π are θr¯π=(1γ)θv¯πsSdπ(s)aAθπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)], where Sdπ and Aπ(S,θ). Here, the approximation is more accurate when γ is closer to 1 .

Appendix

Proof of Lemma 9.2

First, for any sS, it holds that θvπ(s)=θ[aAπ(as,θ)qπ(s,a)]=aA[θπ(as,θ)qπ(s,a)+π(as,θ)θqπ(s,a)] where qπ(s,a) is the action value given by qπ(s,a)=r(s,a)+γsSp(ss,a)vπ(s)

Since r(s,a)=rrp(rs,a) is independent of θ, we have θqπ(s,a)=0+γsSp(ss,a)θvπ(s) Substituting this result into the policy gradient θvπ(s) yields θvπ(s)=aA[θπ(as,θ)qπ(s,a)+π(as,θ)γsSp(ss,a)θvπ(s)]=aAθπ(as,θ)qπ(s,a)+γaAπ(as,θ)sSp(ss,a)θvπ(s)

It is notable that θvπ appears on both sides of the above equation. Here, we use the matrix-vector form to calculate it. In particular, let u(s)aAθπ(as,θ)qπ(s,a) Since aAπ(as,θ)sSp(ss,a)θvπ(s)=sSp(ss)θvπ(s)=sS[Pπ]ssθvπ(s) equation θvπ(s)=aAθπ(as,θ)qπ(s,a)+γaAπ(as,θ)sSp(ss,a)θvπ(s) can be written in matrix-vector form as [θvπ(s)]θvπRmn=[u(s)]uRmn+γ(PπIm)[θvπ(s)]θvπRmn, which can be written concisely as θvπ=u+γ(PπIm)θvπ. Here, n=|S|, and m is the dimension of the parameter vector θ. The reason that the Kronecker product emerges in the equation is that θvπ(s) is a vector. The above equation is a linear equation of θvπ, which can be solved as θvπ=(InmγPπIm)1u=(InImγPπIm)1u=[(InγPπ)1Im]u.

For any state s, it follows from this equation that θvπ(s)=sS[(InγPπ)1]ssu(s)=sS[(InγPπ)1]ssaAθπ(as,θ)qπ(s,a).

The quantity [(InγPπ)1]ss has a clear probabilistic interpretation. In particular, since (InγPπ)1=I+γPπ+γ2Pπ2+, we have [(InγPπ)1]ss=[I]ss+γ[Pπ]ss+γ2[Pπ2]ss+=k=0γk[Pπk]ss. Note that [Pπk]ss is the probability of transitioning from s to s using exactly k steps (see my previous post). Therefore, [(InγPπ)1]ss is the discounted total probability of transitioning from s to s using any number of steps. By denoting [(InγPπ)1]ssPrπ(ss), we obtain sS[(InγPπ)1]ssaAθπ(as,θ)qπ(s,a)=sSPrπ(ss)aAθπ(as,θ)qπ(s,a)=θvπ(s).

Q. E. D.

Proof of Theorem 9.2

Since d0(s) is independent of π, we have θv¯π0=θsSd0(s)vπ(s)=sSd0(s)θvπ(s).

Substituting the expression of θvπ(s) given in Lemma 9.2 into the above equation yields $$ θv¯π0=sSd0(s)θvπ(s)=sSd0(s)sSPrπ(ss)aAθπ(as,θ)qπ(s,a)=sS(sSd0(s)Prπ(ss))aAθπ(as,θ)qπ(s,a)sSρπ(s)aAθπ(as,θ)qπ(s,a)=sSρπ(s)aAθπ(as,θ)qπ(s,a) (change s to s ) =sSρπ(s)aAπ(as,θ)θlnπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)],

$$ where Sρπ and Aπ(S,θ). The proof is complete.

Proof of Theorem 9.3

It follows from the definition of v¯π that θv¯π=θsSdπ(s)vπ(s)=sSθdπ(s)vπ(s)+sSdπ(s)θvπ(s)

This equation contains two terms. On the one hand, substituting the expression of θvπ given in (9.17) into the second term gives sSdπ(s)θvπ(s)=(dπTIm)θvπ=(dπTIm)[(InγPπ)1Im]u=[dπT(InγPπ)1]Imu

It is noted that dπT(InγPπ)1=11γdπT which can be easily verified by multiplying (InγPπ) on both sides of the equation. Therefore, (9.21) becomes sSdπ(s)θvπ(s)=11γdπTImu=11γsSdπ(s)aAθπ(as,θ)qπ(s,a) On the other hand, the first term of sSθdπ(s)vπ(s)+sSdπ(s)θvπ(s) involves θdπ. However, since the second term contains 11γ, the second term becomes dominant, and the first term becomes negligible (#TODO) when γ1. Therefore, θv¯π11γsSdπ(s)aAθπ(as,θ)qπ(s,a)

Furthermore, it follows from r¯π=(1γ)v¯π that θr¯π=(1γ)θv¯πsSdπ(s)aAθπ(as,θ)qπ(s,a)=sSdπ(s)aAπ(as,θ)θlnπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)] The approximation in the above equation requires that the first term does not go to infinity when γ1.