Here we prove the Policy gradient theorem, i.e., the gradient of an objective function is where is a state distribution and is the gradient of with respect to .
Moreover, this equation has a compact form expressed in terms of expectation: where is the natural logarithm.
We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics .
For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.
Sources:
- Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.
Derivation of the gradients in the discounted case
We next derive the gradients of the metrics in the discounted case where . The state value and action value in the discounted case are defined as
It holds that and the state value satisfies the Bellman equation.
First, we show that and are equivalent metrics.
Lemma 9.1
Lemma 9.1 (Equivalence between and ). In the discounted case where , it holds that
Proof: Note that and , where and satisfy the Bellman equation . Multiplying on both sides of the Bellman equation yields which implies . Second, the following lemma gives the gradient of for any .
Lemma 9.2
Lemma 9.2 (Gradient of ). In the discounted case, it holds for any that where is the discounted total probability of transitioning from to under policy . Here, denotes the entry in the sth row and th column, and is the probability of transitioning from to using exactly steps under .
With the results in Lemma 9.2, we are ready to derive the gradient of .
Theorem 9.2
Theorem 9.2 (Gradient of in the discounted case). In the discounted case where , the gradient of is where and . Here, the state distribution is where is the discounted total probability of transitioning from to under policy .
Theorem 9.3
Theorem 9.3 (Gradients of and in the discounted case). In the discounted case where , the gradients of and are where and . Here, the approximation is more accurate when is closer to 1 .
Appendix
Proof of Lemma 9.2
First, for any , it holds that where is the action value given by
Since is independent of , we have Substituting this result into the policy gradient yields
It is notable that appears on both sides of the above equation. Here, we use the matrix-vector form to calculate it. In particular, let Since equation can be written in matrix-vector form as which can be written concisely as Here, , and is the dimension of the parameter vector . The reason that the Kronecker product emerges in the equation is that is a vector. The above equation is a linear equation of , which can be solved as
For any state , it follows from this equation that
The quantity has a clear probabilistic interpretation. In particular, since we have Note that is the probability of transitioning from to using exactly steps (see my previous post). Therefore, is the discounted total probability of transitioning from to using any number of steps. By denoting we obtain
Q. E. D.
Proof of Theorem 9.2
Since is independent of , we have
Substituting the expression of
given in
Lemma 9.2 into the above equation yields $$
$$ where and . The proof is complete.
Proof of Theorem 9.3
It follows from the definition of that
This equation contains two terms. On the one hand, substituting the expression of given in (9.17) into the second term gives
It is noted that which can be easily verified by multiplying on both sides of the equation. Therefore, (9.21) becomes On the other hand, the first term of involves . However, since the second term contains , the second term becomes dominant, and the first term becomes negligible (#TODO) when . Therefore,
Furthermore, it follows from that The approximation in the above equation requires that the first term does not go to infinity when .