Proof of the Policy Gradient Theorem

Posted on 2024-06-24 Edited on 2025-06-17 In Computer Science Views:

Here we prove the Policy gradient theorem, i.e., the gradient of an objective function $J (θ)$ is $\nabla_{θ} J (θ) = \sum_{s \in S} η (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a)$ where $η$ is a state distribution and $\nabla_{θ} π$ is the gradient of $π$ with respect to $θ$ .

Moreover, this equation has a compact form expressed in terms of expectation: $\nabla_{θ} J (θ) = E_{S \sim η, A \sim π (S, θ)} [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)],$ where $\ln$ is the natural logarithm.

We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics ${\bar{v}}_{π}, {\bar{r}}_{π}, {\bar{v}}_{π}^{0}$ .

For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.

Sources:

Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Derivation of the gradients in the discounted case

We next derive the gradients of the metrics in the discounted case where $γ \in (0, 1)$ . The state value and action value in the discounted case are defined as $\begin{aligned} v_{π} (s) & = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ S_{t} = s], \\ q_{π} (s, a) & = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots ∣ S_{t} = s, A_{t} = a] . \end{aligned}$

It holds that $v_{π} (s) = \sum_{a \in A} π (a ∣ s, θ) q_{π} (s, a)$ and the state value satisfies the Bellman equation.

First, we show that ${\bar{v}}_{π} (θ)$ and ${\bar{r}}_{π} (θ)$ are equivalent metrics.

Lemma 9.1

Lemma 9.1 (Equivalence between ${\bar{v}}_{π} (θ)$ and ${\bar{r}}_{π} (θ)$ ). In the discounted case where $γ \in$ $(0, 1)$ , it holds that $\begin{matrix} (1) & {\bar{r}}_{π} = (1 - γ) {\bar{v}}_{π} \end{matrix}$

Proof: Note that ${\bar{v}}_{π} (θ) = d_{π}^{T} v_{π}$ and ${\bar{r}}_{π} (θ) = d_{π}^{T} r_{π}$ , where $v_{π}$ and $r_{π}$ satisfy the Bellman equation $v_{π} = r_{π} + γ P_{π} v_{π}$ . Multiplying $d_{π}^{T}$ on both sides of the Bellman equation yields ${\bar{v}}_{π} = {\bar{r}}_{π} + γ d_{π}^{T} P_{π} v_{π} = {\bar{r}}_{π} + γ d_{π}^{T} v_{π} = {\bar{r}}_{π} + γ {\bar{v}}_{π}$ which implies $(1)$ . Second, the following lemma gives the gradient of $v_{π} (s)$ for any $s$ .

Lemma 9.2

Lemma 9.2 (Gradient of $v_{π} (s)$ ). In the discounted case, it holds for any $s \in S$ that $\nabla_{θ} v_{π} (s) = \sum_{s^{'} \in S} \Pr_{π} (s^{'} ∣ s) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a)$ where $\Pr_{π} (s^{'} ∣ s) ≐ \sum_{k = 0}^{\infty} γ^{k} {[P_{π}^{k}]}_{s s^{'}} = {[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}}$ is the discounted total probability of transitioning from $s$ to $s^{'}$ under policy $π$ . Here, $[\cdot]_{s s^{'}}$ denotes the entry in the sth row and $s^{'}$ th column, and ${[P_{π}^{k}]}_{s s^{'}}$ is the probability of transitioning from $s$ to $s^{'}$ using exactly $k$ steps under $π$ .

With the results in Lemma 9.2, we are ready to derive the gradient of ${\bar{v}}_{π}^{0}$ .

Theorem 9.2

Theorem 9.2 (Gradient of ${\bar{v}}_{π}^{0}$ in the discounted case). In the discounted case where $γ \in (0, 1)$ , the gradient of ${\bar{v}}_{π}^{0} = d_{0}^{T} v_{π}$ is $\nabla_{θ} {\bar{v}}_{π}^{0} = E [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)]$ where $S \sim ρ_{π}$ and $A \sim π (S, θ)$ . Here, the state distribution $ρ_{π}$ is $ρ_{π} (s) = \sum_{s^{'} \in S} d_{0} (s^{'}) \Pr_{π} (s ∣ s^{'}), s \in S$ where $\Pr_{π} (s ∣ s^{'}) = \sum_{k = 0}^{\infty} γ^{k} {[P_{π}^{k}]}_{s^{'} s} = {[{(I - γ P_{π})}^{- 1}]}_{s^{'} s}$ is the discounted total probability of transitioning from $s^{'}$ to $s$ under policy $π$ .

Theorem 9.3

Theorem 9.3 (Gradients of ${\bar{r}}_{π}$ and ${\bar{v}}_{π}$ in the discounted case). In the discounted case where $γ \in (0, 1)$ , the gradients of ${\bar{r}}_{π}$ and ${\bar{v}}_{π}$ are $\begin{aligned} \nabla_{θ} {\bar{r}}_{π} = (1 - γ) \nabla_{θ} {\bar{v}}_{π} & \approx \sum_{s \in S} d_{π} (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) \\ = E [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)], \end{aligned}$ where $S \sim d_{π}$ and $A \sim π (S, θ)$ . Here, the approximation is more accurate when $γ$ is closer to 1 .

Appendix

Proof of Lemma 9.2

First, for any $s \in S$ , it holds that $\begin{aligned} \nabla_{θ} v_{π} (s) & = \nabla_{θ} [\sum_{a \in A} π (a ∣ s, θ) q_{π} (s, a)] \\ = \sum_{a \in A} [\nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) + π (a ∣ s, θ) \nabla_{θ} q_{π} (s, a)] \end{aligned}$ where $q_{π} (s, a)$ is the action value given by $q_{π} (s, a) = r (s, a) + γ \sum_{s^{'} \in S} p (s^{'} ∣ s, a) v_{π} (s^{'})$

Since $r (s, a) = \sum_{r} r p (r ∣ s, a)$ is independent of $θ$ , we have $\nabla_{θ} q_{π} (s, a) = 0 + γ \sum_{s^{'} \in S} p (s^{'} ∣ s, a) \nabla_{θ} v_{π} (s^{'})$ Substituting this result into the policy gradient $\nabla_{θ} v_{π} (s)$ yields $\begin{aligned} \nabla_{θ} v_{π} (s) & = \sum_{a \in A} [\nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) + π (a ∣ s, θ) γ \sum_{s^{'} \in S} p (s^{'} ∣ s, a) \nabla_{θ} v_{π} (s^{'})] \\ = \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) + γ \sum_{a \in A} π (a ∣ s, θ) \sum_{s^{'} \in S} p (s^{'} ∣ s, a) \nabla_{θ} v_{π} (s^{'}) \end{aligned}$

It is notable that $\nabla_{θ} v_{π}$ appears on both sides of the above equation. Here, we use the matrix-vector form to calculate it. In particular, let $u (s) ≐ \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a)$ Since $\sum_{a \in A} π (a ∣ s, θ) \sum_{s^{'} \in S} p (s^{'} ∣ s, a) \nabla_{θ} v_{π} (s^{'}) = \sum_{s^{'} \in S} p (s^{'} ∣ s) \nabla_{θ} v_{π} (s^{'}) = \sum_{s^{'} \in S} {[P_{π}]}_{s s^{'}} \nabla_{θ} v_{π} (s^{'})$ equation $\nabla_{θ} v_{π} (s) = \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) + γ \sum_{a \in A} π (a ∣ s, θ) \sum_{s^{'} \in S} p (s^{'} ∣ s, a) \nabla_{θ} v_{π} (s^{'})$ can be written in matrix-vector form as $\underset{\nabla_{θ} v_{π} \in R^{m n}}{\underset{⏟}{[\begin{matrix} ⋮ \\ \nabla_{θ} v_{π} (s) \\ ⋮ \end{matrix}]}} = \underset{u \in R^{m n}}{\underset{⏟}{[\begin{matrix} ⋮ \\ u (s) \\ ⋮ \end{matrix}]}} + γ (P_{π} \otimes I_{m}) \underset{\nabla_{θ} v_{π} \in R^{m n}}{\underset{⏟}{[\begin{matrix} ⋮ \\ \nabla_{θ} v_{π} (s^{'}) \\ ⋮ \end{matrix}]}},$ which can be written concisely as $\nabla_{θ} v_{π} = u + γ (P_{π} \otimes I_{m}) \nabla_{θ} v_{π} .$ Here, $n = | S |$ , and $m$ is the dimension of the parameter vector $θ$ . The reason that the Kronecker product $\otimes$ emerges in the equation is that $\nabla_{θ} v_{π} (s)$ is a vector. The above equation is a linear equation of $\nabla_{θ} v_{π}$ , which can be solved as $\begin{aligned} \nabla_{θ} v_{π} & = {(I_{n m} - γ P_{π} \otimes I_{m})}^{- 1} u \\ = {(I_{n} \otimes I_{m} - γ P_{π} \otimes I_{m})}^{- 1} u \\ = [{(I_{n} - γ P_{π})}^{- 1} \otimes I_{m}] u . \end{aligned}$

For any state $s$ , it follows from this equation that $\begin{aligned} \nabla_{θ} v_{π} (s) & = \sum_{s^{'} \in S} {[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}} u (s^{'}) \\ = \sum_{s^{'} \in S} {[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}} \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) . \end{aligned}$

The quantity ${[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}}$ has a clear probabilistic interpretation. In particular, since ${(I_{n} - γ P_{π})}^{- 1} = I + γ P_{π} + γ^{2} P_{π}^{2} + \dots,$ we have ${[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}} = [I]_{s s^{'}} + γ {[P_{π}]}_{s s^{'}} + γ^{2} {[P_{π}^{2}]}_{s s^{'}} + \dots = \sum_{k = 0}^{\infty} γ^{k} {[P_{π}^{k}]}_{s s^{'}} .$ Note that ${[P_{π}^{k}]}_{s s^{'}}$ is the probability of transitioning from $s$ to $s^{'}$ using exactly $k$ steps (see my previous post). Therefore, ${[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}}$ is the discounted total probability of transitioning from $s$ to $s^{'}$ using any number of steps. By denoting ${[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}} ≐ \Pr_{π} (s^{'} ∣ s),$ we obtain $\sum_{s^{'} \in S} {[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}} \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) = \sum_{s^{'} \in S} \Pr_{π} (s^{'} ∣ s) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) = \nabla_{θ} v_{π} (s) .$

Q. E. D.

Proof of Theorem 9.2

Since $d_{0} (s)$ is independent of $π$ , we have $\nabla_{θ} {\bar{v}}_{π}^{0} = \nabla_{θ} \sum_{s \in S} d_{0} (s) v_{π} (s) = \sum_{s \in S} d_{0} (s) \nabla_{θ} v_{π} (s) .$

Substituting the expression of

\nabla_{θ} v_{π} (s)

given in Lemma 9.2 into the above equation yields $$

\begin{aligned} \nabla_{θ} {\bar{v}}_{π}^{0} = \sum_{s \in S} d_{0} (s) \nabla_{θ} v_{π} (s) & = \sum_{s \in S} d_{0} (s) \sum_{s^{'} \in S} \Pr_{π} (s^{'} ∣ s) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ = \sum_{s^{'} \in S} (\sum_{s \in S} d_{0} (s) \Pr_{π} (s^{'} ∣ s)) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ ≐ \sum_{s^{'} \in S} ρ_{π} (s^{'}) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ = \sum_{s \in S} ρ_{π} (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) (change s^{'} to s) \\ = \sum_{s \in S} ρ_{π} (s) \sum_{a \in A} π (a ∣ s, θ) \nabla_{θ} \ln π (a ∣ s, θ) q_{π} (s, a) \\ = E [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)], \end{aligned}

$$ where $S \sim ρ_{π}$ and $A \sim π (S, θ)$ . The proof is complete.

Proof of Theorem 9.3

It follows from the definition of ${\bar{v}}_{π}$ that $\begin{aligned} \nabla_{θ} {\bar{v}}_{π} & = \nabla_{θ} \sum_{s \in S} d_{π} (s) v_{π} (s) \\ = \sum_{s \in S} \nabla_{θ} d_{π} (s) v_{π} (s) + \sum_{s \in S} d_{π} (s) \nabla_{θ} v_{π} (s) \end{aligned}$

This equation contains two terms. On the one hand, substituting the expression of $\nabla_{θ} v_{π}$ given in (9.17) into the second term gives $\begin{aligned} \sum_{s \in S} d_{π} (s) \nabla_{θ} v_{π} (s) & = (d_{π}^{T} \otimes I_{m}) \nabla_{θ} v_{π} \\ = (d_{π}^{T} \otimes I_{m}) [{(I_{n} - γ P_{π})}^{- 1} \otimes I_{m}] u \\ = [d_{π}^{T} {(I_{n} - γ P_{π})}^{- 1}] \otimes I_{m} u \end{aligned}$

It is noted that $d_{π}^{T} {(I_{n} - γ P_{π})}^{- 1} = \frac{1}{1 - γ} d_{π}^{T}$ which can be easily verified by multiplying $(I_{n} - γ P_{π})$ on both sides of the equation. Therefore, (9.21) becomes $\begin{aligned} \sum_{s \in S} d_{π} (s) \nabla_{θ} v_{π} (s) & = \frac{1}{1 - γ} d_{π}^{T} \otimes I_{m} u \\ = \frac{1}{1 - γ} \sum_{s \in S} d_{π} (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) \end{aligned}$ On the other hand, the first term of $\sum_{s \in S} \nabla_{θ} d_{π} (s) v_{π} (s) + \sum_{s \in S} d_{π} (s) \nabla_{θ} v_{π} (s)$ involves $\nabla_{θ} d_{π}$ . However, since the second term contains $\frac{1}{1 - γ}$ , the second term becomes dominant, and the first term becomes negligible (#TODO) when $γ \to 1$ . Therefore, $\nabla_{θ} {\bar{v}}_{π} \approx \frac{1}{1 - γ} \sum_{s \in S} d_{π} (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a)$

Furthermore, it follows from ${\bar{r}}_{π} = (1 - γ) {\bar{v}}_{π}$ that $\begin{aligned} \nabla_{θ} {\bar{r}}_{π} = (1 - γ) \nabla_{θ} {\bar{v}}_{π} & \approx \sum_{s \in S} d_{π} (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) \\ = \sum_{s \in S} d_{π} (s) \sum_{a \in A} π (a ∣ s, θ) \nabla_{θ} \ln π (a ∣ s, θ) q_{π} (s, a) \\ = E [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)] \end{aligned}$ The approximation in the above equation requires that the first term does not go to infinity when $γ \to 1$ .