Policy Gradient Methods

Posted on 2024-06-24 Edited on 2025-06-17 In Computer Science Views: 91

We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.

In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.

The key point of policy gradient is that, given an objective funcnion $J_{} (s)$ ( $J_{θ} (s)$ can be some form of cumulative rewards), according to the chain rule, its derivation $\frac{\partial J_{θ} (s)}{\partial θ} = \frac{\partial J_{θ} (s)}{\partial π_{θ} (a | s)} \frac{\partial π_{θ} (a | s)}{\partial θ}$ where $θ$ is the parameters, $π_{θ}$ is a policy parameterized by $θ$ , $π_{θ}$ can be implemented by a neural network, $s$ is a state and $a$ is an action, is not differentiable as $J_{θ}$ must relies one rewards and rewards are generated by the environment which is indifferentiable.

Therefore, how can we compute $\partial J_{θ} (s) / \partial θ$ ? The answer is that we can prove $\nabla_{θ} J (θ) = \sum_{s \in S} η (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a),$ and use it to as the the gradient of $J (θ)$ (we use $J ()$ to denote $J_{θ} (s)$ ).

Sources:

Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Metrics for defining optimal policies

We first define the objective function $J (θ)$ , there are multiple metrics.

Metric 1: Average state value

The first metric is the average state value or simply called average value. In particular, the metric is defined as

${\bar{v}}_{π} = \sum_{s \in S} d (s) v_{π} (s)$ where $d = d_{π}$ is the stationary distribution.

Then, the metric can be written as ${\bar{v}}_{π} = E_{S \sim d} [v_{π} (S)] .$

The metric ${\bar{v}}_{π}$ can also be rewritten as the inner product of two vectors. In particular, let $\begin{aligned} v_{π} & = {[\dots, v_{π} (s), \dots]}^{T} \in R^{| S |}, \\ d & = [\dots, d (s), \dots]^{T} \in R^{| S |} . \end{aligned}$ Then, we have ${\bar{v}}_{π} = d^{T} v_{π} .$

This expression will be useful when we analyze its gradient.

We can also let $d$ to be other distributions. For instance, we can make $d$ independent of the policy $π$ . In this case, we specifically denote $d$ as $d_{0}$ and ${\bar{v}}_{π}$ as ${\bar{v}}_{π}^{0}$ .

One trivial way is make $d_{0}$ the uniform distrbution $d_{0} (s) = 1 / | S | .$

This case is relatively simple because the gradient of the metric is easier to calculate.
In this case,
How to select $d_{0}$ ?
One trivial way is to treat all the states equally important and hence select

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as $\begin{aligned} {\bar{r}}_{π} & ≐ \sum_{s \in S} d_{π} (s) r_{π} (s) \\ = E_{S \sim d_{π}} [r_{π} (S)] \end{aligned}$ where $d_{π}$ is the stationary distribution and $r_{π} (s) ≐ \sum_{a \in A} π (a ∣ s, θ) r (s, a) = E_{A \sim π (s, θ)} [r (s, A) ∣ s]$ is the expectation of the immediate rewards. Here, $r (s, a) ≐ E [R ∣ s, a] = \sum_{r} r p (r ∣ s, a)$ .

The average reward ${\bar{r}}_{π}$ can also be written as the inner product of two vectors. In particular, let $\begin{aligned} r_{π} = {[\dots, r_{π} (s), \dots]}^{T} \in R^{| S |} \\ d_{π} = {[\dots, d_{π} (s), \dots]}^{T} \in R^{| S |} \end{aligned}$ Then, it is clear that ${\bar{r}}_{π} = \sum_{s \in S} d_{π} (s) r_{π} (s) = d_{π}^{T} r_{π} .$

Gradients of the metrics

Given a metric, we next - derive its gradient - and then, apply gradient-based methods to optimize the metric.

The gradient calculation is one of the most complicated parts of policy gradient methods! That is because - first, we need to distinguish different metrics ${\bar{v}}_{π}, {\bar{r}}_{π}, {\bar{v}}_{π}^{0}$ - second, we need to distinguish the discounted and undiscounted cases.

Theorem 9.1 (Policy gradient theorem). The gradient of $J (θ)$ is $\nabla_{θ} J (θ) = \sum_{s \in S} η (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a)$ where $η$ is a state distribution and $\nabla_{θ} π$ is the gradient of $π$ with respect to $θ$ .

Moreover, this equation has a compact form expressed in terms of expectation: $\nabla_{θ} J (θ) = E_{S \sim η, A \sim π (S, θ)} [\nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)],$ where $\ln$ is the natural logarithm.

See the proof here.

Remarks:

It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2 , Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases.
The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of $J (θ)$ and $η$ are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, $J (θ)$ could be ${\bar{v}}_{π}^{0}, {\bar{v}}_{π}$ , or ${\bar{r}}_{π}$ .
The equality in Theorem 9.1 may become a strict equality or an approximation. The distribution $η$ also varies in different scenarios.

Gradient-ascent algorithms

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.

The gradient-ascent algorithm for maximizing $J (θ)$ is $\begin{aligned} θ_{t + 1} & = θ_{t} + α \nabla_{θ} J (θ_{t}) \\ = θ_{t} + α E_{S \sim η, A \sim π (S, θ)} [\nabla_{θ} \ln π (A ∣ S, θ_{t}) q_{π} (S, A)] \end{aligned}$ where $α > 0$ is a constant learning rate. Since the true gradient $E_{S \sim η, A \sim π (S, θ)} [\nabla_{θ} \ln π (A ∣ S, θ_{t}) q_{π} (S, A)]$

is unknown, we use stochastic gradient method to replace it by a sample: $θ_{t + 1} = θ_{t} + α \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) q_{π} (s_{t}, a_{t})$

Furthermore, since $q_{π}$ is unknown, it can be approximated (as in the function approximation): $θ_{t + 1} = θ_{t} + α \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) q_{t} (s_{t}, a_{t})$

There are different methods to approximate $q_{π} (s_{t}, a_{t})$ - If we use Monte-Carlo learning, the result algorithm is called REINFORCE. - If we use TD learning learning, the result algorithm is called Actor-Critic.

Remarks:

Since $A \sim π (A ∣ S, θ)$ , $a_{t}$ should be sampled following $π (θ_{t})$ (representing the policy $π_{θ}$ at time step $t$ ) at $s_{t}$ . Therefore, the policy gradient method is on-policy. Because wenuse the same policy for dataset generation and policy optimization.

Explanation

One explanation of this gradient-ascent algorithm is that, since $\nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) = \frac{\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t})}{π (a_{t} ∣ s_{t}, θ_{t})}$ the algorithm can be rewritten as $\begin{aligned} θ_{t + 1} & = θ_{t} + α \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) q_{t} (s_{t}, a_{t}) \\ = θ_{t} + α \underset{β_{t}}{\underset{⏟}{(\frac{q_{t} (s_{t}, a_{t})}{π (a_{t} ∣ s_{t}, θ_{t})})}} \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}) . \end{aligned}$

Therefore, we have the important expression of the algorithm: $θ_{t + 1} = θ_{t} + α β_{t} \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t})$ It is a gradient-ascent algorithm for maximizing $π (a_{t} ∣ s_{t}, θ)$ : $θ_{t + 1} = θ_{t} + α β_{t} \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t})$

Intuition: When $α β_{t}$ is sufficiently small - If $β_{t} > 0$ , the probability of choosing $(s_{t}, a_{t})$ is enhanced: $π (a_{t} ∣ s_{t}, θ_{t + 1}) > π (a_{t} ∣ s_{t}, θ_{t})$

The greater $β_{t}$ is, the stronger the enhancement is. - If $β_{t} < 0$ , then $π (a_{t} ∣ s_{t}, θ_{t + 1}) < π (a_{t} ∣ s_{t}, θ_{t})$ .

When $θ_{t + 1} - θ_{t}$ is sufficiently small, we have $\begin{aligned} π (a_{t} ∣ s_{t}, θ_{t + 1}) & \approx π (a_{t} ∣ s_{t}, θ_{t}) + {(\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}))}^{T} (θ_{t + 1} - θ_{t}) \\ = π (a_{t} ∣ s_{t}, θ_{t}) + α β_{t} {(\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}))}^{T} (\nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t})) \\ = π (a_{t} ∣ s_{t}, θ_{t}) + α β_{t} {‖ \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t}) ‖}^{2} \end{aligned}$

Monte Carlo policy gradient (REINFORCE)

Recall that $θ_{t + 1} = θ_{t} + α \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) q_{π} (s_{t}, a_{t})$ is replaced by $θ_{t + 1} = θ_{t} + α \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) q_{t} (s_{t}, a_{t})$ where $q_{t} (s_{t}, a_{t})$ is an approximation of $q_{π} (s_{t}, a_{t})$ .

If $q_{π} (s_{t}, a_{t})$ is approximated by Monte Carlo estimation, the algorithm has a specific name, REINFORCE.