Policy Gradient Methods
We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.
In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.
The key point of policy gradient is that, given an objective funcnion
Therefore, how can we compute
Sources:
Metrics for defining optimal policies
We first define the objective function
Metric 1: Average state value
The first metric is the average state value or simply called average value. In particular, the metric is defined as
Then, the metric can be written as
The metric
This expression will be useful when we analyze its gradient.
We can also let
One trivial way is make
- This case is relatively simple because the gradient of the metric is easier to calculate.
- In this case,
- How to select
? - One trivial way is to treat all the states equally important and hence select
Metric 2: Average reward
The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as
The average reward
Gradients of the metrics
Given a metric, we next - derive its gradient - and then, apply gradient-based methods to optimize the metric.
The gradient calculation is one of the most complicated parts of policy gradient methods! That is because - first, we need to distinguish different metrics
Theorem 9.1 (Policy gradient theorem). The gradient of
Moreover, this equation has a compact form expressed in terms of expectation:
See the proof here.
Remarks:
- It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2 , Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases.
- The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of
and are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, could be , or . - The equality in Theorem 9.1 may become a strict equality or an approximation. The distribution
also varies in different scenarios.
Gradient-ascent algorithms
With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.
The gradient-ascent algorithm for maximizing
is unknown, we use stochastic gradient method to replace it by a sample:
Furthermore, since
There are different methods to approximate
Remarks:
- Since
, should be sampled following (representing the policy at time step ) at . Therefore, the policy gradient method is on-policy. Because wenuse the same policy for dataset generation and policy optimization.
Explanation
One explanation of this gradient-ascent algorithm is that, since
Therefore, we have the important expression of the algorithm:
Intuition: When
The greater
When
Monte Carlo policy gradient (REINFORCE)

Recall that
If