# Policy Gradient Methods

We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.

In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.

The key point of policy gradient is that, given an objective funcnion $ J_{}(s)$ (\(J_{\theta}(s)\) can be some form of cumulative rewards), according to the chain rule, its derivation \[
\frac{\partial J_{\theta}(s)}{\partial \theta} = \frac{\partial J_{\theta}(s)}{\partial \pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta}
\] where \(\theta\) is the parameters, \(\pi_\theta\) is a policy parameterized by \(\theta\), \(\pi_\theta\) can be implemented by a neural network, \(s\) is a state and \(a\) is an action, is **not** differentiable as \(J_{\theta}\) must relies one rewards and rewards are generated by the environment which is indifferentiable.

Therefore, how can we compute \(\partial J_{\theta}(s) / \partial \theta\)? The answer is that we can prove \[ \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \] and use it to as the the gradient of \(J(θ)\) (we use $ J()$ to denote \(J_{\theta}(s)\)).

Sources:

# Metrics for defining optimal policies

We first define the objective function \(J(\theta)\), there are multiple metrics.

## Metric 1: Average state value

The first metric is the average state value or simply called average value. In particular, the metric is defined as

\[ \bar{v}_\pi=\sum_{s \in \mathcal{S}} d(s) v_\pi(s) \] where \(d = d_\pi\) is the stationary distribution.

Then, the metric can be written as \[ \bar{v}_\pi=\mathbb{E}_{S \sim d}\left[v_\pi(S)\right] . \]

The metric \(\bar{v}_\pi\) can also be rewritten as the inner product of two vectors. In particular, let \[ \begin{aligned} v_\pi & =\left[\ldots, v_\pi(s), \ldots\right]^T \in \mathbb{R}^{|\mathcal{S}|}, \\ d & =[\ldots, d(s), \ldots]^T \in \mathbb{R}^{|\mathcal{S}|} . \end{aligned} \] Then, we have \[ \bar{v}_\pi=d^T v_\pi . \]

This expression will be useful when we analyze its gradient.

We can also let \(d\) to be other distributions. For instance, we can make \(d\) independent of the policy \(\pi\). In this case, we specifically denote \(d\) as \(d_0\) and \(\bar{v}_\pi\) as \(\bar{v}_\pi^0\).

One trivial way is make \(d_0\) the uniform distrbution \[ d_0(s)=1 /|\mathcal{S}| . \]

- This case is relatively simple because the gradient of the metric is easier to calculate.
- In this case,
- How to select \(d_0\) ?
- One trivial way is to treat all the states equally important and hence select

## Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as \[ \begin{aligned} \bar{r}_\pi & \doteq \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) \\ & =\mathbb{E}_{S \sim d_\pi}\left[r_\pi(S)\right] \end{aligned} \] where \(d_\pi\) is the stationary distribution and \[ \color{salmon}{r_\pi(s)} \doteq \sum_{a \in \mathcal{A}} \pi(a \mid s, \theta) r(s, a)=\mathbb{E}_{A \sim \pi(s, \theta)}[r(s, A) \mid s] \] is the expectation of the immediate rewards. Here, \(r(s, a) \doteq \mathbb{E}[R \mid s, a]=\sum_r r p(r \mid s, a)\).

The average reward \(\bar{r}_\pi\) can also be written as the inner product of two vectors. In particular, let \[ \begin{aligned} & \color{salmon}{r_\pi}=\left[\ldots, r_\pi(s), \ldots\right]^T \in \mathbb{R}^{|\mathcal{S}|} \\ & d_\pi=\left[\ldots, d_\pi(s), \ldots\right]^T \in \mathbb{R}^{|\mathcal{S}|} \end{aligned} \] Then, it is clear that \[ \bar{r}_\pi=\sum_{s \in \mathcal{S}} d_\pi(s) \color{salmon}{r_\pi(s)}=d_\pi^T r_\pi . \]

# Gradients of the metrics

Given a metric, we next - derive its gradient - and then, apply gradient-based methods to optimize the metric.

The gradient calculation is one of the most complicated parts of policy gradient methods! That is because - first, we need to distinguish different metrics \(\bar{v}_\pi, \bar{r}_\pi, \bar{v}_\pi^0\) - second, we need to distinguish the discounted and undiscounted cases.

**Theorem 9.1 (Policy gradient theorem)**. The gradient of \(J(\theta)\) is \[
\color{orange}{\nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a)}
\] where \(\eta\) is a state distribution and \(\nabla_\theta \pi\) is the gradient of \(\pi\) with respect to \(\theta\).

Moreover, this equation has a compact form expressed in terms of expectation: \[ \color{green}{\nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)}\left[\nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right]}, \] where \(\ln\) is the natural logarithm.

See the proof here.

Remarks:

- It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2 , Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases.
- The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of \(J(\theta)\) and \(\eta\) are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, \(J(\theta)\) could be \(\bar{v}_\pi^0, \bar{v}_\pi\), or \(\bar{r}_\pi\).
- The equality in Theorem 9.1 may become a strict equality or an approximation. The distribution \(\eta\) also varies in different scenarios.

# Gradient-ascent algorithms

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.

The gradient-ascent algorithm for maximizing \(J(\theta)\) is \[ \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \color{orange}{\nabla_\theta J\left(\theta_t\right)} \\ & =\theta_t+\alpha \color{orange}{\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)} [\nabla_{\theta} \ln \pi(A \mid S, \theta_t) q_{\pi}(S, A) ]} \end{aligned} \] where \(\alpha>0\) is a constant learning rate. Since the true gradient \[ \color{orange}{\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) q_\pi(S, A)\right]} \]

is unknown, we use *stochastic gradient method* to replace it by a sample: \[
\theta_{t+1}=\theta_t+\alpha \color{pink}{\nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_\pi\left(s_t, a_t\right)}
\]

Furthermore, since \(q_\pi\) is unknown, it can be approximated (as in the function approximation): \[ \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) \]

There are different methods to approximate \(q_\pi\left(s_t, a_t\right)\) - If we use Monte-Carlo learning, the result algorithm is called *REINFORCE*. - If we use TD learning learning, the result algorithm is called *Actor-Critic*.

Remarks:

- Since \(A \sim \pi(A \mid S, \theta)\), \(a_t\) should be sampled following \(\pi\left(\theta_t\right)\) (representing the policy \(\pi_{\theta}\) at time step \(t\)) at \(s_t\). Therefore, the policy gradient method is on-policy. Because wenuse the same policy for dataset generation and policy optimization.

## Explanation

One explanation of this gradient-ascent algorithm is that, since \[ \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right)=\frac{\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)} \] the algorithm can be rewritten as \[ \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \underbrace{\left(\frac{q_t\left(s_t, a_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)}\right)}_{\beta_t} \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right) . \end{aligned} \]

Therefore, we have the important expression of the algorithm: \[ \color{blue}{\theta_{t+1}=\theta_t+\alpha} \color{red}{\beta_t} \color{blue}{\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)} \] It is a gradient-ascent algorithm for maximizing \(\pi\left(a_t \mid s_t, \theta\right)\) : \[ \theta_{t+1}=\theta_t+\alpha \beta_t \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right) \]

Intuition: When \(\alpha \beta_t\) is sufficiently small - If \(\beta_t>0\), the probability of choosing \(\left(s_t, a_t\right)\) is enhanced: \[ \pi\left(a_t \mid s_t, \theta_{t+1}\right)>\pi\left(a_t \mid s_t, \theta_t\right) \]

The greater \(\beta_t\) is, the stronger the enhancement is. - If \(\beta_t<0\), then \(\pi\left(a_t \mid s_t, \theta_{t+1}\right)<\pi\left(a_t \mid s_t, \theta_t\right)\).

When \(\theta_{t+1}-\theta_t\) is sufficiently small, we have \[ \begin{aligned} \pi\left(a_t \mid s_t, \theta_{t+1}\right) & \approx \pi\left(a_t \mid s_t, \theta_t\right)+\left(\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)\right)^T\left(\theta_{t+1}-\theta_t\right) \\ & =\pi\left(a_t \mid s_t, \theta_t\right)+\alpha \beta_t\left(\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)\right)^T\left(\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)\right) \\ & =\pi\left(a_t \mid s_t, \theta_t\right)+\alpha \beta_t\left\|\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)\right\|^2 \end{aligned} \]

## Monte Carlo policy gradient (REINFORCE)

Recall that \[ \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_\pi\left(s_t, a_t\right) \] is replaced by \[ \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) \] where \(q_t\left(s_t, a_t\right)\) is an approximation of \(q_\pi\left(s_t, a_t\right)\).

If \(q_\pi\left(s_t, a_t\right)\) is approximated by Monte Carlo estimation, the algorithm has a specific name, REINFORCE.