Policy Gradient Methods

We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.

In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.

The key point of policy gradient is that, given an objective funcnion J(s) (Jθ(s) can be some form of cumulative rewards), according to the chain rule, its derivation Jθ(s)θ=Jθ(s)πθ(a|s)πθ(a|s)θ where θ is the parameters, πθ is a policy parameterized by θ, πθ can be implemented by a neural network, s is a state and a is an action, is not differentiable as Jθ must relies one rewards and rewards are generated by the environment which is indifferentiable.

Therefore, how can we compute Jθ(s)/θ? The answer is that we can prove θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a), and use it to as the the gradient of J(θ) (we use J() to denote Jθ(s)).

Sources:

  1. Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Metrics for defining optimal policies

We first define the objective function J(θ), there are multiple metrics.

Metric 1: Average state value

The first metric is the average state value or simply called average value. In particular, the metric is defined as

v¯π=sSd(s)vπ(s) where d=dπ is the stationary distribution.

Then, the metric can be written as v¯π=ESd[vπ(S)].

The metric v¯π can also be rewritten as the inner product of two vectors. In particular, let vπ=[,vπ(s),]TR|S|,d=[,d(s),]TR|S|. Then, we have v¯π=dTvπ.

This expression will be useful when we analyze its gradient.

We can also let d to be other distributions. For instance, we can make d independent of the policy π. In this case, we specifically denote d as d0 and v¯π as v¯π0.

One trivial way is make d0 the uniform distrbution d0(s)=1/|S|.

  • This case is relatively simple because the gradient of the metric is easier to calculate.
  • In this case,
  • How to select d0 ?
  • One trivial way is to treat all the states equally important and hence select

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward. In particular, it is defined as r¯πsSdπ(s)rπ(s)=ESdπ[rπ(S)] where dπ is the stationary distribution and rπ(s)aAπ(as,θ)r(s,a)=EAπ(s,θ)[r(s,A)s] is the expectation of the immediate rewards. Here, r(s,a)E[Rs,a]=rrp(rs,a).

The average reward r¯π can also be written as the inner product of two vectors. In particular, let rπ=[,rπ(s),]TR|S|dπ=[,dπ(s),]TR|S| Then, it is clear that r¯π=sSdπ(s)rπ(s)=dπTrπ.

Gradients of the metrics

Given a metric, we next - derive its gradient - and then, apply gradient-based methods to optimize the metric.

The gradient calculation is one of the most complicated parts of policy gradient methods! That is because - first, we need to distinguish different metrics v¯π,r¯π,v¯π0 - second, we need to distinguish the discounted and undiscounted cases.

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a) where η is a state distribution and θπ is the gradient of π with respect to θ.

Moreover, this equation has a compact form expressed in terms of expectation: θJ(θ)=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)], where ln is the natural logarithm.

See the proof here.

Remarks:

  • It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2 , Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases.
  • The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be v¯π0,v¯π, or r¯π.
  • The equality in Theorem 9.1 may become a strict equality or an approximation. The distribution η also varies in different scenarios.

Gradient-ascent algorithms

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.

The gradient-ascent algorithm for maximizing J(θ) is θt+1=θt+αθJ(θt)=θt+αESη,Aπ(S,θ)[θlnπ(AS,θt)qπ(S,A)] where α>0 is a constant learning rate. Since the true gradient ESη,Aπ(S,θ)[θlnπ(AS,θt)qπ(S,A)]

is unknown, we use stochastic gradient method to replace it by a sample: θt+1=θt+αθlnπ(atst,θt)qπ(st,at)

Furthermore, since qπ is unknown, it can be approximated (as in the function approximation): θt+1=θt+αθlnπ(atst,θt)qt(st,at)

There are different methods to approximate qπ(st,at) - If we use Monte-Carlo learning, the result algorithm is called REINFORCE. - If we use TD learning learning, the result algorithm is called Actor-Critic.

Remarks:

  • Since Aπ(AS,θ), at should be sampled following π(θt) (representing the policy πθ at time step t) at st. Therefore, the policy gradient method is on-policy. Because wenuse the same policy for dataset generation and policy optimization.

Explanation

One explanation of this gradient-ascent algorithm is that, since θlnπ(atst,θt)=θπ(atst,θt)π(atst,θt) the algorithm can be rewritten as θt+1=θt+αθlnπ(atst,θt)qt(st,at)=θt+α(qt(st,at)π(atst,θt))βtθπ(atst,θt).

Therefore, we have the important expression of the algorithm: θt+1=θt+αβtθπ(atst,θt) It is a gradient-ascent algorithm for maximizing π(atst,θ) : θt+1=θt+αβtθπ(atst,θt)

Intuition: When αβt is sufficiently small - If βt>0, the probability of choosing (st,at) is enhanced: π(atst,θt+1)>π(atst,θt)

The greater βt is, the stronger the enhancement is. - If βt<0, then π(atst,θt+1)<π(atst,θt).

When θt+1θt is sufficiently small, we have π(atst,θt+1)π(atst,θt)+(θπ(atst,θt))T(θt+1θt)=π(atst,θt)+αβt(θπ(atst,θt))T(θπ(atst,θt))=π(atst,θt)+αβtθπ(atst,θt)2

Monte Carlo policy gradient (REINFORCE)

Algorithm 9.1

Recall that θt+1=θt+αθlnπ(atst,θt)qπ(st,at) is replaced by θt+1=θt+αθlnπ(atst,θt)qt(st,at) where qt(st,at) is an approximation of qπ(st,at).

If qπ(st,at) is approximated by Monte Carlo estimation, the algorithm has a specific name, REINFORCE.