We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.

In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.

The key point of policy gradient is that, given an objective funcnion $ J_{}(s)$ (\(J_{\theta}(s)\) can be some form of cumulative rewards), according to the chain rule, its derivation \[ \frac{\partial J_{\theta}(s)}{\partial \theta} = \frac{\partial J_{\theta}(s)}{\partial \pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \] where \(\theta\) is the parameters, \(\pi_\theta\) is a policy parameterized by \(\theta\), \(\pi_\theta\) can be implemented by a neural network, \(s\) is a state and \(a\) is an action, is not differentiable as \(J_{\theta}\) must relies one rewards and rewards are generated by the environment which is indifferentiable.

Therefore, how can we compute \(\partial J_{\theta}(s) / \partial \theta\)? The answer is that we can prove \[ \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \] and use it to as the the gradient of \(J(θ)\) (we use $ J()$ to denote \(J_{\theta}(s)\)).

Sources:

  1. Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.
Read more »

Here we prove the Policy gradient theorem, i.e., the gradient of an objective function \(J(\theta)\) is \[ \color{orange}{\nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a)} \] where \(\eta\) is a state distribution and \(\nabla_\theta \pi\) is the gradient of \(\pi\) with respect to \(\theta\).

Moreover, this equation has a compact form expressed in terms of expectation: \[ \color{green}{\nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)}\left[\nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right]}, \] where \(\ln\) is the natural logarithm.

We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics \(\bar{v}_\pi, \bar{r}_\pi, \bar{v}_\pi^0\).

For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.

Sources:

  1. Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.
Read more »

The policy gradient methods that we have studied so far, including REINFORCE, QAC, and \(\mathrm{A} 2 \mathrm{C}\), are all on-policy. The reason for this can be seen from the expression of the true gradient: \[ \nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left(q_\pi(S, A)-v_\pi(S)\right)\right] . \]

To use samples to approximate this true gradient, we must generate the action samples by following \(\pi(\theta)\). Hence, \(\pi(\theta)\) is the behavior policy. Since \(\pi(\theta)\) is also the target policy that we aim to improve, the policy gradient methods are on-policy.

In the case that we already have some samples generated by a given behavior policy, the policy gradient methods can still be applied to utilize these samples. To do that, we can employ a technique called importance sampling. It is a general technique for estimating expected values defined over one probability distribution using some samples drawn from another distribution.

Sources:

  1. Shiyu Zhao. Chapter 10: Actor-Critic Methods. Mathematical Foundations of Reinforcement Learning.
Read more »

The stationary distribution of \(S\) under policy \(\pi\) can bedenoted by \(\left\{d_\pi(s)\right\}_{s \in \mathcal{S}}\). By definition, \(d_\pi(s) \geq 0\) and \(\sum_{s \in \mathcal{S}} d_\pi(s)=1\).

Let \(n_\pi(s)\) denote the number of times that \(s\) has been visited in a very ong episode generated by \(\pi\). Then, \(d_\pi(s)\) can be approximated by \[ d_\pi(s) \approx \frac{n_\pi(s)}{\sum_{s^{\prime} \in \mathcal{S}} n_\pi\left(s^{\prime}\right)} \] Meanwhile, the converged values \(d_\pi(s)\) can be computed directly by solving equation: \[ d_\pi^T=d_\pi^T P_\pi, \] i.e., \(d_\pi\) is the left eigenvector of \(P_\pi\) associated with the eigenvalue 1.

Sources:

  1. Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.
Read more »

In the previous post, we introduced TD learning algorithms. At that time, all state/action values were represented by tables. This is inefficient for handling large state or action spaces.

In this post, we will use the function approximation method for TD learning. It is also where artificial neural networks are incorporated into reinforcement learning as function approximators.

Sources:

  1. Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.
Read more »

This chapter introduces temporal-difference (TD) methods for reinforcement learning. Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some advantages due to its incremental form.

The goal of TD both learning and MC learning is policy evaluation : given a dataset generated by a policy \(\pi\), estimate the state value (or action value) of the policy \(\pi\). They must be combined with one step of policy improvement to get optimal policies.

We elaborate two TD learning methods: one for estimating state values and one for estimating actions values, the latter algorithm is called Sarsa.

In next post, we will introduce Q-learning, which is very similar to Sarsa, to directly estimate optimal action values and hence optimal policies.

Sources:

  1. Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.
Read more »

Sources:

  1. B. P. Lathi & Roger Green. (2018). Chapter 7: Continuous-Time Signal Analysis: The Fourier Transform. Signal Processing and Linear Systems (3nd ed., pp. 701-720). Oxford University Press.

For a quick reference table, see Wikipedia page on the Fourier transform.

Read more »
0%