Policy Gradient Methods
We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.
In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.
The key point of policy gradient is that, given an objective funcnion $ J_{}(s)$ (\(J_{\theta}(s)\) can be some form of cumulative rewards), according to the chain rule, its derivation \[ \frac{\partial J_{\theta}(s)}{\partial \theta} = \frac{\partial J_{\theta}(s)}{\partial \pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \] where \(\theta\) is the parameters, \(\pi_\theta\) is a policy parameterized by \(\theta\), \(\pi_\theta\) can be implemented by a neural network, \(s\) is a state and \(a\) is an action, is not differentiable as \(J_{\theta}\) must relies one rewards and rewards are generated by the environment which is indifferentiable.
Therefore, how can we compute \(\partial J_{\theta}(s) / \partial \theta\)? The answer is that we can prove \[ \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \] and use it to as the the gradient of \(J(θ)\) (we use $ J()$ to denote \(J_{\theta}(s)\)).
Sources: