眉妩·新月
【宋】 王沂孙
渐新痕悬柳,淡彩穿花,依约破初暝。便有团圆意,深深拜,相逢谁在香径。画眉未稳。料素娥、犹带离恨。最堪爱、一曲银钩小,宝帘挂秋冷。
千古盈亏休问。叹慢磨玉斧,难补金镜。太液池犹在,凄凉处、何人重赋清景。故山夜永。试待他、窥户端正。看云外山河,还老尽、桂花影。
【宋】 王沂孙
渐新痕悬柳,淡彩穿花,依约破初暝。便有团圆意,深深拜,相逢谁在香径。画眉未稳。料素娥、犹带离恨。最堪爱、一曲银钩小,宝帘挂秋冷。
千古盈亏休问。叹慢磨玉斧,难补金镜。太液池犹在,凄凉处、何人重赋清景。故山夜永。试待他、窥户端正。看云外山河,还老尽、桂花影。
本文介绍机器学习中的生成式任务以及神经网络在其中的作用.
Source:
Source: Lesson 24 LOTUS
Actor-critic methods are still policy gradient methods. Compared to REINFORCE, actor-critic methods use TD learning to approximate the action value \(q_\pi\left(s_t, a_t\right)\).
What are "actor" and "critic"?
Sources:
We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.
In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.
The key point of policy gradient is that, given an objective funcnion $ J_{}(s)$ (\(J_{\theta}(s)\) can be some form of cumulative rewards), according to the chain rule, its derivation \[ \frac{\partial J_{\theta}(s)}{\partial \theta} = \frac{\partial J_{\theta}(s)}{\partial \pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \] where \(\theta\) is the parameters, \(\pi_\theta\) is a policy parameterized by \(\theta\), \(\pi_\theta\) can be implemented by a neural network, \(s\) is a state and \(a\) is an action, is not differentiable as \(J_{\theta}\) must relies one rewards and rewards are generated by the environment which is indifferentiable.
Therefore, how can we compute \(\partial J_{\theta}(s) / \partial \theta\)? The answer is that we can prove \[ \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \] and use it to as the the gradient of \(J(θ)\) (we use $ J()$ to denote \(J_{\theta}(s)\)).
Sources:
Here we prove the Policy gradient theorem, i.e., the gradient of an objective function \(J(\theta)\) is \[ \color{orange}{\nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a)} \] where \(\eta\) is a state distribution and \(\nabla_\theta \pi\) is the gradient of \(\pi\) with respect to \(\theta\).
Moreover, this equation has a compact form expressed in terms of expectation: \[ \color{green}{\nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)}\left[\nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right]}, \] where \(\ln\) is the natural logarithm.
We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics \(\bar{v}_\pi, \bar{r}_\pi, \bar{v}_\pi^0\).
For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.
Sources: