Lu, Yukuan

望江南·春睡起

Posted on 2024-07-11 Edited on 2024-12-04 In 文学

【宋】金德淑

春睡起，积雪满燕山。万里长城横缟带，六街灯火已阑珊，人立玉楼间。

什么是机器学习中的“生成”

Posted on 2024-07-10 Edited on 2024-12-04 In Potpourri

本文介绍机器学习中的生成式任务以及神经网络在其中的作用.

Random Vectors

Posted on 2024-07-09 Edited on 2024-12-04 In Mathematics

Source:

Random Vectors from the textbook Introduction to Probability, Statistics, and Random Processes by Hossein Pishro-Nik.
Random Vectors and the Variance–Covariance Matrix

Expected Value, Variance and Covariance of a Random Variable

Posted on 2024-07-09 Edited on 2024-12-04 In Mathematics

Sources:

Wikipidia

Law of the Unconscious Statistician

Posted on 2024-07-09 Edited on 2024-12-04 In Mathematics

Source: Lesson 24 LOTUS

兵线理解

Posted on 2024-07-03 Edited on 2024-12-04

Source:

如何判断兵线
如何利用兵线

Actor-Critic Methods

Posted on 2024-06-24 Edited on 2024-12-04 In Computer Science

Actor-critic methods are still policy gradient methods. Compared to REINFORCE, actor-critic methods use TD learning to approximate the action value $q_\pi\left(s_t, a_t\right)$.

What are "actor" and "critic"?

Here, "actor" refers to policy update. It is called actor is because the policies will be applied to take actions.
Here, "critic" refers to policy evaluation or value estimation. It is called critic because it criticizes the policy by evaluating it.

Sources:

Shiyu Zhao. Chapter 10: Actor-Critic Methods. Mathematical Foundations of Reinforcement Learning.

Policy Gradient Methods

Posted on 2024-06-24 Edited on 2024-12-04 In Computer Science

We have shown that both state value fucntions and action value functions can be approximated by functions (see here), especially neural networks, and can be optimized by TD learning or MC learning.

In this post, we illutrate that policies can be approximated as functions and can be optimized by TD learning (Actor-Critic) or MC learning (REINFORCE) as well.

The key point of policy gradient is that, given an objective funcnion $ J_{}(s)$ ($J_{\theta}(s)$ can be some form of cumulative rewards), according to the chain rule, its derivation \[ \frac{\partial J_{\theta}(s)}{\partial \theta} = \frac{\partial J_{\theta}(s)}{\partial \pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \] where $\theta$ is the parameters, $\pi_\theta$ is a policy parameterized by $\theta$, $\pi_\theta$ can be implemented by a neural network, $s$ is a state and $a$ is an action, is not differentiable as $J_{\theta}$ must relies one rewards and rewards are generated by the environment which is indifferentiable.

Therefore, how can we compute $\partial J_{\theta}(s) / \partial \theta$? The answer is that we can prove \[ \nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a), \] and use it to as the the gradient of $J(θ)$ (we use $ J()$ to denote $J_{\theta}(s)$).

Sources:

Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Proof of the Policy Gradient Theorem

Posted on 2024-06-24 Edited on 2024-12-04 In Computer Science

Here we prove the Policy gradient theorem, i.e., the gradient of an objective function $J(\theta)$ is \[ \color{orange}{\nabla_\theta J(\theta)=\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s, \theta) q_\pi(s, a)} \] where $\eta$ is a state distribution and $\nabla_\theta \pi$ is the gradient of $\pi$ with respect to $\theta$.

Moreover, this equation has a compact form expressed in terms of expectation: \[ \color{green}{\nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi(S, \theta)}\left[\nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right]}, \] where $\ln$ is the natural logarithm.

We prove this theorem in the discounted case and undiscounted cases separately. In each case, we prove it for 3 different metrics $\bar{v}_\pi, \bar{r}_\pi, \bar{v}_\pi^0$.

For simplicity, I only list the proof in the discounted case in the appendix. See the book for proof of the undiscounted case.

Sources:

Shiyu Zhao. Chapter 8: Policy Gradient Methods. Mathematical Foundations of Reinforcement Learning.

Off-Policy Actor-Critic Methods

Posted on 2024-06-24 Edited on 2024-12-04 In Computer Science

The policy gradient methods that we have studied so far, including REINFORCE, QAC, and $\mathrm{A} 2 \mathrm{C}$, are all on-policy. The reason for this can be seen from the expression of the true gradient: \[ \nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left(q_\pi(S, A)-v_\pi(S)\right)\right] . \]

To use samples to approximate this true gradient, we must generate the action samples by following $\pi(\theta)$. Hence, $\pi(\theta)$ is the behavior policy. Since $\pi(\theta)$ is also the target policy that we aim to improve, the policy gradient methods are on-policy.

In the case that we already have some samples generated by a given behavior policy, the policy gradient methods can still be applied to utilize these samples. To do that, we can employ a technique called importance sampling. It is a general technique for estimating expected values defined over one probability distribution using some samples drawn from another distribution.

Sources:

Shiyu Zhao. Chapter 10: Actor-Critic Methods. Mathematical Foundations of Reinforcement Learning.