Off-Policy Actor-Critic Methods

Posted on 2024-06-24 Edited on 2025-06-17 In Computer Science Views: 54

The policy gradient methods that we have studied so far, including REINFORCE, QAC, and $A 2 C$ , are all on-policy. The reason for this can be seen from the expression of the true gradient: $\nabla_{θ} J (θ) = E_{S \sim η, A \sim π} [\nabla_{θ} \ln π (A ∣ S, θ_{t}) (q_{π} (S, A) - v_{π} (S))] .$

To use samples to approximate this true gradient, we must generate the action samples by following $π (θ)$ . Hence, $π (θ)$ is the behavior policy. Since $π (θ)$ is also the target policy that we aim to improve, the policy gradient methods are on-policy.

In the case that we already have some samples generated by a given behavior policy, the policy gradient methods can still be applied to utilize these samples. To do that, we can employ a technique called importance sampling. It is a general technique for estimating expected values defined over one probability distribution using some samples drawn from another distribution.

Sources:

Shiyu Zhao. Chapter 10: Actor-Critic Methods. Mathematical Foundations of Reinforcement Learning.

Importance sampling

We next introduce the importance sampling technique. Consider a random variable $X \in X$ . Suppose that $p_{0} (X)$ is a probability distribution. Our goal is to estimate $E_{X \sim p_{0}} [X]$ . Suppose that we have some i.i.d. samples ${x_{i}}_{i = 1}^{n}$ .

Suppose the samples ${x_{i}}_{i = 1}^{n}$ are not generated by $p_{0}$ . Instead, they are generated by another distribution $p_{1}$ . Can we still use these samples to approximate $E_{X \sim p_{0}} [X]$ ? The answer is yes. $E_{X \sim p_{0}} [X]$ can be approximated based on the importance sampling technique.

In particular, $E_{X \sim p_{0}} [X]$ satisfies $E_{X \sim p_{0}} [X] = \sum_{x \in X} p_{0} (x) x = \sum_{x \in X} p_{1} (x) \underset{f (x)}{\underset{⏟}{\frac{p_{0} (x)}{p_{1} (x)} x}} = E_{X \sim p_{1}} [f (X)] .$

Thus, estimating $E_{X \sim p_{0}} [X]$ becomes the problem of estimating $E_{X \sim p_{1}} [f (X)]$ . Let $\bar{f} ≐ \frac{1}{n} \sum_{i = 1}^{n} f (x_{i})$

Since $\bar{f}$ can effectively approximate $E_{X \sim p_{1}} [f (X)]$ , we obtain $E_{X \sim p_{0}} [X] = E_{X \sim p_{1}} [f (X)] \approx \bar{f} = \frac{1}{n} \sum_{i = 1}^{n} f (x_{i}) = \frac{1}{n} \sum_{i = 1}^{n} \underset{\begin{matrix} importance \\ weight \end{matrix}}{\underset{⏟}{\frac{p_{0} (x_{i})}{p_{1} (x_{i})}}} x_{i} .$

Here, $\frac{p_{0} (x_{i})}{p_{1} (x_{i})}$ is called the importance weight. When $p_{1} = p_{0}$ , the importance weight is 1 and $\bar{f}$ becomes $\bar{x}$ . When $p_{0} (x_{i}) \geq p_{1} (x_{i}), x_{i}$ can be sampled more frequently by $p_{0}$ but less frequently by $p_{1}$ . In this case, the importance weight, which is greater than one, emphasizes the importance of this sample.

You may ask that while $p_{0} (x)$ is required in $\begin{matrix} (1) & \frac{1}{n} \sum_{i = 1}^{n} \frac{p_{0} (x_{i})}{p_{1} (x_{i})} x_{i}, \end{matrix}$ why do we not directly calculate $E_{X \sim p_{0}} [X]$ using its definition $E_{X \sim p_{0}} [X] = \sum_{x \in X} p_{0} (x) x$ ?

The answer is as follows. To use the definition, we need to know either the analytical expression of $p_{0}$ or the value of $p_{0} (x)$ for every $x \in X$ . However, it is difficult to obtain the analytical expression of $p_{0}$ when the distribution is represented by, for example, a neural network. It is also difficult to obtain the value of $p_{0} (x)$ for every $x \in X$ when $X$ is large. By contrast, $(1)$ merely requires the values of $p_{0} (x_{i})$ for some samples and is much easier to implement in practice.

The off-policy policy gradient theorem

Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.

Suppose $β$ is the behavior policy that generates experience samples.
Our aim is to use these samples to update a target policy $π$ that can minimize the metric $J (θ) = \sum_{s \in S} d_{β} (s) v_{π} (s) = E_{S \sim d_{β}} [v_{π} (S)]$ where $d_{β}$ is the stationary distribution under policy $β$ and $v_{π}$ is the state value under policy $π$ .

The gradient of this metric is given in the following theorem.

Theorem 10.1 (Off-policy policy gradient theorem). In the discounted case where $γ \in$ $(0, 1)$ , the gradient of $J (θ)$ is $\nabla_{θ} J (θ) = E_{S \sim ρ, A \sim β} [\underset{\begin{matrix} importance \\ weight \end{matrix}}{\underset{⏟}{\frac{π (A ∣ S, θ)}{β (A ∣ S)}}} \nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)],$ where the state distribution $ρ$ is $ρ (s) ≐ \sum_{s^{'} \in S} d_{β} (s^{'}) \Pr_{π} (s ∣ s^{'}), s \in S,$ where $\Pr_{π} (s ∣ s^{'}) = \sum_{k = 0}^{\infty} γ^{k} {[P_{π}^{k}]}_{s^{'} s} = {[{(I - γ P_{π})}^{- 1}]}_{s^{'} s}$ is the discounted total probability of transitioning from $s^{'}$ to $s$ under policy $π$ .

The gradient $\nabla_{θ} J (θ)$ is similar to that in the on-policy case in Theorem 9.1, but there are two differences. The first difference is the importance weight. The second difference is that $A \sim β$ instead of $A \sim π$ . Therefore, we can use the action samples generated by following $β$ to approximate the true gradient. The proof of the theorem is given in the appendix.

The off-policy policy gradient is also invariant to a baseline $b (s)$ . In particular, we have

$\nabla_{θ} J (θ) = E_{S \sim ρ, A \sim β} [\frac{π (A ∣ S, θ)}{β (A ∣ S)} \nabla_{θ} \ln π (A ∣ S, θ) (q_{π} (S, A) - b (S))]$ because $E [\frac{π (A ∣ S, θ)}{β (A ∣ S)} \nabla_{θ} \ln π (A ∣ S, θ) b (S)] = 0 .$

To reduce the estimation variance, we select the baseline as $b (S) = v_{π} (S)$ as in A2C and obtain $\nabla_{θ} J (θ) = E [\frac{π (A ∣ S, θ)}{β (A ∣ S)} \nabla_{θ} \ln π (A ∣ S, θ) (q_{π} (S, A) - v_{π} (S))]$

The corresponding stochastic gradient-ascent algorithm is $θ_{t + 1} = θ_{t} + α_{θ} \frac{π (a_{t} ∣ s_{t}, θ_{t})}{β (a_{t} ∣ s_{t})} \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) (q_{t} (s_{t}, a_{t}) - v_{t} (s_{t}))$

Similar to the on-policy case, $q_{t} (s_{t}, a_{t}) - v_{t} (s_{t}) \approx r_{t + 1} + γ v_{t} (s_{t + 1}) - v_{t} (s_{t}) ≐ δ_{t} (s_{t}, a_{t})$

Then, the algorithm becomes $θ_{t + 1} = θ_{t} + α_{θ} \frac{π (a_{t} ∣ s_{t}, θ_{t})}{β (a_{t} ∣ s_{t})} \nabla_{θ} \ln π (a_{t} ∣ s_{t}, θ_{t}) δ_{t} (s_{t}, a_{t})$ and hence $θ_{t + 1} = θ_{t} + α_{θ} (\frac{δ_{t} (s_{t}, a_{t})}{β (a_{t} ∣ s_{t})}) \nabla_{θ} π (a_{t} ∣ s_{t}, θ_{t})$

Appendix

Proof of Theorem 10.1

Since $d_{β}$ is independent of $θ$ , the gradient of $J (θ)$ satisfies $\nabla_{θ} J (θ) = \nabla_{θ} \sum_{s \in S} d_{β} (s) v_{π} (s) = \sum_{s \in S} d_{β} (s) \nabla_{θ} v_{π} (s) .$

According to Lemma 9.2, the expression of $\nabla_{θ} v_{π} (s)$ is $\nabla_{θ} v_{π} (s) = \sum_{s^{'} \in S} \Pr_{π} (s^{'} ∣ s) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a),$ where $\Pr_{π} (s^{'} ∣ s) ≐ \sum_{k = 0}^{\infty} γ^{k} {[P_{π}^{k}]}_{s s^{'}} = {[{(I_{n} - γ P_{π})}^{- 1}]}_{s s^{'}}$ Substituting $\nabla_{θ} v_{π} (s)$ into $\sum_{s \in S} d_{β} (s) \nabla_{θ} v_{π} (s)$ yields $\begin{aligned} \nabla_{θ} J (θ) = \sum_{s \in S} d_{β} (s) \nabla_{θ} v_{π} (s) & = \sum_{s \in S} d_{β} (s) \sum_{s^{'} \in S} \Pr_{π} (s^{'} ∣ s) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ = \sum_{s^{'} \in S} (\sum_{s \in S} d_{β} (s) \Pr_{π} (s^{'} ∣ s)) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ ≐ \sum_{s^{'} \in S} ρ (s^{'}) \sum_{a \in A} \nabla_{θ} π (a ∣ s^{'}, θ) q_{π} (s^{'}, a) \\ = \sum_{s \in S} ρ (s) \sum_{a \in A} \nabla_{θ} π (a ∣ s, θ) q_{π} (s, a) (change s^{'} to s) \\ = E_{S \sim ρ} [\sum_{a \in A} \nabla_{θ} π (a ∣ S, θ) q_{π} (S, a)], \end{aligned}$

where the 3rd equation is because $ρ (s) ≐ \sum_{s^{'} \in S} d_{β} (s^{'}) \Pr_{π} (s ∣ s^{'}), s \in S .$

By using the importance sampling technique, the above equation can be further rewritten as $$

\begin{aligned} E_{S \sim ρ} [\sum_{a \in A} \nabla_{θ} π (a ∣ S, θ) q_{π} (S, a)] & = E_{S \sim ρ} [\sum_{a \in A} β (a ∣ S) \frac{π (a ∣ S, θ)}{β (a ∣ S)} \frac{\nabla_{θ} π (a ∣ S, θ)}{π (a ∣ S, θ)} q_{π} (S, a)] \\ = E_{S \sim ρ} [\sum_{a \in A} β (a ∣ S) \frac{π (a ∣ S, θ)}{β (a ∣ S)} \nabla_{θ} \ln π (a ∣ S, θ) q_{π} (S, a)] \\ = E_{S \sim ρ, A \sim β} [\frac{π (A ∣ S, θ)}{β (A ∣ S)} \nabla_{θ} \ln π (A ∣ S, θ) q_{π} (S, A)] . \end{aligned}

The proof is complete. The above proof is similar to that of Theorem 9.1.