Off-Policy Actor-Critic Methods
The policy gradient methods that we have studied so far, including REINFORCE, QAC, and \(\mathrm{A} 2 \mathrm{C}\), are all on-policy. The reason for this can be seen from the expression of the true gradient: \[ \nabla_\theta J(\theta)=\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left(q_\pi(S, A)-v_\pi(S)\right)\right] . \]
To use samples to approximate this true gradient, we must generate the action samples by following \(\pi(\theta)\). Hence, \(\pi(\theta)\) is the behavior policy. Since \(\pi(\theta)\) is also the target policy that we aim to improve, the policy gradient methods are on-policy.
In the case that we already have some samples generated by a given behavior policy, the policy gradient methods can still be applied to utilize these samples. To do that, we can employ a technique called importance sampling. It is a general technique for estimating expected values defined over one probability distribution using some samples drawn from another distribution.
Sources: