Q Learning

Posted on 2024-06-22 Edited on 2025-06-17 In Computer Science Views:

This chapter introduces Q learning, a Temporal-Difference (TD) learning method to estimate optimal action values, hen optimal policies. Previouly we have illustrated TD-learning of state values and action values. For these methods, we need to do policy improvement to get optimal policies.

Sources:

Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.

Q learning

Given the data/experience ${(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1})}_{t}$ generated following the any policy $π$ , the TD learning algorithm to estimate the optimal action value function of policy $π$ , also called Q learning, is: $\begin{aligned} q_{t + 1} (s_{t}, a_{t}) & = q_{t} (s_{t}, a_{t}) - α_{t} (s_{t}, a_{t}) [q_{t} (s_{t}, a_{t}) - [r_{t + 1} + γ max_{a \in A (s_{t + 1})} q_{t} (s_{t + 1}, a)]] \\ q_{t + 1} (s, a) & = q_{t} (s, a), \forall (s, a) \neq (s_{t}, a_{t}) \end{aligned}$ where $t = 0, 1, 2, \dots$

$q_{t} (s_{t}, a_{t})$ is an estimate of $q_{π} (s_{t}, a_{t})$ ;
$α_{t} (s_{t}, a_{t})$ is the learning rate depending on $s_{t}, a_{t}$ .
The action $a_{t}$ is called the current action. It's the action from the current state that is actually executed in the environment, and whose Q-value is updated.

The action used in $max_{a \in A (s_{t + 1})} q_{t} (s_{t + 1}, a)$ is called the target action. It has the highest Q-value from the next state, and used to update the current action’s Q value. Later we will see that, in DQN, the approximation of the target action is generated by a "target network", not the "main network" which generates the approximation of $q_{π} (s_{t}, a_{t})$ .

Q-learning is very similar to Sarsa. They are different only in terms of the TD target: - The TD target in Q-learning is $r_{t + 1} + γ max_{a \in A (s_{t + 1})} q_{t} (s_{t + 1}, a)$ - The TD target in Sarsa is $r_{t + 1} + γ q_{t} (s_{t + 1}, a_{t + 1})$ .

Motivation

Its motivation is to solve $q (s, a) = E [R_{t + 1} + γ max_{a} q (S_{t + 1}, a) ∣ S_{t} = s, A_{t} = a], \forall s, a .$

This is the Bellman optimality equation expressed in terms of action values.

Comparison with Sarsa and MC learning

Before further studying Q-learning, we first introduce two important concepts: on-policy learning and off-policy learning.

There exist two policies in a TD learning task:

The behavior policy is used to generate experience samples.
The target policy is constantly updated toward an optimal policy.

On-policy vs off-policy:

When the behavior policy is the same as the target policy, such kind of learning is called on-policy.
When they are different, the learning is called off-policy.

Advantages of off-policy learning: - It can search for optimal policies based on the experience samples generated by any other policies.

Sarsa is on-policy, since it aims to estimate the action value function from a dataset generated by a given policy $π$ .

MC learning is on-policy as well since it shares the same goal with Sarsa.

Q learning is off-policy because it solves the optimal action value function from a datadaset generated any policy $π$ .