Q Learning

This chapter introduces Q learning, a Temporal-Difference (TD) learning method to estimate optimal action values, hen optimal policies. Previouly we have illustrated TD-learning of state values and action values. For these methods, we need to do policy improvement to get optimal policies.

Sources:

  1. Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.

Q learning

Algorithm 7.3

Given the data/experience {(st,at,rt+1,st+1,at+1)}t generated following the any policy π, the TD learning algorithm to estimate the optimal action value function of policy π, also called Q learning, is: qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γmaxaA(st+1)qt(st+1,a)]]qt+1(s,a)=qt(s,a),(s,a)(st,at) where t=0,1,2,

  • qt(st,at) is an estimate of qπ(st,at);

  • αt(st,at) is the learning rate depending on st,at.

  • The action at is called the current action. It's the action from the current state that is actually executed in the environment, and whose Q-value is updated.

    The action used in maxaA(st+1)qt(st+1,a) is called the target action. It has the highest Q-value from the next state, and used to update the current action’s Q value. Later we will see that, in DQN, the approximation of the target action is generated by a "target network", not the "main network" which generates the approximation of qπ(st,at).

Q-learning is very similar to Sarsa. They are different only in terms of the TD target: - The TD target in Q-learning is rt+1+γmaxaA(st+1)qt(st+1,a) - The TD target in Sarsa is rt+1+γqt(st+1,at+1).

Motivation

Its motivation is to solve q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a],s,a.

This is the Bellman optimality equation expressed in terms of action values.

Comparison with Sarsa and MC learning

Before further studying Q-learning, we first introduce two important concepts: on-policy learning and off-policy learning.

There exist two policies in a TD learning task:

  • The behavior policy is used to generate experience samples.
  • The target policy is constantly updated toward an optimal policy.

On-policy vs off-policy:

  • When the behavior policy is the same as the target policy, such kind of learning is called on-policy.
  • When they are different, the learning is called off-policy.

Advantages of off-policy learning: - It can search for optimal policies based on the experience samples generated by any other policies.

Sarsa is on-policy, since it aims to estimate the action value function from a dataset generated by a given policy π.

MC learning is on-policy as well since it shares the same goal with Sarsa.

Q learning is off-policy because it solves the optimal action value function from a datadaset generated any policy π.