Q Learning

Posted on 2024-04-16 Edited on 2024-05-09 In Computer Science Views:

Sources:

DQN 2013 paper
Reinforcement Learning Explained Visually (Part 4): Q Learning, step-by-step by Ketan Doshi

Other useful resources:

OpenAI Gym and Q-Learning by Alexander Van de Kleut

# TODO

The Q-learning algorithm uses a Q-table of State-Action Values (also called Q-values). This Q-table has a row for each state and a column for each action. Each cell contains the estimated Q-value for the corresponding state-action pair.

Steps:

We start by initializing all the Q-values to zero. As the agent interacts with the environment and gets feedback, the algorithm iteratively improves these Q-values until they converge to the Optimal Q-values. It updates them using the Bellman equation.
Then we use the ε-greedy policy to pick the current action from the current state.
Execute this action tointhe environment to execute, and gets feedback in the form of a reward and the next state.
Now, for step #4, the algorithm has to use a Q-value from the next state in order to update its estimated Q-value (Q1) for the current state and selected action.

And here is where the Q-Learning algorithm uses its clever trick. The next state has several actions, so which Q-value does it use? It uses the action (a4) from the next state which has the highest Q-value (Q4). What is critical to note is that it treats this action as a target action to be used only for the update to Q1. It is not necessarily the action that it will actually end up executing from the next state when it reaches the next time step. \[ Q(s_t, a_t) \gets (1-\alpha)Q(s_t, a_t) + \alpha \left( r_t + (1-d_t) \gamma \max_{a_{t+1}} Q(s_{t+1}, a_{t+1}) \right) \] Here \(d_t\) is the done flag. If \(d_t=1\), then it means the state \(s_{t+1}\) is a terminal state. Adding the \(\left(1-d_t\right)\) term implies that the discounted return \(G_t\) is just going to be \(r_t\), since all rewards after that are going to be 0 . In some situations this can help accelerate learning.

In other words, there are two actions involved:

Current action — the action from the current state that is actually executed in the environment, and whose Q-value is updated.
Target action — has the highest Q-value from the next state, and used to update the current action’s Q value.