This chapter introduces temporal-difference (TD) methods for reinforcement learning. Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some advantages due to its incremental form.
The goal of TD both learning and MC learning is policy evaluation : given a dataset generated by a policy \(\pi\), estimate the state value (or action value) of the policy \(\pi\). They must be combined with one step of policy improvement to get optimal policies.
We elaborate two TD learning methods: one for estimating state values and one for estimating actions values, the latter algorithm is called Sarsa.
In next post, we will introduce Q-learning, which is very similar to Sarsa, to directly estimate optimal action values and hence optimal policies.
Sources:
- Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.