This chapter introduces temporal-difference (TD) methods for reinforcement learning. Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some advantages due to its incremental form.
The goal of TD both learning and MC learning is policy evaluation : given a dataset generated by a policy , estimate the state value (or action value) of the policy . They must be combined with one step of policy improvement to get optimal policies.
We elaborate two TD learning methods: one for estimating state values and one for estimating actions values, the latter algorithm is called Sarsa.
In next post, we will introduce Q-learning, which is very similar to Sarsa, to directly estimate optimal action values and hence optimal policies.
We next first some stochastic problems and show how to use the RM algorithm to solve them.
First, consider the simple mean estimation problem: calculate based on some iid samples of . - By writing , we can reformulate the problem to a root-finding problem - Since we can only obtain samples of , the noisy observation is - Then, according to the last lecture, we know the RM algorithm for solving is
Second, consider a little more complex problem. That is to estimate the mean of a function , based on some iid random samples of . - To solve this problem, we define - Then, the problem becomes a root-finding problem: . The corresponding RM algorithm is
Third, consider an even more complex problem: calculate where are random variables, is a constant, and is a function. - Suppose we can obtain samples and of and . we define - Then, the problem becomes a root-finding problem: . The corresponding RM algorithm is
The Bellman expectation equation
Recalling that, the definition of state value of policy is where is discounted return. Since where is the next state, we can rewrite as
where .
Meanwhile, we can extend this conclusion for action values, resulting in
where .
Equation (or , in the sense of action value) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.
A natural derivation of the TD learning algorithm
Now that state value of a policy is, according to the Bellman expectation equation, given by . One intuitive solution to compute it is to use the RM algorithm: Or, in the the sense of action value:
However, the right-hand side of the first equation contains , the right-hand side of the second equation contains , both are unknown during the approximation process. Therefore, we use and to replace them. We can prove that, under some circumstances, the algorithm also converges with this replacement.
The result method is called temporal-difference (TD) learning methods. For this reason, TD learning is not RM algorithm but they are quite alike.
TD learning of state values
Given the data/experience or generated following the given policy , the TD learning algorithm to estimate the state value function of policy is: where Here, is the estimated state value of ; is the learning rate of at time . - At time , only the value of the visited state is updated whereas the values of the unvisited states remain unchanged. - The 2nd equation will be omitted when the context is clear.
Here, is called the TD target. is called the TD error.
TD learning of action values: Sarsa
Algorithm 7.1
Given the data/experience generated following the given policy , the TD learning algorithm (Sarsa) to estimate the action value function of policy is: where
is an estimate of ;
is the learning rate depending on .
The name "Sarsa" comes from the abbreviation of state-action-reward-state-action. Each step of the algorithm involves .
Convergence of the TD learning algorithm
Theorem (Convergence of TD Learning):
By the TD algorithm, converges with probability 1 to for all as if and for all .
Remarks: - This theorem says the state value can be found by the TD algorithm for a given a policy . - and must be valid for all . At time step , if which means that is visited at time , then ; otherwise, for all the other . That requires every state must be visited an infinite (or sufficiently many) number of times. - The learning rate is often selected as a small constant. In this case, the condition that is invalid anymore. When is constant, it can still be shown that the algorithm converges in the sense of expectation sense.