Temporal-Difference Methods

This chapter introduces temporal-difference (TD) methods for reinforcement learning. Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some advantages due to its incremental form.

The goal of TD both learning and MC learning is policy evaluation : given a dataset generated by a policy π, estimate the state value (or action value) of the policy π. They must be combined with one step of policy improvement to get optimal policies.

We elaborate two TD learning methods: one for estimating state values and one for estimating actions values, the latter algorithm is called Sarsa.

In next post, we will introduce Q-learning, which is very similar to Sarsa, to directly estimate optimal action values and hence optimal policies.

Sources:

  1. Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.

Motivitating examples

Using the RM algorithm to do mean estimations

We next first some stochastic problems and show how to use the RM algorithm to solve them.

First, consider the simple mean estimation problem: calculate w=E[X] based on some iid samples {x} of X. - By writing g(w)=wE[X], we can reformulate the problem to a root-finding problem g(w)=0 - Since we can only obtain samples {x} of X, the noisy observation is g~(w,η)=wx=(wE[X])+(E[X]x)g(w)+η - Then, according to the last lecture, we know the RM algorithm for solving g(w)=0 is

wk+1=wkαkg~(wk,ηk)=wkαk(wkxk)


Second, consider a little more complex problem. That is to estimate the mean of a function v(X), w=E[v(X)] based on some iid random samples {x} of X. - To solve this problem, we define g(w)=wE[v(X)]g~(w,η)=wv(x)=(wE[v(X)])+(E[v(X)]v(x))g(w)+η - Then, the problem becomes a root-finding problem: g(w)=0. The corresponding RM algorithm is

wk+1=wkαkg~(wk,ηk)=wkαk[wkv(xk)]


Third, consider an even more complex problem: calculate w=E[R+γv(X)] where R,X are random variables, γ is a constant, and v() is a function. - Suppose we can obtain samples {x} and {r} of X and R. we define g(w)=wE[R+γv(X)]g~(w,η)=w[r+γv(x)]=(wE[R+γv(X)])+(E[R+γv(X)][r+γv(x)])g(w)+η. - Then, the problem becomes a root-finding problem: g(w)=0. The corresponding RM algorithm is

wk+1=wkαkg~(wk,ηk)=wkαk[wk(rk+γv(xk))]

The Bellman expectation equation

Recalling that, the definition of state value of policy π is (1)vπ(s)=E[R+γGS=s],sS where G is discounted return. Since E[GS=s]=aπ(as)sp(ss,a)vπ(s)=E[vπ(S)S=s], where S is the next state, we can rewrite (1) as (2)vπ(s)=E[R+γvπ(S)S=s],

where sS.

Meanwhile, we can extend this conclusion for action values, resulting in (3)qπ(s,a)=E[R+γqπ(S,A)s,a],

where s,a.

Equation (2) (or (3), in the sense of action value) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation, an important tool to design and analyze TD algorithms.

A natural derivation of the TD learning algorithm

Now that state value of a policy is, according to the Bellman expectation equation, given by (2). One intuitive solution to compute it is to use the RM algorithm: vt+1(st)=vt(st)αt(st)[vt(st)[rt+1+γvπ(st+1)]] Or, in the the sense of action value: qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqπ(st+1,at+1)]]

However, the right-hand side of the first equation contains vπ(st+1), the right-hand side of the second equation contains qπ(st,at), both are unknown during the approximation process. Therefore, we use vt(st+1) and qt(st,at) to replace them. We can prove that, under some circumstances, the algorithm also converges with this replacement.

The result method is called temporal-difference (TD) learning methods. For this reason, TD learning is not RM algorithm but they are quite alike.

TD learning of state values

Given the data/experience (s0,r1,s1,,st,rt+1,st+1,) or {(st,rt+1,st+1)}t generated following the given policy π, the TD learning algorithm to estimate the state value function of policy π is: vt+1(st)=vt(st)αt(st)[vt(st)[rt+1+γvt(st+1)]]vt+1(s)=vt(s),sst where t=0,1,2, Here, vt(st) is the estimated state value of vπ(st); αt(st) is the learning rate of st at time t. - At time t, only the value of the visited state st is updated whereas the values of the unvisited states sst remain unchanged. - The 2nd equation will be omitted when the context is clear.

Here, v¯trt+1+γvt(st+1) is called the TD target. δtv(st)[rt+1+γv(st+1)]=v(st)v¯t is called the TD error.

TD learning of action values: Sarsa

Algorithm 7.1

Given the data/experience {(st,at,rt+1,st+1,at+1)}t generated following the given policy π, the TD learning algorithm (Sarsa) to estimate the action value function of policy π is: qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]]qt+1(s,a)=qt(s,a),(s,a)(st,at) where t=0,1,2,

  • qt(st,at) is an estimate of qπ(st,at);
  • αt(st,at) is the learning rate depending on st,at.

The name "Sarsa" comes from the abbreviation of state-action-reward-state-action. Each step of the algorithm involves (st,at,rt+1,st+1,at+1).

Convergence of the TD learning algorithm

Theorem (Convergence of TD Learning):

By the TD algorithm, vt(s) converges with probability 1 to vπ(s) for all sS as t if tαt(s)= and tαt2(s)< for all sS.

Remarks: - This theorem says the state value can be found by the TD algorithm for a given a policy π. - tαt(s)= and tαt2(s)< must be valid for all sS. At time step t, if s=st which means that s is visited at time t, then αt(s)>0; otherwise, αt(s)=0 for all the other sst. That requires every state must be visited an infinite (or sufficiently many) number of times. - The learning rate α is often selected as a small constant. In this case, the condition that tαt2(s)< is invalid anymore. When α is constant, it can still be shown that the algorithm converges in the sense of expectation sense.

For the proof of the theorem, see the book.