The MC Basic Algorithm

  1. Shiyu Zhao. Chapter 5: Monte Carlo Methods. Mathematical Foundations of Reinforcement Learning.
  2. --> Youtube: MC Basic

Motivation

Recalling that the policy iteration algorithm has two steps in each iteration:

  1. Policy evaluation(PE): vπk=rπk+γPπkvπk.
  2. Policy improvement(PI): πk+1=argmaxπ(rπ+γPπvπk).

The PE step is calculated through solving the Bellman equation.

The elementwise form of the (PI) step is: πk+1(s)=argmaxπaπ(as)[rp(rs,a)r+γsp(ss,a)vπk(s)]=argmaxπaπ(as)qπk(s,a),sS

The key is qπk(s,a)! So how can we compute qπk(s,a)?

Model based case

Since the policy iteration algorithm is a model based algorithm, i.e., the model is given. Recalling that the model (or dynamic) of a MDP is composed of two parts:

  1. State transition probability: p(ss,a).
  2. Reward probability: p(rs,a).

Thus, we can calculate qπk(s,a) via following equation qπk(s,a)=rp(rs,a)r+γsp(ss,a)vπk(s) where p(ss,a) and p(rs,a) are given, and every vπk(s) is calculated in the PE step.

Model free case

But what if we don't know the model, i.e., we want to convert policy iteration to a model based algorhtm?

Recalling the definition of action value: (1)qπ(s,a)E[GtSt=s,At=a]

We can use expression (1) to calculate qπk(s,a) based on data (samples or experiences).

This is the key idea of MBRL: If we don't have a model, we estimate one based on data (or experience).

The procedure of Monte Carlo estimation of action values

  • Starting from (s,a), following policy πk, generate an episode.
  • The return of this episode is g(s,a)
  • g(s,a) is a sample of Gt in qπk(s,a)=E[GtSt=s,At=a] Suppose we have a set of episodes and hence {g(j)(s,a)}. Then, qπk(s,a)=E[GtSt=s,At=a]1Ni=1Ng(i)(s,a)

The idea of estimate the mean based on data is called Monte Carlo estimation. This is why our method is called "MC (Monte Carlo)" Basic.

The MC Basic algorithm

The MC Basic algorithm is exactly the same as the policy iteration algorithm except: In policy evaluation (PI), we don't solve vπk(s), instead we estimate qπk(s,a) directly.

Question: why we don't compute vπk(s)?

Answer: If we calculate the state value in PE, in PI step we still need to calculate activion value. So we can directly calculate activion value in PE.

Algorithm 5.1: MC Basic (a model-free variant of policy iteration)

The MC Basic algorithm is convergent since the policy iteration algorithm (MC Basic is just a variant of it) is convergent.

However, the MC Basic algorithm is not practical due its low data efficiency (we need to calculate N episodes for every qπk(s,a)).

Example

An initial policy is shown in the figure (as you can see, it's a deterministic policy). Use MC Basic to find the optimal policy. The env setting is: rboundary =1,rforbidden =1,rtarget =1,γ=0.9. Figure 5.3: An example for illustrating the MC Basic algorithm.

Outline: given the current policy πk, in each iteration: 1. policy evaluation: calculate qπk(s,a). Sicne there're 9 states and 5 actions. We need to calculate 45 state-action pairs. 2. policy improvement: select the greedy action

a(s)=argmaxaiqπk(s,a)

For each state-action pair, we need to roll out N episodes to estimate the action value. However, since it's a deterministic policy, we only need to rollout 1 step.

For space limitation, we only illustrate for the part of action value for s1 in the first iteration.

Step 1: policy evaluation

  1. Starting from (s1,a1), the episode is s1a1s1a1s1a1 Hence, the action value is

qπ0(s1,a1)=1+γ(1)+γ2(1)+

  1. Starting from (s1,a2), the episode is s1a2s2a3s5a3 Hence, the action value is

qπ0(s1,a2)=0+γ0+γ20+γ3(1)+γ4(1)+

  1. Starting from (s1,a3), the episode is s1a3s4a2s5a3 Hence, the action value is

qπ0(s1,a3)=0+γ0+γ20+γ3(1)+γ4(1)+

  1. Starting from (s1,a4), the episode is s1a4s1a1s1a1 Hence, the action value is

qπ0(s1,a4)=1+γ(1)+γ2(1)+ 5. Starting from (s1,a5), the episode is s1a5s1a1s1a1 Hence, the action value is

qπ0(s1,a5)=0+γ(1)+γ2(1)+

Step2: policy improvement

By observing the action values, we see that qπ0(s1,a2)=qπ0(s1,a3) are the maximum.

As a result, the policy can be improved as π1(a2s1)=1 or π1(a3s1)=1

In either way, the new policy for s1 becomes optimal.

Drawback

We can see the drawback of MC Basic algorithm: for any episode, we must calculate the discounted return Gt of it, i.e., we must wait until an episode has been completed.

How long should the episode length be?

As can been seen in previous, in each episode the reward is calculated through a process like

Starting from (s1,a1), the episode is s1a1s1a1s1a1 Hence, the action value is qπ0(s1,a1)=1+γ(1)+γ2(1)+

In practice, we use the finite episode length. We have shown that the state value obtained via an iteration process vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2, is ensured to have vπk(j)vπk as j (-->See the proof), from any initial guess vπk(0).

However, what if the finite episode length is too small and vπk(j) does not converge to vπk. In this case, the action value won't be correct, and the premise of the reasoning falls apart.


We can see from the following figures that, the episode length greatly impacts the final optimal poli- cies.

When the length of each episode is too short, neither the policy nor the value estimate is optimal (see Figures 5.4(a)-(d)). In the extreme case where the episode length is one, only the states that are adjacent to the target have nonzero values, and all the other states have zero values since each episode is too short to reach the target or get positive rewards (see Figure 5.4(a)).

As the episode length increases, the policy and value estimates gradually approach the optimal ones (see Fig- ure 5.4(h)).

While the above analysis suggests that each episode must be sufficiently long, the episodes are not necessarily infinitely long. As shown in Figure 5.4(g), when the length is 30, the algorithm can find an optimal policy, although the value estimate is not yet optimal.

Figure 5.4: a, b, c, d
Figure 5.4: e, f, g, h

The above analysis is related to an important reward design problem, sparse reward, which refers to the scenario in which no positive rewards can be obtained unless the target is reached. The sparse reward setting requires long episodes that can reach the target. This requirement is challenging to satisfy when the state space is large. As a result, the sparse reward problem downgrades the learning efficiency.