The MC Basic Algorithm
- Shiyu Zhao. Chapter 5: Monte Carlo Methods. Mathematical Foundations of Reinforcement Learning.
- --> Youtube: MC Basic
Motivation
Recalling that the policy iteration algorithm has two steps in each iteration:
- Policy evaluation(PE):
. - Policy improvement(PI):
.
The PE step is calculated through solving the Bellman equation.
The elementwise form of the (PI) step is:
The key is
Model based case
Since the policy iteration algorithm is a model based algorithm, i.e., the model is given. Recalling that the model (or dynamic) of a MDP is composed of two parts:
- State transition probability:
. - Reward probability:
.
Thus, we can calculate
Model free case
But what if we don't know the model, i.e., we want to convert policy iteration to a model based algorhtm?
Recalling the definition of action value:
We can use expression
This is the key idea of MBRL: If we don't have a model, we estimate one based on data (or experience).
The procedure of Monte Carlo estimation of action values
- Starting from
, following policy , generate an episode. - The return of this episode is
is a sample of in Suppose we have a set of episodes and hence . Then,
The idea of estimate the mean based on data is called Monte Carlo estimation. This is why our method is called "MC (Monte Carlo)" Basic.
The MC Basic algorithm
The MC Basic algorithm is exactly the same as the policy iteration algorithm except: In policy evaluation (PI), we don't solve
Question: why we don't compute
? Answer: If we calculate the state value in PE, in PI step we still need to calculate activion value. So we can directly calculate activion value in PE.
The MC Basic algorithm is convergent since the policy iteration algorithm (MC Basic is just a variant of it) is convergent.
However, the MC Basic algorithm is not practical due its low data efficiency (we need to calculate
Example
An initial policy is shown in the figure (as you can see, it's a deterministic policy). Use MC Basic to find the optimal policy. The env setting is:
Outline: given the current policy
For each state-action pair, we need to roll out
For space limitation, we only illustrate for the part of action value for
Step 1: policy evaluation
- Starting from
, the episode is Hence, the action value is
- Starting from
, the episode is Hence, the action value is
- Starting from
, the episode is Hence, the action value is
- Starting from
, the episode is Hence, the action value is
Step2: policy improvement
By observing the action values, we see that
As a result, the policy can be improved as
In either way, the new policy for
Drawback
We can see the drawback of MC Basic algorithm: for any episode, we must calculate the discounted return
How long should the episode length be?
As can been seen in previous, in each episode the reward is calculated through a process like
Starting from
, the episode is Hence, the action value is
In practice, we use the finite episode length. We have shown that the state value obtained via an iteration process
However, what if the finite episode length is too small and
We can see from the following figures that, the episode length greatly impacts the final optimal poli- cies.
When the length of each episode is too short, neither the policy nor the value estimate is optimal (see Figures 5.4(a)-(d)). In the extreme case where the episode length is one, only the states that are adjacent to the target have nonzero values, and all the other states have zero values since each episode is too short to reach the target or get positive rewards (see Figure 5.4(a)).
As the episode length increases, the policy and value estimates gradually approach the optimal ones (see Fig- ure 5.4(h)).
While the above analysis suggests that each episode must be sufficiently long, the episodes are not necessarily infinitely long. As shown in Figure 5.4(g), when the length is 30, the algorithm can find an optimal policy, although the value estimate is not yet optimal.
The above analysis is related to an important reward design problem, sparse reward, which refers to the scenario in which no positive rewards can be obtained unless the target is reached. The sparse reward setting requires long episodes that can reach the target. This requirement is challenging to satisfy when the state space is large. As a result, the sparse reward problem downgrades the learning efficiency.