The MC Basic Algorithm

Posted on 2024-01-27 Edited on 2025-06-17 In Computer Science Views:

Motivation

Recalling that the policy iteration algorithm has two steps in each iteration:

Policy evaluation(PE): $v_{π_{k}} = r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}$ .
Policy improvement(PI): $π_{k + 1} = \arg max_{π} (r_{π} + γ P_{π} v_{π_{k}})$ .

The PE step is calculated through solving the Bellman equation.

The elementwise form of the (PI) step is: $\begin{aligned} π_{k + 1} (s) & = \arg max_{π} \sum_{a} π (a ∣ s) [\sum_{r} p (r ∣ s, a) r + γ \sum_{s^{'}} p (s^{'} ∣ s, a) v_{π_{k}} (s^{'})] \\ = \arg max_{π} \sum_{a} π (a ∣ s) q_{π_{k}} (s, a), s \in S \end{aligned}$

The key is $q_{π_{k}} (s, a)$ ! So how can we compute $q_{π_{k}} (s, a)$ ?

Model based case

Since the policy iteration algorithm is a model based algorithm, i.e., the model is given. Recalling that the model (or dynamic) of a MDP is composed of two parts:

State transition probability: $p (s^{'} ∣ s, a)$ .
Reward probability: $p (r ∣ s, a)$ .

Thus, we can calculate $q_{π_{k}} (s, a)$ via following equation $q_{π_{k}} (s, a) = \sum_{r} p (r ∣ s, a) r + γ \sum_{s^{'}} p (s^{'} ∣ s, a) v_{π_{k}} (s^{'})$ where $p (s^{'} ∣ s, a)$ and $p (r ∣ s, a)$ are given, and every $v_{π_{k}} (s^{'})$ is calculated in the PE step.

Model free case

But what if we don't know the model, i.e., we want to convert policy iteration to a model based algorhtm?

Recalling the definition of action value: $\begin{matrix} (1) & q_{π} (s, a) ≜ E [G_{t} ∣ S_{t} = s, A_{t} = a] \end{matrix}$

We can use expression $(1)$ to calculate $q_{π_{k}} (s, a)$ based on data (samples or experiences).

This is the key idea of MBRL: If we don't have a model, we estimate one based on data (or experience).

The procedure of Monte Carlo estimation of action values

Starting from $(s, a)$ , following policy $π_{k}$ , generate an episode.
The return of this episode is $g (s, a)$
$g (s, a)$ is a sample of $G_{t}$ in $q_{π_{k}} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a]$ Suppose we have a set of episodes and hence ${g^{(j)} (s, a)}$ . Then, $q_{π_{k}} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a] \approx \frac{1}{N} \sum_{i = 1}^{N} g^{(i)} (s, a)$

The idea of estimate the mean based on data is called Monte Carlo estimation. This is why our method is called "MC (Monte Carlo)" Basic.

The MC Basic algorithm

The MC Basic algorithm is exactly the same as the policy iteration algorithm except: In policy evaluation (PI), we don't solve $v_{π_{k}} (s)$ , instead we estimate $q_{π_{k}} (s, a)$ directly.

Question: why we don't compute $v_{π_{k}} (s)$ ?

Answer: If we calculate the state value in PE, in PI step we still need to calculate activion value. So we can directly calculate activion value in PE.

Algorithm 5.1: MC Basic (a model-free variant of policy iteration)

The MC Basic algorithm is convergent since the policy iteration algorithm (MC Basic is just a variant of it) is convergent.

However, the MC Basic algorithm is not practical due its low data efficiency (we need to calculate $N$ episodes for every $q_{π_{k}} (s, a)$ ).

Example

An initial policy is shown in the figure (as you can see, it's a deterministic policy). Use MC Basic to find the optimal policy. The env setting is: $r_{boundary} = - 1, r_{forbidden} = - 1, r_{target} = 1, γ = 0.9 .$ Figure 5.3: An example for illustrating the MC Basic algorithm.

Outline: given the current policy $π_{k}$ , in each iteration: 1. policy evaluation: calculate $q_{π_{k}} (s, a)$ . Sicne there're 9 states and 5 actions. We need to calculate 45 state-action pairs. 2. policy improvement: select the greedy action

$a^{*} (s) = \arg max_{a_{i}} q_{π_{k}} (s, a)$

For each state-action pair, we need to roll out $N$ episodes to estimate the action value. However, since it's a deterministic policy, we only need to rollout 1 step.

For space limitation, we only illustrate for the part of action value for $s_{1}$ in the first iteration.

Step 1: policy evaluation

Starting from $(s_{1}, a_{1})$ , the episode is $s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} \dots$ Hence, the action value is

$q_{π_{0}} (s_{1}, a_{1}) = - 1 + γ (- 1) + γ^{2} (- 1) + \dots$

Starting from $(s_{1}, a_{2})$ , the episode is $s_{1} \overset{a_{2}}{\to} s_{2} \overset{a_{3}}{\to} s_{5} \overset{a_{3}}{\to} \dots$ Hence, the action value is

$q_{π_{0}} (s_{1}, a_{2}) = 0 + γ 0 + γ^{2} 0 + γ^{3} (1) + γ^{4} (1) + \dots$

Starting from $(s_{1}, a_{3})$ , the episode is $s_{1} \overset{a_{3}}{\to} s_{4} \overset{a_{2}}{\to} s_{5} \overset{a_{3}}{\to} \dots$ Hence, the action value is

$q_{π_{0}} (s_{1}, a_{3}) = 0 + γ 0 + γ^{2} 0 + γ^{3} (1) + γ^{4} (1) + \dots$

Starting from $(s_{1}, a_{4})$ , the episode is $s_{1} \overset{a_{4}}{\to} s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} \dots$ Hence, the action value is

$q_{π_{0}} (s_{1}, a_{4}) = - 1 + γ (- 1) + γ^{2} (- 1) + \dots$ 5. Starting from $(s_{1}, a_{5})$ , the episode is $s_{1} \overset{a_{5}}{\to} s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} \dots$ Hence, the action value is

$q_{π_{0}} (s_{1}, a_{5}) = 0 + γ (- 1) + γ^{2} (- 1) + \dots$

Step2: policy improvement

By observing the action values, we see that $q_{π_{0}} (s_{1}, a_{2}) = q_{π_{0}} (s_{1}, a_{3})$ are the maximum.

As a result, the policy can be improved as $π_{1} (a_{2} ∣ s_{1}) = 1 or π_{1} (a_{3} ∣ s_{1}) = 1 .$

In either way, the new policy for $s_{1}$ becomes optimal.

Drawback

We can see the drawback of MC Basic algorithm: for any episode, we must calculate the discounted return $G_{t}$ of it, i.e., we must wait until an episode has been completed.

How long should the episode length be?

As can been seen in previous, in each episode the reward is calculated through a process like

Starting from $(s_{1}, a_{1})$ , the episode is $s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} s_{1} \overset{a_{1}}{\to} \dots$ Hence, the action value is $q_{π_{0}} (s_{1}, a_{1}) = - 1 + γ (- 1) + γ^{2} (- 1) + \dots$

In practice, we use the finite episode length. We have shown that the state value obtained via an iteration process $v_{π_{k}}^{(j + 1)} = r_{π_{k}} + γ P_{π_{k}} v_{π_{k}}^{(j)}, j = 0, 1, 2, \dots$ is ensured to have $v_{π_{k}}^{(j)} \to v_{π_{k}}$ as $j \to \infty$ (-->See the proof), from any initial guess $v_{π_{k}}^{(0)}$ .

However, what if the finite episode length is too small and $v_{π_{k}}^{(j)}$ does not converge to $v_{π_{k}}$ . In this case, the action value won't be correct, and the premise of the reasoning falls apart.

We can see from the following figures that, the episode length greatly impacts the final optimal poli- cies.

When the length of each episode is too short, neither the policy nor the value estimate is optimal (see Figures 5.4(a)-(d)). In the extreme case where the episode length is one, only the states that are adjacent to the target have nonzero values, and all the other states have zero values since each episode is too short to reach the target or get positive rewards (see Figure 5.4(a)).

As the episode length increases, the policy and value estimates gradually approach the optimal ones (see Fig- ure 5.4(h)).

While the above analysis suggests that each episode must be sufficiently long, the episodes are not necessarily infinitely long. As shown in Figure 5.4(g), when the length is 30, the algorithm can find an optimal policy, although the value estimate is not yet optimal.

The above analysis is related to an important reward design problem, sparse reward, which refers to the scenario in which no positive rewards can be obtained unless the target is reached. The sparse reward setting requires long episodes that can reach the target. This requirement is challenging to satisfy when the state space is large. As a result, the sparse reward problem downgrades the learning efficiency.