Value Iteration and Policy Iteration

Posted on 2024-01-08 Edited on 2026-07-20 In AI Views:

Shiyu Zhao. Chapter 4: Value Iteration and Policy Iteration. Mathematical Foundations of Reinforcement Learning.
–> Youtube: Value Iteration and Policy Iteration

This chapter introduces three algorithms that are closely related to each other. Briefly speaking, they are all solutions of Bellman optimality equations (BOEs) to get optimal policies.

Value iteration

This section introduces the value iteration algorithm. It is exactly the algorithm suggested in solving the Bellman optimality equation (BOE). (This section is exactly the same of that chapter, there’s nothing new added here.)

The algorithm is \[ \begin{equation} \label{eq_solution_of_BOE} v_{k+1}=\max _{\pi \in \Pi}\left(\color{teal} {r_\pi} + \gamma \color{salmon} {P_\pi v_k} \right), \end{equation} \]

where $k=0,1,2, \ldots$.

It is guaranteed that $v_k$ and $\pi_k$ converge to the optimal state value and an optimal policy as $k \rightarrow \infty$, respectively.

But $\eqref{eq_solution_of_BOE}$ can’t be calculated directly. In fact, value iteration is an iterative algorithm. Each iteration has two steps.

Step 1: policy update. This step is to solve \[ \begin{equation} \label{eq_policy_update} \pi_{k+1}=\arg \max _\pi\left(\color{teal} {r_\pi} + \gamma \color{salmon} {P_\pi v_k} \right) \end{equation} \] where $v_k$ is obtained in the previous iteration.

Step 2: value update. It calculates a new value $v_{k+1}$ by \[ \begin{equation} \label{eq_value_update} v_{k+1}=\color{teal}{r_{\pi_{k+1}}}+\gamma \color{salmon} {P_{\pi_{k+1}} v_k}, \end{equation} \] where $v_{k+1}$ will be used in the next iteration.

Note: $v_k$ is not a state value, it doesn’t ensure to satisfy the Bellman equation/

The value iteration algorithm introduced above is in a matrix-vector form. It’s useful for understanding the core idea of the algorithm.

To implement this algorithm, we need to further examine its elementwise form.

Elementwise form and implementation

Step 1

Step 1: Policy update

Recall from my earlier post, the elementwise form of $\eqref{eq_policy_update}$ is \[ \pi_{k+1}(s)=\arg \max _\pi \sum_a \pi(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, \quad s \in \mathcal{S} \]

The optimal policy solving the above optimization problem is \[ \pi_{k+1}(a \mid s)=\left\{\begin{array}{cc} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right. \] where \[ a_k^*(s)=\arg \max _a q_k(a, s) . \]

$\pi_{k+1}$ is called a greedy policy, since it simply selects the greatest q-value.

Step 2

Step 2: Value update

The elementwise form of $\eqref{eq_value_update}$ is \[ v_{k+1}(s)=\sum_a \pi_{k+1}(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, \quad s \in \mathcal{S} \]

Since $\pi_{k+1}$ is greedy, the above equation is simply \[ v_{k+1}(s)=\max _a q_k(a, s) \]

Algorithm

Procedure summary:

$v_k(s) \rightarrow q_k(s, a) \rightarrow$ greedy policy $\pi_{k+1}(a \mid s) \rightarrow$ new value $v_{k+1}=\max _a q_k(s, a)$

Algorithm 4.1: Value iteration algorithm

Now, since we get the optimal state-values, our optimal policy is clear-just selct the next state with the maximal state-value.

Example

We next present an example to illustrate the step-by-step implementation of the value iteration algorithm.

Figure 4.2: An example for demonstrating the implementation of the value iteration algorithm. The target area is $s_4$. The reward settings are $r_{\text {boundary }}=r_{\text {forbidden }}=-1$ and $r_{\text {target }}=1$. The discount rate is $\gamma=0.9$.

The q-table is:

Table 4.1: The expression of q(s, a) for the example as shown in Figure 4.2.

k=0

$k=0$ : let $v_0\left(s_1\right)=v_0\left(s_2\right)=v_0\left(s_3\right)=v_0\left(s_4\right)=0$ case: k=0

Step 1: Policy update: \[ \pi_1\left(a_5 \mid s_1\right)=1, \pi_1\left(a_3 \mid s_2\right)=1, \pi_1\left(a_2 \mid s_3\right)=1, \pi_1\left(a_5 \mid s_4\right)=1 . \nonumber \]

This policy is visualized in Figure (b).

Step 2: Value update: \[ v_1\left(s_1\right)=0, v_1\left(s_2\right)=1, v_1\left(s_3\right)=1, v_1\left(s_4\right)=1 . \nonumber \]

k=1

$k=1$ : since \[ v_1\left(s_1\right)=0, v_1\left(s_2\right)=1, v_1\left(s_3\right)=1, v_1\left(s_4\right)=1, \] we have

Step 1: Policy update:

\[ \pi_2\left(a_3 \mid s_1\right)=1, \pi_2\left(a_3 \mid s_2\right)=1, \pi_2\left(a_2 \mid s_3\right)=1, \pi_2\left(a_5 \mid s_4\right)=1 . \nonumber \]

Step 2: Value update:

\[ v_2\left(s_1\right)=\gamma 1, v_2\left(s_2\right)=1+\gamma 1, v_2\left(s_3\right)=1+\gamma 1, v_2\left(s_4\right)=1+\gamma 1 . \nonumber \]

This policy is visualized in Figure (c).

The policy is already optimal.

k=2,3,…

$k=2,3, \ldots$ Stop when $\left\|v_k-v_{k+1}\right\|$ is smaller than a predefined

Policy iteration

Policy iteration is an iterative algorithm. Each iteration has two steps.

Given a random initial policy $\pi_0$, in each olicy iteration we do

policy evaluation (PE): \[ \begin{equation} \label{eq_policy_evaluation} v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} \end{equation} \] Note: $v_{\pi_k}$ is a state value function. So we need to get the state values for all states, not for one specific state, in PE.
policy improvement (PI): \[ \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right) \] The maximization is componentwise.

The policy iteration algorithm leads to a sequence \[ \pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \ldots \nonumber \]

Q1: In the policy evaluation step, how to calculate $v_{\pi_k}$?

$\eqref{eq_policy_evaluation}$ is a Bellman equation. We’ve learned there’re two ways to solve it.

Closed-form solution: \[ v_{\pi_k}=\left(I-\gamma P_{\pi_k}\right)^{-1} r_{\pi_k} . \]

Iterative solution: \[ \begin{equation} \label{eq_iterative_solution} v_{\pi_k}^{(j+1)}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j=0,1,2, \ldots \end{equation} \]

where $v_{\pi_k}^{(j)}$ denotes the $j$ th estimate of $v_{\pi_k}$. Starting from any initial guess $v_{\pi_k}^{(0)}$, it is ensured that $v_{\pi_k}^{(j)} \rightarrow v_{\pi_k}$ as $j \rightarrow \infty$ (–>See the proof).

Interestingly, policy iteration is an iterative algorithm with another iterative algorithm $\eqref{eq_iterative_solution}$ embedded in the policy evaluation step.

In theory, this embedded iterative algorithm requires an infinite number of steps (that is, $j \rightarrow \infty$ ) to converge to the true state value $v_{\pi_k}$. This is, however, impossible to realize.

In practice, the iterative process terminates when a certain criterion is satisfied.

Q2: In the policy improvement step, why is $\pi_{k+1}$ better than $\pi_k$ ?

Lemma (Policy improvement). If $\pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right)$, then $v_{\pi_{k+1}} \geq v_{\pi_k}$, which means $v_{\pi_{k+1}}(s) \geq v_{\pi_k}(s)$ for all $s$, i.e., $\pi_{k+1}$ is better than $\pi_k]$.

Proof: Since $v_{\pi_{k+1}}$ and $v_{\pi_k}$ are state values, they satisfy the Bellman equations: \[ \begin{aligned} v_{\pi_{k+1}} & =r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_{k+1}}, \\ v_{\pi_k} & =r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} . \end{aligned} \nonumber \]

Since $\pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right)$, we know that \[ r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_k} \geq r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} \nonumber \] It then follows that \[ \begin{aligned} v_{\pi_k}-v_{\pi_{k+1}} & =\left(r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k}\right)-\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_{k+1}}\right) \\ & \leq\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_k}\right)-\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_{k+1}}\right) \\ & \leq \gamma P_{\pi_{k+1}}\left(v_{\pi_k}-v_{\pi_{k+1}}\right) . \end{aligned} \]

Therefore, \[ \begin{aligned} v_{\pi_k}-v_{\pi_{k+1}} \leq \gamma^2 P_{\pi_{k+1}}^2\left(v_{\pi_k}-v_{\pi_{k+1}}\right) \leq \ldots & \leq \gamma^n P_{\pi_{k+1}}^n\left(v_{\pi_k}-v_{\pi_{k+1}}\right) \\ & \leq \lim _{n \rightarrow \infty} \gamma^n P_{\pi_{k+1}}^n\left(v_{\pi_k}-v_{\pi_{k+1}}\right)=0 . \end{aligned} \]

The limit is due to the facts that $\gamma^n \rightarrow 0$ as $n \rightarrow \infty$ and $P_{\pi_{k+1}}^n$ is a nonnegative stochastic matrix for any $n$. Here, a stochastic matrix refers to a nonnegative matrix whose row sums are equal to one for all rows.

Q3: Why can the policy iteration algorithm eventually find an optimal policy?

The policy iteration algorithm generates two sequences. The first is a sequence of policies: $\left\{\pi_0, \pi_1, \ldots, \pi_k, \ldots\right\}$. The second is a sequence of state values: $\left\{v_{\pi_0}, v_{\pi_1}, \ldots, v_{\pi_k}, \ldots\right\}$. Suppose that $v^*$ is the optimal state value. Then, $v_{\pi_k} \leq v^*$ for all $k$.

Since the policies are continuously improved according to the previous Lemma, we know that \[ v_{\pi_0} \leq v_{\pi_1} \leq v_{\pi_2} \leq \cdots \leq v_{\pi_k} \leq \cdots \leq v^* \nonumber \]

Since $v_{\pi_k}$ is nondecreasing and always bounded from above by $v^*$, it follows from the monotone convergence theorem that $v_{\pi_k}$ converges to a constant value, denoted as $v_{\infty}$, when $k \rightarrow \infty$.

Now we prove that $v_{\infty}=v^*$.

Theorem (Convergence of policy iteration). The state value sequence $\left\{v_{\pi_k}\right\}_{k=0}^{\infty}$ generated by the policy iteration algorithm converges to the optimal state value $v^*$. As a result, the policy sequence $\left\{\pi_k\right\}_{k=0}^{\infty}$ converges to an optimal policy.

Proof:

The idea of the proof is to show that the policy iteration algorithm converges faster than the value iteration algorithm.

In particular, to prove the convergence of $\left\{v_{\pi_k}\right\}_{k=0}^{\infty}$, we introduce another sequence $\left\{v_k\right\}_{k=0}^{\infty}$ generated by \[ v_{k+1}=f\left(v_k\right)=\max _\pi\left(r_\pi+\gamma P_\pi v_k\right) \]

This iterative algorithm is exactly the value iteration algorithm. We already know that $v_k$ converges to $v^*$ when given any initial value $v_0$.

For $k=0$, we can always find a $v_0$ such that $v_{\pi_0} \geq v_0$ for any $\pi_0$.

We next show that $v_k \leq v_{\pi_k} \leq v^*$ for all $k$ by induction.

For $k \geq 0$, suppose that $v_{\pi_k} \geq v_k$.

For $k+1$, we have \[ \begin{aligned} v_{\pi_{k+1}}-v_{k+1} & =\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_{k+1}}\right)-\max _\pi\left(r_\pi+\gamma P_\pi v_k\right) \\ & \geq\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_k}\right)-\max _\pi\left(r_\pi+\gamma P_\pi v_k\right) \end{aligned} \] (because $v_{\pi_{k+1}} \geq v_{\pi_k}$ by Lemma 4.1 and $P_{\pi_{k+1}} \geq 0$ ) \[ =\left(r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_{\pi_k}\right)-\left(r_{\pi_k^{\prime}}+\gamma P_{\pi_k^{\prime}} v_k\right) \nonumber \] (suppose $\left.\pi_k^{\prime}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right)\right)$ \[ \geq\left(r_{\pi_k^{\prime}}+\gamma P_{\pi_k^{\prime}} v_{\pi_k}\right)-\left(r_{\pi_k^{\prime}}+\gamma P_{\pi_k^{\prime}} v_k\right) \nonumber \] (because $\pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right)$ ) \[ =\gamma P_{\pi_k^{\prime}}\left(v_{\pi_k}-v_k\right) \text {. } \nonumber \]

Since $v_{\pi_k}-v_k \geq 0$ and $P_{\pi_k^{\prime}}$ is nonnegative, we have $P_{\pi_k^{\prime}}\left(v_{\pi_k}-v_k\right) \geq 0$ and hence $v_{\pi_{k+1}}-v_{k+1} \geq 0$

Therefore, we can show by induction that $v_k \leq v_{\pi_k} \leq v^*$ for any $k \geq 0$. Since $v_k$ converges to $v^*, v_{\pi_k}$ also converges to $v^*$.