Stationary Distribution of a Markov Decision Process

Posted on 2024-06-24 Edited on 2025-06-17 In Computer Science Views: 62

The stationary distribution of $S$ under policy $π$ can bedenoted by ${d_{π} (s)}_{s \in S}$ . By definition, $d_{π} (s) \geq 0$ and $\sum_{s \in S} d_{π} (s) = 1$ .

Let $n_{π} (s)$ denote the number of times that $s$ has been visited in a very ong episode generated by $π$ . Then, $d_{π} (s)$ can be approximated by $d_{π} (s) \approx \frac{n_{π} (s)}{\sum_{s^{'} \in S} n_{π} (s^{'})}$ Meanwhile, the converged values $d_{π} (s)$ can be computed directly by solving equation: $d_{π}^{T} = d_{π}^{T} P_{π},$ i.e., $d_{π}$ is the left eigenvector of $P_{π}$ associated with the eigenvalue 1.

Sources:

Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.

Interpretation of $P_{π}^{k} (k = 1, 2, 3, \dots)$

The key tool for analyzing stationary distribution is $P_{π} \in R^{n \times n}$ , which is the probability transition matrix under the given policy $π$ .

If the states are indexed as $s_{1}, \dots, s_{n}$ , then ${[P_{π}]}_{i j}$ is defined as the probability for the agent moving from $s_{i}$ to $s_{j}$ . The probability of the agent transitioning from $s_{i}$ to $s_{j}$ using exactly $k$ steps is denoted as $p_{i j}^{(k)} = \Pr (S_{t_{k}} = j ∣ S_{t_{0}} = i),$ where $t_{0}$ and $t_{k}$ are the initial and $k$ th time steps, respectively. First, by the definition of $P_{π}$ , we have ${[P_{π}]}_{i j} = p_{i j}^{(1)},$ which means that ${[P_{π}]}_{i j}$ is the probability of transitioning from $s_{i}$ to $s_{j}$ using $a$ single step. Second, consider $P_{π}^{2}$ . It can be verified that ${[P_{π}^{2}]}_{i j} = {[P_{π} P_{π}]}_{i j} = \sum_{q = 1}^{n} {[P_{π}]}_{i q} {[P_{π}]}_{q j} .$

Since ${[P_{π}]}_{i q} {[P_{π}]}_{q j}$ is the joint probability of transitioning from $s_{i}$ to $s_{q}$ and then from $s_{q}$ to $s_{j}$ , we know that ${[P_{π}^{2}]}_{i j}$ is the probability of transitioning from $s_{i}$ to $s_{j}$ using exactly two steps. That is ${[P_{π}^{2}]}_{i j} = p_{i j}^{(2)}$

Similarly, we know that ${[P_{π}^{k}]}_{i j} = p_{i j}^{(k)}$ which means that ${[P_{π}^{k}]}_{i j}$ is the probability of transitioning from $s_{i}$ to $s_{j}$ using exactly $k$ steps.

Definition of stationary distributions.

Let $d_{0} \in R^{n}$ be a vector representing the probability distribution of the states at the initial time step. For example, if $s$ is always selected as the starting state, then $d_{0} (s) = 1$ and the other entries of $d_{0}$ are 0 . Let $d_{k} \in R^{n}$ be the vector representing the probability distribution obtained after exactly $k$ steps starting from $d_{0}$ . Then, we have $d_{k} (s_{i}) = \sum_{j = 1}^{n} d_{0} (s_{j}) {[P_{π}^{k}]}_{j i}, i = 1, 2, \dots$

This equation indicates that the probability of the agent visiting $s_{i}$ at step $k$ equals the sum of the probabilities of the agent transitioning from ${s_{j}}_{j = 1}^{n}$ to $s_{i}$ using exactly $k$ steps. The matrix-vector form of the last equation is $\begin{matrix} (1) & d_{k}^{T} = d_{0}^{T} P_{π}^{k} . \end{matrix}$

When we consider the long-term behavior of the Markov process, it holds under certain conditions that $\begin{matrix} (2) & lim_{k \to \infty} P_{π}^{k} = 1_{n} d_{π}^{T}, \end{matrix}$ where $1_{n} = [1, \dots, 1]^{T} \in R^{n}$ and $1_{n} d_{π}^{T}$ is a constant matrix with all its rows equal to $d_{π}^{T}$ . The conditions under which $(2)$ is valid will be discussed later. Substituting $(2)$ into $(1)$ yields $lim_{k \to \infty} d_{k}^{T} = d_{0}^{T} lim_{k \to \infty} P_{π}^{k} = d_{0}^{T} 1_{n} d_{π}^{T} = d_{π}^{T},$ where the last equality is valid because $d_{0}^{T} 1_{n} = 1$ .

The last equation means that the state distribution $d_{k}$ converges to a constant value $d_{π}$ , which is called the limiting distribution.

The limiting distribution depends on the system model and the policy $π$ . Interestingly, it is independent of the initial distribution $d_{0}$ . That is, regardless of which state the agent starts from, the probability distribution of the agent after a sufficiently long period can always be described by the limiting distribution.

The value of $d_{π}$ can be calculated in the following way. Taking the limit of both sides of $d_{k}^{T} = d_{k - 1}^{T} P_{π}$ ¹ gives $lim_{k \to \infty} d_{k}^{T} = lim_{k \to \infty} d_{k - 1}^{T} P_{π}$ and hence $\begin{matrix} (3) & d_{π}^{T} = d_{π}^{T} P_{π} \end{matrix}$

As a result, $d_{π}$ is the left eigenvector of $P_{π}$ associated with the eigenvalue 1. The solution of ( $(3)$ ) is called the stationary distribution. It holds that $\sum_{s \in S} d_{π} (s) =$ 1 and $d_{π} (s) > 0$ for all $s \in S$ . The reason why $d_{π} (s) > 0$ (not $d_{π} (s) \geq 0$ ) will be explained later (#TODO).

This is because $d_{k}^{T} = d_{0}^{T} P_{π}^{k}$ .↩︎

Interpretation of Pπk(k=1,2,3,…)

Definition of stationary distributions.

Interpretation of $P_{π}^{k} (k = 1, 2, 3, \dots)$