Stationary Distribution of a Markov Decision Process

The stationary distribution of \(S\) under policy \(\pi\) can bedenoted by \(\left\{d_\pi(s)\right\}_{s \in \mathcal{S}}\). By definition, \(d_\pi(s) \geq 0\) and \(\sum_{s \in \mathcal{S}} d_\pi(s)=1\).

Let \(n_\pi(s)\) denote the number of times that \(s\) has been visited in a very ong episode generated by \(\pi\). Then, \(d_\pi(s)\) can be approximated by \[ d_\pi(s) \approx \frac{n_\pi(s)}{\sum_{s^{\prime} \in \mathcal{S}} n_\pi\left(s^{\prime}\right)} \] Meanwhile, the converged values \(d_\pi(s)\) can be computed directly by solving equation: \[ d_\pi^T=d_\pi^T P_\pi, \] i.e., \(d_\pi\) is the left eigenvector of \(P_\pi\) associated with the eigenvalue 1.

Sources:

  1. Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.

Interpretation of \(P_\pi^k(k=1,2,3, \ldots)\)

The key tool for analyzing stationary distribution is \(P_\pi \in \mathbb{R}^{n \times n}\), which is the probability transition matrix under the given policy \(\pi\).

If the states are indexed as \(s_1, \ldots, s_n\), then \(\left[P_\pi\right]_{i j}\) is defined as the probability for the agent moving from \(s_i\) to \(s_j\). The probability of the agent transitioning from \(s_i\) to \(s_j\) using exactly \(k\) steps is denoted as \[ p_{i j}^{(k)}=\operatorname{Pr}\left(S_{t_k}=j \mid S_{t_0}=i\right), \] where \(t_0\) and \(t_k\) are the initial and \(k\) th time steps, respectively. First, by the definition of \(P_\pi\), we have \[ \color{orange}{\left[P_\pi\right]_{i j}=p_{i j}^{(1)}}, \] which means that \(\left[P_\pi\right]_{i j}\) is the probability of transitioning from \(s_i\) to \(s_j\) using \(a\) single step. Second, consider \(P_\pi^2\). It can be verified that \[ \left[P_\pi^2\right]_{i j}=\left[P_\pi P_\pi\right]_{i j}=\sum_{q=1}^n\left[P_\pi\right]_{i q}\left[P_\pi\right]_{q j} . \]

Since \(\left[P_\pi\right]_{i q}\left[P_\pi\right]_{q j}\) is the joint probability of transitioning from \(s_i\) to \(s_q\) and then from \(s_q\) to \(s_j\), we know that \(\left[P_\pi^2\right]_{i j}\) is the probability of transitioning from \(s_i\) to \(s_j\) using exactly two steps. That is \[ \color{orange}{\left[P_\pi^2\right]_{i j}=p_{i j}^{(2)}} \]

Similarly, we know that \[ \color{orange}{\left[P_\pi^k\right]_{i j}=p_{i j}^{(k)}} \] which means that \(\left[P_\pi^k\right]_{i j}\) is the probability of transitioning from \(s_i\) to \(s_j\) using exactly \(k\) steps.

Definition of stationary distributions.

Let \(d_0 \in \mathbb{R}^n\) be a vector representing the probability distribution of the states at the initial time step. For example, if \(s\) is always selected as the starting state, then \(d_0(s)=1\) and the other entries of \(d_0\) are 0 . Let \(d_k \in \mathbb{R}^n\) be the vector representing the probability distribution obtained after exactly \(k\) steps starting from \(d_0\). Then, we have \[ d_k\left(s_i\right)=\sum_{j=1}^n d_0\left(s_j\right)\left[P_\pi^k\right]_{j i}, \quad i=1,2, \ldots \]

This equation indicates that the probability of the agent visiting \(s_i\) at step \(k\) equals the sum of the probabilities of the agent transitioning from \(\left\{s_j\right\}_{j=1}^n\) to \(s_i\) using exactly \(k\) steps. The matrix-vector form of the last equation is \[ \begin{equation} \label{eq8_7} d_k^T=d_0^T P_\pi^k . \end{equation} \]

When we consider the long-term behavior of the Markov process, it holds under certain conditions that \[ \begin{equation} \label{eq8_8} \lim _{k \rightarrow \infty} P_\pi^k=\mathbf{1}_n d_\pi^T, \end{equation} \] where \(\mathbf{1}_n=[1, \ldots, 1]^T \in \mathbb{R}^n\) and \(\mathbf{1}_n d_\pi^T\) is a constant matrix with all its rows equal to \(d_\pi^T\). The conditions under which \(\eqref{eq8_8}\) is valid will be discussed later. Substituting \(\eqref{eq8_8}\) into \(\eqref{eq8_7}\) yields \[ \color{red}{\lim _{k \rightarrow \infty} d_k^T=d_0^T \lim _{k \rightarrow \infty} P_\pi^k=d_0^T \mathbf{1}_n d_\pi^T=d_\pi^T}, \] where the last equality is valid because \(d_0^T \mathbf{1}_n=1\).

The last equation means that the state distribution \(d_k\) converges to a constant value \(d_\pi\), which is called the limiting distribution.

The limiting distribution depends on the system model and the policy \(\pi\). Interestingly, it is independent of the initial distribution \(d_0\). That is, regardless of which state the agent starts from, the probability distribution of the agent after a sufficiently long period can always be described by the limiting distribution.

The value of \(d_\pi\) can be calculated in the following way. Taking the limit of both sides of \(d_k^T=d_{k-1}^T P_\pi\)1 gives \[ \lim _{k \rightarrow \infty} d_k^T=\lim _{k \rightarrow \infty} d_{k-1}^T P_\pi \] and hence \[ \begin{equation} \label{eq8_10} \color{pink}{d_\pi^T=d_\pi^T P_\pi} \end{equation} \]

As a result, \(d_\pi\) is the left eigenvector of \(P_\pi\) associated with the eigenvalue 1. The solution of (\(\eqref{eq8_10}\)) is called the stationary distribution. It holds that \(\sum_{s \in \mathcal{S}} d_\pi(s)=\) 1 and \(d_\pi(s)>0\) for all \(s \in \mathcal{S}\). The reason why \(d_\pi(s)>0\) (not \(d_\pi(s) \geq 0\) ) will be explained later (#TODO).


  1. This is because \(d_k^T=d_0^T P_\pi^k\).↩︎