Stationary Distribution of a Markov Decision Process

The stationary distribution of S under policy π can bedenoted by {dπ(s)}sS. By definition, dπ(s)0 and sSdπ(s)=1.

Let nπ(s) denote the number of times that s has been visited in a very ong episode generated by π. Then, dπ(s) can be approximated by dπ(s)nπ(s)sSnπ(s) Meanwhile, the converged values dπ(s) can be computed directly by solving equation: dπT=dπTPπ, i.e., dπ is the left eigenvector of Pπ associated with the eigenvalue 1.

Sources:

  1. Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.

Interpretation of Pπk(k=1,2,3,)

The key tool for analyzing stationary distribution is PπRn×n, which is the probability transition matrix under the given policy π.

If the states are indexed as s1,,sn, then [Pπ]ij is defined as the probability for the agent moving from si to sj. The probability of the agent transitioning from si to sj using exactly k steps is denoted as pij(k)=Pr(Stk=jSt0=i), where t0 and tk are the initial and k th time steps, respectively. First, by the definition of Pπ, we have [Pπ]ij=pij(1), which means that [Pπ]ij is the probability of transitioning from si to sj using a single step. Second, consider Pπ2. It can be verified that [Pπ2]ij=[PπPπ]ij=q=1n[Pπ]iq[Pπ]qj.

Since [Pπ]iq[Pπ]qj is the joint probability of transitioning from si to sq and then from sq to sj, we know that [Pπ2]ij is the probability of transitioning from si to sj using exactly two steps. That is [Pπ2]ij=pij(2)

Similarly, we know that [Pπk]ij=pij(k) which means that [Pπk]ij is the probability of transitioning from si to sj using exactly k steps.

Definition of stationary distributions.

Let d0Rn be a vector representing the probability distribution of the states at the initial time step. For example, if s is always selected as the starting state, then d0(s)=1 and the other entries of d0 are 0 . Let dkRn be the vector representing the probability distribution obtained after exactly k steps starting from d0. Then, we have dk(si)=j=1nd0(sj)[Pπk]ji,i=1,2,

This equation indicates that the probability of the agent visiting si at step k equals the sum of the probabilities of the agent transitioning from {sj}j=1n to si using exactly k steps. The matrix-vector form of the last equation is (1)dkT=d0TPπk.

When we consider the long-term behavior of the Markov process, it holds under certain conditions that (2)limkPπk=1ndπT, where 1n=[1,,1]TRn and 1ndπT is a constant matrix with all its rows equal to dπT. The conditions under which (2) is valid will be discussed later. Substituting (2) into (1) yields limkdkT=d0TlimkPπk=d0T1ndπT=dπT, where the last equality is valid because d0T1n=1.

The last equation means that the state distribution dk converges to a constant value dπ, which is called the limiting distribution.

The limiting distribution depends on the system model and the policy π. Interestingly, it is independent of the initial distribution d0. That is, regardless of which state the agent starts from, the probability distribution of the agent after a sufficiently long period can always be described by the limiting distribution.

The value of dπ can be calculated in the following way. Taking the limit of both sides of dkT=dk1TPπ1 gives limkdkT=limkdk1TPπ and hence (3)dπT=dπTPπ

As a result, dπ is the left eigenvector of Pπ associated with the eigenvalue 1. The solution of ((3)) is called the stationary distribution. It holds that sSdπ(s)= 1 and dπ(s)>0 for all sS. The reason why dπ(s)>0 (not dπ(s)0 ) will be explained later (#TODO).


  1. This is because dkT=d0TPπk.↩︎