The stationary distribution of under policy can bedenoted by . By definition, and .
Let denote the number of times that has been visited in a very ong episode generated by . Then, can be approximated by Meanwhile, the converged values can be computed directly by solving equation: i.e., is the left eigenvector of associated with the eigenvalue 1.
Sources:
- Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.
Interpretation of
The key tool for analyzing stationary distribution is , which is the probability transition matrix under the given policy .
If the states are indexed as , then is defined as the probability for the agent moving from to . The probability of the agent transitioning from to using exactly steps is denoted as where and are the initial and th time steps, respectively. First, by the definition of , we have which means that is the probability of transitioning from to using single step. Second, consider . It can be verified that
Since is the joint probability of transitioning from to and then from to , we know that is the probability of transitioning from to using exactly two steps. That is
Similarly, we know that which means that is the probability of transitioning from to using exactly steps.
Definition of stationary distributions.
Let be a vector representing the probability distribution of the states at the initial time step. For example, if is always selected as the starting state, then and the other entries of are 0 . Let be the vector representing the probability distribution obtained after exactly steps starting from . Then, we have
This equation indicates that the probability of the agent visiting at step equals the sum of the probabilities of the agent transitioning from to using exactly steps. The matrix-vector form of the last equation is
When we consider the long-term behavior of the Markov process, it holds under certain conditions that where and is a constant matrix with all its rows equal to . The conditions under which is valid will be discussed later. Substituting into yields where the last equality is valid because .
The last equation means that the state distribution converges to a constant value , which is called the limiting distribution.
The limiting distribution depends on the system model and the policy . Interestingly, it is independent of the initial distribution . That is, regardless of which state the agent starts from, the probability distribution of the agent after a sufficiently long period can always be described by the limiting distribution.
The value of can be calculated in the following way. Taking the limit of both sides of gives and hence
As a result, is the left eigenvector of associated with the eigenvalue 1. The solution of () is called the stationary distribution. It holds that 1 and for all . The reason why (not ) will be explained later (#TODO).