Stationary Distribution of a Markov Decision Process
The stationary distribution of \(S\) under policy \(\pi\) can bedenoted by \(\left\{d_\pi(s)\right\}_{s \in \mathcal{S}}\). By definition, \(d_\pi(s) \geq 0\) and \(\sum_{s \in \mathcal{S}} d_\pi(s)=1\).
Let \(n_\pi(s)\) denote the number of times that \(s\) has been visited in a very ong episode generated by \(\pi\). Then, \(d_\pi(s)\) can be approximated by \[ d_\pi(s) \approx \frac{n_\pi(s)}{\sum_{s^{\prime} \in \mathcal{S}} n_\pi\left(s^{\prime}\right)} \] Meanwhile, the converged values \(d_\pi(s)\) can be computed directly by solving equation: \[ d_\pi^T=d_\pi^T P_\pi, \] i.e., \(d_\pi\) is the left eigenvector of \(P_\pi\) associated with the eigenvalue 1.
Sources: