Lu, Yukuan

Stationary Distribution of a Markov Decision Process

Posted on 2024-06-24 Edited on 2024-12-04 In Computer Science

The stationary distribution of \(S\) under policy \(\pi\) can bedenoted by \(\left\{d_\pi(s)\right\}_{s \in \mathcal{S}}\). By definition, \(d_\pi(s) \geq 0\) and \(\sum_{s \in \mathcal{S}} d_\pi(s)=1\).

Let \(n_\pi(s)\) denote the number of times that \(s\) has been visited in a very ong episode generated by \(\pi\). Then, \(d_\pi(s)\) can be approximated by \[ d_\pi(s) \approx \frac{n_\pi(s)}{\sum_{s^{\prime} \in \mathcal{S}} n_\pi\left(s^{\prime}\right)} \] Meanwhile, the converged values \(d_\pi(s)\) can be computed directly by solving equation: \[ d_\pi^T=d_\pi^T P_\pi, \] i.e., \(d_\pi\) is the left eigenvector of \(P_\pi\) associated with the eigenvalue 1.

Sources:

Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.

Q Learning

Posted on 2024-06-22 Edited on 2024-12-04 In Computer Science

This chapter introduces Q learning, a Temporal-Difference (TD) learning method to estimate optimal action values, hen optimal policies. Previouly we have illustrated TD-learning of state values and action values. For these methods, we need to do policy improvement to get optimal policies.

Sources:

Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.

Value Function Approximation

Posted on 2024-06-22 Edited on 2024-12-04 In Computer Science

In the previous post, we introduced TD learning algorithms. At that time, all state/action values were represented by tables. This is inefficient for handling large state or action spaces.

In this post, we will use the function approximation method for TD learning. It is also where artificial neural networks are incorporated into reinforcement learning as function approximators.

Sources:

Shiyu Zhao. Chapter 8: Value Function Approximation. Mathematical Foundations of Reinforcement Learning.

Temporal-Difference Methods

Posted on 2024-06-22 Edited on 2024-12-04 In Computer Science

This chapter introduces temporal-difference (TD) methods for reinforcement learning. Similar to Monte Carlo (MC) learning, TD learning is also model-free, but it has some advantages due to its incremental form.

The goal of TD both learning and MC learning is policy evaluation : given a dataset generated by a policy \(\pi\), estimate the state value (or action value) of the policy \(\pi\). They must be combined with one step of policy improvement to get optimal policies.

We elaborate two TD learning methods: one for estimating state values and one for estimating actions values, the latter algorithm is called Sarsa.

In next post, we will introduce Q-learning, which is very similar to Sarsa, to directly estimate optimal action values and hence optimal policies.

Sources:

Shiyu Zhao. Chapter 7: Temporal-Difference Methods. Mathematical Foundations of Reinforcement Learning.