Basic Concepts in Reinforcement Learning
Sources:
- Shiyu Zhao. Chapter 1: Basic Concepts. Mathematical Foundations of Reinforcement Learning.
- OpenAI Spinning Up
Notation
Here I list the notations I use in my Reinforcement Learning (RL) posts. We follow the conventions in probability theory such that
- Samples are denoted as lower-case italicized Roman letters, such as
. - Sample Spaces are denoted by upper-case calligraphic fonts, such as
. - Random variables are denoted by upper-case italicized Roman letters, such as
.
Symbol | Meaning |
---|---|
The state of an environment. | |
The action to take by the agent. | |
The reward an agent obtains after executing an action at a state. The reward must be a scalar. | |
The observation |
|
The random variable reprsenting the state of an environment. | |
The random variable reprsenting the observation of an environment. | |
The random variable reprsenting the action to take by the agent. | |
The random variable reprsenting the reward an agent obtains after executing an action at a state. The reward must be a scalar. Note that we have |
|
The state, observation, action, reward at time index |
|
The random variables representing the state, observation, action, reward at time index |
|
The reward is a function of the state |
|
For the same reson, the reward random vaiable is also denoted as |
|
The set of all states, called the state space. We have |
|
The set of all actions, called the action space. We have |
|
The set of all rewards, called the reward space. We have |
|
It's common that an action space is associated with the state |
|
It's common that a reward space is associated with the state and action (called the each state-action pair |
|
The policy of an agent. By convention, we use |
|
The policy of an agent. By convention, we use |
|
The trajectory, which is a state-action chain |
|
The return (or cumulative rewards) of a trajetory. We often consider a discounted return with a discount rate |
|
The random variable reprsenting the return (or cumulative rewards) of a trajetory. | |
Suppose the trajectory starts from time index |
|
The random variable reprsenting the return of a trajectory starting from time index |
|
The return is a function of the trajectory |
|
The function |
|
A grid world example
Another important example we use in our posts is the grid world example, shown in Figure 1.2, where a robot moves in a grid world. The robot, called agent, can move across adjacent cells in the grid. At each time step, it can only occupy a single cell.
The white cells are accessible for entry, and the orange cells are forbidden. There is a target cell that the robot would like to reach. We will use such grid world examples throughout this RL serie since they are intuitive for illustrating new concepts and algorithms.

State and action
The concept state describes the agent's status with respect to the environment. The set of all the states is called the state space, denoted as
For each state, the agent can take (may be different) actions. The set of all actions is called the action space, denoted as
In the grid world example, the state corresponds to the agent's location. Since there are nine cells, there are nine states as well, thus
For each state, the agent can take five possible actions: moving upward, moving rightward, moving downward, moving leftward, and remaining unchanged. These five actions are denoted as
Considering that taking

In this RL serie, we consider the most general case:
States and Observations
You may see the terminology "observation". It's similar to "states". Wha's the difference?
A state
is a complete description of the state of the world. There is no information about the world which is hidden from the state.An observation
is a partial description of a state, which may omit information.The environment is fully observed when the agent is able to observe the complete state of the environment.
The environment is partially observed when the agent can only see a partial observation.
Reinforcement learning notation sometimes puts the symbol for state, , in places where it would be technically more appropriate to write the symbol for observation,
Specifically, this happens when talking about how the agent decides an action: we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have access to the state.
In our guide, we’ll follow standard conventions for notation, but it should be clear from context which is meant. If something is unclear, though, please raise an issue! Our goal is to teach, not to confuse.
State transition
When taking an action, the agent may move from one state to another. Such a process is called state transition. For example, if the agent is at state
We next examine two important examples.
- What is the next state when the agent attempts to go beyond the boundary, for example, taking action
at state ? The answer is that the agent will be bounced back because it is impossible for the agent to exit the state space. Hence, we have . - What is the next state when the agent attempts to enter a forbidden cell, for example, taking action
at state ? Two different scenarios may be encountered.- In the first scenario, although
is forbidden, it is still accessible. In this case, the next state is ; hence, the state transition process is . - In the second scenario,
is not accessible because, for example, it is surrounded by walls. In this case, the agent is bounced back to if it attempts to move rightward; hence, the state transition process is .
- In the first scenario, although
- Which scenario should we consider? The answer depends on the physical environment. In this serie, we consider the first scenario where the forbidden cells are accessible, although stepping into them may get punished. This scenario is more general and interesting. Moreover, since we are considering a simulation task, we can define the state transition process however we prefer. In real-world applications, the state transition process is determined by real-world dynamics.
Policy

A policy tells the agent which actions to take at every state. Intuitively, policies can be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a trajectory starting from an initial state (see Figure 1.4(b)).
A policy can be deterministic or stochastic.
Deterministic policy
A deterministic policy is usually denoted by
Stochastic policy
A stochastic policy is usually denoted by
It is preferrable to use
Meanwhile, the policies are often parameterized, the parameters are commonly denoted as
Reward
After executing an action at a state, the agent obtains a reward, denoted as
The reward is a function of the state
In the grid world example, the rewards are designed as follows:
- If the agent attempts to exit the boundary, let
. - If the agent attempts to enter a forbidden cell, let
. - If the agent reaches the target state, let
. - Otherwise, the agent obtains a reward of
.
Trajectory
A trajectory
Note that the trajectory is based on some policy

Return
We also define the return of a trajectory as the sum of all the rewards collected along the trajectory. It's undiscounted. However, it's more common to consider discounted return with a discount rate
Consider a trajectory:
The (often discounted) return is denoted as
Returns are also called total rewards or cumulative rewards.
The (undiscounted) return of this trajectory is
NOTE: The return is based on the trajectory, and the trajectory is based on a policy. Thus, the return is based one a policy as well.
Episode
When interacting with the environment by following a policy, the agent may stop at some terminal states. In this case, the resulting trajectory is finite, and is called an episode (or a trial, rollout).
However, some tasks may have no terminal states. In this case, the resulting trajectory is infinite.
Tasks with episodes are called episodic tasks. Tasks are called continuing tasks.
In fact, we can treat episodic and continuing tasks in a unified mathematical manner by converting episodic tasks to continuing ones. We have two options:
- First, if we treat the terminal state as a special state, we can specifically design its action space or state transition so that the agent stays in this state forever. Such states are called absorbing states, meaning that the agent never leaves a state once reached.
- Second, if we treat the terminal state as a normal state, we can simply set its action space to the same as the other states, and the agent may leave the state and come back again. Since a positive reward of
can be obtained every time is reached, the agent will eventually learn to stay at forever (self-circle) to collect more rewards.
In this RL serie, we consider the second scenario where the target state is treated as a normal state whose action space is
Markov decision process
This section presents the basic RL concepts in a more formal way under the framework of Markov decision processes (MDPs).
An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.
Sets:
- State space: the set of all states, denoted as
. - Action space: a set of actions, denoted as
, associated with each state . - Reward set: a set of rewards, denoted as
, associated with each state-action pair .
Note:
- The state, action, reward at time index
are denoted as separately. - The reward depends on the sate and action, but not the next state.
Model:
- State transition probability: At state
, when taking action , the probability of transitioning to state is . It holds that for any . - Reward probability: At state
, when taking action , the probability of obtaining reward is . It holds that for any .
Policy: At state
Markov property:
The name Markov Decision Process refers to the fact that the system obeys the Markov property, the memoryless property of a stochastic process.
Mathematically, it means that
The state is markovian:
The state transition is markovian:
The action itself doesn't have markov property, but it's defined to only rely on current state and policy
, doesn't depends on .The reward
itself doesn't have markov property, but it's defined to only rely on :
Here,
Markov processes
One may have heard about the Markov processes (MPs). What is the difference between an MDP and an MP? The answer is that, once the policy in an MDP is fixed, the MDP degenerates into an MP. For example, the grid world example in Figure 1.7 can be abstracted as a Markov process. In the literature on stochastic processes, a Markov process is also called a Markov chain if it is a discrete-time process and the number of states is finite or countable [1]. In this book, the terms “Markov process” and “Markov chain” are used interchangeably when the context is clear. Moreover, this book mainly considers finite MDPs where the numbers of st

Goal of reinforcement learning
The goal of RL is to find a policy that achieves the maximal return (or rewards):