Introduction to Model Predictive Control
Model Predictive Control, or MPC, is a model-based planning method. Its core idea is:
Plan several steps into the future, execute only the first action, and then replan at the next step.
Basic idea
At time \(t\), MPC considers a future action sequence:
\[ a_{t:t+H-1} = (a_t, a_{t+1}, \dots, a_{t+H-1}). \]
Using a model, it predicts the future outcomes of this action sequence and selects the sequence with the largest predicted return:
\[ a^*_{t:t+H-1} = \arg\max_{a_{t:t+H-1}} J(a_{t:t+H-1}, s_t). \]
However, MPC only executes the first action \(a_t^*\). At the next environment step, it observes the new state and replans.
This is why MPC is also called receding-horizon control.
1 | observe current state |
In model-based RL, the model is often learned from data. For example:
\[ \hat{s}*{t+1} = f*\theta(\hat{s}_t, a_t). \]
In latent-space MPC, observations are first encoded into latent states:
\[ z_t = e_\theta(o_t), \]
and planning is done in latent space:
\[ \hat{z}*{t+1} = f*\theta(\hat{z}_t, a_t). \]
MPC objective
A simple finite-horizon MPC objective is:
$$ J(a_{0:H-1}, s_0) =================
_{h=0}^{H-1} ^h r(_h, a_h), $$
subject to:
\[ \hat{s}_0 = s_0, \]
\[ \hat{s}_{h+1} = f(\hat{s}_h, a_h). \]
In RL, the finite horizon is often bootstrapped with a value function:
$$ J(a_{0:H}, s_0) ===============
_{h=0}^{H-1} ^h r(_h, a_h) + ^H V(_H). $$
If we use an action-value function, the terminal bootstrap can be:
$$ J(a_{0:H}, s_0) ===============
_{h=0}^{H-1} ^h r(_h, a_h) + ^H Q(_H, a_H). $$
The value or Q-function estimates returns beyond the finite planning horizon.
Common MPC solvers
MPC defines the planning problem, but not how to solve it. Common solvers include random shooting, CEM, and MPPI.
Random shooting
Random shooting samples many action sequences, evaluates them with the model, and picks the best one.
1 | sample many action sequences |
It is simple but inefficient in high-dimensional action spaces.
CEM
CEM, or Cross-Entropy Method, maintains a sampling distribution over action sequences, often a diagonal Gaussian:
\[ a_{0:H-1} \sim \mathcal{N}(\mu_{0:H-1}, \operatorname{diag}(\sigma^2_{0:H-1})). \]
The procedure is:
1 | sample action sequences |
CEM is a hard-selection method: only the elite samples are used to update the distribution.
MPPI
MPPI, or Model Predictive Path Integral control, is also a sampling-based trajectory optimization method.
Like CEM, it samples action sequences and evaluates them. But instead of only keeping elites, it assigns weights based on returns:
\[ w_i \propto \exp(J_i / \lambda). \]
High-return trajectories get larger weights, but lower-return trajectories may still contribute.
1 | CEM: |
Hybrid CEM-MPPI
Some practical methods mix CEM and MPPI.
For example:
1 | sample many trajectories |
This is CEM-like because it keeps elites, and MPPI-like because it uses return-based weights.
Why sample an action sequence?
In sampling-based MPC, one sample is usually a whole future action sequence:
\[ a_{0:H-1} = (a_0, a_1, \dots, a_{H-1}). \]
If each action has dimension \(d_a\), then the sequence has dimension \(H d_a\).
So if \(H=10\) and \(d_a=6\), the action sequence is a 60-dimensional vector.
This is why methods such as CEM and MPPI often use a high-dimensional Gaussian over action sequences. The Gaussian is not necessarily standard; its mean and variance are updated during planning.
Useful tricks
Short horizon
MPC usually uses a relatively short horizon. It does not need to plan the entire episode because it replans at every step.
Warm start
MPC solves a new planning problem at every environment step. However, consecutive planning problems are usually very similar, so we can initialize the new planning problem using the previous solution.
Suppose the previous plan was:
\[ (a_0, a_1, a_2, \dots, a_{H-1}). \]
MPC executes only the first action \(a_0\). At the next step, the remaining actions
\[ (a_1, a_2, \dots, a_{H-1}) \]
are still useful because they were already planned for the future. Therefore, the next planning problem can be initialized with the shifted sequence:
\[ (a_1, a_2, \dots, a_{H-1}, a_{\text{new}}). \]
This is called warm starting. It makes planning more efficient because the optimizer does not start from scratch at every step.
Policy prior
A learned policy can propose candidate action sequences for the planner.
1 | policy: |
This is useful because pure random sampling can be inefficient.
Terminal value bootstrap (TD target)
Since MPC plans only a finite number of steps, a value or Q-function can estimate what happens after the horizon:
1 | short-horizon rewards |
This makes short-horizon planning closer to the full RL objective.
MPC with world models
In modern model-based RL, MPC often uses a learned world model.
The pipeline is:
1 | observation |
For example, in a latent world model:
\[ z_0 = e_\theta(o), \]
\[ \hat{z}*{h+1} = f*\theta(\hat{z}_h, a_h), \]
and the planner maximizes:
$$ J(a_{0:H}, o) =============
{h=0}^{H-1} ^h R(h, a_h) + ^H Q(_H, a_H). $$
This avoids planning directly in pixel space.
Example: DC-MPC
DC-MPC is an example of MPC with a learned latent world model.
It first encodes an observation into a discrete codebook latent state:
\[ c_t = f(e_\theta(o_t)). \]
Its latent dynamics model predicts a categorical distribution over the next code:
\[ p_\phi(c_{t+1} \mid c_t, a_t). \]
During planning, DC-MPC evaluates candidate action sequences using predicted rewards and terminal Q-value bootstrapping:
$$ J(a_{0:H}, o) =============
{h=0}^{H-1} ^h R(*h, a_h) + ^H q_{_k}(_H, a_H). $$
Its planner is a modified MPPI method. It samples candidate action sequences from a diagonal Gaussian, evaluates them in the learned world model, selects top trajectories, and updates the distribution using return-based weights.
So it is not pure CEM and not pure MPPI. It is closer to a CEM-MPPI hybrid:
1 | CEM-like: |
It also uses warm start and policy-prior action sequences to make planning more efficient.
Summary
MPC is a model-based planning method.
Its core idea is:
1 | plan multiple future actions |
MPC requires a predictive model. The model can be known, simulated, or learned.
Common solvers include:
1 | random shooting: |
In model-based RL, MPC is often combined with learned world models, policy priors, warm starts, and terminal Q-value bootstrapping.