Introduction to Model Predictive Control

Model Predictive Control, or MPC, is a model-based planning method. Its core idea is:

Plan several steps into the future, execute only the first action, and then replan at the next step.

Basic idea

At time \(t\), MPC considers a future action sequence:

\[ a_{t:t+H-1} = (a_t, a_{t+1}, \dots, a_{t+H-1}). \]

Using a model, it predicts the future outcomes of this action sequence and selects the sequence with the largest predicted return:

\[ a^*_{t:t+H-1} = \arg\max_{a_{t:t+H-1}} J(a_{t:t+H-1}, s_t). \]

However, MPC only executes the first action \(a_t^*\). At the next environment step, it observes the new state and replans.

This is why MPC is also called receding-horizon control.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
observe current state
plan H steps
execute the first action
observe the next state
replan
````

## Does MPC require a model?

In the standard sense, yes. MPC needs a model because it must predict future states and rewards under candidate actions.

The model can be:

```text
a known physical model
a simulator
a learned dynamics model
a learned latent world model

In model-based RL, the model is often learned from data. For example:

\[ \hat{s}*{t+1} = f*\theta(\hat{s}_t, a_t). \]

In latent-space MPC, observations are first encoded into latent states:

\[ z_t = e_\theta(o_t), \]

and planning is done in latent space:

\[ \hat{z}*{t+1} = f*\theta(\hat{z}_t, a_t). \]

MPC objective

A simple finite-horizon MPC objective is:

$$ J(a_{0:H-1}, s_0) =================

_{h=0}^{H-1} ^h r(_h, a_h), $$

subject to:

\[ \hat{s}_0 = s_0, \]

\[ \hat{s}_{h+1} = f(\hat{s}_h, a_h). \]

In RL, the finite horizon is often bootstrapped with a value function:

$$ J(a_{0:H}, s_0) ===============

_{h=0}^{H-1} ^h r(_h, a_h) + ^H V(_H). $$

If we use an action-value function, the terminal bootstrap can be:

$$ J(a_{0:H}, s_0) ===============

_{h=0}^{H-1} ^h r(_h, a_h) + ^H Q(_H, a_H). $$

The value or Q-function estimates returns beyond the finite planning horizon.

Common MPC solvers

MPC defines the planning problem, but not how to solve it. Common solvers include random shooting, CEM, and MPPI.

Random shooting

Random shooting samples many action sequences, evaluates them with the model, and picks the best one.

1
2
3
4
sample many action sequences
roll them out in the model
choose the one with the highest return
execute its first action

It is simple but inefficient in high-dimensional action spaces.

CEM

CEM, or Cross-Entropy Method, maintains a sampling distribution over action sequences, often a diagonal Gaussian:

\[ a_{0:H-1} \sim \mathcal{N}(\mu_{0:H-1}, \operatorname{diag}(\sigma^2_{0:H-1})). \]

The procedure is:

1
2
3
4
5
6
sample action sequences
evaluate them
keep the top-K elite sequences
refit the Gaussian to the elites
repeat
execute the first action

CEM is a hard-selection method: only the elite samples are used to update the distribution.

MPPI

MPPI, or Model Predictive Path Integral control, is also a sampling-based trajectory optimization method.

Like CEM, it samples action sequences and evaluates them. But instead of only keeping elites, it assigns weights based on returns:

\[ w_i \propto \exp(J_i / \lambda). \]

High-return trajectories get larger weights, but lower-return trajectories may still contribute.

1
2
3
4
5
CEM:
hard elite selection

MPPI:
soft return-based weighting

Hybrid CEM-MPPI

Some practical methods mix CEM and MPPI.

For example:

1
2
3
4
sample many trajectories
select top-K elites
compute return-based weights over elites
update the distribution using weighted averaging

This is CEM-like because it keeps elites, and MPPI-like because it uses return-based weights.

Why sample an action sequence?

In sampling-based MPC, one sample is usually a whole future action sequence:

\[ a_{0:H-1} = (a_0, a_1, \dots, a_{H-1}). \]

If each action has dimension \(d_a\), then the sequence has dimension \(H d_a\).

So if \(H=10\) and \(d_a=6\), the action sequence is a 60-dimensional vector.

This is why methods such as CEM and MPPI often use a high-dimensional Gaussian over action sequences. The Gaussian is not necessarily standard; its mean and variance are updated during planning.

Useful tricks

Short horizon

MPC usually uses a relatively short horizon. It does not need to plan the entire episode because it replans at every step.

Warm start

MPC solves a new planning problem at every environment step. However, consecutive planning problems are usually very similar, so we can initialize the new planning problem using the previous solution.

Suppose the previous plan was:

\[ (a_0, a_1, a_2, \dots, a_{H-1}). \]

MPC executes only the first action \(a_0\). At the next step, the remaining actions

\[ (a_1, a_2, \dots, a_{H-1}) \]

are still useful because they were already planned for the future. Therefore, the next planning problem can be initialized with the shifted sequence:

\[ (a_1, a_2, \dots, a_{H-1}, a_{\text{new}}). \]

This is called warm starting. It makes planning more efficient because the optimizer does not start from scratch at every step.

Policy prior

A learned policy can propose candidate action sequences for the planner.

1
2
3
4
5
policy:
gives reasonable candidate actions

MPC:
improves them by planning

This is useful because pure random sampling can be inefficient.

Terminal value bootstrap (TD target)

Since MPC plans only a finite number of steps, a value or Q-function can estimate what happens after the horizon:

1
2
3
short-horizon rewards
+
terminal value bootstrap

This makes short-horizon planning closer to the full RL objective.

MPC with world models

In modern model-based RL, MPC often uses a learned world model.

The pipeline is:

1
2
3
4
5
6
observation
-> encoder
-> latent state
-> latent dynamics rollout
-> predicted rewards and values
-> action-sequence optimization

For example, in a latent world model:

\[ z_0 = e_\theta(o), \]

\[ \hat{z}*{h+1} = f*\theta(\hat{z}_h, a_h), \]

and the planner maximizes:

$$ J(a_{0:H}, o) =============

{h=0}^{H-1} ^h R(h, a_h) + ^H Q(_H, a_H). $$

This avoids planning directly in pixel space.

Example: DC-MPC

DC-MPC is an example of MPC with a learned latent world model.

It first encodes an observation into a discrete codebook latent state:

\[ c_t = f(e_\theta(o_t)). \]

Its latent dynamics model predicts a categorical distribution over the next code:

\[ p_\phi(c_{t+1} \mid c_t, a_t). \]

During planning, DC-MPC evaluates candidate action sequences using predicted rewards and terminal Q-value bootstrapping:

$$ J(a_{0:H}, o) =============

{h=0}^{H-1} ^h R(*h, a_h) + ^H q_{_k}(_H, a_H). $$

Its planner is a modified MPPI method. It samples candidate action sequences from a diagonal Gaussian, evaluates them in the learned world model, selects top trajectories, and updates the distribution using return-based weights.

So it is not pure CEM and not pure MPPI. It is closer to a CEM-MPPI hybrid:

1
2
3
4
5
CEM-like:
select top-K elite action sequences

MPPI-like:
use return-based importance weights to update the distribution

It also uses warm start and policy-prior action sequences to make planning more efficient.

Summary

MPC is a model-based planning method.

Its core idea is:

1
2
3
plan multiple future actions
execute only the first action
replan at the next step

MPC requires a predictive model. The model can be known, simulated, or learned.

Common solvers include:

1
2
3
4
5
6
7
8
9
10
11
random shooting:
sample trajectories and pick the best

CEM:
sample, keep elites, refit distribution

MPPI:
sample, weight by return, update distribution

hybrid CEM-MPPI:
keep elites and use return-based weights

In model-based RL, MPC is often combined with learned world models, policy priors, warm starts, and terminal Q-value bootstrapping.