Transformer-Based World Models

Posted on 2025-03-23 Edited on 2025-05-08 In Research Views: 11

Sources:

GAIA-1

GAIA-1 (‘Generative AI for Autonomy’) is a generative world model designed to simulate high-resolution driving scenarios conditioned on past video, text, and action inputs.

Here, the term world model refers to a generative model in pixel space—i.e., it learns to produce future observations (image frames) rather than the inner dynamics (states) of the world, in contrast to latent-space models used in model-based RL (e.g., Dreamer).

Notation Table

Symbol	Type	Explanation
$H, W, C$	$\in N^{+}$	Image height, width, and number of channels (e.g., RGB with $C = 3$ )
$D$	$\in N^{+}$	Patch downsampling factor in VQ-GAN (e.g., $D = 16$ )
$n$	$\in N^{+}$	Number of image tokens per frame: $n = \frac{H}{D} \cdot \frac{W}{D}$
$K$	$\in N^{+}$	Vocabulary size of image tokens (codebook size in VQ-GAN)
$T$	$\in N^{+}$	Number of time steps in the input sequence
$T^{'}$	$\in N^{+}$	Number of time steps in the video decoder training
$m$	$\in N^{+}$	Number of text tokens per time step (e.g., $m = 32$ )
$l$	$\in N^{+}$	Number of scalar action components per step (e.g., speed and curvature)
$d$	$\in N^{+}$	Dimensionality of each action embedding
$x_{t}$	$\in R^{H \times W \times C}$	Input image at time step $t$
$z_{t}$	$\in {1, \dots, K}^{n}$	Discrete image tokens from VQ-GAN at time $t$
$c_{t}$	$\in R^{m}$	Text token embeddings at time $t$ from a pretrained encoder (e.g., T5)
$a_{t}$	$\in R^{l \times d}$	Embedded action vector at time $t$
$E_{θ}$	Function	VQ-GAN encoder that maps $x_{t} \mapsto z_{t}$
$z_{t, j}$	$\in {1, \dots, K}$	$j$ -th discrete token in frame $t$
$(c_{1}, z_{1}, a_{1}, \dots, c_{T}, z_{T}, a_{T})$	Sequence	Full multimodal input to the autoregressive world model
$ϵ_{θ}$	Function	Denoising network of the video diffusion model
$ϵ$	$\in R^{T^{'} \times H \times W \times C}$	Ground truth noise in video diffusion training
$x$	$\in R^{T^{'} \times H \times W \times C}$	Original clean video sequence of length $T^{'}$
$x^{t^{'}}$	$\in R^{T^{'} \times H \times W \times C}$	Noised video at time step $t^{'}$ with schedule $(α_{t^{'}}, σ_{t^{'}})$
$α_{t^{'}}, σ_{t^{'}}$	$\in R$	Noise schedule coefficients at diffusion step $t^{'}$
$z$	Sequence	Conditioning image token sequence $[z_{1}, \dots, z_{T^{'}}]$
$m$	$\in {0, 1}^{T^{'} \times H \times W}$	Binary masks used in the video decoder training objective
$L_{video}$	$\in R$	Diffusion model training loss

Modeling framework

GAIA-1 formulates world modeling as an autoregressive sequence modeling problem over discrete tokens. At each time step $t$ , three modalities are used:

Image tokens $z_{t} = (z_{t, 1}, \dots, z_{t, n}) \in R^{n}$
Text tokens $c_{t} = (c_{t, 1}, \dots, c_{t, m}) \in R^{m}$
Action embeddings $a_{t} = (a_{t, 1}, \dots, a_{t, l}) \in$ $a_{t} \in R^{l \times d}$

The input sequence to the transformer is:

$(c_{1}, z_{1}, a_{1}, \dots, c_{T}, z_{T}, a_{T})$

The model is trained to predict the next image token conditioned on all previous tokens:

$p (z_{t, i} ∣ c_{\leq t}, z_{< t}, z_{t, < i}, a_{< t})$

where causal masking ensures the autoregressive constraint.

Tokenization of modalities

GAIA-1 defines a generative world model in pixel space—it autoregressively models future observations rather than latent states or dynamics. The input to the model consists of a sequence of three modalities at each time step: text, image, and action.

1. Image tokens

Each image frame $x_{t} \in R^{H \times W \times C}$ is processed independently by a pretrained VQ-GAN encoder $E_{θ}$ , which acts as an image tokenizer (note: this encoder operates on images, not full videos): $z_{t} = E_{θ} (x_{t}), z_{t} = (z_{t, 1}, \dots, z_{t, n}), z_{t, i} \in {1, \dots, K}$

Here, $n = \frac{H}{D} \frac{W}{D}$ is the image downsampling rate and $K$ is the codebook size. Each $z_{t, i}$ is the discrete index of a latent code vector in the VQ-GAN’s vocabulary, producing a sequence of $n$ image tokens per frame.

2. Text Tokens

Textual instructions (e.g., driving intent or descriptions) are encoded using a pretrained T5-large model. At each time step $t$ , the text input is tokenized into:

$c_{t} = (c_{t, 1}, \dots, c_{t, m}), c_{t, i} \in R^{d}, m = 32$

These are fixed-length continuous embeddings (not discrete tokens) used as conditioning inputs. #### 3. Action Tokens (Non-discrete)

Actions are represented by $l = 2$ scalars per time step (e.g., speed and curvature).

$a_{t} = (a_{t, 1}, a_{t, 2}) \in R^{2}$

Note: actions are not true tokens-they are not discretized and have infinite vocabulary size. They are used as conditioning inputs only; the transformer does not generate or predict action tokens.

At each time step $t$ , the tokens are concatenated in the order:

$(c_{t}, z_{t}, a_{t})$

Over a horizon of $T$ steps, the full input sequence becomes:

$(c_{1}, z_{1}, a_{1}, \dots, c_{T}, z_{T}, a_{T})$

Embeddings - Temporal embeddings: One learnable vector per time step, shared across all tokens at that step ( $T$ total). - Spatial embeddings: One learnable vector per position within a timestep (total of $m + n + l$ positions).

These embeddings help the model distinguish both the position within each timestep and the temporal order of frames.

Image tokenizer

Since GAIA-1 uses an autoregressive Transformer as its world model, the input must be represented as sequences of discrete tokens from a finite vocabulary. To tokenize image frames, GAIA-1 adopts a pretrained VQ-GAN encoder $E_{θ}$ , applied frame-by-frame:

$z_{t} = E_{θ} (x_{t}), z_{t} = (z_{t, 1}, \dots, z_{t, n}), z_{t, i} \in {1, \dots, K}$ where $n = \frac{H}{D} \frac{W}{D}$ is the number of discrete image tokens per frame, and $K$ is the codebook size.

Inductive bias via DINO

To improve the semantic quality of the learned tokens, GAIA-1 introduces an additional distillation loss during VQ-GAN training. Specifically, inspired by BEIT V2, it encourages the VQ-GAN’s quantized outputs to align with features extracted from a pretrained DINO model, serving as an external semantic prior.

While this gives improved latent space visualization, as shown in Figure 3, the paper does not report genetaion quality improvement of this distillation.

Limitations: lack of temporal awareness

Importantly, $E_{θ}$ is trained on individual frames, not sequences, and thus captures no temporal dynamics. It acts purely as an image-level tokenizer. As a result, the output tokens $z_{t}$ lack temporal coherence across frames.

To compensate for this limitation, GAIA-1 introduces a separate video decoder trained independently to generate temporally consistent video outputs. The image decoder coupled with $E_{θ}$ (from VQ-GAN) is discarded after training, and only the encoder is retained for tokenization.

The video decoder architecture and training are discussed in a later section.

Transformer world model

The core of GAIA-1 is an autoregressive Transformer decoder trained to model the following multimodal token sequence: $(c_{1}, z_{1}, a_{1}, \dots, c_{T}, z_{T}, a_{T})$

where ${c}_{t}$ are text tokens, ${z}_{t}$ are discrete image tokens, and $l m a t h b f {a}_{t}$ are continuous action embeddings at timestep . The model uses causal masking to ensure that predictions are conditioned only on past or partial information: - From previous time steps: $c_{\leq t}, z_{< t}, a_{< t}$ - Within the same timestep: previously generated image tokens $z_{t, j < i}$

The Transformer is trained to predict the next image token in the sequence, i.e., it is autoregressive over the image modality only. Action inputs are not discretized and are used as conditioning inputs rather than prediction targets.

Video decoder

As the image tokenizer $E_{θ}$ operates on individual frames, it lacks temporal context. To address this, GAIA-1 introduces a video decoder trained independently using a denoising diffusion model with both spatial and temporal attention (following Video Diffusion Models, Ho et al., 2022).

The video decoder is trained using the noise prediction objective with $v$ -parameterization: $L_{video} = E_{ϵ, t^{'}} [{‖ ϵ_{θ} (x^{t^{'}}, t^{'}, z, m) - ϵ ‖}_{2}^{2}]$ where:

$ϵ_{θ}$ is the denoising video model.
$ϵ$ is the denoising target, which uses the $v$ -parameterization.
$t^{'} \sim Uniform (0, 1)$ is the sampled discrete diffusion time.
$x = (x_{1}, \dots, x_{T^{'}})$ is a video sequence of length $T^{'}$ .
$x^{t^{'}} = α_{t^{'}} x + σ_{t^{'}} ϵ$ represents the noised video, with $α_{t^{'}}$ and $σ_{t^{'}}$ functions of $t^{'}$ that define the noise schedule.
$z = (z_{1}, \dots, z_{T^{'}}) = E_{θ} (x)$ is the sequence of conditioning image tokens.
$m = (m_{1}, \dots, m_{T^{'}})$ is a sequence of image masks as specified by the training task (see Figure 4 in the original paper).

Thoughts

GAIA-1 demonstrates a comprehensive general approach to generative world modeling conditioned on multimodal sequences (text, action, image). It treats world modeling as sequence modeling: predict future image tokens based on past observations andother multimodal instruction, e.g., texts, actions.

Unlike model-based RL frameworks (e.g., Dreamer), GAIA-1 does not predict rewards. From my understanding, the motivation is that rewards are typically human-defined functions of the state and are therefore known at test time. Thus, reward prediction is NOT a necessary part of the world model. Instead, GAIA-1 focuses solely on modeling observation dynamics, which are more fundamental.

Interestingly, GAIA-1 incorporates feature distillation from a pretrained DINO model during VQ-GAN training. This likely injects semantic priors into the image tokens and improves both token quality and generation fidelity, although the paper does not quantify this effect.

The only clear limitation is that the image tokenizer $E_{θ}$ lacks temporal modeling. However, I believe this is a design choice and can be addressed.

WorldDreamer

WorldDreamer is another generative pixel-space world model.

Similar to GAIA-1, it utilizes a transformer which predicts image tokens, and can be condiioned on multi-modal instructions.

The difference is that WorldDreamer predicts the masked visual tokens instead of the next token. It's close to MAE in this sense. But I don't know why mask prediction (or image completion?) can bring about generative capability. Maybe there are some papers about that.

Meanwhile, WorldDreamer proposes a new transformer model. Its difference with the traidation one, e.g., the one used in GAIA-1, is that, while a traditional transformer computes self attention for embeddings of all tokens, it contains a 3D CNN to aggregate image token embeddings belonging to the same location but across the time. After the split, it computes cross attention betwwen the image embedding patches and other-modal embeddings.

I don't know why they do this--split patches for embeddings and do the cross attention stuff.