Understanding Simulus Token-Based World Model Architecture
这篇笔记涵盖 Simulus / REM 的 core architecture, loss, and training procedure。而 Simulus 区别于 REM 的核心创新 intrinsic motivation、prioritized replay, and RaC 则在另外的文章讨论。
Sources:
- Simulus world model paper
- REM world model paper, REM is the predecessor of Simulus and has the same model architecture with Simulus
- Retentive Network: A Successor to Transformer for Large Language Models (RetNet). RetNet is the backbone sequence prediction model leveraged by REM and Simulus. (我很惊讶这是清华的文章,作者未来可期)
Simulus:网络架构、Loss 与训练方式
为了和论文表述一致,下面主要使用这些符号:
\[ V,\quad M,\quad C,\quad o_t,\quad a_t,\quad z_t,\quad z_t^a,\quad \tau_t,\quad X_t,\quad Y_t^u,\quad S_t \]
其中 \(V\) 是 representation module,\(M\) 是 world model,\(C\) 是 controller。
0. 问题设置:State-Agnostic POMDP Formulation
Simulus 采用 Partially Observable Markov Decision Process(POMDP)设定。不过,在实际环境中,agent 通常无法访问 hidden state。因此,论文使用了一个 state-agnostic formulation:环境动态只从 agent 可见的 observation-action history 来描述。
令 \(\Omega\) 表示 observation space,\(\mathcal{A}\) 表示 action space。在每个时间步 \(t\),agent 观测到:
\[ o_t \in \Omega \]
并选择动作:
\[ a_t \in \mathcal{A} \]
从 agent 的视角看,环境按照依赖历史的条件分布演化:
\[ o_{t+1}, r_t, d_t \sim p(o_{t+1}, r_t, d_t \mid o_{\le t}, a_{\le t}) \]
其中,\(r_t\) 是 observed reward,\(d_t \in \{0,1\}\) 是 termination signal。episode 会一直进行,直到出现正的 termination signal。
agent 的目标是最大化期望折扣回报:
\[ \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \right], \quad \gamma \in [0,1] \]
对于 multi-modal observation,时间步 \(t\) 的 observation 写成:
\[ o_t = \{o_t^{(i)}\}_{i=1}^{|\kappa|} \]
其中,\(\kappa\) 是 modality 集合,\(o_t^{(i)}\) 表示第 \(i\) 个 modality \(\kappa_i\) 的 observation。
在这个 formulation 下,REM/Simulus 这类 token-based world model 不直接建模 hidden state,而是从可观测历史中学习环境动态。world model 的目标是建模:
\[ p(o_{t+1}, r_t, d_t \mid o_{\le t}, a_{\le t}) \]
也就是:给定截至时间步 \(t\) 的所有 observations 和 actions,预测下一步 observation、当前 transition 的 reward,以及 termination signal。
经过 tokenization 后,这个预测问题会被转化到 token space 中。给定 tokenized trajectory:
\[ \tau_t = z_1, z_1^a, \ldots, z_t, z_t^a \]
world model \(M\) 学习:
\[ p_\theta(\hat z_{t+1}\mid \tau_t), \quad \hat r_t = \hat r_\theta(\tau_t), \quad p_\theta(\hat d_t\mid \tau_t) \]
也就是说,给定过去直到 \(t\) 的 observation tokens 和 action tokens,world model 预测下一步 observation tokens、reward 和 termination signal。
1. 总体结构:Modular Token-Based World Model Agent
Simulus 包含三个主要神经模块:
\[ V,\quad M,\quad C \]
它们分别负责:
\[ V: \text{raw observation/action} \rightarrow \text{tokens} \]
\[ M: \text{token trajectory} \rightarrow \text{next tokens, reward, termination} \]
\[ C: \text{tokens} \rightarrow \text{policy/value} \]
核心数据流是:
\[ o_t \xrightarrow{V} z_t \]
\[ a_t \xrightarrow{V} z_t^a \]
\[ \tau_t = z_1, z_1^a, \ldots, z_t, z_t^a \]
\[ M(\tau_t) \rightarrow p_\theta(\hat z_{t+1}\mid \tau_t), \hat r_t, p_\theta(\hat d_t\mid \tau_t) \]
然后 controller \(C\) 在 \(M\) 的 imagined rollout 中学习 policy。
关键点是:Simulus 把 representation learning、world model learning、controller learning 分开优化。这和 Dreamer 的 RSSM joint optimization 不同。
2. Representation Module \(V\)
Representation module \(V\) 是一个 multi-modal tokenizer / detokenizer system。
更准确地说:
\[ V = \{\mathrm{enc}_i,\mathrm{dec}_i\}_{i=1}^{|\kappa|} \]
对每种 modality \(\kappa_i\),都有一个对应的 encoder-decoder pair:
\[ z^{(i)} = \mathrm{enc}_i(o^{(i)}) \]
\[ \hat o^{(i)} = \mathrm{dec}_i(z^{(i)}) \]
具体到不同 modality:
| Modality | Tokenizer |
|---|---|
| image | learned VQ-VAE tokenizer |
| continuous vector | symlog + quantization |
| categorical variable | 本身就是 discrete token,通常不需要复杂 tokenizer |
| 2D categorical grid | flatten spatial dimensions 后按 categorical tokens 处理 |
| action | 用对应 action modality 的 tokenizer 得到 \(z_t^a\) |
也就是说,Simulus 不只 tokenize observation,也 tokenize action。
2.1 输入输出
多模态 observation 为:
\[ o_t = \{o_t^{(i)}\}_{i=1}^{|\kappa|} \]
其中,\(\kappa\) 是 modality 集合,\(o_t^{(i)}\) 是第 \(i\) 个 modality 的 observation。
\(V\) 把 raw input 变成 fixed-length integer token sequence:
\[ o_t \mapsto z_t = \{z_t^{(i)}\}_{i=1}^{|\kappa|} \]
对于 action,也用对应 modality 的 tokenizer 得到:
\[ a_t \mapsto z_t^a \]
2.2 Image Tokenizer:VQ-VAE
对于 image observation,Simulus 使用 VQ-VAE tokenizer。
CNN encoder 输出 continuous feature grid:
\[ h = \mathrm{enc}(o) \]
然后每个 vector 被 quantize 到最近的 codebook entry:
\[ z = \arg\min_i \left\|h - E^{(i)}\right\| \]
其中,\(E\) 是 codebook / embedding table。
decoder 从 token embedding reconstruct image:
\[ \hat o = \mathrm{dec}(z) \]
Atari 中,image resolution 是:
\[ 64 \times 64 \]
token grid 是:
\[ 8 \times 8 \]
所以每帧 image 变成:
\[ K = 64 \]
个 image tokens,vocabulary size 是:
\[ N = 512 \]
2.3 Representation Loss
Image VQ-VAE 的 objective 可以写成:
\[ \mathcal{L}_V(\mathrm{enc}, \mathrm{dec}, E) = \left\|o - \mathrm{dec}(z)\right\|_2^2 + \left\|\mathrm{sg}(\mathrm{enc}(o)) - E(z)\right\|_2^2 + \left\|\mathrm{sg}(E(z)) - \mathrm{enc}(o)\right\|_2^2 + \mathcal{L}_{\mathrm{perceptual}}(o,\mathrm{dec}(z)) \]
其中:
\[ \left\|o - \mathrm{dec}(z)\right\|_2^2 \]
是 pixel reconstruction loss;
\[ \left\|\mathrm{sg}(\mathrm{enc}(o)) - E(z)\right\|_2^2 \]
是 codebook loss;
\[ \left\|\mathrm{sg}(E(z)) - \mathrm{enc}(o)\right\|_2^2 \]
是 commitment-style loss;
\[ \mathcal{L}_{\mathrm{perceptual}}(o,\mathrm{dec}(z)) \]
是 perceptual reconstruction loss。
这里 \(\mathrm{sg}(\cdot)\) 是 stop-gradient operator。
重要的是:这个 loss 只训练 representation module \(V\)。World model \(M\) 不直接看 pixel,也不直接 optimize image reconstruction loss;它学的是 token dynamics。
3. World Model \(M\)
3.1 建模目标
给定 token trajectory:
\[ \tau_t = z_1, z_1^a, \ldots, z_t, z_t^a \]
world model \(M\) 学习:
\[ p_\theta(\hat z_{t+1}\mid \tau_t) \]
\[ \hat r_t = \hat r_\theta(\tau_t) \]
\[ p_\theta(\hat d_t\mid \tau_t) \]
也就是预测:
- 下一步 observation tokens;
- reward;
- termination signal。
这里 \(M\) 预测的是:
\[ \hat z_{t+1} \]
而不是:
\[ \hat o_{t+1} \]
如果要得到图像,需要再经过 \(V\) 的 image decoder:
\[ \hat z_{t+1} \xrightarrow{\mathrm{dec}} \hat o_{t+1} \]
3.2 Embedding:从 Tokens 到 \(X_t\)
\(M\) 首先把 token trajectory 转换成 embedding sequence。
论文定义:
\[ X_t = (X_t^o, X_t^a) \]
其中:
\[ X_t^o = (X_t^{(1)}, \ldots, X_t^{(|\kappa|)}) \]
也就是说,一个 time step 的 embedding block \(X_t\) 包含两部分:
- observation token embeddings;
- action token embeddings。
每个 token \(z\) 会通过 modality-specific embedding table 查表,得到一个 \(d\)-dimensional vector:
\[ x_{t,j}^{(i)} = E^{(i)}(l), \quad l = z_{t,j}^{(i)} \]
这里,\(z_{t,j}^{(i)}\) 表示第 \(i\) 个 modality 在时间步 \(t\) 的第 \(j\) 个 token,\(l\) 是这个 token 的整数 index,\(E^{(i)}(l)\) 是 embedding table \(E^{(i)}\) 的第 \(l\) 行。
对于 image modality,\(E^{(i)}\) 就是 VQ-VAE tokenizer 的 codebook,是 representation module \(V\) 学出来的。换句话说,image tokenizer 不仅把图像编码成离散 token,也提供了这些 token 对应的 embedding vectors。
对于没有 tokenizer codebook 的 modality,world model \(M\) 会自己学习 dedicated embedding table,记作:
\[ E_M^{(i)} \]
这种情况下,token embedding 由 \(M\) 自己的 embedding table 查表得到:
\[ x_{t,j}^{(i)} = E_M^{(i)}(l), \quad l = z_{t,j}^{(i)} \]
因此,embedding 的来源取决于 modality:如果 tokenizer 自带 codebook,就使用 tokenizer 的 codebook;如果 tokenizer 不自带 codebook,就由 \(M\) 单独学习 embedding table。
对于 Atari image,\(z_t\) 有 \(K=64\) 个 tokens。VQ-VAE 把一帧 \(64\times64\) 图像编码成 \(8\times8=64\) 个 token,每个 token 是 \(1\sim512\) 之间的整数;然后用 VQ-VAE 的 512-entry codebook 把每个整数 token 查成一个 256-dimensional vector。于是,一帧图像最终变成:
\[ X_t^o \in \mathbb{R}^{64 \times 256} \]
也就是长度为 64、每个元素维度为 256 的 embedding sequence。
3.3 Sequence Model:RetNet \(f_\theta\)
\(M\) 的 backbone 是 retentive network:
\[ f_\theta \]
给定 observation-action block sequence:
\[ X_1,\ldots,X_t \]
它 recurrently 更新 state:
\[ (S_t, Y_t) = f_\theta(S_{t-1}, X_t) \]
其中:
- \(S_t\) 是 recurrent state,summary 过去 token/action history;
- \(Y_t\) 是当前 block 的 sequence output。
可以把它理解成:
\[ S_t \approx \text{summary}(X_{\le t}) \]
3.4 Parallel Observation Prediction:用 \(X^u\) 预测下一帧 Tokens
Simulus 不是逐 token autoregressive 生成下一帧 observation,而是用一组 learned prediction embeddings:
\[ X^u \in \mathbb{R}^{K \times d} \]
然后从当前 recurrent state \(S_t\) 出发,再调用一次 \(f_\theta\):
\[ (\cdot, Y_{t+1}^u) = f_\theta(S_t, X^u) \]
这里:
\[ Y_{t+1}^u = (y_1,\ldots,y_K) \]
每个 \(y_i\) 对应下一帧 observation 的第 \(i\) 个 token。
然后用 token prediction heads 得到:
\[ p_\theta(\hat z_{t+1,i}\mid y_i) \]
对所有 \(i=1,\ldots,K\) 并行预测。
所以对 Atari:
\[ Y_{t+1}^u \in \mathbb{R}^{64 \times d} \]
接 64 个 token positions 的分类预测,每个 token 是 512-way classification。
3.5 Prediction Heads
\(M\) 有三类 heads。
Observation Token Heads
对每个 modality 的 token 做 classification:
\[ p_\theta(\hat z \mid y) \]
head 是 single-hidden-layer MLP,输出维度等于该 modality tokenizer 的 vocabulary size。
对于 Atari image:
\[ \text{output dim} = 512 \]
Reward Head
reward head 从 \(Y_{t+1}^u\) 的 summary vector \(y\) 预测 reward:
\[ \hat r_t = \hat r_\theta(y) \]
Termination Head
termination head 同样从 \(y\) 预测 termination distribution:
\[ p_\theta(\hat d_t\mid y) \]
4. World Model Loss \(\mathcal{L}_M\)
先定义 observation token loss。
对于一帧 observation tokens:
\[ z_t = (z_{t,1},\ldots,z_{t,K}) \]
以及 prediction outputs:
\[ Y_t^u = (y_1,\ldots,y_K) \]
observation loss 是 average token cross entropy:
\[ \mathcal{L}_{\mathrm{obs}} (\theta, z_t, p_\theta(\hat z_t\mid Y_t^u)) = -\frac{1}{K} \sum_{i=1}^{K} \log p_\theta(z_{t,i}\mid y_i) \]
这是 Simulus world model 的核心 dynamics loss。
完整 \(M\) objective 可以写成:
\[ \mathcal{L}_M(\theta,\tau) = \sum_{t=1}^{H} \left[ \mathcal{L}_{\mathrm{obs}} (\theta, z_t, p_\theta(\hat z_t\mid Y_t^u)) + \mathcal{L}_{\mathrm{reward}} (\theta, r_t, \hat r_t) - \log p_\theta(d_t\mid Y_t^u) \right] \]
其中:
\[ \tau = z_1,z_1^a,\ldots,z_H,z_H^a \]
各项含义如下:
| Term | Meaning |
|---|---|
| \(\mathcal{L}_{\mathrm{obs}}\) | 下一步 observation token cross-entropy |
| \(\mathcal{L}_{\mathrm{reward}}\) | reward prediction loss |
| \(-\log p_\theta(d_t\mid Y_t^u)\) | termination negative log-likelihood |
如果暂时不展开 Simulus 的 scalar prediction trick,可以把:
\[ \mathcal{L}_{\mathrm{reward}} \]
理解成 reward head 的 supervised prediction loss。论文实际实现里它是 classification-style scalar prediction loss,而不是普通 MSE。
5. Controller \(C\)
5.1 输入输出
Controller 也工作在 token space。
它按时间处理:
\[ z_1, z_1^a, z_2, z_2^a,\ldots \]
在每个 step \(t\),controller 接收当前 observation tokens \(z_t\),输出 policy:
\[ \pi(a_t\mid \tau_{\le t-1}, z_t) \]
以及 value estimate:
\[ \hat V^\pi(\tau_{\le t-1}, z_t) \]
5.2 网络结构
Controller 的核心是 LSTM。
对每个 modality,先有 modality-specific encoder:
\[ z_t^{(i)} \mapsto x^{(i)} \]
然后拼接不同 modality 的 latent:
\[ (x^{(1)},\ldots,x^{(|\kappa|)}) \]
再用 MLP \(g_\psi\) 融合:
\[ x_t = g_\psi(x^{(1)},\ldots,x^{(|\kappa|)}) \]
然后送入 LSTM:
\[ h_t, c_t = \mathrm{LSTM}(x_t,h_{t-1},c_{t-1};\psi) \]
最后 actor 和 critic 是两个 linear heads:
\[ \pi(a_t\mid h_t) \]
\[ \hat V^\pi(h_t) \]
对于 action,采样得到 \(a_t\),然后 action 也会被 embedding 后作为后续 sequence element 输入 controller。
5.3 Image Token Encoder for \(C\)
对于 Atari image tokens,controller 不直接用 raw image。
它先把 image tokens 映射回 embedding grid。Appendix A.5 里的 image observation encoder 输入 shape 是:
\[ 256 \times 8 \times 8 \]
然后经过 CNN 和 MLP:
\[ 256 \times 8 \times 8 \rightarrow 128 \times 8 \times 8 \rightarrow 64 \times 8 \times 8 \rightarrow 4096 \rightarrow 512 \]
也就是说,controller 的 image encoder 是一个浅 CNN + MLP,不是复用 \(M\) 的 RetNet。
6. Controller Training Loss
Controller 在 world model imagination 里训练。
先从 replay buffer 采一个短 trajectory segment 初始化 \(M\) 和 \(C\),然后 rollout \(H\) steps,得到 imagined trajectory:
\[ \hat \tau = (z_1,a_1,\bar r_1,d_1,\hat z_2,a_2,\bar r_2,d_2,\ldots,\hat z_H,a_H,\bar r_H,d_H) \]
这里如果不展开 intrinsic reward,则可以先把 \(\bar r_t\) 理解成 world model 提供的训练 reward。最基本情况下:
\[ \bar r_t = \hat r_t \]
然后计算 \(\lambda\)-return:
\[ G_t = \begin{cases} \bar r_t + \gamma(1-d_t) \left((1-\lambda)\hat V^\pi_{t+1} + \lambda G_{t+1}\right), & t < H, \\ \hat V^\pi_H, & t = H. \end{cases} \]
critic 用 \(G_t\) 作为 target。critic loss 可以抽象写成:
\[ \mathcal{L}_{\mathrm{critic}}(\psi) = \mathcal{L}_{\mathrm{value}} (\hat V^\pi_t, G_t) \]
论文实际实现也是 classification-style scalar loss;如果暂时不讨论该 trick,可以把它理解为 value prediction supervised loss。
actor 用 REINFORCE-style objective:
\[ \mathcal{J}_\pi(\psi) = \mathbb{E}_{\pi} \left[ \sum_{t=1}^{H} \mathrm{sg} \left( \frac{G_t - \hat V^\pi_t}{\max(1,c)} \right) \log \pi(a_t\mid \hat \tau_{\le t-1},\hat z_t) + w_{\mathrm{ent}} \mathcal{H}(\pi(a_t\mid \hat \tau_{\le t-1},\hat z_t)) \right] \]
其中:
- \(G_t\) 是 \(\lambda\)-return;
- \(\hat V^\pi_t\) 是 baseline;
- \(c\) 是 return scale normalization;
- \(w_{\mathrm{ent}}\) 是 entropy regularization weight;
- \(\mathrm{sg}\) 表示 advantage 不对 value target 反传。
严格说,如果按 gradient descent 写成 loss,通常会对 policy objective 取负号:
\[ \mathcal{L}_{\pi}(\psi) = -\mathcal{J}_{\pi}(\psi) \]
7. 训练方式:三个模块分开训练
Simulus 的训练循环可以概括为:
\[ \text{data collection} \rightarrow \text{representation learning } V \rightarrow \text{world model learning } M \rightarrow \text{control learning } C \]
7.1 Step 1:Data Collection
当前 controller \(C\) 和真实环境交互,收集 transition:
\[ (o_t,a_t,r_t,d_t) \]
并存进 replay buffer。
7.2 Step 2:Train \(V\)
从 replay buffer 采 raw observations,训练 tokenizer:
\[ o_t \rightarrow z_t \rightarrow \hat o_t \]
优化:
\[ \mathcal{L}_V \]
对 image 来说,就是 VQ-VAE reconstruction / codebook / commitment / perceptual loss。
训练完成后,\(V\) 可以把 replay 中的 observations 和 actions 转成 token trajectories。
7.3 Step 3:Train \(M\)
从 replay buffer 采 trajectory segment:
\[ (o_1,a_1,r_1,d_1,\ldots,o_H,a_H,r_H,d_H) \]
先用 \(V\) tokenize:
\[ (o_t,a_t) \mapsto (z_t,z_t^a) \]
得到:
\[ \tau = z_1,z_1^a,\ldots,z_H,z_H^a \]
然后训练 \(M\) 预测每个 step 的:
\[ z_t,\quad r_t,\quad d_t \]
优化:
\[ \mathcal{L}_M = \sum_{t=1}^{H} \left[ \mathcal{L}_{\mathrm{obs}} + \mathcal{L}_{\mathrm{reward}} - \log p_\theta(d_t\mid Y_t^u) \right] \]
7.4 Step 4:Train \(C\) in Imagination
从 replay buffer 采一个短 context 初始化 \(M\) 和 \(C\)。
然后重复 imagined rollout:
\[ C(z_t) \rightarrow a_t \]
\[ a_t \rightarrow z_t^a \]
\[ M(\tau_t) \rightarrow \hat z_{t+1},\hat r_t,\hat d_t \]
\[ \hat z_{t+1} \rightarrow C(\hat z_{t+1}) \]
rollout \(H\) steps 后,用 imagined rewards 计算:
\[ G_t \]
训练 critic:
\[ \hat V^\pi_t \rightarrow G_t \]
训练 actor:
\[ \log \pi(a_t\mid \hat \tau_{\le t-1},\hat z_t) \]
8. 核心机制总结
Simulus 的 core model 可以压缩成下面这条链:
\[ o_t \xrightarrow{V} z_t \xrightarrow[\text{with } z_t^a]{M} p_\theta(\hat z_{t+1}\mid \tau_t),\hat r_t,\hat d_t \xrightarrow{\text{imagination}} C \]
其中:
- \(V\) 用 VQ-VAE-style reconstruction objective 学离散 token;
- \(M\) 用 token cross-entropy、reward loss、termination NLL 学 token dynamics;
- \(C\) 用 \(M\) 生成的 imagined trajectories,通过 \(\lambda\)-return、critic loss 和 REINFORCE-style actor objective 学 policy。
一句话说:
Simulus 把 pixel / vector observations 先 tokenization,再用 RetNet 学 token dynamics,最后在 learned token world model 里训练 actor-critic controller。
Appendix:符号表
| Symbol | Meaning | Type / Shape |
|---|---|---|
| \(t\) | 时间步 | \(t=1,\ldots,T\) |
| \(H\) | training / imagination horizon | Atari: \(H=10\);DMC/Craftax: \(H=20\) |
| \(\kappa\) | observation modality 集合 | \(\kappa=\{\kappa_1,\ldots,\kappa_{|\kappa|}\}\) |
| \(i\) | modality index | \(i=1,\ldots,|\kappa|\) |
| \(j\) | token index | \(j=1,\ldots,K_i\) |
| \(K_i\) | 第 \(i\) 个 modality 的 token 数量 | Atari image: \(K_i=64\) |
| \(K\) | 一个 observation 的总 token 数 | \(K=\sum_i K_i\);Atari: \(K=64\) |
| \(K_a\) | action token 数 | Atari discrete action: \(K_a=1\) |
| \(N\) | tokenizer vocabulary size | Atari image: \(N=512\) |
| \(d\) | world model token embedding dimension | Atari: \(d=256\) |
| \(d_C\) | controller hidden / latent dimension | implementation-dependent |
| \(V\) | representation module | tokenizer-detokenizer system |
| \(M\) | world model | RetNet \(f_\theta\) + prediction heads |
| \(C\) | controller | LSTM + actor head + critic head |
| \(\theta\) | world model parameters | parameters of \(M\) |
| \(\psi\) | controller parameters | parameters of \(C\) |
| \(o_t\) | raw observation at time \(t\) | multi-modal observation |
| \(o_t^{(i)}\) | 第 \(i\) 个 modality 的 raw observation | image: \(\mathbb{R}^{3\times64\times64}\) |
| \(a_t\) | raw action | Atari: scalar discrete action |
| \(r_t\) | reward | scalar |
| \(d_t\) | termination signal | \(d_t\in\{0,1\}\) |
| \(\mathrm{enc}_i\) | 第 \(i\) 个 modality 的 encoder/tokenizer | \(o_t^{(i)}\mapsto z_t^{(i)}\) |
| \(\mathrm{dec}_i\) | 第 \(i\) 个 modality 的 decoder/detokenizer | \(z_t^{(i)}\mapsto \hat o_t^{(i)}\) |
| \(z_t^{(i)}\) | 第 \(i\) 个 modality 的 observation tokens | \(z_t^{(i)}\in\{1,\ldots,N_i\}^{K_i}\) |
| \(z_t\) | 完整 observation token sequence | \(z_t=\{z_t^{(i)}\}_{i=1}^{|\kappa|}\) |
| \(z_{t,j}^{(i)}\) | 第 \(i\) 个 modality 的第 \(j\) 个 token | \(z_{t,j}^{(i)}\in\{1,\ldots,N_i\}\) |
| \(z_t^a\) | action token | \(z_t^a\in\{1,\ldots,N_a\}^{K_a}\) |
| \(\hat z_{t+1}\) | world model 预测/采样的下一步 observation tokens | token sequence |
| \(\hat o_t\) | decoder 重建 observation | raw observation |
| \(\hat r_t\) | world model predicted reward | scalar |
| \(\hat d_t\) | world model predicted termination | Bernoulli / binary |
| \(\tau_t\) | 到时间步 \(t\) 为止的 token trajectory | \(\tau_t=z_1,z_1^a,\ldots,z_t,z_t^a\) |
| \(\tau\) | 长度为 \(H\) 的 training trajectory segment | \(\tau=z_1,z_1^a,\ldots,z_H,z_H^a\) |
| \(\hat\tau\) | imagined trajectory | generated token/action/reward/done sequence |
| \(E^{(i)}\) | 第 \(i\) 个 modality 的 embedding table | \(E^{(i)}\in\mathbb{R}^{N_i\times d}\) |
| \(E_M^{(i)}\) | \(M\) 自己学习的 embedding table | \(E_M^{(i)}\in\mathbb{R}^{N_i\times d}\) |
| \(E_C^{(i)}\) | \(C\) 自己学习的 embedding table | implementation-dependent |
| \(E\) | image VQ-VAE codebook | Atari: \(E\in\mathbb{R}^{512\times256}\) |
| \(l\) | token integer index | \(l=z_{t,j}^{(i)}\) |
| \(x_{t,j}^{(i)}\) | token embedding vector | \(x_{t,j}^{(i)}=E^{(i)}(l)\in\mathbb{R}^d\) |
| \(X_t^{(i)}\) | 第 \(i\) 个 modality 的 token embedding sequence | \(X_t^{(i)}\in\mathbb{R}^{K_i\times d}\) |
| \(X_t^o\) | observation block embedding | \(X_t^o=(X_t^{(1)},\ldots,X_t^{(|\kappa|)})\) |
| \(X_t^a\) | action block embedding | \(X_t^a\in\mathbb{R}^{K_a\times d}\) |
| \(X_t\) | observation-action block embedding | \(X_t=(X_t^o,X_t^a)\in\mathbb{R}^{(K+K_a)\times d}\) |
| \(f_\theta\) | RetNet sequence model | \((S_{t-1},X_t)\mapsto(S_t,Y_t)\) |
| \(S_t\) | RetNet recurrent state | summary of \(X_{\le t}\) |
| \(Y_t\) | RetNet output for current block | roughly \(\mathbb{R}^{(K+K_a)\times d}\) |
| \(X^u\) | learned prediction embeddings for POP | \(X^u\in\mathbb{R}^{K\times d}\) |
| \(Y_{t+1}^u\) | POP output for next observation prediction | \(Y_{t+1}^u=(y_1,\ldots,y_K)\in\mathbb{R}^{K\times d}\) |
| \(y_i\) | POP output vector for token position \(i\) | \(y_i\in\mathbb{R}^d\) |
| \(p_\theta(\hat z_{t+1}\mid\tau_t)\) | predicted distribution over next observation tokens | product of categorical distributions |
| \(p_\theta(\hat d_t\mid\tau_t)\) | predicted termination distribution | Bernoulli distribution |
| \(\hat r_\theta(\tau_t)\) | reward predictor | scalar predictor |
| \(\mathcal{L}_V\) | representation/tokenizer training loss | VQ-VAE reconstruction + codebook + commitment + perceptual |
| \(\mathcal{L}_{\mathrm{obs}}\) | observation token prediction loss | average token cross entropy |
| \(\mathcal{L}_{\mathrm{reward}}\) | reward prediction loss | scalar classification-style loss |
| \(\mathcal{L}_M\) | world model loss | summed over \(t=1,\ldots,H\) |
| \(\mathcal{L}_{\mathrm{critic}}\) | critic/value loss | target is \(\lambda\)-return \(G_t\) |
| \(\mathcal{J}_\pi\) | actor/policy objective | REINFORCE-style objective |
| \(\mathrm{sg}(\cdot)\) | stop-gradient operator | forward identity, backward zero |
| \(g_\psi\) | controller modality-fusion MLP | fuses modality latents |
| \(x_t\) | controller fused input vector | \(x_t\in\mathbb{R}^{d_C}\) |
| \(h_t\) | LSTM hidden state | \(h_t\in\mathbb{R}^{d_C}\) |
| \(c_t\) | LSTM cell state | \(c_t\in\mathbb{R}^{d_C}\) |
| \(\pi(a_t\mid h_t)\) | actor policy | Atari: categorical over actions |
| \(\hat V^\pi(h_t)\) | critic value estimate | scalar |
| \(\bar r_t\) | reward used for imagination training | basic case: \(\bar r_t=\hat r_t\) |
| \(G_t\) | \(\lambda\)-return target | recursively defined over imagined horizon |
| \(\gamma\) | discount factor | \([0,1]\) |
| \(\lambda\) | \(\lambda\)-return mixing coefficient | \([0,1]\) |
| \(c\) | return scale normalization factor | scalar |
| \(w_{\mathrm{ent}}\) | entropy regularization weight | scalar |
| \(\mathcal{H}(\pi)\) | policy entropy | entropy of action distribution |