Compression as Intelligence

Posted on 2025-08-20 Edited on 2026-03-09 In AI Views:

Source:

Presentation from Jack Rae
Explanation from Zhihu

Notation

Symbol	Type	Explanation
\(D = (x_1, x_2, \ldots, x_n)\)	Dataset	Sequence of tokens Alice wants to transmit to Bob
\(x_t \in \mathcal{X}\)	Token	The token at position \(t\) in the dataset
\(\mathcal{X}\)	Vocabulary set	Set of all possible tokens
\(m = \|\mathcal{X}\|\)	Vocabulary size	Number of distinct tokens in the vocabulary
\(n \in \mathbb{N}\)	Dataset length	Number of tokens in the dataset \(D\)
\(S \in \mathbb{N}\)	Bit length	Total number of bits Alice must transmit so Bob can reconstruct \(D\)
\(S_0 \in \mathbb{N}\)	Bit length	Total number of bits using uniform coding: \(S_0 = \|f\| + n \log m\)
\(S_k \in \mathbb{N}\)	Bit length	Total number of bits using model at step \(k\): \(S_k = \|f\| + \sum -\log p^{(k)}(x_t \mid x_{1:t-1})\)
\(f\)	Code / overhead	Fixed code shared between Alice and Bob (model structure, initialization, seed, sampling procedure)
\(\|f\|\)	Bit length	Size of the overhead code \(f\), independent of dataset \(D\)
\(z_t \in \{0,1\}^*\)	Binary string	Encoded bits Alice sends to represent token \(x_t\) under the shared model
\(\|z_t\|\)	Bit length	Length of code \(z_t\), approximately \(-\log p^{(k)}(x_t \mid x_{1:t-1})\)
\(p^{(k)}(x_t \mid x_{1:t-1})\)	Probability	Model distribution at step \(k\), conditioned on previous tokens
\(\theta\)	Parameters	All parameters including trainable model weights and fixed overhead (seeds, code, etc.)
\(-\log p^{(k)}(x_t \mid x_{1:t-1})\)	NLL term	Negative log-likelihood contribution for token \(x_t\) at step \(k\)
\(\sum_{t=1}^n -\log p^{(k)}(x_t \mid x_{1:t-1})\)	NLL sum	Total negative log-likelihood of dataset \(D\) under model at step \(k\)
\(r_k \in [0,1]\)	Compression rate	Fraction of bits saved relative to uniform baseline: \(r_k = 1 - S_k / S_0\)

1. Compression as Intelligence (Mathematical Note)

Imagine Alice holds a dataset and wants to send it to Bob across a digital channel. - Dataset

\[ D=\left(x_1, x_2, \ldots, x_n\right), \quad x_t \in \mathcal{X}, \quad|\mathcal{X}|=m . \]

Here, \(\mathcal{X}\) is the vocabulary (the set of all possible tokens), and \(n\) is the dataset length. - Task

Alice wants Bob to reconstruct the dataset \(D\) exactly (losslessly). The central question is: how many bits must Alice transmit? - Definition

We denote \(S\) as the total number of bits Alice must transmit so that Bob can reconstruct \(D\).

Depending on the coding scheme Alice and Bob agree upon, the value of \(S\) will differ. Naturally, Alice wishes to minimize \(S\) : the fewer the bits she sends, the "smarter" the coding scheme.

2. Baseline: Uniform Coding

If Alice and Bob have no prior knowledge of the data distribution, the only option is to assume a uniform distribution:

\[ p(x)=\frac{1}{m} . \]

Then the number of bits to encode one token is

\[ \left|z_t\right|=-\log \frac{1}{m}=\log m . \]

Thus, the total cost is

\[ S_0=|f|+n \log m, \]

where \(|f|\) denotes a fixed overhead (e.g., model code, sampling procedure, random seed). Importantly, \(|f|\) does not depend on \(D\) and can be treated as a constant.

3. Introducing a Probabilistic Model

Alice can improve efficiency as follows: 1. Transmit code \(f\) :

This code specifies: - the structure of a probabilistic model, - parameter initialization, - the training algorithm, - and the sampling procedure.

Once Bob receives \(f\) and uses the same random seed, both parties obtain identical models. 2. Model at training step \(k\) :

At any training step \(k\), the shared model defines the distribution

\[ p^{(k)}\left(x_t \mid x_{1: t-1}\right) . \]

Encoding tokens:

To transmit token \(x_t\), Alice does not send it directly. Instead, she sends a binary string \(z_t\), which serves as the input to the shared sampling procedure. - With arithmetic coding (or equivalently binary search coding), the code length is

\[ \left|z_t\right| \approx-\log p^{(k)}\left(x_t \mid x_{1: t-1}\right) . \]

Decoding:

Bob receives \(z_t\), feeds it into the identical sampling procedure, and recovers \(x_t\).

4. Total Transmission Cost

At training step \(k\), the total transmission cost for dataset \(D\) is

\[ S_k=|f|+\sum_{t=1}^n\left|z_t\right|=|f|+\sum_{t=1}^n-\log p^{(k)}\left(x_t \mid x_{1: t-1}\right) . \]

Here: - \(\mid f \mid\) is constant; - the second term is the negative log-likelihood (NLL) of the dataset under the model \(p^{(k)}\).

Since \(|f|\) is constant, the dominant factor in \(S_k\) is the NLL.

The better the model predicts the data, the lower the NLL, and hence the lower the transmission cost.

5. Compression Rate

We define the compression rate relative to the uniform baseline:

\[ r_k=1-\frac{S_k}{S_0}, \quad r_k \in[0,1] . \]

If \(p^{(k)}\) is uniform:

\[ S_k=S_0, \quad r_k=0 . \]

If \(p^{(k)}\) learns meaningful structure from the data:

\[ S_k<S_0, \quad r_k>0 . \]

Thus, a more predictive model corresponds to a higher compression rate.

6. Connection to Language Model Training

Observe that

\[ \frac{1}{n} \sum_{t=1}^n-\log p^{(k)}\left(x_t \mid x_{1: t-1}\right) \]

is exactly the cross-entropy loss used to train language models (e.g., GPT) with next-token prediction.

Therefore:

\[ \min _\theta \frac{1}{n} \sum_{t=1}^n-\log p_\theta\left(x_t \mid x_{1: t-1}\right) \Longleftrightarrow \min _\theta S(\theta) . \] Here we put every parameters, including trainable ones like parameters in neural networks and untrianable ones like random seeds, training code, etc, into \(\theta\).

Summary

In a nutshell, training a language model is equivalent to minimizing the number of bits Alice must transmit to Bob so that Bob can reconstruct the dataset, i.e., compressing the dataset.

In this sense, compression is a way to achieve intelligence.

However, this is just a view point and an inspiration. It can NOT explain any important problems directly, such as why \(p\left(x_t \mid x_{1: t-1}\right)\) can be minimized so well. Also, it's a bit of circular reasoning because the NLL loss function is itself a cross-entopy function and by traing, i.e., minimizing entropy, we are essentially doing compression.