Data Compression
Ref:
- EE 376A: Information Theory. Winter 2018. Lecture 6. - Stanford University
- EE 376A: Information Theory. Winter 2017. Lecture 4. - Stanford University
- Elements of Information Theory
Notations
- rv: random variable.
- Let
denote .
Codes
Definition: A source code
for a random variable is a mapping from , the range of , to , the set of finite-length strings of symbols from a -ary alphabet. Let denote the codeword corresponding to and let denote the length of .- That is, a code
of rv is a function from to . - For example,
red blue is a source code for red, blue with alphabet .
- That is, a code
Definition: The expected length
of a source code for a random variable with probability mass function is given by where is the length of the codeword associated with .Definition: The extension
of a code is a function of stochastic process . where indicates concatenation of the corresponding codewords.- Example: If
and , then . - Note: For convience, we'll use the notation
instead of .
- Example: If
Example
Without loss of generality, we can assume that the
Let
The entropy
The expected length
Here we have a code that has the same average length as the entropy. We note that any sequence of bits can be uniquely decoded into a sequence of symbols of
Classes of Codes

Nonsingular Codes
Definition: A code is said to be nonsingular if it's an injection; i.e., every element of the range of
Nonsingularity suffices for an unambiguous description of a single value of
But we usually wish to send a sequence of values of
But this is an inefficient use of the special symbol; we can do better by developing the idea of self-punctuating or instantaneous codes.
Uniquely Decodable
Definition: A code is called uniquely decodable if its extension is an injection(nonsingular); i.e, every element of the range of
However, one may have to look at the entire string to determine even the first symbol in the corresponding source string.
Prefix Codes
Definition: A code is called a prefix code or an instantaneous code if no codeword is a prefix of any other codeword.
- A prefix code is uniquely decodable and self-punctuating(i.e. without needing entire binary sequence before decoding starts)
Example
To illustrate the differences between the various kinds of codes, consider the examples of codeword assignments
Singular | Nonsingular, But Not Uniquely Decodable | Uniquely Decodable But Not Instantaneous | Instantaneous | |
---|---|---|---|---|
1 | 0 | 0 | 10 | 0 |
2 | 0 | 010 | 00 | 10 |
3 | 0 | 01 | 11 | 110 |
4 | 0 | 10 | 110 | 111 |
For the nonsingular code, the code string 010 has three possible source sequences:
- 2
- 14
- 31
and hence the code is not uniquely decodable.
For the uniquely decodable code but not prefix-free code:
- If the first two bits are 00 or 10, they can be decoded immediately.
- If the first two bits are 11, we must look at the following bits.
- If the next bit is a 1, the first source symbol is a 3.
- If the length of the string of 0’s immediately following the 11 is odd, the first codeword must be 110 and the first source symbol must be 4;
- if the length of the string of 0’s is even, the first source symbol is a 3.
By repeating this argument, we can see that this code is uniquely decodable. Sardinas and Patterson [455] have devised a finite test for unique decodability, which involves forming sets of possible suffixes to the codewords and eliminating them systematically.
The fact that the last code in Table 5.1 is instantaneous is obvious since no codeword is a prefix of any other.
Kraft-McMillan Inequality
Theorem: (Kraft-McMillan Inequality). For all D-ary uniquely decodable (UD) codes:
Converse: any integer-valued function satisfying this inequality is the length function of some UD code.
Note since every prefix code is uniquely decodable, this inequality also holds for prefix codes.
Sometimes this equation is written as
.
Proof
Take any uniquely decodable code and let
Explain:
- The transition from 1st to 2nd line is that we just assign index 1,2,...,k to
. //TODO is the length of the string/sequence (or ). is the number of source sequences xk mapping into codewords of length .
Since the code is uniquely decodable, the mapping is one-to-one, so:
- there is at most one sorce symbol sequence mapping into each code
-length codeword sequence - and there are at most
code -length codeword sequence.
Thus,
So we can get an upper-bound:
Notice that the exponential term
From this we arrive at the theorem's claim (note that the inequality holds for all
Proof of Converse

Proof of converse: given any set of
Consider a D-ary tree in which each node has D children. Let the branches of the tree represent the symbols of the codeword. Then each codeword is represented by a leaf on the tree. The path from the root traces out the symbols of the codeword.
Note: no codeword is an ancestor of any other codeword on the tree. So that it's a prefix tree.
Note: even for a "leaf", we continue to write its descendants.
Label the first node (lexicographically) of depth
as codeword 1, and remove its descendants from the tree. Note: for , may equal to .Then label the first remaining node of depth
as codeword 2, and so on.Proceeding this way, we construct a prefix code with the specified
.
Optimal Codes
An optimal code is a prefix code with the minimum expected length.
From the results of Kraft-McMillan Inequality, this is equivalent to finding the set of lengths
This is a standard optimization problem: Minimize
Bounds on the Optimal Code Length
Theorem: The expected length
Proof:
Let
We knwo that
Proof:
- By the Kraft-McMillan Inequality
, . - Since
, , so
Next, consider the relative entropy between the probability distributions of the source
Explain:
- The first line is from the non-negativity of relative entropy.
- The last line is due to
and . is from the fact that .
Note: Beyond its role in this proof, the relative entropy has significance in the context of compression. Suppose we have length function
The relative entropy can be thought of as the cost of mismatch in lossless compression, i.e. the expected number of excess bits that will be expended due to optimizing the code for distribution
Proof:
Recall the setup of Kraft-McMillan Inequality
We proved that the optimal codeword lengths can be found by finding the
//TODO
The choice of word lengths
This choice of codeword lengths satisfies
Multiplying by
Huffman Coding
We've discussed a lot about optimal codes previously. But how can we construct an optimal code?
Here we introduce Huffman Code, which is the optimal(proved later) prefix code for a general source code distribution.
How to generate (binary) Huffman Code:
- Find 2 symbols with the smallest probability and then merge them to create a new “node” and treat it as a new symbol.
- Then merge the next 2 symbols with the smallest probability to create a new “node”. 3. Repeat steps 1 and 2 until there is only 1 symbol left.
- At the end of this process, we obtain a binary tree. The left branch of the Hoffman tree is given the bit value 0 and the right branch is given the bit value 1. The Hoffman encoding of each character is formed by taking the path from the root node to each leaf node.
Recall that this is a prefix code since we only assign the source symbols to the leaf nodes. Also note that the Huffman code is not necessarily unique since it depends on the order of the merge operations
Aside: (Source arity nomenclature):
- binary (2 symbols),
- ternary (3 symbols),
- quaternary (4 symbols),
- quinary (5 symbols),
- senary (6 symbols),
- septenary (7 symbols),
- octal (8 symbols),
- nonary (9 symbols),
- dec- imal (10 symbols)
- ...
Example
Consider a random variable X taking values in the set
The Huffman coding is:

This code has average length 2.3 bits.
Optimality Of Huffman Coding
Theorem: Huffman coding is optimal; that is, if
In the following we'll proved the theorem for a binary alphabet, the proof can be extended for a D-ary alphabet as well.
It is important to remember that there are many optimal codes: inverting all the bits or exchanging two codewords of the same length will give another optimal code. The Huffman procedure constructs one such optimal code.
To prove the optimality of Huffman codes, we first prove some properties of a particular optimal code.
Without loss of generality, we will assume that the probability masses are ordered, so that
Lemma
For any distribution, there exists an optimal instantaneous code (with minimum expected length) that satisfies the following properties:
- The lengths are ordered inversely with the probabilities (i.e., if
, then ). - The two longest codewords have the same length.
- Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols.
Proof: The proof amounts to swapping, trimming, and rearranging, as shown in Figure 5.3. Consider an optimal code
But
Summarizing, we have shown that if
Proof
Thus, we have shown that there exists an optimal code satisfying the properties of the lemma. We call such codes canonical codes.
For any probability mass function for an alphabet of size
with
we define the Huffman reduction
- Let
be an canonical optimal code for .- We denote the length of each codeword:
. - The average length is
.
- We denote the length of each codeword:
- Let
be the canonical optimal code for .- We denote the length of each codeword:
. - The average length is
.
- We denote the length of each codeword:
The proof of optimality will follow from two constructions:
Step1
First, from the canonical code
- adding a 0 to form a codeword for symbol
- adding a 1 to form a codeword for symbol
.
The length of each codeword of

The average length of
Step2
Similarly, from the canonical code
The average length of
Step3
Adding
We know that
. .
If the sum of two nonnegative terms is 0, then both of them are 0, which implies that
Conclustion
Consequently, if we start with an optimal code for
But how to start with an optimal code?
Well, we can start with a code for two elements, in which case the optimal code is obvious, we can by induction extend this result to prove the following theorem.