Vector Quantized Variational Autoencoders

Sources:

  1. VQ-VAE 2018 paper
  2. A compact explanation by Julius Ruseckas

Link: My VQ-VAE implementation on Github

Vector Quantized Variational Autoencoders

Notation

Symbol Type Explanation
x,y,z R Cartesian coordinates
u,v,w R General curvilinear coordinates
r,ϕ,z R Cylindrical coordinates
R,θ,ϕ R Spherical coordinates
r R3 Position vector
x^,y^,z^ R3 Cartesian unit basis vectors
r^,ϕ^,z^ R3 Cylindrical unit basis vectors
R^,θ^,ϕ^ R3 Spherical unit basis vectors
eu,ev,ew R3 Base vectors in the transformed coordinate system
dl R3 Differential displacement vector
dlu,dlv,dlw R3 Differential displacement vectors along each coordinate
ds R3 Differential area vector
dsu,dsv,dsw R3 Differential area vectors along each coordinate
dV R Differential volume element
J R3×3 Jacobian matrix of the coordinate transformation
detJ R Determinant of the Jacobian matrix, used for volume scaling
eu(ev×ew) R Scalar triple product of base vectors, equal to detJ
Operator Gradient operator

Introduction

This section is modified from the brilliant and compact introduction of VQ-VAE from Finite Scalar Quantization: VQ-VAE Made Simple

Vector quantization (VQ), initially introduced by Gray (1984), has recently seen a renaissance in the context of learning discrete representations with neural networks. Spurred by the success of VQ-VAE (Van Den Oord et al., 2017), Esser et al. (2020) and Villegas et al. (2022) showed that training an autoregressive transformer on the representations of a VQ-VAE trained with a GAN loss enables powerful image and video generation models, respectively.

At the same time, VQ has become popular component in image (Bao et al., 2021; Li et al., 2023) and audio (Baevski et al., 2019) representation learning, and is a promising building block for the next generation of multimodal large language models (Aghajanyan et al., 2022; Kim et al., 2023; Aghajanyan et al., 2023).

When training VQ-VAE, the goal is to learn a codebook C whose elements induce a compressed, semantic representation of the input data (typically images). In the forward pass, an image x is encoded into a representation z (typically a sequence of feature vectors), and each vector in z quantized to (i.e., replaced with) the closest vector in C. The quantization operation is not differentiable. When training a VAE with VQ in the latent representation, Van Den Oord et al. (2017) use the straightthrough estimator (STE) (Bengio et al., 2013), copying the gradients from the decoder input to the encoder output, resulting in gradients to the encoder. Since this still does not produce gradients for the codebook vectors, they further introduce two auxiliary losses to pull the codeword vectors towards the (unquantized) representation vectors and vice-versa.

Components of VQ-VAE

The Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static.

In VQ-VAE, we have:

  • An input image xRH×W×3 where H,W are the height and width of the image.

  • A latent embedding space E={ek}k=1KRD, called the codebook, where each ek is called a code and D is the dimensionality of codes.

  • An encoder (1)E:RH×W×3Rh×w×D(2)E(x)=ze encodes an image x into an embedding ze.

  • An operator called quantizer: Qtzr:Rh×w×DRh×w×DQtzr(ze)=zq,

    quantizes ze to zq, where zq,zeRh×w×D. The quantization process is zqij=argminzeijek with subscript i,j starting from 1 to h,w.

    Therefore, for each element zeij of ze, we use argmin to find the ek that is closest in distance to zeij, i.e., minimizing the norm zeijek, to get zqi,j.

  • A decoder (3)D:Rh×w×DRH×W×3(4)D(zq)=x^ decodes zq into an image x^, which is also called the reconstructed image of x.

Forward pass

VQ-VAE

The forward pass of VQ-VAE consists of

  1. First, we use encoder E(.) to encode the input image x to get the embedding ze: ze=E(x).

  2. Next, we use quantizer Qtzr(.) to quantize the embedding ze to get the quantized embedding zq. Since ze contains h×w embeddings, zq is composed of h×w codes, of which each code ek is selected through a nearest-neighbor lookup (argmin()) to the codebook E={ek}k=1K.

  3. Finnally, we use decoder to decode ze to get the reconstructed image x^.

NOTES:

  1. The encoded embedding ze and quantized embedding zq have the same dimentionality ze,zqRh×w×D. and they have the same embedding dimenstion (=D) as ek.

  2. We will sometimes use ze(x),zq(x) to refer ze,zq.

Loss function

The overall loss function is: L=logp(xzq)+sg(zezq)22+βzesg(zq)22, Since VQ-VAE leverages argmin() function, which is non-differentiable. The gradient zqL from decoder input zq can not be passed to the encoder output ze. To solve this, we use a trick called the straight through estimator which applies a stop_gradient operator (sg in the equation) to copy zqL to ze.

The overall loss function has three components:

  1. Reconstruction loss logp(xzq) is the negative log-likelihood. In practice, it's common to replace it with MSE loss.

  2. Codebook loss sg[ze(x)]e22, which moves the embedding vectors towards the encoder output.

  3. Commitment loss βze(x)sg[e]22, which encourages the encoder output to stay close to the embedding space.

Model architecture

#TODO shape problem

VectorQuantizer

This layer takes a tensor to be quantized. The channel dimension will be used as the space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.

The output tensor will have the same shape as the input.

As an example for a BCHW tensor of shape [16, 64, 32, 32], we will first convert it to an BHWC tensor of shape [16, 32, 32, 64] and then reshape it into [16384, 64] and all 16384 vectors of size 64 will be quantized independently. In otherwords, the channels are used as the space in which to quantize.

All other dimensions will be flattened and be seen as different examples to quantize, 16384 in this case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class VectorQuantizer(nn.Module):
def __init__(self, num_embeddings, embedding_dim, commitment_cost):
super(VectorQuantizer, self).__init__()

self._embedding_dim = embedding_dim
self._num_embeddings = num_embeddings

self._embedding = nn.Embedding(self._num_embeddings, self._embedding_dim)
self._embedding.weight.data.uniform_(-1/self._num_embeddings, 1/self._num_embeddings)
self._commitment_cost = commitment_cost

def forward(self, inputs):
# convert inputs from BCHW -> BHWC
inputs = inputs.permute(0, 2, 3, 1).contiguous()
input_shape = inputs.shape

# Flatten input
flat_input = inputs.view(-1, self._embedding_dim)

# Calculate distances
distances = (torch.sum(flat_input**2, dim=1, keepdim=True)
+ torch.sum(self._embedding.weight**2, dim=1)
- 2 * torch.matmul(flat_input, self._embedding.weight.t()))

# Encoding
encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)
encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)
encodings.scatter_(1, encoding_indices, 1)

# Quantize and unflatten
quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)

# Loss
e_latent_loss = F.mse_loss(quantized.detach(), inputs)
commitment_loss = self._commitment_cost * e_latent_loss
codebook_loss = F.mse_loss(quantized, inputs.detach())


quantized = inputs + (quantized - inputs).detach()
avg_probs = torch.mean(encodings, dim=0)
perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))

# convert quantized from BHWC -> BCHW
return codebook_loss, commitment_loss, quantized.permute(0, 3, 1, 2).contiguous(), perplexity, encodings

We will also implement a slightly modified version which will use exponential moving averages to update the embedding vectors instead of an auxillary loss. This has the advantage that the embedding updates are independent of the choice of optimizer for the encoder, decoder and other parts of the architecture. For most experiments the EMA version trains faster than the non-EMA version.

VectorQuantizerEMA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class VectorQuantizerEMA(nn.Module):
def __init__(self, num_embeddings, embedding_dim, commitment_cost, decay, epsilon=1e-5):
super(VectorQuantizerEMA, self).__init__()

self._embedding_dim = embedding_dim
self._num_embeddings = num_embeddings

self._embedding = nn.Embedding(self._num_embeddings, self._embedding_dim)
self._embedding.weight.data.normal_()
self._commitment_cost = commitment_cost

self.register_buffer('_ema_cluster_size', torch.zeros(num_embeddings))
self._ema_w = nn.Parameter(torch.Tensor(num_embeddings, self._embedding_dim))
self._ema_w.data.normal_()

self._decay = decay
self._epsilon = epsilon

def forward(self, inputs):
# convert inputs from BCHW -> BHWC
inputs = inputs.permute(0, 2, 3, 1).contiguous() # (256, 64, 8, 8) BCHW --> (256, 8, 8, 64) BHWC
input_shape = inputs.shape # BHWC

# Flatten input
flat_input = inputs.view(-1, self._embedding_dim) # Now set C'=64 (_embedding_dim), flatten `inputs` into (N, C'), where N=B*H*W, i.e., we have N=B*H*W vectors, each vector has dimension=C.

# Calculate distances
distances = (torch.sum(flat_input**2, dim=1, keepdim=True)
+ torch.sum(self._embedding.weight**2, dim=1)
- 2 * torch.matmul(flat_input, self._embedding.weight.t())) # Each vector `z_e` has distances with all the quantized vectors `e_j` in the codebook, where j in K = `_num_embeddings`.

# Encoding
encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)# For each vector `z_e`, select the index of the **closest** quantized vector `e_j` in the codebook.

# For each each vector `z_e`, use the index of its corresponding `z_q` to create a one-hot encoding.
encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)
encodings.scatter_(1, encoding_indices, 1)

# Quantize and unflatten
quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape) # Use the one-hot encoding as the index to select the quantized vectors in the codebook.

# Use EMA to update the embedding vectors
if self.training:
self._ema_cluster_size = self._ema_cluster_size * self._decay + \
(1 - self._decay) * torch.sum(encodings, 0)

# Laplace smoothing of the cluster size
n = torch.sum(self._ema_cluster_size.data)
self._ema_cluster_size = (
(self._ema_cluster_size + self._epsilon)
/ (n + self._num_embeddings * self._epsilon) * n)

dw = torch.matmul(encodings.t(), flat_input)
self._ema_w = nn.Parameter(self._ema_w * self._decay + (1 - self._decay) * dw)

self._embedding.weight = nn.Parameter(self._ema_w / self._ema_cluster_size.unsqueeze(1))

# Loss
e_latent_loss = F.mse_loss(quantized.detach(), inputs)
commitment_loss = self._commitment_cost * e_latent_loss
codebook_loss = F.mse_loss(quantized, inputs.detach())

# Straight Through Estimator
quantized = inputs + (quantized - inputs).detach()
avg_probs = torch.mean(encodings, dim=0)
perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))

# convert quantized from BHWC -> BCHW
return codebook_loss, commitment_loss, quantized.permute(0, 3, 1, 2).contiguous(), perplexity, encodings

Encoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Encoder(nn.Module):
def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):
super(Encoder, self).__init__()

self._conv_1 = nn.Conv2d(in_channels=in_channels,
out_channels=num_hiddens//2,
kernel_size=4,
stride=2, padding=1)
self._conv_2 = nn.Conv2d(in_channels=num_hiddens//2,
out_channels=num_hiddens,
kernel_size=4,
stride=2, padding=1)
self._conv_3 = nn.Conv2d(in_channels=num_hiddens,
out_channels=num_hiddens,
kernel_size=3,
stride=1, padding=1)
self._residual_stack = ResidualStack(in_channels=num_hiddens,
num_hiddens=num_hiddens,
num_residual_layers=num_residual_layers,
num_residual_hiddens=num_residual_hiddens)

def forward(self, inputs):
x = self._conv_1(inputs)
x = F.relu(x)

x = self._conv_2(x)
x = F.relu(x)

x = self._conv_3(x)
return self._residual_stack(x)

Decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class Decoder(nn.Module):
def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):
super(Decoder, self).__init__()

self._conv_1 = nn.Conv2d(in_channels=in_channels,
out_channels=num_hiddens,
kernel_size=3,
stride=1, padding=1)

self._residual_stack = ResidualStack(in_channels=num_hiddens,
num_hiddens=num_hiddens,
num_residual_layers=num_residual_layers,
num_residual_hiddens=num_residual_hiddens)

self._conv_trans_1 = nn.ConvTranspose2d(in_channels=num_hiddens,
out_channels=num_hiddens//2,
kernel_size=4,
stride=2, padding=1)

self._conv_trans_2 = nn.ConvTranspose2d(in_channels=num_hiddens//2,
out_channels=3,
kernel_size=4,
stride=2, padding=1)

def forward(self, inputs):
x = self._conv_1(inputs)

x = self._residual_stack(x)

x = self._conv_trans_1(x)
x = F.relu(x)

return self._conv_trans_2(x)

Residual blocks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class Residual(nn.Module):
def __init__(self, in_channels, num_hiddens, num_residual_hiddens):
super(Residual, self).__init__()
self._block = nn.Sequential(
nn.ReLU(True),
nn.Conv2d(in_channels=in_channels,
out_channels=num_residual_hiddens,
kernel_size=3, stride=1, padding=1, bias=False),
nn.ReLU(True),
nn.Conv2d(in_channels=num_residual_hiddens,
out_channels=num_hiddens,
kernel_size=1, stride=1, bias=False)
)

def forward(self, x):
return x + self._block(x)


class ResidualStack(nn.Module):
def __init__(self, in_channels, num_hiddens, num_residual_layers, num_residual_hiddens):
super(ResidualStack, self).__init__()
self._num_residual_layers = num_residual_layers
self._layers = nn.ModuleList([Residual(in_channels, num_hiddens, num_residual_hiddens)
for _ in range(self._num_residual_layers)])

def forward(self, x):
for i in range(self._num_residual_layers):
x = self._layers[i](x)
return F.relu(x)

VQ-VAE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Model(nn.Module):
def __init__(self, num_hiddens, num_residual_layers, num_residual_hiddens,
num_embeddings, embedding_dim, commitment_cost, decay=0):
super(Model, self).__init__()

self._encoder = Encoder(3, num_hiddens,
num_residual_layers,
num_residual_hiddens)
self._pre_vq_conv = nn.Conv2d(in_channels=num_hiddens,
out_channels=embedding_dim,
kernel_size=1,
stride=1)
if decay > 0.0:
self._vq_vae = VectorQuantizerEMA(num_embeddings, embedding_dim,
commitment_cost, decay)
else:
self._vq_vae = VectorQuantizer(num_embeddings, embedding_dim,
commitment_cost)
self._decoder = Decoder(embedding_dim,
num_hiddens,
num_residual_layers,
num_residual_hiddens)

def forward(self, x):
z = self._encoder(x) # (256, 3, 32, 32) BCHW -> (256, 128, 8, 8) BCHW
z = self._pre_vq_conv(z) # (256, 128, 8, 8) BCHW -> (256, 64, 8, 8) BCHW
codebook_loss, commitment_loss, quantized, perplexity, _ = self._vq_vae(z)
x_recon = self._decoder(quantized)

return codebook_loss, commitment_loss, x_recon, perplexity