Negative Log-Likelihood as a Loss Function

TL;DR: - For categorical outcomes (e.g., classification), the negative log-likelihood corresponds to the cross-entropy loss. - For continuous outcomes (e.g., regression), assuming a Gaussian distribution, the negative log-likelihood corresponds to the Mean Squared Error (MSE) loss.

Notation

Symbol Type Explanation
\(p(x \mid \theta)\) Function Likelihood of data \(x\) under model parameters \(\theta\). In VQ-VAE, \(\theta\) corresponds to \(z\), the quantized latent variable.
\(x\) \(\in \mathbb{R}^{H \times W \times C}\) Observed data or input
\(\theta\) \(\in \mathbb{R}^d\) Parameters of the model. In VQ-VAE, \(\theta\) is often represented by the quantized latent variable \(z\).
\(z\) \(\in \mathbb{R}^d\) Latent representation in the model, serving as the effective model parameters \(\theta\) in visually generative models such as VQ-VAE.
\(f\) Function Decoder function in visually generative models such as VQ-VAE. \(f(z) = \hat x\).
\(\hat{x}\) \(\in \mathbb{R}^{H \times W \times C}\) Reconstructed image or output, equal to \(f(z)\)
\(\mathcal{N}(f(z), I)\) \(\in \mathbb{R}\) Assumed distribution of \(x\) around \(f(z)\) with variance \(I\)
\(y\) \(\in \{1, \dots, K\}\) Actual label in classification
\(\hat{p}_y\) \(\in [0, 1]\) Model's predicted probability for the true class \(y\)
\(K\) \(\in \mathbb{N}\) Number of classes in multi-class classification
\(y_k\) \(\in \{0, 1\}\) One-hot encoded true label for class \(k\)
\(\|\cdot \|_2^2\) Function Squared L2 norm. For a vector \(v=\left(v_1, v_2, \ldots, v_n\right)\), the squred L2 norm is \(\|v\|_2^2=v_1^2+v_2^2+\cdots+v_n^2\)

Likelihood Function

The likelihood function \(p(x \mid \theta)\) represents the probability of the observed data \(x\) under the model with parameters \(\theta\).

Negative Log-Likelihood and MSE

->Source

If the conditional distribution of \(x\) given \(z\) follows a Gaussian distribution \(p(x \mid z) \sim \mathcal{N}\left(f(z), I\right)\), the log-likelihood is: \[ \log p(x \mid z) \propto-\frac{1}{2}\|x-f(z)\|_2^2 \propto-\|x-\hat{x}\|_2^2 \] where \(\hat{x}=f(z)\) is the reconstructed image, representing the mean of the distribution.

MSE is simply the squared \(L_2\) norm divided by the the dimension \(n = H \times W \times C\), which is contant: \[ \operatorname{MSE}=\frac{1}{n}\|x-\hat{x}\|_2^2 \] So negative Log-Likelihood can be thought as MSE.

Negative Log-Likelihood and Cross-Entropy

For multi-class classification, the log-likelihood for a single observation is: \[ \log p(y \mid \theta, x)=\log \hat{p}_y \]

Here, \(\hat{p}_y\) is the predicted probability of the true class \(y\).

The cross-entropy loss for a single observation is \[ \text { Cross-Entropy }=-\log \hat{p}_y \]

Thus, the cross-entropy loss is exactly the negative log-likelihood of the true class: \[ \text { Cross-Entropy }=-\log \hat{p}_y \]