Negative Log-Likelihood as a Loss Function

Posted on 2024-08-14 Edited on 2026-06-11 In AI Views:

TL;DR: - For categorical outcomes (e.g., classification), the negative log-likelihood corresponds to the cross-entropy loss. - For continuous outcomes (e.g., regression), assuming a Gaussian distribution, the negative log-likelihood corresponds to the Mean Squared Error (MSE) loss.

Notation

Symbol	Type	Explanation
\(p(x \mid \theta)\)	Function	Likelihood of data \(x\) under model parameters \(\theta\). In VQ-VAE, \(\theta\) corresponds to \(z\), the quantized latent variable.
\(x\)	\(\in \mathbb{R}^{H \times W \times C}\)	Observed data or input
\(\theta\)	\(\in \mathbb{R}^d\)	Parameters of the model. In VQ-VAE, \(\theta\) is often represented by the quantized latent variable \(z\).
\(z\)	\(\in \mathbb{R}^d\)	Latent representation in the model, serving as the effective model parameters \(\theta\) in visually generative models such as VQ-VAE.
\(f\)	Function	Decoder function in visually generative models such as VQ-VAE. \(f(z) = \hat x\).
\(\hat{x}\)	\(\in \mathbb{R}^{H \times W \times C}\)	Reconstructed image or output, equal to \(f(z)\)
\(\mathcal{N}(f(z), I)\)	\(\in \mathbb{R}\)	Assumed distribution of \(x\) around \(f(z)\) with variance \(I\)
\(y\)	\(\in \{1, \dots, K\}\)	Actual label in classification
\(\hat{p}_y\)	\(\in [0, 1]\)	Model’s predicted probability for the true class \(y\)
\(K\)	\(\in \mathbb{N}\)	Number of classes in multi-class classification
\(y_k\)	\(\in \{0, 1\}\)	One-hot encoded true label for class \(k\)
\(\\|\cdot \\|_2^2\)	Function	Squared L2 norm. For a vector \(v=\left(v_1, v_2, \ldots, v_n\right)\), the squred L2 norm is \(\\|v\\|_2^2=v_1^2+v_2^2+\cdots+v_n^2\)

Likelihood Function

The likelihood function \(p(x \mid \theta)\) represents the probability of the observed data \(x\) under the model with parameters \(\theta\).

Negative Log-Likelihood and MSE

->Source

If the conditional distribution of \(x\) given \(z\) follows a Gaussian distribution \(p(x \mid z) \sim \mathcal{N}\left(f(z), I\right)\), the log-likelihood is: \[ \log p(x \mid z) \propto-\frac{1}{2}\|x-f(z)\|_2^2 \propto-\|x-\hat{x}\|_2^2 \] where \(\hat{x}=f(z)\) is the reconstructed image, representing the mean of the distribution.

MSE is simply the squared \(L_2\) norm divided by the the dimension \(n = H \times W \times C\), which is contant: \[ \operatorname{MSE}=\frac{1}{n}\|x-\hat{x}\|_2^2 \] So negative Log-Likelihood can be thought as MSE.

Negative Log-Likelihood and Cross-Entropy

For multi-class classification, the log-likelihood for a single observation is: \[ \log p(y \mid \theta, x)=\log \hat{p}_y \]

Here, \(\hat{p}_y\) is the predicted probability of the true class \(y\).

The cross-entropy loss for a single observation is \[ \text { Cross-Entropy }=-\log \hat{p}_y \]

Thus, the cross-entropy loss is exactly the negative log-likelihood of the true class: \[ \text { Cross-Entropy }=-\log \hat{p}_y \]