# Negative Log-Likelihood as a Loss Function

TL;DR: - For categorical outcomes (e.g., classification), the negative log-likelihood corresponds to the cross-entropy loss. - For continuous outcomes (e.g., regression), assuming a Gaussian distribution, the negative log-likelihood corresponds to the Mean Squared Error (MSE) loss.

# Notation

Symbol | Type | Explanation |
---|---|---|

\(p(x \mid \theta)\) | Function | Likelihood of data \(x\) under model parameters \(\theta\). In VQ-VAE, \(\theta\) corresponds to \(z\), the quantized latent variable. |

\(x\) | \(\in \mathbb{R}^{H \times W \times C}\) | Observed data or input |

\(\theta\) | \(\in \mathbb{R}^d\) | Parameters of the model. In VQ-VAE, \(\theta\) is often represented by the quantized latent variable \(z\). |

\(z\) | \(\in \mathbb{R}^d\) | Latent representation in the model, serving as the effective model parameters \(\theta\) in visually generative models such as VQ-VAE. |

\(f\) | Function | Decoder function in visually generative models such as VQ-VAE. \(f(z) = \hat x\). |

\(\hat{x}\) | \(\in \mathbb{R}^{H \times W \times C}\) | Reconstructed image or output, equal to \(f(z)\) |

\(\mathcal{N}(f(z), I)\) | \(\in \mathbb{R}\) | Assumed distribution of \(x\) around \(f(z)\) with variance \(I\) |

\(y\) | \(\in \{1, \dots, K\}\) | Actual label in classification |

\(\hat{p}_y\) | \(\in [0, 1]\) | Model's predicted probability for the true class \(y\) |

\(K\) | \(\in \mathbb{N}\) | Number of classes in multi-class classification |

\(y_k\) | \(\in \{0, 1\}\) | One-hot encoded true label for class \(k\) |

\(\|\cdot \|_2^2\) | Function | Squared L2 norm. For a vector \(v=\left(v_1, v_2, \ldots, v_n\right)\), the squred L2 norm is \(\|v\|_2^2=v_1^2+v_2^2+\cdots+v_n^2\) |

# Likelihood Function

The likelihood function \(p(x \mid \theta)\) represents the probability of the observed data \(x\) under the model with parameters \(\theta\).

# Negative Log-Likelihood and MSE

->Source

If the conditional distribution of \(x\) given \(z\) follows a Gaussian distribution \(p(x \mid z) \sim \mathcal{N}\left(f(z), I\right)\), the log-likelihood is: \[ \log p(x \mid z) \propto-\frac{1}{2}\|x-f(z)\|_2^2 \propto-\|x-\hat{x}\|_2^2 \] where \(\hat{x}=f(z)\) is the reconstructed image, representing the mean of the distribution.

MSE is simply the squared \(L_2\) norm divided by the the dimension \(n = H \times W \times C\), which is contant: \[ \operatorname{MSE}=\frac{1}{n}\|x-\hat{x}\|_2^2 \] So negative Log-Likelihood can be thought as MSE.

# Negative Log-Likelihood and Cross-Entropy

For multi-class classification, the log-likelihood for a single observation is: \[ \log p(y \mid \theta, x)=\log \hat{p}_y \]

Here, \(\hat{p}_y\) is the predicted probability of the true class \(y\).

The cross-entropy loss for a single observation is \[ \text { Cross-Entropy }=-\log \hat{p}_y \]

Thus, the cross-entropy loss is exactly the negative log-likelihood of the true class: \[ \text { Cross-Entropy }=-\log \hat{p}_y \]