Cross Entropy Loss

Posted on 2024-03-09 Edited on 2026-03-15 In AI Views:

Sources:

Adjrej Karpathy's video Building makemore Part 4: Becoming a Backprop Ninja.

Cross entropy

The cross entropy loss \(L\) for a single sample in a multi-class classification problem is defined as: \[ L=-\sum_{i=1}^C y_i \log \left(\hat{y}_i\right) \] where:

\(C\) is the number of classes
\(y_i\) is the ground truth label for class \(i\) (typically one-hot encoded)
\(\hat{y}_i\) is the predicted probability for class \(i\)

The predicted probabilities are usually obtained by applying the softmax function to the model's logits \(z_i\): \[ \hat{y}_i=\operatorname{softmax}\left(z_i\right)=\frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} \]

For classification tasks where there is only one true class for a sample, i.e., \(y_i \in \{0,0, \cdots, 1, 0, \cdots \}\), suppose the index of the true class of the input data point is \(t\), we obtain \[ L=- (0 + \cdots + 0+1 \cdot \log (\hat y_t) + 0 + \cdots+0) = - \log (\hat y_t) \]

Note that, in this case, \(L\) is only related to \(\hat y_t\): \[ \hat{y}_t=\frac{e^{z_t}}{\sum_{j=1}^C e^{z_j}} . \]

Derivation of cross entropy

Question: What is the gradient (more strictly, the derivation) of the cross-entropy loss w.r.t. a logit \(z_k\) ?

To get the gradient, we need to differentiate the loss \(L\) with respect to \(z_i\):

Using the expression for \(L\) and \(\hat{y}_i\), we get: \[ \frac{\partial L}{\partial z_i}=\frac{\partial L}{\partial \hat{y}_t} \cdot \frac{\partial \hat{y}_t}{\partial z_i} \]

Given \(L=-\log \left(\hat{y}_t\right)\), we have: \[ \frac{\partial L}{\partial \hat{y}_t}=-\frac{1}{\hat{y}_t} \]

For \(\frac{\partial \hat{y}_t}{\partial z_k}\), there are two cases to consider:

When \(i=t\), the derivative of the softmax function w.r.t. \(z_k\) is: \[ \frac{\partial \hat{y}_t}{\partial z_i}=\frac{e^{z_i} \sum_{j=1}^C e^{z_j}-e^{z_i} e^{z_i}}{\left(\sum_{j=1}^C e^{z_j}\right)^2}=\frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}\left(1-\frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}\right)=\hat{y}_t\left(1-\hat{y}_t\right) . \]
When \(i \neq t\), the derivative of the softmax function w.r.t. \(z_k\) is: \[ \frac{\partial \hat{y}_t}{\partial z_i}=\frac{ e^{z_i} (0- e^{z_k})}{\left(\sum_{j=1}^C e^{z_j}\right)^2}=-\frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}} \cdot \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}=-\hat{y}_t \hat{y}_i . \]

Putting it all together, the gradient of the loss w.r.t. \(z_k\) is:

If \(i=t\) (for the true class): \[ \frac{\partial L}{\partial z_k}=\hat{y}_i-1 \triangleq \text{softmax}(z_i) - 1 . \]
If \(i \neq t\) (for the other classes): \[ \frac{\partial L}{\partial z_k}=\hat{y}_i \triangleq \text{softmax}(z_i) . \]

To conclude: \[ \frac{\partial L}{\partial z_i}= \begin{cases}\operatorname{softmax}\left(z_i\right)-1, & i=t \\ \operatorname{softmax}\left(z_i\right), & i \neq t\end{cases} \]