Cross Entropy Loss

Posted on 2024-03-09 Edited on 2025-06-17 In Computer Science Views: 45

Sources:

Adjrej Karpathy's video Building makemore Part 4: Becoming a Backprop Ninja.

Cross entropy

The cross entropy loss $L$ for a single sample in a multi-class classification problem is defined as: $L = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i})$ where:

$C$ is the number of classes
$y_{i}$ is the ground truth label for class $i$ (typically one-hot encoded)
${\hat{y}}_{i}$ is the predicted probability for class $i$

The predicted probabilities are usually obtained by applying the softmax function to the model's logits $z_{i}$ : ${\hat{y}}_{i} = softmax (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}}$

For classification tasks where there is only one true class for a sample, i.e., $y_{i} \in {0, 0, \dots, 1, 0, \dots}$ , suppose the index of the true class of the input data point is $t$ , we obtain $L = - (0 + \dots + 0 + 1 \cdot \log ({\hat{y}}_{t}) + 0 + \dots + 0) = - \log ({\hat{y}}_{t})$

Note that, in this case, $L$ is only related to ${\hat{y}}_{t}$ : ${\hat{y}}_{t} = \frac{e^{z_{t}}}{\sum_{j = 1}^{C} e^{z_{j}}} .$

Derivation of cross entropy

Question: What is the gradient (more strictly, the derivation) of the cross-entropy loss w.r.t. a logit $z_{k}$ ?

To get the gradient, we need to differentiate the loss $L$ with respect to $z_{i}$ :

Using the expression for $L$ and ${\hat{y}}_{i}$ , we get: $\frac{\partial L}{\partial z_{i}} = \frac{\partial L}{\partial {\hat{y}}_{t}} \cdot \frac{\partial {\hat{y}}_{t}}{\partial z_{i}}$

Given $L = - \log ({\hat{y}}_{t})$ , we have: $\frac{\partial L}{\partial {\hat{y}}_{t}} = - \frac{1}{{\hat{y}}_{t}}$

For $\frac{\partial {\hat{y}}_{t}}{\partial z_{k}}$ , there are two cases to consider:

When $i = t$ , the derivative of the softmax function w.r.t. $z_{k}$ is: $\frac{\partial {\hat{y}}_{t}}{\partial z_{i}} = \frac{e^{z_{i}} \sum_{j = 1}^{C} e^{z_{j}} - e^{z_{i}} e^{z_{i}}}{{(\sum_{j = 1}^{C} e^{z_{j}})}^{2}} = \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}} (1 - \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}}) = {\hat{y}}_{t} (1 - {\hat{y}}_{t}) .$
When $i \neq t$ , the derivative of the softmax function w.r.t. $z_{k}$ is: $\frac{\partial {\hat{y}}_{t}}{\partial z_{i}} = \frac{e^{z_{i}} (0 - e^{z_{k}})}{{(\sum_{j = 1}^{C} e^{z_{j}})}^{2}} = - \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}} \cdot \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}} = - {\hat{y}}_{t} {\hat{y}}_{i} .$

Putting it all together, the gradient of the loss w.r.t. $z_{k}$ is:

If $i = t$ (for the true class): $\frac{\partial L}{\partial z_{k}} = {\hat{y}}_{i} - 1 ≜ softmax (z_{i}) - 1 .$
If $i \neq t$ (for the other classes): $\frac{\partial L}{\partial z_{k}} = {\hat{y}}_{i} ≜ softmax (z_{i}) .$

To conclude: $\frac{\partial L}{\partial z_{i}} = {\begin{cases} softmax (z_{i}) - 1, & i = t \\ softmax (z_{i}), & i \neq t \end{cases}$