Cross Entropy Loss
Sources:
- Adjrej Karpathy's video Building makemore Part 4: Becoming a Backprop Ninja.
Cross entropy
The cross entropy loss
is the number of classes is the ground truth label for class (typically one-hot encoded) is the predicted probability for class
The predicted probabilities are usually obtained by applying the softmax function to the model's logits
For classification tasks where there is only one true class for a sample, i.e.,
Note that, in this case,
Derivation of cross entropy
Question: What is the gradient (more strictly, the derivation) of the cross-entropy loss w.r.t. a logit
To get the gradient, we need to differentiate the loss
Using the expression for
Given
For
When
, the derivative of the softmax function w.r.t. is:When
, the derivative of the softmax function w.r.t. is:
Putting it all together, the gradient of the loss w.r.t.
If
(for the true class):If
(for the other classes):
To conclude: