Cross Entropy Loss

Sources:

Cross entropy

The cross entropy loss L of a sample is L=i=1Cyilog(y^i).

where y^i is the predicted probability for class i, usually obtained by applying the softmax function to the logit of i, denoted as zi: y^i=softmax(zi)=ezij=1Cezj.

For classification tasks where there is only one true class for a sample, i.e., yc{0,0,,1,0,}, suppose the index of the true class of the input data point is t, we obtain L=(0++0+1log(y^t)+0++0)=log(y^t)

Note that, in this case, L is only related to y^t: y^t=eztj=1Cezj.

Derivation of cross entropy

Question: What is the gradient (more strictly, the derivation) of the cross-entropy loss w.r.t. a logit zk ?

To get the gradient, we need to differentiate the loss L with respect to zi:

Using the expression for L and y^i, we get: Lzi=Ly^ty^tzi

Given L=log(y^t), we have: Ly^t=1y^t

For y^tzk, there are two cases to consider:

  1. When i=t, the derivative of the softmax function w.r.t. zk is: y^tzi=ezij=1Cezjeziezi(j=1Cezj)2=ezij=1Cezj(1ezij=1Cezj)=y^t(1y^t).

  2. When it, the derivative of the softmax function w.r.t. zk is: y^tzi=ezi(0ezk)(j=1Cezj)2=ezij=1Cezjezij=1Cezj=y^ty^i.

Putting it all together, the gradient of the loss w.r.t. zk is:

  1. If i=t (for the true class): Lzk=y^i1softmax(zi)1.

  2. If it (for the other classes): Lzk=y^isoftmax(zi).

To conclude: Lzi={softmax(zi)1,i=tsoftmax(zi),it