Common Loss Functions
- Cross entropy
- Focal loss
- Hinge loss
- ...
Notations
TO make the context clear, in this article, we make the following regulations:
- The size of the dataset, i.e., the number of the datapoints, is denoted as
. - The datapoint of index
is . The last dimenstion of it is denoted as . - The predicted datapoint by the model is, corresponding to
, is . The last dimenstion of it is . - The predicted value of the
datapoint output by the model is . - When the task is a classification task, the ground truth value is often a 0/1 label, i.e.,
. - When the task is a classification task, the number of classes is
. In addition, the ground truth label value and the prediction value of class of the datapoint is .
Mean Squared Error
Mean squared error (MSE) loss is
1 | import torch |
Cross Entropy
For the gradient (or derivation) of Cross Entropy, please refer to ->this article.
Cross-entropy loss (often abbreviated as CE), or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1 (usually produced by a softmax function).
The cross entropy loss
where
1 | import torch |
Example
For example, consider following neural network, there're three data points "Setosa", "Virginica" and "Versicolor", each is a 2-D vector. The number of classes is

Petal | Sepal | Species | Cross Entropy | |
---|---|---|---|---|
0.04 | 0.42 | Setosa | 0.57 | |
1 | 0.54 | Virginica | 0.58 | |
0.50 | 0.37 | Versicolor | 0.52 |
Take Sepal for Versicolor, when input is Versicolor (
Therefore, the cross entropy of this training is
Property

The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly.
Log loss heavily penalizes those predictions that are confident and wrong.
Problems
The main problem of cross entropy is that, even when the prediction is correct, i.e., y_predicted == y_true
, the cross entropy isn't symmetric.

From this figure, if y_predicted == y_true == 0.2
, their CE is 0.3218...
. If y_predicted == y_true == 0.8
, their CE is 0.178514
. They're not equal!
Focal Loss
--> Focal Loss for Dense Object Detection
A Focal Loss function addresses (extreme) class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples.
Formally, the Focal Loss adds a factor
Property
(In the below figure, the notation

- Setting
reduces the relative loss for well-classified examples , putting more focus on hard, misclassified examples. Here there is tunable focusing parameter . - One typical use case is in object detection tasks. An image may contain 5 onjects whereas the the number of bounding boxes can be millions. Thus there're enormous negative datapoints. The model can easily learn to judge all data points to be false to achieve a high traing performance.
1 | import torch |
Dice Loss
Dice Loss was originally designed for binary classification problems, particularly in the context of binary segmentation where you're often distinguishing between the foreground and the background.
Dice Coefficient
It's derived from Dice Coefficient, which is a statistic used to gauge the similarity of two samples.
For the
1 | def dice_coef(groundtruth_mask, pred_mask): |
Dice Loss
The Dice Loss is
1 | def dice_loss(groundtruth_mask, pred_mask): |
IoU

Jaccard index, also known as Intersection over Union (IoU), is the area of the intersection over union of the predicted segmentation and the ground truth
1 | def iou(groundtruth_mask, pred_mask): |