Common Loss Functions

Posted on 2024-01-03 Edited on 2025-06-17 In Computer Science Views: 66

Cross entropy
Focal loss
Hinge loss
...

Notations

TO make the context clear, in this article, we make the following regulations:

The size of the dataset, i.e., the number of the datapoints, is denoted as $N$ .
The datapoint of index $i$ is $x_{i} \in R^{d \times 1}$ . The last dimenstion of it is denoted as $y_{i} \in R^{1 \times 1}$ .
The predicted datapoint by the model is, corresponding to $x_{i}$ , is ${\hat{x}}_{i}$ . The last dimenstion of it is ${\hat{y}}_{i}$ .
The predicted value of the $i^{th}$ datapoint output by the model is $p_{i}$ .
When the task is a classification task, the ground truth value is often a 0/1 label, i.e., $y_{i} \in {0, 1}, \forall i$ .
When the task is a classification task, the number of classes is $C$ . In addition, the ground truth label value and the prediction value of $c^{th}$ class of the $i^{th}$ datapoint is $y_{c}, p_{c}$ .

Mean Squared Error

Mean squared error (MSE) loss is $MSE Loss = ‖ y_{i} - {\hat{y}}_{i} ‖_{2}^{2} = \sqrt{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}} .$

import torch

def mse(y_true, y_predicted):
    # Mean Squared Error (MSE) between x and y
    mse = torch.mean((y_true - y_predicted) ** 2)
    print("Mean Squared Error between y and \hat y:", mse.item())

Cross Entropy

For the gradient (or derivation) of Cross Entropy, please refer to ->this article.

--> Youtube: Cross Entropy

Cross-entropy loss (often abbreviated as CE), or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1 (usually produced by a softmax function).

The cross entropy loss $L$ of a sample is $L = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i}) .$

where ${\hat{y}}_{i}$ is the predicted probability for class $i$ , usually obtained by applying the softmax function to the logits For classification tasks where there is only one true class for a sample, i.e., $y_{c} \in {0, 0, \dots, 1, 0, \dots}$ , suppose the index of the true class of the input data point is $t$ , we obtain $L = - (0 + \dots + 0 + 1 \cdot \log ({\hat{y}}_{t}) + 0 + \dots + 0) = - \log ({\hat{y}}_{t})$

import torch

def calculate_cross_entropy(y_true, y_predicted):
    '''
    :param y_predicted: The predicted, often by model, distribution of data point y.
    :param y_true: The true distribution of data point y.
    :return: the cross entropy of p_predicted, given the fact that the true distribution is y_true.
    Note: y_predicted and y_true are  multi-variance distributions if X is multidimensional.
    '''
    # Ensure that p_predict and y_true have the same length
    if y_predicted.shape != y_true.shape:
        raise ValueError("Tensors p_predict and y_true must have the same shape")

    # Avoid log(0) situation
    epsilon = 1e-15
    # Clamps all elements of p_predicted in input into the range [ min, max ] where min =epsilon, and max = 1 - epsilon.
    y_predicted = torch.clamp(y_predicted, epsilon, 1 - epsilon)

    # Calculate cross entropy
    cross_entropy = -torch.sum(y_true * torch.log(y_predicted))
    return cross_entropy

Example

For example, consider following neural network, there're three data points "Setosa", "Virginica" and "Versicolor", each is a 2-D vector. The number of classes is $N = 3$ since softmax outputs 3 values.

Petal	Sepal	Species	$p$	Cross Entropy
0.04	0.42	Setosa	0.57	$1 . (- \log (p)) + 0 + 0 = 0.56$
1	0.54	Virginica	0.58	$0 + 1. - (p) + 0 = 0.54$
0.50	0.37	Versicolor	0.52	$0 + 0 + - \log (p) = 0.65$

Take Sepal for Versicolor, when input is Versicolor ( $[0.50, 0.37]$ ), the true label value of Versicolor is $1$ and that of others is all $0$ . $\begin{array}{r} y_{Setosa} = 0, \\ y_{Virginica} = 0, \\ y_{Versicolor} = 1, \end{array}$ The output of softmax corresponding to Versicolor is $p_{Versicolor} = 0.52$ .

Therefore, the cross entropy of this training is $1 . (- \log (p)) + 0 + 0 = 0.56 .$

Property

The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly.

Log loss heavily penalizes those predictions that are confident and wrong.

Problems

The main problem of cross entropy is that, even when the prediction is correct, i.e., y_predicted == y_true, the cross entropy isn't symmetric.

From this figure, if y_predicted == y_true == 0.2, their CE is 0.3218.... If y_predicted == y_true == 0.8, their CE is 0.178514. They're not equal!

Focal Loss

--> Focal Loss for Dense Object Detection

A Focal Loss function addresses (extreme) class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples.

Formally, the Focal Loss adds a factor ${(1 - {\hat{y}}_{t})}^{γ}$ to the standard cross entropy criterion. $FL ({\hat{y}}_{t}) = - {(1 - {\hat{y}}_{t})}^{γ} \log ({\hat{y}}_{t})$

Property

(In the below figure, the notation $\hat{y}$ is replaced by $p$ )

Setting $γ > 0$ reduces the relative loss for well-classified examples $({\hat{y}}_{t} > .5)$ , putting more focus on hard, misclassified examples. Here there is tunable focusing parameter $γ \geq 0$ .
One typical use case is in object detection tasks. An image may contain 5 onjects whereas the the number of bounding boxes can be millions. Thus there're enormous negative datapoints. The model can easily learn to judge all data points to be false to achieve a high traing performance.

import torch

def focal_loss(y_true, y_predicted, alpha=0.25, gamma=2.0):
    # First calculate the cross entropy loss using the helper function
    ce_loss = calculate_cross_entropy(y_true, y_predicted)

    # Calculate p_t
    p_t = y_true * y_predicted + (1 - y_true) * (1 - y_predicted)

    # Calculate the modulating factor
    modulating_factor = (1 - p_t) ** gamma

    # Apply the alpha weighting
    alpha_factor = y_true * alpha + (1 - y_true) * (1 - alpha)

    # Calculate the final Focal Loss
    focal_loss = alpha_factor * modulating_factor * ce_loss
    return focal_loss

Dice Loss

--> Source

Dice Loss was originally designed for binary classification problems, particularly in the context of binary segmentation where you're often distinguishing between the foreground and the background.

Dice Coefficient

It's derived from Dice Coefficient, which is a statistic used to gauge the similarity of two samples.

For the $i^{th}$ datapoint, the Dice Coefficient is $Dice = \frac{2 \times \sum_{i} ({\hat{y}}_{i} \times y_{i})}{\sum_{i} {\hat{y}}_{i} + \sum_{i} y_{i}} .$

def dice_coef(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    total_sum = np.sum(pred_mask) + np.sum(groundtruth_mask)
    dice = np.mean(2*intersect/total_sum)
    return round(dice, 3) #round up to 3 decimal places

Dice Loss

The Dice Loss is $DL = 1 - Dice .$

1 2	def dice_loss(groundtruth_mask, pred_mask): return 1 - dice_coef(groundtruth_mask, pred_mask)

IoU

--> Source

Jaccard index, also known as Intersection over Union (IoU), is the area of the intersection over union of the predicted segmentation and the ground truth $I o U = \frac{T P}{T P + F P + F N}$

def iou(groundtruth_mask, pred_mask):
    intersect = np.sum(pred_mask*groundtruth_mask)
    union = np.sum(pred_mask) + np.sum(groundtruth_mask) - intersect
    iou = np.mean(intersect/union)
    return round(iou, 3)