Variance-Invariance-Covariance Regularization

Posted on 2025-01-09 Edited on 2025-06-17 In Research Views: 35

Sources:

VICReg 2022 paper
VC Reg, a follow-up paper of VICReg

Image source: https://arxiv.org/pdf/2306.13292

Variance-Invariance-Covariance Regularization

Self-supervised learning methods aim to learn meaningful representations without relying on labels. VICReg (Variance-Invariance-Covariance Regularization) is one such method, which learns representations by optimizing three key objectives: maintaining variance, reducing covariance, and ensuring invariance between augmented views of the same input.

In this article, we focus solely on the core idea of VICReg—the design of its loss function—excluding discussions about network architectures and implementation details.

Notation

Symbol	Type	Description
$x_{i}$	$R$	The $i$ -th original input in the batch
$x_{i}^{'}, x_{i}^{''}$	$R$	Two augmented versions of the original input $x_{i}$
$f_{θ} (\cdot)$	Function	Neural network parameterized by $θ$ , used to generate embeddings
$z_{i}^{'}, z_{i}^{''}$	$R^{d}$	Representations of $x_{i}^{'}$ and $x_{i}^{''}$ generated by $f_{θ}$
$Z^{'}, Z^{''}$	$R^{d \times n}$	Batch embeddings for augmented inputs, where $Z^{'}, Z^{''} \in R^{d \times n}$
$z^{' j}, z^{'' j}$	$R^{n}$	The $j$ -th row of $Z^{'}$ or $Z^{''}$ , representing values for the $j$ -th dimension across all samples
$Cov (Z^{'})$	$R^{d \times d}$	Variance-covariance matrix of $Z^{'}$
$Var (z^{' j})$	$R$	Variance of the $j$ -th dimension across embeddings in $Z^{'}$
$γ$	$R$	Threshold for variance regularization (e.g., $γ = 1$ )
$ℓ (Z^{'}, Z^{''})$	$R$	Overall VICReg loss function
$v (Z^{'}), c (Z^{'}), s (Z^{'}, Z^{''})$	$R$	Variance loss, covariance loss, and invariance loss, respectively
$μ, ν, λ$	$R$	Hyperparameters controlling the weight of variance, covariance, and invariance terms

Abbreviations

Abbreviation	Description
VICReg	Variance-Invariance-Covariance Regularization
Cov	Covariance
Var	Variance
NN	Neural network

Problem setting

We consider a batch of data ${x_{1}, \dots, x_{n}}$ . For each sample $x_{i}$ , two augmented views $x_{i}^{'}$ and $x_{i}^{''}$ are generated. These augmentations are passed through a neural network $f_{θ} (\cdot)$ , producing embeddings:

$z_{i}^{'} = f_{θ} (x_{i}^{'}) and z_{i}^{''} = f_{θ} (x_{i}^{''})$

where $z_{i}^{'}, z_{i}^{''} \in R^{d}$ . The embeddings for the entire batch are represented as two matrices:

$Z^{'} = [z_{1}^{'}, z_{2}^{'}, \dots, z_{n}^{'}] and Z^{''} = [z_{1}^{''}, z_{2}^{''}, \dots, z_{n}^{''}]$

where $Z^{'}, Z^{''} \in R^{d \times n}$ .

The variance-covariance matrix of $Z^{'}$ is defined as:

$Cov (Z^{'}) = E [(Z^{'} - E [Z^{'}]) {(Z^{'} - E [Z^{'}])}^{T}]$

Expanding this:

$Cov (Z^{'}) = [\begin{array}{ccc} Cov (z_{1}^{'}, z_{1}^{'}) & \dots & Cov (z_{1}^{'}, z_{n}^{'}) \\ ⋮ & ⋱ & ⋮ \\ Cov (z_{n}^{'}, z_{1}^{'}) & \dots & Cov (z_{n}^{'}, z_{n}^{'}) \end{array}]$

VICReg loss

VICReg optimizes three goals:

High Variance: Encourage $Var (z^{' j})$ to prevent collapse, where all embeddings $z_{i}^{'}$ become identical. For example, if all embeddings are mapped to the same vector $[1, 2, \dots, d]$ : $Z^{'} = [\begin{array}{cccc} 1 & 1 & \dots & 1 \\ 2 & 2 & \dots & 2 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ d & d & \dots & d \end{array}]$ In this case, each dimension $z^{' j}$ (e.g., $z^{' 1} = [1, 1, \dots, 1]$ ) has no variation, resulting in: $Var (z^{' j}) = 0 \forall j$ To prevent this, VICReg introduces the variance loss¹: $v (Z^{'}) = \frac{1}{d} \sum_{j = 1}^{d} max (0, γ - Var (z^{' j}))$
Low Covariance: Minimize off-diagonal elements of $Cov (Z^{'})$ . The covariance loss: $c (Z^{'}) = \frac{1}{d} \sum_{i \neq j} {[Cov (Z^{'})]}_{i, j}^{2}$ This reduces redundancy by minimizing overlap between dimensions.
Invariance: Ensure embeddings $Z^{'}$ and $Z^{''}$ of the same input are similar. The invariance loss: $s (Z^{'}, Z^{''}) = \frac{1}{n} \sum_{i = 1}^{n} {‖ z_{i}^{'} - z_{i}^{''} ‖}_{2}^{2}$ This loss term is where contrastive learning resides: pushing two positive embeddings closer.

The overall VICReg loss: $ℓ (Z^{'}, Z^{''}) = μ [v (Z^{'}) + v (Z^{''})] + ν [c (Z^{'}) + c (Z^{''})] + λ s (Z^{'}, Z^{''})$ where $λ, μ$ and $ν$ are hyperparameters controlling the importance of each term in the loss.

My comment

Invariance term is contrastive learning. Other two terms are heuristic tricks.

In the original paper, $Var (z^{' j})$ is replaced by , where $ϵ$ is a small constant for numerical stability. This is an engineering choice and does not affect the core idea. In this article, we use the simpler version for clarity and intuitiveness.↩︎