Variance-Invariance-Covariance Regularization

Sources:

VCReg which is quite similar to VICeg

Image source: https://arxiv.org/pdf/2306.13292

Variance-Invariance-Covariance Regularization

Self-supervised learning methods aim to learn meaningful representations without relying on labels. VICReg (Variance-Invariance-Covariance Regularization) is one such method, which learns representations by optimizing three key objectives: maintaining variance, reducing covariance, and ensuring invariance between augmented views of the same input.

In this article, we focus solely on the core idea of VICReg—the design of its loss function—excluding discussions about network architectures and implementation details.

Notation

Symbol Type Description
xi R The i-th original input in the batch
xi,xi R Two augmented versions of the original input xi
fθ() Function Neural network parameterized by θ, used to generate embeddings
zi,zi Rd Representations of xi and xi generated by fθ
Z,Z Rd×n Batch embeddings for augmented inputs, where Z,ZRd×n
zj,zj Rn The j-th row of Z or Z, representing values for the j-th dimension across all samples
Cov(Z) Rd×d Variance-covariance matrix of Z
Var(zj) R Variance of the j-th dimension across embeddings in Z
γ R Threshold for variance regularization (e.g., γ=1)
(Z,Z) R Overall VICReg loss function
v(Z),c(Z),s(Z,Z) R Variance loss, covariance loss, and invariance loss, respectively
μ,ν,λ R Hyperparameters controlling the weight of variance, covariance, and invariance terms

Abbreviations

Abbreviation Description
VICReg Variance-Invariance-Covariance Regularization
Cov Covariance
Var Variance
NN Neural network

Problem setting

We consider a batch of data {x1,,xn}. For each sample xi, two augmented views xi and xi are generated. These augmentations are passed through a neural network fθ(), producing embeddings:

zi=fθ(xi) and zi=fθ(xi)

where zi,ziRd. The embeddings for the entire batch are represented as two matrices:

Z=[z1,z2,,zn] and Z=[z1,z2,,zn]

where Z,ZRd×n.

The variance-covariance matrix of Z is defined as:

Cov(Z)=E[(ZE[Z])(ZE[Z])T]

Expanding this:

Cov(Z)=[Cov(z1,z1)Cov(z1,zn)Cov(zn,z1)Cov(zn,zn)]

VICReg loss

VICReg optimizes three goals:

  1. High Variance: Encourage Var(zj) to prevent collapse, where all embeddings zi become identical. For example, if all embeddings are mapped to the same vector [1,2,,d]: Z=[111222ddd] In this case, each dimension zj (e.g., z1=[1,1,,1]) has no variation, resulting in: Var(zj)=0j To prevent this, VICReg introduces the variance loss1: v(Z)=1dj=1dmax(0,γVar(zj))

  2. Low Covariance: Minimize off-diagonal elements of Cov(Z). The covariance loss: c(Z)=1dij[Cov(Z)]i,j2 This reduces redundancy by minimizing overlap between dimensions.

  3. Invariance: Ensure embeddings Z and Z of the same input are similar. The invariance loss: s(Z,Z)=1ni=1nzizi22 This loss term is where contrastive learning resides: pushing two positive embeddings closer.

The overall VICReg loss: (Z,Z)=μ[v(Z)+v(Z)]+ν[c(Z)+c(Z)]+λs(Z,Z) where λ,μ and ν are hyperparameters controlling the importance of each term in the loss.

My comment

Invariance term is contrastive learning. Other two terms are heuristic tricks.


  1. In the original paper, Var(zj) is replaced by , where ϵ is a small constant for numerical stability. This is an engineering choice and does not affect the core idea. In this article, we use the simpler version for clarity and intuitiveness.↩︎