SimCLR
Sources:
- SimCLR v1 2020 paper
- SimCLR v2 2020 paper
- Contrastive Representation Learning by Lilian
- UVA's SimCLR implementation(Both Pytorch and Jax versions are implemented)
Introduction of contrastive learning
Self-supervised Learning is the technique of learning rich useful representations out of unlabelled data which can then be used for downstream tasks i.e. use it as initialization and finetune(either the whole network or only the linear classifier) the network on limited data.
Contrastive learning is one of many Self-supervised Learning paradigms that fall under Deep Distance Metric Learning where the objective is to learn a distance in a low dimensional space which is consistent with the notion of semantic similarity. In simple terms(considering image domain), it means to learn similarity among images where distance is less for similar images and more for dissimilar images.
Gist of the approach:
- Create similar and dissimilar sets for every image in the dataset.
- Pass two images(from similar/dissimilar set) to the same neural network and extract low dimensional embeddings/representations.
- Compute euclidean distance between both the embeddings (not pixels!).
- Minimize loss such that the above objective is achieved.
- Repeat 1–4 for large number of pairs(all pairs may be infeasible) until the model converges.
SimCLR
SimCLR is a very simple framework for unsupervised pretraining. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space (not in the pixel space).
As illustrated in Figure 1, SimCLR comprises the following four major components.
A stochastic data augmentation family
. We sample two separate data augmentation operators from it and ), and apply them to each data example to obtain two correlated views denoted and ..A neural network encoder
(ResNet in practice) that extracts representation vectors from augmented data examples. where is the output after the average pooling layer.A small neural network projection head
(2-3 layers MLP in practice) that maps representations to the latent space where contrastive loss is applied.A contrastive loss function. Given a set
including a positive pair of examples and , the contrastive prediction task aims to identify in for a given .
Training process
We randomly sample a minibatch of N examples, apply augmentation functions t(·) and t'(·) to them, resulting in 2N image views.
For each image, we construct positive and negative examples, which will be used in contrastive learning, as follows:
- For each original image, we augment it into two views i and j. These views i and j are mutually positive examples.
- All other 2(N-1) views k where k ≠ i and k ≠ j are negative examples of i (or j). Therefore, for each view i, we want to maximize its similarity with its positive example j, while minimizing its similarity to all negative examples k.
SimCLR proposes a NT-Xent loss (the normalized temperature-scaled cross entropy loss), which took inspiration from the InfoNCE loss, for contrastive learning. In short, the InfoNCE loss compares the similarity of
The final loss is computed across all positive pairs, both
We can further derive
See the appendix for the implementation.
The exploration of compositions of data augmentation operations
Composition of data augmentation operations is crucial for learning good representations. Figure 3 illustrates the studied data augmentation operators. Each augmentation can transform data stochastically with some internal parameters (e.g. rotation degree, noise level).
Note that we only test these operators in ablation, the augmentation policy used to train our models only includes random crop (with flip and resize), color distortion, and Gaussian blur.
The result of the ablation experiment is shown in Figure 4, where we tabulate the ImageNet top-1 accuracy under individual or composition of data augmentations, applied only to one branch.
For all columns but the last, diagonal entries correspond to single transformation, and off-diagonals correspond to composition of two transformations (applied sequentially). The last column reflects the average over the row.
One composition of augmentations stands out: random cropping and random color distortion (acchieves accuray 55.8 or 56.3). We conjecture that one serious issue when using only random cropping as data augmentation is that most patches from an image share a similar color distribution.
Figure 4 shows that color histograms alone suffice to distinguish images. Neural nets may exploit this shortcut to solve the predictive task. Therefore, it is critical to compose cropping with color distortion in order to learn generalizable features.

Figure 5. Histograms of pixel intensities (over all channels) for different crops of two different images (i.e. two rows). The image for the first row is from Figure 3. All axes have the same range.
SimCLR v2
SimCLR v2 is the successor of SimCLR. Basically it leverages bigger and deeper neural networks (bigger ResNets and MLPs) as its backbone. Meanwhile, it provides a three-stage pipeline for semi-supervised learning:
- (unsupervised) pretraining
- (supervised) fine-tune
- (supervised) distill

We then illustrate the process of knowledge distillation via unlabeled examples. To further improve the network for the target task, we use the fine-tuned network as a teacher to impute labels for training a student network. Specifically, we minimize the following distillation loss where no real labels are used:
The teacher network, which produces
While we focus on distillation using only unlabeled examples in this work, when the number of labeled examples is significant, one can also combine the distillation loss with ground-truth labeled examples using a weighted combination
This procedure can be performed using students either with the same model architecture (selfdistillation), which further improves the task-specific performance, or with a smaller model architecture, which leads to a compact model.
Appendix
InfoNCE loss
Suppose we have
Suppose that we extract
Since there are
Now we make those
In contrastive learning,
Since we have regulated that, in the new batch,
For example, soppose
Get the cosine similarity matrix (and scale it by a temperature parameter):
1
2
3
4
5
6cos_sim = [
[s11, s12, s13, s14],
[s21, s22, s23, s24],
[s31, s32, s33, s34],
[s41, s42, s43, s44]
]For numerical stabily, the diagonal of the cosine similarity matrix is set to a very low value
1
2
3
4
5
6cos_sim = [
[-inf, s12, s13, s14],
[s21, -inf, s23, s24],
[s31, s32, -inf, s34],
[s41, s42, s43, -inf]
]Here,
is indexed by 1, is indexed by 1+B=3. So the similarity of the first positive pair iss13
. For the same reason, we can get the similarity of all the positive pairs1
s13, s24, s31, s42
Threfore, the similarity of all the negative pairs are
1
-inf, s12, s14, s21, -inf, s23, s32, -inf, s34, s41, s43, -inf
To extract them, we simply make two index arrays:
1
2diag_range = [1, 2, 3, 4]
shifted_diag_range = [1+B % B', 2+B % B', 3+B % B', 4+B % B'] = [3, 4, 1, 2]Extract the positive pairs with these indices:
1
2
3
4positive_pair_sim_array = cos_sim[diag_range, shifted_diag_range]
'''
Get: s13, s24, s31, s42
'''From
, we compute the InfoNCE loss from the similarity of positive pairs and negative pairs:1
2nll = - positive_pair_sim_array + nn.logsumexp(cos_sim, axis=-1)
nll = nll.mean()