SimCLR Explained in Simple Terms

SimCLR explained simply in 1 sentence, 1 minute, and 5 minutes

Jeffrey Boschman
One Minute Machine Learning

--

SimCLR stands for a Simple Framework for Contrastive Learning of Visual Representations. There are a few steps involved and the original paper (by Ting Chen and Geoffrey Hinton, among others) uses some complex words so it might seem daunting, but once you understand it the concept is quite simple and intuitive. So, what is SimCLR?

Illustration of SimCLR from the author’s blog post.

One Sentence Summary

If I had to describe SimCLR in one sentence, I would say something like:

SimCLR is a strategy that trains model weights (to eventually be used for a separate downstream computer vision task like classification or detection) by training on the task of identifying the two images that are augmented views of each other out of a batch of other random images.

However, there is still a lot to unpack before this makes complete sense, so allow me to go into a bit more detail.

One Minute Summary

Now I will try to explain SimCLR in the way I would for example to a peer, maybe in about one minute.

SimCLR is a self-supervised learning strategy. You would use SimCLR when you want to train the weights of a model to be really good before you eventually train it further for a separate computer vision task, like classification of images.

The SimCLR framework has four major components:

  1. A Data Augmentation Module, which transforms a given image randomly in two ways, yielding two correlated views of the same example.
  2. A Base Encoder, such as the ResNet-50 architecture (although you can use almost any encoder) to extract representation vectors from augmented data examples. This is the model whose weights you are training for the eventual downstream task.
  3. A Projection Head, which is an MLP (with one hidden layer) that just makes the representation vectors smaller before the loss function.
  4. A Contrastive Loss Function, which is low if the two correlated vectors (from the same original image) are similar to each other AND dissimilar to the the vectors from a set of “negative” examples (all the other examples in the mini-batch; not from the same original example), and high otherwise.

During training, each image in a mini-batch goes through the first 3 steps (the Data Augmentation Module, the Base Encoder, and the Projection Head), which yields two representation vectors per original image.

Then, the Contrastive Loss is calculated for each “positive” pair of vectors (from the same input image) by comparing them to one “negative” representation vector from all the other images in the mini-batch. The total loss is the average loss across all positive pairs in the mini-batch.

As the loss gets propagated backwards through the network and weights are updated, the Base Encoder (and Projection Head, but we don’t care about this) should learn to make the representation vectors from the same input image more similar to each other, and more different from the representation vectors of the other images.

Five Minute Summary

So now, I will go into a bit more detail. Imagine if I was talking to someone and they had more questions about what I was explaining.

What is self-supervised learning?

Let me run through a scenario to try to explain this.

Let’s say you want to train a model to do classification of images — maybe to classify medical images as containing cancer or not. If you can only train using example images that have already been annotated or labelled by a human doctor (as cancerous or not-cancerous), then you are doing supervised learning, right? However, you are limited by the number of example images that have been labelled. If you train an initial model using only supervised learning and realize it does not perform well, you might conclude you need to get more labeled images for training. Unfortunately, getting an expert doctor to annotate enough images to make your model perform better is very time consuming and expensive.

So what can you do? Can you do something besides supervised learning?

Yes, you can! This is where self-supervised learning comes in.

Instead of supervised training using labelled data for a task like classification, self-supervised learning is a way of training on a different task beforehand using “labels” related to the metadata.

Generally, computer vision strategies that use self-supervised learning involve performing two tasks, a pretext task and a downstream task.

The downstream task is the real task you care about, like classification or detection.

The pretext task is the self-supervised learning task solved to train model weights that produce good representation vectors for input images, with the aim of using these model weights for the downstream task.

In self-supervised scenarios, “labels” are not the same kind of labels as in supervised learning. Labels in supervised learning come from a human annotator, like a doctor who goes through a bunch of medical images and labels each as containing cancer or not.

In self-supervised learning, “labels” are inherent to the data without any human intervention. For example, in SimCLR, you have images that are augmented views of each other from the same source image, as well as augmented images that come from other source images. Comparing two augmented images, the “labels” could therefore be “from the same source image” or “not from the same source image”.

Training with self-supervised learning does not explicitly help with the downstream task (e.g., classification), BUT it does help train the model weights to be better in general (i.e., it would make the representation vectors more representative of the images they are from), which should thereby help with the downstream task anyways.

And the best thing is that you don’t need data annotated by a human labeller for self-supervised tasks like SimCLR — you can use completely unlabelled datasets which expands the amount of data that you can use to train your model!

So, what is self-supervision in regards to SimCLR? In conclusion:

SimCLR is a self-supervised strategy for using unlabelled data to train your model to produce good visual representations before a real, downstream task.

Data Augmentation Module Details

As mentioned, the Data Augmentation Module transforms a given image randomly in two ways. There are three transforms that are done:

  • random cropping (and resizing back to the original size)
  • random colour distortion
  • random Gaussian blur

The authors chose these three after trying a bunch of different augmentations and found this combination to be the best.

Base Encoder Details

The Base Encoder extracts representation vectors (also sometimes called feature vectors) from the augmented images. If you’ve ever read/heard someone talk about “transforming an image into a lower-dimension space”, that’s what this Base Encoder does.

In simpler terms:

  • Input: One of the augmented images from the previous step (e.g., the input image might have a size of 256*256 pixels and have 3 colour channels, for a total of 256*256*3=196,608 numbers per image)
  • Output: A vector that somehow represents the input image, but using less numbers (e.g., the ResNet-50 output feature vector is made up of 2,048 numbers)

Ideally, with SimCLR, the output vectors for similar images should be similar, while the output vectors for dissimilar images should be dissimilar.

How well the Base Encoder accomplishes this depends on how well the weights are trained — essentially that is the whole point of SimCLR (and training deep learning models in general).

The Base Encoder is the model that we will use later (after training with SimCLR) to do the actual task we want to do, like classification or regression.

The original paper uses the commonly used ResNet-50 architecture (without the last layer) for the Base Encoder, but notes that pretty much any model that takes an image and yields an output vector could be used.

Projection Head Details

The Projection Head might sound fancy, but it is actually just a multilayer perceptron (MLP) with one hidden layer (i.e., two fully connected layers) that transforms the representation vector from the previous step to a different, smaller representation vector.

  • Input: A representation vector that is the output of the base encoder (length 2,048)
  • Output: A different representation vector (length 128)

After you have trained with SimCLR, you remove the Projection Head and only use the Base Encoder for the downstream task.

The reason you need the Projection Head is because it puts the representation vector into the format/size that you will apply the Contrastive Loss Function.

It seems like the authors only included the Projection Head because they tried both having it and not having it, and found that SimCLR worked better when it was included. Perhaps the weights that are learned in these two Projection Head layers negatively impact the downstream task.

Contrastive Loss Function Details

The formula(s) for the Constrastive Loss Function looks like:

The steps to calculate the total loss L in SimCLR (from the original paper). z is a representation vector after the Projection Head. N is the number of original images in a mini-batch.

I know this looks really complicated, but really the main thing to understand is that:

The loss is low if the two correlated vectors (from the same example) are similar to each other AND dissimilar to the the vectors from a set of “negative” examples (all the other examples in the mini-batch; not from the same original example).

The loss is high otherwise.

Anyways, I will still try to explain in more detail.

Step 1:

The first step in calculating the total loss is to calculate the pairwise similarity for all combinations of representation vectors. If there are N original images in the mini-batch, then there are 2N representation vectors (because of the Data Augmentation Module taking each image and randomly transforming it in two ways). That is what the following formula says:

Calculating the pairwise similarity for all combinations of representation vectors. From the original paper.

Step 2:

The next step is to define the actual Contrastive Loss Function. To be specific, the Contrastive Loss Function in SimCLR is called NT-Xent (normalized temperature-scaled cross entropy loss). The formula is the following:

The NT-Xent loss function. From the original paper.

The numerator (the top of the fraction) is a measure of how similar the current vectors being compared are. If they are from the same original image, we want the similarity to be high (the highest similarity value is 1.0).

The denominator (the bottom of the fraction) is the sum of how similar other vectors are. Usually this is higher than the numerator, and thus the fraction is less than 1.0. So to make the fraction higher (closer to 1.0), you want the denominator value to be low. But it is a sum of a bunch of values, so how can it be low? You want all those values to be low.

The best case is that when i and j are from the same original image (which is always should be because of the third equation), you have a similarity of 1.0, and i compared to any other vector has a similarity of 0.0. That way, the total fraction will be 1.0 — which is the largest number it can be, and since we are taking the negative log, this is the smallest loss you can possibly get (0.0).

Step 3:

The third step is to actually calculate and sum all the losses.

The total loss of a forward pass. From the original paper.

As I mentioned a bit earlier, we are only calculating the losses for when i and j (in the second equation) are from the same image (because the (2k)th and (2k-1)th are the two corresponding representation vectors from the same original image). And we do this for all original images, and sum up all the losses to get the total loss.

That’s it! I hope all of this helps someone out there!

--

--

Jeffrey Boschman
One Minute Machine Learning

An endlessly curious grad student trying to build and share knowledge.