Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Self-Supervised Learning

● A form of unsupervised learning where the data provides the supervision


● In general, withhold some part of the data, and task the network with predicting it
● The task defines a proxy loss, and the network is forced to learn what we really care
about, e.g. a semantic representation, in order to solve it

Components of Self-Supervised Learning

● Pretext Tasks
These are the fixed tasks that don’t require any external labelling and can learn some
sort of a semantic representation. These representations can then be used for other
tasks via transfer learning. Some of the common pretext tasks are:
● Rotation​: Gidariset al. propose to produce 4 copies of a single image by rotating
it by {0°, 90°, 180°, 270°} and let a single network predict the rotation which was
applied — a 4-class classification task. Intuitively, a good model should learn to
recognize canonical orientations of objects in natural images.
● Exemplar​: In this technique, every individual image corresponds to its own
class, and multiple examples of it are generated by heavy random data
augmentation such as translation, scaling, rotation, and contrast and color shifts.
We use data augmentation mechanism from. Triplet loss is used in order to scale
this pretext task to a large number of images (hence, classes) present in the
ImageNet dataset.
● Jigsaw​: The task is to recover relative spatial position of 9 randomly sampled
image patches after a random permutation of these patches was performed. All
of these patches are sent through the same network, then their representations
from the pre-logits layer are concatenated and passed through a two hidden layer
fully-connected multilayer perceptron (MLP), which needs to predict a
permutation that was used. In practice, the fixed set of 100 permutations is
used.
● Relative Patch Location​: The pretext task consists of predicting the relative
location of two given patches of an image. The model is similar to the Jigsaw
one, but in this case the 8 possible relative spatial relations between two patches
need to be predicted, e.g. “below” or “on the right and above”. Same patch
prepossessing as in the Jigsaw model is used and also final image
representations are extracted by averaging representations of 9 cropped
patches.

● Architecture of CNN Models


Different variants of ​ResNet50 (v1 and v2)​ and ​VGG ​have been used for training the
pretext tasks. There were many peculiar findings made by Google Brain team about
these models. These are:
● Better results in pretext tasks do not directly correspond to better representation.
For example, VGG outperformed every other model in the Rotation pretext task
but when the learned representations were used for VOC2007 image
classification task, its performance was very poor in comparison to those of
ResNet.
● Model width and representation strongly influence the performance of
self-supervised tasks, even more than in the case of supervised learning with
labels.
● The reason why ResNet performs better is thought to be skip connections. Model
like VGG which has no skip connection tend to perform better in the earlier layers
than the pre-logits layer because of the fact that final layers weights are more
specific to the problem and not generic enough to be transferred to other tasks.
Skip connections offer analytical invertibility to some degree and thus, it is easier
for weights to invert back to the weights of the previous blocks.

● Evaluation of different models


For fair comparison of representations learnt by different models, same model
(preferably linear) is trained for a specific task with same set of hyperparameters. Linear
model is a good choice for evaluation as a larger model may diminish the effects of
changing the width or size of the model used for training the pretext tasks.

Facebook AI Self-Supervised Learning Challenge (ICCV 2019)

Quoting the challenge details page:

The Facebook AI Self-Supervision Learning challenge (FASSL) aims to benchmark


self-supervised visual representations on a diverse set of tasks and datasets using a
standardized transfer learning setup.
In this first iteration, we base our challenge on the following subset of the tasks defined in Goyal
et al., 2019

● CLS-Places205: Image Classification on the Places205 dataset. Training a linear


classifier on fixed feature representations.
● CLS-VOC07: Image Classification on the VOC2007 dataset. Training a linear
SVM on fixed feature representations.
● DET-VOC07: Object Detection on the VOC2007 dataset. Training only the ROI
heads of a Fast R-CNN model with precomputed proposals.
● Lowshot-VOC07: Image Classification on the VOC2007 dataset using a small
subset of the training data. Training a linear SVM on fixed feature
representations.

Link​ to challenge page


References

1. https://arxiv.org/pdf/1901.09005.pdf
2. https://arxiv.org/pdf/1905.01235.pdf
3. https://arxiv.org/pdf/1505.05192.pdf
4. https://github.com/facebookresearch/fair_self_supervision_benchmark
5. https://github.com/google/revisiting-self-supervised
6. https://towardsdatascience.com/self-supervised-learning-78bdd989c88b

You might also like