Professional Documents
Culture Documents
Generative Pretraining From Pixels
Generative Pretraining From Pixels
Mark Chen 1 Alec Radford 1 Rewon Child 1 Jeff Wu 1 Heewoo Jun 1 David Luan 1 Ilya Sutskever 1
Figure 1. An overview of our approach. First, we pre-process raw images by resizing to a low resolution and reshaping into a 1D sequence.
We then chose one of two pre-training objectives, auto-regressive next pixel prediction or masked pixel prediction. Finally, we evaluate
the representations learned by these objectives with linear probes or fine-tuning.
representations that significantly improve the state of the for the downstream task rather than because of better pre-
art in low-resolution unsupervised representation learning training.
settings.
We begin this section by defining the auto-regressive and
This is especially promising as our architecture uses a dense BERT objectives in the context of images. Next, we outline
connectivity pattern which does not encode the 2D spatial implementation details for our transformer decoder. Finally,
structure of images yet is able to match and even outperform we describe how the transformer is used for fine-tuning and
approaches which do. We report a set of experiments charac- how features are extracted for linear probes.
terizing the performance of our approach on many datasets
and in several different evaluation settings (low data, linear 2.1. Pre-training
evaluation, full fine-tuning). We also conduct several exper- Given an unlabeled dataset X consisting of high dimen-
iments designed to better understand the achieved perfor- sional data x = (x1 , ..., xn ), we can pick a permutation π
mance of these models. We investigate how representations of the set [1, n] and model the density p(x) auto-regressively
are computed inside our model via the performance of linear as follows:
probes as a function of model depth as well as studying how Yn
scaling the resolution and parameter count of the approach p(x) = p(xπi |xπ1 , ..., xπi−1 , θ)
affects performance. i=1
When working with images, we pick the identity permuta-
2. Approach tion πi = i for 1 ≤ i ≤ n, also known as raster order. We
train our model by minimizing the negative log-likelihood
Our approach consists of a pre-training stage followed by
of the data:
a fine-tuning stage. In pre-training, we explore both the
auto-regressive and BERT objectives. We also apply the LAR = E [− log p(x)]
x∼X
sequence Transformer architecture to predict pixels instead
of language tokens. We also consider the BERT objective, which samples a
sub-sequence M ⊂ [1, n] such that each index i indepen-
One way to measure representation quality is to fine-tune for dently has probability 0.15 of appearing in M . We call M
image classification. Fine-tuning adds a small classification the BERT mask, and we train our model by minimizing
head to the model, used to optimize a classification objective the negative log-likelihood of the “masked” elements xM
and adapts all weights. Pre-training can be viewed as a conditioned on the “unmasked” ones x[1,n]\M :
favorable initialization or as a regularizer when used in X
combination with early stopping (Erhan et al., 2010). LBERT = E E − log p xi |x[1,n]\M
x∼X M
i∈M
Another approach for measuring representation quality uses
the pre-trained model as a feature extractor. In particular, In pre-training, we pick one of LAR or LBERT and mini-
given labeled examples (X, Y ), the model is applied to X mize the loss over our pre-training dataset.
to produce features fX . Then, a linear classifier is trained 2.2. Architecture
on (fX , Y ). Linear probing captures the intuition that good
features should linearly separate the classes of transfer tasks. The transformer decoder takes an input sequence x1 , ..., xn
Furthermore, linear probes help disentangle feature quality of discrete tokens and produces a d-dimensional embedding
from model architecture: in fine-tuning, one model may for each position. The decoder is realized as a stack of
outperform another because its architecture is more suited L blocks, the l-th of which produces an intermediate em-
bedding hl1 , ..., hln also of dimension d. We use the GPT-2
Generative Pretraining from Pixels
(Radford et al., 2019) formulation of the transformer de- always at the final layer:
coder block, which acts on an input tensor hl as follows:
nl = layer norm(hl ) f l = hnli ii
al = hl + multihead attention(nl )
where 0 ≤ l ≤ L. We will show in the experiments section
hl+1 = al + mlp(layer norm(al )) that the best features often lie in the middle of the network.
In particular, layer norms precede both the attention and As in fine-tuning, we project these intermediate features
mlp operations, and all operations lie strictly on residual to produce class logits. Because we view the features as
paths. We find that such a formulation allows us to scale the fixed when linear probing, this projection contains the only
transformer with ease. trainable weights, so we can only optimize LCLF .
The only mixing across sequence elements occurs in the 3. Methodology
attention operation, and to ensure proper conditioning when
training the AR objective, we apply the standard upper Although supervised pre-training is the dominant paradigm
triangular mask to the n×n matrix of attention logits. When for image classification, curating large labeled image
using the BERT objective, no attention logit masking is datasets is both expensive and time consuming. Instead
required: after applying content embeddings to the input of further scaling up labeling efforts, we can instead as-
sequence, we zero out the positions in M . pire to learn general purpose representations from the much
larger set of available unlabeled images and fine-tune them
Additionally, since we learn independent position embed-
for classification. We investigate this setting using Ima-
dings for each sequence element, our BERT model has no
geNet as a proxy for a large unlabeled corpus, and small
positional inductive biases (i.e. it is permutation invariant).
classic labeled datasets (CIFAR-10, CIFAR-100, STL-10)
Put another way, any spatial relationships between posi-
as proxies for downstream tasks.
tions must be learned by the model at train time. This is
not entirely true for the AR model, as choosing the raster Even in cases where labels are available, unsupervised or
order also fixes a prespecified ordering of the condition- self-supervised pre-training can still provide benefits in data
als. Nevertheless, permutation invariance is a property in efficiency or on fine-tuning speed. We investigate this set-
strong contrast to convolutional neural networks, which in- ting by pre-training on ImageNet without labels and then
corporate the inductive bias that features should arise from fine-tuning or linear probing with labels.
spatially proximate elements.
3.1. Dataset and Data Augmentation
Following the final transformer layer, we apply a layer norm
nL = layer norm(hL ), and learn a projection from nL to We use the ImageNet ILSVRC 2012 training dataset, split-
logits parameterizing the conditional distributions at each ting off 4% as our experimental validation set and report
sequence element. When training BERT, we simply ignore results on the ILSVRC 2012 validation set as our test set.
the logits at unmasked positions. For CIFAR-10, CIFAR-100 and STL-10, we split off 10%
of the provided training set instead. We ignore the provided
2.3. Fine-tuning unlabeled examples in STL-10, which constitute a subset of
ImageNet.
When fine-tuning, we average pool nL across the sequence
dimension to extract a d-dimensional vector of features per When pre-training or fine-tuning on ImageNet, we make use
example: of lightweight data augmentation. First, we randomly resize
f L = hnL
i ii the image such that the shorter sidelength is in the range
We learn a projection from f L to class logits, which we use [256, 384]. Next, we take a random 224 × 224 crop. When
to minimize a cross entropy loss LCLF . evaluating on ImageNet, we resize the image such that the
shorter sidelength is 224, and use the single 224×224 center
While fine-tuning on LCLF yields reasonable downstream crop.
performance, we find empirically that the joint objective
When full-network fine-tuning on CIFAR-10 and CIFAR-
LGEN + LCLF 100, we use the augmentation popularized by Wide Residual
LGEN ∈ {LAR , LBERT } works even better. Similar find- Networks: 4 pixels are reflection padded on each side, and
ings were reported by Radford et al. (2018). a 32 × 32 crop is randomly sampled from the padded image
or its horizontal flip (Zagoruyko & Komodakis, 2016).
2.4. Linear Probing
Once optimal hyperparameters are found, we fold our ex-
Extracting fixed features for linear probing follows a similar perimental validation set back into the training set, retrain
procedure to fine-tuning, except that average pooling is not the model, and report numbers on the respective test set.
Generative Pretraining from Pixels
3.2. Context Reduction 36 and d = 1024 and iGPT-S, a 76M parameter model with
L = 24 and d = 512 to study the effect of model capacity
Because the memory requirements of the transformer de-
on representation quality in a generative model.
coder scale quadratically with context length when using
dense attention, we must employ further techniques to re- 3.4. Training
duce context length. If we naively trained a transformer on
a sequence of length 2242 × 3, our attention logits would be When pre-training, we use a batch size of 128 and train for
tens of thousands of times larger than those used in language 1000000 iterations using Adam with β1 = 0.9 and β2 =
models and even a single layer would not fit on a GPU. To 0.95. We sequentially try the learning rates 0.01, 0.003,
deal with this, we first resize our image to a lower resolution, 0.001, 0.0003, ..., stopping at the first local minimum. The
which we call the input resolution (IR). Our models have an learning rate is warmed up for one epoch, and then decays
IR of 322 × 3, 482 × 3, 962 × 3, or 1922 × 3. to 0 following a cosine schedule. No dropout is used.
An IR of 322 × 3 is still quite computationally intensive. When fine-tuning, we use the same batch size and Adam
While even lower IRs are tempting, prior work has demon- hyperparameters. Here, we do not employ a cosine sched-
strated human performance on image classification begins ule, and early stop once we reach the maximum validation
to drop rapidly below this size (Torralba et al., 2008). When accuracy. Again, no dropout is used.
using an IR of 322 × 3 or 482 × 3, we instead further reduce When running a linear probe on ImageNet, we follow recent
context size by a factor of 3 by clustering (R, G, B) pixel literature and use SGD with momentum 0.9 and a high
values using k-means with k = 512. A similar approach learning rate (we try the values 30, 10, 3, ... in the manner
was applied to spatial patches by Ranzato et al. (2014). We described above) (He et al., 2019). We train for 1000000
call the resulting context length (322 or 482 ) the model iterations with a cosine learning rate schedule. Finally, when
resolution (MR). Note that this reduction breaks permuta- running a linear probe on CIFAR-10, CIFAR-100, or STL-
tion invariance of the color channels, but keeps the model 10, we use the L-BFGS algorithm for consistency with prior
spatially invariant. results (Pedregosa et al., 2011).
To push performance on ImageNet linear probes, we also
work with IRs of 962 × 3 or 1922 × 3. Here, only clustering
4. Experiments and Results
pixels produces a context that is still too large. Using a We begin with experiments and results from the autore-
VQ-VAE (van den Oord et al., 2017) with a latent grid size gressive formulation of iGPT. Comparisons with the BERT
of 482 , we can downsample our images and stay at a MR of formulation appear in Section 4.6.
482 . We choose a latent vocabulary size of 4096, the lowest
size at which we do not observe reconstruction artifacts. 4.1. What Representation Works Best in a Generative
Our clustering of (R, G, B) values can also be interpreted Model Without Latent Variables?
as the training of a VQ-VAE with an identity encoder and
decoder.
For the VQ-VAE architecture, we choose a small encoder
and decoder pair (< 1M parameters) to aid the sequence
Transformer in modeling latent codes. Although downsam-
pling with VQ-VAE destroys spatial permutation invariance,
the receptive field for a latent code is only 16 × 16 for an
IR of 962 × 3 and 34 × 34 for an IR of 1922 × 3. Because
the encoder is so small, information stays relatively local.
3.3. Model
Figure 2. Representation quality heavily depends on the layer from
Our largest model, iGPT-L, is essentially identical to GPT-2. which we extract features. In contrast with supervised models, the
Both models contain L = 48 layers, but we use an em- best representations for these generative models lie in the middle of
bedding size of d = 1536 (vs 1600), resulting in a slightly the network. We plot this unimodal dependence on depth showing
reduced parameter count (1.4B vs 1.5B). We use the same linear probes on CIFAR-10, CIFAR-100, and STL-10.
model code as GPT-2, except that we initialize weights in
the layer-dependent fashion as in Sparse Transformer (Child In supervised pre-training, representation quality tends to
et al., 2019) and zero-initialize all projections producing increase monotonically with depth, such that the best rep-
logits. resentations lie at the penultimate layer (Zeiler & Fergus,
We also train iGPT-M, a 455M parameter model with L = 2014). Indeed, since a linear layer produces logits from pre-
logits, a high performing classifier necessarily achieves high
Generative Pretraining from Pixels
Table 2. Comparing linear probe accuracies between our models Table 3. Comparing fine-tuning performance between our models
and state-of-the-art self-supervised models. A blank input resolu- and state-of-the-art models utilizing supervised ImageNet transfer.
tion (IR) corresponds to a model working at standard ImageNet We also include AutoAugment, the best performing model trained
resolution. We report the best performing configuration for each end-to-end on CIFAR. Table results: AutoAugment (Cubuk et al.,
contrastive method, finding that our models achieve comparable 2019), GPipe (Huang et al., 2019), EfficentNet (Tan & Le, 2019)
performance.
Model Acc Unsup Transfer Sup Transfer
Method IR Params (M) Features Acc CIFAR-10
Rotation orig. 86 8192 55.4 AutoAugment 98.5 √
iGPT-L 322 ·3 1362 1536 60.3 GPipe 99.0 √
BigBiGAN orig. 86 8192 61.3 iGPT-L 99.0
iGPT-L 482 ·3 1362 1536 65.2
AMDIM orig. 626 68.1 CIFAR-100 √
MoCo orig. 375 8192 68.6 iGPT-L 88.5
iGPT-L 1922 ·3 1362 16896 69.0 AutoAugment 89.3 √
CPC v2 orig. 303 8192 71.5 EfficientNet 91.7
5. Related Work and popularized by Oord et al. (2016) serving for the ba-
sis of several papers similarly adapting transformers to the
Many generative models have been developed and evalu- problem of generative image modeling (Parmar et al., 2018;
ated for their representation learning capabilities. Notably, Child et al., 2019).
GANs (Goodfellow et al., 2014; Radford et al., 2015; Don-
ahue et al., 2016) and VAEs (Kingma & Welling, 2013; Ke et al. (2018) introduced the pixel-by-pixel CIFAR10 task
Kingma et al., 2014; Higgins et al., 2017) have been well- and first benchmarked the performance of a 1D sequence
studied. transformer on a competitive image classification dataset.
Rives et al. (2019) similarly investigates whether the recent
As of yet, most generative model based approaches have success of unsupervised pre-training in NLP applies to other
not been competitive with supervised and self-supervised domains, observing promising results on protein sequence
methods in the image domain. A notable exception is Big- data.
BiGAN (Donahue & Simonyan, 2019) which first demon-
strated that sufficiently high fidelity generative models learn 6. Discussion and Conclusion
image representations which are competitive with other self-
supervised methods. Our results suggest that generative image modeling contin-
ues to be a promising route to learn high-quality unsuper-
Many self-supervised approaches focus on designing aux- vised image representations. Simply predicting pixels learns
iliary objectives which support the learning of useful rep- state of the art representations for low resolution datasets.
resentations without attempting to directly model the input In high resolution settings, our approach is also competitive
data. Examples include surrogate classification (Dosovit- with other self-supervised results on ImageNet.
skiy et al., 2015), jigsaw puzzle solving (Noroozi & Favaro,
2016), and rotation prediction (Gidaris et al., 2018). A clus- However, our experiments also demonstrate several areas
ter of similar approaches based on contrastive losses com- for improvement. We currently model low resolution in-
paring various views and transformations of input images puts with self-attention. By comparison, most other self-
have recently driven significant progress in self-supervised supervised results use CNN based encoders that easily work
learning (Hjelm et al., 2018; Bachman et al., 2019; Tian with high resolution images. It is not immediately obvious
et al., 2019). how to best bridge the gap between performant autoregres-
sive and discriminative models. Additionally, we observed
Among contrastive approaches, our work is most similar that our approach requires large models in order to learn
to Contrast Predictive Coding (Oord et al., 2018) which high quality representations. iGPT-L has 2 to 3 times as
also utilizes a autoregressive prediction objective, but in a many parameters as similarly performing models on Ima-
learned latent space, and to Selfie (Trinh et al., 2019) which geNet and uses more compute.
trains a bidirectional self-attention architecture on top of a
standard convolutional network to differentiate correct vs Although dense self-attention was a deliberate choice for
wrong patches. this work due to it being domain agnostic and widely used in
NLP, it becomes very memory and computationally expen-
Our work is directly inspired by the success of generative sive due to its quadratic scaling with sequence length. We
pre-training methods developed for Natural Language Pro- mitigated this via the context reduction techniques discussed
cessing. These methods predict some parts of a piece of text in section 3.2 but it is still a significant limitation. Future
conditioned on other parts. Our work explores two training work could instead address this via architectural changes by
objectives in this framework, autoregressive prediction as exploring more efficient self-attention approaches. Several
originally explored for modern neural sequence models by promising techniques have recently been developed such as
Dai & Le (2015), and a denoising objective, similar to BERT local 2D relative attention (Bello et al., 2019; Ramachan-
(Devlin et al., 2018). The context in-painting approach of dran et al., 2019), sparse attention patterns (Child et al.,
Pathak et al. (2016) also explores pre-training by predict- 2019), locality sensitive hashing (Kitaev et al., 2020), and
ing corruptions but predicts large regions of high-resolution multiscale modeling (Menick & Kalchbrenner, 2018).
images.
Finally, our results, considered together with Donahue &
Kolesnikov et al. (2019); Goyal et al. (2019) conducted Simonyan (2019), suggest revisiting the representation learn-
rigorous investigations of existing self-supervised methods. ing capabilities of other families of generative models such
Several of our findings are consistent with their results, in- as flows (Dinh et al., 2014; Kingma & Dhariwal, 2018)
cluding the benefits of scale and the non-monotonic perfor- and VAEs in order to study whether they show similarly
mance of representations with depth in certain architectures. competitive representation learning capabilities.
Expressive autoregressive models tractably optimizing like-
lihood were first applied to images by Uria et al. (2013)
Generative Pretraining from Pixels
Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The
erating long sequences with sparse transformers. arXiv reversible residual network: Backpropagation without
preprint arXiv:1904.10509, 2019. storing activations. In Advances in neural information
processing systems, pp. 2214–2224, 2017.
Coates, A., Ng, A., and Lee, H. An analysis of single-
layer networks in unsupervised feature learning. In Pro- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
ceedings of the fourteenth international conference on Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
artificial intelligence and statistics, pp. 215–223, 2011. Y. Generative adversarial nets. In Advances in neural
information processing systems, pp. 2672–2680, 2014.
Cubuk, E., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V.
Autoaugment: Learning augmentation strategies from Goyal, P., Mahajan, D., Gupta, A., and Misra, I. Scaling and
data, 2019. benchmarking self-supervised visual representation learn-
ing. In Proceedings of the IEEE International Conference
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning.
on Computer Vision, pp. 6391–6400, 2019.
In Advances in neural information processing systems,
pp. 3079–3087, 2015. Graves, A. and Jaitly, N. Towards end-to-end speech recog-
nition with recurrent neural networks. In International
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
conference on machine learning, pp. 1764–1772, 2014.
Pre-training of deep bidirectional transformers for lan-
guage understanding. arXiv preprint arXiv:1810.04805, He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
2018. mentum contrast for unsupervised visual representation
Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear learning. arXiv preprint arXiv:1911.05722, 2019.
independent components estimation. arXiv preprint
Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord,
arXiv:1410.8516, 2014.
A. v. d. Data-efficient image recognition with contrastive
Doersch, C., Gupta, A., and Efros, A. A. Unsupervised predictive coding. arXiv preprint arXiv:1905.09272,
visual representation learning by context prediction. In 2019.
Proceedings of the IEEE International Conference on
Computer Vision, pp. 1422–1430, 2015. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
Botvinick, M., Mohamed, S., and Lerchner, A. beta-
Donahue, J. and Simonyan, K. Large scale adversarial rep- vae: Learning basic visual concepts with a constrained
resentation learning. In Advances in Neural Information variational framework. 2017.
Processing Systems, pp. 10541–10551, 2019.
Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning
Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial algorithm for deep belief nets. Neural computation, 18
feature learning. arXiv preprint arXiv:1605.09782, 2016. (7):1527–1554, 2006.
Generative Pretraining from Pixels
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
K., Bachman, P., Trischler, A., and Bengio, Y. Learning classification with deep convolutional neural networks.
deep representations by mutual information estimation In Advances in neural information processing systems,
and maximization. arXiv preprint arXiv:1808.06670, pp. 1097–1105, 2012.
2018.
Larochelle, H. and Murray, I. The neural autoregressive
Howard, J. and Ruder, S. Universal language model distribution estimator. In Proceedings of the Fourteenth
fine-tuning for text classification. arXiv preprint International Conference on Artificial Intelligence and
arXiv:1801.06146, 2018. Statistics, pp. 29–37, 2011.
Huang, P.-S., Avron, H., Sainath, T. N., Sindhwani, V., and
Lasserre, J. A., Bishop, C. M., and Minka, T. P. Principled
Ramabhadran, B. Kernel methods match deep neural net-
hybrids of generative and discriminative models. In 2006
works on timit. In 2014 IEEE International Conference
IEEE Computer Society Conference on Computer Vision
on Acoustics, Speech and Signal Processing (ICASSP),
and Pattern Recognition (CVPR’06), volume 1, pp. 87–
pp. 205–209. IEEE, 2014.
94. IEEE, 2006.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen,
M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convo-
Efficient training of giant neural networks using pipeline lutional deep belief networks for scalable unsupervised
parallelism. In Advances in Neural Information Process- learning of hierarchical representations. In Proceedings
ing Systems, pp. 103–112, 2019. of the 26th annual international conference on machine
learning, pp. 609–616, 2009.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. May, A., Garakani, A. B., Lu, Z., Guo, D., Liu, K., Bellet,
arXiv preprint arXiv:1502.03167, 2015. A., Fan, L., Collins, M., Hsu, D., Kingsbury, B., et al.
Kernel approximation methods for speech recognition.
Ke, N. R., GOYAL, A. G. A. P., Bilaniuk, O., Binas, J., arXiv preprint arXiv:1701.03577, 2017.
Mozer, M. C., Pal, C., and Bengio, Y. Sparse attentive
backtracking: Temporal credit assignment through re- Menick, J. and Kalchbrenner, N. Generating high fidelity im-
minding. In Advances in neural information processing ages with subscale pixel networks and multidimensional
systems, pp. 7640–7651, 2018. upscaling. arXiv preprint arXiv:1812.01608, 2018.
Kingma, D. P. and Ba, J. Adam: A method for stochastic
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
optimization. CoRR, abs/1412.6980, 2014.
Dean, J. Distributed representations of words and phrases
Kingma, D. P. and Dhariwal, P. Glow: Generative flow and their compositionality. In Advances in neural infor-
with invertible 1x1 convolutions. In Advances in Neural mation processing systems, pp. 3111–3119, 2013.
Information Processing Systems, pp. 10215–10224, 2018.
Misra, I. and van der Maaten, L. Self-supervised learn-
Kingma, D. P. and Welling, M. Auto-encoding variational ing of pretext-invariant representations. arXiv preprint
bayes. arXiv preprint arXiv:1312.6114, 2013. arXiv:1912.01991, 2019.
Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, Mohamed, A.-r., Dahl, G., and Hinton, G. Deep belief
M. Semi-supervised learning with deep generative mod- networks for phone recognition. 2009.
els. In Advances in neural information processing sys-
tems, pp. 3581–3589, 2014. Nair, V. and Hinton, G. E. Rectified linear units improve
restricted boltzmann machines. In Proceedings of the 27th
Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The
international conference on machine learning (ICML-10),
efficient transformer. arXiv preprint arXiv:2001.04451,
pp. 807–814, 2010.
2020.
Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self- Noroozi, M. and Favaro, P. Unsupervised learning of visual
supervised visual representation learning. In Proceedings representations by solving jigsaw puzzles. In European
of the IEEE conference on Computer Vision and Pattern Conference on Computer Vision, pp. 69–84. Springer,
Recognition, pp. 1920–1929, 2019. 2016.
Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Simi- Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu,
larity of neural network representations revisited. arXiv K. Pixel recurrent neural networks. arXiv preprint
preprint arXiv:1905.00414, 2019. arXiv:1601.06759, 2016.
Generative Pretraining from Pixels
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn- Sandler, M., Baccash, J., Zhmoginov, A., and Howard, A.
ing with contrastive predictive coding. arXiv preprint Non-discriminative data or weak model? on the relative
arXiv:1807.03748, 2018. importance of data and model resolution. In Proceedings
of the IEEE International Conference on Computer Vision
Paine, T. L., Khorrami, P., Han, W., and Huang, T. S. An Workshops, pp. 0–0, 2019.
analysis of unsupervised pre-training in light of recent
advances. arXiv preprint arXiv:1412.6597, 2014. Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N.,
Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fix-
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, match: Simplifying semi-supervised learning with consis-
N., Ku, A., and Tran, D. Image transformer. arXiv tency and confidence. arXiv preprint arXiv:2001.07685,
preprint arXiv:1802.05751, 2018. 2020.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Tan, M. and Le, Q. V. Efficientnet: Rethinking model
Efros, A. A. Context encoders: Feature learning by scaling for convolutional neural networks. arXiv preprint
inpainting. In Proceedings of the IEEE conference on arXiv:1905.11946, 2019.
computer vision and pattern recognition, pp. 2536–2544, Tarvainen, A. and Valpola, H. Mean teachers are better role
2016. models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In Advances in
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
neural information processing systems, pp. 1195–1204,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
2017.
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
Scikit-learn: Machine learning in Python. Journal of coding. arXiv preprint arXiv:1906.05849, 2019.
Machine Learning Research, 12:2825–2830, 2011.
Torralba, A., Fergus, R., and Freeman, W. T. 80 million tiny
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, images: A large data set for nonparametric object and
C., Lee, K., and Zettlemoyer, L. Deep contextualized scene recognition. IEEE transactions on pattern analysis
word representations. arXiv preprint arXiv:1802.05365, and machine intelligence, 30(11):1958–1970, 2008.
2018.
Trinh, T. H., Luong, M.-T., and Le, Q. V. Selfie: Self-
supervised pretraining for image embedding. arXiv
Radford, A., Metz, L., and Chintala, S. Unsupervised rep-
preprint arXiv:1906.02940, 2019.
resentation learning with deep convolutional generative
adversarial networks. arXiv preprint arXiv:1511.06434, Uria, B., Murray, I., and Larochelle, H. Rnade: The real-
2015. valued neural autoregressive density-estimator. In Ad-
vances in Neural Information Processing Systems, pp.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, 2175–2183, 2013.
I. Improving language understanding by generative pre-
training. 2018. van den Oord, A., Vinyals, O., et al. Neural discrete repre-
sentation learning. In Advances in Neural Information
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Processing Systems, pp. 6306–6315, 2017.
Sutskever, I. Language models are unsupervised multitask
learners. 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev- tion is all you need. In Advances in neural information
skaya, A., and Shlens, J. Stand-alone self-attention in processing systems, pp. 5998–6008, 2017.
vision models. arXiv preprint arXiv:1906.05909, 2019.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, Extracting and composing robust features with denoising
R., and Chopra, S. Video (language) modeling: a baseline autoencoders. In Proceedings of the 25th international
for generative models of natural videos. arXiv preprint conference on Machine learning, pp. 1096–1103, 2008.
arXiv:1412.6604, 2014. Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le,
Q. V. Unsupervised data augmentation. arXiv preprint
Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, arXiv:1904.12848, 2019.
C. L., Ma, J., and Fergus, R. Biological structure and func-
tion emerge from scaling unsupervised learning to 250 Zagoruyko, S. and Komodakis, N. Wide residual networks.
million protein sequences. bioRxiv, pp. 622803, 2019. arXiv preprint arXiv:1605.07146, 2016.
Generative Pretraining from Pixels
A.2. VQ-VAE
Our VQ-VAE models largely follow the original approach
in (van den Oord et al., 2017). Each encoder block con-
sists of a downsampling convolution, ReLU, and a residual
Generative Pretraining from Pixels