Google AI Blog - ICLR - 2021

9/5/2021 Google AI Blog
Blog

The latest from Google Research
ALIGN: Scaling Up Visual and Vision-Language

Representation Learning With Noisy Text Supervision
Tuesday, May 11, 2021
Posted by Chao Jia and Yinfei Yang, Software Engineers, Google Research
Learning good visual and vision-language representations is critical to solving computer vision
problems — image retrieval, image classification, video understanding — and can enable the
development of tools and products that change people’s daily lives. For example, a good vision-
language matching model can help users find the most relevant images given a text description or
an image input and help tools such as Google Lens find more fine-grained information about an
image.
To learn such representations, current state-of-the-art (SotA) visual and vision-language models
rely heavily on curated training datasets that require expert knowledge and extensive labels. For
vision applications, representations are mostly learned on large-scale datasets with explicit class
labels, such as ImageNet, OpenImages, and JFT-300M. For vision-language applications, popular
pre-training datasets, such as Conceptual Captions and Visual Genome Dense Captions, all require
non-trivial data collection and cleaning steps, limiting the size of datasets and thus hindering the
scale of the trained models. In contrast, natural language processing (NLP) models have achieved
SotA performance on GLUE and SuperGLUE benchmarks by utilizing large-scale pre-training on raw
text without human labels.
In "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision", to
appear at ICML 2021, we propose bridging this gap with publicly available image alt-text data
(written copy that appears in place of an image on a webpage if the image fails to load on a user's
screen) in order to train larger, state-of-the-art vision and vision-language models. To that end, we
leverage a noisy dataset of over one billion image and alt-text pairs, obtained without expensive
filtering or post-processing steps in the Conceptual Captions dataset. We show that the scale of
our corpus can make up for noisy data and leads to SotA representation, and achieves strong
performance when transferred to classification tasks such as ImageNet and VTAB. The aligned
visual and language representations also set new SotA results on Flickr30K and MS-COCO
benchmarks, even when compared with more sophisticated cross-attention models, and enable
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 1/36
9/5/2021 p p Google AI Blog
zero-shot image classification and cross-modality search with complex text and text + image
queries.
Creating the Dataset
Alt-texts usually provide a description of what the image is about, but the dataset is “noisy”
because some text may be partly or wholly unrelated to its paired image.
Example image-text pairs randomly sampled from the training dataset of ALIGN. One clearly noisy text label is marked in
italics.
In this work, we follow the methodology of constructing the Conceptual Captions dataset to get a
version of raw English alt-text data (image and alt-text pairs). While the Conceptual Captions
dataset was cleaned by heavy filtering and post-processing, this work scales up visual and vision-
language representation learning by relaxing most of the cleaning steps in the original work.
Instead, we only apply minimal frequency-based filtering. The result is a much larger but noisier
dataset of 1.8B image-text pairs.
ALIGN: A Large-scale ImaGe and Noisy-Text Embedding
For the purpose of building larger and more powerful models easily, we employ a simple dual-
encoder architecture that learns to align visual and language representations of the image and text
pairs. Image and text encoders are learned via a contrastive loss (formulated as normalized
softmax) that pushes the embeddings of matched image-text pairs together while pushing those of
non-matched image-text pairs (within the same batch) apart. The large-scale dataset makes it
possible for us to scale up the model size to be as large as EfficientNet-L2 (image encoder) and
BERT-large (text encoder) trained from scratch. The learned representation can be used for
downstream visual and vision-language tasks.
Figure of ImageNet credit to (Krizhevsky et al. 2012) and VTAB figure credit to (Zhai et al. 2019)
The resulting representation can be used for vision-only or vision-language task transfer. Without
any fine-tuning, ALIGN powers cross-modal search – image-to-text search, text-to-image search,
and even search with joint image+text queries, examples below.
Evaluating Retrieval and Representation
The learned ALIGN model with BERT-Large and EfficientNet-L2 as text and image encoder
backbones achieves SotA performance on multiple image-text retrieval tasks (Flickr30K and MS-
COCO) in both zero-shot and fine-tuned settings, as shown below.
Flickr30K (1K test set) R@1 MS-COCO (5K test set) R@1
Setting Model image → text text → image image → text text → image
ImageBERT 70.7 54.3 44.0 32.3

UNITER 83.6 68.7 - -
Zero-shot
CLIP 88.0 68.7 58.4 37.8
ALIGN 88.6 75.7 58.6 45.6
GPO 88.7 76.1 68.1 52.7

UNITER 87.3 75.6 65.7 52.9
ERNIE-ViL 88.1 76.7 - -
Fine-tuned
VILLA 87.9 76.3 - -
Oscar - - 73.5 57.5
ALIGN 95.3 84.9 77.0 59.9
Image-text retrieval results (recall@1) on Flickr30K and MS-COCO datasets (both zero-shot and fine-tuned). ALIGN
significantly outperforms existing methods including the cross-modality attention models that are too expensive for large-
scale retrieval applications.
ALIGN is also a strong image representation model. Shown below, with frozen features, ALIGN
slightly outperforms CLIP and achieves a SotA result of 85.5% top-1 accuracy on ImageNet. With
fine-tuning, ALIGN achieves higher accuracy than most generalist models, such as BiT and ViT, and
is only worse than Meta Pseudo Labels, which requires deeper interaction between ImageNet
training and large-scale unlabeled data.
Model (backbone) Acc@1 w/ frozen features Acc@1 Acc@5

WSL (ResNeXt-101 32x48d) 83.6 85.4 97.6
CLIP (ViT-L/14) 85.4 - -
BiT (ResNet152 x 4) - 87.54 98.46
NoisyStudent (EfficientNet-L2) - 88.4 98.7
ViT (ViT-H/14) - 88.55 -
Meta-Pseudo-Labels (EfficientNet-L2) - 90.2 98.8
ALIGN (EfficientNet-L2) 85.5 88.64 98.67
ImageNet classification results comparison with supervised training (fine-tuning).
Zero-Shot Image Classification
Traditionally, image classification problems treat each class as independent IDs, and people have
to train the classification layers with at least a few shots of labeled data per class. The class names
are actually also natural language phrases, so we can naturally extend the image-text retrieval
capability of ALIGN for image classification without any training data.
The pre-trained image and text encoder can directly be used in classifying an image into a set of classes by retrieving the
nearest class name in the aligned embedding space. This approach does not require any training data for the defined class
space.
On the ImageNet validation dataset, ALIGN achieves 76.4% top-1 zero-shot accuracy and shows
great robustness in different variants of ImageNet with distribution shifts, similar to the concurrent
work CLIP. We also use the same text prompt engineering and ensembling as in CLIP.
ImageNet ImageNet-R ImageNet-A ImageNet-V2
CLIP 76.2 88.9 77.2 70.1
CLIP 76.2 88.9 77.2 70.1
ALIGN 76.4 92.2 75.8 70.1
Top-1 accuracy of zero-shot classification on ImageNet and its variants.
Application in Image Search
To illustrate the quantitative results above, we build a simple image retrieval system with the
embeddings trained by ALIGN and show the top 1 text-to-image retrieval results for a handful of
text queries from a 160M image pool. ALIGN can retrieve precise images given detailed
descriptions of a scene, or fine-grained or instance-level concepts like landmarks and artworks.
These examples demonstrate that the ALIGN model can align images and texts with similar
semantics, and that ALIGN can generalize to novel complex concepts.
Image retrieval with fine-grained text queries using ALIGN's embeddings.
Multimodal (Image+Text) Query for Image Search
A surprising property of word vectors is that word analogies can often be solved with vector
arithmetic. A common example, "king – man + woman = queen". Such linear relationships between
image and text embeddings also emerge in ALIGN.
Specifically, given a query image and a text string, we add their ALIGN embeddings together and
use it to retrieve relevant images using cosine similarity as shown below These examples not only
use it to retrieve relevant images using cosine similarity, as shown below. These examples not only
demonstrate the compositionality of ALIGN embeddings across vision and language domains, but
also show the feasibility of searching with a multi-modal query. For instance, one could now look
for the "Australia" or "Madagascar" equivalence of pandas, or turn a pair of black shoes into
identically-looking beige shoes. Also, it is possible to remove objects/attributes from a scene by
performing subtraction in the embedding space, shown below.
Image retrieval with image text queries. By adding or subtracting text query embedding, ALIGN retrieves relevant images.
Social Impact and Future Work
While this work shows promising results from a methodology perspective with a simple data
collection method, additional analysis of the data and the resulting model is necessary before the
responsible use of the model in practice. For instance, considerations should be made towards the
potential for the use of harmful text data in alt-texts to reinforce such harms. With regard to
fairness, data balancing efforts may be required to prevent reinforcing stereotypes from the web
data. Additional testing and training around sensitive religious or cultural items should be taken to
understand and mitigate the impact from possibly mislabeled data.
Further analysis should also be taken to ensure that the demographic distribution of humans and
related cultural items, such as clothing, food, and art, do not cause skewed model performance.
Analysis and balancing would be required if such models will be used in production.
Conclusion
We have presented a simple method of leveraging large-scale noisy image-text data to scale up
visual and vision-language representation learning. The resulting model, ALIGN, is capable of cross-
modal retrieval and significantly outperforms SotA models. In visual-only downstream tasks, ALIGN
is also comparable to or outperforms SotA models trained with large-scale labeled data.
Acknowledgement
We would like to thank our co-authors in Google Research: Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu
Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. This work was also done with invaluable help
from other colleagues from Google. We would like to thank Jan Dlabal and Zhe Li for continuous
support in training infrastructure, Simon Kornblith for building the zero-shot & robustness model
evaluation on ImageNet variants, Xiaohua Zhai for help on conducting VTAB evaluation, Mingxing Tan
and Max Moroz for suggestions on EfficientNet training, Aleksei Timofeev for the early idea of
multimodal query retrieval, Aaron Michelony and Kaushal Patel for their early work on data
generation, and Sergey Ioffe, Jason Baldridge and Krishna Srinivasan for the insightful feedback and
discussion.
Accelerating Eye Movement Research for Wellness and

Accessibility
Monday, May 10, 2021
Posted by Nachiappan Valliappan, Senior Software Engineer and Kai Kohlhoff, Staff Research Scientist,
Google Research
Eye movement has been studied widely across vision science, language, and usability since the
1970s. Beyond basic research, a better understanding of eye movement could be useful in a wide
variety of applications, ranging across usability and user experience research, gaming, driving, and
gaze-based interaction for accessibility to healthcare. However, progress has been limited because
most prior research has focused on specialized hardware-based eye trackers that are expensive
and do not easily scale.
In “Accelerating eye movement research via accurate and affordable smartphone eye tracking”,
published in Nature Communications, and “Digital biomarker of mental fatigue”, published in npj
Digital Medicine, we present accurate, smartphone-based, ML-powered eye tracking that has the
potential to unlock new research into applications across the fields of vision, accessibility,
healthcare, and wellness, while additionally providing orders-of-magnitude scaling across diverse
populations in the world, all using the front-facing camera on a smartphone. We also discuss the
potential use of this technology as a digital biomarker of mental fatigue, which can be useful for
improved wellness
improved wellness.
Model Overview
The core of our gaze model was a multilayer feed-forward convolutional neural network (ConvNet)
trained on the MIT GazeCapture dataset. A face detection algorithm selected the face region with
associated eye corner landmarks, which were used to crop the images down to the eye region
alone. These cropped frames were fed through two identical ConvNet towers with shared weights.
Each convolutional layer was followed by an average pooling layer. Eye corner landmarks were
combined with the output of the two towers through fully connected layers. Rectified Linear Units
(ReLUs) were used for all layers except the final fully connected output layer (FC6), which had no
activation.
Architecture of the unpersonalized gaze model. Eye regions, extracted from a front-facing camera image, serve as input into
a convolutional neural network. Fully-connected (FC) layers combine the output with eye corner landmarks to infer gaze x-
and y-locations on screen via a multi-regression output layer.
The unpersonalized gaze model accuracy was improved by fine-tuning and per-participant
personalization. For the latter, a lightweight regression model was fitted to the model’s penultimate
ReLU layer and participant-specific data.
Model Evaluation
To evaluate the model, we collected data from consenting study participants as they viewed dots
that appeared at random locations on a blank screen. The model error was computed as the
distance (in cm) between the stimulus location and model prediction. Results show that while the
unpersonalized model has high error, personalization with ~30s of calibration data led to an over
fourfold error reduction (from 1.92 to 0.46cm). At a viewing distance of 25-40 cm, this corresponds
fourfold error reduction (from 1.92 to 0.46cm). At a viewing distance of 25 40 cm, this corresponds
to 0.6-1° accuracy, a significant improvement over the 2.4-3° reported in previous work [1, 2].
Additional experiments show that the smartphone eye tracker model’s accuracy is comparable to
state-of-the-art wearable eye trackers both when the phone is placed on a device stand, as well as
when users hold the phone freely in their hand in a near frontal headpose. In contrast to specialized
eye tracking hardware with multiple infrared cameras close to each eye, running our gaze model
using a smartphone’s single front-facing RGB camera is significantly more cost effective (~100x
cheaper) and scalable.
Using this smartphone technology, we were able to replicate key findings from prior eye movement
research in neuroscience and psychology, including standard oculomotor tasks (to understand
basic visual functioning in the brain) and natural image understanding. For example, in a simple
prosaccade task, which tests a person’s ability to quickly move their eyes towards a stimulus that
appears on the screen, we found that the average saccade latency (time to move the eyes)
matches prior work for basic visual health (210ms versus 200-250ms). In controlled visual search
tasks, we were able to replicate key findings, such as the effect of target saliency and clutter on eye
movements.
Example gaze scanpaths show the effect of the target’s saliency (i.e., color contrast) on visual search performance. Fewer
fixations are required to find a target (left) with high saliency (different from the distractors), while more fixations are
required to find a target (right) with low saliency (similar to the distractors).
For complex stimuli, such as natural images, we found that the gaze distribution (computed by
aggregating gaze positions across all participants) from our smartphone eye tracker are similar to
those obtained from bulky, expensive eye trackers that used highly controlled settings, such as
laboratory chin rest systems. While the smartphone-based gaze heatmaps have a broader
distribution (i.e., they appear more “blurred”) than hardware-based eye trackers, they are highly
correlated both at the pixel level (r = 0.74) and object level (r = 0.90). These results suggest that this
technology could be used to scale gaze analysis for complex stimuli such as natural and medical
images (e.g., radiologists viewing MRI/PET scans).
Similar gaze distribution from our smartphone approach vs. a more expensive (100x) eye tracker (from the OSIE dataset).
We found that smartphone gaze could also help detect difficulty with reading comprehension.
Participants reading passages spent significantly more time looking within the relevant excerpts
when they answered correctly. However, as comprehension difficulty increased, they spent more
time looking at the irrelevant excerpts in the passage before finding the relevant excerpt that
contained the answer. The fraction of gaze time spent on the relevant excerpt was a good predictor
of comprehension, and strongly negatively correlated with comprehension difficulty (r = −0.72).
Digital Biomarker of Mental Fatigue
Gaze detection is an important tool to detect alertness and wellbeing, and is studied widely in
medicine, sleep research, and mission-critical settings such as medical surgeries, aviation safety,
etc. However, existing fatigue tests are subjective and often time-consuming. In our recent paper
published in npj Digital Medicine, we demonstrated that smartphone gaze is significantly impaired
with mental fatigue, and can be used to track the onset and progression of fatigue.
A simple model predicts mental fatigue reliably using just a few minutes of gaze data from
participants performing a task. We validated these findings in two different experiments — using a
language-independent object-tracking task and a language-dependent proofreading task. As shown
below, in the object-tracking task, participants’ gaze initially follows the object’s circular trajectory,
but under fatigue, their gaze shows high errors and deviations. Given the pervasiveness of phones,
these results suggest that smartphone-based gaze could provide a scalable, digital biomarker of
mental fatigue.
Example gaze scanpaths for a participant with no fatigue (left) versus with mental fatigue (right) as they track an object
following a circular trajectory.
The corresponding progression of fatigue scores (ground truth) and model prediction as a function of time on task.
Beyond wellness, smartphone gaze could also provide a digital phenotype for screening or
monitoring health conditions such as autism spectrum disorder, dyslexia, concussion and more.
This could enable timely and early interventions, especially for countries with limited access to
healthcare services.
Another area that could benefit tremendously is accessibility. People with conditions such as ALS,
locked-in syndrome and stroke have impaired speech and motor ability. Smartphone gaze could
provide a powerful way to make daily tasks easier by using gaze for interaction, as recently
demonstrated with Look to Speak.
Ethical Considerations
Gaze research needs careful consideration, including being mindful of the correct use of such
technology — applications should obtain explicit approval and fully informed consent from users
for the specific task at hand. In our work, all data was collected for research purposes with users’
explicit approval and consent. In addition, users were allowed to opt out at any point and request
their data to be deleted. We continue to research additional ways to ensure ML fairness and
improve the accuracy and robustness of gaze technology across demographics, in a responsible,
privacy-preserving way.
Conclusion
Our findings of accurate and affordable ML-powered smartphone eye tracking offer the potential
for orders-of-magnitude scaling of eye movement research across disciplines (e.g., neuroscience,
psychology and human-computer interaction). They unlock potential new applications for societal
good, such as gaze-based interaction for accessibility, and smartphone-based screening and
monitoring tools for wellness and healthcare.
Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers,
researchers, and cross-functional contributors. We’d like to thank all the co-authors of the papers,
including our team members, Junfeng He, Na Dai, Pingmei Xu, Venky Ramachandran; interns, Ethan
including our team members, Junfeng He, Na Dai, Pingmei Xu, Venky Ramachandran; interns, Ethan
Steinberg, Kantwon Rogers, Li Guo, and Vincent Tseng; collaborators, Tanzeem Choudhury; and UXRs:
Mina Shojaeizadeh, Preeti Talwai, and Ran Tao. We’d also like to thank Tomer Shekel, Gaurav
Nemade, and Reena Lee for their contributions to this project, and Vidhya Navalpakkam for her
technical leadership in initiating and overseeing this body of work.
Crisscrossed Captions: Semantic Similarity for Images and

Text
Thursday, May 6, 2021
Posted by Zarana Parekh, Software Engineer and Jason Baldridge, Staff Research Scientist, Google
Research
The past decade has seen remarkable progress on automatic image captioning, a task in which a
computer algorithm creates written descriptions for images. Much of the progress has come
through the use of modern deep learning methods developed for both computer vision and natural
language processing, combined with large scale datasets that pair images with descriptions
created by people. In addition to supporting important practical applications, such as providing
descriptions of images for visually impaired people, these datasets also enable investigations into
important and exciting research questions about grounding language in visual inputs. For example,
learning deep representations for a word like “car”, means using both linguistic and visual contexts.
Image captioning datasets that contain pairs of textual descriptions and their corresponding
images, such as MS-COCO and Flickr30k, have been widely used to learn aligned image and text
representations and to build captioning models. Unfortunately, these datasets have limited cross-
modal associations: images are not paired with other images, captions are only paired with other
captions of the same image (also called co-captions), there are image-caption pairs that match but
are not labeled as a match, and there are no labels that indicate when an image-caption pair does
not match. This undermines research into how inter-modality learning (connecting captions to
images, for example) impacts intra-modality tasks (connecting captions to captions or images to
images). This is important to address, especially because a fair amount of work on learning from
images paired with text is motivated by arguments about how visual elements should inform and
improve representations of language.
To address this evaluation gap, we present "Crisscrossed Captions: Extended Intramodal and
Intermodal Semantic Similarity Judgments for MS-COCO", which was recently presented at EACL
2021. The Crisscrossed Captions (CxC) dataset extends the development and test splits of MS-
COCO with semantic similarity ratings for image-text, text-text and image-image pairs. The rating
criteria are based on Semantic Textual Similarity, an existing and widely-adopted measure of
semantic relatedness between pairs of short texts, which we extend to include judgments about
images as well. In all, CxC contains human-derived semantic similarity ratings for 267,095 pairs
(derived from 1,335,475 independent judgments), a massive extension in scale and detail to the
50k original binary pairings in MS-COCO’s development and test splits. We have released CxC’s
ratings, along with code to merge CxC with existing MS-COCO data. Anyone familiar with MS-COCO
9/5/2021
g , g g gGoogle AI Blog y
can thus easily enhance their experiments with CxC.
Crisscrossed Captions extends the MS-COCO evaluation sets by adding human-derived semantic similarity ratings for
existing image-caption pairs and co-captions (solid lines), and it increases rating density by adding human ratings for new
image-caption, caption-caption and image-image pairs (dashed lines).*
Creating the CxC Dataset
If a picture is worth a thousand words, it is likely because there are so many details and
relationships between objects that are generally depicted in pictures. We can describe the texture
of the fur on a dog, name the logo on the frisbee it is chasing, mention the expression on the face
of the person who has just thrown the frisbee, or note the vibrant red on a large leaf in a tree above
the person’s head, and so on.
The CxC dataset extends the MS-COCO evaluation splits with graded similarity associations within
and across modalities. MS-COCO has five captions for each image, split into 410k training, 25k
development, and 25k test captions (for 82k, 5k, 5k images, respectively). An ideal extension would
rate every pair in the dataset (caption-caption, image-image, and image-caption), but this is
infeasible as it would require obtaining human ratings for billions of pairs.
Given that randomly selected pairs of images and captions are likely to be dissimilar, we came up
with a way to select items for human rating that would include at least some new pairs with high
expected similarity. To reduce the dependence of the chosen pairs on the models used to find
them, we introduce an indirect sampling scheme (depicted below) where we encode images and
captions using different encoding methods and compute the similarity between pairs of same
modality items, resulting in similarity matrices. Images are encoded using Graph-RISE embeddings,
while captions are encoded using two methods — Universal Sentence Encoder (USE) and average
bag-of-words (BoW) based on GloVe embeddings. Since each MS-COCO example has five co-
captions, we average the co-caption encodings to create a single representation per example,
ensuring all caption pairs can be mapped to image pairs (more below on how we select
intermodality pairs).
Top: Text similarity matrix (each cell corresponds to a similarity score) constructed using averaged co-caption encodings, so
each text entry corresponds to a single image, resulting in a 5k x 5k matrix. Two different text encoding methods were used,
but only one text similarity matrix has been shown for simplicity. Bottom: Image similarity matrix for each image in the
dataset, resulting in a 5k x 5k matrix.
The next step of the indirect sampling scheme is to use the computed similarities of images for a
biased sampling of caption pairs for human rating (and vice versa). For example, we select two
captions with high computed similarities from the text similarity matrix, then take each of their
images, resulting in a new pair of images that are different in appearance but similar in what they
depict based on their descriptions. For example, the captions “A dog looking bashfully to the side”
and “A black dog lifts its head to the side to enjoy a breeze” would have a reasonably high model
similarity, so the corresponding images of the two dogs in the figure below could be selected for
image similarity rating. This step can also start with two images with high computed similarities to
yield a new pair of captions. We now have indirectly sampled new intramodal pairs — at least some
of which are highly similar — for which we obtain human ratings.
Top: Pairs of images are picked based on their computed caption similarity. Bottom: Pairs of captions are picked based on
the computed similarity of the images they describe.
Last, we then use these new intramodal pairs and their human ratings to select new intermodal
pairs for human rating. We do this by using existing image-caption pairs to link between modalities.
For example, if a caption pair example ij was rated by humans as highly similar, we pick the image
from example i and caption from example j to obtain a new intermodal pair for human rating. And
again, we use the intramodal pairs with the highest rated similarity for sampling because this
includes at least some new pairs with high similarity. Finally, we also add human ratings for all
existing intermodal pairs and a large sample of co-captions.
The following table shows examples of semantic image similarity (SIS) and semantic image-text
similarity (SITS) pairs corresponding to each rating, with 5 being the most similar and 0 being
completely dissimilar.
Examples for each human-derived similarity score (left: 5 to 0, 5 being very similar and 0 being completely dissimilar) of
image pairs based on SIS (middle) and SITS (right) tasks. Note that these examples are for illustrative purposes and are not
themselves in the CxC dataset.
Evaluation
MS-COCO supports three retrieval tasks:
1. Given an image, find its matching captions out of all other captions in the evaluation
g , g p p
set.
2. Given a caption, find its corresponding image out of all other images in the
evaluation set.
3. Given a caption, find its other co-captions out of all other captions in the evaluation
set.
MS-COCO’s pairs are incomplete because captions created for one image at times apply equally
well to another, yet these associations are not captured in the dataset. CxC enhances these existing
retrieval tasks with new positive pairs, and it also supports a new image-image retrieval task. With
its graded similarity judgements, CxC also makes it possible to measure correlations between
model and human rankings. Retrieval metrics in general focus only on positive pairs, while CxC’s
correlation scores additionally account for the relative ordering of similarity and include low-scoring
items (non-matches). Supporting these evaluations on a common set of images and captions
makes them more valuable for understanding inter-modal learning compared to disjoint sets of
caption-image, caption-caption, and image-image associations.
We ran a series of experiments to show the utility of CxC’s ratings. For this, we constructed three
dual encoder (DE) models using BERT-base as the text encoder and EfficientNet-B4 as the image
encoder:
1. A text-text (DE_T2T) model that uses a shared text encoder for both sides.
2. An image-text model (DE_I2T) that uses the aforementioned text and image
encoders, and includes a layer above the text encoder to match the image encoder
output.
3. A multitask model (DE_I2T+T2T) trained on a weighted combination of text-text and
image-text tasks.
C C i l l i f (T2T) i (I2T) d li k (I2T T2T) d l d d l

CxC retrieval results — a comparison of our text-text (T2T), image-text (I2T) and multitask (I2T+T2T) dual encoder models
on all the four retrieval tasks.
From the results on the retrieval tasks, we can see that DE_I2T+T2T (yellow bar) performs better
than DE_I2T (red bar) on the image-text and text-image retrieval tasks. Thus, adding the intramodal
(text-text) training task helped improve the intermodal (image-text, text-image) performance. As for
the other two intramodal tasks (text-text and image-image), DE_I2T+T2T shows strong, balanced
performance on both of them.
CxC correlation results for the same models shown above.
For the correlation tasks, DE_I2T performs the best on SIS and DE_I2T+T2T is the best overall. The
correlation scores also show that DE_I2T performs well only on images: it has the highest SIS but
has much worse STS. Adding the text-text loss to DE_I2T training (DE_I2T+T2T) produces more
balanced overall performance.
The CxC dataset provides a much more complete set of relationships between and among images
and captions than the raw MS-COCO image-caption pairs. The new ratings have been released and
further details are in our paper. We hope to encourage the research community to push the state of
the art on the tasks introduced by CxC with better models for jointly learning inter- and intra-modal
representations.
Acknowledgments
The core team includes Daniel Cer, Yinfei Yang and Austin Waters. We thank Julia Hockenmaier for
her inputs on CxC’s formulation, the Google Data Compute Team, especially Ashwin Kakarla and
Mohd Majeed for their tooling and annotation support, Yuan Zhang, Eugene Ie for their comments on
the initial versions of the paper and Daphne Luong for executive support for the data collection.
the initial versions of the paper and Daphne Luong for executive support for the data collection.
*All the images in the article have been taken from the Open Images dataset under the CC-by 4.0 license. ↩
Introducing FELIX: Flexible Text Editing Through Tagging and

Insertion
Wednesday, May 5, 2021
Posted by Jonathan Mallinson and Aliaksei Severyn, Research Scientists, Google Research
Sequence-to-sequence (seq2seq) models have become a favoured approach for tackling natural
language generation tasks, with applications ranging from machine translation to monolingual
generation tasks, such as summarization, sentence fusion, text simplification, and machine
translation post-editing. However these models appear to be a suboptimal choice for many
monolingual tasks, as the desired output text often represents a minor rewrite of the input text.
When accomplishing such tasks, seq2seq models are both slower because they generate the
output one word at a time (i.e., autoregressively), and wasteful because most of the input tokens
are simply copied into the output.
Instead, text-editing models have recently received a surge of interest as they propose to predict
edit operations – such as word deletion, insertion, or replacement – that are applied to the input to
reconstruct the output. However, previous text-editing approaches have limitations. They are either
fast (being non-autoregressive), but not flexible, because they use a limited number of edit
operations, or they are flexible, supporting all possible edit operations, but slow (autoregressive). In
either case, they have not focused on modeling large structural (syntactic) transformations, for
example switching from active voice, “They ate steak for dinner,” to passive, “Steak was eaten for
dinner.” Instead, they've focused on local transformations, deleting or replacing short phrases.
When a large structural transformation needs to occur, they either can’t produce it or insert a large
amount of new text, which is slow.
In “FELIX: Flexible Text Editing Through Tagging and Insertion”, we introduce FELIX, a fast and
flexible text-editing system that models large structural changes and achieves a 90x speed-up
compared to seq2seq approaches whilst achieving impressive results on four monolingual
generation tasks. Compared to traditional seq2seq methods, FELIX has the following three key
advantages:
Sample efficiency: Training a high precision text generation model typically requires
large amounts of high-quality supervised data. FELIX uses three techniques to
minimize the amount of required data: (1) fine-tuning pre-trained checkpoints, (2) a
tagging model that learns a small number of edit operations, and (3) a text insertion
task that is very similar to the pre-training task.
Fast inference time: FELIX is fully non-autoregressive, avoiding slow inference times
caused by an autoregressive decoder.
Flexible text editing: FELIX strikes a balance between the complexity of learned edit
ti d fl ibilit i th t f ti it d l
operations and flexibility in the transformations it models.
In short, FELIX is designed to derive the maximum benefit from self-supervised pre-training, being
efficient in low-resource settings, with little training data.
Overview
To achieve the above, FELIX decomposes the text-editing task into two sub-tasks: tagging to decide
on the subset of input words and their order in the output text, and insertion, where words that are
not present in the input are inserted. The tagging model employs a novel pointer mechanism, which
supports structural transformations, while the insertion model is based on a Masked Language
Model. Both of these models are non-autoregressive, ensuring the model is fast. A diagram of
FELIX can be seen below.
An example of FELIX trained on data for a text simplification task. Input words are first tagged as KEEP (K), DELETE (D) or
KEEP and INSERT (I). After tagging, the input is reordered. This reordered input is then fed to a masked language model.
The Tagging Model
The first step in FELIX is the tagging model, which consists of two components. First the tagger
determines which words should be kept or deleted and where new words should be inserted. When
the tagger predicts an insertion, a special MASK token is added to the output. After tagging, there is
a reordering step where the pointer reorders the input to form the output, by which it is able to reuse
parts of the input instead of inserting new text. The reordering step supports arbitrary rewrites,
which enables modeling large changes. The pointer network is trained such that each word in the
input points to the next word as it will appear in the output, as shown below.
Realization of the pointing mechanism to transform "There are 3 layers in the walls of the heart" into "the heart MASK 3
layers".
The Insertion Model
The output of the tagging model is the reordered input text with deleted words and MASK tokens
predicted by the insertion tag. The insertion model must predict the content of MASK tokens.
Because FELIX’s insertion model is very similar to the pretraining objective of BERT, it can take
direct advantage of the pre-training, which is particularly advantageous when data is limited.
Example of the insertion model, where the tagger predicts two words will be inserted and the insertion model predicts the
content of the MASK tokens.
Results
We evaluated FELIX on sentence fusion, text simplification, abstractive summarization, and

machine translation post-editing. These tasks vary significantly in the types of edits required and
dataset sizes under which they operate. Below are the results on the sentence fusion task (i.e.,
merging two sentences into one), comparing FELIX against a large pre-trained seq2seq model
(BERT2BERT) and a text-editing model (LaserTager), under a range of dataset sizes. We see that
FELIX outperforms LaserTagger and can be trained on as little as a few hundred training examples.
For the full dataset, the autoregressive BERT2BERT outperforms FELIX. However, during inference,
this model takes significantly longer.
A comparison of different training dataset sizes on the DiscoFuse dataset. We compare FELIX (using the best performing
model) against BERT2BERT and LaserTagger.
Latency in milliseconds for a batch of 32 on a Nvidia Tesla P100.
Conclusion
We have presented FELIX, which is fully non-autoregressive, providing even faster inference times,
while achieving state-of-the-art results. FELIX also minimizes the amount of required training data
with three techniques — fine-tuning pre-trained checkpoints, learning a small number of edit
operations, and an insertion task that mimics masked language model task from the pre-training.
Lastly, FELIX strikes a balance between the complexity of learned edit operations and the
percentage of input-output transformations it can handle. We have open-sourced the code for
FELIX and hope it will provide researchers with a faster, more efficient, and more flexible text-
editing model.
Acknowledgements
This research was conducted by Jonathan Mallinson, Aliaksei Severyn (equal contribution), Eric
Malmi, Guillermo Garrido. We would like to thank Aleksandr Chuklin, Daniil Mirylenka, Ryan McDonald,
and Sebastian Krause for useful discussions, running early experiments and paper suggestions.
Do Wide and Deep Networks Learn the Same Things?

Tuesday, May 4, 2021
Posted by Thao Nguyen, AI Resident, Google Research
A common practice to improve a neural network’s performance and tailor it to available

computational resources is to adjust the architecture depth and width. Indeed, popular families of
neural networks, including EfficientNet, ResNet and Transformers, consist of a set of architectures
of flexible depths and widths. However, beyond the effect on accuracy, there is limited
understanding of how these fundamental choices of architecture design affect the model, such as
the impact on its internal representations.
In “Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network
Representations Vary with Width and Depth”, we perform a systematic study of the similarity
between wide and deep networks from the same architectural family through the lens of their
hidden representations and final outputs. In very wide or very deep models, we find a characteristic
block structure in their internal representations, and establish a connection between this
phenomenon and model overparameterization. Comparisons across models demonstrate that
those without the block structure show significant similarity between representations in
corresponding layers, but those containing the block structure exhibit highly dissimilar
representations. These properties of the internal representations in turn translate to systematically
different errors at the class and example levels for wide and deep models when they are evaluated
on the same test set.
Comparing Representation Similarity with CKA
We extended prior work on analyzing representations by leveraging our previously developed

Centered Kernel Alignment (CKA) technique, which provides a robust, scalable way to determine
the similarity between the representations learned by any pair of neural network layers. CKA takes
as input the representations (i.e., the activation matrices) from two layers, and outputs a similarity
score between 0 (not at all similar) and 1 (identical representations).
We apply CKA to a family of ResNets of varying depths and widths, trained on common benchmark
datasets (CIFAR-10, CIFAR-100 and ImageNet), and use representation heatmaps to illustrate the
results. The x and y axes of each heatmap index the layers of the model(s) in consideration, going
from input to output, and each entry (i, j) is the CKA similarity score between layer i and layer j.
We use CKA to compute the representation similarity for all pairs of layers within a single model (i.e., when network 1 and
network 2 are identical), and across models (i.e., when network 1 and network 2 are trained with different random
initializations, or have different architectures altogether).
Below is an example of the resulting heatmap when we compare representations of each layer to
every other layer within a single ResNet of depth 26 and width multiplier 1. In the design convention
used here, the stated depth only refers to the number of convolutional layers in the network, but we
analyze all layers present, and the width multiplier applies to the number of filters in each
convolution. Notice the checkerboard pattern in the heatmap, which is caused by skip connections
(shortcuts between layers) in the architecture.
The Emergence of the Block Structure
What stands out from the representation heatmaps of deeper or wider networks is the emergence
of a large set of consecutive layers with highly similar representations, which appears in the
heatmaps as a yellow square (i.e., a region with high CKA scores). This phenomenon, which we call
the block structure, suggests that the underlying layers may not be as efficient at progressively
refining the network’s representations as we expect. Indeed, we show that the task performance
becomes stagnant inside the block structure, and that it is possible to prune some underlying
layers without affecting the final performance.
Block structure — a large, contiguous set of layers with highly similar representations — emerges with increasing width or
depth. Each heatmap panel shows the CKA similarity between all pairs of layers within a single neural network. While its size
and position can vary across different training runs, the block structure is a robust phenomenon that arises consistently in
larger models.
With additional experiments, we show that the block structure has less to do with the absolute
model size, than with the size of the model relative to the size of the training dataset. As we reduce
the training dataset size, the block structure starts to appear in shallower and narrower networks:
With increasing network width (towards the right along each row) and decreasing dataset size (down each column), the
relative model capacity (with respect to a given task) is effectively inflated, and the block structure begins to appear in
smaller models
smaller models.
Through further analysis, we are also able to demonstrate that the block structure arises from
preserving and propagating the dominant principal components of its underlying representations.
Refer to our paper for more details.
Comparing Representations Across Models
Going further, we study the implications of depth and width on representations across models of
different random initializations and different architectures, and find that the presence of block
structure makes a significant difference in this context as well. Despite having different
architectures, wide and deep models without the block structure do exhibit representation similarity
with each other, with corresponding layers broadly being of the same proportional depth in the
model. However, when the block structure is present, its representations are unique to each model.
This suggests that despite having similar overall performance, each wide or deep model with the
block structure picks up a unique mapping from the input to the output.
For smaller models (e.g., ResNet-38 1×), CKA across different initializations (off the diagonal) closely resembles CKA within
a single model (on the diagonal). In contrast, representations within the block structure of wider and deeper models (e.g.,
ResNet-38 10×, ResNet-164 1×) are highly dissimilar across training runs.
Error Analysis of Wide and Deep Models
Having explored the properties of the learned representations of wide and deep models, we next
turn to understanding how they influence the diversity of the output predictions. We train
populations of networks of different architectures and determine on which test set examples each
architecture configuration tends to make errors.
On both CIFAR-10 and ImageNet datasets, wide and deep models that have the same average
accuracy still demonstrate statistically significant differences in example-level predictions. The
same observation holds for class-level errors on ImageNet, with wide models exhibiting a small
advantage in identifying classes corresponding to scenes, and deep networks being relatively more
accurate on consumer goods.
Per-class differences on ImageNet between models with increased width (y-axis) or depth (x-axis). Orange dots reflect
differences between two sets of 50 different random initializations of ResNet-83 (1×).
Conclusions
In studying the effects of depth and width on internal representations, we uncover a block structure
phenomenon, and demonstrate its connection to model capacity. We also show that wide and deep
models exhibit systematic output differences at class and example levels. Check out the paper for
full details on these results and additional insights! We’re excited about the many interesting open
questions these findings suggest, such as how the block structure arises during training, whether
the phenomenon occurs in domains beyond image classification, and ways these insights on
internal representations can inform model efficiency and generalization.
Acknowledgements
This is a joint work with Maithra Raghu and Simon Kornblith. We would like to thank Tom Small for
the visualizations of the representation heatmap.
Google at ICLR 2021

Monday, May 3, 2021
Posted by Jaqui Herman, Research Specialist and Tim Herrmann, Program Manager
The 9th International Conference on Learning Representations (ICLR 2021), a virtual conference
focused on deep learning, kicked off this week, offering conference and workshop tracks that
present some of the latest research in deep learning and its applications to areas such as computer
vision, computational biology, speech recognition, text understanding, and more.
As a Platinum Sponsor of ICLR 2021, Google will have a strong presence with over 100 accepted
publications and participation on organizing committees and in workshops. If you have registered
for ICLR 2021, we hope you’ll watch our talks and learn about the work at Google that goes into
for ICLR 2021, we hope you ll watch our talks and learn about the work at Google that goes into
solving interesting problems for billions of people. Learn more about our research being presented
in the list below (Googlers in bold).
Officers and Board Members
Includes: Hugo Larochelle, Tara Sainath
Organizing Committee
Includes: Sanmi Koyejo, Chelsea Finn
Area Chairs
Includes: Abhishek Kumar, Aditya Menon, Aleksandra Faust, Alexey Dosovitskiy, Andrew Cotter,
Andrew Dai, Augustus Odena, Been Kim, Behnam Neyshabur, Ben Poole, Bo Dai, Bo Li, Branislav
Kveton, Ce Liu, Claudio Gentile, Colin Raffel, Danny Tarlow, David Ha, Dengyong Zhou, Dumitru Erhan,
Dustin Tran, Felix Hill, George Tucker, Hanie Sedghi, Heinrich Jiang, Hossein Mobahi, Izhak Shafran,
Jascha Sohl-Dickstein, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Justin Gilmer, Kevin
Swersky, Marco Cuturi, Mario Lucic, Marlos C. Machado, Mathieu Blondel, Matt Johnson, Matthieu
Geist, Mohammad Norouzi, Naman Agarwal, Navdeep Jaitly, Nicolas Le Roux, Niki Parmar, Olivier
Bachem, Olivier Pietquin, Philip Long, Quentin Berthet, Razvan Pascanu, Rodolphe Jenatton, Samy
Bengio*, Sebastian Nowozin, Silvio Lattanzi, Slav Petrov, Srinadh Bhojanapalli, Suman Ravuri, Tim
Salimans, Vitaly Kuznetsov, William Cohen, Yann Dauphin, Yujia Li
Publications
Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes
Mike Gartrell, Insu Han, Elvis Dohmatob, Jennifer Gillenwater, Victor-Emmanuel Brunel
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (see the blog post)
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
Neil Houlsby
Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation
Biao Zhang*, Ankur Bapna, Rico Sennrich, Orhan Firat
Evolving Reinforcement Learning Algorithms (see the blog post)
John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee,
Aleksandra Faust
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song*, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier,
Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem
When Do Curricula Work?
Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur
Sharpness-aware Minimization for Efficiently Improving Generalization
Pierre Foret*, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur
Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual

Models Zirui Wang*, Yulia Tsvetkov, Orhan Firat, Yuan Cao
Models Zirui Wang , Yulia Tsvetkov, Orhan Firat, Yuan Cao
Mathematical Reasoning via Self-supervised Skip-tree Training
Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy
Long-Tail Learning via Logit Adjustment
Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv
Kumar
Are Neural Rankers Still Outperformed by Gradient Boosted Decision Trees?
Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael
Bendersky, Marc Najork
LambdaNetworks: Modeling Long-Range Interactions without Attention
Irwan Bello
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G Bellemare
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration
Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai
Practical Real Time Recurrent Learning with a Sparse Approximation
Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, Alex Graves
LEAF: A Learnable Frontend for Audio Classification (see the blog post)
Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi
Batch Reinforcement Learning Through Continuation Method
Yijie Guo, Shengyu Feng, Nicolas Le Roux, Ed Chi, Honglak Lee, Minmin Chen
Scalable Transfer Learning with Expert Models
Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, Cedric Renggli*, André Susano Pinto, Sylvain
Gelly, Daniel Keysers, Neil Houlsby
Rishabh Agarwal, Marlos C. Machado*, Pablo Samuel Castro, Marc G Bellemare
Scaling Symbolic Methods Using Gradients for Neural Model Explanation
Subham Sekhar Sahoo, Subhashini Venugopalan, Li Li, Rishabh Singh, Patrick Riley
Primal Wasserstein Imitation Learning (see the blog post)
Robert Dadashi, Leonard Hussenot, Matthieu Geist, Olivier Pietquin
Reset-Free Lifelong Learning with Skill-Space Planning
Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch
Teaching Temporal Logics to Neural Networks
Christopher Hahn, Frederik Schmitt, Jens U. Kreber, Markus Norman Rabe, Bernd Finkbeiner
Shape-Texture Debiased Neural Network Training
Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie
Rethinking Embedding Coupling in Pre-trained Language Models

Hyung Won Chung, Thibault Fevry*, Henry Tsai, Melvin Johnson, Sebastian Ruder
Overparameterisation and Worst-Case Generalisation: Friend or Foe?
Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar
Single-Photon Image Classification
Thomas Fischbacher, Luciano Sbaiz
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Efthymios Tzinis*, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R.
Hershey
Adaptive Federated Optimization
Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný,
Sanjiv Kumar, Hugh Brendan McMahan
Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation
Biao Zhang*, Ankur Bapna, Rico Sennrich, Orhan Firat
Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers
Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, Ruslan Salakhutdinov
Open Question Answering over Tables and Text
Wenhu Chen*, Ming-Wei Chang, Eva Schlinger, William Yang Wang, William W. Cohen
Practical Real Time Recurrent Learning with a Sparse Approximation
Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, Alex Graves
IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression
Rianne van den Berg, Alexey A. Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby, Tim Salimans
A Universal Representation Transformer Layer for Few-Shot Image Classification
Lu Liu, William L. Hamilton, Guodong Long, Jing Jiang, Hugo Larochelle
Tradeoffs in Data Augmentation: An Empirical Study
Raphael Gontijo-Lopes, Sylvia Smullin, Ekin Dogus Cubuk, Ethan Dyer
Coping with Label Shift via Distributionally Robust Optimisation
Jingzhao Zhang, Aditya Krishna Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra
Rethinking Attention with Performers (see the blog post)
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin
Belanger, Lucy J Colwell, Adrian Weller
Teaching with Commentaries
Aniruddh Raghu*, Maithra Raghu, Simon Kornblith, David Duvenaud, Geoffrey Hinton
Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics
Vinay Venkatesh Ramasesh, Ethan Dyer, Maithra Raghu
Model-Based Offline Planning
Arthur Argenson, Gabriel Dulac-Arnold

The Geometry of Integration in Text Classification RNNs
Kyle Aitken*, Vinay Venkatesh Ramasesh, Ankush Garg, Yuan Cao, David Sussillo, Niru
Maheswaranathan
On the Origin of Implicit Regularization in Stochastic Gradient Descent
Samuel L Smith, Benoit Dherin, David Barrett, Soham De
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song*, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers (see the blog
post)
Preetum Nakkiran*, Behnam Neyshabur, Hanie Sedghi
Learning Energy-Based Models by Diffusion Recovery Likelihood
Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P Kingma
Latent Skill Planning for Exploration and Transfer
Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti
PseudoSeg: Designing Pseudo Labels for Semantic Segmentation
Yuliang Zou*, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, Tomas Pfister
WaveGrad: Estimating Gradients for Waveform Generation
Nanxin Chen*, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, William Chan
One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks
Atish Agarwala, Abhimanyu Das, Brendan Juba*, Rina Panigrahy, Vatsal Sharan*, Xin Wang, Qiuyi
Zhang
Long Range Arena : A Benchmark for Efficient Transformers
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu
Yang, Sebastian Ruder, Donald Metzler
Explainable Deep One-Class Classification
Philipp Liznerski, Lukas Ruff, Robert A. Vandermeulen, Billy Joe Franks, Marius Kloft, Klaus Robert
Muller
Net-DNF: Effective Deep Modeling of Tabular Data
Liran Katzir, Gal Elidan, Ran El-Yaniv
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu
Auxiliary Task Update Decomposition: The Good, the Bad and the Neutral
Lucio M. Dery, Yann Dauphin, David Grangier
Long-Tail Learning via Logit Adjustment
Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv
Kumar
Average-Case Acceleration for Bilinear Games and Normal Matrices
Carles Domingo-Enrich, Fabian Pedregosa, Damien Scieur
OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning
Anurag Ajay*, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum
Training Independent Subnetworks for Robust Prediction
Marton Havasi*, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji
Lakshminarayanan, Andrew Mingbo Dai, Dustin Tran
Benchmarks for Deep Off-Policy Evaluation
Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov,
Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine,
Thomas Paine
TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks
Martin Trimmel, Henning Petzka, Cristian Sminchisescu
Mastering Atari with Discrete World Models (see the blog post)
Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, Jimmy Ba
Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit
Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, Jimmy Ba
Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning
Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies
Paul Pu Liang*, Manzil Zaheer, Yuan Wang, Amr Ahmed
Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret*, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur
HyperGrid Transformers: Towards A Single Model for Multiple Tasks
Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan
Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms
Maruan Al-Shedivat*, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration
Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai
Are Neural Rankers Still Outperformed by Gradient Boosted Decision Trees?
Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael
Bendersky, Marc Najork
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network
Representations Vary with Width and Depth
Thao Nguyen, Maithra Raghu, Simon Kornblith
A Unifying View on Implicit Bias in Training Linear Neural Networks
Chulhee Yun*, Shankar Krishnan, Hossein Mobahi
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
l h bh l b h h
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine
Mathematical Reasoning via Self-Supervised Skip-Tree Training
Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy
Lipschitz Recurrent Neural Networks
N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, Michael W. Mahoney
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
Michael R Zhang*, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang,
Mohammad Norouzi
The Importance of Pessimism in Fixed-Dataset Policy Optimization
Jacob Buckman, Carles Gelada, Marc G Bellemare
Monotonic Kronecker-Factored Lattice
William Taylor Bakst, Nobuyuki Morioka, Erez Louidor
What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier,
Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem
Adversarially Guided Actor-Critic
Yannis Flet-Berliac, Johan Ferret, Olivier Pietquin, Philippe Preux, Matthieu Geist
Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes
Mike Gartrell, Insu Han, Elvis Dohmatob, Jennifer Gillenwater, Victor-Emmanuel Brunel
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, Zhifeng Chen
Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon,
Honglak Lee*, Seunghoon Hong
Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual

Models
Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao
Dataset Meta-Learning from Kernel Ridge-Regression
Timothy Nguyen, Zhourong Chen, Jaehoon Lee
Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling
Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, Ruoming
Pang
Implicit Gradient Regularization
David Barrett, Benoit Dherin
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G Bellemare
D i h R l i i f B hN
Deconstructing the Regularization of BatchNorm
Yann Dauphin, Ekin Dogus Cubuk
C-Learning: Learning to Achieve Goals via Recursive Classification
Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
Evolving Reinforcement Learning Algorithms
John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee,
Aleksandra Faust
Colorization Transformer
Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner
Control-Aware Representations for Model-based Reinforcement Learning
Brandon Cui, Yinlam Chow, Mohammad Ghavamzadeh
Evaluations and Methods for Explanation through Robustness Analysis
Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Kumar Ravikumar, Seungyeon Kim, Sanjiv
Kumar, Cho-Jui Hsieh
Learning and Evaluating Representations for Deep One-Class Classification
Kihyuk Sohn, Chun-Liang Li, Jinsung Yoon, Minho Jin, Tomas Pfister
No MCMC for Me: Amortized Sampling for Fast and Stable Training of Energy-Based Models
Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David
Duvenaud
Neural Thompson Sampling
Weitong ZHANG, Dongruo Zhou, Lihong Li, Quanquan Gu
A Design Space Study for LISTA and Beyond
Tianjian Meng, Xiaohan Chen, Yifan Jiang, Zhangyang Wang
i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, Honglak Lee
Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments
Anirudh Goyal, Alex Lamb, Phanideep Gampa, Philippe Beaudoin, Charles Blundell, Sergey Levine,
Yoshua Bengio, Michael Curtis Mozer
Calibration of Neural Networks using Splines
Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu,
Richard Hartley
Extreme Memorization via Scale of Initialization
Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur
Molecule Optimization by Explainable Evolution
Binghong Chen, Tianzhe Wang, Chengtao Li, Hanjun Dai, Le Song
Combining Ensembles and Data Augmentation Can Harm Your Calibration
Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper Snoek, Balaji
Lakshminarayanan, Dustin Tran
Workshops
Science and Engineering of Deep Learning
Speakers and Panelists include: Alex Hanna
Moderator and Advisors include: Emily Denton
Organizers include: Negar Rostemzadeh, Samy Bengio*
Synthetic Data Generation: Quality, Privacy, Bias
Speakers include: Jinsung Yoon, Emily Denton
Program Committee includes: Syed Ashrafulla
Enormous Language Models: Perspectives and Benchmarks
Speakers and Panelists include: Noam Shazeer, Natalie Schluter
Organizers include: Colin Raffel, Adam Roberts, Jascha Sohl-Dickstein, Katherine Lee, William
Fedus, Aitor Lewkowycz
The Role of Mathematical Reasoning in General Artificial Intelligence
Speakers and Panelists include: Markus Rabe, Christian Szegedy
Weakly Supervised Learning
Invited Speakers include: Lu Jiang
Learning to Learn
Organizers include: Yevgen Chebotar
Embodied Multimodal Learning (EML)
Invited Speakers includes: Sergey Levine
Distributed and Private Machine Learning
Program Committee includes: Peter Kairouz, Ananda Theertha Suresh
S2D-OLAD: From Shallow to Deep, Overcoming Limited and Adverse Data
Invited Speakers include: Alex Hanna, Hugo Larochelle
Organizers include: Vincent Dumoulin
Responsible AI (RAI)
Speakers include: Been Kim
Energy-Based Models: Current Perspectives, Challenges, and Opportunities
Organizers include: Adji Bousso Dieng, Igor Mordatch
A Roadmap to Never-Ending RL
Invited Session Panelists include: Aleksandra Faust
Program Committee includes: Coline Devin, Karol Hausman, Ben Eysenbach, Ofir Nachum, Ryan
Julian, Tianhe Yu, Dumitru Erhan, Marc Pickett, Shixiang Gu
2nd Workshop on Practical ML for Developing Countries: Learning Under Limited/low Resource
Scenarios
Program Committee includes: Pablo Samuel Castro
Beyond Static Papers: Rethinking How We Share Scientific Understanding in ML
Speakers include: David Ha, Hugo Larochelle
Organizers include: Sara Hooker

Organizers include: Sara Hooker
* Indicates work done while at Google



Google
·
Privacy
·
Terms

Google AI Blog - ICLR - 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Google AI Blog - ICLR - 2021

Uploaded by

Copyright:

Available Formats

9/5/2021 Google AI Blog

The latest from Google Research

ALIGN: Scaling Up Visual and Vision-Language

Creating the Dataset

ALIGN: A Large-scale ImaGe and Noisy-Text Embedding

Evaluating Retrieval and Representation

ImageBERT 70.7 54.3 44.0 32.3

GPO 88.7 76.1 68.1 52.7

Model (backbone) Acc@1 w/ frozen features Acc@1 Acc@5

ImageNet classification results comparison with supervised training (fine-tuning).

Zero-Shot Image Classification

Top-1 accuracy of zero-shot classification on ImageNet and its variants.

Application in Image Search

Image retrieval with fine-grained text queries using ALIGN's embeddings.

Multimodal (Image+Text) Query for Image Search

Social Impact and Future Work

Accelerating Eye Movement Research for Wellness and

images (e.g., radiologists viewing MRI/PET scans).

Digital Biomarker of Mental Fatigue

Crisscrossed Captions: Semantic Similarity for Images and

Creating the CxC Dataset

MS-COCO supports three retrieval tasks:

C C i l l i f (T2T) i (I2T) d li k (I2T T2T) d l d d l

CxC correlation results for the same models shown above.

Introducing FELIX: Flexible Text Editing Through Tagging and

The Tagging Model

The Insertion Model

We evaluated FELIX on sentence fusion, text simplification, abstractive summarization, and

Latency in milliseconds for a batch of 32 on a Nvidia Tesla P100.

Do Wide and Deep Networks Learn the Same Things?

A common practice to improve a neural network’s performance and tailor it to available

Comparing Representation Similarity with CKA

We extended prior work on analyzing representations by leveraging our previously developed

The Emergence of the Block Structure

Comparing Representations Across Models

Error Analysis of Wide and Deep Models

Google at ICLR 2021

Officers and Board Members

Includes: Hugo Larochelle, Tara Sainath

Includes: Sanmi Koyejo, Chelsea Finn

Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation

Biao Zhang*, Ankur Bapna, Rico Sennrich, Orhan Firat

Evolving Reinforcement Learning Algorithms (see the blog post)

Score-Based Generative Modeling through Stochastic Differential Equations

What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study

When Do Curricula Work?

Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

Sharpness-aware Minimization for Efficiently Improving Generalization

Pierre Foret*, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual

Mathematical Reasoning via Self-supervised Skip-tree Training

Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy

Long-Tail Learning via Logit Adjustment

Are Neural Rankers Still Outperformed by Gradient Boosted Decision Trees?

LambdaNetworks: Modeling Long-Range Interactions without Attention

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G Bellemare

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Practical Real Time Recurrent Learning with a Sparse Approximation

Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi

Batch Reinforcement Learning Through Continuation Method

Scalable Transfer Learning with Expert Models