Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

9/5/2021 Google AI Blog

Blog

The latest from Google Research

ALIGN: Scaling Up Visual and Vision-Language


Representation Learning With Noisy Text Supervision
Tuesday, May 11, 2021
Posted by Chao Jia and Yinfei Yang, Software Engineers, Google Research

Learning good visual and vision-language representations is critical to solving computer vision
problems — image retrieval, image classification, video understanding — and can enable the
development of tools and products that change people’s daily lives. For example, a good vision-
language matching model can help users find the most relevant images given a text description or
an image input and help tools such as Google Lens find more fine-grained information about an
image.

To learn such representations, current state-of-the-art (SotA) visual and vision-language models
rely heavily on curated training datasets that require expert knowledge and extensive labels. For
vision applications, representations are mostly learned on large-scale datasets with explicit class
labels, such as ImageNet, OpenImages, and JFT-300M. For vision-language applications, popular
pre-training datasets, such as Conceptual Captions and Visual Genome Dense Captions, all require
non-trivial data collection and cleaning steps, limiting the size of datasets and thus hindering the
scale of the trained models. In contrast, natural language processing (NLP) models have achieved
SotA performance on GLUE and SuperGLUE benchmarks by utilizing large-scale pre-training on raw
text without human labels.

In "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision", to
appear at ICML 2021, we propose bridging this gap with publicly available image alt-text data
(written copy that appears in place of an image on a webpage if the image fails to load on a user's
screen) in order to train larger, state-of-the-art vision and vision-language models. To that end, we
leverage a noisy dataset of over one billion image and alt-text pairs, obtained without expensive
filtering or post-processing steps in the Conceptual Captions dataset. We show that the scale of
our corpus can make up for noisy data and leads to SotA representation, and achieves strong
performance when transferred to classification tasks such as ImageNet and VTAB. The aligned
visual and language representations also set new SotA results on Flickr30K and MS-COCO
benchmarks, even when compared with more sophisticated cross-attention models, and enable
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 1/36
9/5/2021 p p Google AI Blog
zero-shot image classification and cross-modality search with complex text and text + image
queries.

Creating the Dataset

Alt-texts usually provide a description of what the image is about, but the dataset is “noisy”
because some text may be partly or wholly unrelated to its paired image.

Example image-text pairs randomly sampled from the training dataset of ALIGN. One clearly noisy text label is marked in
italics.

In this work, we follow the methodology of constructing the Conceptual Captions dataset to get a
version of raw English alt-text data (image and alt-text pairs). While the Conceptual Captions
dataset was cleaned by heavy filtering and post-processing, this work scales up visual and vision-
language representation learning by relaxing most of the cleaning steps in the original work.
Instead, we only apply minimal frequency-based filtering. The result is a much larger but noisier
dataset of 1.8B image-text pairs.

ALIGN: A Large-scale ImaGe and Noisy-Text Embedding

For the purpose of building larger and more powerful models easily, we employ a simple dual-
encoder architecture that learns to align visual and language representations of the image and text
pairs. Image and text encoders are learned via a contrastive loss (formulated as normalized
softmax) that pushes the embeddings of matched image-text pairs together while pushing those of
non-matched image-text pairs (within the same batch) apart. The large-scale dataset makes it
possible for us to scale up the model size to be as large as EfficientNet-L2 (image encoder) and
BERT-large (text encoder) trained from scratch. The learned representation can be used for
downstream visual and vision-language tasks.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 2/36
9/5/2021 Google AI Blog

Figure of ImageNet credit to (Krizhevsky et al. 2012) and VTAB figure credit to (Zhai et al. 2019)

The resulting representation can be used for vision-only or vision-language task transfer. Without
any fine-tuning, ALIGN powers cross-modal search – image-to-text search, text-to-image search,
and even search with joint image+text queries, examples below.

Evaluating Retrieval and Representation

The learned ALIGN model with BERT-Large and EfficientNet-L2 as text and image encoder
backbones achieves SotA performance on multiple image-text retrieval tasks (Flickr30K and MS-
COCO) in both zero-shot and fine-tuned settings, as shown below.

Flickr30K (1K test set) R@1 MS-COCO (5K test set) R@1
Setting Model    image → text      text → image      image → text      text → image   

ImageBERT    70.7 54.3 44.0 32.3


UNITER 83.6 68.7 - -
Zero-shot
CLIP 88.0 68.7 58.4 37.8
ALIGN 88.6 75.7 58.6 45.6

GPO 88.7 76.1 68.1 52.7


UNITER 87.3 75.6 65.7 52.9
ERNIE-ViL 88.1 76.7 - -
Fine-tuned   
VILLA 87.9 76.3 - -
Oscar - - 73.5 57.5
ALIGN 95.3 84.9 77.0 59.9

Image-text retrieval results (recall@1) on Flickr30K and MS-COCO datasets (both zero-shot and fine-tuned). ALIGN
significantly outperforms existing methods including the cross-modality attention models that are too expensive for large-
scale retrieval applications.
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 3/36
9/5/2021 Google AI Blog

ALIGN is also a strong image representation model. Shown below, with frozen features, ALIGN
slightly outperforms CLIP and achieves a SotA result of 85.5% top-1 accuracy on ImageNet. With
fine-tuning, ALIGN achieves higher accuracy than most generalist models, such as BiT and ViT, and
is only worse than Meta Pseudo Labels, which requires deeper interaction between ImageNet
training and large-scale unlabeled data.

Model (backbone)    Acc@1 w/ frozen features      Acc@1      Acc@5   


WSL (ResNeXt-101 32x48d) 83.6 85.4 97.6
CLIP (ViT-L/14) 85.4 - -
BiT (ResNet152 x 4) - 87.54 98.46
NoisyStudent (EfficientNet-L2) - 88.4 98.7
ViT (ViT-H/14) - 88.55 -
Meta-Pseudo-Labels (EfficientNet-L2)    - 90.2 98.8
ALIGN (EfficientNet-L2) 85.5 88.64 98.67

ImageNet classification results comparison with supervised training (fine-tuning).

Zero-Shot Image Classification

Traditionally, image classification problems treat each class as independent IDs, and people have
to train the classification layers with at least a few shots of labeled data per class. The class names
are actually also natural language phrases, so we can naturally extend the image-text retrieval
capability of ALIGN for image classification without any training data.

The pre-trained image and text encoder can directly be used in classifying an image into a set of classes by retrieving the
nearest class name in the aligned embedding space. This approach does not require any training data for the defined class
space.

On the ImageNet validation dataset, ALIGN achieves 76.4% top-1 zero-shot accuracy and shows
great robustness in different variants of ImageNet with distribution shifts, similar to the concurrent
work CLIP. We also use the same text prompt engineering and ensembling as in CLIP.

   ImageNet      ImageNet-R      ImageNet-A      ImageNet-V2   
CLIP 76.2 88.9 77.2 70.1
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 4/36
9/5/2021 Google AI Blog
CLIP 76.2 88.9 77.2 70.1
ALIGN    76.4 92.2 75.8 70.1

Top-1 accuracy of zero-shot classification on ImageNet and its variants.

Application in Image Search

To illustrate the quantitative results above, we build a simple image retrieval system with the
embeddings trained by ALIGN and show the top 1 text-to-image retrieval results for a handful of
text queries from a 160M image pool. ALIGN can retrieve precise images given detailed
descriptions of a scene, or fine-grained or instance-level concepts like landmarks and artworks.
These examples demonstrate that the ALIGN model can align images and texts with similar
semantics, and that ALIGN can generalize to novel complex concepts.

Image retrieval with fine-grained text queries using ALIGN's embeddings.

Multimodal (Image+Text) Query for Image Search

A surprising property of word vectors is that word analogies can often be solved with vector
arithmetic. A common example, "king – man + woman = queen". Such linear relationships between
image and text embeddings also emerge in ALIGN.

Specifically, given a query image and a text string, we add their ALIGN embeddings together and
use it to retrieve relevant images using cosine similarity as shown below These examples not only
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 5/36
9/5/2021 Google AI Blog
use it to retrieve relevant images using cosine similarity, as shown below. These examples not only
demonstrate the compositionality of ALIGN embeddings across vision and language domains, but
also show the feasibility of searching with a multi-modal query. For instance, one could now look
for the "Australia" or "Madagascar" equivalence of pandas, or turn a pair of black shoes into
identically-looking beige shoes. Also, it is possible to remove objects/attributes from a scene by
performing subtraction in the embedding space, shown below.

Image retrieval with image text queries. By adding or subtracting text query embedding, ALIGN retrieves relevant images.

Social Impact and Future Work

While this work shows promising results from a methodology perspective with a simple data
collection method, additional analysis of the data and the resulting model is necessary before the
responsible use of the model in practice. For instance, considerations should be made towards the
potential for the use of harmful text data in alt-texts to reinforce such harms. With regard to
fairness, data balancing efforts may be required to prevent reinforcing stereotypes from the web
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 6/36
9/5/2021 Google AI Blog
data. Additional testing and training around sensitive religious or cultural items should be taken to
understand and mitigate the impact from possibly mislabeled data.

Further analysis should also be taken to ensure that the demographic distribution of humans and
related cultural items, such as clothing, food, and art, do not cause skewed model performance.
Analysis and balancing would be required if such models will be used in production.

Conclusion

We have presented a simple method of leveraging large-scale noisy image-text data to scale up
visual and vision-language representation learning. The resulting model, ALIGN, is capable of cross-
modal retrieval and significantly outperforms SotA models. In visual-only downstream tasks, ALIGN
is also comparable to or outperforms SotA models trained with large-scale labeled data.

Acknowledgement

We would like to thank our co-authors in Google Research: Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu
Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. This work was also done with invaluable help
from other colleagues from Google. We would like to thank Jan Dlabal and Zhe Li for continuous
support in training infrastructure, Simon Kornblith for building the zero-shot & robustness model
evaluation on ImageNet variants, Xiaohua Zhai for help on conducting VTAB evaluation, Mingxing Tan
and Max Moroz for suggestions on EfficientNet training, Aleksei Timofeev for the early idea of
multimodal query retrieval, Aaron Michelony and Kaushal Patel for their early work on data
generation, and Sergey Ioffe, Jason Baldridge and Krishna Srinivasan for the insightful feedback and
discussion.

Accelerating Eye Movement Research for Wellness and


Accessibility
Monday, May 10, 2021
Posted by Nachiappan Valliappan, Senior Software Engineer and Kai Kohlhoff, Staff Research Scientist,
Google Research

Eye movement has been studied widely across vision science, language, and usability since the
1970s. Beyond basic research, a better understanding of eye movement could be useful in a wide
variety of applications, ranging across usability and user experience research, gaming, driving, and
gaze-based interaction for accessibility to healthcare. However, progress has been limited because
most prior research has focused on specialized hardware-based eye trackers that are expensive
and do not easily scale.

In “Accelerating eye movement research via accurate and affordable smartphone eye tracking”,
published in Nature Communications, and “Digital biomarker of mental fatigue”, published in npj
Digital Medicine, we present accurate, smartphone-based, ML-powered eye tracking that has the
potential to unlock new research into applications across the fields of vision, accessibility,
healthcare, and wellness, while additionally providing orders-of-magnitude scaling across diverse
populations in the world, all using the front-facing camera on a smartphone. We also discuss the
potential use of this technology as a digital biomarker of mental fatigue, which can be useful for
improved wellness
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 7/36
9/5/2021 Google AI Blog
improved wellness.

Model Overview

The core of our gaze model was a multilayer feed-forward convolutional neural network (ConvNet)
trained on the MIT GazeCapture dataset. A face detection algorithm selected the face region with
associated eye corner landmarks, which were used to crop the images down to the eye region
alone. These cropped frames were fed through two identical ConvNet towers with shared weights.
Each convolutional layer was followed by an average pooling layer. Eye corner landmarks were
combined with the output of the two towers through fully connected layers. Rectified Linear Units
(ReLUs) were used for all layers except the final fully connected output layer (FC6), which had no
activation.

Architecture of the unpersonalized gaze model. Eye regions, extracted from a front-facing camera image, serve as input into
a convolutional neural network. Fully-connected (FC) layers combine the output with eye corner landmarks to infer gaze x-
and y-locations on screen via a multi-regression output layer.

The unpersonalized gaze model accuracy was improved by fine-tuning and per-participant
personalization. For the latter, a lightweight regression model was fitted to the model’s penultimate
ReLU layer and participant-specific data.

Model Evaluation

To evaluate the model, we collected data from consenting study participants as they viewed dots
that appeared at random locations on a blank screen. The model error was computed as the
distance (in cm) between the stimulus location and model prediction. Results show that while the
unpersonalized model has high error, personalization with ~30s of calibration data led to an over
fourfold error reduction (from 1.92 to 0.46cm). At a viewing distance of 25-40 cm, this corresponds
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 8/36
9/5/2021 Google AI Blog
fourfold error reduction (from 1.92 to 0.46cm). At a viewing distance of 25 40 cm, this corresponds
to 0.6-1° accuracy, a significant improvement over the 2.4-3° reported in previous work [1, 2].

Additional experiments show that the smartphone eye tracker model’s accuracy is comparable to
state-of-the-art wearable eye trackers both when the phone is placed on a device stand, as well as
when users hold the phone freely in their hand in a near frontal headpose. In contrast to specialized
eye tracking hardware with multiple infrared cameras close to each eye, running our gaze model
using a smartphone’s single front-facing RGB camera is significantly more cost effective (~100x
cheaper) and scalable.

Using this smartphone technology, we were able to replicate key findings from prior eye movement
research in neuroscience and psychology, including standard oculomotor tasks (to understand
basic visual functioning in the brain) and natural image understanding. For example, in a simple
prosaccade task, which tests a person’s ability to quickly move their eyes towards a stimulus that
appears on the screen, we found that the average saccade latency (time to move the eyes)
matches prior work for basic visual health (210ms versus 200-250ms). In controlled visual search
tasks, we were able to replicate key findings, such as the effect of target saliency and clutter on eye
movements.

Example gaze scanpaths show the effect of the target’s saliency (i.e., color contrast) on visual search performance. Fewer
fixations are required to find a target (left) with high saliency (different from the distractors), while more fixations are
required to find a target (right) with low saliency (similar to the distractors).

For complex stimuli, such as natural images, we found that the gaze distribution (computed by
aggregating gaze positions across all participants) from our smartphone eye tracker are similar to
those obtained from bulky, expensive eye trackers that used highly controlled settings, such as
laboratory chin rest systems. While the smartphone-based gaze heatmaps have a broader
distribution (i.e., they appear more “blurred”) than hardware-based eye trackers, they are highly
correlated both at the pixel level (r = 0.74) and object level (r = 0.90). These results suggest that this
technology could be used to scale gaze analysis for complex stimuli such as natural and medical
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 9/36
9/5/2021 Google AI Blog

images (e.g., radiologists viewing MRI/PET scans).

Similar gaze distribution from our smartphone approach vs. a more expensive (100x) eye tracker (from the OSIE dataset).

We found that smartphone gaze could also help detect difficulty with reading comprehension.
Participants reading passages spent significantly more time looking within the relevant excerpts
when they answered correctly. However, as comprehension difficulty increased, they spent more
time looking at the irrelevant excerpts in the passage before finding the relevant excerpt that
contained the answer. The fraction of gaze time spent on the relevant excerpt was a good predictor
of comprehension, and strongly negatively correlated with comprehension difficulty (r = −0.72).

Digital Biomarker of Mental Fatigue

Gaze detection is an important tool to detect alertness and wellbeing, and is studied widely in
medicine, sleep research, and mission-critical settings such as medical surgeries, aviation safety,
etc. However, existing fatigue tests are subjective and often time-consuming. In our recent paper
published in npj Digital Medicine, we demonstrated that smartphone gaze is significantly impaired
with mental fatigue, and can be used to track the onset and progression of fatigue.

A simple model predicts mental fatigue reliably using just a few minutes of gaze data from
participants performing a task. We validated these findings in two different experiments — using a
language-independent object-tracking task and a language-dependent proofreading task. As shown
below, in the object-tracking task, participants’ gaze initially follows the object’s circular trajectory,
but under fatigue, their gaze shows high errors and deviations. Given the pervasiveness of phones,
these results suggest that smartphone-based gaze could provide a scalable, digital biomarker of
mental fatigue.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 10/36
9/5/2021 Google AI Blog

Example gaze scanpaths for a participant with no fatigue (left) versus with mental fatigue (right) as they track an object
following a circular trajectory.

The corresponding progression of fatigue scores (ground truth) and model prediction as a function of time on task.

Beyond wellness, smartphone gaze could also provide a digital phenotype for screening or
monitoring health conditions such as autism spectrum disorder, dyslexia, concussion and more.
This could enable timely and early interventions, especially for countries with limited access to
healthcare services.

Another area that could benefit tremendously is accessibility. People with conditions such as ALS,
locked-in syndrome and stroke have impaired speech and motor ability. Smartphone gaze could
provide a powerful way to make daily tasks easier by using gaze for interaction, as recently
demonstrated with Look to Speak.

Ethical Considerations

Gaze research needs careful consideration, including being mindful of the correct use of such
technology — applications should obtain explicit approval and fully informed consent from users
for the specific task at hand. In our work, all data was collected for research purposes with users’
explicit approval and consent. In addition, users were allowed to opt out at any point and request
their data to be deleted. We continue to research additional ways to ensure ML fairness and
improve the accuracy and robustness of gaze technology across demographics, in a responsible,
privacy-preserving way.

Conclusion

Our findings of accurate and affordable ML-powered smartphone eye tracking offer the potential
for orders-of-magnitude scaling of eye movement research across disciplines (e.g., neuroscience,
psychology and human-computer interaction). They unlock potential new applications for societal
good, such as gaze-based interaction for accessibility, and smartphone-based screening and
monitoring tools for wellness and healthcare.

Acknowledgements

This work involved collaborative efforts from a multidisciplinary team of software engineers,
researchers, and cross-functional contributors. We’d like to thank all the co-authors of the papers,
including our team members, Junfeng He, Na Dai, Pingmei Xu, Venky Ramachandran; interns, Ethan
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 11/36
9/5/2021 Google AI Blog
including our team members, Junfeng He, Na Dai, Pingmei Xu, Venky Ramachandran; interns, Ethan
Steinberg, Kantwon Rogers, Li Guo, and Vincent Tseng; collaborators, Tanzeem Choudhury; and UXRs:
Mina Shojaeizadeh, Preeti Talwai, and Ran Tao. We’d also like to thank Tomer Shekel, Gaurav
Nemade, and Reena Lee for their contributions to this project, and Vidhya Navalpakkam for her
technical leadership in initiating and overseeing this body of work.

Crisscrossed Captions: Semantic Similarity for Images and


Text
Thursday, May 6, 2021
Posted by Zarana Parekh, Software Engineer and Jason Baldridge, Staff Research Scientist, Google
Research

The past decade has seen remarkable progress on automatic image captioning, a task in which a
computer algorithm creates written descriptions for images. Much of the progress has come
through the use of modern deep learning methods developed for both computer vision and natural
language processing, combined with large scale datasets that pair images with descriptions
created by people. In addition to supporting important practical applications, such as providing
descriptions of images for visually impaired people, these datasets also enable investigations into
important and exciting research questions about grounding language in visual inputs. For example,
learning deep representations for a word like “car”, means using both linguistic and visual contexts.

Image captioning datasets that contain pairs of textual descriptions and their corresponding
images, such as MS-COCO and Flickr30k, have been widely used to learn aligned image and text
representations and to build captioning models. Unfortunately, these datasets have limited cross-
modal associations: images are not paired with other images, captions are only paired with other
captions of the same image (also called co-captions), there are image-caption pairs that match but
are not labeled as a match, and there are no labels that indicate when an image-caption pair does
not match. This undermines research into how inter-modality learning (connecting captions to
images, for example) impacts intra-modality tasks (connecting captions to captions or images to
images). This is important to address, especially because a fair amount of work on learning from
images paired with text is motivated by arguments about how visual elements should inform and
improve representations of language.

To address this evaluation gap, we present "Crisscrossed Captions: Extended Intramodal and
Intermodal Semantic Similarity Judgments for MS-COCO", which was recently presented at EACL
2021. The Crisscrossed Captions (CxC) dataset extends the development and test splits of MS-
COCO with semantic similarity ratings for image-text, text-text and image-image pairs. The rating
criteria are based on Semantic Textual Similarity, an existing and widely-adopted measure of
semantic relatedness between pairs of short texts, which we extend to include judgments about
images as well. In all, CxC contains human-derived semantic similarity ratings for 267,095 pairs
(derived from 1,335,475 independent judgments), a massive extension in scale and detail to the
50k original binary pairings in MS-COCO’s development and test splits. We have released CxC’s
ratings, along with code to merge CxC with existing MS-COCO data. Anyone familiar with MS-COCO
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 12/36
9/5/2021
g , g g gGoogle AI Blog y
can thus easily enhance their experiments with CxC.

Crisscrossed Captions extends the MS-COCO evaluation sets by adding human-derived semantic similarity ratings for
existing image-caption pairs and co-captions (solid lines), and it increases rating density by adding human ratings for new
image-caption, caption-caption and image-image pairs (dashed lines).*

Creating the CxC Dataset

If a picture is worth a thousand words, it is likely because there are so many details and
relationships between objects that are generally depicted in pictures. We can describe the texture
of the fur on a dog, name the logo on the frisbee it is chasing, mention the expression on the face
of the person who has just thrown the frisbee, or note the vibrant red on a large leaf in a tree above
the person’s head, and so on.

The CxC dataset extends the MS-COCO evaluation splits with graded similarity associations within
and across modalities. MS-COCO has five captions for each image, split into 410k training, 25k
development, and 25k test captions (for 82k, 5k, 5k images, respectively). An ideal extension would
rate every pair in the dataset (caption-caption, image-image, and image-caption), but this is
infeasible as it would require obtaining human ratings for billions of pairs.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 13/36
9/5/2021 Google AI Blog
Given that randomly selected pairs of images and captions are likely to be dissimilar, we came up
with a way to select items for human rating that would include at least some new pairs with high
expected similarity. To reduce the dependence of the chosen pairs on the models used to find
them, we introduce an indirect sampling scheme (depicted below) where we encode images and
captions using different encoding methods and compute the similarity between pairs of same
modality items, resulting in similarity matrices. Images are encoded using Graph-RISE embeddings,
while captions are encoded using two methods — Universal Sentence Encoder (USE) and average
bag-of-words (BoW) based on GloVe embeddings. Since each MS-COCO example has five co-
captions, we average the co-caption encodings to create a single representation per example,
ensuring all caption pairs can be mapped to image pairs (more below on how we select
intermodality pairs).

Top: Text similarity matrix (each cell corresponds to a similarity score) constructed using averaged co-caption encodings, so
each text entry corresponds to a single image, resulting in a 5k x 5k matrix. Two different text encoding methods were used,
but only one text similarity matrix has been shown for simplicity. Bottom: Image similarity matrix for each image in the
dataset, resulting in a 5k x 5k matrix.

The next step of the indirect sampling scheme is to use the computed similarities of images for a
biased sampling of caption pairs for human rating (and vice versa). For example, we select two
captions with high computed similarities from the text similarity matrix, then take each of their
images, resulting in a new pair of images that are different in appearance but similar in what they
depict based on their descriptions. For example, the captions “A dog looking bashfully to the side”
and “A black dog lifts its head to the side to enjoy a breeze” would have a reasonably high model
similarity, so the corresponding images of the two dogs in the figure below could be selected for
image similarity rating. This step can also start with two images with high computed similarities to
yield a new pair of captions. We now have indirectly sampled new intramodal pairs — at least some
of which are highly similar — for which we obtain human ratings.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 14/36
9/5/2021 Google AI Blog

Top: Pairs of images are picked based on their computed caption similarity. Bottom: Pairs of captions are picked based on
the computed similarity of the images they describe.

Last, we then use these new intramodal pairs and their human ratings to select new intermodal
pairs for human rating. We do this by using existing image-caption pairs to link between modalities.
For example, if a caption pair example ij was rated by humans as highly similar, we pick the image
from example i and caption from example j to obtain a new intermodal pair for human rating. And
again, we use the intramodal pairs with the highest rated similarity for sampling because this
includes at least some new pairs with high similarity. Finally, we also add human ratings for all
existing intermodal pairs and a large sample of co-captions.

The following table shows examples of semantic image similarity (SIS) and semantic image-text
similarity (SITS) pairs corresponding to each rating, with 5 being the most similar and 0 being
completely dissimilar.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 15/36
9/5/2021 Google AI Blog

Examples for each human-derived similarity score (left: 5 to 0, 5 being very similar and 0 being completely dissimilar) of
image pairs based on SIS (middle) and SITS (right) tasks. Note that these examples are for illustrative purposes and are not
themselves in the CxC dataset.

Evaluation

MS-COCO supports three retrieval tasks:

1. Given an image, find its matching captions out of all other captions in the evaluation
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 16/36
9/5/2021 Google AI Blog
g , g p p
set.
2. Given a caption, find its corresponding image out of all other images in the
evaluation set.
3. Given a caption, find its other co-captions out of all other captions in the evaluation
set.

MS-COCO’s pairs are incomplete because captions created for one image at times apply equally
well to another, yet these associations are not captured in the dataset. CxC enhances these existing
retrieval tasks with new positive pairs, and it also supports a new image-image retrieval task. With
its graded similarity judgements, CxC also makes it possible to measure correlations between
model and human rankings. Retrieval metrics in general focus only on positive pairs, while CxC’s
correlation scores additionally account for the relative ordering of similarity and include low-scoring
items (non-matches). Supporting these evaluations on a common set of images and captions
makes them more valuable for understanding inter-modal learning compared to disjoint sets of
caption-image, caption-caption, and image-image associations.

We ran a series of experiments to show the utility of CxC’s ratings. For this, we constructed three
dual encoder (DE) models using BERT-base as the text encoder and EfficientNet-B4 as the image
encoder:

1. A text-text (DE_T2T) model that uses a shared text encoder for both sides.
2. An image-text model (DE_I2T) that uses the aforementioned text and image
encoders, and includes a layer above the text encoder to match the image encoder
output.
3. A multitask model (DE_I2T+T2T) trained on a weighted combination of text-text and
image-text tasks.

C C i l l i f (T2T) i (I2T) d li k (I2T T2T) d l d d l


https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 17/36
9/5/2021 Google AI Blog
CxC retrieval results — a comparison of our text-text (T2T), image-text (I2T) and multitask (I2T+T2T) dual encoder models
on all the four retrieval tasks.

From the results on the retrieval tasks, we can see that DE_I2T+T2T (yellow bar) performs better
than DE_I2T (red bar) on the image-text and text-image retrieval tasks. Thus, adding the intramodal
(text-text) training task helped improve the intermodal (image-text, text-image) performance. As for
the other two intramodal tasks (text-text and image-image), DE_I2T+T2T shows strong, balanced
performance on both of them.

CxC correlation results for the same models shown above.

For the correlation tasks, DE_I2T performs the best on SIS and DE_I2T+T2T is the best overall. The
correlation scores also show that DE_I2T performs well only on images: it has the highest SIS but
has much worse STS. Adding the text-text loss to DE_I2T training (DE_I2T+T2T) produces more
balanced overall performance.

The CxC dataset provides a much more complete set of relationships between and among images
and captions than the raw MS-COCO image-caption pairs. The new ratings have been released and
further details are in our paper. We hope to encourage the research community to push the state of
the art on the tasks introduced by CxC with better models for jointly learning inter- and intra-modal
representations.

Acknowledgments

The core team includes Daniel Cer, Yinfei Yang and Austin Waters. We thank Julia Hockenmaier for
her inputs on CxC’s formulation, the Google Data Compute Team, especially Ashwin Kakarla and
Mohd Majeed for their tooling and annotation support, Yuan Zhang, Eugene Ie for their comments on
the initial versions of the paper and Daphne Luong for executive support for the data collection.
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 18/36
9/5/2021 Google AI Blog
the initial versions of the paper and Daphne Luong for executive support for the data collection.

  *All the images in the article have been taken from the Open Images dataset under the CC-by 4.0 license. ↩

Introducing FELIX: Flexible Text Editing Through Tagging and


Insertion
Wednesday, May 5, 2021
Posted by Jonathan Mallinson and Aliaksei Severyn, Research Scientists, Google Research

Sequence-to-sequence (seq2seq) models have become a favoured approach for tackling natural
language generation tasks, with applications ranging from machine translation to monolingual
generation tasks, such as summarization, sentence fusion, text simplification, and machine
translation post-editing. However these models appear to be a suboptimal choice for many
monolingual tasks, as the desired output text often represents a minor rewrite of the input text.
When accomplishing such tasks, seq2seq models are both slower because they generate the
output one word at a time (i.e., autoregressively), and wasteful because most of the input tokens
are simply copied into the output.

Instead, text-editing models have recently received a surge of interest as they propose to predict
edit operations – such as word deletion, insertion, or replacement – that are applied to the input to
reconstruct the output. However, previous text-editing approaches have limitations. They are either
fast (being non-autoregressive), but not flexible, because they use a limited number of edit
operations, or they are flexible, supporting all possible edit operations, but slow (autoregressive). In
either case, they have not focused on modeling large structural (syntactic) transformations, for
example switching from active voice, “They ate steak for dinner,” to passive, “Steak was eaten for
dinner.” Instead, they've focused on local transformations, deleting or replacing short phrases.
When a large structural transformation needs to occur, they either can’t produce it or insert a large
amount of new text, which is slow.

In “FELIX: Flexible Text Editing Through Tagging and Insertion”, we introduce FELIX, a fast and
flexible text-editing system that models large structural changes and achieves a 90x speed-up
compared to seq2seq approaches whilst achieving impressive results on four monolingual
generation tasks. Compared to traditional seq2seq methods, FELIX has the following three key
advantages:

Sample efficiency: Training a high precision text generation model typically requires
large amounts of high-quality supervised data. FELIX uses three techniques to
minimize the amount of required data: (1) fine-tuning pre-trained checkpoints, (2) a
tagging model that learns a small number of edit operations, and (3) a text insertion
task that is very similar to the pre-training task.
Fast inference time: FELIX is fully non-autoregressive, avoiding slow inference times
caused by an autoregressive decoder.
Flexible text editing: FELIX strikes a balance between the complexity of learned edit
ti d fl ibilit i th t f ti it d l
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 19/36
9/5/2021 Google AI Blog
operations and flexibility in the transformations it models.

In short, FELIX is designed to derive the maximum benefit from self-supervised pre-training, being
efficient in low-resource settings, with little training data.

Overview

To achieve the above, FELIX decomposes the text-editing task into two sub-tasks: tagging to decide
on the subset of input words and their order in the output text, and insertion, where words that are
not present in the input are inserted. The tagging model employs a novel pointer mechanism, which
supports structural transformations, while the insertion model is based on a Masked Language
Model. Both of these models are non-autoregressive, ensuring the model is fast. A diagram of
FELIX can be seen below.

An example of FELIX trained on data for a text simplification task. Input words are first tagged as KEEP (K), DELETE (D) or
KEEP and INSERT (I). After tagging, the input is reordered. This reordered input is then fed to a masked language model.

The Tagging Model

The first step in FELIX is the tagging model, which consists of two components. First the tagger
determines which words should be kept or deleted and where new words should be inserted. When
the tagger predicts an insertion, a special MASK token is added to the output. After tagging, there is
a reordering step where the pointer reorders the input to form the output, by which it is able to reuse
parts of the input instead of inserting new text. The reordering step supports arbitrary rewrites,
which enables modeling large changes. The pointer network is trained such that each word in the
input points to the next word as it will appear in the output, as shown below.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 20/36
9/5/2021 Google AI Blog

Realization of the pointing mechanism to transform "There are 3 layers in the walls of the heart" into "the heart MASK 3
layers".

The Insertion Model

The output of the tagging model is the reordered input text with deleted words and MASK tokens
predicted by the insertion tag. The insertion model must predict the content of MASK tokens.
Because FELIX’s insertion model is very similar to the pretraining objective of BERT, it can take
direct advantage of the pre-training, which is particularly advantageous when data is limited.

Example of the insertion model, where the tagger predicts two words will be inserted and the insertion model predicts the
content of the MASK tokens.

Results

We evaluated FELIX on sentence fusion, text simplification, abstractive summarization, and


machine translation post-editing. These tasks vary significantly in the types of edits required and
dataset sizes under which they operate. Below are the results on the sentence fusion task (i.e.,
merging two sentences into one), comparing FELIX against a large pre-trained seq2seq model
(BERT2BERT) and a text-editing model (LaserTager), under a range of dataset sizes. We see that
FELIX outperforms LaserTagger and can be trained on as little as a few hundred training examples.
For the full dataset, the autoregressive BERT2BERT outperforms FELIX. However, during inference,
this model takes significantly longer.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 21/36
9/5/2021 Google AI Blog

A comparison of different training dataset sizes on the DiscoFuse dataset. We compare FELIX (using the best performing
model) against BERT2BERT and LaserTagger.

Latency in milliseconds for a batch of 32 on a Nvidia Tesla P100.

Conclusion

We have presented FELIX, which is fully non-autoregressive, providing even faster inference times,
while achieving state-of-the-art results. FELIX also minimizes the amount of required training data
with three techniques — fine-tuning pre-trained checkpoints, learning a small number of edit
operations, and an insertion task that mimics masked language model task from the pre-training.
Lastly, FELIX strikes a balance between the complexity of learned edit operations and the
percentage of input-output transformations it can handle. We have open-sourced the code for
FELIX and hope it will provide researchers with a faster, more efficient, and more flexible text-
editing model.

Acknowledgements

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 22/36
9/5/2021 Google AI Blog

This research was conducted by Jonathan Mallinson, Aliaksei Severyn (equal contribution), Eric
Malmi, Guillermo Garrido. We would like to thank Aleksandr Chuklin, Daniil Mirylenka, Ryan McDonald,
and Sebastian Krause for useful discussions, running early experiments and paper suggestions.

Do Wide and Deep Networks Learn the Same Things?


Tuesday, May 4, 2021
Posted by Thao Nguyen, AI Resident, Google Research

A common practice to improve a neural network’s performance and tailor it to available


computational resources is to adjust the architecture depth and width. Indeed, popular families of
neural networks, including EfficientNet, ResNet and Transformers, consist of a set of architectures
of flexible depths and widths. However, beyond the effect on accuracy, there is limited
understanding of how these fundamental choices of architecture design affect the model, such as
the impact on its internal representations.

In “Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network
Representations Vary with Width and Depth”, we perform a systematic study of the similarity
between wide and deep networks from the same architectural family through the lens of their
hidden representations and final outputs. In very wide or very deep models, we find a characteristic
block structure in their internal representations, and establish a connection between this
phenomenon and model overparameterization. Comparisons across models demonstrate that
those without the block structure show significant similarity between representations in
corresponding layers, but those containing the block structure exhibit highly dissimilar
representations. These properties of the internal representations in turn translate to systematically
different errors at the class and example levels for wide and deep models when they are evaluated
on the same test set.

Comparing Representation Similarity with CKA

We extended prior work on analyzing representations by leveraging our previously developed


Centered Kernel Alignment (CKA) technique, which provides a robust, scalable way to determine
the similarity between the representations learned by any pair of neural network layers. CKA takes
as input the representations (i.e., the activation matrices) from two layers, and outputs a similarity
score between 0 (not at all similar) and 1 (identical representations).

We apply CKA to a family of ResNets of varying depths and widths, trained on common benchmark
datasets (CIFAR-10, CIFAR-100 and ImageNet), and use representation heatmaps to illustrate the
results. The x and y axes of each heatmap index the layers of the model(s) in consideration, going
from input to output, and each entry (i, j) is the CKA similarity score between layer i and layer j.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 23/36
9/5/2021 Google AI Blog

We use CKA to compute the representation similarity for all pairs of layers within a single model (i.e., when network 1 and
network 2 are identical), and across models (i.e., when network 1 and network 2 are trained with different random
initializations, or have different architectures altogether).

Below is an example of the resulting heatmap when we compare representations of each layer to
every other layer within a single ResNet of depth 26 and width multiplier 1. In the design convention
used here, the stated depth only refers to the number of convolutional layers in the network, but we
analyze all layers present, and the width multiplier applies to the number of filters in each
convolution. Notice the checkerboard pattern in the heatmap, which is caused by skip connections
(shortcuts between layers) in the architecture.

The Emergence of the Block Structure

What stands out from the representation heatmaps of deeper or wider networks is the emergence
of a large set of consecutive layers with highly similar representations, which appears in the
heatmaps as a yellow square (i.e., a region with high CKA scores). This phenomenon, which we call
the block structure, suggests that the underlying layers may not be as efficient at progressively
refining the network’s representations as we expect. Indeed, we show that the task performance
becomes stagnant inside the block structure, and that it is possible to prune some underlying
layers without affecting the final performance.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 24/36
9/5/2021 Google AI Blog

Block structure — a large, contiguous set of layers with highly similar representations — emerges with increasing width or
depth. Each heatmap panel shows the CKA similarity between all pairs of layers within a single neural network. While its size
and position can vary across different training runs, the block structure is a robust phenomenon that arises consistently in
larger models.

With additional experiments, we show that the block structure has less to do with the absolute
model size, than with the size of the model relative to the size of the training dataset. As we reduce
the training dataset size, the block structure starts to appear in shallower and narrower networks:

With increasing network width (towards the right along each row) and decreasing dataset size (down each column), the
relative model capacity (with respect to a given task) is effectively inflated, and the block structure begins to appear in
smaller models
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 25/36
9/5/2021 Google AI Blog
smaller models.

Through further analysis, we are also able to demonstrate that the block structure arises from
preserving and propagating the dominant principal components of its underlying representations.
Refer to our paper for more details.

Comparing Representations Across Models

Going further, we study the implications of depth and width on representations across models of
different random initializations and different architectures, and find that the presence of block
structure makes a significant difference in this context as well. Despite having different
architectures, wide and deep models without the block structure do exhibit representation similarity
with each other, with corresponding layers broadly being of the same proportional depth in the
model. However, when the block structure is present, its representations are unique to each model.
This suggests that despite having similar overall performance, each wide or deep model with the
block structure picks up a unique mapping from the input to the output.

For smaller models (e.g., ResNet-38 1×), CKA across different initializations (off the diagonal) closely resembles CKA within
a single model (on the diagonal). In contrast, representations within the block structure of wider and deeper models (e.g.,
ResNet-38 10×, ResNet-164 1×) are highly dissimilar across training runs.

Error Analysis of Wide and Deep Models

Having explored the properties of the learned representations of wide and deep models, we next
turn to understanding how they influence the diversity of the output predictions. We train
populations of networks of different architectures and determine on which test set examples each
architecture configuration tends to make errors.

On both CIFAR-10 and ImageNet datasets, wide and deep models that have the same average
accuracy still demonstrate statistically significant differences in example-level predictions. The
same observation holds for class-level errors on ImageNet, with wide models exhibiting a small
advantage in identifying classes corresponding to scenes, and deep networks being relatively more
accurate on consumer goods.

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 26/36
9/5/2021 Google AI Blog

Per-class differences on ImageNet between models with increased width (y-axis) or depth (x-axis). Orange dots reflect
differences between two sets of 50 different random initializations of ResNet-83 (1×).

Conclusions

In studying the effects of depth and width on internal representations, we uncover a block structure
phenomenon, and demonstrate its connection to model capacity. We also show that wide and deep
models exhibit systematic output differences at class and example levels. Check out the paper for
full details on these results and additional insights! We’re excited about the many interesting open
questions these findings suggest, such as how the block structure arises during training, whether
the phenomenon occurs in domains beyond image classification, and ways these insights on
internal representations can inform model efficiency and generalization.

Acknowledgements

This is a joint work with Maithra Raghu and Simon Kornblith. We would like to thank Tom Small for
the visualizations of the representation heatmap.

Google at ICLR 2021


Monday, May 3, 2021
Posted by Jaqui Herman, Research Specialist and Tim Herrmann, Program Manager

The 9th International Conference on Learning Representations (ICLR 2021), a virtual conference
focused on deep learning, kicked off this week, offering conference and workshop tracks that
present some of the latest research in deep learning and its applications to areas such as computer
vision, computational biology, speech recognition, text understanding, and more.

As a Platinum Sponsor of ICLR 2021, Google will have a strong presence with over 100 accepted
publications and participation on organizing committees and in workshops. If you have registered
for ICLR 2021, we hope you’ll watch our talks and learn about the work at Google that goes into
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 27/36
9/5/2021 Google AI Blog
for ICLR 2021, we hope you ll watch our talks and learn about the work at Google that goes into
solving interesting problems for billions of people. Learn more about our research being presented
in the list below (Googlers in bold).

Officers and Board Members

Includes: Hugo Larochelle, Tara Sainath

Organizing Committee

Includes: Sanmi Koyejo, Chelsea Finn

Area Chairs

Includes: Abhishek Kumar, Aditya Menon, Aleksandra Faust, Alexey Dosovitskiy, Andrew Cotter,
Andrew Dai, Augustus Odena, Been Kim, Behnam Neyshabur, Ben Poole, Bo Dai, Bo Li, Branislav
Kveton, Ce Liu, Claudio Gentile, Colin Raffel, Danny Tarlow, David Ha, Dengyong Zhou, Dumitru Erhan,
Dustin Tran, Felix Hill, George Tucker, Hanie Sedghi, Heinrich Jiang, Hossein Mobahi, Izhak Shafran,
Jascha Sohl-Dickstein, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Justin Gilmer, Kevin
Swersky, Marco Cuturi, Mario Lucic, Marlos C. Machado, Mathieu Blondel, Matt Johnson, Matthieu
Geist, Mohammad Norouzi, Naman Agarwal, Navdeep Jaitly, Nicolas Le Roux, Niki Parmar, Olivier
Bachem, Olivier Pietquin, Philip Long, Quentin Berthet, Razvan Pascanu, Rodolphe Jenatton, Samy
Bengio*, Sebastian Nowozin, Silvio Lattanzi, Slav Petrov, Srinadh Bhojanapalli, Suman Ravuri, Tim
Salimans, Vitaly Kuznetsov, William Cohen, Yann Dauphin, Yujia Li

Publications

Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes

Mike Gartrell, Insu Han, Elvis Dohmatob, Jennifer Gillenwater, Victor-Emmanuel Brunel

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (see the blog post)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
Neil Houlsby

Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation

Biao Zhang*, Ankur Bapna, Rico Sennrich, Orhan Firat

Evolving Reinforcement Learning Algorithms (see the blog post)

John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee,
Aleksandra Faust

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song*, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study

Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier,
Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem

When Do Curricula Work?

Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

Sharpness-aware Minimization for Efficiently Improving Generalization

Pierre Foret*, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual


Models Zirui Wang*, Yulia Tsvetkov, Orhan Firat, Yuan Cao
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 28/36
9/5/2021 Google AI Blog
Models Zirui Wang , Yulia Tsvetkov, Orhan Firat, Yuan Cao

Mathematical Reasoning via Self-supervised Skip-tree Training

Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy

Long-Tail Learning via Logit Adjustment

Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv
Kumar

Are Neural Rankers Still Outperformed by Gradient Boosted Decision Trees?

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael
Bendersky, Marc Najork

LambdaNetworks: Modeling Long-Range Interactions without Attention

Irwan Bello

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G Bellemare

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai

Practical Real Time Recurrent Learning with a Sparse Approximation

Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, Alex Graves

LEAF: A Learnable Frontend for Audio Classification (see the blog post)

Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi

Batch Reinforcement Learning Through Continuation Method

Yijie Guo, Shengyu Feng, Nicolas Le Roux, Ed Chi, Honglak Lee, Minmin Chen

Scalable Transfer Learning with Expert Models

Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, Cedric Renggli*, André Susano Pinto, Sylvain
Gelly, Daniel Keysers, Neil Houlsby

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal, Marlos C. Machado*, Pablo Samuel Castro, Marc G Bellemare

Scaling Symbolic Methods Using Gradients for Neural Model Explanation

Subham Sekhar Sahoo, Subhashini Venugopalan, Li Li, Rishabh Singh, Patrick Riley

Primal Wasserstein Imitation Learning (see the blog post)

Robert Dadashi, Leonard Hussenot, Matthieu Geist, Olivier Pietquin

Reset-Free Lifelong Learning with Skill-Space Planning

Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

Teaching Temporal Logics to Neural Networks

Christopher Hahn, Frederik Schmitt, Jens U. Kreber, Markus Norman Rabe, Bernd Finkbeiner

Shape-Texture Debiased Neural Network Training

Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie

Rethinking Embedding Coupling in Pre-trained Language Models


https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 29/36
9/5/2021 Google AI Blog

Hyung Won Chung, Thibault Fevry*, Henry Tsai, Melvin Johnson, Sebastian Ruder

Overparameterisation and Worst-Case Generalisation: Friend or Foe?

Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Single-Photon Image Classification

Thomas Fischbacher, Luciano Sbaiz

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Efthymios Tzinis*, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R.
Hershey

Adaptive Federated Optimization

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný,
Sanjiv Kumar, Hugh Brendan McMahan

Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation

Biao Zhang*, Ankur Bapna, Rico Sennrich, Orhan Firat

Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers

Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, Ruslan Salakhutdinov

Open Question Answering over Tables and Text

Wenhu Chen*, Ming-Wei Chang, Eva Schlinger, William Yang Wang, William W. Cohen

Practical Real Time Recurrent Learning with a Sparse Approximation

Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, Alex Graves

IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression

Rianne van den Berg, Alexey A. Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby, Tim Salimans

A Universal Representation Transformer Layer for Few-Shot Image Classification

Lu Liu, William L. Hamilton, Guodong Long, Jing Jiang, Hugo Larochelle

Tradeoffs in Data Augmentation: An Empirical Study

Raphael Gontijo-Lopes, Sylvia Smullin, Ekin Dogus Cubuk, Ethan Dyer

Coping with Label Shift via Distributionally Robust Optimisation

Jingzhao Zhang, Aditya Krishna Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra

Rethinking Attention with Performers (see the blog post)

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin
Belanger, Lucy J Colwell, Adrian Weller

Teaching with Commentaries

Aniruddh Raghu*, Maithra Raghu, Simon Kornblith, David Duvenaud, Geoffrey Hinton

Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics

Vinay Venkatesh Ramasesh, Ethan Dyer, Maithra Raghu

Model-Based Offline Planning

Arthur Argenson, Gabriel Dulac-Arnold


https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 30/36
9/5/2021 Google AI Blog

The Geometry of Integration in Text Classification RNNs

Kyle Aitken*, Vinay Venkatesh Ramasesh, Ankush Garg, Yuan Cao, David Sussillo, Niru
Maheswaranathan

On the Origin of Implicit Regularization in Stochastic Gradient Descent

Samuel L Smith, Benoit Dherin, David Barrett, Soham De

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song*, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers (see the blog
post)

Preetum Nakkiran*, Behnam Neyshabur, Hanie Sedghi

Learning Energy-Based Models by Diffusion Recovery Likelihood

Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P Kingma

Latent Skill Planning for Exploration and Transfer

Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti

PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Yuliang Zou*, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, Tomas Pfister

WaveGrad: Estimating Gradients for Waveform Generation

Nanxin Chen*, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, William Chan

One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks

Atish Agarwala, Abhimanyu Das, Brendan Juba*, Rina Panigrahy, Vatsal Sharan*, Xin Wang, Qiuyi
Zhang

Long Range Arena : A Benchmark for Efficient Transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu
Yang, Sebastian Ruder, Donald Metzler

Explainable Deep One-Class Classification

Philipp Liznerski, Lukas Ruff, Robert A. Vandermeulen, Billy Joe Franks, Marius Kloft, Klaus Robert
Muller

Net-DNF: Effective Deep Modeling of Tabular Data

Liran Katzir, Gal Elidan, Ran El-Yaniv

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu

Auxiliary Task Update Decomposition: The Good, the Bad and the Neutral

Lucio M. Dery, Yann Dauphin, David Grangier

Long-Tail Learning via Logit Adjustment

Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv
Kumar

Average-Case Acceleration for Bilinear Games and Normal Matrices

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 31/36
9/5/2021 Google AI Blog
Carles Domingo-Enrich, Fabian Pedregosa, Damien Scieur

OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

Anurag Ajay*, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum

Training Independent Subnetworks for Robust Prediction

Marton Havasi*, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji
Lakshminarayanan, Andrew Mingbo Dai, Dustin Tran

Benchmarks for Deep Off-Policy Evaluation

Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov,
Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine,
Thomas Paine

TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks

Martin Trimmel, Henning Petzka, Cristian Sminchisescu

Mastering Atari with Discrete World Models (see the blog post)

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, Jimmy Ba

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, Jimmy Ba

Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning

Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies

Paul Pu Liang*, Manzil Zaheer, Yuan Wang, Amr Ahmed

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret*, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

HyperGrid Transformers: Towards A Single Model for Multiple Tasks

Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan

Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Maruan Al-Shedivat*, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai

Are Neural Rankers Still Outperformed by Gradient Boosted Decision Trees?

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael
Bendersky, Marc Najork

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network
Representations Vary with Width and Depth

Thao Nguyen, Maithra Raghu, Simon Kornblith

A Unifying View on Implicit Bias in Training Linear Neural Networks

Chulhee Yun*, Shankar Krishnan, Hossein Mobahi

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

l h bh l b h h
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 32/36
9/5/2021 Google AI Blog
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine

Mathematical Reasoning via Self-Supervised Skip-Tree Training

Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy

Lipschitz Recurrent Neural Networks

N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, Michael W. Mahoney

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Michael R Zhang*, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang,
Mohammad Norouzi

The Importance of Pessimism in Fixed-Dataset Policy Optimization

Jacob Buckman, Carles Gelada, Marc G Bellemare

Monotonic Kronecker-Factored Lattice

William Taylor Bakst, Nobuyuki Morioka, Erez Louidor

What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study

Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier,
Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem

Adversarially Guided Actor-Critic

Yannis Flet-Berliac, Johan Ferret, Olivier Pietquin, Philippe Preux, Matthieu Geist

Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes

Mike Gartrell, Insu Han, Elvis Dohmatob, Jennifer Gillenwater, Victor-Emmanuel Brunel

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, Zhifeng Chen

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon,
Honglak Lee*, Seunghoon Hong

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual


Models

Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao

Dataset Meta-Learning from Kernel Ridge-Regression

Timothy Nguyen, Zhourong Chen, Jaehoon Lee

Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling

Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, Ruoming
Pang

Implicit Gradient Regularization

David Barrett, Benoit Dherin

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G Bellemare

D i h R l i i f B hN
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 33/36
9/5/2021 Google AI Blog
Deconstructing the Regularization of BatchNorm

Yann Dauphin, Ekin Dogus Cubuk

C-Learning: Learning to Achieve Goals via Recursive Classification

Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

Evolving Reinforcement Learning Algorithms

John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee,
Aleksandra Faust

Colorization Transformer

Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner

Control-Aware Representations for Model-based Reinforcement Learning

Brandon Cui, Yinlam Chow, Mohammad Ghavamzadeh

Evaluations and Methods for Explanation through Robustness Analysis

Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Kumar Ravikumar, Seungyeon Kim, Sanjiv
Kumar, Cho-Jui Hsieh

Learning and Evaluating Representations for Deep One-Class Classification

Kihyuk Sohn, Chun-Liang Li, Jinsung Yoon, Minho Jin, Tomas Pfister

No MCMC for Me: Amortized Sampling for Fast and Stable Training of Energy-Based Models

Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David
Duvenaud

Neural Thompson Sampling

Weitong ZHANG, Dongruo Zhou, Lihong Li, Quanquan Gu

A Design Space Study for LISTA and Beyond

Tianjian Meng, Xiaohan Chen, Yifan Jiang, Zhangyang Wang

i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning

Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, Honglak Lee

Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments

Anirudh Goyal, Alex Lamb, Phanideep Gampa, Philippe Beaudoin, Charles Blundell, Sergey Levine,
Yoshua Bengio, Michael Curtis Mozer

Calibration of Neural Networks using Splines

Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu,
Richard Hartley

Extreme Memorization via Scale of Initialization

Harsh Mehta, Ashok Cutkosky, Behnam Neyshabur

Molecule Optimization by Explainable Evolution

Binghong Chen, Tianzhe Wang, Chengtao Li, Hanjun Dai, Le Song

Combining Ensembles and Data Augmentation Can Harm Your Calibration

Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper Snoek, Balaji
Lakshminarayanan, Dustin Tran
https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 34/36
9/5/2021 Google AI Blog

Workshops

Science and Engineering of Deep Learning

Speakers and Panelists include: Alex Hanna

Moderator and Advisors include: Emily Denton

Organizers include: Negar Rostemzadeh, Samy Bengio*

Synthetic Data Generation: Quality, Privacy, Bias

Speakers include: Jinsung Yoon, Emily Denton

Program Committee includes: Syed Ashrafulla

Enormous Language Models: Perspectives and Benchmarks

Speakers and Panelists include: Noam Shazeer, Natalie Schluter

Organizers include: Colin Raffel, Adam Roberts, Jascha Sohl-Dickstein, Katherine Lee, William
Fedus, Aitor Lewkowycz

The Role of Mathematical Reasoning in General Artificial Intelligence

Speakers and Panelists include: Markus Rabe, Christian Szegedy

Weakly Supervised Learning

Invited Speakers include: Lu Jiang

Learning to Learn

Organizers include: Yevgen Chebotar

Embodied Multimodal Learning (EML)

Invited Speakers includes: Sergey Levine

Distributed and Private Machine Learning

Program Committee includes: Peter Kairouz, Ananda Theertha Suresh

S2D-OLAD: From Shallow to Deep, Overcoming Limited and Adverse Data

Invited Speakers include: Alex Hanna, Hugo Larochelle

Organizers include: Vincent Dumoulin

Responsible AI (RAI)

Speakers include: Been Kim

Energy-Based Models: Current Perspectives, Challenges, and Opportunities

Organizers include: Adji Bousso Dieng, Igor Mordatch

A Roadmap to Never-Ending RL

Invited Session Panelists include: Aleksandra Faust

Program Committee includes: Coline Devin, Karol Hausman, Ben Eysenbach, Ofir Nachum, Ryan
Julian, Tianhe Yu, Dumitru Erhan, Marc Pickett, Shixiang Gu

2nd Workshop on Practical ML for Developing Countries: Learning Under Limited/low Resource
Scenarios

Program Committee includes: Pablo Samuel Castro

Beyond Static Papers: Rethinking How We Share Scientific Understanding in ML

Speakers include: David Ha, Hugo Larochelle

Organizers include: Sara Hooker


https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 35/36
9/5/2021 Google AI Blog
Organizers include: Sara Hooker

* Indicates work done while at Google



Google
·
Privacy
·
Terms

https://ai.googleblog.com/search?updated-max=2021-05-14T09:57:00-07:00&max-results=10&start=34&by-date=false 36/36

You might also like