Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Get unlimited access Open in app

This is your last free member-only story this month. Upgrade for unlimited access.

Elahe Naserian Follow

May 5, 2021 · 12 min read · Listen

Save

ViLBERT, a model for learning joint representations of image


and text
Many real-world applications don’t involve only one data modality. Web pages, for example, contain text, images,
videos, etc. Restricting oneself to using only one modality would involve losing all the information contained in the
others. Multi-modal machine learning aims to build models that can process and relate information from multiple
modalities. In this tutorial, we focus on two main modalities: written text as linguistic and image as visual signals.

Joint image-text representation is the bedrock for many Vision-and-Language tasks, where multimodality inputs are
simultaneously processed for joint visual and textual understanding. This enables a wide range of applications, such
as visual question answering, visual commonsense reasoning, referring expressions, and caption-based image
retrieval.

Figure 1. an example of different Vision-and-Language tasks [1].

In recent years different approaches have been proposed to learn the joint representation of image and language,
however, they are mainly task-specific models rather than a single unified model. This means the model that
understands questions cannot ground noun phrases, the grounding model cannot retrieve images based on a
description, and so forth. While individual tasks present different challenges and diverse interfaces, the underlying
associations between language and visual concepts are often common across tasks. For example, learning to ground
the referring expression “small red vase” requires understanding the same concepts as answering the question “What
color is the small vase?”. Training multiple tasks jointly can potentially pool these different sources of grounding
supervision [1].

In this tutorial, I explain VILBERT (short for Vision-and-Language BERT) [1], a joint model for learning task-
independent visual grounding from paired visiolinguistic data. I start with a brief theoretical explanation, then I go
through step-by-step of setting up the environment, preparing the data, fine-tuning the pre-trained VILBERT model,
and finally, I discuss how we can use this model to extract the visiolinguistic representations for image (or text). This
tutorial is based on the code implemented by the authors of ‘ViLBERT: Pretraining Task-Agnostic Visiolinguistic
Get unlimited access Open in app
Representations for Vision-and-Language Tasks’ [1], [2] https://github.com/facebookresearch/vilbert-multi-task.

VILBERT

This model extends the recently developed BERT [3] language model to jointly reason about text and images. The
key technical innovation, as it is shown in figure 2, is introducing separate streams for vision and language processing
that communicate through co-attentional transformer layers. This structure can accommodate the different
processing needs of each modality and provides interaction between modalities at varying representation depths.

Figure 2. VILBERT architecture [1]

Given an image represented as a set of region features v1, …, vT and a text input represented as a set of tokens w0,
…, wT , VILBERT outputs final representations hv0, …, hvT and hw0, …, hwT, where hv0 and hw0 are the holistic
representations of the image and text.

Figure 3. ViLBERT model consists of two parallel streams for visual (green) and linguistic
(purple) processing that interact through novel co-attentional transformer layers (Co-TRM) [1]

Pre-trained models: Initially, ViLBERT was pre-trained on the Conceptual Caption dataset [4], and later on, it
further (jointly) fine-tuned on 6 different tasks with different datasets (the goal was to build a vision and language
model which is task-independent).

This tutorial comprises four main parts:

setting-up the environment

preparing the datasets and extracting the features

fine-tuning and evaluating the model

extracting the visiolinguistic embeddings from the image-text data


You can find the complete code in our GitHub repository https://github.com/ExID-proj/VILBERT_tutorial.
Get unlimited access Open in app

Setting-up the environment


Create a fresh conda environment,

conda create -n vilbert-mt python=3.6

conda activate vilbert-mt

and install all the required dependencies:

pytorch-transformers==1.0.0

numpy==1.16.4

lmdb==0.94

tensorboardX==1.2

tensorflow==2.4.0

tensorpack==0.9.4

tqdm==4.31.1

easydict==1.9

PyYAML==5.1.2

jsonlines==1.2.0

json-lines==0.5.0

matplotlib

Cython

python-prctl

msgpack

msgpack-numpy

opencv-python==4.2.0.34

We further need to install vqa-maskrcnn-benchmark , in order to extract the features from the images, by following
these steps (from https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark/-/blob/master/INSTALL.md):

conda install ipython

# maskrcnn_benchmark and coco api dependencies

pip install ninja yacs cython matplotlib

# follow PyTorch installation in https://pytorch.org/get-started/locally/

# we give the instructions for CUDA 9.0

conda install pytorch-nightly -c pytorch

# install torchvision

cd ~/github

git clone https://github.com/pytorch/vision.git

cd vision

python setup.py install

# install pycocotools

cd ~/github

git clone https://github.com/cocodataset/cocoapi.git

cd cocoapi/PythonAPI

python setup.py build_ext install

# install PyTorch Detection

cd ~/github

git clone https://github.com/facebookresearch/maskrcnn-benchmark.git

cd maskrcnn-benchmark

# the following will install the lib with

# symbolic links, so that you can modify

# the files if you want and won't need to

# re-build it

python setup.py build develop Get unlimited access Open in app

Install apex by following the instructions here: https://github.com/NVIDIA/apex

and finally, create a folder named tool in your project directory, download this repository
https://github.com/lichengunc/refer/tree/d23ec5d8e84baf858c58af119799e26846aa5261, and run ‘make’ in
terminal to set it up.

If you have followed the above steps without problem, you have successfully created the perfect environment for the
VILBERT project. I personally struggled a bit to set up my environment, it took me almost two days to resolve all the
conflicts, but hopefully, it won’t take that much for you.

Dataset Preparation
downloading the prepared datasets

If you want to train and test the model on the existing VILBERT project datasets, there is no need to extract the
features from the raw images. You can download the datasets used by the paper from:

# create a folder named data in your project directory

cd data

# create a folder named datasets1 in your project directory

cd datasets1

wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets.tar.gz

tar xf datasets.tar.gz

The extracted folder would contain almost all the datasets used to pre-train the multi-task model. However, for some
datasets (COCO, GQA, and NLVR2) you need to do some extra steps, which are explained here:
https://github.com/facebookresearch/vilbert-multi-task/tree/master/data.

In this tutorial, the task we are interested in is Caption-Based Image Retrieval and the dataset we work on is
flickr30k. After downloading the datasets, inside the flickr30k folder, you would find several folders and files:

1) *.lmbd folder: it contains the image features extracted using a pre-trained object detection network. There are two
lmbd folders inside the flickr30 folder, for two types of features that are extracted from the images. We would explain
it further in the next section where we extract the features from raw images.

2) *.jsonlines file which contains the captions along with the image ids. Each line in this file is a dictionary with four
items: sentences (caption), id, and image path.

3) *.pkl file which contains are the hard negative samples.

So, what are the hard negatives? for training (or fine-tuning) the model, we need negative examples in addition to the
positive image-caption pairs. In VILBERT paper, authors train the model in a 4-way multiple-choice setting by
randomly sampling three distractors for each image-caption pair — substituting a random caption, a random image,
or a hard negative from among the 100 nearest neighbors of the target image. The hard-negatives are selected off-line
and they are fixed [2].

Extracting the features from raw images

To know how the image features are generated, here is the explanation from the VILBERT paper [1]:
Image region features are generated by extracting bounding boxes and their
Get unlimited access Open in app
visual features from a pre-trained object detection network (We use Faster R-CNN to extract region features). Unlike
words in text, image regions lack a natural ordering. Therefore, we encode spatial location instead, constructing a 5-d
vector from region position and the fraction of image area covered. This is then projected to match the dimension of the
visual feature and they are summed. We mark the beginning of an image region sequence with a special IMG token
representing the entire image.

First, we download the original flickr30k dataset (not the extracted features as above) into
data/rawdatasets/flickr30k directory (you can download the dataset from
https://www.kaggle.com/hsankesara/flickr-image-dataset). The flickr30k dataset includes a folder of images and a
file which contains all the caption for images:

Then, we download the trained object detection model and its config file:

cd data

wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth

wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml

Now that we’ve got the images and the captions and the object detection model, we can extract the image features by
calling extract_features.py:

python script/extract_features.py --model_file data/detectron_model.pth --config_file


data/detectron_config.yaml --image_dir <path_to_directory_with_images> --output_folder
<path_to_output_extracted_features>

<path_to_directory_with_images> points to where your image directory is, for us is


data/rawdatasets/flickr30k/images, and <path_to_directory_with_images>is where you want to save the image
features, which we set it as data/rawdatasets/flickr30k/image_features.

Then we have to convert the extracted images to an LMDB file:

python script/convert_to_lmdb.py --features_dir data/rawdatasets/flickr30k/image_features


--lmdb_file /flickr30k/flickr30k.lmdb

Now you should have a directory named flickr30k.lmdb which contains two files — data.mdb and lock.mdb

At this point, we have extracted the image features. The next step is to prepare the jsonline file with captions and
image paths. Well, we have the csv file of the captions, the only thing we need to do is to convert it to a proper
jsonline format by running the following code:
1 df = pd.read_csv('captions.csv')
2
Get unlimited access
df = df[['image_name', 'caption']].groupby('image_name')['caption'].apply(list).reset_index(name='sentences') Open in app
3 df.drop_duplicates(subset=['image_name'], inplace=True)
4
5 flist = []
6 count = 0
7 for i, row in df.iterrows():
8 count += 1
9 temp = {}
10 temp['sentences'] = row['sentences']
11 temp['id'] = int(count)
12 temp['img_path'] = str(row['image_name'])
13
14 flist.append(temp)
15
16 with jsonlines.open('flickr30k.jsonline', 'w') as writer:
17 writer.write_all(flist)

extract_features.jsx
hosted with ❤ by GitHub view raw

The flickr30k folder would look like this: captions.csv, images (the original flickr30k image), image_features
(extracted image features), flickr30k.lmdb, and the flickr30k.jsonline. We further split the jsonline file into train and
test sets.

The files that we would use for training and testing the models are flickr30k.lmdb, flickr30k_train.jsonlines, and
flickr30k_test.jsonlines.

How about the hard negatives (.pkl file)? as it has explained in the paper, hard-negative pairs can be chosen by the
same approach applied in this paper [], however, in this tutorial we skip that, and we use a random image-text pair
instead.

Fine-tuning and Evaluating the model


Before starting with the code, let’s download the pre-trained VILBERT model. There are two models that we can
choose between: 1) a model pre-trained on Conceptual Caption dataset (https://dl.fbaipublicfiles.com/vilbert-multi-
task/pretrained_model.bin), 2) a model which is further trained on multiple tasks
(https://dl.fbaipublicfiles.com/vilbert-multi-task/multi_task_model.bin). In this tutorial, we choose the first model,
but you try and test to see which one is giving you the best result. Download the models into a folder named
‘save/multitask_model/’.
If you take a look at the VILBERT Github repository, you would see that there are lots of files and folders. Here, I
Get unlimited access Open in app
would take you through the most important classes of this project that we need for fine-tuning a Caption-Based
Image Retrieval task.

- train_tasks.py

- vilbert.py

- task_utils.py

- retrieval_datsets.py

In addition to these files, there are a few config files that need to be set:

- vilbert_tasks.yml

- vilbert/dataset/__init__.py

There is also a config folder which contains different configuration for the VILBERT model, and for the BERT model
which is responsible for extracting the text representation. We choose the default configurations for VILBERT and
BERT model (bert_base_6layer_6conect.json and bert-base-uncased_weight_name.json, respectively), but you can
try and test the other configurations. We also get to choose which model to fine-tune on, the baseline model which is
a single stream model, or VILBERT which a two-stream model (we would choose between them later).

Let’s start with the code. We fine-tune the model by calling the train_tasks.py. Some of the main arguments that need
to be set are listed below:

1 def main():
2 parser = argparse.ArgumentParser()
3
4 parser.add_argument(
5 "--bert_model",
6 default="bert-base-uncased",
7 type=str,
8 help="Bert pre-trained model selected in the list: bert-base-uncased, "
9 "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.",
10 )
11 parser.add_argument(
12 "--from_pretrained",
13 default='save/multitask_model/multi_task_model.bin', # "bert-base-uncased",
14 type=str,
15 help="VILBert pre-trained model selected in the list: multi_task_model.bin, pretrained_model.bin.",
16 )
17
18 parser.add_argument(
19 "--config_file",
20 default="config/bert_base_6layer_6conect.json",
21 type=str,
22 help="The config file which specified the model details.",
23 )
24
25 parser.add_argument(
26 "--baseline", action="store_true", help="whether use single stream baseline."
27 )
28

extract_features.jsx
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

We set the path to the image features and jsonline files in vilbert_tasks.yml. Some other arguments can be set in this
file, such as the maximum sequence length of the caption and the maximum number of region features per image.

In vilbert/datasets/__init__.py we set the classes which are responsible for loading the datasets for each task. Here,
we have only one task (RetrievalFlickr30k), and respectively, we specify RetrievalDataset and RetrievalDatasetVal as
the classes for loading the training and validating datasets:
What the train_tasks.py does, is to first load the train and validation datasets through the LoadDatasets function in
Get unlimited access Open in app
task_utils.py:

1 task_batch_size, task_num_iters, task_ids, task_datasets_train, \


2 task_datasets_val, task_dataloader_train, task_dataloader_val = LoadDatasets(
3 args,
4 task_cfg,
5 args.tasks)

extract_features.jsx
hosted with ❤ by GitHub view raw

LoadDatasets function call the RetrievalDataset class from retrieval_datasets.py to prepare the train and eval datasets
and dataloaders:

1
2 task_datasets_train[task] = None
3 if "train" in split:
4 task_datasets_train[task] = DatasetMapTrain[task_name](
5 task=task_cfg[task]["name"],
6 dataroot=task_cfg[task]["dataroot"],
7 annotations_jsonpath=task_cfg[task]["train_annotations_jsonpath"],
8 split=task_cfg[task]["train_split"],
9 image_features_reader=task_feature_reader1[
10 task_cfg[task]["features_h5path1"]
11 ],
12 gt_image_features_reader=task_feature_reader2[
13 task_cfg[task]["features_h5path2"]
14 ],
15 tokenizer=tokenizer,
16 bert_model=args.bert_model,
17 padding_index=0,
18 max_seq_length=task_cfg[task]["max_seq_length"],
19 max_region_num=task_cfg[task]["max_region_num"],
20 )
21
22 task_datasets_val[task] = None
23 if "val" in split:
24 task_datasets_val[task] = DatasetMapTrain[task_name](
25 task=task_cfg[task]["name"],
26 dataroot=task_cfg[task]["dataroot"],
27 annotations_jsonpath=task_cfg[task]["val_annotations_jsonpath"],
28 split=task_cfg[task]["val_split"],
29 image_features_reader=task_feature_reader1[
30 task_cfg[task]["features_h5path1"]
31 ],
32 gt_image_features_reader=task_feature_reader2[
33 task_cfg[task]["features_h5path2"]
34 ],
35 tokenizer=tokenizer,
36 bert_model=args.bert_model,
37 padding_index=0,
38 max_seq_length=task_cfg[task]["max_seq_length"],
39 max_region_num=task_cfg[task]["max_region_num"],
40 )
41
42
extract_features.jsx
hosted with ❤ by GitHub view raw

Get unlimited access Open in app

In RetreivalDataset class, the image features and captions would be prepared, by loading the hard negative samples
(if they exist) and tokenizing the caption sentences. Then they would be saved in a cache directory.

After loading the datasets in train_tasks.py, it calls an intermediate function, ForwardModelsTrain. This function
receives the features and calls the vilbert model in vilbert.py:

1
2 features, spatials, image_mask, question, target, input_mask, segment_ids, co_attention_mask, question_id = (
3 batch
4 )
5
6 batch_size = features.size(0)
7
8 max_num_bbox = features.size(1)
9 num_options = question.size(1)
10 features = features.view(-1, features.size(2), features.size(3))
11 spatials = spatials.view(-1, spatials.size(2), spatials.size(3))
12 image_mask = image_mask.view(-1, image_mask.size(2))
13 question = question.view(-1, question.size(2))
14 input_mask = input_mask.view(-1, input_mask.size(2))
15 segment_ids = segment_ids.view(-1, segment_ids.size(2))
16 co_attention_mask = co_attention_mask.view(
17 -1, co_attention_mask.size(2), co_attention_mask.size(3)
18 )
19
20 task_tokens = question.new().resize_(question.size(0), 1).fill_(int(task_id[4:]))
21 vil_logit = model(
22 question,
23 features,
24 spatials,
25 segment_ids,
26 input_mask,
27 image_mask,
28 co_attention_mask,
29 task_tokens,
30 )
31
32 vil_logit = vil_logit.view(batch_size, num_options)
33 loss = task_losses[task_id](vil_logit, target)
34 _, preds = torch.max(vil_logit, 1)
35 batch_score = float((preds == target).sum()) / float(batch_size)
36
37 return loss, batch_score

extract_features.jsx
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

There is an important argument here, ‘output_all_encoded_layers’. You can set that if you wish to get the output of all
the hidden layers, but by default, it returns only the output of the last layer.

The main part of the forward function is calling the bert model (BertModel) which returns five outputs:
sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask. The first two are
the output of the hidden layer(s) (the last layer or all the layers if you set output_all_encoded_layers=True) for text
tokens and image features. pooled_output_t and pooled_output_v are the output of an extra dense linear layer on top
of the last layer. pooled_output_t and _v are then used to calculate the loss.

1 def forward(
2 self,
3 input_txt,
4 input_imgs,
5 image_loc,
6 token_type_ids=None,
7 attention_mask=None,
8 image_attention_mask=None,
9 co_attention_mask=None,
10 task_ids=None,
11 output_all_encoded_layers=False,
12 output_all_attention_masks=False,
13 ):
14
15 sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask = self.bert(
16 input_txt,
17 input_imgs,
18 image_loc,
19 token_type_ids,
20 attention_mask,
21 image_attention_mask,
22 co_attention_mask,
23 task_ids,
24 output_all_encoded_layers=output_all_encoded_layers,
25 output_all_attention_masks=output_all_attention_masks,
26 )
27
28 if self.fusion_method == "sum":
29 pooled_output = self.dropout(pooled_output_t + pooled_output_v)
30 elif self.fusion_method == "mul":
31 pooled_output = self.dropout(pooled_output_t * pooled_output_v)
32 else:
33 assert False
34
35 vil_logit = self.vil_logit(pooled_output)
36 return vil_logit
Get unlimited access Openview
in app
extract_features.jsx
hosted with ❤ by GitHub raw

We would explain how to use the other two outputs to get the visiolinguistic embedding of the image-text data in the
last part of this tutorial.

Now that we have set up the arguments and explained how the code works, we can run the train_tasks.py to
fine_tune the VILBERT model on our dataset:

python train_tasks.py --bert_model bert-base-uncased --from_pretrained


save/multitask_model/multi_task_model.bin --config_file
config/bert_base_6layer_6conect.json --tasks 8 --lr_scheduler 'warmup_linear' --
train_iter_gap 4 --task_specific_tokens --save_name flickr30k_finetuned

You can also set the ‘frequency_iter’ to evaluate the model after a specific number of iterations, before training on all
the samples at the end of each epoch.

The fine-tuned model would be saved in a save directory that you specified, and we would evaluate it in the next
section.

Evaluating the model

Similar to the fine-tuning process, some arguments should be set before calling the eval_retrieval.py:

1 def main():
2 parser = argparse.ArgumentParser()
3
4 parser.add_argument(
5 "--bert_model",
6 default="bert-base-uncased",
7 type=str,
8 help="Bert pre-trained model selected in the list: bert-base-uncased, "
9 "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.",
10 )
11 parser.add_argument(
12 "--from_pretrained",
13 default='save/flickr30k_finetuned/pytorch_model_0.bin',
14 type=str,
15 help="path to the fine-tuned model",
16 )
17
18 parser.add_argument(
19 "--config_file",
20 default="config/bert_base_6layer_6conect.json",
21 type=str, Get unlimited access Open in app
22 help="The config file which specified the model details.",
23 )

Eval_retrieval_1.jsx
hosted with ❤ by GitHub view raw

as you can see, we set the ‘from_pretrained’ argument to the path that we saved our fine-tuned model:

python eval_retrieval.py --bert_model bert-base-uncased --from_pretrained


save/flickr30k_finetuned/pytorch_model_0.bin --config_file
config/bert_base_6layer_6conect.json --tasks 8 --task_specific_tokens --save_name
test_flickr30k

VILBERT Visiolinguistic Embedding


The last part of this tutorial is about using the VILBERT model to get the visiolinguistic embedding of the image-text
data. It is very similar to how we get the BERT embedding from text data.

First, we modify the VILBertForVLTasks class, by adding a function to get the embedding for the input data:

1
2 def getembedding(
3 self,
4 input_txt,
5 input_imgs,
6 image_loc,
7 token_type_ids=None,
8 attention_mask=None,
9 image_attention_mask=None,
10 co_attention_mask=None,
11 task_ids=None,
12 output_all_encoded_layers=True,
13 output_all_attention_masks=False,
14 layerno=-1,
15 ):
16
17 sequence_output_t, sequence_output_v, pooled_output_t, pooled_output_v, all_attention_mask = self.bert(
18 input_txt,
19 input_imgs,
20 image_loc,
21 token_type_ids,
22 attention_mask,
23 image_attention_mask,
24 co_attention_mask,
25 task_ids,
26 output_all_encoded_layers=output_all_encoded_layers,
27 output_all_attention_masks=output_all_attention_masks,
28 )
29
30 return sequence_output_t[layerno][:, 0], sequence_output_v[layerno][:, 0]
Get unlimited access Open in app
VILBertForVLTasks.jsx
hosted with ❤ by GitHub view raw

There are two important arguments: output_all_encoded_layers, and layerno. We set the first argument as True to get
the output of all the hidden layers of VILBERT model (it could be 2,4,6, or 8 layers depending the chosen config, here
we selected the 6 layer base config). Layerno specifies which layer you want to get the embeddings from, the default
is the last layer (layerno=-1).

As we explained at the very start of this tutorial, VILBERT outputs final representations hv0,…, hvT and hw0,…, hwT,
where hv0 and hw0 are the holistic image and text representations. And accordingly, we return them as the
visiolinguistic embedding for the caption and image.

So, to obtain the visiolinguistic embedding, we first set some arguments in identify_vilbert_emds.py:

1 def main():
2 parser = argparse.ArgumentParser()
3
4 parser.add_argument(
5 "--bert_model",
6 default="bert-base-uncased",
7 type=str,
8 help="Bert pre-trained model selected in the list: bert-base-uncased, "
9 "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.",
10 )
11 parser.add_argument(
12 "--from_pretrained",
13 default="save/multitask_model/multi_task_model.bin",
14 type=str,
15 help="VILBert pre-trained model selected in the list: multi_task_model.bin, pretrained_model.bin",
16 )
17
18 parser.add_argument(
19 "--config_file",
20 default="config/bert_base_6layer_6conect.json",
21 type=str,
22 help="The config file which specified the model details.",
23 )

identify_VILBERT_emds.jsx
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

Same as fine-tuning and evaluating, we have to set the ‘bert_model’, ‘from_pre-trained’, and the ‘config_file’
arguments. You can choose between the existing pre-trained VILBERT model (multi_task_model.bin, or
pretrained_model.bin), or your own fine-tuned model to get the embeddings from. We have also created a file
vilbert_transfer_tasks.yml to specify some parameters:

A new class is also created to load the dataset, ‘RetreivalDatasetTrans’, and then vilbert/dataset/__init__.py is
modified to include that:

We then run the identify_vilbert_emds.py by:

python identify_vilbert_emds.py --bert_model bert-base-uncased --from_pretrained


save/multitask_model/multi_task_model.bin --config_file
config/bert_base_6layer_6conect.json --tasks 8

You can find the complete code in our Github repository https://github.com/ExID-proj/VILBERT_tutorial.

Conclusion
That’s all from me folks. I hope you enjoyed the post and hopefully got a clearer picture around VILBERT, and how
Get unlimited access Open in app
you can extract the features, fine-tune and evaluate, and get the visiolinguistic embeddings. Feel free to post your
feedback or questions in the comments section.

If you liked my article, please give it a clap :)

References:
[1] Lu, J., Batra, D., Parikh, D. and Lee, S., 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for
vision-and-language tasks. arXiv preprint arXiv:1908.02265.

[2] Lu, J., Goswami, V., Rohrbach, M., Parikh, D. and Lee, S., 2020. 12-in-1: Multi-task vision and language
representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.
10437–10446).

[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.

[4] Sharma, P., Ding, N., Goodman, S. and Soricut, R., 2018, July. Conceptual captions: A cleaned, hypernymed,
image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) (pp. 2556–2565).

19
Get unlimited access Open in app

You might also like