Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 33

Course Title

Prepared By
J.Kamal Vijetha ,Asha Jyothi
Multi-modal Networks
&
Image captioning

Unit - 5
objectives
• Overview of multi-modal deep learning
• Vision and language tasks
• Detailed overview of the Image Captioning task and the MS-COCO dataset
• Architecture of a residual network, specifically ResNet
• Extracting features from images using pre-trained ResNet50
• Building a full Transformer model from scratch
• Ideas for improving the performance of image captioning
5.1 Multi-modal deep learning
• modality - is a particular mode in which something exists/experienced/expressed
• Sensory modalities, like touch, taste, smell, vision, and sound, allow humans to
experience the world around them
• Example: Suppose you are out at the farm picking strawberries, and your friend
tells you to pick ripe and red strawberries. The instruction, ripe and red
strawberries, is processed and converted into a visual and haptic criterion. As you
see strawberries and feel them, you know instinctively if they match the criteria of
ripe and red. This task is an example of multiple modalities working together.
Multi-modal deep learning
Emerging areas:
• Audio-Visual Speech Recognition (AVSR)
• Speech Recognition using lip reading
• Imaging MRI, SNP, EHR data

https://www.nature.com/articles/s41598-020-74399-w#Sec3
Multi-modal deep learning
• Combines multiple modals such as,
• Shallow Neural Networks
• Deep Neural Networks
• Convolutional Neural Networks
• Belief Networks
• Recurrent Neural Networks
• Auto encoders
• Adversarial Generative Networks
Vision and language tasks
• Computer Vision (CV) and Natural Language Processing (NLP)
• Image Captioning -----Task of generating caption for an image given
• Visual Question & Answering --Task of answering questions about objects in the image
• Visual Commonsense Reasoning (VCR)----Task of guessing emotions, actions and
framing a hypothesis of what is happening
• Visual Grounding------Matching words to objects in a picture
5.2 Image captioning
Visual Question & Answering
• Where is the child sitting?

VQA: Visual Question Answering (visualqa.org)


Visual common sense reasoning

VCR: Visual Commonsense Reasoning


Visual Grounding
5.2 Image captioning
• All about describing the contents of an image in a sentence
Represent image as sequence

784 tokens

28 X 28 pixels
Image captioning using transformers
• Transformer models are currently state of the art in NLP
• Encoder part is the Bidirectional Encoder Representations from Transformers (BERT)
• Decoder part is the core of the Generative Pretrained Transformer (GPT)
• Specific advantage is BiLSTMS, which tries to learn relationships via co-occurrence
Steps in building model
Downloading the data:
• Pre-processing captions: Since the captions are in JSON format, they are
flattened into a CSV for easier processing
• Feature extraction: We pass the image files through ResNet50 to extract features
and save them to speed up training
• Transformer training: A full Transformer model with positional encoding, multi-
head attention, an Encoder, and a Decoder is trained on the processed data
• Inference: Use the trained model to caption some images!
• Evaluating performance: Bilingual Evaluation Understudy (BLEU) scores are used
to compare the trained models with ground truth data
5.3 COCO dataset
• Common Objects in Context; Microsoft published in 2014
• big dataset used for object detection, segmentation, and captioning, among other
annotations

http://cocodataset.org/
COCO Data Set
• 83k images – training set (13GB)
• 43k images – validation set (6GB)
• 5 captions – per each image (214MB)

Flickr Data Set


• 8k images
• captions.txt(3.32 MB)
5.4 Image processing with CNN & ResNet50
• CNN – successful model for processing images
• CNN key properties relevant to image recognition:
– Data locality: The pixels in an image are highly correlated to the pixels around them.
– Translation invariance: An object of interest, for example, a bird, may appear at
different places in an image. The model should be able to identify the object,
irrespective of the object's position in the image.
– Scale invariance: An object of interest may have a smaller or large size, depending on
the zoom. Ideally, the model should be able to identify objects of interest in an image,
irrespective of their size
• CNN layers
– Convolution
– Pooling
CNN – Convolution layer
• Convolution - mathematical operation performed on patches taken from an image
with a filter
• Filter – a matrix, usually square and with 3x3, 5x5, and 7x7 as common dimensions
• Example: 3x3 convolution matrix applied to a 5x5 image
CNN – Convolution Operation

Edge-detection filter
CNN – Convolution layer
• Image patches are taken from left to right and then top to bottom
• Number of pixels this patch shifts by every step is - stride length
• Stride length of 1 in a horizontal and vertical direction reduces a 5x5 image to a 3x3
image
5.5 Image Feature Extraction with resnet50
• ResNet50 – pre-trained model for extracting features from images
• ResNet50 models are trained on the ImageNet dataset.
• This dataset contains millions of images in over 20,000 categories.
• The large-scale visual recognition challenge, ILSVRC, focuses on the top 1,000
categories for models to compete on recognizing images.
• Consequently, the top layers of the ResNet50 that perform classification have a
dimension of 1,000.
• The idea behind using a pre-trained ResNet50 model is that it is already able to
parse out objects that may be useful in image captioning.
5.6 The Transformer model
• inspired by the seq2seq model and has an Encoder and a Decoder part.
• the Transformer model does not rely on RNNs, input sequences need to be annotated with
positional encodings, which allow the model to learn about the relationships between
inputs.
• Removing recurrence improves the speed of the model vastly while reducing the memory
footprint. This innovation of the Transformer model has made very large-sized models such
as BERT and GPT-3 possible.
• the Encoder part of the Transformer is used as to create a visual Encoder, which takes
image data as input instead of text sequences. There are some other small modifications to
be made to accommodate images as input to the Encoder.
• The Transformer model we are going to build is shown in the following diagram.
• The main difference here is how the input sequence is encoded. In the case of text, we will
tokenize the text using a Subword Encoder and pass it through an Embedding layer, which
is trainable.
The Transformer model
• Inspired by the seq2seq model and has an Encoder and a Decoder part.
• the Transformer model does not rely on RNNs, input sequences need to be annotated with
positional encodings, which allow the model to learn about the relationships between
inputs.
• Removing recurrence improves the speed of the model vastly while reducing the memory
footprint. This innovation of the Transformer model has made very large-sized models such
as BERT and GPT-3 possible.
• the Encoder part of the Transformer is used as to create a visual Encoder, which takes
image data as input instead of text sequences. There are some other small modifications to
be made to accommodate images as input to the Encoder.
• The Transformer model we are going to build is shown in the following diagram.
• The main difference here is how the input sequence is encoded. In the case of text, we will
tokenize the text using a Subword Encoder and pass it through an Embedding layer, which
is trainable.
Figure : Transformer model with a visual Encoder
Objectives
• Course title slide with team names who prepared
• Slide which contains cropped image of unit syllabus
• Chapter/unit title along with unit no
• Objectives
• Content slides
• Summary
• References
• Thank you slide
Rules
• Left alignment, font style - calibri, size : heading – 42, content-24
• Only bulleted (max 8 per slide) points; no paragraphs and No complete sentences;
same bulleted symbols throughout ppt.
• Content including images should be within the range of place holders
• Explanation(if any) should be in slide note area which is provided below the slide
• As the content is bulleted points should not end with dot
• Definitions can be taken as cropped images
• Highlighting, bold, italic, etc formatting can be used on demand

Condt.
Rules
• Animations & transitions between slides should be very smooth
• Links can be provided along with the content on demand
Summary
• In the world of deep learning, specific architectures have been developed to handle specific modalities.
• Convolutional Neural Networks (CNNs) have been incredibly effective in processing images and is the standard
architecture for CV tasks. However, the world of research is moving toward the world of multi-modal networks, which
can take multiple types of inputs, like sounds, images, text, and so on and perform cognition like humans. After reviewing
multi-modal networks, we dived into vision and language tasks as a specific focus. There are a number of problems in
this particular area, including image captioning, visual question answering, VCR, and text-to-image, among others.
• Building on our learnings from previous chapters on seq2seq architectures, custom TensorFlow layers and models,
custom learning schedules, and custom training loops, we implemented a Transformer model from scratch. Transformers
are state of the art at the time of writing. We took a quick look at the basic concepts of CNNs to help with the image side
of things. We were able to build a model that may not be able to generate a thousand words for a picture but is
definitely able to generate a human-readable caption. Its performance still needs improvement, and we discussed a
number of possibilities so that we can try to do so, including the latest techniques.
• It is apparent that deep models perform very well when they contain a lot of data. The BERT and GPT models have
shown the value of pre-training on massive amounts of data. It is still very hard to get good quality labeled data for use in
pre-training or fine-tuning. In the world of NLP, we have a lot of text data, but not enough labeled data. The next chapter
focuses on weak supervision to build classification models that can label data for pre-training or even fine-tuning tasks.

You might also like