DL Group 6 Rep

DEEP LEARNING COURSE
PROJECT REPORT
Image Caption generator using Deep

Learning
techniques
Group Number - 6
Submitted By –
Aditya Biswakarma (IIT2020033) Harshit Kushwah

(IIT2020039) Sanjeet Beniwal (IIT2020052) Rohan
Tirkey (IIT2020088)
Instructor - Prof. Shiv Ram Dubey
1
1. Abstract
The practise of creating descriptions for an image's events is known as image captioning. Image
captioning is a crucial duty that can be applied to virtual assistants, editing software,
image indexing, and people with disabilities aid. It links the two key areas of artificial
intelligence, natural language processing and computer vision.
The machine learning library Pytorch is used to construct our neural network-based image
caption generator in Python. Recurrent neural networks driven by long-short-term memory
(LSTM) units have enabled considerable advancements in picture captioning in recent
years. LSTM units are complex and fundamentally sequential across time, despite the fact
that they mitigate the vanishing gradient problem and have a convincing capacity to
memorise dependencies. Convolutional networks have been useful for machine translation and
conditional picture generation, which is topic that has recently been addressed by research.
2. Introduction
We come across a lot of images every day from many sources, including the internet, news
stories, document diagrams, and commercials. These sites include pictures that visitors must
interpret for themselves. Although the majority of photographs lack descriptions, most people
can still understand them without them. But, if humans require automatic image captions from it,
then robots must comprehend a range of image captions. A recent hot topic in computer vision
and machine learning is teaching computers how to automatically create captions for pictures.
Understanding image scenes, extracting features, and translating visual representations into plain
languages are all components of this endeavour. Picture captioning is crucial for a number of
reasons.The development of assistive technologies for people who are blind and assistance with
the automation of captioning activities on the internet are two areas where this initiative holds
out considerable promise. The addition of captions to every image on the internet can make
image searches and indexing more efficient and detailed.
This project's objective is to produce suitable captions for a given image. To capture the context
of the photographs, subtitles will be automatically created. To create appropriate captions,
current methods use convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) or their variations.
In order to complete this task, these networks offer an encoder-decoder technique in which
CNNs encode the image into feature vectors and RNNs are utilised as decoders to produce
language descriptions.
2
Picture captioning has a variety of uses in a variety of industries, including business,
biomedicine, web search, and the military, among others. Social media platforms like Facebook
and Instagram can produce automated captions generated from photos.
The key issue in this work is to represent the relationships between the items in an image in a
language that is natural, like English. Traditionally, written descriptions for photographs have
been generated by computers using established templates. This strategy, meanwhile, falls short
of offering the variation needed to provide lexically rich text descriptions. The growing
effectiveness of neural networks has minimized this flaw. Several cutting-edge algorithms make
use of neural networks to anticipate the following lexical unit in the output sentence while
creating captions for images as input.
The idea of integrating existing remedies for the aforementioned issues to create image
descriptions has been explored in earlier attempts. We would want to demonstrate our single
joint model, which uses an image to infer the intended word order. The work was primarily
driven by the most recent developments in machine translation. The objective is to keep the
model's cutting-edge performance while translating more simply using the Recurrent Neural
Network (RNN). The original sentence is read by the encoder RNN before it is transformed into
a vector representation, which is used as the initial hidden state by the decoder RNN to produce
the required text. Both computer vision technology and language models from the study of
natural languages are used to translate the understanding of the image into the proper sequence
of words. Deep learning methods have led to cutting-edge results for caption generating
problems. The most astounding feature of these methods is that a single end-to-end model can be
developed to predict a caption given a photo, eliminating the need for crowded data preparation
or a pipeline of specially constructed models.
3. Problem Definition & Objectives

Problem Definition – Image Caption Generator using Deep Learning technique.
Objectives –
● To learn the concepts of a CNN and LSTM model and build a working model of Image
caption generator by implementing CNN with LSTM
● To generate suitable captions for the given input image.
● To determine the accuracy of deep learning model for image caption generator.
3
4. Literature Review
Sr. Author Paper Public Method/approac Dataset used Achieved Advantages Future
No Title ation h used Performa and scope
Year nce Disadvantage s
1. Jyoti Convolutio 2018 Machine vision MSCOCO K-fold Advantages- Potential

Aneja∗ nal Image Approach, dataset cross- formulate improvem
, Aditya Captioning Classification validation image ent is by
Deshpa model through showed captioning as a training on
nde∗ , training by CNN that the retrieval a combinati
Alexan (VGGNet), developed problem and on of
der G. Histogram feature find the best Flickr8k,
Schwin g oriented gradient fusion fitting Flickr30k,
(HOG) technique description and
(98.36%) MSCOCO
provided
higher
accuracy
than the
individual
feature
extraction
methods,
such as
HOG
(87.34%)
or CNN
(93.64%).
2. Marcell a Meshed- 2020 Meshed memory MSCOCO On average, The model

Cornia, Memory Transformer dataset The model incorporates a
Matteo Transforme is able to region
Stefanin r for Image generate encoding
i, Captioning more approach that
Lorenzo accurate exploits a priori
Baraldi, and knowledge
Rita descriptive through
Cucchia captions, memory
ra integrating vectors and a
fine- meshed
grained connectivity
details and between
4
object encoding and
relations. decoding
modules
5. Methodology
Numerous studies have used Convolutional Neural Networks for the problem of
image classification in the literature, most of which create different architectures for the
neural networks. Deep convolutional neural networks are one of the powerful deep
learning architectures and have been widely applied in a broad range of machine learning tasks.
Convolutional Neural Network (CNN)-

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take in an
input image, give various elements and objects in the image importance (learnable weights and
biases), and be able to distinguish between them. Comparatively speaking, a ConvNet requires
substantially less pre-processing than other classification techniques. ConvNets have the capacity
to learn these filters and properties, whereas in primitive techniques filters are hand-engineered.
The structure of a ConvNet is similar to the connectivity pattern of neurons in the human brain
and was modelled after how the visual cortex is constituted. Only in the Receptive Field,
a constrained area of the visual field, individual neurons react to inputs. The entire visual field is
covered by a series of such fields that overlap.
Like other neural networks, a CNN is composed of an input layer, an output layer, and many
hidden layers in between.
Img-1
5
These layers perform operations that alter the data with the intent of learning features specific to
the data. Convolution, activation or ReLU, and pooling are three of the most used layers.
●Convolution puts the input images through a set of convolutional filters, each of which activates
certain features from the images.
●Rectified linear unit (ReLU), which maps negative values to zero while keeping positive values,
enables quicker and more efficient training. Due to the fact that only the activated
features are carried over to the following layer, this is frequently referred to as activation.
●Pooling reduces the amount of parameters the network needs to learn by conducting
nonlinear downsampling on the output.
Each layer learns to recognise various features as these operations are repeated over tens
or hundreds of layers.
Img - 2
Example of a network with many convolutional layers. Filters are applied to each training image
at different resolutions, and the output of each convolved image is used as the input to the next
layer.
Convolutional Neural Network (CNN) for Image Caption Generator
6
Long Short Term Memory Networks (LSTM)
A variation of RNN that addresses the long-term memory issue is called Long-Short Term
Memory networks, or LSTMs.
They are able to better control how to learn or forget from various input sources because they
have a more complex cell structure than a typical recurrent neuron.
The horizontal line through the top of the diagram, the cell state (cell memory), and the internal
mechanism known as gates that can control the flow of information are the fundamental
components of LSTMs.
The cell state resembles a conveyor belt in some ways. With only a few minor linear interactions,
it proceeds directly down the entire chain.
In essence, the information of the inputs (relevant information) that have been observed up to that
point is encoded by the cell state (at every step).
The LSTM cell is a specially created logical unit that will assist in sufficiently reducing the
vanishing gradient problem to increase the usefulness of recurrent neural networks for long-term
memory tasks, such as text sequence predictions.
7
It accomplishes this by simply adding an internal memory state to the input that has already been
processed, considerably reducing the multiplicative effect of minor gradients. An intriguing idea
known as a forget gate controls the time dependence and consequences of prior inputs and
decides which states are remembered or forgotten. The input gate and output gate are two
additional gates present in LSTM cells.
Long Short Term Memory Networks (LSTM) for Image Caption

Generator
8
Dataset for Image Caption Generator
For the model training of image caption generators, the Flickr 8K dataset is employed.
The dataset is easily downloadable and openly accessible. Due to the size of the dataset
(1GB), downloading takes some time. The most significant file is Flickr 8k.token, which contains
all of the image names and captions. The Flickr8k Dataset folder contains 8091 photographs, and
the Flickr8k text folder has text files including the image captions.
Image Caption Generator Model

So, to make our image caption generator model, we will be merging these architectures. It is also
called a CNN-RNN model.
CNN is used for extracting features from the image. We will use the pre-trained model
Xception.
LSTM will use the information from CNN to help generate a description of the image.
9
6. Software & Hardware Requirements
Software Technologies Required -
● Python (latest version)
● Or Google Colab Or Jupyter Notebook for running ipynb file
● Other important libraries required will be -
○ TensorFlow
○ Keras
10
○ Scikit-Learn (sklearn)
○ Pandas
○ Numpy
○ Matplotlib (pyplot)
○ seaborn
Hardware Requirements
● Graphics Card / GPU (if possible) (Google Colab will
perform better)
7. Implementation
8. Results
9. Conclusion
11. References
Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. "Convolutional image
captioning." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2018.
Cornia, Marcella, et al. "Meshed-memory transformer for image

captioning." Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2020.
11

DL Group 6 Rep

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Group 6 Rep

Uploaded by

Copyright:

Available Formats

DEEP LEARNING COURSE

Image Caption generator using Deep

Aditya Biswakarma (IIT2020033) Harshit Kushwah

Instructor - Prof. Shiv Ram Dubey

3. Problem Definition & Objectives

1. Jyoti Convolutio 2018 Machine vision MSCOCO K-fold Advantages- Potential

2. Marcell a Meshed- 2020 Meshed memory MSCOCO On average, The model

Convolutional Neural Network (CNN)-

Convolutional Neural Network (CNN) for Image Caption Generator

Long Short Term Memory Networks (LSTM) for Image Caption

Image Caption Generator Model

Cornia, Marcella, et al. "Meshed-memory transformer for image

You might also like