Image Caption Generator

lOMoARcPSD|20521427
Image Caption Generator
project report format (Chandigarh University)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)
lOMoARcPSD|20521427
IMAGE CAPTION GENERATOR
A PROJECT REPORT
Submitted by
ABHAY BANALACH(19BCS1568)
DHEERAJ KARWASRA(19BCS1177)
ISHITA THAPLIYAL(19BCS1163)
SAHIL KUMAR CHOUDHARY(19BCS1149)
SHADMAN AHMED(19BCS2289)
in partial fulfilment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
Chandigarh University
DECEMBER 2022

lOMoARcPSD|20521427
BONAFIDE CERTIFICATE
Certified that this project report “IMAGE CAPTION GENRATOR” is the

bonafide work of “ABHAY BANALACH, DHEERAJ KARWASRA, ISHITA
THAPLIYAL, SAHIL KUMAR CHOUDHARY AND SHADMAN AHMED”
who carried out the project work under my/our supervision.
Dr. Navpreet Kaur Walia Er. Steve Samson

Head Of The Department Supervisor
Assistant Professor
Computer Science & Engineering Computer Science & Engineering
Submitted for the project viva-voce examination held on
INTERNAL EXAMINER EXTERNAL EXAMINER

lOMoARcPSD|20521427
ACKNOWLEDGEMENTS
I would like to express my special thanks of gratitude to my project mentor Er. Steve Samson
as well as various teachers of CSE department of Chandigarh University who gave me the
golden opportunity to do this wonderful project on the topic Image Caption Generator, which
also helped me in doing a lot of Research and I came to know about so many new things I
am really thankful to them.
Table of Contents

lOMoARcPSD|20521427
LIST OF FIGURES....................................................................................................i
List of Tables............................................................................................................iii
ABSTRACT.............................................................................................................iv
ससस............................................................................................................................v
ABBREVIATIONS..................................................................................................vi
CHAPTER 1 INTRODUCTION...............................................................................1
1.1 Relevant Contemporary Issues.........................................................................1
1.2 Identification of Problem..................................................................................3
1.3 Identification of Tasks......................................................................................3
1.4 Timeline............................................................................................................5
1.5 Organization of the Report...............................................................................5
CHAPTER 2 LITERATURE SURVEY....................................................................6
2.1 Timeline of the Reported Problem....................................................................6
2.1.1 Handcraft Features with Statistical Language Model.................................7
2.1.2 Deep Learning Features with Neural Network...........................................7
2.1.3 Datasets and Evaluation..............................................................................8
2.2 Proposed Solution.............................................................................................8
2.2.1 TEMPLATE-BASED APPROACHES.......................................................9
2.2.2 RETRIEVAL-BASED APPROACHES.....................................................9
2.2.3 NOVEL CAPTION GENERATION..........................................................9
2.3 Bibliometric Analysis.....................................................................................10
2.3.2 Limitations................................................................................................12
2.4 Review Summary...........................................................................................12
2.5 Problem Definition.........................................................................................13
2.6 Goals and Objectives......................................................................................14
CHAPTER 3 DESIGN FLOW................................................................................16
3.1 Evaluation & Selection of Specifications/Features........................................16

lOMoARcPSD|20521427
3.2 Design Constraints..........................................................................................17

3.3 Analysis and Feature Finalization...................................................................18
3.4 Design Flow....................................................................................................19
3.4.1 CNN and LSTM Model............................................................................20
3.4.2 Other Deep Learning-Based Image Captioning.......................................20
3.5 Design Selection.............................................................................................21
3.5.1 Convolutional Neural Network.................................................................21
3.5.2 Long Short Term Memory........................................................................22
3.5.3 Flickr8k Dataset........................................................................................23
3.5.4 Model Advantages....................................................................................24
3.6 Methodology...................................................................................................25
3.6.1 Model Overview.......................................................................................25
3.6.2 Recurrent Neural Network........................................................................26
3.6.3 Convolutional Neural Network.................................................................27
3.6.4 Long Short Term Memory........................................................................28
3.6.5 VGG16......................................................................................................29
CHAPTER 4 RESULTS ANALYSIS AND VALIDATION...................................31
4.1 Implementation...............................................................................................31
4.1.1 Pre-Requisites...........................................................................................31
4.1.2 Project File Structure................................................................................31
4.2 Building Python Project..................................................................................32
4.2.1 Data Cleaning...........................................................................................32
4.2.2 Extracting Feature Vector from all Images...............................................34
4.3 Loading Dataset for Training the Model........................................................35
4.3.1 Tokenizing the Vocabulary........................................................................35
4.3.2 Create Data Generator..............................................................................35
4.3.3 Defining the CNN-RNN Model................................................................36

lOMoARcPSD|20521427
4.4 Training and Testing the model......................................................................37

CHAPTER 5 CONCLUSION AND FUTURE WORK..........................................39
5.1 Conclusion......................................................................................................39
5.2 Future Scope...................................................................................................40
REFERENCES........................................................................................................41
APPENDIX 1..........................................................................................................44
ANNEXURE-1........................................................................................................54

lOMoARcPSD|20521427
LIST OF FIGURES
Figure 1.1 Number of visually impaired people in the world by 2020 ……..02
Figure 1.2 Timeline of the Project…………………………………………..05
Figure 2.1 Deep Learning Features with NN………………………………..08
Figure 2.2 Novel Caption Generation……………………………………….10
Figure 2.3 An overall taxonomy of deep learning-based image captioning...11
Figure 3.1 Feature Analysation……………………………………………...19
Figure 3.2 CNN and LSTM Model………………………………………….20
Figure 3.3 LSTM Model…………………………………………………….22
Figure 3.4 Methodology Flow……………………………………………....25
Figure 3.5 VGG16 Model…………………………………………………...26
Figure 3.6 RNN ……………………………………...……………………...27
Figure 3.7 CNN…………………………………………….……………......28
Figure 3.8 LSTM Inner Model………………………………….………...…29
Figure 3.9 VGG-16…………………………………….…………………....30
Figure 4.1 Text File containing Dataset…………………………………......32

lOMoARcPSD|20521427
Figure 4.2 Five Captions for one picture.......

……………………………......33
Figure 4.3 Pre-processed data…………………………...………………......34
Figure4.4 Final Model Structure……………….……………......……….….37
Figure 4.5 Example Caption Generation

……………………………............38
Figure 4.6 Example of Caption

Generation…………………………….........38

lOMoARcPSD|20521427
LIST OF TABLES
Table 4.1 Word Prediction Generation Step By Step…………………………….. 36

lOMoARcPSD|20521427
ABSTRACT
Scene interpretation has long been a key component of computer vision, and image
captioning is one of the main topics of AI research since it seeks to imitate how
humans can condense a huge quantity of visual data into a small number of
sentences. The purpose of image caption generation is to produce a statement that
describes an image. The challenge calls for the employment of methods from
computer vision and natural language processing in order to generate a brief but
comprehensive caption of the image. Significant study in the field has been made
possible by recent advancements in deep learning and the accessibility of image
caption datasets from sources like Flickr and COCO. A multilayer Convolutional
Neural Network (CNN), a recurrent neural network (RNN), and other approaches
are proposed in this research.

lOMoARcPSD|20521427
ससस
ससससस सससससससस सससस ससस सस सससससससस सससससस सस सस सससससस ससस

ससस सस, सस ससस ससससससससस ससस सससससससस सस ससससस सससससस ससस सस सस
सस ससससससस सस ससस सससस ससससस सस सस सससस सससससस सससस सससससस ससस
ससससस सससस सस सस सससससस ससस ससससससस ससस सससससस सस सससस ससस
सससस सससससस सससससस सस सससससससस सस ससससससससस ससससस सससस सस सस
सस सससस सस ससससस सससस ससस ससस सस सस ससससससससस ससससस सससससस
सससससस ससससससस सससस सस ससस सससससस सससससससस सससससस सस
ससससससससस सससस सससससससससस सस ससससससस सस सससससस सस ससस सससससस
ससस ससस सससससस ससस सससससस सससससस सस ससससससस सस सससससस सससस
ससससससस सस ससस सससससस ससससससस सस ससससस सस सस ससससससस ससस
सससससससससस सससससस सससस सस ससस ससस सस ससस ससस सस सससससस
सससससससस सससससस ससससससस (सससससस), सस सससससस सससससससस ससससससस
(सससससस), सस सससस ससससससससस सससससससससस सससस

lOMoARcPSD|20521427
ABBREVIATIONS
ML Machine Learning
AI Artificial Intelligence
NN Neural Networks
CNN Convolutional Neural Network
LSTM Long Short Term Memory
RNN Recurrent Neural Network

lOMoARcPSD|20521427
CHAPTER 1
INTRODUCTION
1.1 Relevant Contemporary Issues
Image captioning is a multimodal task which can be improved by the aid of Deep Learning.
However, even after so much progress in the field, captions generated by humans are more
effective which makes image caption generator an interesting topic to work on with Deep
learning. Fetching the story of an image and automatically generating captions is a
challenging task. The whole objective of generating captions solely lay in the derivation of
the relationship between the captured image and the object, generated natural language and
judging the quality of the generated captions. To determine the context of image requires the
detection, spotting a recognition of the attributes, characteristics and the flow of
relationships between these entities in an image.
Applications involving computer vision or image/video processing can benefit greatly from
deep learning. It is generally used to categorise photos, cluster them based on similarities,
and recognise objects in scenarios. ViSENZE, for instance, has created commercial products
that use deep learning networks to enhance image recognition and tagging. This enables
buyers to search for a company's products or related things using images rather than
keywords.
Object identification is a more difficult form of this task that requires precisely detecting one
or more items within the scene of the shot and boxing them in. This is accomplished using
algorithms that can recognise faces, people, street signs, flowers, and many other visual data
elements.

lOMoARcPSD|20521427
Figure
Figure0-1.1:
1.1: Number
Numberofofvisually
visuallyimpaired
impairedpeople
peopleininthe
theworld
worldby
by2020
2020
The whole model of generating accurate captions has a potential to have large impact.
According to World Health Organization, globally, at least 2.2 billion people have a near or
distance vision impairment as shown in Figure 1.1. This survey implies that these people
cannot view any image on the web, that is where an Image caption generator comes into
play. It can help visually impaired people understand the context of an image by deciphering
the objects of the image.
Image caption generator can also play a major role in the growth of social media where
captions play a major role. Till now the only modification with respect to generating
captions is given by Instagram which generates a script with respect to the audio fetched,
however there is no progress of generating captions by just looking at the image. With
internet and technologies continuously growing and people getting hooked to their mobile
phones and also social media being a competitive market, inclusion of such a feature that

lOMoARcPSD|20521427
can provide that human touch to an image without the help of human knowledge or emotions
to express the essence of the image and also can make the inclusion of visually impaired
people feel more included in the society.
1.2 Identification of Problem
Though numerous researches and deep learning models have been already put into play to
generate captions, the major problem that lies in hand still is the fact that humans describe
image using natural languages which is compact, crisp and easy to understand. Machines on
the other hand have not been found efficient to beat the effectiveness the human generated
descriptions of a certain image. The task to identify a well-structured object or image is quite
an easy task and has been in study by the computer vision community for long but to derive
a relationship between objects of an image is quite a task. Indeed, a description must
capture the essence and put up the description of the objects of the image but it must also
express how the objects correlate with each other as well as their attributes and activities
they are performing in the image. The above knowledge has to be expressed in a natural
language like English to actually communicate the context of the image effectively.
In deep machine learning based techniques, features are learned by the machine on its own
from training data and these techniques are able to handle a large and diverse set of images
and videos. Hence, they are preferred over traditional methods. In the last few years,
numerous articles have been published on image captioning with deep learning. To provide
an abridged version of the literature, this paper presents a survey majorly focusing on the
deep learning-based papers on image captioning.

lOMoARcPSD|20521427
1.3 Identification of Tasks

A lot of tasks need to be accomplish and relevant skills required to successfully build this
project:
1. System design or architecture

2. Dataset fetching from an online website
3. Data Pre-processing or cleaning
4. Building a model – implement using a specific programming language.
5. Testing
6. Report Generation
7. Deployment of the model through an interface
Skills required to complete this project are stated:

1. System design – knowledge of UML diagram, system design concepts
2. Data collection – through online websites like kaggle
3. Data Pre-processing or cleaning – knowledge of python, or any contemporary modern
day programming language
4. Building a model – knowledge about a specific programming language preferably
Python, expertise in various machine learning techniques.
5. Testing – Knowledge about Unit testing, Black box testing or any special test package in
Python.
6. Report Generation – Good command over English language.
7. Deployment – Knowledge about various online deployment platform such as AWS,
Github.
8. Front end development – Knowledge in Html, Javascript, CSS
9. Back end development – Knowledge about Mysql

lOMoARcPSD|20521427
1.4 Timeline
Figure Figure
1.2:
Figure
Gantt
1.21.2
Timeline
Chart
Timeline
depicting
ofof
the
Project
Project
the timelines
1.5 Organization of the Report

The following chapters will give an elaborate description of the project.
 Chapter 2, of this report describes the various literature surveyed, hence give an idea of
the recent trends and development in the field. This chapter also propose solutions to
solve the problems identified in chapter 1, goals and objectives this project wish to
achieve.
 Chapter 3, describes the design constraints we would have. In the same chapter data flow
diagram, flowchart will give a visual description of the project details.
 Chapter 4, gives the results obtained, testing methods adopted.
 Chapter 5, gives conclusion and future work that can be pursued to improve the image
caption generator.

lOMoARcPSD|20521427
CHAPTER 2
LITERATURE SURVEY
2.1 Timeline of the Reported Problem

The two main categories of image captioning algorithms are those based on template methods
and those based on encoder decoder structures. The two approaches are mostly examined in the
following. Fig. depicts the progress of picture captioning. 1. Using templates is mostly how the
initial picture captioning method is implemented. This method first employs various classifiers,
such as SVM, to extract a number of important feature data, such as key objects and special
attributes, and then transforms the data into description text using a lexical model or other
targeted templates. The object in the image, the behaviour of the object, and the scene in which it
is present are the three basic pieces of information that Farhadi et al.mainly includes three
information: the object in the image, the action of the object and the scene in which the object is
located. The noise item is removed by some common smoothing methods. Some typical
smoothing techniques eliminate the noise component. Finally, Li et al. investigate the
relationship between the extracted data and the final picture description result.
Early methods for creating image descriptions combined image data utilising static object class
libraries and statistical language models. Aker and Gaizauskas suggest a method for
automatically geotagging photographs and utilise a dependency model to summarise several web
articles that contain information about image locations. Li and co. Present an n-gram approach
based on network scale that gathers potential phrases and combines them to create sentences that
describe images starting from scratch. To estimate motion in an image, Yang et al. suggest a
language model trained using the English Gigaword corpus.and the probability of colocated
nouns, scenes, and prepositions and and the likelihood of colocated nouns, situations, and
prepositions, and use these projections as hidden Markov model parameters. The most likely
nouns, verbs, situations, and prepositions that make up the sentence are used to generate the
image description. According to Kulkarni et al., a detector should be used to identify objects in
an image, each candidate region should then be classified and subjected to prepositional

lOMoARcPSD|20521427
relationship processing, and finally a conditional random field (CRF) prediction image tag
should be used to produce a natural language description. On photos, object detection is also
done. In order to infer objects, properties, and relationships in an image and transform them into
a sequence of semantic trees, Lin et al. used a 3D visual analysis system. After learning the
grammar, the trees were used to generate text descriptions.
2.1.1 Handcraft Features with Statistical Language Model

The visual detector and language model are directly learned from the image description dataset
using this technique, which is a Midge system based on maximum likelihood estimation, as
shown in Figure 1. In order to identify the object in the image and create a caption, Fang et al.
first analyse the image. By applying a convolutional neural network (CNN) on the image area
and merging the data with MIL , words are recognised. To reduce a priori assumptions about the
sentence structure, the sentence structure is then taught straight from the caption. Last but not
least, it optimises the image caption generation problem by looking for the most likely statement.
2.1.2 Deep Learning Features with Neural Network

In the field of deep learning, the recurrent neural network (RNN) has received a lot of attention.
As can be seen in Figure 2.1, it was initially widely used in the field of natural language
processing and produced positive language modelling outcomes . Speech-related RNN functions
include text-to-speech conversion , machine translation , question-and-answer sessions , and
more. Of course, at the level of characters and words, they are also employed as effective
language models. Right now, word-level models appear to be superior to character-level models,
Figure 2.1 Deep Learning Features with NN

lOMoARcPSD|20521427
but this is undoubtedly a passing improvement. In computer vision, RNN is also quickly rising in
prominence. Sequence modelling , frame-level video classification , and current visual question-
answer tasks.
2.1.3 Datasets and Evaluation

The foundation of artificial intelligence is data. People are becoming more aware of the fact that
a lot of hard-to-find legislation can be found from a lot of data. There are currently several
diverse datasets in the picture description generation task, including MSCOCO, Flickr8k,
Flickr30k, PASCAL 1K, AI Challenger Dataset, and STAIR Captions, which are progressively
becoming a point of controversy. Each image in the dataset has five reference descriptions,
which add up how many images are in each dataset. The dataset employs alternative syntax to
describe the same image in order to have numerous independent descriptions of each image. As
seen in the example in Figure 10, various descriptions of the same image concentrate on various
features of the scene
The entire image caption generation task has been covered in this overview, along with the
model framework that has been recently suggested to address the description task. We have also
concentrated on the algorithmic core of various attention mechanisms and provided a brief
overview of how the attention mechanism is used. We provide an overview of the sizable
datasets and standard practise rating criteria.
2.2 Proposed Solution

There are many different image captioning techniques, some of which are seldom ever
utilised nowadays, but before moving on, it is important to have a general understanding
of those technologies. The three primary types of picture captioning techniques now in
use are template-based caption generation, retrieval-based captioning, and novel caption
production. Innovative ways for creating captions for images typically rely on deep
machine learning and visual space. Multimodal space can also be used to create captions.
Supervised learning, Reinforcement learning, and Unsupervised learning are three
learning strategies that can be used to classify deep learning-based picture captioning
solutions. We classify other deep learning into two categories: reinforcement learning and

lOMoARcPSD|20521427
unsupervised learning. Typically, captions are created for the entire scene depicted in the
image However, distinct areas of an image can also create captions (Dense captioning).
Either a compositional design or a straightforward encoder-decoder architecture can be
used for image captioning. There are techniques that make advantage of the attention
mechanism, semantic notion, and various visual description styles. Some techniques can
even produce descriptions for invisible objects. However, there are also methods that
utilise other language models, such as CNN and RNN, thus we include a language model-
based category as "LSTM vs. Others." The majority of the picture captioning algorithms
employ LSTM as their language model.
2.2.1 TEMPLATE-BASED APPROACHES

Fixed templates with numerous vacant slots are used in template-based systems to
produce captions. These methods start by identifying various objects, attributes, and
actions before filling up the template's empty spots. For instance, Farhadi et al. fill the
template spaces for creating image captions with a triplet of scene elements. For this
objective, Li et al. extract the sentences relevant to detected objects, properties, and
relationships. Kulkarni et al. use a Conditional Random Field (CRF) to infer the objects,
characteristics, and prepositions before filling in the blanks. Grammarly perfect captions
can be produced using template-based techniques. Templates, however, are predetermined
and unable to produce captions of variable length. Additionally, later on, parsing-based
language models for image captioning were established which are more powerful than
fixed template-based methods. Therefore, in this paper, we do not focus on these template
based methods.
2.2.2 RETRIEVAL-BASED APPROACHES

Both visual and multimodal spaces can yield captions. Captions are retrieved from a group of
already-existing captions in retrieval-based systems. The initial step in retrieval-based
approaches is to identify the captions and visually comparable images from the training data set.
Candidate captions are what they are. These captions are pulled from a pool to create the
captions for the search image. These techniques result in comprehensive and punctuation-perfect

lOMoARcPSD|20521427
captions. They are unable to produce captions that are both image-specific and semantically
accurate.
2.2.3 NOVEL CAPTION GENERATION

Novel picture captions are ones that the model creates on its own using a language model
and a mix of image attributes rather than by comparing them to preexisting captions. The
topic of creating original image captions is far more intriguing and practical because it
addresses both drawbacks of using existing captions.
Visual space and multimodal space can both be used to develop original captions. An
strategy used generally in this category is to first analyse the image's visual information
before creating image captions from it using a language model, as shown in Figure 2.2.
These techniques can produce new captions for each image that are more semantically
precise than earlier techniques. Deep machine learning is the most common technique for
creating new captions. Therefore, deep learning based novel image caption generating
methods are our main focus in this literature.
Figure 2.2 Novel Caption Generation
2.3 Bibliometric Analysis

In Figure 2.3, we depict a broad taxonomy for deep learning-based picture captioning
techniques. We categorise them into visual space versus multimodal space, dense
captioning versus scene-wide captioning, supervised learning versus other deep learning,
10

lOMoARcPSD|20521427
encoder-decoder architecture versus compositional architecture, and one "Others" group

that includes attention-based, semantic concept-based, stylized captions, and novel object-
based captioning to discuss their similarities and differences. Additionally, we develop the
category LSTM vs. Others.
The table provides a brief summary of the deep learning-based picture captioning
techniques. It lists the names of the picture captioning techniques, the class of deep neural
networks that were used to encrypt the image data, and the language models that were
employed to describe the images. In the final column, we give a category label to each
captioning technique based on the taxonomy in Figure 2.3.
Figure 2.3 An overall taxonomy of deep learning-based image captioning
11

lOMoARcPSD|20521427
Visual Space Vs Multimodal Space

Deep learning-based algorithms for captioning images can produce captions from both
multimodal and visual spaces. Naturally, databases for image captioning contain text
versions of the associated captions. The image features and the related captions are
independently provided to the language decoder in the visual space-based approaches. In
contrast, a shared multimodal space is learned from the images and the related captions in a
multimodal space example. The language decoder is then given access to this multimodal
representation.
Visual Space
Bulk of the image captioning methods use visual space for generating captions. In the
visual space-based methods, the image features and the corresponding captions are
independently passed to the language decoder.
Multimodal Space
Language encoder, vision part, multimodal space part, and language decoder are all components
of a typical multimodal space-based method's design.
In order to extract the image features, the vision component uses a deep convolutional neural
network as a feature extractor. The language encoder portion learns a dense feature embedding
for each word after extracting the word features. The semantic temporal context is subsequently
transmitted to the recurrent levels. The multimodal space component places the word features
and image features in the same spatial location.
2.3.2 Limitations
A helpful framework for learning to map from photos to human-level image captions is provided
by the neural image caption generator. The algorithm learns to extract pertinent semantic
information from visual attributes by training on a large number of image-caption pairs.
However, when used with a static image, our caption generator will concentrate on aspects of the
12

lOMoARcPSD|20521427
image that are helpful for image classification rather than necessarily aspects that are helpful for
caption production. We may train the image embedding model (the VGG-16 network used to
encode features) as a component of the caption generation model, enabling us to precisely adjust
the image encoder to more effectively perform the duty of creating captions.
2.4 Review Summary

The approaches for brainstorming that have been mentioned each have their own advantages, but
they all share the drawback of not providing an all-encompassing mature general model to
address this issue or making intuitive feature observations on items or actions in the image. Prior
to the onset of the era of big data and the spread of deep learning techniques, the efficiency and
popularisation of neural networks had achieved significant advances in the field of picture
description and offered new hope.
To tackle the description problem, a model framework has recently been developed. We have
combined all components of this task, reviewed it, concentrated on the algorithmic core of
various attention mechanisms, and explained how the attention mechanism is used. We provide
an overview of the sizable datasets and standard practise rating criteria. Although there are many
different image caption systems available today and image caption can be used for image
retrieval, video caption, and video movement, testing results demonstrate that this task still
requires higher performance systems and improvement. It struggles with the following three
issues: first, how to produce entire phrases in natural language that are grammatically correct;
second, how to construct the generated sentences; and third, how to provide captions semantics
as clear as possible and consistent with the given image content. For future work, we propose the
following four possible improvements:
1. An image is often rich in content. 0e model should be able to generate description
sentences corresponding to multiple main objects for images with multiple target objects,
instead of just describing a single target object.
2. For corpus description languages of different languages, a general image description
system capable of handling multiple languages should be developed.
3. Evaluating the result of natural language generation systems is a difficult problem. 0e
best way to evaluate the quality of automatically generated texts is subjective assessment
by linguists, which is hard to achieve. In order to improve system performance, the
13

lOMoARcPSD|20521427
evaluation indicators should be optimized to make them more in line with human experts’
assessments.
4. A very real problem is the speed of training, testing, and generating sentences for the
model should be optimized to improve performance.
2.5 Problem Definition

A crucial task that pertains to both the field of computer vision and the field of natural
language processing is the creation of captions for photographs. The ability of a machine
to mimic human abilities to describe images with text is already a great advancement in
artificial intelligence. In the past, computer systems have used predefined templates to
generate text descriptions for images. The fundamental problem of this endeavour is to
capture how objects connect to each other in the image and to communicate them in a
natural language (like English). This method, however, falls short of offering the variation
needed to produce lexically rich text descriptions. With neural networks' enhanced
efficiency, this flaw has been eliminated. Numerous cutting-edge models use neural
networks to produce subtitles by taking image as input and predicting next lexical unit in
the output sentence.
Despite the successes of many systems based on the Recurrent Neural Networks (RNN)
many issues remain to be addressed. Among those issues the following two are prominent
for most systems.
1. The Vanishing Gradient Problem.
2. Training an RNN is a very difficult task.
A deep learning system called a recurrent neural network is created to handle various challenging
computer tasks including speech recognition and object classification. The understanding of each
event is based on knowledge from earlier occurrences, and RNNs are built to manage a
succession of events.
The deeper RNNs would be ideal because they would have a greater memory capacity and
superior capabilities. These could be used in numerous real-world use cases, including improved
14

lOMoARcPSD|20521427
speech recognition and stock prediction. RNNs aren't frequently used in real-world applications,
despite their seeming promise, because of the vanishing gradient issue.
2.6 Goals and Objectives

Image captioning has recently gathered a lot of attention specifically in the natural
language domain. There is a pressing need for context based natural language description
of images, however, this may seem a bit farfetched but recent developments in fields like
neural networks, computer vision and natural language processing has paved a way for
accurately describing images i.e. representing their visually grounded meaning. We are
leveraging state-of-the-art techniques like Convolutional Neural Network (CNN),
Recurrent Neural Network (RNN) and appropriate datasets of images and their human
perceived description to achieve the same. We demonstrate that our alignment model
produces results in retrieval experiments on datasets such as Flicker.
15

lOMoARcPSD|20521427
CHAPTER 3
DESIGN FLOW
3.1 Evaluation & Selection of Specifications/Features

Shuang Ma et al. introduced a multimodal image captioning technique in which they tackled the
problem of translating instances from one modality to another without the help of paired data by
leveraging an intermediate modality that was common to the two other modalities. In the paper,
Shuang Ma et al. chose to translate images to speech and leveraged disjoint datasets with one
shared modality, i.e., image-text pairs and text-speech pairs, with text as the shared modality.
Since the shared modality is skipped during the generation process, they called this problem”
skip-modal generation”. Shuang et al. proposed a multimodal information bottleneck approach to
tackle the problem. The model showed qualitative results on image-to-speech synthesis and also
improves performance on traditional cross-modal generation.
Through our model we aim to produce semantically and visually grounded description of
abstract images, the proposed description of which would be in natural language i.e. human
perceived description. By leveraging the techniques like CNN, RNN, and data sets such as those
of MS-COCO, we strive to attain the human level perception of given images. Our core insight is
that we can leverage these large image-sentence datasets by treating the sentences as weak labels,
in which contiguous segments of words correspond to some particular, but unknown location in
the image.
Our approach is to infer these alignments and use them to learn a generative model of
descriptions. We develop a deep neural network model that infers the alignment between
segments of sentences and the region of the image that they describe. We introduce a Recurrent
Neural Network architecture that takes an input image and generates its description in text. Our
experiments show that the generated sentences produce sensible qualitative predictions
16

lOMoARcPSD|20521427
3.2 Design Constraints

Challenges limiting the model:
1. Compositionality and Naturalness - 1. Compositionality and Naturalness - The
compositional nature of natural language and visual sceneries presents the first challenge.
A captioning system should be able to generalise by assembling things in various settings,
even though the training dataset only contains co-occurrences of particular objects in
their context. Because traditional captioning systems frequently generate captions in a
sequential fashion, where the next generated word depends on both the previous
generated word and the picture characteristic, they frequently lack compositionality and
naturalness. This frequently results in language structures that are syntactically accurate
but semantically meaningless, as well as a dearth of diversity in the output captions.
2. Generalization - The dataset bias that has an impact on the present captioning systems is
the second problem. The problem with these systems is that they have difficulty
generalising to situations when the same objects appear in scenes with unknown contexts
because the trained models are overfit to the common objects that co-occur in a common
context (such as bed and bedroom) (e.g., bed and forest). Although eliminating dataset
bias is a difficult, open research subject in and of itself, we suggest a diagnostic tool to
measure how biassed a particular captioning system is. To evaluate the compositional and
generalisation abilities of a captioner, we specifically generated a test diagnosis dataset of
captioned images with common objects appearing in strange contexts (Out of Context -
OOC dataset). The rating on OOC is a reliable guide to the model’s generalization. Bad
performance is a sign that the captioner is over-fitted to the training context.
3. Evaluation and Turing Test - The examination of the calibre of created captions
presents the third difficulty. Utilizing automated measures, while somewhat beneficial, is
still inadequate because they ignore the image. When scoring various and descriptive
captions, their scoring frequently remains insufficient and perhaps even misleading.
Human assessment continues to be the gold standard for rating captioning systems.
17

lOMoARcPSD|20521427
Computer vision systems will become more dependable for use as personal assistants for
visually impaired persons and in enhancing their daily lives as progress is made in
automatic picture captioning and scene understanding. In order to bridge the meaning gap
between language and visual, common sense and logic must be incorporated.
3.3 Analysis and Feature Finalization

Based on above constraints, we have selected following steps for our model:
1. Data Collection
2. Understanding the data
3. Data Cleaning
4. Loading the training set
5. Data Pre-processing – Images
6. Data Pre-processing – Captions
7. Data preparation using Generator Function
8. Word Embedding
9. Model Architecture
10. Inference
The overall workflow can be divided into these main steps:

1. Read Captions File: Reading the text and token flickr8k file , finding the length of the file
and splitting it.
2. Data Cleaning: Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. If data is incorrect,
outcomes and algorithms are unreliable, even though they may look correct.
3. Loading Training Testing Data: The process includes training Images File, testing it and
creating a train description dictionary that adds starting and ending sequence
4. Data Preprocessing - Images : Loading the image, preprocessing and encoding it and
testing it.
5. Data Preprocessing - Captions : Loading the captions ,appending the start and the end
sequence , finding the maximum length of the caption
18

lOMoARcPSD|20521427
6. Data Preparation using Generator: Data preparation is the process of cleaning and
transforming raw data prior to processing and analysis. It is an important step prior to
processing and often involves reformatting data, making corrections to data and the
combining of data sets to enrich data.
7. Word Embedding: Converting words into Vectors(Embedding Layer Output) \
8. Model Architecture: Making an image feature extractor model, partial caption sequence
model and merging the two networks
9. Train Our Model: A training model is a dataset that is used to train an ML algorithm. It
consists of the sample output data and the corresponding sets of input data that have an
influence on the output.
10. Predictions: Prediction refers to the output of an algorithm after it has been trained on a
historical dataset and applied to new data when forecasting the likelihood of a particular
outcome here predicting Caption for a photo
Figure 3.1 Feature Analysation
3.4 Design Flow

Image caption generator is a task that involves computer vision and natural language processing
concepts to recognize the context of an image and describe them in a natural language like
English. So, to make an image caption generator model, we have merged these architectures. It is
also called a CNN-RNN model.
19

lOMoARcPSD|20521427
3.4.1 CNN and LSTM Model

 CNN is used for extracting features from the image. We will use the pre-trained model
RESNET.
 LSTM will use the information from CNN to help generate a description of the image.
Convolutional Neural Networks are customised deep neural networks that are capable of
processing data with input shapes similar to a 2D matrix. CNN is particularly helpful when
working with photos and can readily express images as a 2D matrix. CNN is primarily used to
classify images and determine whether they are a bird, a jet, Superman, etc. Images are scanned
from top to bottom and left to right to extract key details, which are then combined to identify
the images. It can handle images that have been resized, rotated, translated, and perspective-
shifted. Long short term memory, or LSTM, is a form of RNN (recurrent neural network) that is
useful for solving issues involving sequence prediction. We can anticipate the following word
based on the prior text. By getting over RNN's short-term memory restrictions, it has
distinguished itself from regular RNN as an effective technology. Through the use of a forget
gate, the LSTM may carry out relevant information while processing inputs and reject irrelevant
information.
Figure 3.2 CNN and LSTM Model
3.4.2 Other Deep Learning-Based Image Captioning

In our day to day life, data are increasing with unlabled data because it is often
impractical to accurately annotate data. Therefore, recently, researchers are focusing more
on reinforcement learning and unsupervised learning-based techniques for image
captioning.
20

lOMoARcPSD|20521427
Dense Captioning Vs Captions for the Whole Scene
In dense captioning, captions are generated for each region of the scene. Other methods
generate captions for the whole scene.
Dense Captioning
The previous techniques for creating picture captions can only produce one caption for the
entire image. To gather data on various things, they use various parts of the image. These
techniques do not, however, produce regionally specific captions. DenseCap is a method
for captioning images that Johnson et al. proposed. This technique localises every
important area of a picture before producing descriptions for each area. A typical method
of this category has the following steps:
(1) Region proposals are generated for the different regions of the given image.
(2) CNN is used to obtain the region-based image features.
(3) The outputs of Step 2 are used by a language model to generate captions for every
region.
Captions for the Whole Scene
Encoder-Decoder architecture, Compositional architecture, attention-based, semantic

concept-based, stylized captions, Novel object-based image captioning, and other deep
learning networks-based image captioning methods generate single or multiple captions
for the whole scene.
3.5 Design Selection

We use two techniques mainly CNN and LSTM for image classification. So, to make our image
caption generator model, we will be merging these architectures. It is also called a CNN-RNN
model.
 CNN is used for extracting features from the image. We will use the pre-trained model
Xception.  LSTM will use the information from CNN to help generate a description of the
image.
21

lOMoARcPSD|20521427
3.5.1 Convolutional Neural Network

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take in an
input image, give various elements and objects in the image importance (learnable weights and
biases), and be able to distinguish between them. Comparatively speaking, a ConvNet requires
substantially less pre-processing than other classification techniques. Convolutional Neural
Networks are customised deep neural networks that are capable of processing data with input
shapes similar to a 2D matrix. CNN is particularly helpful when working with photos and can
readily express images as a 2D matrix.
Images are scanned from top to bottom and left to right to extract key details, which are then
combined to identify the images. It is capable of handling translated images, rotated, scaled and
changes in perspective.
3.5.2 Long Short Term Memory
22

lOMoARcPSD|20521427
Long short term memory, or LSTM, is a form of RNN (recurrent neural network) that is useful
for solving issues involving sequence prediction. We can anticipate the following word based on
the prior text. By getting over RNN's short-term memory restrictions, it has distinguished itself
from regular RNN as an effective technology. Through the use of a forget gate, the LSTM may
carry out relevant information while processing inputs and reject irrelevant information.
Compared to conventional RNNs, LSTMs are built to address the vanishing gradient problem
and have a longer memory for information. In order to continue learning over several time-steps
and back-propagate over time and layers, LSTMs have the ability to retain a constant error.
Figure 3.3 LSTM Model
To store data outside the RNN's usual flow, LSTMs employ gated cells. The network can use
these cells to change data in a variety of ways, including storing information there and reading
from it. Individual cells have the ability to make decisions about the information and can carry
out these decisions by opening or closing the gates. LSTM outperform conventional RNNs in
certain tasks due to their superior long-term memory capabilities. Because of its chain-like
architecture, LSTM can retain knowledge for longer periods of time and handle difficult
problems that typical RNNs either find difficult or are unable to handle.
The three major parts of the LSTM include:
23

lOMoARcPSD|20521427
Forget gate - removes information that is no longer necessary for the completion of the task.
This step is essential to optimizing the performance of the network.
Input gate - responsible for adding information to the cells
Output gate - selects and outputs necessary information
3.5.3 Flickr8k Dataset
An openly accessible benchmark for image-to-set instructions is the Flickr8k dataset. This
collection contains 8000 images, each with five captions. These images came from several Flickr
groups. Each caption provides a thorough overview of the subjects and activities seen on camera.
The dataset is larger since it shows a variety of scenes and events rather than only famous people
or locations. There are 6000 photographs in the training dataset, 1000 in the development dataset,
and 1000 in the test dataset. The dataset's qualities that make it suitable for this project are as
follows: • The model becomes increasingly prevalent and is kept from being overfit by having
numerous labels for a single image. • The image annotation model can work in numerous picture
categories thanks to different training image categories, which makes the model more resilient
Convolutional Neural Network (CNN) layers for feature extraction on input data are paired with
LSTMs to facilitate sequence prediction in the CNN LSTM architecture. Although we will refer
to LSTMs that employ a CNN as a front end in this lesson by the more general label "CNN
LSTM," this architecture was initially known as a Long-term Recurrent Convolutional Network
or LRCN model.
To provide textual descriptions of photographs, this architecture is employed. Use of a CNN that
has already been trained on a difficult picture classification task and has been repurposed as a
feature extractor for the caption producing challenge is crucial.
3.5.4 Model Advantages
The text file also contains the captions for the almost 8000 photographs that make up the Flickr
8k dataset that we used. Although deep learning-based image captioning techniques have made
significant strides in recent years, a reliable technique that can provide captions of a high calibre
for almost all photos has not yet been developed. Automatic picture captioning will continue to
be a hot research topic for some time to come with the introduction of novel deep learning
network architectures. With more people using social media every day and the majority of them
24

lOMoARcPSD|20521427
posting images, the potential for image captioning is very broad in the future. Therefore, they
will benefit more from this project.
1. Assistance for visually impaired
 The advent of machine learning solutions like image captioning is a boon for visually
impaired people who are unable to comprehend visuals.
 With AI-powered image caption generators, image descriptions can be read out to the
visually impaired, enabling them to get a better sense of their surroundings.
2. Recommendations in editing
 The image captioning model automates and accelerates the closed captioning process for
digital content production, editing, delivery, and archival.
 Well-trained models replace manual efforts for generating quality captions for images as
well as videos.
3. Media and Publishing Houses

 The media and public relations industry circulate tens of thousands of visual data across
borders in the form of newsletters, emails, etc.
 The image captioning model accelerates subtitle creation and enables executives to focus
on more important tasks.
4. Self-driving cars
 By using the captions generated by image caption generator self driving cars become
aware of the surroundings and make decisions to control the car.
5. Reduce vehicle accidents
 By installing an image caption generator
3.6 Methodology
3.6.1 Model Overview
The model is trained to maximise the probability of p(S|I), where S is the sequence of words
produced by the model and each word St. is produced using a lexicon that was created using the
training dataset. A deep vision convolutional neural network (CNN) is fed the input image I,
which aids in object detection in the image. As shown in Figure 3.4, the Language Generating
25

lOMoARcPSD|20521427
Recurrent Neural Network (RNN) receives the picture encodings and uses them to create a
relevant phrase for the image. The model can be compared to a language translation RNN model,
where the goal is to maximise the p(T|S), where T is the translation to the sentence S. However,
in our model the encoder RNN which helps in transforming an input sentence to a fixed length
vector is replaced by a CNN encoder. Recent research has shown that the CNN can easily
transform an input image to a vector.
Figure 3.4 Methodology Flow
We employ the pre-trained model VGG16 for the picture classification task. The next section
discusses the models' specifics. The pre-trained VGG16 is then followed by an LSTM network.
Language creation is accomplished using the LSTM network. Since a current token depends on
the preceding tokens for a sentence to have sense, LSTM networks are different from typical
Neural Networks in that they take this into consideration.
We use a VGG16 pretrained model to the task of classifying images. The next section discusses
the models' specifics. The VGG16 is followed by an LSTM network, which stands for long
short-term memory. Language creation is accomplished using the LSTM network. Traditional
Neural Networks are different from LSTM since a current token depends on the previous tokens
for a sentence to be meaningful and LSTM networks take this factor into account.
26

lOMoARcPSD|20521427
Figure 3.5 VGG16 Model
The model that existed prior to the project has two distinct input streams, one for picture features
and the other for captions from the preprocessed input. To represent the picture features in a
different dimension, the image features are transferred through a dense, completely linked layer.
The embedding layer is carried through after the input captions. After merging these two input
streams, an LSTM layer receives them as inputs. The caption embedding serves as the LSTM's
input while the image serves as its initial state.
3.6.2 Recurrent Neural Network
A feed forward neural network with an internal memory is known as a recurrent neural network.
The result of the current input depends on the previous computation, making RNNs recurrent in
nature because they carry out the same function for every data input. The output is created,
copied, and then delivered back into the recurrent network. It takes into account both the current
input and the output that it has learned from the prior input when making a decision.
27

lOMoARcPSD|20521427
Figure 3.6 RNN
One type of artificial neural network called a recurrent neural network (RNN) has connections
between nodes that create a directed graph along a temporal sequence. It can display temporal
dynamic behaviour as a result of this. RNNs, which are derived from feed-forward neural
networks, can process input sequences of varying length by using their internal state (memory).
3.6.3 Convolutional Neural Network
28

lOMoARcPSD|20521427
Convolutional neural networks (CNN, or ConvNet) are a class of deep neural networks used
most frequently to analyse visual vision in deep learning. Based on the shared-weight
convolution kernels' translation invariance properties and shift invariance architecture, they are
often referred to as shift invariant or space invariant artificial neural networks (SIANN). They
can be used in a variety of fields, including image and video recognition, recommender systems,
image classification, image segmentation, and medical image analysis. They can also be used in
brain-computer interfaces, natural language processing, and financial time series.
Figure 3.7 CNN
Multilayer perceptrons are modified into CNNs. Fully linked networks, or multilayer
perceptrons, are those in which all of the neurons in one layer are connected to all of the neurons
in the following layer. These networks are vulnerable to overfitting data because of their "fully-
connectedness." Changing the weights when the loss function is decreased and randomly
trimming connections are common methods of regularisation.
3.6.4 Long Short Term Memory
Long Short-Term Memory (LSTM) networks, Figure 3.8, are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems. This is a behavior
required in complex problem domains like machine translation, speech recognition, and more.
29

lOMoARcPSD|20521427
Figure 3.8 LSTM Inner Model
A challenging area of deep learning is LSTMs. Understanding LSTMs and how concepts like
bidirectional and sequence-to-sequence relate to the field can be challenging. LSTM has
feedback connections in contrast to traditional feed forward neural networks. It can analyse
whole data sequences in addition to single data points (like photos) (such as speech or video).
For instance, LSTM can be used for tasks like linked, unsegmented handwriting identification,
speech recognition, and network traffic anomaly detection, or IDSs (intrusion detection systems).
A cell, an input gate, an output gate, and a forget gate make up a typical LSTM unit. The three
gates control the information flow, and the cell retains values across arbitrary time intervals.
3.6.5 VGG16
It is regarded as one of the best vision model architectures created to date. The most distinctive
feature of VGG16 is that it prioritised having convolution layers of 3x3 filters with a stride 1 and
always utilised the same padding and maxpool layer of 2x2 filters with a stride 2, as shown in
Figure 3.9. Throughout the entire architecture, convolution and max pool layers are arranged in
the same manner. Two FC (completely connected layers) are present at the very end, followed by
a softmax for output. The 16 in VGG16 denotes the fact that there are 16 layers with weights.
This network has over 138 million parameters, making it a sizable network.
30

lOMoARcPSD|20521427
Figure 3.9 VGG-16
In their publication "Very Deep Convolutional Networks for Large-Scale Image Recognition," K.
Simonyan and A. Zisserman from the University of Oxford introduced the convolutional neural
network model known as VGG16. In the top five tests, the model performs 92.7% accurately in
ImageNet, a dataset of over 14 million images divided into 1000 classes. It was a well-known
model that was submitted to ILSVRC-2014. By sequentially substituting several 33 kernel-sized
filters for AlexNet's big kernel-sized filters (11 and 5, respectively, in the first and second
convolutional layers), it improves upon AlexNet. Using NVIDIA Titan Black GPUs, VGG16
underwent weeks of training.
31

lOMoARcPSD|20521427
CHAPTER 4
RESULTS ANALYSIS AND VALIDATION
4.1 Implementation
4.1.1 Pre-Requisites
This project requires good knowledge of Deep learning, Python, working on Jupyter
notebooks, Keras library, Numpy, and Natural language processing.
Make sure you have installed all the following necessary libraries:
• pip install tensorflow
• keras
• pillow
• numpy
• tqdm
32

lOMoARcPSD|20521427
• jupyterlab
4.1.2 Project File Structure

Downloaded from dataset:
• Flicker8k_Dataset – Dataset folder which contains 8091 images.

• Flickr_8k_text – Dataset folder which contains text files and captions of images.
The below files will be created by us while making the project.
• Models – It will contain our trained models.

• Descriptions.txt – This text file contains all image names and their captions after
preprocessing.
• Features.p – Pickle object that contains an image and their feature vector extracted
from the RESNET pre-trained CNN model.
• Tokenizer.p – Contains tokens mapped with an index value.
• Model.png – Visual representation of dimensions of our project.
• Testing_caption_generator.py – Python file for generating a caption of any image.
• Training_caption_generator.ipynb – Jupyter notebook in which we train and build our
image caption generator.
4.2 Building Python Project

Let‟s start by initializing the jupyter notebook server by typing jupyter lab in the console
of your project folder. It will open up the interactive Python notebook where you can run
your code.
Create a Python3 notebook and name it training_caption_generator.ipynb.
4.2.1 Data Cleaning

The main text file which contains all image captions is Flickr8k.token in our Filckr_8k_text
folder as shown in Figure 4.1
33

lOMoARcPSD|20521427
Figure 4.1 Text File containing Dataset
The format of our file is image and caption separated by a new line (“\n”).
Each image has 5 captions and we can see that #(0 to 5)number is assigned for each
caption.
We will define 5 functions:
• load_doc( filename ) – For loading the document file and reading the contents inside
the file into a string.
• all_img_captions( filename ) – This function will create a descriptions dictionary that
maps images with a list of 5 captions. The descriptions dictionary will look
something like the Figure.
34

lOMoARcPSD|20521427
Figure 4.2 Five Captions for One picture
• cleaning_text( descriptions) – This function takes all descriptions and performs data
cleaning. This is an important step when we work with textual data, according to our
goal, we decide what type of cleaning we want to perform on the text. In our case, we
will be removing punctuations, converting all text to lowercase and removing words
that contain numbers.So, a caption like “A man riding on a three-wheeled
wheelchair” will be transformed into “man riding on three wheeled wheelchair” as
shown in Figure 4.3
• text_vocabulary( descriptions ) – This is a simple function that will separate all the
unique words and create the vocabulary from all the descriptions.
• save_descriptions( descriptions, filename ) – This function will create a list of all the
descriptions that have been preprocessed and store them into a file. We will create a
descriptions.txt file to store all the captions. It will look something like this:
35

lOMoARcPSD|20521427
Figure 4.3 Pre-processed data
4.2.2 Extracting Feature Vector from all Images

This technique is also called transfer learning, we don‟t have to do everything on our
own, we use the pre-trained model that have been already trained on large datasets and
extract the features from these models and use them for our tasks. We are using RESNET
model which has been trained on imagenet dataset that had 1000 different classes to
classify. We can directly import this model from the keras.applications . Make sure you
are connected to the internet as the weights get automatically downloaded. Since the
RESNET model was originally built for imagenet, we will do little changes for integrating
with our model. One thing to notice is that the RESNET model takes 299*299*3 image
size as input. We will remove the last classification layer and get the 2048 feature vector.
model = ResNet50(weights="imagenet",input_shape=(224,224,3))
The function extract_features() will extract features for all images and we will map image
names with their respective feature array. Then we will dump the features dictionary into a
“features.p” pickle file.
This process can take a lot of time depending on your system. I am using an Nvidia 1050
GPU for training purpose so it took me around 7 minutes for performing this task.
However, if you are using CPU then this process might take 1-2 hours. You can comment
out the code and directly load the features from our pickle file.
36

lOMoARcPSD|20521427
4.3 Loading Dataset for Training the Model

In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that contains a list of
6000 image names that we will use for training.
For loading the training dataset, we need more functions:
• load_photos( filename ) – This will load the text file in a string and will return the list
of image names.
• load_clean_descriptions( filename, photos ) – This function will create a dictionary

that contains captions for each photo from the list of photos. We also append the
<start> and <end> identifier for each caption. We need this so that our LSTM model
can identify the starting and ending of the caption.
• load_features(photos) – This function will give us the dictionary for image names
and their feature vector which we have previously extracted from the Xception
model.
4.3.1 Tokenizing the Vocabulary

Because English words are not understood by computers, we must represent them using
numbers. Therefore, we will assign a distinct index value to each word in the lexicon. We will
use the tokenizer function in the Keras package to generate tokens from our vocabulary and store
them to a pickle file called "tokenizer.p."
Our vocabulary is 7577 words strong. The maximum length of the descriptions is determined.
This is significant for choosing the parameters for the model structure. Maximum description
length is 32.
4.3.2 Create Data Generator

Let's first have a look at what our model's input and output will entail. We must give the
model input and output for training in order to convert this task into a supervised learning
37

lOMoARcPSD|20521427
task. Our model needs to be trained on 6000 photos, each of which has a 2048-length
feature vector and a caption that is likewise represented as a number. Because it is
impossible to store this much data in memory for 6000 photos, we will use a generator
method that will produce batches.The generator will yield the input and output sequence.
For example:
The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature
vector of that image, x2 is the input text sequence and y is the output text sequence that
the model has to predict.
x1(feature vector) x2(Text sequence) y(word to predict )
feature start, two
feature start, two dogs
feature start, two, dogs drink
feature start, two, dogs, drink water
feature start, two, dogs, drink, water end
Table 4.1 Word Prediction Generation Step By Step
4.3.3 Defining the CNN-RNN Model

To define the structure of the model, we will be using the Keras Model from Functional
API. It will consist of three major parts:
• Feature Extractor – The feature extracted from the image has a size of 2048, with a
dense layer, we will reduce the dimensions to 256 nodes.
• Sequence Processor – An embedding layer will handle the textual input, followed by
the LSTM layer.
• Decoder – By merging the output from the above two layers, we will process by the
dense layer to make the final prediction. The final layer will contain the number of
nodes equal to our vocabulary size.
Visual representation of the final model is given in the figure
38

lOMoARcPSD|20521427
4.4 Training and Testing the model

To train the model, we will be using the 6000 training images by generating the input and
output sequences in batches and fitting them to the model using model.fit_generator()
method. We also save the model to our models folder. This will take some time depending
on your system capability.
Figure 4.4 Final Model Structure
The model has been trained, now shown in Figure 4.4, we will make a separate file
testing_caption_generator.py which will load the model and generate predictions. The
predictions contain the max length of index values so we will use the same tokenizer.p
pickle file to get the words from their index values.
39

lOMoARcPSD|20521427
Figure 4.5 Example of Caption Generation
Figure 4.6 Example of Caption Generation
CHAPTER 5
CONCLUSION AND FUTURE WORK
40

lOMoARcPSD|20521427
In this chapter we have thrown some light on the conclusion of our project. We have also
underlined the limitation of our methodology. There is a huge possibility in this field, as we have
discussed in the future scope section of this chapter.
5.1 Conclusion
We reviewed deep learning-based picture captioning techniques in this study. We've
provided a taxonomy of picture captioning methods, illustrated general block diagrams of
the main groups, and highlighted the advantages and disadvantages of each. We talked
about several evaluation criteria and datasets, as well as their advantages and
disadvantages. The results of the experiment are also briefly summarised. We briefly
discussed possible possibilities for future research in this field. Although deep learning-
based image captioning techniques have made significant strides in recent years, a reliable
technique that can provide captions of a high calibre for almost all photos has not yet been
developed. Automatic picture captioning will continue to be a hot research topic for some
time to come with the introduction of novel deep learning network architectures utilized
here is Flickr 8k which includes nearly 8000 images, and the corresponding captions are
also stored in the text file. Although deep learning-based image captioning techniques
have made significant strides in recent years, a reliable technique that can provide
captions of a high calibre for almost all photos has not yet been developed. Automatic
picture captioning will continue to be a popular study topic for some time to come with
the introduction of novel deep learning network architectures. With more people using
social media every day and the majority of them posting images, the potential for image
captioning is very broad in the future. Therefore, they will benefit more from this project.
5.2 Future Scope

Due to the internet's and social media's exponential rise in image content, image
captioning has recently become a significant issue. This article reviews the many image
retrieval studies conducted in the past while highlighting the various methods and
strategies employed. There is a huge potential for future research in this area because
41

lOMoARcPSD|20521427
feature extraction and similarity computation in photos are difficult tasks. Picture
RETRIEVAL USING IMAGE CAPTIONING 54 Using features like colour, tags, the
histogram, and other features, current image retrieval algorithms calculate similarity.
Since these approaches are independent of the image's context, findings cannot be entirely
correct. Therefore, a thorough study of picture retrieval using the image context, such as
image captioning, will help to resolve this issue in the future. By including new picture
captioning datasets into the project's training, it will be possible to improve the
identification of classes with lesser precision in the future. In order to see if the picture
retrieval outcomes improve, this methodology can also be integrated with earlier image
retrieval techniques like the histogram, shapes, etc.
REFERENCES
[1] Abhaya Agarwal and Alon Lavie. 2008. Meteor, m-bleu and m-ter:
Evaluation metrics for high-correlation with
human rankings of machine translation output. In Proceedings of the
ThirdWorkshop on Statistical Machine Translation.
42

lOMoARcPSD|20521427
Association for Computational Linguistics, 115–118.

[2] Ahmet Aker and Robert Gaizauskas. 2010. Generating image
descriptions using dependency relational patterns. In
Proceedings of the 48th annual meeting of the association for computational
linguistics. Association for Computational
Linguistics, 1250–1258.
[3] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould.
2016. Spice: Semantic propositional image caption evaluation. In European
Conference on Computer Vision. Springer, 382–398.
[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark
Johnson, Stephen Gould, and Lei Zhang. 2017.
Bottom-up and top-down attention for image captioning and vqa. arXiv
preprint arXiv:1707.07998 (2017).
[5] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018.
Convolutional image captioning. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 5561–5570.
[6] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach,
Raymond Mooney, Kate Saenko, Trevor Darrell,
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, et al.
2016. Deep compositional captioning:
Describing novel object categories without paired training data. In
Proceedings of the IEEE
Conference on Computer
Vision and Pattern Recognition.
[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015.
Neural machine translation by jointly learning to align and translate. In
International Conference on Learning Representations (ICLR).
[8] Shuang Bai and Shan An. 2018. A Survey on Automatic Image
Caption Generation. Neurocomputing.
ACM Computing Surveys, Vol. 0, No. 0, Article 0. Acceptance Date:
October 2018.
43

lOMoARcPSD|20521427
0:30 Hossain et al.

[9] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic
metric for MT evaluation with improved correlation
with human judgments. In Proceedings of the acl workshop on intrinsic and
extrinsic evaluation measures for machine translation and/or summarization,
Vol. 29. 65–72.
[10] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer.
2015. Scheduled sampling for sequence prediction
with recurrent neural networks. In Advances in Neural Information
Processing Systems. 1171– 1179.
[11] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian
Jauvin. 2003. A neural probabilistic language model.
Journal of machine learning research 3, Feb, 1137–1155.
[12] Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra.
1996. A maximum entropy approach to natural language processing.
Computational linguistics 22, 1 (1996), 39–71.
[13] Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem,
Erkut Erdem, Nazli IkizlerCinbis, Frank Keller,
Adrian Muscat, Barbara Plank, et al. 2016. Automatic Description
Generation from Images: A
Survey of Models,
Datasets, and Evaluation Measures. Journal of Artificial Intelligence
Research (JAIR) 55, 409– 442.
[14] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent
dirichlet allocation. Journal of machine Learning research 3, Jan (2003),
993–1022.
[15] Cristian Bodnar. 2018. Text to Image Synthesis Using Generative
Adversarial Networks. arXiv preprint arXiv:1805.00676.
[16] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie
Taylor. 2008. Freebase: a collaboratively created
44

lOMoARcPSD|20521427
graph database for structuring human knowledge. In Proceedings of the

2008 ACM SIGMOD international conference on Management of data.
AcM, 1247–1250.
[17] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. 1992. A
training algorithm for optimal margin classifiers.
In Proceedings of the fifth annual workshop on Computational learning
theory. ACM, 144–152.
[18] Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Troster.
2011. Eye movement analysis for activity
recognition using electrooculography. IEEE transactions on pattern analysis
and machine intelligence 33, 4 (2011),
741–753.
[19] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand.
2011. Learning photographic global tonal
adjustment with a database of input/output image pairs. In Computer Vision
and Pattern Recognition (CVPR), 2011
APPENDIX 1
Python
Python is an interpreted, high-level, general purpose programming language created by Guido

Van Rossum and first released in 1991, Python's design philosophy emphasizes code Readability
with its notable use of significant White space. Its language constructs and object oriented
approach aim to help programmers write clear, logical code for small and large-scale projects.
45

lOMoARcPSD|20521427
Python is dynamically typed and garbage collected. It supports multiple programming

paradigms, including procedural, object-oriented, and functional programming.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in
2000 and introduced new features such as list comprehensions, cycle-detecting garbage
collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major
revision that is not completely backward-compatible with earlier versions. Python 2 was
discontinued with version 2.7.18 in 2020.
Python strives for a simpler, less-cluttered syntax and grammar while giving developers a choice
in their coding methodology. In contrast to Perl's "there is more than one way to do it" motto,
Python embraces a "there should be one—and preferably only one—obvious way to do it"
philosophy. Alex Martelli, a Fellow at the Python Software Foundation and Python book author,
wrote: "To describe something as 'clever' is not considered a compliment in the Python culture."
Python's developers strive to avoid premature optimization and reject patches to non-critical
parts of the CPython reference implementation that would offer marginal increases in speed at
the cost of clarity. When speed is important, a Python programmer can move time-critical
functions to extension modules written in languages such as C; or use PyPy, a just-in-time
compiler. Cython is also available, which translates a Python script into C and makes direct C-
level API calls into the Python interpreter.
Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
It was originally called scikits.learn and was initially developed by David Cournapeau as a
Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux,
Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in
46

lOMoARcPSD|20521427
Computer Science and Automation), took this project at another level and made the first public
release (v0.1 beta) on 1st Feb. 2010.
Prerequisites
Before we start using scikit-learn latest release, we require the following −
Python (>=3.5)
NumPy (>= 1.11.0)
Scipy (>= 0.17.0)li
Joblib (>= 0.11)
Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities.
Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data
structure and analysis.
Installation
If you already installed NumPy and Scipy, following are the two easiest ways to install
scikit-learn −
Using pip
Following command can be used to install scikit-learn via pip −
47

lOMoARcPSD|20521427
pip install -U scikit-learn
Using conda
Following command can be used to install scikit-learn via conda −
conda install scikit-learn
Features
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is
focused on modeling the data. Some of the most popular groups of models provided by Sklearn
are as follows −
Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like
Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-
learn.
Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised
learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to
unsupervised neural networks.
Clustering − This model is used for grouping unlabeled data.
Cross Validation − It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction − It is used for reducing the number of attributes in data which can be
further used for summarisation, visualisation and feature selection.
Ensemble methods − As name suggest, it is used for combining the predictions of multiple
supervised models.
48

lOMoARcPSD|20521427
Feature extraction − It is used to extract the features from data to define the attributes in image
and text data.
Feature selection − It is used to identify useful attributes to create supervised models.
Open Source − It is open source library and also commercially usable under BSD license.
Numpy
NumPy is a library for the python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high level mathematical
functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by
Jim with contributions from several other developers. In 2005, Travis created NumPy by
incorporating features of the competing Numarray into Numeric, with extensive modifications.
NumPy is open source software and has many contributors.
The Python programming language was not originally designed for numerical computing, but
attracted the attention of the scientific and engineering community early on. In 1995 the special
interest group (SIG) matrix-sig was founded with the aim of defining an array computing
package; among its members was Python designer and maintainer Guido van Rossum, who
extended Python's syntax (in particular the indexing syntax) to make array computing easier.
An implementation of a matrix package was completed by Jim Fulton, then generalized[further

explanation needed] by Jim Hugunin and called Numeric (also variously known as the
"Numerical Python extensions" or "NumPy"). Hugunin, a graduate student at the Massachusetts
Institute of Technology (MIT), 10 joined the Corporation for National Research Initiatives
(CNRI) in 1997 to work on JPython, leaving Paul Dubois of Lawrence Livermore National
Laboratory (LLNL) to take over as maintainer. Other early contributors include David Ascher,
Konrad Hinsen and Travis Oliphant.:
49

lOMoARcPSD|20521427
A new package called Numarray was written as a more flexible replacement for Numeric. Like
Numeric, it too is now deprecated. Numarray had faster operations for large arrays, but was
slower than Numeric on small ones, so for a time both packages were used in parallel for
different use cases. The last version of Numeric (v24.2) was released on 11 November 2005,
while the last version of numarray (v1.5.2) was released on 24 August 2006.
There was a desire to get Numeric into the Python standard library, but Guido van Rossum
decided that the code was not maintainable in its state then.
In early 2005, NumPy developer Travis Oliphant wanted to unify the community around a single
array package and ported Numarray's features to Numeric, releasing the result as NumPy 1.0 in
2006. This new project was part of SciPy. To avoid installing the large SciPy package just to get
an array object, this new package was separated and called NumPy. Support for Python 3 was
added in 2011 with NumPy version 1.5.0.
Features
NumPy targets the CPython reference implementation of Python, which is a non-optimizing

bytecode interpreter. Mathematical algorithms written for this version of Python often run much
slower than compiled equivalents due to the absence of compiler optimization. NumPy addresses
the slowness problem partly by providing multidimensional arrays and functions and operators
that operate efficiently on arrays; using these requires rewriting some code, mostly inner loops,
using NumPy.
Using NumPy in Python gives functionality comparable to MATLAB since they are both
interpreted, and they both allow the user to write fast programs as long as most operations work
on arrays or matrices instead of scalars. In comparison, MATLAB boasts a large number of
additional toolboxes, notably Simulink, whereas NumPy is intrinsically integrated with Python, a
50

lOMoARcPSD|20521427
more modern and complete programming language. Moreover, complementary Python packages
are available; SciPy is a library that adds more MATLAB-like functionality and Matplotlib is a
plotting package that provides MATLAB-like plotting functionality. Internally, both MATLAB
and NumPy rely on BLAS and LAPACK for efficient linear algebra computations.
Installing Numpy
The only prerequisite for installing NumPy is Python itself. If you don’t have Python yet and
want the simplest way to get started, we recommend you use the Anaconda Distribution - it
includes Python, NumPy, and many other commonly used packages for scientific computing and
data science.
CONDA
If you use conda, you can install NumPy from the defaults or conda-forge channels:
# Best practice, use an environment rather than install in the base env
conda create -n my-env
conda activate my-env
# If you want to install from conda-forge
conda config --env --add channels conda-forge
# The actual install command
conda install numpy
PIP
If you use pip, you can install NumPy with:
pip install numpy
Librosa
51

lOMoARcPSD|20521427
Librosa is a Python package for music and audio analysis. Librosa is basically used when we
work with audio data like in music generation(using LSTMs), Automatic Speech Recognition.
Installation instructions
pypi
The simplest way to install librosa is through the Python Package Index (PyPI). This will ensure
that all required dependencies are fulfilled. This can be achieved by executing the following
command:
pip install librosa

or:
sudo pip install librosa

to install system-wide, or:
pip install -u librosa

to install just for your own user.
conda
If you use conda/Anaconda environments, librosa can be installed from the conda-forge channel:
conda install -c conda-forge librosa
It provides the building blocks necessary to create the music information retrieval systems.
Librosa helps to visualize the audio signals and also do the feature extractions in it using
different signal processing techniques.
Source
If you’ve downloaded the archive manually from the releases page, you can install using the
setuptools script:
tar xzf librosa-VERSION.tar.gz
52

lOMoARcPSD|20521427
cd librosa-VERSION/
python setup.py install
If you intend to develop librosa or make changes to the source code, you can install with pip
install -e to link to your actively developed source tree:
tar xzf librosa-VERSION.tar.gz

cd librosa-VERSION/
pip install -e .
Alternately, the latest development version can be installed via pip:
pip install git+https://github.com/librosa/librosa

ffmpeg
To fuel audioread with more audio-decoding power, you can install ffmpeg which ships with
many audio decoders. Note that conda users on Linux and OSX will have this installed by
default; Windows users must install ffmpeg separately.
OSX users can use homebrew to install ffmpeg by calling brew install ffmpeg or get a binary
version from their website https://www.ffmpeg.org.
Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is
also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely
resemble that of MATLAB, though its use is discouraged.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. Matplotlib makes easy things easy and hard things possible.
53

lOMoARcPSD|20521427
1. Create publication quality plots.

2. Make interactive figures that can zoom, pan, update.
3. Customize visual style and layout.
4. Export to many file formats.
5. Embed in JupyterLab and Graphical User Interfaces.
6. Use a rich array of third-party packages built on Matplotlib.
Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level

interface for drawing attractive and informative statistical graphics. Seaborn is a library in
Python predominantly used for making statistical graphics. Seaborn is a data visualization library
built on top of matplotlib and closely integrated with pandas data structures in Python.
Visualization is the central part of Seaborn which helps in exploration and understanding of data.
Installing and getting started
Official releases of seaborn can be installed from PyPI:
pip install seaborn
The library is also included as part of the Anaconda distribution:
conda install seaborn
Dependencies
Supported Python versions
Python 3.6+
Required dependencies
If not already present, these libraries will be downloaded when you install seaborn.
numpy
scipy
pandas
54

lOMoARcPSD|20521427
matplotlib
ANNEXURE-1
SAMPLE CODE:
DATA CLEANING:
def clean_text(sentence):
sentence = sentence.lower()
sentence = re.sub("[^a-z]+"," ",sentence)
sentence = sentence.split()
sentence = [s for s in sentence if len(s)>1]

sentence = " ".join(sentence)
return sentence
# Clean all Captions
for key,caption_list in descriptions.items():
for i in range(len(caption_list)):
caption_list[i] = clean_text(caption_list[i])
DATA GENERATION LOADER:
55

lOMoARcPSD|20521427
def
data_generator(train_descriptions,encoding_train,word_to_idx,max_len,batch_size):
X1,X2, y = [],[],[]
n =0
while True:
for key,desc_list in train_descriptions.items():
n += 1
photo = encoding_train[key+".jpg"]
for desc in desc_list:
seq = [word_to_idx[word] for word in desc.split() if word in

word_to_idx]
for i in range(1,len(seq)):
xi = seq[0:i]
yi = seq[i]
#0 denote padding word

xi =
pad_sequences([xi],maxlen=max_len,value=0,padding='post')[0]
yi = to_categorcial([yi],num_classes=vocab_size)[0]
X1.append(photo)
X2.append(xi)
y.append(yi)
if n==batch_size:
yield [[np.array(X1),np.array(X2)],np.array(y)]
X1,X2,y = [],[],[]
n = 0
56

Image Caption Generator

Uploaded by

Copyright:

Available Formats

You might also like

Image Caption Generator

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Caption Generator

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|20521427

Image Caption Generator

project report format (Chandigarh University)

Studocu is not sponsored or endorsed by any college or university

IMAGE CAPTION GENERATOR

SAHIL KUMAR CHOUDHARY(19BCS1149)

in partial fulfilment for the award of the degree of

COMPUTER SCIENCE & ENGINEERING

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Certified that this project report “IMAGE CAPTION GENRATOR” is the

Dr. Navpreet Kaur Walia Er. Steve Samson

Submitted for the project viva-voce examination held on

INTERNAL EXAMINER EXTERNAL EXAMINER

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

3.2 Design Constraints..........................................................................................17

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

4.4 Training and Testing the model......................................................................37

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Figure 1.2 Timeline of the Project…………………………………………..05

Figure 2.1 Deep Learning Features with NN………………………………..08

Figure 2.2 Novel Caption Generation……………………………………….10

Figure 2.3 An overall taxonomy of deep learning-based image captioning...11

Figure 3.1 Feature Analysation……………………………………………...19

Figure 3.2 CNN and LSTM Model………………………………………….20

Figure 3.3 LSTM Model…………………………………………………….22

Figure 3.4 Methodology Flow……………………………………………....25

Figure 3.5 VGG16 Model…………………………………………………...26

Figure 3.6 RNN ……………………………………...……………………...27

Figure 3.7 CNN…………………………………………….……………......28

Figure 3.8 LSTM Inner Model………………………………….………...…29

Figure 3.9 VGG-16…………………………………….…………………....30

Figure 4.1 Text File containing Dataset…………………………………......32

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Figure 4.2 Five Captions for one picture.......

Figure 4.3 Pre-processed data…………………………...………………......34

Figure4.4 Final Model Structure……………….……………......……….….37

Figure 4.5 Example Caption Generation

Figure 4.6 Example of Caption

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Table 4.1 Word Prediction Generation Step By Step…………………………….. 36

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

sentences. The purpose of image caption generation is to produce a statement that

possible by recent advancements in deep learning and the accessibility of image

are proposed in this research.

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

ससससस सससससससस सससस ससस सस सससससससस सससससस सस सस सससससस ससस

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

CNN Convolutional Neural Network

LSTM Long Short Term Memory

RNN Recurrent Neural Network

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

1.1 Relevant Contemporary Issues

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

Downloaded by Puzan Cozzu (puzancozzu@gmail.com)

1.2 Identification of Problem