Professional Documents
Culture Documents
Image Caption Generator
Image Caption Generator
Image Caption Generator
A PROJECT REPORT
Submitted by
ABHAY BANALACH(19BCS1568)
DHEERAJ KARWASRA(19BCS1177)
ISHITA THAPLIYAL(19BCS1163)
SHADMAN AHMED(19BCS2289)
BACHELOR OF ENGINEERING
IN
Chandigarh University
DECEMBER 2022
BONAFIDE CERTIFICATE
Assistant Professor
Computer Science & Engineering Computer Science & Engineering
ACKNOWLEDGEMENTS
I would like to express my special thanks of gratitude to my project mentor Er. Steve Samson
as well as various teachers of CSE department of Chandigarh University who gave me the
golden opportunity to do this wonderful project on the topic Image Caption Generator, which
also helped me in doing a lot of Research and I came to know about so many new things I
am really thankful to them.
Table of Contents
LIST OF FIGURES....................................................................................................i
List of Tables............................................................................................................iii
ABSTRACT.............................................................................................................iv
ससस............................................................................................................................v
ABBREVIATIONS..................................................................................................vi
CHAPTER 1 INTRODUCTION...............................................................................1
1.1 Relevant Contemporary Issues.........................................................................1
1.2 Identification of Problem..................................................................................3
1.3 Identification of Tasks......................................................................................3
1.4 Timeline............................................................................................................5
1.5 Organization of the Report...............................................................................5
CHAPTER 2 LITERATURE SURVEY....................................................................6
2.1 Timeline of the Reported Problem....................................................................6
2.1.1 Handcraft Features with Statistical Language Model.................................7
2.1.2 Deep Learning Features with Neural Network...........................................7
2.1.3 Datasets and Evaluation..............................................................................8
2.2 Proposed Solution.............................................................................................8
2.2.1 TEMPLATE-BASED APPROACHES.......................................................9
2.2.2 RETRIEVAL-BASED APPROACHES.....................................................9
2.2.3 NOVEL CAPTION GENERATION..........................................................9
2.3 Bibliometric Analysis.....................................................................................10
2.3.2 Limitations................................................................................................12
2.4 Review Summary...........................................................................................12
2.5 Problem Definition.........................................................................................13
2.6 Goals and Objectives......................................................................................14
CHAPTER 3 DESIGN FLOW................................................................................16
3.1 Evaluation & Selection of Specifications/Features........................................16
LIST OF FIGURES
Figure 1.1 Number of visually impaired people in the world by 2020 ……..02
LIST OF TABLES
ABSTRACT
Scene interpretation has long been a key component of computer vision, and image
captioning is one of the main topics of AI research since it seeks to imitate how
humans can condense a huge quantity of visual data into a small number of
describes an image. The challenge calls for the employment of methods from
computer vision and natural language processing in order to generate a brief but
comprehensive caption of the image. Significant study in the field has been made
caption datasets from sources like Flickr and COCO. A multilayer Convolutional
Neural Network (CNN), a recurrent neural network (RNN), and other approaches
ससस
ABBREVIATIONS
ML Machine Learning
AI Artificial Intelligence
NN Neural Networks
CHAPTER 1
INTRODUCTION
Image captioning is a multimodal task which can be improved by the aid of Deep Learning.
However, even after so much progress in the field, captions generated by humans are more
effective which makes image caption generator an interesting topic to work on with Deep
learning. Fetching the story of an image and automatically generating captions is a
challenging task. The whole objective of generating captions solely lay in the derivation of
the relationship between the captured image and the object, generated natural language and
judging the quality of the generated captions. To determine the context of image requires the
detection, spotting a recognition of the attributes, characteristics and the flow of
relationships between these entities in an image.
Applications involving computer vision or image/video processing can benefit greatly from
deep learning. It is generally used to categorise photos, cluster them based on similarities,
and recognise objects in scenarios. ViSENZE, for instance, has created commercial products
that use deep learning networks to enhance image recognition and tagging. This enables
buyers to search for a company's products or related things using images rather than
keywords.
Object identification is a more difficult form of this task that requires precisely detecting one
or more items within the scene of the shot and boxing them in. This is accomplished using
algorithms that can recognise faces, people, street signs, flowers, and many other visual data
elements.
Figure
Figure0-1.1:
1.1: Number
Numberofofvisually
visuallyimpaired
impairedpeople
peopleininthe
theworld
worldby
by2020
2020
The whole model of generating accurate captions has a potential to have large impact.
According to World Health Organization, globally, at least 2.2 billion people have a near or
distance vision impairment as shown in Figure 1.1. This survey implies that these people
cannot view any image on the web, that is where an Image caption generator comes into
play. It can help visually impaired people understand the context of an image by deciphering
the objects of the image.
Image caption generator can also play a major role in the growth of social media where
captions play a major role. Till now the only modification with respect to generating
captions is given by Instagram which generates a script with respect to the audio fetched,
however there is no progress of generating captions by just looking at the image. With
internet and technologies continuously growing and people getting hooked to their mobile
phones and also social media being a competitive market, inclusion of such a feature that
can provide that human touch to an image without the help of human knowledge or emotions
to express the essence of the image and also can make the inclusion of visually impaired
people feel more included in the society.
Though numerous researches and deep learning models have been already put into play to
generate captions, the major problem that lies in hand still is the fact that humans describe
image using natural languages which is compact, crisp and easy to understand. Machines on
the other hand have not been found efficient to beat the effectiveness the human generated
descriptions of a certain image. The task to identify a well-structured object or image is quite
an easy task and has been in study by the computer vision community for long but to derive
a relationship between objects of an image is quite a task. Indeed, a description must
capture the essence and put up the description of the objects of the image but it must also
express how the objects correlate with each other as well as their attributes and activities
they are performing in the image. The above knowledge has to be expressed in a natural
language like English to actually communicate the context of the image effectively.
In deep machine learning based techniques, features are learned by the machine on its own
from training data and these techniques are able to handle a large and diverse set of images
and videos. Hence, they are preferred over traditional methods. In the last few years,
numerous articles have been published on image captioning with deep learning. To provide
an abridged version of the literature, this paper presents a survey majorly focusing on the
deep learning-based papers on image captioning.
1.4 Timeline
Figure Figure
1.2:
Figure
Gantt
1.21.2
Timeline
Chart
Timeline
depicting
ofof
the
Project
Project
the timelines
CHAPTER 2
LITERATURE SURVEY
relationship processing, and finally a conditional random field (CRF) prediction image tag
should be used to produce a natural language description. On photos, object detection is also
done. In order to infer objects, properties, and relationships in an image and transform them into
a sequence of semantic trees, Lin et al. used a 3D visual analysis system. After learning the
grammar, the trees were used to generate text descriptions.
but this is undoubtedly a passing improvement. In computer vision, RNN is also quickly rising in
prominence. Sequence modelling , frame-level video classification , and current visual question-
answer tasks.
The entire image caption generation task has been covered in this overview, along with the
model framework that has been recently suggested to address the description task. We have also
concentrated on the algorithmic core of various attention mechanisms and provided a brief
overview of how the attention mechanism is used. We provide an overview of the sizable
datasets and standard practise rating criteria.
unsupervised learning. Typically, captions are created for the entire scene depicted in the
image However, distinct areas of an image can also create captions (Dense captioning).
Either a compositional design or a straightforward encoder-decoder architecture can be
used for image captioning. There are techniques that make advantage of the attention
mechanism, semantic notion, and various visual description styles. Some techniques can
even produce descriptions for invisible objects. However, there are also methods that
utilise other language models, such as CNN and RNN, thus we include a language model-
based category as "LSTM vs. Others." The majority of the picture captioning algorithms
employ LSTM as their language model.
captions. They are unable to produce captions that are both image-specific and semantically
accurate.
10
The table provides a brief summary of the deep learning-based picture captioning
techniques. It lists the names of the picture captioning techniques, the class of deep neural
networks that were used to encrypt the image data, and the language models that were
employed to describe the images. In the final column, we give a category label to each
captioning technique based on the taxonomy in Figure 2.3.
11
Visual Space
Bulk of the image captioning methods use visual space for generating captions. In the
visual space-based methods, the image features and the corresponding captions are
independently passed to the language decoder.
Multimodal Space
Language encoder, vision part, multimodal space part, and language decoder are all components
of a typical multimodal space-based method's design.
In order to extract the image features, the vision component uses a deep convolutional neural
network as a feature extractor. The language encoder portion learns a dense feature embedding
for each word after extracting the word features. The semantic temporal context is subsequently
transmitted to the recurrent levels. The multimodal space component places the word features
and image features in the same spatial location.
2.3.2 Limitations
A helpful framework for learning to map from photos to human-level image captions is provided
by the neural image caption generator. The algorithm learns to extract pertinent semantic
information from visual attributes by training on a large number of image-caption pairs.
However, when used with a static image, our caption generator will concentrate on aspects of the
12
image that are helpful for image classification rather than necessarily aspects that are helpful for
caption production. We may train the image embedding model (the VGG-16 network used to
encode features) as a component of the caption generation model, enabling us to precisely adjust
the image encoder to more effectively perform the duty of creating captions.
13
evaluation indicators should be optimized to make them more in line with human experts’
assessments.
4. A very real problem is the speed of training, testing, and generating sentences for the
model should be optimized to improve performance.
Despite the successes of many systems based on the Recurrent Neural Networks (RNN)
many issues remain to be addressed. Among those issues the following two are prominent
for most systems.
1. The Vanishing Gradient Problem.
A deep learning system called a recurrent neural network is created to handle various challenging
computer tasks including speech recognition and object classification. The understanding of each
event is based on knowledge from earlier occurrences, and RNNs are built to manage a
succession of events.
The deeper RNNs would be ideal because they would have a greater memory capacity and
superior capabilities. These could be used in numerous real-world use cases, including improved
14
speech recognition and stock prediction. RNNs aren't frequently used in real-world applications,
despite their seeming promise, because of the vanishing gradient issue.
15
CHAPTER 3
DESIGN FLOW
Through our model we aim to produce semantically and visually grounded description of
abstract images, the proposed description of which would be in natural language i.e. human
perceived description. By leveraging the techniques like CNN, RNN, and data sets such as those
of MS-COCO, we strive to attain the human level perception of given images. Our core insight is
that we can leverage these large image-sentence datasets by treating the sentences as weak labels,
in which contiguous segments of words correspond to some particular, but unknown location in
the image.
Our approach is to infer these alignments and use them to learn a generative model of
descriptions. We develop a deep neural network model that infers the alignment between
segments of sentences and the region of the image that they describe. We introduce a Recurrent
Neural Network architecture that takes an input image and generates its description in text. Our
experiments show that the generated sentences produce sensible qualitative predictions
16
17
Computer vision systems will become more dependable for use as personal assistants for
visually impaired persons and in enhancing their daily lives as progress is made in
automatic picture captioning and scene understanding. In order to bridge the meaning gap
between language and visual, common sense and logic must be incorporated.
18
6. Data Preparation using Generator: Data preparation is the process of cleaning and
transforming raw data prior to processing and analysis. It is an important step prior to
processing and often involves reformatting data, making corrections to data and the
combining of data sets to enrich data.
7. Word Embedding: Converting words into Vectors(Embedding Layer Output) \
8. Model Architecture: Making an image feature extractor model, partial caption sequence
model and merging the two networks
9. Train Our Model: A training model is a dataset that is used to train an ML algorithm. It
consists of the sample output data and the corresponding sets of input data that have an
influence on the output.
10. Predictions: Prediction refers to the output of an algorithm after it has been trained on a
historical dataset and applied to new data when forecasting the likelihood of a particular
outcome here predicting Caption for a photo
19
information.
Figure 3.2 CNN and LSTM Model
20
In dense captioning, captions are generated for each region of the scene. Other methods
generate captions for the whole scene.
Dense Captioning
The previous techniques for creating picture captions can only produce one caption for the
entire image. To gather data on various things, they use various parts of the image. These
techniques do not, however, produce regionally specific captions. DenseCap is a method
for captioning images that Johnson et al. proposed. This technique localises every
important area of a picture before producing descriptions for each area. A typical method
of this category has the following steps:
(1) Region proposals are generated for the different regions of the given image.
(3) The outputs of Step 2 are used by a language model to generate captions for every
region.
21
22
Long short term memory, or LSTM, is a form of RNN (recurrent neural network) that is useful
for solving issues involving sequence prediction. We can anticipate the following word based on
the prior text. By getting over RNN's short-term memory restrictions, it has distinguished itself
from regular RNN as an effective technology. Through the use of a forget gate, the LSTM may
carry out relevant information while processing inputs and reject irrelevant information.
Compared to conventional RNNs, LSTMs are built to address the vanishing gradient problem
and have a longer memory for information. In order to continue learning over several time-steps
and back-propagate over time and layers, LSTMs have the ability to retain a constant error.
To store data outside the RNN's usual flow, LSTMs employ gated cells. The network can use
these cells to change data in a variety of ways, including storing information there and reading
from it. Individual cells have the ability to make decisions about the information and can carry
out these decisions by opening or closing the gates. LSTM outperform conventional RNNs in
certain tasks due to their superior long-term memory capabilities. Because of its chain-like
architecture, LSTM can retain knowledge for longer periods of time and handle difficult
problems that typical RNNs either find difficult or are unable to handle.
23
Forget gate - removes information that is no longer necessary for the completion of the task.
This step is essential to optimizing the performance of the network.
Input gate - responsible for adding information to the cells
Output gate - selects and outputs necessary information
An openly accessible benchmark for image-to-set instructions is the Flickr8k dataset. This
collection contains 8000 images, each with five captions. These images came from several Flickr
groups. Each caption provides a thorough overview of the subjects and activities seen on camera.
The dataset is larger since it shows a variety of scenes and events rather than only famous people
or locations. There are 6000 photographs in the training dataset, 1000 in the development dataset,
and 1000 in the test dataset. The dataset's qualities that make it suitable for this project are as
follows: • The model becomes increasingly prevalent and is kept from being overfit by having
numerous labels for a single image. • The image annotation model can work in numerous picture
categories thanks to different training image categories, which makes the model more resilient
Convolutional Neural Network (CNN) layers for feature extraction on input data are paired with
LSTMs to facilitate sequence prediction in the CNN LSTM architecture. Although we will refer
to LSTMs that employ a CNN as a front end in this lesson by the more general label "CNN
LSTM," this architecture was initially known as a Long-term Recurrent Convolutional Network
or LRCN model.
To provide textual descriptions of photographs, this architecture is employed. Use of a CNN that
has already been trained on a difficult picture classification task and has been repurposed as a
feature extractor for the caption producing challenge is crucial.
3.5.4 Model Advantages
The text file also contains the captions for the almost 8000 photographs that make up the Flickr
8k dataset that we used. Although deep learning-based image captioning techniques have made
significant strides in recent years, a reliable technique that can provide captions of a high calibre
for almost all photos has not yet been developed. Automatic picture captioning will continue to
be a hot research topic for some time to come with the introduction of novel deep learning
network architectures. With more people using social media every day and the majority of them
24
posting images, the potential for image captioning is very broad in the future. Therefore, they
will benefit more from this project.
1. Assistance for visually impaired
The advent of machine learning solutions like image captioning is a boon for visually
impaired people who are unable to comprehend visuals.
With AI-powered image caption generators, image descriptions can be read out to the
visually impaired, enabling them to get a better sense of their surroundings.
2. Recommendations in editing
The image captioning model automates and accelerates the closed captioning process for
digital content production, editing, delivery, and archival.
Well-trained models replace manual efforts for generating quality captions for images as
well as videos.
4. Self-driving cars
By using the captions generated by image caption generator self driving cars become
aware of the surroundings and make decisions to control the car.
5. Reduce vehicle accidents
By installing an image caption generator
3.6 Methodology
3.6.1 Model Overview
The model is trained to maximise the probability of p(S|I), where S is the sequence of words
produced by the model and each word St. is produced using a lexicon that was created using the
training dataset. A deep vision convolutional neural network (CNN) is fed the input image I,
which aids in object detection in the image. As shown in Figure 3.4, the Language Generating
25
Recurrent Neural Network (RNN) receives the picture encodings and uses them to create a
relevant phrase for the image. The model can be compared to a language translation RNN model,
where the goal is to maximise the p(T|S), where T is the translation to the sentence S. However,
in our model the encoder RNN which helps in transforming an input sentence to a fixed length
vector is replaced by a CNN encoder. Recent research has shown that the CNN can easily
transform an input image to a vector.
We employ the pre-trained model VGG16 for the picture classification task. The next section
discusses the models' specifics. The pre-trained VGG16 is then followed by an LSTM network.
Language creation is accomplished using the LSTM network. Since a current token depends on
the preceding tokens for a sentence to have sense, LSTM networks are different from typical
Neural Networks in that they take this into consideration.
We use a VGG16 pretrained model to the task of classifying images. The next section discusses
the models' specifics. The VGG16 is followed by an LSTM network, which stands for long
short-term memory. Language creation is accomplished using the LSTM network. Traditional
Neural Networks are different from LSTM since a current token depends on the previous tokens
for a sentence to be meaningful and LSTM networks take this factor into account.
26
The model that existed prior to the project has two distinct input streams, one for picture features
and the other for captions from the preprocessed input. To represent the picture features in a
different dimension, the image features are transferred through a dense, completely linked layer.
The embedding layer is carried through after the input captions. After merging these two input
streams, an LSTM layer receives them as inputs. The caption embedding serves as the LSTM's
input while the image serves as its initial state.
3.6.2 Recurrent Neural Network
A feed forward neural network with an internal memory is known as a recurrent neural network.
The result of the current input depends on the previous computation, making RNNs recurrent in
nature because they carry out the same function for every data input. The output is created,
copied, and then delivered back into the recurrent network. It takes into account both the current
input and the output that it has learned from the prior input when making a decision.
27
One type of artificial neural network called a recurrent neural network (RNN) has connections
between nodes that create a directed graph along a temporal sequence. It can display temporal
dynamic behaviour as a result of this. RNNs, which are derived from feed-forward neural
networks, can process input sequences of varying length by using their internal state (memory).
28
Convolutional neural networks (CNN, or ConvNet) are a class of deep neural networks used
most frequently to analyse visual vision in deep learning. Based on the shared-weight
convolution kernels' translation invariance properties and shift invariance architecture, they are
often referred to as shift invariant or space invariant artificial neural networks (SIANN). They
can be used in a variety of fields, including image and video recognition, recommender systems,
image classification, image segmentation, and medical image analysis. They can also be used in
brain-computer interfaces, natural language processing, and financial time series.
Multilayer perceptrons are modified into CNNs. Fully linked networks, or multilayer
perceptrons, are those in which all of the neurons in one layer are connected to all of the neurons
in the following layer. These networks are vulnerable to overfitting data because of their "fully-
connectedness." Changing the weights when the loss function is decreased and randomly
trimming connections are common methods of regularisation.
3.6.4 Long Short Term Memory
Long Short-Term Memory (LSTM) networks, Figure 3.8, are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems. This is a behavior
required in complex problem domains like machine translation, speech recognition, and more.
29
A challenging area of deep learning is LSTMs. Understanding LSTMs and how concepts like
bidirectional and sequence-to-sequence relate to the field can be challenging. LSTM has
feedback connections in contrast to traditional feed forward neural networks. It can analyse
whole data sequences in addition to single data points (like photos) (such as speech or video).
For instance, LSTM can be used for tasks like linked, unsegmented handwriting identification,
speech recognition, and network traffic anomaly detection, or IDSs (intrusion detection systems).
A cell, an input gate, an output gate, and a forget gate make up a typical LSTM unit. The three
gates control the information flow, and the cell retains values across arbitrary time intervals.
3.6.5 VGG16
It is regarded as one of the best vision model architectures created to date. The most distinctive
feature of VGG16 is that it prioritised having convolution layers of 3x3 filters with a stride 1 and
always utilised the same padding and maxpool layer of 2x2 filters with a stride 2, as shown in
Figure 3.9. Throughout the entire architecture, convolution and max pool layers are arranged in
the same manner. Two FC (completely connected layers) are present at the very end, followed by
a softmax for output. The 16 in VGG16 denotes the fact that there are 16 layers with weights.
This network has over 138 million parameters, making it a sizable network.
30
In their publication "Very Deep Convolutional Networks for Large-Scale Image Recognition," K.
Simonyan and A. Zisserman from the University of Oxford introduced the convolutional neural
network model known as VGG16. In the top five tests, the model performs 92.7% accurately in
ImageNet, a dataset of over 14 million images divided into 1000 classes. It was a well-known
model that was submitted to ILSVRC-2014. By sequentially substituting several 33 kernel-sized
filters for AlexNet's big kernel-sized filters (11 and 5, respectively, in the first and second
convolutional layers), it improves upon AlexNet. Using NVIDIA Titan Black GPUs, VGG16
underwent weeks of training.
31
CHAPTER 4
RESULTS ANALYSIS AND VALIDATION
4.1 Implementation
4.1.1 Pre-Requisites
This project requires good knowledge of Deep learning, Python, working on Jupyter
notebooks, Keras library, Numpy, and Natural language processing.
Make sure you have installed all the following necessary libraries:
• keras
• pillow
• numpy
• tqdm
32
• jupyterlab
33
The format of our file is image and caption separated by a new line (“\n”).
Each image has 5 captions and we can see that #(0 to 5)number is assigned for each
caption.
• load_doc( filename ) – For loading the document file and reading the contents inside
the file into a string.
• all_img_captions( filename ) – This function will create a descriptions dictionary that
maps images with a list of 5 captions. The descriptions dictionary will look
something like the Figure.
34
• cleaning_text( descriptions) – This function takes all descriptions and performs data
cleaning. This is an important step when we work with textual data, according to our
goal, we decide what type of cleaning we want to perform on the text. In our case, we
will be removing punctuations, converting all text to lowercase and removing words
that contain numbers.So, a caption like “A man riding on a three-wheeled
wheelchair” will be transformed into “man riding on three wheeled wheelchair” as
shown in Figure 4.3
• text_vocabulary( descriptions ) – This is a simple function that will separate all the
unique words and create the vocabulary from all the descriptions.
• save_descriptions( descriptions, filename ) – This function will create a list of all the
descriptions that have been preprocessed and store them into a file. We will create a
descriptions.txt file to store all the captions. It will look something like this:
35
model = ResNet50(weights="imagenet",input_shape=(224,224,3))
The function extract_features() will extract features for all images and we will map image
names with their respective feature array. Then we will dump the features dictionary into a
This process can take a lot of time depending on your system. I am using an Nvidia 1050
GPU for training purpose so it took me around 7 minutes for performing this task.
However, if you are using CPU then this process might take 1-2 hours. You can comment
out the code and directly load the features from our pickle file.
36
• load_photos( filename ) – This will load the text file in a string and will return the list
of image names.
• load_features(photos) – This function will give us the dictionary for image names
and their feature vector which we have previously extracted from the Xception
model.
37
task. Our model needs to be trained on 6000 photos, each of which has a 2048-length
feature vector and a caption that is likewise represented as a number. Because it is
impossible to store this much data in memory for 6000 photos, we will use a generator
method that will produce batches.The generator will yield the input and output sequence.
For example:
The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature
vector of that image, x2 is the input text sequence and y is the output text sequence that
the model has to predict.
• Feature Extractor – The feature extracted from the image has a size of 2048, with a
dense layer, we will reduce the dimensions to 256 nodes.
• Sequence Processor – An embedding layer will handle the textual input, followed by
the LSTM layer.
• Decoder – By merging the output from the above two layers, we will process by the
dense layer to make the final prediction. The final layer will contain the number of
nodes equal to our vocabulary size.
Visual representation of the final model is given in the figure
38
The model has been trained, now shown in Figure 4.4, we will make a separate file
testing_caption_generator.py which will load the model and generate predictions. The
predictions contain the max length of index values so we will use the same tokenizer.p
pickle file to get the words from their index values.
39
CHAPTER 5
CONCLUSION AND FUTURE WORK
40
In this chapter we have thrown some light on the conclusion of our project. We have also
underlined the limitation of our methodology. There is a huge possibility in this field, as we have
discussed in the future scope section of this chapter.
5.1 Conclusion
We reviewed deep learning-based picture captioning techniques in this study. We've
provided a taxonomy of picture captioning methods, illustrated general block diagrams of
the main groups, and highlighted the advantages and disadvantages of each. We talked
about several evaluation criteria and datasets, as well as their advantages and
disadvantages. The results of the experiment are also briefly summarised. We briefly
discussed possible possibilities for future research in this field. Although deep learning-
based image captioning techniques have made significant strides in recent years, a reliable
technique that can provide captions of a high calibre for almost all photos has not yet been
developed. Automatic picture captioning will continue to be a hot research topic for some
time to come with the introduction of novel deep learning network architectures utilized
here is Flickr 8k which includes nearly 8000 images, and the corresponding captions are
also stored in the text file. Although deep learning-based image captioning techniques
have made significant strides in recent years, a reliable technique that can provide
captions of a high calibre for almost all photos has not yet been developed. Automatic
picture captioning will continue to be a popular study topic for some time to come with
the introduction of novel deep learning network architectures. With more people using
social media every day and the majority of them posting images, the potential for image
captioning is very broad in the future. Therefore, they will benefit more from this project.
41
feature extraction and similarity computation in photos are difficult tasks. Picture
RETRIEVAL USING IMAGE CAPTIONING 54 Using features like colour, tags, the
histogram, and other features, current image retrieval algorithms calculate similarity.
Since these approaches are independent of the image's context, findings cannot be entirely
correct. Therefore, a thorough study of picture retrieval using the image context, such as
image captioning, will help to resolve this issue in the future. By including new picture
captioning datasets into the project's training, it will be possible to improve the
identification of classes with lesser precision in the future. In order to see if the picture
retrieval outcomes improve, this methodology can also be integrated with earlier image
retrieval techniques like the histogram, shapes, etc.
REFERENCES
[1] Abhaya Agarwal and Alon Lavie. 2008. Meteor, m-bleu and m-ter:
Evaluation metrics for high-correlation with
human rankings of machine translation output. In Proceedings of the
ThirdWorkshop on Statistical Machine Translation.
42
43
44
APPENDIX 1
Python
45
Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
It was originally called scikits.learn and was initially developed by David Cournapeau as a
Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux,
Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in
46
Computer Science and Automation), took this project at another level and made the first public
release (v0.1 beta) on 1st Feb. 2010.
Prerequisites
Python (>=3.5)
Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data
structure and analysis.
Installation
If you already installed NumPy and Scipy, following are the two easiest ways to install
scikit-learn −
Using pip
47
Using conda
Features
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is
focused on modeling the data. Some of the most popular groups of models provided by Sklearn
are as follows −
Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like
Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-
learn.
Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised
learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to
unsupervised neural networks.
Cross Validation − It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction − It is used for reducing the number of attributes in data which can be
further used for summarisation, visualisation and feature selection.
Ensemble methods − As name suggest, it is used for combining the predictions of multiple
supervised models.
48
Feature extraction − It is used to extract the features from data to define the attributes in image
and text data.
Open Source − It is open source library and also commercially usable under BSD license.
Numpy
NumPy is a library for the python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high level mathematical
functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by
Jim with contributions from several other developers. In 2005, Travis created NumPy by
incorporating features of the competing Numarray into Numeric, with extensive modifications.
NumPy is open source software and has many contributors.
The Python programming language was not originally designed for numerical computing, but
attracted the attention of the scientific and engineering community early on. In 1995 the special
interest group (SIG) matrix-sig was founded with the aim of defining an array computing
package; among its members was Python designer and maintainer Guido van Rossum, who
extended Python's syntax (in particular the indexing syntax) to make array computing easier.
49
A new package called Numarray was written as a more flexible replacement for Numeric. Like
Numeric, it too is now deprecated. Numarray had faster operations for large arrays, but was
slower than Numeric on small ones, so for a time both packages were used in parallel for
different use cases. The last version of Numeric (v24.2) was released on 11 November 2005,
while the last version of numarray (v1.5.2) was released on 24 August 2006.
There was a desire to get Numeric into the Python standard library, but Guido van Rossum
decided that the code was not maintainable in its state then.
In early 2005, NumPy developer Travis Oliphant wanted to unify the community around a single
array package and ported Numarray's features to Numeric, releasing the result as NumPy 1.0 in
2006. This new project was part of SciPy. To avoid installing the large SciPy package just to get
an array object, this new package was separated and called NumPy. Support for Python 3 was
added in 2011 with NumPy version 1.5.0.
Features
Using NumPy in Python gives functionality comparable to MATLAB since they are both
interpreted, and they both allow the user to write fast programs as long as most operations work
on arrays or matrices instead of scalars. In comparison, MATLAB boasts a large number of
additional toolboxes, notably Simulink, whereas NumPy is intrinsically integrated with Python, a
50
more modern and complete programming language. Moreover, complementary Python packages
are available; SciPy is a library that adds more MATLAB-like functionality and Matplotlib is a
plotting package that provides MATLAB-like plotting functionality. Internally, both MATLAB
and NumPy rely on BLAS and LAPACK for efficient linear algebra computations.
Installing Numpy
The only prerequisite for installing NumPy is Python itself. If you don’t have Python yet and
want the simplest way to get started, we recommend you use the Anaconda Distribution - it
includes Python, NumPy, and many other commonly used packages for scientific computing and
data science.
CONDA
If you use conda, you can install NumPy from the defaults or conda-forge channels:
# Best practice, use an environment rather than install in the base env
conda create -n my-env
conda activate my-env
# If you want to install from conda-forge
conda config --env --add channels conda-forge
# The actual install command
conda install numpy
PIP
If you use pip, you can install NumPy with:
pip install numpy
Librosa
51
Librosa is a Python package for music and audio analysis. Librosa is basically used when we
work with audio data like in music generation(using LSTMs), Automatic Speech Recognition.
Installation instructions
pypi
The simplest way to install librosa is through the Python Package Index (PyPI). This will ensure
that all required dependencies are fulfilled. This can be achieved by executing the following
command:
It provides the building blocks necessary to create the music information retrieval systems.
Librosa helps to visualize the audio signals and also do the feature extractions in it using
different signal processing techniques.
Source
If you’ve downloaded the archive manually from the releases page, you can install using the
setuptools script:
52
cd librosa-VERSION/
python setup.py install
If you intend to develop librosa or make changes to the source code, you can install with pip
install -e to link to your actively developed source tree:
OSX users can use homebrew to install ffmpeg by calling brew install ffmpeg or get a binary
version from their website https://www.ffmpeg.org.
Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is
also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely
resemble that of MATLAB, though its use is discouraged.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. Matplotlib makes easy things easy and hard things possible.
53
Seaborn
54
matplotlib
ANNEXURE-1
SAMPLE CODE:
DATA CLEANING:
def clean_text(sentence):
sentence = sentence.lower()
sentence = re.sub("[^a-z]+"," ",sentence)
sentence = sentence.split()
55
def
data_generator(train_descriptions,encoding_train,word_to_idx,max_len,batch_size):
X1,X2, y = [],[],[]
n =0
while True:
for key,desc_list in train_descriptions.items():
n += 1
photo = encoding_train[key+".jpg"]
for desc in desc_list:
X1.append(photo)
X2.append(xi)
y.append(yi)
if n==batch_size:
yield [[np.array(X1),np.array(X2)],np.array(y)]
X1,X2,y = [],[],[]
n = 0
56