Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

IMAGE CAPTION GENERATOR USING DEEP LEARNING

ABSTRACT
It can be difficult to create automated image captions. Still, we are able to identify the
relationship between the objects in the image as well as forecast their content thanks to
advancements in artificial intelligence and deep learning. For each image to produce a
sentence with proper grammar, it must be understood clearly. In order to create extremely
accurate captions, the content of the image will be extracted using the Convolutional Neural
Network and input into an RNN/LSTM-based model in this article. In this case, the
RNN/LSTM model serves as the decoder and the CNN serves as the encoder. Deep Learning
and Natural Language Processing have advanced to the point where extremely accurate
captions can be produced. For word representation, we employed the GloVe model and the
Flickr8k dataset. Using Greedy and Beam Search to select the appropriate words for the
caption is also covered in this work. Current research on picture caption generation aids in the
comprehension of images by the blind and helps social media users come up with hashtags
for their postings. Furthermore, this paradigm assists in displacing the requirement for human
interpretation.

KEYWORDS: Deep Learning, Convolution Neural Networks, Recurrent Neural Networks,


Long Short-Term Memory, Global Vectors, Inception V3.

I. INTRODUCTION

In the rapidly developing field of artificial intelligence, we are using a Convolutional


Neural Network's (CNN) capabilities to extract powerful characteristics from images and use
them as an encoder. These features are then processed using a decoder model that is based on
the principles of long short-term memory (LSTM) and recurrent neural networks (RNN).
This combination enables us to provide informative descriptions for the photographs that are
also appropriate for their context. This can help those who are blind or visually challenged
understand the content of photographs on the internet. [1]

For word representation, we are leveraging the Global Vectors for Word
Representation (GloVe) model. This unsupervised learning algorithm provides vector
representations for words, encapsulating various aspects of their meanings based on their co-
occurrence statistics in a corpus of text. The challenge lies in selecting the appropriate word
for the caption that maintains its meaning when combined both grammatically and
contextually.

The model takes the image I as an input to produce p(S|I) a sequence of words S =
{S1, S2, S3…} where each word St describes the sentence precisely. To further enhance our
model’s performance we are utilizing Inception V3, a pre-trained CNN model developed by
Google. Known for its high performance in object recognition tasks, Inception V3, with its
intricate architecture and multiple layers, aids in extracting detailed features from the images.
[2]
This comprehensive approach, combining the strengths of CNN, RNN/LSTM, GloVe,
and Inception V3, paves the way for highly accurate image cation generation. This not only
assists in comprehending the content of images but also finds applications in aiding visually
impaired individuals and generating hashtags for social media posts. This endeavor stands as
a testament to the advancements in Deep Learning and Natural Language Processing, and
their potential in transforming the way we interact with digital imagery.

II. RELATED WORKS

The paper titled “Where to put the Image in an Image Caption Generator” by Marc Tanti and
Albert Gatt, published in 2018, delves into the intricacies of four distinct Recurrent Neural
Network (RNN) architectures used in image caption generation. These include the Init-inject,
Pre-inject, Par-inject, and Merge architectures. The paper provides an in-depth analysis of
how these architectures handle image vectors and word vectors differently. It also offers a
comparative study of the advantages of one architecture over the other. Furthermore, the
paper discusses the challenges faced when implementing these architectures, providing
insights from both scientific and engineering perspectives. This comprehensive study offers
valuable insights into the complexities and potential of using RNNs in caption generation. [1]

Oriol Vinyals, Alexander Toshev et al Google (2015) “Show and Tell: A Neural Image
Caption Generator” discusses the achievement of state-of-the-art performance by integrating
Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks for
sequence tasks such as translation. It introduces a neural and probabilistic framework for
image descriptions and delves into the recent advancements in statistical machine translation
within sequence models. The paper aims to achieve superior results by maximizing the
probability of accurate translation. It investigates the optimization of the sum of log
probabilities across the entire training set using stochastic gradient descent. The paper also
compares the differences in the BLEU score on average between greedy and beam search
methods. This comprehensive study contributes significantly to the understanding of
sequence tasks and translation models. [2]

Grishma Sharma, Priyanka Kalena, et al (2019) “Visual Image Caption Generator Using
Deep Learning” The paper discusses an intricate system composed of three distinct models.
The first is the feature extraction model, known as VGG16, which is responsible for
processing the input image and extracting relevant features. The second and third are the
encoder and decoder models, which work in tandem to generate a caption that accurately
describes the image. The paper further delves into a comparative study between two popular
approaches in sequence modelling – Long Short-Term Memory (LSTM) and Gated Recurrent
Units (GRU). These two methods are analysed in terms of their performance in the task of
image caption generation. A noteworthy aspect of the system discussed in the paper is the use
of VGG16 for feature extraction. VGG16, a 16 layer Convolutional Neural Network (CNN),
is renowned for its effectiveness in image classification tasks. Furthermore, this feature
extraction process, which currently employs VGG16, could potentially be enhanced by
utilizing other classification networks such as GoogleNet, AlexNet and InceptionV3.
III. SYSTEM ARCHITECTURE

Fig1. Model Architecture

The system illustrated in Figure 1, as detailed in this research, employs a hybrid model
comprising a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)
network to process individual image inputs and generate descriptive captions. In this
architecture, an encoder is utilized to convert variable-length sentences into a fixed-size
vector representation. This vector serves as the initial hidden state for the decoder, which then
sequentially generates output in the form of coherent and meaningful sentence descriptions.

The deep CNN as it is better suited for producing a high-quality representation of the input
image through embedding with a fixed-size vector. Furthermore, we pre-train the CNN for
image classification and extract the backend hidden layer of the network as the image
representation. This image representation is then fed into the decoder, which is an LSTM, to
generate a sentence description of the input image. By using this approach, we are able to
produce a description of the input image that is both accurate and natural-sounding.
Fig2. LSTM Architecture.
In Figure 2, the architecture prominently highlights the utilization of the Convolutional
Neural Network (CNN) as the encoder, specifically employing the InceptionV3 model to
extract high-level features from input images. These features are then fed into the Long
Short-Term Memory (LSTM) network, serving as the decoder in the proposed model.
Following the encoding process, where variable-length sentences are transformed into a
fixed-size vector representation, the LSTM decoder sequentially processes this information.
Its pivotal role lies in generating output, producing coherent and meaningful sentence
descriptions. By using CNN as the encoder and LSTM as the decoder, this model harnesses
the power of both image feature extraction and sequential context preservation for accurate
and nuanced image caption generation.

IV. METHODOLOGY

i. DATASET
The dataset utilized in our study is Flickr8K, which comprises approximately 8091 JPEG
images. Each image is accompanied by a minimum of five related captions. The dataset is
divided into three subsets: 6000 images for training, 1000 for validation, and 1000 for testing.
This division facilitates a robust evaluation of the model’s performance. Please note that the
numbers mentioned are approximate and may vary slightly.
Fig3. Samples from the Flickr8K

ii. PREPROCESSING

The images from the Flickr8K dataset are first resized to a uniform dimension, ensuring
consistency and reducing computational complexity. This step aids in stabilizing the training
process and significantly reduces the training time. For text pre-processing, the captions
associated with the images are first tokenized, which involves breaking down the text into
individual words or tokens. Stop words, which are common words that do not contribute to
the meaning of the sentences, are removed. All remaining words are then converted to
lowercase to maintain uniformity and reduce the complexity of the model. We utilize the
InceptionV3 model, which is already trained on the ImageNet dataset, for the extraction of
features from the images. The benefit of employing a pre-trained model like InceptionV3 lies
in its ability to leverage robust features learned from a substantial dataset. These features can
be effectively applied to our task through the concept of transfer learning. This approach not
only accelerates the training process but also reduces the amount of data required compared
to building a model from the ground up. The extracted features and pre-processed captions
are then fed into the caption generation model. The model learns to map the image features to
the corresponding captions, thereby learning to generate contextually relevant captions for the
images.

iii. MODEL

A. FEATURE EXTRACTION MODEL

In the realm of image caption generation, the Feature Extraction Model assumes a pivotal
role, tasked with extracting intricate details and high-level features from input images. To
achieve this, our approach leverages the renowned Inception V3 architecture, a state-of-the-
art convolutional neural network (CNN) originally designed for image classification tasks.
Notably, our objective in this phase transcends image classification, focusing instead on
obtaining a comprehensive and fixed-size vector representation of the visual content.

A distinctive aspect of our approach involves the deliberate removal of the softmax layer
from the Inception V3 model. Typically inherent in image classification architectures for
generating class probabilities, the softmax layer is extraneous to our feature extraction goals.
By eliminating this layer, we preserve the raw feature representation, untethered from
classification-oriented constraints.

Furthermore, to accommodate the nuances of the Inception V3 architecture, we meticulously


pre-process input images before subjecting them to feature extraction. This preprocessing
step, characterized by reshaping images to the model's anticipated dimensions (299 x 299),
ensures that the input data aligns seamlessly with the model's expectations. The resultant
fixed-size vector becomes a robust and informative representation of the visual features
within the image.

Through the orchestrated interplay of these steps, the Feature Extraction Model sets the stage
for subsequent phases in our image captioning pipeline. This strategic combination of
architectural considerations allows our model to distill intricate visual details and lay the
groundwork for the generation of accurate and contextually rich image captions.
B. THE ENCODER MODEL

The Encoder Model constitutes a pivotal phase in our image captioning architecture,
leveraging a Convolutional Neural Network (CNN) to process the high-level features
extracted by the Inception V3 model during the Feature Extraction phase. Unlike traditional
image classification applications, the encoder in our model focuses on distilling relevant
information from the visual input to create a compact and informative representation.

The Encoder Model seamlessly integrates with the Inception V3 architecture's output,
inheriting the refined features extracted from the input images. These features, encapsulated
in a fixed-size vector, serve as the foundation for subsequent processing by the encoder. By
bridging the gap between feature extraction and language generation, this integration ensures
that the model capitalizes on both spatial and contextual information embedded in the visual
content.

The CNN, acting as the encoder, performs a critical role in further processing the extracted
features. Its hierarchical structure allows it to capture intricate spatial hierarchies within the
images, identifying patterns and relevant visual elements. The convolutional layers act as
effective filters, extracting essential visual information and transforming it into a format
conducive for subsequent stages of the model.

The encoder's function is to abstract and condense the visual information from the input
images into a format that can be seamlessly fed into the subsequent language generation
phase. This abstraction facilitates the creation of a bridge between the visual and textual
domains, enabling the model to generate coherent and contextually relevant captions for the

input images.

C. THE DECODER MODEL

The Decoder Model plays a central role in our image captioning architecture, responsible for
generating meaningful and contextually relevant sentence descriptions from the visual
features processed by the Encoder Model. In this phase, we utilize a Long Short-Term
Memory (LSTM) network, chosen for its sequential processing capabilities and proficiency in
capturing dependencies within sequential data.

The Decoder Model seamlessly integrates with the fixed-size vector representation derived
from the Encoder Model. This vector serves as the initial hidden state for the LSTM network,
providing a contextual starting point for the generation of textual descriptions. By
incorporating the encoded visual features into the language generation process, our model
achieves a harmonious fusion of visual and textual information.

The choice of LSTM as the decoding mechanism is strategic. LSTMs excel in capturing long-
range dependencies and contextual nuances within sequential data, making them well-suited
for generating coherent and natural-sounding language. As the LSTM processes the encoded
visual features, it sequentially generates words, refining its understanding of the context with
each iteration.

The primary objective of the Decoder Model is to transform the encoded visual features into
a meaningful sentence description. The LSTM, through its recurrent connections, maintains
memory of previously generated words, allowing it to create contextually relevant and
grammatically coherent captions. This iterative language generation process continues until a
predefined stopping condition is met or the caption reaches a specified length.

By employing the LSTM-based Decoder Model, our approach aims to produce image
captions that are both accurate and natural-sounding. The synergy between visual information
and linguistic context ensures that the generated descriptions not only faithfully represent the
content of the input image but also adhere to the intricacies of human language.

V. RESULT AND EVALUATION

Greedy Search and Beam Search

Our image captioning framework incorporates two distinct search methodologies: Greedy
Search and Beam Search, employed for predicting the most suitable words in image captions.
Renowned for its computational efficiency, Greedy Search opts for the word with the highest
probability during each decoding step, offering simplicity ideal for real-time applications.
However, this streamlined approach may lead to suboptimal local selections. Conversely,
Beam Search adopts an exploration-centric stance by maintaining multiple candidate
sequences, fostering diverse exploration of potential word sequences. The equilibrium
between Greedy Search's simplicity and Beam Search's diversity is meticulously addressed in
our model, exerting a substantial influence on the overall quality and contextual richness of
the generated captions. Our research features a comprehensive evaluation of both strategies,
providing insights into their distinct contributions to the image captioning process.

Fig4. A selection of the evaluation results of the our model.

VI. CONCLUSION
In conclusion, our research endeavors to advance the field of image captioning by employing
a meticulously designed model that integrates Greedy Search and Beam Search as distinct
search strategies. The synergistic fusion of these methodologies allows for a nuanced
exploration of their respective strengths and trade-offs, contributing to the richness and
quality of the generated captions.

Greedy Search, known for its computational efficiency, emerges as a practical choice for
real-time applications. However, its simplicity may result in locally optimal decisions,
prompting the exploration of alternative strategies. On the other hand, Beam Search, with its
exploration-centric approach, introduces diversity into the generated sequences, potentially
capturing more globally optimal solutions. The delicate balance struck between these search
strategies influences the contextual richness and overall quality of our image captions.

In essence, our work extends beyond the generation of image captions, serving as a testament
to the continuous exploration and refinement of methodologies in the evolving landscape of
computer vision and natural language processing. We hope that our findings inspire further
research and innovation, propelling the field towards more sophisticated and contextually
aware image captioning systems.

REFERENCES
[1] Marc Tanti and Albert Gatt, (2018), “Where to put the Image in an Image Caption
Generator”. Institute of Linguistics and Language Technology, Kenneth P. Camilleri,
arXiv:1703.09137v2 [cs.NE] 14 Mar 2018

[2] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Google (2015) “Show
and Tell: A Neural Image Caption Generator”. arXiv:1411.4555v2 [cs.CV] 20 Apr 2015.

[3] Palak Kabra, Mikir Gharat, Dhiraj Jha, Shailesh Sangle, “Image Caption Generator Using
Deep Learning”, October (2022) International Journal for Research in Applied Science &
Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:
7.538 Volume 10 Issue X Oct 2022.

[4] Grishma Sharma, Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar “Visual
Image Caption Generator Using Deep Learning” 2nd International Conference on Advances
in Science & Technology (ICAST-2019) SSRN Electronic Journal · January 2019 DOI:
10.2139/ssrn.3368837

[5] Reshmi Sasibhooshan, Suresh Kumaraswamy and Santhoshkumar Sasidharan, “Image


caption generation using Visual Attention Prediction and Contextual Spatial Relation
Extraction”, Sasibhooshan, et al. Journal of Big Data (2023) 10:18.
https://doi.org/10.1186/s40537-023-00693-9.

[6] Tanvi S. Laddha, Darshak G. Thakore and Udesang K. Jaliya “Generating Human-Like
Descriptions for the Given Image Using Deep Learning” ITM Web of Conferences 53, 02001
(2023) ICDSIA-2023 https://doi.org/10.1051/itmconf/20235302001.

[7] Namitha O, Kavitha D, “Image Cation Generator Using Deep Learning Model” ( 2023 )
Vellore Institue of Technology, Chennai. Eur. Chem. Bull. 2023,12(3), 1345-1351
[8] Omkar Nithin Shinde, Tishikesh Gawde, Anurag Paradkar, “Social Media Image Caption
Generation Using Deep Learning” International Journal of Engineering Development and
Research (www.ijedr.org) ISSN: 2321-9939 | ©IJEDR 2020 Year 2020, Volume 8, Issue 4.

[9] P. Aishwarya Naidu, Satvik Vats, Gehna Anand and Nalina V, “A Deep Learning Model
for Image Caption Generation”, International Journal of Computer Sciences and Engineering
Vol.8, Issue.6, June 2020 E-ISSN: 2347-2693 DOI: https://doi.org/10.26438/ijcse/v8i6.1017

[10] Megha J Panicker, Vikas Upadhayay, Gunjan Sethi, Vrinda Mathur, “Image Caption
Generator” International Journal of Innovative Technology and Exploring Engineering
(IJITEE) ISSN: 2278-3075 (Online), Volume-10 Issue-3, January 2021.

[11] Aishwarya Mark, Sakshi Adokar, Vageshwari Pandit, Rutuja Hambarde, Prof. Swapnil
Patil, “Review on Image Caption Generation” International Journal of Advanced Research in
Science, Communication and Technology (IJARSCT) Volume 2, Issue 3, April 2022 ISSN
(Online) 2581-9429

[12] Sanket Veer, Archana Chaudhari, "Image Caption Generator in text and audio using
Neural Networks" IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p):
2278-8719 September 2021.

You might also like