project_review

Automated Image
Captioning using
Deep Learning
By: Petchi Mani. A
Guide: Mr. N.V. Ravindhar
OBJECTIVE
The aim of the project is to develop a Image Captioning system that uses a pre-trained convolutional neural network
(CNN) and long short term memory (LSTM) model to extract features from an image which will be used for
generating text descriptions for images.
Image caption generator has become the need of the hour, be it for visually impaired people or social media
enthusiasts. It can be used by visually impaired people to understand the image content on the web.
The model will be trained like when an image is given model produces captions that almost describe the image.
2
ABSTRACT
In this project, we will be using the convolutional neural network (CNN) and Long short
term memory (LSTM) to train a model which can able give the description of an image by understanding
it’s features to provide relevant information. Image captioning is the process of generating descriptions
about what is going on in the image. In the last few years, the problem of generating descriptive
sentences automatically for images have gained a rising interest in Natural Language Processing (NLP).
Image captioning is a task where each image must be understood properly and are able to generate
suitable caption with proper grammatical structure. To describe a picture, you need well-structured
English phrases Automatically defining image content is very helpful for visually impaired people to
better understand the problem. Thus, eradicating any ambiguity in image meaning in turn also free of any
discrepancy in knowledge acquisition. Using CNN-LSTM architecture such that CNN layers will help in
extraction of the input data and LSTM will extract relevant information throughout the processing of
input such that the current word acts as an input for the prediction of the next word. The basic idea
behind this is that users will get automated captions when we use or implement it on social media or on
any applications.
3
LITERATURE SURVEY
“Where to put the Image in an Image Caption Generator” by Marc

Tanti and Albert Gatt, published in 2018
This paper delves into the intricacies of four distinct Recurrent Neural
Network (RNN) architectures used in image caption generation. These
include the Init-inject, Pre-inject, Par-inject, and Merge architectures. The
paper provides an in-depth analysis of how these architectures handle
image vectors and word vectors differently.
Oriol Vinyals, Alexander Toshev et al Google (2015) “Show and Tell:

A Neural Image Caption Generator”
This paper discusses the achievement of state-of-the-art performance by
integrating Recurrent Neural Networks (RNN) and Long Short-Term
Memory (LSTM) networks for sequence tasks such as translation. It
introduces a neural and probabilistic framework for image descriptions
and delves into the recent advancements in statistical machine translation
within sequence models. 4
LITERATURE SURVEY
Grishma Sharma, Priyanka Kalena, et al (2019) “Visual Image Caption

Generator Using Deep Learning”
The paper discusses an intricate system composed of three distinct
models. The first is the feature extraction model, known as VGG16, which
is responsible for processing the input image and extracting relevant
features. The second and third are the encoder and decoder models, which
work in tandem to generate a caption that accurately describes the image.
Tanvi S. Laddha, Darshak G. Thakore and Udesang K. Jaliya

“Generating Human-Like Descriptions for the Given Image Using Deep
Learning” ITM Web of Conferences 53, 02001 (2023)
The paper under discussion utilizes three distinct datasets for its study:
Flickr8k, Flickr30k, and MSCOCO. Each of these datasets offers a unique
set of images and associated captions, providing a rich and diverse source
of data for the image caption generation task.
5
EXISTING SYSTEM
• In CNN-CNN based model where CNN is used for both encoding and decoding
purpose we observe that CNN-CNN model has high loss which is not acceptable
• While in case of CNN-RNN based captions there might be less loss compared to
the CNN-CNN based model but the training time is more.
• Training time effects the whole efficiency of the model.
6
PROPOSED SYSTEM
• The task is to build a system that will take an image input in the form of a dimensional
array and generate an output consisting of a sentence that describe the image is
syntactically and grammatically correct.
• By using the traditional CNN-RNN model which hinders the Recurrent Neural Network to
learn and get efficiently trained model which generates the descriptive text captions.
7
Modules
• Data Pre-processing / Data Cleaning

• Feature Extraction
• Load the dataset for model
• Training the model
• Testing the model
• Caption Generation
8
Result
9
PROJECT ENVIRONMENT
HARDWARE REQUIREMENTS:
 System – Intel Dual core or more
 Speed – 2.5 GHZ
 Hard Disk – 40GB or more
 Monitor – 15 VGA color
 RAM – 2GB or more
SOFTWARE REQUIREMENTS:
 Operating System – Windows 10
 Programming Language – Python
 IDE - Jupyter Notebook/ Google colab
10
REFERENCE
▰ [1] Marc Tanti and Albert Gatt, (2018), “Where to put the Image in an Image Caption Generator”. Institute of
Linguistics and Language Technology, Kenneth P. Camilleri, arXiv:1703.09137v2 [cs.NE] 14 Mar 2018
▰ [2] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Google (2015) “Show and Tell: A Neural Image
Caption Generator”. arXiv:1411.4555v2 [cs.CV] 20 Apr 2015.
▰ [3] Palak Kabra, Mikir Gharat, Dhiraj Jha, Shailesh Sangle, “Image Caption Generator Using Deep Learning”, October
(2022) International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653;
IC Value: 45.98; SJ Impact Factor: 7.538 Volume 10 Issue X Oct 2022.
▰ [4] Grishma Sharma, Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar “Visual Image Caption Generator
Using Deep Learning” 2nd International Conference on Advances in Science & Technology (ICAST-2019) SSRN
Electronic Journal · January 2019 DOI: 10.2139/ssrn.3368837
▰ [5] Reshmi Sasibhooshan, Suresh Kumaraswamy and Santhoshkumar Sasidharan, “Image caption generation using
Visual Attention Prediction and Contextual Spatial Relation Extraction”, Sasibhooshan, et al. Journal of Big Data
(2023) 10:18. https://doi.org/10.1186/s40537-023-00693-9.
11
THANKS!
Any questions?
12

project_review

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

project_review

Uploaded by

Copyright:

Available Formats

Automated Image

“Where to put the Image in an Image Caption Generator” by Marc

Oriol Vinyals, Alexander Toshev et al Google (2015) “Show and Tell:

Grishma Sharma, Priyanka Kalena, et al (2019) “Visual Image Caption

Tanvi S. Laddha, Darshak G. Thakore and Udesang K. Jaliya

• Training time effects the whole efficiency of the model.

• Data Pre-processing / Data Cleaning

You might also like