Professional Documents
Culture Documents
project_review
project_review
Captioning using
Deep Learning
By: Petchi Mani. A
Guide: Mr. N.V. Ravindhar
OBJECTIVE
The aim of the project is to develop a Image Captioning system that uses a pre-trained convolutional neural network
(CNN) and long short term memory (LSTM) model to extract features from an image which will be used for
generating text descriptions for images.
Image caption generator has become the need of the hour, be it for visually impaired people or social media
enthusiasts. It can be used by visually impaired people to understand the image content on the web.
The model will be trained like when an image is given model produces captions that almost describe the image.
2
ABSTRACT
In this project, we will be using the convolutional neural network (CNN) and Long short
term memory (LSTM) to train a model which can able give the description of an image by understanding
it’s features to provide relevant information. Image captioning is the process of generating descriptions
about what is going on in the image. In the last few years, the problem of generating descriptive
sentences automatically for images have gained a rising interest in Natural Language Processing (NLP).
Image captioning is a task where each image must be understood properly and are able to generate
suitable caption with proper grammatical structure. To describe a picture, you need well-structured
English phrases Automatically defining image content is very helpful for visually impaired people to
better understand the problem. Thus, eradicating any ambiguity in image meaning in turn also free of any
discrepancy in knowledge acquisition. Using CNN-LSTM architecture such that CNN layers will help in
extraction of the input data and LSTM will extract relevant information throughout the processing of
input such that the current word acts as an input for the prediction of the next word. The basic idea
behind this is that users will get automated captions when we use or implement it on social media or on
any applications.
3
LITERATURE SURVEY
• In CNN-CNN based model where CNN is used for both encoding and decoding
purpose we observe that CNN-CNN model has high loss which is not acceptable
• While in case of CNN-RNN based captions there might be less loss compared to
the CNN-CNN based model but the training time is more.
6
PROPOSED SYSTEM
• The task is to build a system that will take an image input in the form of a dimensional
array and generate an output consisting of a sentence that describe the image is
syntactically and grammatically correct.
• By using the traditional CNN-RNN model which hinders the Recurrent Neural Network to
learn and get efficiently trained model which generates the descriptive text captions.
7
Modules
8
Result
9
PROJECT ENVIRONMENT
HARDWARE REQUIREMENTS:
System – Intel Dual core or more
Speed – 2.5 GHZ
Hard Disk – 40GB or more
Monitor – 15 VGA color
RAM – 2GB or more
SOFTWARE REQUIREMENTS:
Operating System – Windows 10
Programming Language – Python
IDE - Jupyter Notebook/ Google colab
10
REFERENCE
▰ [1] Marc Tanti and Albert Gatt, (2018), “Where to put the Image in an Image Caption Generator”. Institute of
Linguistics and Language Technology, Kenneth P. Camilleri, arXiv:1703.09137v2 [cs.NE] 14 Mar 2018
▰ [2] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Google (2015) “Show and Tell: A Neural Image
Caption Generator”. arXiv:1411.4555v2 [cs.CV] 20 Apr 2015.
▰ [3] Palak Kabra, Mikir Gharat, Dhiraj Jha, Shailesh Sangle, “Image Caption Generator Using Deep Learning”, October
(2022) International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653;
IC Value: 45.98; SJ Impact Factor: 7.538 Volume 10 Issue X Oct 2022.
▰ [4] Grishma Sharma, Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar “Visual Image Caption Generator
Using Deep Learning” 2nd International Conference on Advances in Science & Technology (ICAST-2019) SSRN
Electronic Journal · January 2019 DOI: 10.2139/ssrn.3368837
▰ [5] Reshmi Sasibhooshan, Suresh Kumaraswamy and Santhoshkumar Sasidharan, “Image caption generation using
Visual Attention Prediction and Contextual Spatial Relation Extraction”, Sasibhooshan, et al. Journal of Big Data
(2023) 10:18. https://doi.org/10.1186/s40537-023-00693-9.
11
THANKS!
Any questions?
12