Image Caption Generator Using AI: Review - 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

REVIEW – 1

Image Caption Generator using AI


CSE 3013 – ARTIFICIAL INTELLIGENCE

SLOT: B2+TB2

TEAM MEMBERS

19BCE0222 – SAISURYA JAIMI

19BCI0229 – JAMEDAR ANISH

19BCE0106 – NIKITA SHARMA


ABSTRACT
Image caption generation has emerged as a difficult and vital analysis space following advances in applied mathematics
language modelling and image recognition. The generation of captions from pictures has numerous sensible edges, starting
from aiding the visually impaired, to enabling the automated and cost-saving labelling of the legion pictures uploaded to the
net on a daily basis. the sector conjointly brings along state of-the-art models in linguistic communication process and pc
Vision, 2 of the key fields in AI.

Benefiting from the advances of image classification and object detection, it becomes potential to mechanically generate
one or a lot of sentences to know the visual content of a picture, that is that the drawback called Image Captioning.
Generating complete and natural image descriptions mechanically has massive potential effects, like titles connected to
news pictures, descriptions related to medical pictures, text-based image retrieval, data accessed for blind users, human
golem interaction. These applications in image captioning have vital theoretical and sensible analysis price. Therefore, image
captioning may be a lot of sophisticated however important task within the age of AI.
I
N The goal of image captioning is to generate descriptions for a given image, i.e., to capture the link between

T the objects gift within the image, generate language expressions and choose the standard of the generated
descriptions. the matter, therefore, is ostensibly harder than well-liked pc vision tasks, e.g., object
R detection or segmentation, wherever the stress is only on distinctive the various entities gift within the

O image. With recent advancements in coaching neural networks, the supply of GPU computing power, and
enormous datasets, neural network driven approaches square measure the foremost well-liked selection
D for handling the caption generation downside. However, humans square measure still higher at decoding

U pictures and constructing helpful and pregnant captions, with or while not a selected application context,
that renders it a noteworthy application for IML and interpretable computer science (XAI). Promising
C technologies embrace active learning, that was already applied for automating the assessment of image

T captioning. IML ways to incrementally train, e.g., re-ranking models for choosing the most effective caption
candidate and XAI ways which will improve the user’s understanding of a model and eventually, modify it to
I produce higher feedback for a second IML method.

O
N
LITERATURE REVIEW
AUTHOR TITLE CONCEPT METHODOLOGY ANALYSIS LIMITATIONS

Chetan Amritkar Image Caption This model is used to Pre-trained We analyze that use The categories in results
Department of Generation using generate natural sentences Convolutional Neural of larger datasets are due to neighborhood
EnTC Vishwakarma Deep Learning which eventually describes Network (CNN) is used increases of some particular words,
Institute of Technique the image. for the image performance of the i.e., for word like car it’s
Technology Pune, classification task. This model. The larger neighborhood words like
India network acts as an dataset will increase vehicle, van, cab etc. are
chetan.amritkar16 image encoder. The accuracy as well as also generated which
@vit.edu last hidden layer is reduce losses. might be incorrect.
used as an input to
Vaishali Jabade Recurrent Neural
Department of Network (RNN). This
EnTC Vishwakarma network is a decoder
Institute of which generates
Technology Pune, sentences.
India
vaishali.jabade@vi
t.edu
HaoranWang , Yue IMAGE CAPTION This method uses natural Pre-trained We analyze that use of Limitation is the difficulty
Zhang, and GENERATION image caption generator bit Convolutional Neural larger datasets
Xiaosheng Yu, “An METHODS doesn’t store more like Network (CNN) is used increases performance of understanding the
Overview of captions whenever required for the image of the model. The larger intermediate result. The
which linked to rnn. classification task. This dataset will increase LRCN
network acts as an accuracy as well as
method is further
image encoder. reduce losses
Image Caption developed to text generatio
Generation
Methods”, (CIN-
2020)
PROPOSED WORK AND IMPLEMENTATION
Module Description
• Performing data cleaning.
• The feature vectors from all images.
• Loading datasets for Training the model.
• Tokenizing the vocabulary.
• Create Data Generators.
• Defining the CNN-RNN model
• Training the Model.
• Testing the Model.
Requirement Specifications
Hardware Requirement Software Requirement
• Processor: 64-bit architecture at 1 GHz or faster; Intel: • Python(3) + PIP installed
eight generation or newer; AMD Ryzen 3 / better;
Qualcomm Snapdragon 7C or higher • Editor

• RAM: 4 GB or higher • Python AI libraries:

• Storage:64 GB or larger storage device ✓ Tensorflow

• Graphics card: Direct X12 or later capable; WDDM 2.0 ✓ keras


driver or newer
✓ pillow
• Display: High-def (720p) display, larger than 9” diagonal in
size, 8 bits per colour channel (or better) ✓ numpy

✓ tqdm

✓ jupyterlab
EXPECTED RESULT
• Detect objects on the spot and determine the relationships between
them.
• Show image content correctly with well formed phrases and
sentences.
• To describe an image with one or more regular language sentences.
REFERENCES
[1] Ting Yao , Yingwei Pan , Yehao Li, Tao Mei (2018) Exploring Visual Relationship for Image Captioning JD AI
Research, Beijing, China

[2] Peter Anderson Xiaon He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang(2017)Bottom-Up and
Top-Down Attention for Image Captioning and Visual Question Answering Version, v3)] Salt Lake City, UT, USA IEEE

[3] Yan Chu, Xiao Yue, Lei Yu, Mikalov Sergei, and Zhengkui Wang (2020) Automatic Image CaptioningBased on
ResNet50 and LSTM Soft Attention Yin Zhang

[4] Vinyals, O., Toshev, A., Bengo, S., & Erhan, D. (2015) Show and tell: A neural image caption generator IEEE Boston,
MA, USA 28

[5] Johnson, J., Karpaty, A., & FeiFi, L. (2016) DenseCap: Fully Convolutional Localization Networks for Dense
Captioning Las Vegas, NV, USA

You might also like