Capstone Report Final 08

PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013)

100-ft Ring Road, Bengaluru – 560 085, Karnataka, India
Report on
FINE-GRAINED FEATURE EXTRACTION OF COMPLEX INDOOR IMAGES
USING VISUAL QUESTION ANSWERING TO ASSIST VISUALLY
IMPAIRED
Submitted by
Prem Kumar L(PES1UG19EC216)
Rajat Subraya Gaonkar (PES1UG19EC229)
Rohan Madan Ghodake (PES1UG19EC239)
V A Pruthvi (PES1UG19EC905)
January-May 2022
Under the guidance of

Assistant Prof. Raghavendra M
Professor
Department of Electronics and Communication Engineering
PES University
Bengaluru - 560085
FACULT
CERTIFICATE Y OF
This is to certify that the Project ENGINEERING entitled
DEPARTM
ENT OF
1
ELECTRONICS
AND
COMMUNICATI
ON
ENGINEERING
FINE-GRAINED FEATURE EXTRACTION OF COMPLEX INDOOR IMAGES
USING VISUAL QUESTION ANSWERING TO ASSIST VISUALLY
IMPAIRED
is a bonafide work carried out by
Prem Kumar L(PES1UG19EC216)

In partial fulfillment for the completion of the Program of Study B. Tech in Electronics and Communication
Engineering under rules and regulations of PES University, Bengaluru during the period August–December
2022. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in
the report. The project report has been approved as it satisfies the 6 th-semester academic requirements in
respect of project work.
Signature with date &Seal Signature with date & Seal Signature with date & Seal
Prof. Raghavendra M J Dr. Anuradha M Dr. Keshavan B K
Internal Guide Chairperson Dean of Faculty
Name and Signature of Examiners:

NAME: SIGNATURE:
DECLARATION
We, Prem Kumar L, Rajat Subraya Gaonkar, Rohan Madan Ghodake, V A Pruthvi, hereby declare that
2
the project entitled, "FINE-GRAINED FEATURE EXTRACTION OF COMPLEX INDOOR IMAGES
USING VISUAL QUESTION ANSWERING TO ASSIST VISUALLY IMPAIRED", is an original work
done by us under the guidance of Prof. Raghavendra M J, Assistant Professor, Department of Electronic
and Communication Engineering, PES UNIVERSITY, and is being submitted in partial fulfilment of
the requirements for completion of 6th semester course work in the Program of Study B.Tech in Electronics
and Communication Engineering.
PLACE: BENGALURU
DATE:14/12/2022
NAME AND SIGNATURE OF THE CANDIDATE
Prem Kumar L (PES1UG19EC216)
3
ACKNOWLEDGEMENT
We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

Department of Electronics and Communication Engineering, PES University, for his
persistent guidance, suggestions, assistance and encouragement throughout the
development of this project. It was an honor for me to work on the project under his
supervision and understand the importance of the project at various stages.
We are grateful to the project coordinator Prof. Rajasekar M, Department of

Electronics and Communication Engineering for, organizing, managing, and helping
with the entire process.
We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Electronics and Communication Engineering, PES University, for her invaluable support
and thankful for the continuous support given by the department. I am also very grateful
to all the professors and non-teaching staff of the Department who have directly or
indirectly contributed to our research towards enriching work.
Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

DORESWAMY, Vice Chancellor Dr. SURYAPRASAD J, the Registrar Dr. K S
SRIDHAR and Dean of Engineering Dr. KESHAVAN B K of PES University for
giving me this wonderful opportunity to complete this project by providing all the
necessaries required to do so. It has motivated us and supported our work thoroughly.
We are grateful and obliged to our family and friend for providing the energy and
encouragement when we need them the most.
4
AUGUST-DECEMBER 2022
FINE GRAINED FEATURE EXTRACTION OF COMPLEX INDOOR IMAGES USING VISUAL QUESTION ANSWERING
ABSTRACT
Visually Challenged people face a lot of due to loss of vision. Vision is one of the important senses for human
beings to complete their daily activities. Many developments and efforts have been made to assist Visually
Impaired. We too are making an effort to assist the visually impaired by using Visual Question Answering.
We aim to develop a Deep Learning Algorithm to extract features from complex indoor images as well as
question using Visual Question Answering to assist visually impaired. Here we give two inputs image and
question. Our VQA algorithm extract fine-grained features from both question and image. The answer
generating model predicts an answer based on the features which is given to the user after it is converted to
speech. The user can again ask any related question which will be given as input along with the image and the
process repeats. VQA v2 dataset is used. A total of 40000 samples are used which consists of complex indoor
images and occluded images. Of the samples 80% of them are used for training and the remaining 20% are
used for testing. VGG16 is used as image encoder and ResNet as question encoder. VGG16 locates items in
images from 200 different classifications and also labels each image with one of a thousand categories.
(write about ResNet)
We propose to solve the Visual Question Answering (VQA) task by using a stacked attention network with
the addition of self-focus network. The model can be trained and tested on yes or no type questions which
mainly include indoor images. We use stacked Attention Network to answer questions that call for multiple
steps of reasoning. These kinds of networks are used to look for areas in images that correspond to the
answers by using the semantic representation of questions as the search term. In SAN, a picture should be
progressively queried in order to obtain results. We use Adam optimizer. The optimizers are the methods to
reduce the losses in the network by changing the attributes of neural. Whenever we train a neural network the
output we get from the network differ from the actual output which is termed as loss function or cost function.
5
TABLE OF CONTENT
Content Pg.No
Abstract 7
Acknowledgement 8
Chapter 1: Introduction. 9 - 11 \
1.1 Motivation. 9 - 10
1.2 Objectives. 10
1.3 Problem Statement. 10
1.4Organisation of the report. 10 – 11
Chapter 2: Literature Survey. 12 – 15
2.1 “Visual Question Answering with dense Inter and Intra-Modality interactions.”
2.2 “Visual Question Answering with dense Inter and Intra-Modality interactions.”
2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question
Answering”
2.4 “Multimodal Encoder-Decoder Attention Networks for Visual Question Answering”.
2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network”.
2.6 “Self-Adaptive Neural module transformer for visual question answering”.
2.7 “Adversarial Learning with Multi-Modal Attention for Visual Question Answering”.
2.8 “Towards Assisting the Visually Impaired: A Review on Techniques for Decoding the Visual
Data from Chart Images”.
2. 9 “Learning to Recognize Visual Concepts for Visual Question Answering with Structural Label
Space”.
2.10 “Plenty is Plague: Fine-Grained Learning for Visual Question Answering”.
2.11 “Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering”.
2.12 “Reasoning on the relation: Enhancing Visual Representation for Visual Question Answering
and Cross-Modal Retrieval”.
2.13 “An Entropy Clustering Approach for Assessing Visual Question Difficulty”.
2.14 “Mutual Attention Inception Network for Remote Sensing Visual Question Answering”.
2.15 “Low-altitude small-sized object detection using lightweight feature-enhanced convolutional
neural network”.
6
2.16 “Stacked Attention Networks for Image Question Answering”.
Chapter 3: Methodology 16 - 29
3.1 Block Diagram. 16 - 17
3.2 Dataset
3.3 Method 17 – 29
3.3.1.1 Image Encoder. 17 – 19
3.3.1,2 VGG16 19 – 22
3.3.1.3 ResNet 22 – 25
3.3.2 Question Encoder. 25 – 26
3.3.2.1 Glove 26 - 27
3.3.2.2 LSTM 27 – 29
3.3.3 Model Architecture and Visualization
3.3.3.1 Stacked Attention Network
3.3.3.2 Customization in the Stacked Attention Network
Chapter 4: Results. 30 - 35
Chapter 5: Software needed. 36
Chapter 6: Conclusion and Future Scope. 37
7
1 LIST OF TABLES
Table No. Table Name Pg.No
2.1 Comparison With Previous State-of-the-Art Methods on the VQA 1.0 Dataset 13
2.2 Performance on the Validation Split of the VQA v2 Dataset.
2.3 Overall Accuracies on the Test-Dev and Test-Challenge Sets of the VQA-2.0 Data Set
2.4 Results of Single Model Compared with the State-of-the-Art Models on Zhe Test-Dev and Test-
___________Std Set.
2.5 Results on the VQA-2.0 Validation Set
2.6 Ablation Study on VQA v1.0 and COCO-QA Data Sets. Denotes Variant Implementations of the
_________Model
2.7 Qualitative Analysis of Different Model Performances (Accuracy) in Reasoning
2.8 Performance (%) on VQA-CP V2 Test set and VQA V2 Validation set.
2.9 Evaluations of BUTD on VQA2.0 Test-Dev With Visual Genome Dataset
2.10 Overall Accuracy Result With Other Methods on the FVQA Data Set.
2.11 Comparison of Performance of Our Model With the State-of-the-Art Methods on MS-COCO
__________Dataset
2.12 Performance Comparison on the RSIVQA Dataset
2.13 Performance Comparison on the Remote Sensing VQA Dataset
2.14 Results on MS-COCO dataset
2.15 VQA results on our partition, in percentage
CHAPTER 1: INTRODUCTION
Visual Question Answering (VQA) is a growing field under Deep Learning and Artificial Intelligence (AI).
VQA incorporates different subfields of AI such as computer vision, Convolutional neural networks, object
detection/recognition, Natural language processing (NLP) and knowledge representation and recognition.
VQA algorithm seeks to find any relevant answer related to the question asked to it with reference to the given
image. It attempts to understand the input image like a human brain and give efficient output about this image.
We majorly identify four major steps in the process of VQA which are ⅰ) Image feature extraction ⅰⅰ) Question
feature extraction ⅲ) Feature convolution of question features and image features ⅳ) Answer generation
using attention network.
8
In this model we propose a Visual Question Answering Algorithm that takes image as well as question related
to the given image as input and to extract fine-grained features of indoor images, questions related to input
images. Our motive is to address minute details in images and questions to improve the quality of the answer
to give the output in the natural language. We have extensively given prominence to indoor images which are
complex in nature. The main reason to do so is to focus on the challenges that comes across when processing
indoor images and the difficulty in detecting/recognizing image features in the images captured indoor. We
also focus on fine-grained features in the questions. Fine-grained refers to a subfield of object recognition that
aims to distinguish a High-level category within Entry-level categories. Visual questions selectively target
different areas of an image, including background details and underlying context
Fig. 1.1: Example 1 Fine-grained feature extraction example
For example, consider a question “Is the animal in the picture a dog” for which the answer would be “yes” but
it cannot give answers to the questions asked on the minute details related to the image. Fine-Grained features
help in solving this problem by extracting minute details. For example, consider a question related to the same
image “Is the breed of the dog in the picture German shepherd” for which the answer would be “No”.
9
Fig. 1.2: Examples 2 “ Free-form, open-ended questions collected for images via Amazon Mechanical Turk. Note that
commonsense knowledge is needed along with a visual understanding of the scene to answer many questions”.
Open-ended questions require a potentially vast set of AI capabilities to answer – fine-grained recognition
(e.g., “What kind of cheese is on the pizza?”), object detection (e.g., “How many bikes are there?”), activity
recognition (e.g., “Is this man crying?”), knowledge base reasoning (e.g., “Is this a vegetarian pizza?”), and
commonsense reasoning (e.g., “Does this person have 20/20 vision?”, “Is this person expecting company?”).
(Aishwarya Agrawal, 27 oct,2016)
10
CHAPTER 1.1: MOTIVATION

The visually challenged face several challenges when performing their daily tasks. There is an increasing
interest in developing effective solutions that can help the visually challenged recognize objects. We propose
an algorithm to help the visually challenged overcome their lack of visual sense by using other senses like
sound. Besides, it can be seen that there is an obvious deficiency in the number of applications that target
visually challenged users. Therefore, to address all of these issues, there is a need to design an effective
solution that can help visually challenged people to analyze their surroundings and take feedback from things
surrounding them. Thus, developing this algorithm that uses human-powered technology, helps the visually
challenged person tackle many challenges. As a result of evaluating the proposed application, it is shown that
it is easy to use and can be employed for many important purposes in daily life.
CHAPTER 1.2: PROBLEM STATEMENT

To develop a Deep Learning algorithm to extract fine-grained features from complex indoor images as
well as questions using visual question answering (VQA).
CHAPTER 1.3: REPORT ORGANISATION
11
CHAPTER 2: LITERATURE SURVEY
2.1 “Visual Question Answering with dense Inter and Intra-

Modality interactions.” (F. Liu, 2021)
The authors of this paper are Fei liu, Jing Liu,Zhiwei Fang, Richang Hong and Hanquig Lu.
This paper proposes a novel DenIII framework for Visual Question Answering, which helps in capturing more
fine-grained multimodal information to perform densely inter and intra-modality interactions. It also proposes
efficient inter and intra AC to connect different and same modalities. VQAv1.0, VQAv2.0, and TDIUC are
the datasets used in this paper.
The methodologies used here are:
• Feature extraction: This extracts initial image features using Faster RCNN.
• Dense and inter and intra-modality interactions: This adopts GRU with hidden units as question
encoding layer and non-linear layer.
• Attention mechanism and answer prediction. A maximum accuracy of 69% is achieved using the
VQAv1.0 dataset. DenIII framework achieves competitive performance on all three datasets used
which means the framework is modelled well.
The limitation of this model is if one object in an image is occluded by another object the model obtains an
inaccurate number of objects. When the model focuses on objects on TV it ignores the global information on
TV. Thus, the model is unaware that the objects are actually television pictures.
test-dev test-std
Model
Yes/No Number others overall Yes/No Number others Overall
Memory-augmented
81.5 39.0 54.0 63.8 81.7 37.6 54.7 64.1
Net
QGHC 83.5 38.1 57.1 65.9 - - - 65.9
Dual-MFA 83.6 40.2 56.8 66.0 83.4 40.4 56.9 66.1
VKMN 83.7 37.9 57.0 66.0 84.1 38.1 56.9 66.1
MFH 85.0 39.7 57.4 66.8 85.0 39.5 57.4 66.9
DCN 84.6 42.4 57.3 66.9 85.0 42.3 57.0 67.0
DA-NTN 85.8 41.9 58.6 67.9 85.8 42.5 58.5 68.1
CoR 85.7 44.1 59.1 68.4 85.8 43.9 59.1 68.5
DenIII (ours) 86.7 44.1 59.7 69.1 86.8 43.3 59.4 69.0
Table 2.1 Comparison with Previous State-of-the-Art Methods on the VQA 1.0 Dataset
12
2.2 “Re-Attention for Visual Question Answering”.

(Guo, Zhang, Yang, & ri8, 2021)
This is an IEEE paper the authors of this paper are Wenya Guo, Ying Zhang, Jufeng, Yang, Xiaojie Yuan.
This paper proposes a Re-attention framework to utilize the information in answers for the Visual Question
Answering task to extract the image and question features. The datasets used in this model are VQA, VQA v2,
and COCO-QA. The methodology used here is
• Questions associated with the relevant objects are unified into a single entity, this is used the predict
the answers.
• Answers generated are used to Re-attend the object and alter the visual attention map.
• The major advantage is Re-attention procedure extracts information from the answers generated and
reconstructs the visual attention map which helps in answer detailing.
• This Re-attention framework performs better than the previous paper (Visual Question Answering
with dense Inter and Intra-Modality interactions) which gives better accuracy.
• Shows the best performance against fusion, attention, and reasoning-based methods on all data sets.
Improvement of 6.28% in overall accuracy on the VQA V2 data set.
Although this method learns correct regions, it is difficult to answer the given question that involves common
sense. A special text understanding module is needed to accurately identify the time shown in the clock.
Methods Yes/No Other Overall

Base+Ca 84.13 58.13 66.51
Base+Ca+Re-att (A) 84.62 58.00 66.72
Base+Ca+Re-att (Q_A) 60.27 50.88 50.36
Base+Ca+Re-att (Q+A) 85.01 58.32 67.19
Base+Ca+E-Re-att (A) 84.76 58.41 66.97
Base+Ca+E-Re-att (Q_A) 66.34 54.26 55.47
Base+Ca+E-Re-att (Q+A) 85.35 58.83 67.68
Table 2.2 Performance on the Validation Split of the VQA v2 Dataset.
13
2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-

Order Pooling for Visual Question Answering”. (Zhou Yu,
DECEMBER 2018)
The authors of this paper are Zhou Yu, Jun Yu, Member, IEEE, Chenchao Xiang, Jianping Fan, and Dacheng
Tao, Fellow, IEEE. This IEEE paper proposes a model to find good solutions for the following three issues:
• fine-grained feature representations for both the image and the question;
• multimodal feature fusion that can capture the complex interactions between multimodal features; and
• automatic answer prediction that can consider the complex correlations between multiple diverse
answers for the same question.
The datasets used are VQA-1.0, and VQA-2.0. The methodology includes
• multimodal factorized bilinear pooling approach
• multimodal factorized high-order pooling (MFH) method to achieve a more effective fusion of
multimodal features.
• Kullback–Leibler divergence (KLD) as the loss function to achieve more accurate characterization. It
achieves 60.7% accuracy on VQA 1.0 and 68.16% accuracy on VQA 2.0 dataset.
This proposed model works with 1/3rd of the parameters and 2/3rd of the total GPU usage. This model is
robust. This model fails to analyse the relation between image regions and semantic words in an input image
and image-related question.
Model Test-Dev Test-Challenge

vqateam-Prior - 25.98
vqateam-Language - 44.34
vqateam-LSTM-CNN - 54.08
vqateam-MCB - 62.33
Adelaide-ACRV-MSR - 69.00
DLAIT (2nd place) - 68.07
LV_NUS (4th place) - 67.62
1 MFB model 64.98 -
1 MFH model 65.80 -
7 MFB models 67.24 -
7 MFH models 67.96 -
9 MFH models 68.02 68.16
Table 2.3 Overall Accuracies on the Test-Dev and Test-Challenge Sets of the VQA-2.0 Data Set
14
2.4 “Multimodal Encoder-Decoder Attention Networks for Visual

Question Answering”. (CHONGQING CHEN, FEB 19 2020)
The author of this paper is CHONGQING CHEN1, DEZHI HAN1, AND JUN WANG2.
This is an IEEE paper, in this paper a novel Multimodal Encoder-Decoder Attention Network (MEDAN) is
proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in-
depth and can capture rich and reasonable question features and image features by associating keywords in
question with important object regions in the image.
The datasets used are the Visual Genome dataset and the VQA-v2 dataset. The methods used are
• Attention mechanism for VQA Multimodal Encoder-Decoder Attention.
• Scaled Dot-Product Attention and Multi-Head Attention.
• Encoder and Decoder. This model achieves an accuracy of 0.56 and 0.66 points higher than DFAF on
test-dev and test-std. This model is 0.54 and 0.16 points higher than DFAF and MCAN on test-dev.
And on test-std, MEDAN is 0.64 and 0.08 points higher than DFAF and MCAN.
A major advantage of this model is it uses Encoder as a model to extract fine-grained question features by
self-attention. Multimodal Encoder-Decoder Attention Network can capture rich and reasonable question
features and image features by associating keywords in question with important object regions in the image.
The limitations of this model are the accuracy of this model is less than MUAN (Multimodal unified Attention
Networks and this model is not well suited for indoor images.

Test-dev Test-std
Model
Y/N Num Other All All
BUTD[I] 81.82 44.21 56.05 65.32 65.67
MFH[33] 85.31 49.56 59.89 68.76 -
BAN[12] 85.42 50.93 60.26 69.52 -
BAN+Counter 85.42 54.04 60.52 70.04 70.35
DFAF[3] 86.09 53.32 60.49 70.22 70.34
MCAN[2] 86.82 53.26 60.72 70.63 70.90
MUAN[4] 86.77 54.40 60.89 70.82 71.10
MEDAN(Adam) 87.10 52.69 60.56 70.60 71.01
MEDAN(AdamW) 87.02 53.57 60.77 70.76 70.98
Table 2.4 Results of Single Model Compared With the State-of-the-Art Models on Zhe Test-Dev and Test-Std
Set.
15
2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation

Attention Network”. (Liang Peng, 1, JANUARY 2022)
The authors of this paper are Liang Peng, Yang Yang , Zheng Wang, Zi Huang , Heng Tao Shen. In this paper
a self-guided word relation attention scheme is used which explores the latent semantic relations between
words. Two question-adaptive visual relation attention modules were used to extract not only the fine-grained
and precise binary relations between objects but also the more sophisticated trinary relations.
The datasets used are VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC and the methods used are:
 A self-guided word relation attention scheme to explore the latent semantic relations between words.
 Two question-adaptive visual relation attention modules that can extract not only the fine-grained and
precise binary relations between objects but also the more sophisticated trinary relations.
MRA-Net improves the overall accuracy from 67.9 to 69.22 on VQA-2.0 dataset. Few advantages of this
paper are MRA-Net can focus on the important words and the latent semantic relation between them to
understand the question more completely and it reconciles the appearance feature with the relation feature
effectively, thereby reasoning the correct answer. In addition, MRA-Net reconciled the object appearance
features with the two kinds of relation features under the guidance of the corresponding question, which can
effectively use these features according to the question. Few limitations observed are:
 The binary relation feature improves the overall accuracy from 65.43 to 65.94 and not much.
 The proposed model occasionally makes mistakes in locating all the relevant regions and relations.
Methods Q-rel nParams VQA-score

O-att  48.7M 65.43
2*O-att  53.8M 65.20
O-att+Binary*  56.8M 65.94
O-att+Binary  56.8M 65.94
O-att+2*Binary  64.7M 65.95
Oatt+Binary+Trinary*  67.4M 66.08
Self-att X 55.9M 65.90
Q-rel  61.6M 65.87
2*Se1f-att X 61.7M 65.96
Self-att+Q-rel*  67.4M 66.08
Table 2.5: Results on the VQA-2.0 Validation Set
16
2.6 “Self-Adaptive Neural module transformer for visual question

answering”. (H. Zhong, 2021)
The authors of this paper are Huasong Zhong, Hanwang Zhang, Jingyuan chen, Xian- Sheng Hua. The
objective of this paper is to present a novel Neural Module Network, called Self-Adaptive Neural Module
Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout
decoding by considering intermediate Question and Answer results. The datasets used are CLEVR, CLEVR-
CoGenT, VQAv1.0 and VQAv2.0 and the methods used are:
 General framework: convert input to image and question embedding.
 Revised Module Network: each neural network is constructed to differentiable structure.
 Encoding: utilize the neural transformer to encode more accurate input question features for the layout
controller based on the intermediate Q&A results.
 Decoding: utilize the neural transformer to guide subsequent layout generation based on the
intermediate results from each reasoning step.
VQAv1.0 gives greater performance and improvement than VQAv2.0 which is 66.7%. VQAv2.0 gives
64.5%. Few advantages observed are self-adaptive feature embedding achieves better performance than
baseline module and Layouts in the model are self-adaptive to real situation resulting in better performance.
Model can focus on the dynamic nature of question comprehension and select the corresponding module
function more accurately and few limitations of this paper are:
 Expert layout supervision is not significant on VQAv2.0.
 VQA2.0 dataset is more complicated and closer to the dataset utilized in practice
2.7 “Adversarial Learning with Multi-Modal Attention for Visual

Question Answering”. (Liu, Zhang, Huang, Cheng, & Li, Sept.
2021)
The authors of this paper are Yun Liu; Xiaoming Zhang; Feiran Huang; Lei Cheng; Zhoujun Li. The objective
17
of this paper is to propose a novel model which can capture the answer related features The datasets used are
VQA v1, VQA v2, COCO-QA. The methods followed are
 Adversarial method to explore the answer related information .
 Multi-modal attention with Siamese similarity learning to learn alignment between image regions and
answer.
 Final ALMA model (Adversarial learning with multi modal attention).
An accuracy of 68.94% on VQA v1, 68.76 on VQA v2 dataset, and 68.27% on COCO-QA dataset is
achieved. Few advantages of this model are:
 Focus more on answer related image region outperforming attention model.
 1.65% improvement in accuracy compared to previous models.
One limitation observed in this model is that the model fails in analysing the small objects in an image.
Accuracy
Methods
VQA V 1.0 COCO-QA
No-Att 45.56 54.78
MM-Att 54.74 61.65
MM-Att + Sia 59.63 64.32
MM-Att + Sia 61.41 68.26
H-LSTM 58.62 65.91
E-LSTM 59.21 66.74
HE-LSTM* 61.36 68.24
Ans-Rep 59.84 65.73
Inf-Dis-Sub 60.71 67.75
Inf-Dis-Div* 61.35 68.26
Table 2.6: Ablation Study on VQA v1.0 and COCO-QA Data Sets. “*” Denotes Variant Implementations of
the Model
2.8 “Towards Assisting the Visually Impaired: A Review on

Techniques for Decoding the Visual Data from Chart Images”.
(Lijiya, 29 March 2021 )
This is an IEEE paper. The authors of this paper are K. C. Shahira and A. Lijiya. The objective of this paper is
to explore the existing literature in understanding the graphs and extracting the visual encoding from them.
The datasets used are LeafQA, PlotQA, FigureQA,DVQA, FigureSeer, CLEVR and the methods followed
are:
 Modality based approaches - outputs audio and tactile info
18
 Traditional method - connected component analysis & Hough transform

 Deep Learning - Attention encoder decoder, VGG, Alex Net, Mobile Net.
ConvNets gives accuracy upto 97% whereas the text Microsoft OCR gives 75.6%.
Major advantage observed is that a blind person can understand major information from the graph through
question answering and the study mainly focuses on extracting the chart data to aid the Visually Impaired for
graph perception by reviewing the conventional methods and deep learning methods.Few limitations of this
paper are:
 FigureQA give only binary answers and no numerical answering is possible
 CLEVR dataset has reasoning questions about synthetic scenes but they perform poorly on chart
datasets
 No proper Datasets
 Accuracy of model is too low compared to human level performance.
Auther,Year Dataset Type IMG QUES IMG+QUES SAN

Methani et al. PlotQA BC,LC 14.84% 15.35% 46.54% 53.96%
Kafle et al. DVQA BC 14.83% 21.06% 32.01% 36.04%
Kabou et al. FigureQA BC,LC,PC 52.47% 50.01% 55.59% 72%
Table 2.7: Qualitative Analysis of Different Model Performances (Accuracy) in Reasoning
2.9 “Learning to Recognize Visual Concepts for Visual Question

Answering with Structural Label Space”. (Difei Gao, 3, MARCH
2020)
The authors of this paper are Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen.
The objective of this paper is to propose a novel visual recognition module named Dynamic Concept
Recognizer (DCR), which is easy to be plugged in an attention-based VQA model and structural label space.
The datasets used are Visual Genome, GQA, VQA v2 and VQA-CP v2 and the methods followed are:
 Structural label space - which outputs G groups and each group contains Ci concepts
 Used K means clustering to classify concepts (outpts of GloVe embeddings) into groups.
 Dynamic Concept Recognizer - takes image features and predicted group from Group PredNet and
predicts the concept which is the answer to given question.
An overall 5% increase in DCR accuracy compared to previous models on visual genome dataset and 3%
19
(absolute) improvement in accuracy on GQA dataset. An advantage of this paper is that it works better on
conceptual questions compared to previous models. Limitation of this paper is that it works poor on yes/no
type questions as the model focuses only on the visual concepts.
Image
VQA-CP v2 test VQA v2 val
Model Feature
Yes/No Number Other Overal Yes/No Number Other Overall
l
NMN ResNet 38.94 11.92 25.72 27.47 73.38 33.23 39.85 51.62
Bottom-Up Bottom - Up
65.49 15.48 35.48 41.17 79.84 42.35 55.16 62.75
QAdv+DoE
MuRel Bottom - Up 42.85 13.17 45.04 39.54 84.03 47.84 56.25 65.58
GVQA 57.99 13.68 22.14 31.30 72.03 31.17 34.65 48.24
SAN ResNet 38.35 11.14 21.74 24.96 68.89 34.55 43.80 52.02
ours: DAG 40.84 13.86 29.58 30.46 69.32 33.02 40.14 50.31
Bottom-Up 41.56 12.19 43.29 38.04 81.18 42.14 55.66 63.48
ours: DAG w. Q Bottom - Up 41.05 11.32 40.28 36.09 81.97 42.55 54.42 63.21
ours: DAG (Full
43.02 15.83 46.41 40.75 81.05 42.47 54.53 62.91
Model)
Table 2.8: Performance (%) on VQA-CP V2 Test set and VQA V2 Validation set.
2.10 “Plenty is Plague: Fine-Grained Learning for Visual

Question Answering”. (Yiyi Zhou, FEBRUARY 2022)
The authors of this paper are Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su,
Deyu Meng, Yue Gao, and Chunhua Shen and the main objective of this paper is to propose a fine-grained
VQA learning paradigm with an actor-critic based learning agent, termed FG-A1C which selects the training
dataset from a given dataset which lowers the training cost and increase the speed. The datasets used are
VQA2.0 and VQA-CP v2 datasets. The Method used is that it follows a Reinforcement learning method
where a RL model selects the most valuable data based on the reward function which is a function of training
loss. It picks the best subset of training which reduces the computational complexity as well as time.
Bellman's equation is used for policy evaluation.For 25% of data the model achieves 60% accuracy and with
75% data it achieves 65.2% accuracy. Few advantages:
20
 Only using 50% of training examples, saves 25% of the model training time.
 Can be integrated with almost all models without altering the model configuration.
Few limitations observed are:
 Model focus more on yes/no type question where hard questions like why and where are barely
selected.
 model is not generalized as it learns from less data.
Paradigm VG STEP All Yes/No Num. Others

Random* 512K - 65.3 81.8 44.2 57.3
Random 512K 412K 66.9 83.4 48.6 57.1
FG-AIC-AL 250K 341K 67.0 83.7 47.6 57.2
FG-AIC-AL 150K 227K 67.0 83.3 47.6 57.1
FG-AIC-SPL 250K 240K 67.2 83.9 48.5 57.2
FG-AIC-SPL 150K 227K 67.2 84.0 48.5 57.0
Table 2.9: Evaluations of BUTD on VQA2.0 Test-Dev With Visual Genome Dataset
2.11 “Rich Visual Knowledge-Based Augmentation Network for

Visual Question Answering”. (Liyang Zhang, OCTOBER 2021)
The authors of this paper are Liyang Zhang, Shuaicheng Liu, Jingkuan Song, Lianli Gao. The objective of this
paper is to propose Visual question answering (VQA) that involves understanding an image and paired
questions. Enhancing the accuracy using Knowledge-based Augmented Network (KAN). The datasets used in
this paper is Visual Genome Dataset. Methods used are:
 knowledge-based Augmented network (KAN), Concept Net as external knowledge base in our work.
 Feature extraction module that extracts image n feature and question feature.
This experiment’s overall accuracy outperforms the best top-one overall accuracy reported in FVQA, which is
63.63 ± 0.73%. Advantage of this paper is that it uses External knowledge base (e.g ConceptNet) to improve
performance which gives better performance than previous model such as MCAN and limitationsa are This
model might give plausible answers when the model learns inadequately. External knowledge (K) and relation
(R) objects values are varied only at certain values/point model and performs well. Beyond this the model's
accuracy decreases.
Model Overall Acc. ± Std (%)

Top-1
21
SVM-Question 10.37 ± 0.80

SVM-1mage 18.41 ± 107
Hie-Question+lmage 33.70 ± 1 18
Hie-Question+Image+Pre-VQA 43.14 ± 061
FVQA 63.63 ± 073
Ours 66.39 ± 0.50
Table 2.10: Overall Accuracy Result With Other Methods on the FVQA Data Set.
2.12 “Reasoning on the relation: Enhancing Visual

Representation for Visual Question Answering and Cross-Modal
Retrieval”.
(Jing tu, December 2020)
The authors of this paper are Jing tu, weifeng Zhang, yuhang lu, Zengchang Qin, yue Hu, Jianlong Tan and Qi
Wu. The objective of this paper is to propose a novel Visual Relational Reasoning (VRR) module to reason
about pair-wise and inner-group visual relationships among objects guided by the textual information.
MS-COCO dataset is used and the methods used are Bilinear Visual Attention Module, Visual Relational
Reasoning Module. The proposed model has an accuracy of 58.1%. an overall improvement of 2.3% is
achieved.
 Major advantage is that GCN is used to extract the textural features, since GCN proposed is better for
modelling long texts.
 Though the model identifies some relevant visual content, it fails to reason about their relationships
and leads to irrelevant results.
Image Query Text Query

Model
R@1 R@5 R@10 Med r R@1 R@5 R@I0 Med r
1K Test Images
DVSA [59] 38.4 69.9 80.5 1.0 27.4 60.2 74.8 3.0
GMM-FV [67] 39.4 67.9 80.9 2.0 25.1 59.8 76.6 4.0
In-CNN [62] 42.8 73.1 84.1 2.0 32.6 68.6 82.8 3.0
VQA.A [68] 50.5 80.1 89.7 - 37.0 70.9 82.9 -
HM-LSTM [63] 43.9 - 87.8 2.0 36.1 - 86.7 3.0
Order-embedding 46.7 - 88.9 2.0 38.9 - 85.9 2.0
DSPE+FV 50.1 79.7 89.2 - 39.6 75.2 86.9 -
sm-1STM 53.2 83. I 91.5 1.0 40.7 75.8 87.4 2.0
two-branch net 54.9 84.0 92.2 - 43.3 76.4 87.5 -
CMPM(ResNet-152) 56.1 86.3 92.9 - 44.6 78.8 89.0 -
VSE++(fine-tuned) 57.2 - 93.3 1.0 45.9 - 89.1 2.0
22
c-VRANet(ours) 58.1 86.9 93.4 1.0 50.4 83.6 92.3 1.0

5K Test images
DVSA 16.5 39.2 52.0 9.0 10.7 29.6 42.2 14.0
GMM-FV 17.3 39.0 50.2 10.0 10.8 28.3 40.1 17.0
VQA-A 23.5 50.7 63.6 - 16.7 40.5 53.8 -
Order-embedding 23.3 - 65.0 5.0 18.0 - 57.6 7.0
CMPM(ResNet-152) 31.1 60.7 73.9 - 22.9 50.2 63.8 -
VSE++(fine-tuned) 32.9 - 74.7 3.0 24.1 - 66.2 5.0
c-VRANet(ours) 34.4 63.8 76.0 3.0 27.8 57.6 70.8 4.0
Table. 2.11: Comparison of Performance of Our Model With the State-of-the-Art Methods on MS-
COCO Dataset
4.13 “An Entropy Clustering Approach for Assessing Visual

Question Difficulty”. (KENTO TERAO, 2020)
The authors of this paper are Kento terao, Toru tamaki, Bisser Raytchev,Kazufumi kaneda and Shin’ichi
Satoh. The objective of this paper is to use the entropy values of answer predictions produced by different
VQA models to evaluate the difficulty of visual questions for the models. The datasets used are VQA v2.
The methods followed are:
 Hard example mining and Hardness / failure prediction.
 Usage of three models (I,Q,I+Q), predicting answer distributions, and computing entropy values to
perform clustering with a simple k-means.
Accuracy: (Q+I)67.47 Entropy:(Q+I)0.84. One advantage is that the usage of entropy of predicted answer
helps in increasing the accuracy of VQA model and measure of entropy sometimes might give plausible
outputs accounts for limitation.
Table. 2.12: Performance Comparison on the RSIVQA Dataset

23
2.14 “Mutual Attention Inception Network for Remote Sensing

Visual Question Answering”. (Xiangtao Zheng, 2022)
The authors of this paper are Xiangtao Zheng, Member, IEEE, Binqiang Wang, Xingqian Du, and Xiaoqiang
Lu. The objective of this paper is that a novel method is proposed to include convolutional features of the
image to represent spatial information. Attention mechanism and bilinear technique are introduced to enhance
the feature considering the alignments between spatial positions and words. The datasets used are UC-Merced
(UCM), Sydney, AID, HRRSD, DOTA.The methods used are:
 Representation module devised to obtain image and question as a whole as well as a part.
 Fusion module is designed to boost the discriminative abilities.
Achieves the best performance in terms of overall accuracy, which is 67.23%. Few advantages are it considers
the task from the perspective of classification which simplifies the problem. Adoption of attention mechanism
and bilinear feature fusion helps on improving the accuracy of the answer and if the same number question is
asked multiple times, it gives different answers accounts for the limitation.
Methods Overall Area Comparison Count presence
IMG+SOFTMAX 29.75(1.78) 21.67(1.49) 32.14(1.8) 12.83(3.19) 38.91(2.43)
BOW+SOFTMAX 56.49(1.83) 51.35(1.22) 59.42(1.86) 48.27(2.04) 61.77(1.31)
IMG+BOW+SOFTMAX 63.53(1.66) 62.34(0.94) 65.54(1.01) 60.41(1.97) 64.79(0.84)
IMG+GloVe+SOFTMA 65.76(1.82) 61.83(0.90) 66.91(1.22) 61.34(1.93) 71.14(0.95)
X
IMG+BERT+SOFTMAX 66.30(1.53) 62.28(1.01) 67.70(1.69) 60.61(1.48) 72.49(0.89)
OURS 67.23(1.04) 63.15(1.38) 68.55(1.63) 59.74(1.41) 69.55(1.05)
Table. 2.13: Performance Comparison on the Remote Sensing VQA Dataset
2.15 “Low-altitude small-sized object detection using lightweight

feature-enhanced convolutional neural network”. (YE Tao, Aug
2021)
The authors of this paper are YE Tao, ZHAO Zongyang, ZHANG Jun, CHAI Xinghua
and ZHOU Fuqiang. The objective of this paper is to propose LSL-Net to perform high-precision detection of
low-altitude flying objects in real time to provide information as guidance to suppress black flight of UAVs.
24
The datasets used are MS-COCO dataset. The Methodology is that the model comprises three simple and
efficient modules, including LSM, EFM, and ADM. LSM reduces image input size,and loss of low level
feature extraction. FEM improves feature extraction. ADM increases, the image detection accuracy.
LSL-Net achieves an mAP of 90.97% which is 6.71% higher than YOLOv4-tiny. Few advantages is that this
model helps in detecting aerial objects such as jets, drone which are useful in security purposes. This model
has a good robustness and an excellent generalization ability, can effectively perform detection of different
weather conditions and satisfy the requirements of low-altitude flying object detection for antiUAV missions.
But frame rates need to be increased to achieve better accuracy.
Methods Size aMP/% FPS

Faster R-CNN - 39.8 9
SSD 300X300 25.1 43
SSD 512X512 28.8 22
YOLOv3-SPP 608X608 36.2 20
YOLOv4 608X608 43.5 33
CenterNet - 41.6 28
FCOS - 44.7 -
LSL-Net(ours) 416x416 38.4 135
LSL-Net(ours) 512x512 39.1 126
LSL-Net(ours) 608x608 40.3 118
Table. 2.14: Results on MS-COCO dataset
2.16 “Stacked Attention Networks for Image Question

Answering”. (Zichao Yang1, 26 Jan 2016)
The authors of this paper are Zichao Yang1 , Xiaodong He2 , Jianfeng Gao2 , Li Deng2 , Alex Smola1. In
this paper SAN uses a multiple-layer attention mechanism that queries an image multiple times to locate the
relevant visual region and to infer the answer progressively.
The datasets used are DAQUAR-ALL, DAQUAR-REDUCED, COCO-QA, VQA.
In this paper three models are considered they are
 Question Model (LSTM).
 Image Model (CNN).
 Stacked Attention Model.
For COCO-QA dataset the model achieves 79.3% accuracy on yes/no type questions. This paper is helpful in
using multiple attention layers to perform multi-step reasoning leads to more fine-grained attention layer-by-
25
layer in locating the regions that are relevant to the potential answers. But the SAN only improves the
performance of only Yes/No questions.
Methods All Yes/No Number Other

36% 10% 54%
SAN(,1 LSTM) 56.6 78.1 41.6 44.8
SAN(1, CNN) 56.9 78.8 42.0 45.0
SAN(2, LSTM) 57.3 78.3 42.2 45.9
SAN(2, CNN) 57.6 78.6 41.8 46.4
Table. 2.15: VQA results on our partition, in percentage
26
CHAPTER 3: METHODOLOGY
3.1 BLOCK DIAGRAM:
Fig. 3.1: Block Diagram.
The model requires two inputs: a camera-generated image and a yes-or-no type question in text format
regarding the input image. After receiving inputs, the model analyses the image and the questions to extract
fine-grained features. The answer-generating model receives both features and uses them to anticipate an
acceptable response. Using the accessible APIs, the response is transformed to speech and provided to the user
through any voice output module. The user can now ask any additional questions that are relevant, which are
turned into text and provided as input to the model along with the image. As long as all the questions are
answered, this process can be performed numerous times.
27
Fig. 3.2: Baseline Model 1 Fusion of features extracted from VGG16 and GloVe
Fig.3.3: Baseline Model 2 Fusion of features extracted from ResNet and GloVe
The baseline model is taken from the Visual Question Answering paper where image features were extracted
through a VGG16 network and the question feature was extracted through the GloVe embedding and passed
through LSTM to understand the sequence and for model fusion both the features are multiplied and passed
28
through fully connected layer to predict the answer.

Here we experimented with VGG16 for image feature extraction (fig. 3.2) as well as the resnet152 v2 for
image feature extraction (fig. 3.3) and we compare the results.
3.2 DATASET
VQA v2.0 dataset is used. The Dataset consists of 4,53,757 question and answer pairs with 82,433 images.
Out of these 40,000 samples are used which consists complex indoor images as well as occluded images. Only
yes or no type question are recorded. Out of these samples 80% i.e., 32000 samples are used for training and
remaining 20% i.e, 8000 images are used for testing.
From fig 1 it can be seen that majority of the answers contain yes or no type questions. Hence yes or no
questions are considered as a limitation of computational cost. And in fig 2. We can see the length of each
question. It is important to set the question length as the information needs to be preserved. Hence, we set the
maximum length of the question to 24 words, as it should be flexible for questions at the time of testing
Fig. 3.4: Graph showing the count of each answer in VQA v2 dataset.
29
Fig. 3.5: Graph showing the number of words in each question.
3.3 METHOD
3.3.1) IMAGE ENCODER
Fig 3.6: Image Encoder.
Image is given as input to the model. The input image size is 224 x 224 pixels. Network takes input image
having 3 as channel width. The height and width as multiples of 32. The next thing to proceed with the model
is to extract features from the image. In this network we use VGG16 to extract features of an image. The
VGG16 extracts the feature from the image and pass on the obtained data from the image to the fully
connected layer in which each neuron gets input from neurons of the previous layers. The dense layer used to
classify image based on the output of the convolutional layers. The calculated data is then passed to the next
block which is tanh(tangent hyperbolic) activation function. The Sigmoid and tanh are the most widely used
activation functions. Tanh is the shifted and stretched version of sigmoid.
With all this process image features are extracted successfully and ready for the further process.
3.3.1.1) VGG16:
In the publication “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE
RECOGNITION” from the Visual Geometry Group Lab at Oxford University in 2014, Karen Simonyan and
Andrew Zisserman proposed VGG 16. In the aforementioned categories of the 2014 ILSVRC challenge, this
model took first and second place.
The acronym for the ImageNet Large Scale Visual Recognition Challenge is ILSVRC. Several teams compete
in this competition each year, primarily completing two tasks. The first is object localization, which locates
items in images from 200 different classifications. The second method is image classification, which labels
each image with one of a thousand categories.
The top-5 test accuracy of this model on the ImageNet dataset, which contains 14 million images belonging to
1000 classes, achieves the accuracy of 92.7%.
30
Objective:
The ImageNet dataset contains images of a fixed size of 224x224 and has RGB channels. So, we have a tensor
of (224, 224, 3) as our input. This model processes the input image and gives output as a vector of 1000
values.
Fig. 3.7: VGG16 Architecture. (vgg-16-easiest-explanation, n.d.)
Results: In 2014’s ILSVRC challenge VGG-16 was one of the best-performing architectures. It was the
runner-up in the classification task with a top-5 classification error of 7.32% (only behind Google’s LeNet
with a classification error of 6.66%). It was also the winner of the localization task with a 25.32% localization
error.
Limitations of VGG 16:

· It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
· The size of VGG-16 trained ImageNet weights is 528 MB. So, it takes quite a lot of disk space and
bandwidth which makes it inefficient.
· 138 million parameters lead to exploding gradients problem.
3.3.1.2) ResNet:
One of the well-known deep learning models, Residual Network, often known as ResNet, was first presented
in a paper by Shaoqing Ren, Kaiming He, Jian Sun, and Xiangyu Zhang. Deep Residual Learning for Image
31
Recognition was the title of the paper when it was published in 2015 [1]. One of the most widely used and
effective deep learning models so far is the ResNet model.
Remaining blocks make up the ResNet. With the advent of these Residual blocks, the issue of training
extremely deep networks has been resolved.
The architecture, which takes its cue from VGG-19, consists of a 34-layer plain network with the addition of
shortcut or skip links.
Fig. 3.8: “Network architectures for ImageNet. Left: the VGG-19 model [41] (19.6 billion FLOPs) as a
reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network
with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more
details and other variants”. (K. He, 2016)
32
Result & Conclusion:

On the ImageNet dataset, which is 8 times deeper than VGG19 but still has fewer parameters, the author
utilises a 152-layer ResNet. On the ImageNet test set, an error of just 3.7% was produced; this performance
won the ILSVRC 2015 competition. Because of its extremely detailed depiction of COCO object detection, it
also produces a 28% relative improvement.
Contrary to the plain network, the error rate on the ImageNet validation Set reduces as the number of layers
increases from 18 to 34. A difficulty brought on by adding layers, as evidenced above, could be resolved
using shortcut connections.
ResNet architecture won the 2015 ImageNet classification competition because it had the lowest top-5 error
rate (3.57%).
Below is ImageNet’s output.
Table. 3.1: ImageNet Output.
3.3.2 QUESTION ENCODER
Fig. 3.9: Question Encoder.
Question is given as input to the model. Question features are extracted using Glove model. Glove uses co-
occurrence matrix to find closely related words occurring together so that similar meaning of words is placed
33
together. This helps in getting vector representation of words. After this, the result from the Glove is passed to
the tanh activation function. To get more accurate result the obtained data from the tanh activation function
passed to the long short-term (LSTM) network. The LSTM network has a behavior of remembering the entire
sentence. The output obtained from the LSTM network is passed again to the tanh activation function. After
processing the results are then passed to the dense layer. The dense layer gives fine grained question feature.
With all this process the extracted question feature is ready for the further process.
3.3.2.1) Glove
Researchers at Stanford University created GloVe with the intention of creating word embeddings by
combining global word co-occurrence matrices from a specified corpus. Global Vectors for Word
Representation is known as GloVe.
The fundamental idea behind the GloVe word embedding is to infer relationships between the words using
statistics. You may find out the frequency of a certain word pair by looking at the co-occurrence matrix. A
pair of words that frequently occur together is represented by each value in the co-occurrence matrix.
Co-occurrence matrix, which lists the frequency with which terms appear together in a corpus. A global
word-word co-occurrence matrix's non-zero entries are used to train the GloVe vector.
Below diagram is an example for co-occurrence matrix.
Table. 3.2: Co-occurrence Matrix

(constructing-a-co-occurrence-matrix-in-python-pandas, n.d.)
3.3.2.2) LSTM
LSTM, or long short-term memory, is an acronym. LSTM incorporates feedback connections as opposed to
typical feedforward neural networks. Handling the vanishing gradient problem looked to demand a more
34
complex solution with the invention of Recurrent Neural Networks (RNNs) that were trained using
backpropagation. The LSTM technique was created by Hochreiter and Schmidhuber and was crucial in
solving the vanishing gradient problem.
Fig. 3.10: Comparison of RNN and LSTM:
Normal recurrent nodes are replaced by memory cells in LSTM. A memory cell is a composite unit made of
more basic nodes connected in a particular way, with the innovative addition of multiplicative nodes.
Input gates, forget gates, and output gates are the three types of gates that LSTMs use to regulate the
information flow between each layer. The internal state of the memory cell and the hidden state are included
in the output of the LSTM's hidden layer. The internal state of the memory cell is completely internal, and
only the concealed state is transmitted to the output layer. LSTMs can prevent gradients from vanishing and
bursting.
35
Fig. 3.11: LSTM Cell Architecture

3.3.3 Model Fusion
3.3.3.1) Stacked Attention Network (SAN):
Stacked attention networks (SANs) overlay many levels of attention on certain portions of images based on
the query to answer queries that call for multi-step reasoning. The paper Stacked Attention Networks for
Image Question Answering by Carnegie Mellon University and Microsoft research writers introduces the
stacked attention network. To answer questions that call for multiple steps of reasoning, stacked attention
networks are used. These kinds of networks are used to look for areas in images that correspond to the
answers by using the semantic representation of questions as the search term. In SAN, a picture is
progressively queried in order to obtain results.
Fig. 3.12: SAN with customized attention layer
4 datasets were used in the experiments. DAQUAR-ALL, which has 6,795 practice questions and 5,673 test
questions. A scaled-down version of DAQUAR-ALL is called DAQUAR-REDUCED. VQA and COCO-QA.
Figure depicts how Attention is Stacking.
36
Fig. 3.13: Visualization of attention in SAN.
3.3.3.2) Customization in the Stacked Attention Network
1. Addition of Self Focus Network to image feature and question

feature
Self-Focus is a mechanism where the Network finds out the part of the image where more attention
should be given. It highlights the relevant part and reduce the importance for irrelevant part of the
image.
hi = F(vi) eqn. (1)
output features = vi x hi eqn. (2)
Here, in eqn. (1), image feature from image encoder is taken as input and passed through a Fully
connected layer with an sigmoid as activation function which gives score as output (hi). This score
gives the importance of each part of the image. Hence the score is multiplied with the image input in
Fig. 3.14: Self-Focus Network (SFA)
37
Fig. 3.15: Addition of Self Focus Network to image feature
Fig. 3.16: Addition of Self Focus Network to image feature and question feature
2. Addition Of Glove Embeddings as Question feature extractor

GloVe stands for Global Vector Representation which converts each word into a vector representation
which describe the relation between any words. It helps in increasing the accuracy and reduce the loss
3. Increase in Number of Attention Layers

Increase in number of attention layers showed the increase in accuracy as it concentrates more on the
features. Hence predicts the correct answers.
3.3.4) ANSWER PREDICTION

Finally, the output of the stacked attention network is passed to the SoftMax layer to get the probability of the
output.
3.3.5) Hyper Parameters

3.3.5.1). Optimizer
Adam is used as the optimizer with a learning rate of 0.001 initially and gradually reduced using a learning
rate scheduler. Whenever we train a neural network the output, we get from the network differ from the actual
output. This can be termed as loss function or cost function.
38
The optimizers are the methods to reduce the losses in the network by changing the attributes of neural
network such as weights and learning rate. There are many different types of optimizers, in this model we use
Adam optimizer. Adam is abbreviation of “Adaptive moment estimation”. In the adam optimizer two things
are combined.
i)Momentum
ii)Root Mean Squared Propagation (RMS Prop)
With the help of momentum, we get smoothening. And with the help of RMS Prop, we are able to change
learning rate in an efficient manner.
Though it is computationally costly. The advantages of Adam optimizer are, it is too fast compared to other
optimizers and it converges rapidly. It also rectifies high variance, vanishing learning rate.
This optimizer works very well for several deep learning model.
3.3.5.2). Loss Function

Categorical cross entropy is used for classification problem. It is the loss function used to calculate the loss in
a classification problem to optimize the model. (cross-entropy-loss-function, n.d.)
This has loss function as shown below
Where ti is the truth label and pi is the SoftMax probability for the ith class.
Categorical Cross Entropy is mainly used when the true labels are one hot coded, for example if we have the
following true values for 4- class classification problem [1,0,0,0],[0,1,0,0],[0,0,1,0] and [0,0,0,1].
39
CHAPTER 4: RESULT AND DISCUSSION

Learning Rate: 0.001
No. of epochs: 20
Batch size: 32
Loss Function: Categorical Cross entropy
Optimizer: Adam
Model Train loss Test Loss Train Accuracy Test Accuracy
Baseline model with VGG16 0.8363 0.9404 0.5664 0.4924

Baseline model with ResNet 0.6483 0.8835 0.7822 0.5322
GloVe+VGG16+Single Attention Layer
GloVe+VGG16+Double Attention Layer
GloVe+VGG16+Triple Attention Layer
GloVe+VGG16+Triple Attention
Layer+SFN on Image Features
GloVe+VGG16+Triple Attention
Layer+SFN on Image Features and
Question Features
40
CHAPTER 5: CONCLUSION AND FUTURE WORK
In this paper, a stacked attention network improvised with the addition of self-focus network is
proposed to solve the Visual Question Answering (VQA) task. The model is trained and tested on yes or no
type questions which mainly include indoor images. The customised model is tested and compared with the
pre-existing models. Which shows the improvement in the accuracy of the VQA task.
From the result it can be seen that Increase in the attention layer showed improvement by reducing the loss.
And much improvement is seen after adding self-focus network. This shows the effectiveness in the
architecture of self-focus network.
However, the level of improvement in accuracy is not satisfactory, as the main use case of VQA task is to help
visually impaired people. And model is getting overfitted due to less data. Hence the network can be
improvised by considering more data and more computational power.
41
REFERENCES
[1] F. Liu, J. Liu, Z. Fang, R. Hong and H. Lu, "Visual Question Answering With Dense Inter- and Intra-
Modality Interactions," in IEEE Transactions on Multimedia, vol. 23, pp. 3518-3529, 2021, doi:
10.1109/TMM.2020.3026892.
[2] W. Guo, Y. Zhang, J. Yang and X. Yuan, "Re-Attention for Visual Question Answering," in IEEE
Transactions on Image Processing, vol. 30, pp. 6730-6743, 2021, doi: 10.1109/TIP.2021.3097180.
[3] Z. Yu, J. Yu, C. Xiang, J. Fan and D. Tao, "Beyond Bilinear: Generalized Multimodal Factorized High-
Order Pooling for Visual Question Answering," in IEEE Transactions on Neural Networks and Learning
Systems, vol. 29, no. 12, pp. 5947-5959, Dec. 2018, doi: 10.1109/TNNLS.2018.2817340.
[4] C. Chen, D. Han and J. Wang, "Multimodal Encoder-Decoder Attention Networks for Visual Question
Answering," in IEEE Access, vol. 8, pp. 35662-35671, 2020, doi: 10.1109/ACCESS.2020.2975093.
[5] L. Peng, Y. Yang, Z. Wang, Z. Huang and H. T. Shen, "MRA-Net: Improving VQA Via Multi-Modal
Relation Attention Network," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no.
1, pp. 318-329, 1 Jan. 2022, doi: 10.1109/TPAMI.2020.3004830.
[6] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang and X. -S. Hua, "Self-Adaptive Neural Module
Transformer for Visual Question Answering," in IEEE Transactions on Multimedia, vol. 23, pp. 1264-1273,
2021, doi: 10.1109/TMM.2020.2995278.
[7] Y. Liu, X. Zhang, F. Huang, L. Cheng and Z. Li, "Adversarial Learning With Multi-Modal Attention for
Visual Question Answering," in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 9,
pp. 3894-3908, Sept. 2021, doi: 10.1109/TNNLS.2020.3016083.
[8] K. C. Shahira and A. Lijiya, "Towards Assisting the Visually Impaired: A Review on Techniques for
Decoding the Visual Data From Chart Images," in IEEE Access, vol. 9, pp. 52926-52943, 2021, doi:
10.1109/ACCESS.2021.3069205.
[9] D. Gao, R. Wang, S. Shan and X. Chen, "Learning to Recognize Visual Concepts for Visual Question
Answering With Structural Label Space," in IEEE Journal of Selected Topics in Signal Processing, vol. 14,
no. 3, pp. 494-505, March 2020, doi: 10.1109/JSTSP.2020.2989701.
[10] Y. Zhou et al., "Plenty is Plague: Fine-Grained Learning for Visual Question Answering," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 697-709, 1 Feb. 2022, doi:
10.1109/TPAMI.2019.2956699.
[11] L. Zhang et al., "Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering,"
in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4362-4373, Oct. 2021,
doi: 10.1109/TNNLS.2020.3017530.
[12] J. Yu et al., "Reasoning on the Relation: Enhancing Visual Representation for Visual Question
Answering and Cross-Modal Retrieval," in IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3196-3209,
42
Dec. 2020, doi: 10.1109/TMM.2020.2972830.
[13] K. Terao, T. Tamaki, B. Raytchev, K. Kaneda and S. Satoh, "An Entropy Clustering Approach for
Assessing Visual Question Difficulty," in IEEE Access, vol. 8, pp. 180633-180645, 2020, doi:
10.1109/ACCESS.2020.3022063.
[14] X. Zheng, B. Wang, X. Du and X. Lu, "Mutual Attention Inception Network for Remote Sensing Visual
Question Answering," in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022, Art
no. 5606514, doi: 10.1109/TGRS.2021.3079918.
[15] Y. Tao, Z. Zongyang, Z. Jun, C. Xinghua and Z. Fuqiang, "Low-altitude small-sized object detection
using lightweight feature-enhanced convolutional neural network," in Journal of Systems Engineering and
Electronics, vol. 32, no. 4, pp. 841-853, Aug. 2021, doi: 10.23919/JSEE.2021.000073.
[16] Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). “Stacked Attention Networks for Image
Question Answering”. arXiv. https://doi.org/10.48550/arXiv.1511.02274.
[17] Aishwarya Agrawal∗ , Jiasen Lu∗ , Stanislaw Antol∗ , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi
Parikh, “VQA: Visual Question Answering”, arXiv:1505.00468 [cs.CL]
43
44

Capstone Report Final 08

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Report Final 08

Uploaded by

Copyright:

Available Formats

PES UNIVERSITY

(Established under Karnataka Act No. 16 of 2013)

Under the guidance of

Prem Kumar L(PES1UG19EC216)

Name and Signature of Examiners:

NAME AND SIGNATURE OF THE CANDIDATE

Prem Kumar L (PES1UG19EC216)

Rajat Subraya Gaonkar (PES1UG19EC229)

Rohan Madan Ghodake (PES1UG19EC239)

We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

We are grateful to the project coordinator Prof. Rajasekar M, Department of

We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

1.3 Problem Statement. 10

1.4Organisation of the report. 10 – 11

Chapter 2: Literature Survey. 12 – 15

2.16 “Stacked Attention Networks for Image Question Answering”.

3.1 Block Diagram. 16 - 17

3.3.1.1 Image Encoder. 17 – 19

3.3.2 Question Encoder. 25 – 26

3.3.3 Model Architecture and Visualization

3.3.3.1 Stacked Attention Network

3.3.3.2 Customization in the Stacked Attention Network

Chapter 5: Software needed. 36

Chapter 6: Conclusion and Future Scope. 37

Fig. 1.1: Example 1 Fine-grained feature extraction example

CHAPTER 1.1: MOTIVATION

CHAPTER 1.2: PROBLEM STATEMENT

CHAPTER 1.3: REPORT ORGANISATION

CHAPTER 2: LITERATURE SURVEY

2.1 “Visual Question Answering with dense Inter and Intra-

2.2 “Re-Attention for Visual Question Answering”.

Methods Yes/No Other Overall

2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-

Model Test-Dev Test-Challenge

2.4 “Multimodal Encoder-Decoder Attention Networks for Visual

2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation

Methods Q-rel nParams VQA-score

Table 2.5: Results on the VQA-2.0 Validation Set

2.6 “Self-Adaptive Neural module transformer for visual question

2.7 “Adversarial Learning with Multi-Modal Attention for Visual

2.8 “Towards Assisting the Visually Impaired: A Review on

 Traditional method - connected component analysis & Hough transform

Auther,Year Dataset Type IMG QUES IMG+QUES SAN

2.9 “Learning to Recognize Visual Concepts for Visual Question

2.10 “Plenty is Plague: Fine-Grained Learning for Visual

Paradigm VG STEP All Yes/No Num. Others

2.11 “Rich Visual Knowledge-Based Augmentation Network for

Model Overall Acc. ± Std (%)

SVM-Question 10.37 ± 0.80

2.12 “Reasoning on the relation: Enhancing Visual

Image Query Text Query

c-VRANet(ours) 58.1 86.9 93.4 1.0 50.4 83.6 92.3 1.0

4.13 “An Entropy Clustering Approach for Assessing Visual

Table. 2.12: Performance Comparison on the RSIVQA Dataset

2.14 “Mutual Attention Inception Network for Remote Sensing

Table. 2.13: Performance Comparison on the Remote Sensing VQA Dataset

2.15 “Low-altitude small-sized object detection using lightweight

Methods Size aMP/% FPS

2.16 “Stacked Attention Networks for Image Question

Methods All Yes/No Number Other

3.1 BLOCK DIAGRAM:

Fig. 3.1: Block Diagram.