Reports

PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013)

100-ft Ring Road, Bengaluru – 560 085, Karnataka, India
Report on
EXTRACTION OF FINE-GRAINED FEATURES OF A COMPLEX OUTDOOR
IMAGE AND QUESTION FOR VISUAL QUESTION ANSWERING
Submitted by
Aditya V(PES1UG19EC016)
Kaushik G (PES1UG19EC132)
Sinchana S R (PES1UG19EC299)
January-May 2022
Under the guidance of

Assistant Prof. Raghavendra M
Professor
Department of Electronics and Communication Engineering
PES University
Bengaluru - 560085
FACULT
CERTIFICATE Y OF
This is to certify that the Project ENGINEERING entitled
DEPARTM
ENT OF
EXTRACTION OF FINE- ELECTRONICS GRAINED FEATURES OF
AND 1
COMMUNICATI
ON
ENGINEERING
PROGRAM B. TECH
A COMPLEX OUTDOOR IMAGE AND QUESTION FOR VISUAL
QUESTION ANSWERING
is a bonafide work carried out by
Aditya V (PES1UG19EC016)
In partial fulfillment for the completion of the Program of Study B. Tech in Electronics and Communication
Engineering under rules and regulations of PES University, Bengaluru during the period August–December
2022. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in
the report. The project report has been approved as it satisfies the 6 th-semester academic requirements in
respect of project work.
Signature with date &Seal Signature with date & Seal Signature with date & Seal
Prof. Raghavendra M J Dr. Anuradha M Dr. Keshavan B K
Internal Guide Chairperson Dean of Faculty
Name and Signature of Examiners:

NAME: SIGNATURE:
DECLARATION
We, Aditya V, Kaushik G , Sinchana S R, hereby declare that the project entitled, "EXTRACTION OF
FINE GRAINED FEATURES OF A COMPLEX OUTDOOR IMAGE AND QUESTION FOR VISUAL
2
QUESTION ANSWERING", is an original work done by us under the guidance of Prof. Raghavendra M J,
Assistant Professor, Department of Electronic and Communication Engineering, PES UNIVERSITY,
and is being submitted in partial fulfilment of the requirements for completion of 6th semester course work
in the Program of Study B.Tech in Electronics and Communication Engineering.
PLACE: BENGALURU
DATE:14/12/2022
NAME AND SIGNATURE OF THE CANDIDATE
Aditya V (PES1UG19EC016)
3
ACKNOWLEDGEMENT
We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

Department of Electronics and Communication Engineering, PES University, for his
persistent guidance, suggestions, assistance and encouragement throughout the
development of this project. It was an honor for us to work on the project under his
supervision and understand the importance of the project at various stages.
We are grateful to the project coordinator Prof. Rajasekar M, Department of

Electronics and Communication Engineering for, organizing, managing, and helping
with the entire process.
We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Electronics and Communication Engineering, PES University, for her invaluable support
and thankful for the continuous support given by the department. I am also very grateful
to all the professors and non-teaching staff of the Department who have directly or
indirectly contributed to our research towards enriching work.
Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

DORESWAMY, Vice Chancellor Dr. SURYAPRASAD J, the Registrar Dr. K S
SRIDHAR and Dean of Engineering Dr. KESHAVAN B K of PES University for
giving me this wonderful opportunity to complete this project by providing all the
necessaries required to do so. It has motivated us and supported our work thoroughly.
We are grateful and obliged to our family and friend for providing the energy and
encouragement when we need them the most.
4
AUGUST-DECEMBER 2022
Extraction of fine grained features of a complex outdoor image and question for visual question answering
ABSTRACT
Visually Challenged people face a lot of due to loss of vision. Vision is one of the important senses for human
beings to complete their daily activities. Many developments and efforts have been made to assist Visually
Impaired. We too are making an effort to assist the visually impaired by using Visual Question Answering.
We aim to develop a Deep Learning Algorithm to extract features from complex indoor images as well as
question using Visual Question Answering to assist visually impaired. Here we give two inputs image and
question. Our VQA algorithm extract fine-grained features from both question and image. The answer
generating model predicts an answer based on the features which is given to the user after it is converted to
speech. The user can again ask any related question which will be given as input along with the image and the
process repeats. VQA v2 dataset is used. A total of 40000 samples are used which consists of complex indoor
images and occluded images. Of the samples 80% of them are used for training and the remaining 20% are
used for testing. VGG16 is used as image encoder and ResNet as question encoder. VGG16 locates items in
images from 200 different classifications and also labels each image with one of a thousand categories.
(write about ResNet)
We propose to solve the Visual Question Answering (VQA) task by using a stacked attention network with
the addition of self-focus network. The model can be trained and tested on yes or no type questions which
mainly include indoor images. We use stacked Attention Network to answer questions that call for multiple
steps of reasoning. These kinds of networks are used to look for areas in images that correspond to the
answers by using the semantic representation of questions as the search term. In SAN, a picture should be
progressively queried in order to obtain results. We use Adam optimizer. The optimizers are the methods to
reduce the losses in the network by changing the attributes of neural. Whenever we train a neural network the
output we get from the network differ from the actual output which is termed as loss function or cost function.
5
TABLE OF CONTENT
Content Pg.No
Abstract 7
Acknowledgement 8
Chapter 1: Introduction. 9 - 11
1.1 Motivation. 9 - 10
1.2 Objectives. 10
1.3 Problem Statement. 10
1.4Organisation of the report. 10 – 11
Chapter 2: Literature Survey. 12 – 15
2.1 “Visual Question Answering with dense Inter and Intra-Modality interactions.”
2.2 “Visual Question Answering with dense Inter and Intra-Modality interactions.”
2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question
Answering”
2.4 “Multimodal Encoder-Decoder Attention Networks for Visual Question Answering”.
2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network”.
2.6 “Self-Adaptive Neural module transformer for visual question answering”.
2.7 “Adversarial Learning with Multi-Modal Attention for Visual Question Answering”.
2.8 “Towards Assisting the Visually Impaired: A Review on Techniques for Decoding the Visual
Data from Chart Images”.
2. 9 “Learning to Recognize Visual Concepts for Visual Question Answering with Structural Label
Space”.
2.10 “Plenty is Plague: Fine-Grained Learning for Visual Question Answering”.
2.11 “Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering”.
2.12 “Reasoning on the relation: Enhancing Visual Representation for Visual Question Answering
and Cross-Modal Retrieval”.
2.13 “An Entropy Clustering Approach for Assessing Visual Question Difficulty”.
2.14 “Mutual Attention Inception Network for Remote Sensing Visual Question Answering”.
2.15 “Low-altitude small-sized object detection using lightweight feature-enhanced convolutional
neural network”.
6
2.16 “Stacked Attention Networks for Image Question Answering”.
Chapter 3: Methodology 16 - 29
3.1 Block Diagram. 16 - 17
3.2 Dataset
3.3 Method 17 – 29
3.3.1.1 Image Encoder. 17 – 19
3.3.1,2 VGG16 19 – 22
3.3.1.3 ResNet 22 – 25
3.3.2 Question Encoder. 25 – 26
3.3.2.1 Glove 26 - 27
3.3.2.2 LSTM 27 – 29
3.3.3 Model Architecture and Visualization
3.3.3.1 Stacked Attention Network
3.3.3.2 Customization in the Stacked Attention Network
Chapter 4: Results. 30 - 35
Chapter 5: Software needed. 36
Chapter 6: Conclusion and Future Scope. 37
7
1 LIST OF TABLES
Table No. Table Name Pg.No
2.1 Comparison With Previous State-of-the-Art Methods on the VQA 1.0 Dataset 13
2.2 Performance on the Validation Split of the VQA v2 Dataset.
2.3 Overall Accuracies on the Test-Dev and Test-Challenge Sets of the VQA-2.0 Data Set
2.4 Results of Single Model Compared with the State-of-the-Art Models on Zhe Test-Dev and Test-
___________Std Set.
2.5 Results on the VQA-2.0 Validation Set
2.6 Ablation Study on VQA v1.0 and COCO-QA Data Sets. Denotes Variant Implementations of the
_________Model
2.7 Qualitative Analysis of Different Model Performances (Accuracy) in Reasoning
2.8 Performance (%) on VQA-CP V2 Test set and VQA V2 Validation set.
2.9 Evaluations of BUTD on VQA2.0 Test-Dev With Visual Genome Dataset
2.10 Overall Accuracy Result With Other Methods on the FVQA Data Set.
2.11 Comparison of Performance of Our Model With the State-of-the-Art Methods on MS-COCO
__________Dataset
2.12 Performance Comparison on the RSIVQA Dataset
2.13 Performance Comparison on the Remote Sensing VQA Dataset
2.14 Results on MS-COCO dataset
2.15 VQA results on our partition, in percentage
8
CHAPTER 1: INTRODUCTION
In the field of research known as "visual question answering," the goal is to create a computer system that
can respond to queries that are submitted both visually and orally.
To link computer vision with natural language processing (NLP) and push the frontiers of both domains,
the challenge of visual question answering was presented. Computer vision is the study of techniques for
gathering, analyzing, and comprehending images. Its main objective is to instruct machines in vision.
NLP, on the other hand, focuses on making it possible for computers and people to communicate in
natural language. NLP and computer vision both fall under the umbrella of artificial intelligence, and they
both use comparable machine learning-based techniques.
Our Model focuses on developing a visual question-answering (VQA) model that receives an input of a
question and an outdoor image and extracts fine-grained features from both to produce an answer. A
question generator model is then fed with the obtained answer to create more specific questions about it.
To obtain refined responses, the VQA model is fed with refined questions as inputs.
Figure 1.1 Example 1 : Fine grained Feature Extraction
In this example , a general question might be , “Is the man playing tennis” , for which the answer will be
‘YES" but using fine grained extraction we can generate answer to the question , “what is the colour of the
ball” as “Yellow”.
9
CHAPTER 1.1: MOTIVATION

Understanding the contents of images is the work of VQA, but doing so frequently necessitates the
knowledge of non-visual material, which can range from "common sense" to topic-specific
information. The motive behind this project is to develop a better understanding of the objects in an
image and the type of question asked.
The scope of this project is as follows:

• Kids-learning.
• Image processing in Space organization.
• Assistance to color-blind users.
• Interactive robotic systems.
CHAPTER 1.2: PROBLEM STATEMENT

To develop an algorithm to extract fine-grained features of a complex outdoor image and question for
Visual Question Answering to generate probable questions and get a refined answer.
CHAPTER 1.3: REPORT ORGANISATION
10
CHAPTER 2: LITERATURE SURVEY
2.1 “Visual Question Answering with dense Inter and Intra-

Modality interactions.” (F. Liu, 2021)
The authors of this paper are Fei liu, Jing Liu,Zhiwei Fang, Richang Hong and Hanquig Lu.
This paper proposes a novel DenIII framework for Visual Question Answering, which helps in capturing more
fine-grained multimodal information to perform densely inter and intra-modality interactions. It also proposes
efficient inter and intra AC to connect different and same modalities. VQAv1.0, VQAv2.0, and TDIUC are
the datasets used in this paper.
The methodologies used here are:
• Feature extraction: This extracts initial image features using Faster RCNN.
• Dense and inter and intra-modality interactions: This adopts GRU with hidden units as question
encoding layer and non-linear layer.
• Attention mechanism and answer prediction. A maximum accuracy of 69% is achieved using the
VQAv1.0 dataset. DenIII framework achieves competitive performance on all three datasets used
which means the framework is modelled well.
The limitation of this model is if one object in an image is occluded by another object the model obtains an
inaccurate number of objects. When the model focuses on objects on TV it ignores the global information on
TV. Thus, the model is unaware that the objects are actually television pictures.
test-dev test-std
Model
Yes/No Number others overall Yes/No Number others Overall
Memory-augmented
81.5 39.0 54.0 63.8 81.7 37.6 54.7 64.1
Net
QGHC 83.5 38.1 57.1 65.9 - - - 65.9
Dual-MFA 83.6 40.2 56.8 66.0 83.4 40.4 56.9 66.1
VKMN 83.7 37.9 57.0 66.0 84.1 38.1 56.9 66.1
MFH 85.0 39.7 57.4 66.8 85.0 39.5 57.4 66.9
DCN 84.6 42.4 57.3 66.9 85.0 42.3 57.0 67.0
DA-NTN 85.8 41.9 58.6 67.9 85.8 42.5 58.5 68.1
CoR 85.7 44.1 59.1 68.4 85.8 43.9 59.1 68.5
DenIII (ours) 86.7 44.1 59.7 69.1 86.8 43.3 59.4 69.0
11
Table 2.1 Comparison with Previous State-of-the-Art Methods on the VQA 1.0 Dataset
2.2 “Re-Attention for Visual Question Answering”.

(Guo, Zhang, Yang, & ri8, 2021)
This is an IEEE paper the authors of this paper are Wenya Guo, Ying Zhang, Jufeng, Yang, Xiaojie Yuan.
This paper proposes a Re-attention framework to utilize the information in answers for the Visual Question
Answering task to extract the image and question features. The datasets used in this model are VQA, VQA v2,
and COCO-QA. The methodology used here is
• Questions associated with the relevant objects are unified into a single entity, this is used the predict
the answers.
• Answers generated are used to Re-attend the object and alter the visual attention map.
• The major advantage is Re-attention procedure extracts information from the answers generated and
reconstructs the visual attention map which helps in answer detailing.
• This Re-attention framework performs better than the previous paper (Visual Question Answering
with dense Inter and Intra-Modality interactions) which gives better accuracy.
• Shows the best performance against fusion, attention, and reasoning-based methods on all data sets.
Improvement of 6.28% in overall accuracy on the VQA V2 data set.
Although this method learns correct regions, it is difficult to answer the given question that involves common
sense. A special text understanding module is needed to accurately identify the time shown in the clock.
Methods Yes/No Overall Other

Base+Ca 84.13 66.51 58.13
Base+Ca+Re-att (A) 84.62 66.72 58.00
Base+Ca+Re-att (Q_A) 60.27 50.36 50.88
Base+Ca+Re-att (Q+A) 85.01 67.19 58.32
Base+Ca+E-Re-att (A) 84.76 66.97 58.41
Base+Ca+E-Re-att (Q_A) 66.34 55.47 54.26
Base+Ca+E-Re-att (Q+A) 85.35 67.68 58.83
Table 2.2 Performance on the Validation Split of the VQA v2 Dataset.
12
2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-

Order Pooling for Visual Question Answering”. (Zhou Yu,
DECEMBER 2018)
The authors of this paper are Zhou Yu, Jun Yu, Member, IEEE, Chenchao Xiang, Jianping Fan, and Dacheng
Tao, Fellow, IEEE. This IEEE paper proposes a model to find good solutions for the following three issues:
• fine-grained feature representations for both the image and the question;
• multimodal feature fusion that can capture the complex interactions between multimodal features; and
• automatic answer prediction that can consider the complex correlations between multiple diverse
answers for the same question.
The datasets used are VQA-1.0, and VQA-2.0. The methodology includes
• multimodal factorized bilinear pooling approach
• multimodal factorized high-order pooling (MFH) method to achieve a more effective fusion of
multimodal features.
• Kullback–Leibler divergence (KLD) as the loss function to achieve more accurate characterization. It
achieves 60.7% accuracy on VQA 1.0 and 68.16% accuracy on VQA 2.0 dataset.
This proposed model works with 1/3rd of the parameters and 2/3rd of the total GPU usage. This model is
robust. This model fails to analyse the relation between image regions and semantic words in an input image
and image-related question.
Model Test-Dev Test-Challenge

vqateam-Prior - 25.98
vqateam-Language - 44.34
vqateam-LSTM-CNN - 54.08
vqateam-MCB - 62.33
Adelaide-ACRV-MSR - 69.00
DLAIT (2nd place) - 68.07
LV_NUS (4th place) - 67.62
1 MFB model 64.98 -
1 MFH model 65.80 -
7 MFB models 67.24 -
7 MFH models 67.96 -
13
9 MFH models 68.02 68.16

Table 2.3 Overall Accuracies on the Test-Dev and Test-Challenge Sets of the VQA-2.0 Data Set
2.4 “Multimodal Encoder-Decoder Attention Networks for Visual

Question Answering”. (CHONGQING CHEN, FEB 19 2020)
The author of this paper is CHONGQING CHEN1, DEZHI HAN1, AND JUN WANG2.
This is an IEEE paper, in this paper a novel Multimodal Encoder-Decoder Attention Network (MEDAN) is
proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in-
depth and can capture rich and reasonable question features and image features by associating keywords in
question with important object regions in the image.
The datasets used are the Visual Genome dataset and the VQA-v2 dataset. The methods used are
• Attention mechanism for VQA Multimodal Encoder-Decoder Attention.
• Scaled Dot-Product Attention and Multi-Head Attention.
• Encoder and Decoder. This model achieves an accuracy of 0.56 and 0.66 points higher than DFAF on
test-dev and test-std. This model is 0.54 and 0.16 points higher than DFAF and MCAN on test-dev.
And on test-std, MEDAN is 0.64 and 0.08 points higher than DFAF and MCAN.
A major advantage of this model is it uses Encoder as a model to extract fine-grained question features by
self-attention. Multimodal Encoder-Decoder Attention Network can capture rich and reasonable question
features and image features by associating keywords in question with important object regions in the image.
The limitations of this model are the accuracy of this model is less than MUAN (Multimodal unified Attention
Networks and this model is not well suited for indoor images.

Test-dev Test-std
Model
Y/N Num Other All All
BUTD[I] 81.82 44.21 56.05 65.32 65.67
MFH[33] 85.31 49.56 59.89 68.76 -
BAN[12] 85.42 50.93 60.26 69.52 -
BAN+Counter 85.42 54.04 60.52 70.04 70.35
DFAF[3] 86.09 53.32 60.49 70.22 70.34
MCAN[2] 86.82 53.26 60.72 70.63 70.90
MUAN[4] 86.77 54.40 60.89 70.82 71.10
MEDAN(Adam) 87.10 52.69 60.56 70.60 71.01
MEDAN(AdamW) 87.02 53.57 60.77 70.76 70.98
Table 2.4 Results of Single Model Compared With the State-of-the-Art Models on Zhe Test-Dev and Test-Std
Set.
14
2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation

Attention Network”. (Liang Peng, 1, JANUARY 2022)
The authors of this paper are Liang Peng, Yang Yang , Zheng Wang, Zi Huang , Heng Tao Shen. In this paper
a self-guided word relation attention scheme is used which explores the latent semantic relations between
words. Two question-adaptive visual relation attention modules were used to extract not only the fine-grained
and precise binary relations between objects but also the more sophisticated trinary relations.
The datasets used are VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC and the methods used are:
 A self-guided word relation attention scheme to explore the latent semantic relations between words.
 Two question-adaptive visual relation attention modules that can extract not only the fine-grained and
precise binary relations between objects but also the more sophisticated trinary relations.
MRA-Net improves the overall accuracy from 67.9 to 69.22 on VQA-2.0 dataset. Few advantages of this
paper are MRA-Net can focus on the important words and the latent semantic relation between them to
understand the question more completely and it reconciles the appearance feature with the relation feature
effectively, thereby reasoning the correct answer. In addition, MRA-Net reconciled the object appearance
features with the two kinds of relation features under the guidance of the corresponding question, which can
effectively use these features according to the question. Few limitations observed are:
 The binary relation feature improves the overall accuracy from 65.43 to 65.94 and not much.
 The proposed model occasionally makes mistakes in locating all the relevant regions and relations.
Methods Q-rel nParams VQA-score

O-att  48.7M 65.43
2*O-att  53.8M 65.20
O-att+Binary*  56.8M 65.94
O-att+Binary  56.8M 65.94
O-att+2*Binary  64.7M 65.95
Oatt+Binary+Trinary*  67.4M 66.08
Self-att X 55.9M 65.90
Q-rel  61.6M 65.87
2*Se1f-att X 61.7M 65.96
Self-att+Q-rel*  67.4M 66.08
Table 2.5: Results on the VQA-2.0 Validation Set

15
2.6 “Self-Adaptive Neural module transformer for visual question

answering”. (H. Zhong, 2021)
The authors of this paper are Huasong Zhong, Hanwang Zhang, Jingyuan chen, Xian- Sheng Hua. The
objective of this paper is to present a novel Neural Module Network, called Self-Adaptive Neural Module
Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout
decoding by considering intermediate Question and Answer results. The datasets used are CLEVR, CLEVR-
CoGenT, VQAv1.0 and VQAv2.0 and the methods used are:
 General framework: convert input to image and question embedding.
 Revised Module Network: each neural network is constructed to differentiable structure.
 Encoding: utilize the neural transformer to encode more accurate input question features for the layout
controller based on the intermediate Q&A results.
 Decoding: utilize the neural transformer to guide subsequent layout generation based on the
intermediate results from each reasoning step.
VQAv1.0 gives greater performance and improvement than VQAv2.0 which is 66.7%. VQAv2.0 gives
64.5%. Few advantages observed are self-adaptive feature embedding achieves better performance than
baseline module and Layouts in the model are self-adaptive to real situation resulting in better performance.
Model can focus on the dynamic nature of question comprehension and select the corresponding module
function more accurately and few limitations of this paper are:
 Expert layout supervision is not significant on VQAv2.0.
 VQA2.0 dataset is more complicated and closer to the dataset utilized in practice
2.7 “Adversarial Learning with Multi-Modal Attention for Visual

Question Answering”. (Liu, Zhang, Huang, Cheng, & Li, Sept.
2021)
16
The authors of this paper are Yun Liu; Xiaoming Zhang; Feiran Huang; Lei Cheng; Zhoujun Li. The objective
of this paper is to propose a novel model which can capture the answer related features The datasets used are
VQA v1, VQA v2, COCO-QA. The methods followed are
 Adversarial method to explore the answer related information .
 Multi-modal attention with Siamese similarity learning to learn alignment between image regions and
answer.
 Final ALMA model (Adversarial learning with multi modal attention).
An accuracy of 68.94% on VQA v1, 68.76 on VQA v2 dataset, and 68.27% on COCO-QA dataset is
achieved. Few advantages of this model are:
 Focus more on answer related image region outperforming attention model.
 1.65% improvement in accuracy compared to previous models.
One limitation observed in this model is that the model fails in analysing the small objects in an image.
Accuracy
Methods
VQA V 1.0 COCO-QA
No-Att 45.56 54.78
MM-Att 54.74 61.65
MM-Att + Sia 59.63 64.32
MM-Att + Sia 61.41 68.26
H-LSTM 58.62 65.91
E-LSTM 59.21 66.74
HE-LSTM* 61.36 68.24
Ans-Rep 59.84 65.73
Inf-Dis-Sub 60.71 67.75
Inf-Dis-Div* 61.35 68.26
Table 2.6: Ablation Study on VQA v1.0 and COCO-QA Data Sets. “*” Denotes Variant Implementations of
the Model
2.8 “Towards Assisting the Visually Impaired: A Review on

Techniques for Decoding the Visual Data from Chart Images”.
(Lijiya, 29 March 2021 )
This is an IEEE paper. The authors of this paper are K. C. Shahira and A. Lijiya. The objective of this paper is
to explore the existing literature in understanding the graphs and extracting the visual encoding from them.
The datasets used are LeafQA, PlotQA, FigureQA,DVQA, FigureSeer, CLEVR and the methods followed
are:
17
 Modality based approaches - outputs audio and tactile info

 Traditional method - connected component analysis & Hough transform
 Deep Learning - Attention encoder decoder, VGG, Alex Net, Mobile Net.
ConvNets gives accuracy upto 97% whereas the text Microsoft OCR gives 75.6%.
Major advantage observed is that a blind person can understand major information from the graph through
question answering and the study mainly focuses on extracting the chart data to aid the Visually Impaired for
graph perception by reviewing the conventional methods and deep learning methods.Few limitations of this
paper are:
 FigureQA give only binary answers and no numerical answering is possible
 CLEVR dataset has reasoning questions about synthetic scenes but they perform poorly on chart
datasets
 No proper Datasets
 Accuracy of model is too low compared to human level performance.
Auther,Year Dataset Type IMG QUES IMG+QUES SAN

Methani et al. PlotQA BC,LC 14.84% 15.35% 46.54% 53.96%
Kafle et al. DVQA BC 14.83% 21.06% 32.01% 36.04%
Kabou et al. FigureQA BC,LC,PC 52.47% 50.01% 55.59% 72%
Table 2.7: Qualitative Analysis of Different Model Performances (Accuracy) in Reasoning
2.9 “Learning to Recognize Visual Concepts for Visual Question

Answering with Structural Label Space”. (Difei Gao, 3, MARCH
2020)
The authors of this paper are Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen.
The objective of this paper is to propose a novel visual recognition module named Dynamic Concept
Recognizer (DCR), which is easy to be plugged in an attention-based VQA model and structural label space.
The datasets used are Visual Genome, GQA, VQA v2 and VQA-CP v2 and the methods followed are:
 Structural label space - which outputs G groups and each group contains Ci concepts
 Used K means clustering to classify concepts (outpts of GloVe embeddings) into groups.
 Dynamic Concept Recognizer - takes image features and predicted group from Group PredNet and
predicts the concept which is the answer to given question.
18
An overall 5% increase in DCR accuracy compared to previous models on visual genome dataset and 3%
(absolute) improvement in accuracy on GQA dataset. An advantage of this paper is that it works better on
conceptual questions compared to previous models. Limitation of this paper is that it works poor on yes/no
type questions as the model focuses only on the visual concepts.
Image
VQA-CP v2 test VQA v2 val
Model Feature
Yes/No Number Other Overal Yes/No Number Other Overall
l
NMN ResNet 38.94 11.92 25.72 27.47 73.38 33.23 39.85 51.62
Bottom-Up Bottom - Up
65.49 15.48 35.48 41.17 79.84 42.35 55.16 62.75
QAdv+DoE
MuRel Bottom - Up 42.85 13.17 45.04 39.54 84.03 47.84 56.25 65.58
GVQA 57.99 13.68 22.14 31.30 72.03 31.17 34.65 48.24
SAN ResNet 38.35 11.14 21.74 24.96 68.89 34.55 43.80 52.02
ours: DAG 40.84 13.86 29.58 30.46 69.32 33.02 40.14 50.31
Bottom-Up 41.56 12.19 43.29 38.04 81.18 42.14 55.66 63.48
ours: DAG w. Q Bottom - Up 41.05 11.32 40.28 36.09 81.97 42.55 54.42 63.21
ours: DAG (Full
43.02 15.83 46.41 40.75 81.05 42.47 54.53 62.91
Model)
Table 2.8: Performance (%) on VQA-CP V2 Test set and VQA V2 Validation set.
2.10 “Plenty is Plague: Fine-Grained Learning for Visual

Question Answering”. (Yiyi Zhou, FEBRUARY 2022)
The authors of this paper are Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su,
Deyu Meng, Yue Gao, and Chunhua Shen and the main objective of this paper is to propose a fine-grained
VQA learning paradigm with an actor-critic based learning agent, termed FG-A1C which selects the training
dataset from a given dataset which lowers the training cost and increase the speed. The datasets used are
VQA2.0 and VQA-CP v2 datasets. The Method used is that it follows a Reinforcement learning method
where a RL model selects the most valuable data based on the reward function which is a function of training
loss. It picks the best subset of training which reduces the computational complexity as well as time.
Bellman's equation is used for policy evaluation.For 25% of data the model achieves 60% accuracy and with
19
75% data it achieves 65.2% accuracy. Few advantages:

 Only using 50% of training examples, saves 25% of the model training time.
 Can be integrated with almost all models without altering the model configuration.
Few limitations observed are:
 Model focus more on yes/no type question where hard questions like why and where are barely
selected.
 model is not generalized as it learns from less data.
Paradigm VG STEP All Yes/No Num. Others

Random* 512K - 65.3 81.8 44.2 57.3
Random 512K 412K 66.9 83.4 48.6 57.1
FG-AIC-AL 250K 341K 67.0 83.7 47.6 57.2
FG-AIC-AL 150K 227K 67.0 83.3 47.6 57.1
FG-AIC-SPL 250K 240K 67.2 83.9 48.5 57.2
FG-AIC-SPL 150K 227K 67.2 84.0 48.5 57.0
Table 2.9: Evaluations of BUTD on VQA2.0 Test-Dev With Visual Genome Dataset
2.11 “Rich Visual Knowledge-Based Augmentation Network for

Visual Question Answering”. (Liyang Zhang, OCTOBER 2021)
The authors of this paper are Liyang Zhang, Shuaicheng Liu, Jingkuan Song, Lianli Gao. The objective of this
paper is to propose Visual question answering (VQA) that involves understanding an image and paired
questions. Enhancing the accuracy using Knowledge-based Augmented Network (KAN). The datasets used in
this paper is Visual Genome Dataset. Methods used are:
 knowledge-based Augmented network (KAN), Concept Net as external knowledge base in our work.
 Feature extraction module that extracts image n feature and question feature.
This experiment’s overall accuracy outperforms the best top-one overall accuracy reported in FVQA, which is
63.63 ± 0.73%. Advantage of this paper is that it uses External knowledge base (e.g ConceptNet) to improve
performance which gives better performance than previous model such as MCAN and limitationsa are This
model might give plausible answers when the model learns inadequately. External knowledge (K) and relation
(R) objects values are varied only at certain values/point model and performs well. Beyond this the model's
accuracy decreases.
Model Overall Acc. ± Std (%)

20
Top-1
SVM-Question 10.37 ± 0.80
SVM-1mage 18.41 ± 107
Hie-Question+lmage 33.70 ± 1 18
Hie-Question+Image+Pre-VQA 43.14 ± 061
FVQA 63.63 ± 073
Ours 66.39 ± 0.50
Table 2.10: Overall Accuracy Result With Other Methods on the FVQA Data Set.
2.12 “Reasoning on the relation: Enhancing Visual

Representation for Visual Question Answering and Cross-Modal
Retrieval”.
(Jing tu, December 2020)
The authors of this paper are Jing tu, weifeng Zhang, yuhang lu, Zengchang Qin, yue Hu, Jianlong Tan and Qi
Wu. The objective of this paper is to propose a novel Visual Relational Reasoning (VRR) module to reason
about pair-wise and inner-group visual relationships among objects guided by the textual information.
MS-COCO dataset is used and the methods used are Bilinear Visual Attention Module, Visual Relational
Reasoning Module. The proposed model has an accuracy of 58.1%. an overall improvement of 2.3% is
achieved.
 Major advantage is that GCN is used to extract the textural features, since GCN proposed is better for
modelling long texts.
 Though the model identifies some relevant visual content, it fails to reason about their relationships
and leads to irrelevant results.
Image Query Text Query

Model
R@1 R@5 R@10 Med r R@1 R@5 R@I0 Med r
1K Test Images
DVSA [59] 38.4 69.9 80.5 1.0 27.4 60.2 74.8 3.0
GMM-FV [67] 39.4 67.9 80.9 2.0 25.1 59.8 76.6 4.0
In-CNN [62] 42.8 73.1 84.1 2.0 32.6 68.6 82.8 3.0
VQA.A [68] 50.5 80.1 89.7 - 37.0 70.9 82.9 -
HM-LSTM [63] 43.9 - 87.8 2.0 36.1 - 86.7 3.0
Order-embedding 46.7 - 88.9 2.0 38.9 - 85.9 2.0
DSPE+FV 50.1 79.7 89.2 - 39.6 75.2 86.9 -
sm-1STM 53.2 83. I 91.5 1.0 40.7 75.8 87.4 2.0
two-branch net 54.9 84.0 92.2 - 43.3 76.4 87.5 -
CMPM(ResNet-152) 56.1 86.3 92.9 - 44.6 78.8 89.0 -
21
VSE++(fine-tuned) 57.2 - 93.3 1.0 45.9 - 89.1 2.0

c-VRANet(ours) 58.1 86.9 93.4 1.0 50.4 83.6 92.3 1.0
5K Test images
DVSA 16.5 39.2 52.0 9.0 10.7 29.6 42.2 14.0
GMM-FV 17.3 39.0 50.2 10.0 10.8 28.3 40.1 17.0
VQA-A 23.5 50.7 63.6 - 16.7 40.5 53.8 -
Order-embedding 23.3 - 65.0 5.0 18.0 - 57.6 7.0
CMPM(ResNet-152) 31.1 60.7 73.9 - 22.9 50.2 63.8 -
VSE++(fine-tuned) 32.9 - 74.7 3.0 24.1 - 66.2 5.0
c-VRANet(ours) 34.4 63.8 76.0 3.0 27.8 57.6 70.8 4.0
Table. 2.11: Comparison of Performance of Our Model With the State-of-the-Art Methods on MS-
COCO Dataset
4.13 “An Entropy Clustering Approach for Assessing Visual

Question Difficulty”. (KENTO TERAO, 2020)
The authors of this paper are Kento terao, Toru tamaki, Bisser Raytchev,Kazufumi kaneda and Shin’ichi
Satoh. The objective of this paper is to use the entropy values of answer predictions produced by different
VQA models to evaluate the difficulty of visual questions for the models. The datasets used are VQA v2.
The methods followed are:
 Hard example mining and Hardness / failure prediction.
 Usage of three models (I,Q,I+Q), predicting answer distributions, and computing entropy values to
perform clustering with a simple k-means.
Accuracy: (Q+I)67.47 Entropy:(Q+I)0.84. One advantage is that the usage of entropy of predicted answer
helps in increasing the accuracy of VQA model and measure of entropy sometimes might give plausible
outputs accounts for limitation.
22
Table. 2.12: Performance Comparison on the RSIVQA Dataset
2.14 “Mutual Attention Inception Network for Remote Sensing

Visual Question Answering”. (Xiangtao Zheng, 2022)
The authors of this paper are Xiangtao Zheng, Member, IEEE, Binqiang Wang, Xingqian Du, and Xiaoqiang
Lu. The objective of this paper is that a novel method is proposed to include convolutional features of the
image to represent spatial information. Attention mechanism and bilinear technique are introduced to enhance
the feature considering the alignments between spatial positions and words. The datasets used are UC-Merced
(UCM), Sydney, AID, HRRSD, DOTA.The methods used are:
 Representation module devised to obtain image and question as a whole as well as a part.
 Fusion module is designed to boost the discriminative abilities.
Achieves the best performance in terms of overall accuracy, which is 67.23%. Few advantages are it considers
the task from the perspective of classification which simplifies the problem. Adoption of attention mechanism
and bilinear feature fusion helps on improving the accuracy of the answer and if the same number question is
asked multiple times, it gives different answers accounts for the limitation.
Methods Overall Area Comparison Count presence
IMG+SOFTMAX 29.75(1.78) 21.67(1.49) 32.14(1.8) 12.83(3.19) 38.91(2.43)
BOW+SOFTMAX 56.49(1.83) 51.35(1.22) 59.42(1.86) 48.27(2.04) 61.77(1.31)
IMG+BOW+SOFTMAX 63.53(1.66) 62.34(0.94) 65.54(1.01) 60.41(1.97) 64.79(0.84)
IMG+GloVe+SOFTMA 65.76(1.82) 61.83(0.90) 66.91(1.22) 61.34(1.93) 71.14(0.95)
X
IMG+BERT+SOFTMAX 66.30(1.53) 62.28(1.01) 67.70(1.69) 60.61(1.48) 72.49(0.89)
OURS 67.23(1.04) 63.15(1.38) 68.55(1.63) 59.74(1.41) 69.55(1.05)
Table. 2.13: Performance Comparison on the Remote Sensing VQA Dataset
2.15 “Low-altitude small-sized object detection using lightweight

feature-enhanced convolutional neural network”. (YE Tao, Aug
2021)
23
The authors of this paper are YE Tao, ZHAO Zongyang, ZHANG Jun, CHAI Xinghua
and ZHOU Fuqiang. The objective of this paper is to propose LSL-Net to perform high-precision detection of
low-altitude flying objects in real time to provide information as guidance to suppress black flight of UAVs.
The datasets used are MS-COCO dataset. The Methodology is that the model comprises three simple and
efficient modules, including LSM, EFM, and ADM. LSM reduces image input size,and loss of low level
feature extraction. FEM improves feature extraction. ADM increases, the image detection accuracy.
LSL-Net achieves an mAP of 90.97% which is 6.71% higher than YOLOv4-tiny. Few advantages is that this
model helps in detecting aerial objects such as jets, drone which are useful in security purposes. This model
has a good robustness and an excellent generalization ability, can effectively perform detection of different
weather conditions and satisfy the requirements of low-altitude flying object detection for antiUAV missions.
But frame rates need to be increased to achieve better accuracy.
Methods Size aMP/% FPS

Faster R-CNN - 39.8 9
SSD 300X300 25.1 43
SSD 512X512 28.8 22
YOLOv3-SPP 608X608 36.2 20
YOLOv4 608X608 43.5 33
CenterNet - 41.6 28
FCOS - 44.7 -
LSL-Net(ours) 416x416 38.4 135
LSL-Net(ours) 512x512 39.1 126
LSL-Net(ours) 608x608 40.3 118
Table. 2.14: Results on MS-COCO dataset
2.16 “Stacked Attention Networks for Image Question

Answering”. (Zichao Yang1, 26 Jan 2016)
The authors of this paper are Zichao Yang1 , Xiaodong He2 , Jianfeng Gao2 , Li Deng2 , Alex Smola1. In
this paper SAN uses a multiple-layer attention mechanism that queries an image multiple times to locate the
relevant visual region and to infer the answer progressively.
The datasets used are DAQUAR-ALL, DAQUAR-REDUCED, COCO-QA, VQA.
In this paper three models are considered they are
 Question Model (LSTM).
 Image Model (CNN).
24
 Stacked Attention Model.

For COCO-QA dataset the model achieves 79.3% accuracy on yes/no type questions. This paper is helpful in
using multiple attention layers to perform multi-step reasoning leads to more fine-grained attention layer-by-
layer in locating the regions that are relevant to the potential answers. But the SAN only improves the
performance of only Yes/No questions.
Methods All Yes/No Number Other

36% 10% 54%
SAN(,1 LSTM) 56.6 78.1 41.6 44.8
SAN(1, CNN) 56.9 78.8 42.0 45.0
SAN(2, LSTM) 57.3 78.3 42.2 45.9
SAN(2, CNN) 57.6 78.6 41.8 46.4
Table. 2.15: VQA results on our partition, in percentage
25
CHAPTER 3: METHODOLOGY
3.1 BLOCK DIAGRAM:
Fig. 3.1: Block Diagram.
The model requires two inputs: A complex outdoor image and a question regarding the input image. After
receiving inputs, the model analyses the image and the questions to extract fine-grained features. The answer-
generating model receives both features and uses them to anticipate an acceptable response. The obtained
answer and image are passed onto the question generator to generate probable questions regarding the given
inputs. The probable questions are fed as refined questions to the visual question-answering model to get
refined answers.
26
Figure 3.2 Baseline Model Architecture of visual question answering
To embed the image, convolutional neural network based on ResNet is used, A multi-layer LSTM is fed with
the tokenized, embedded input query. Then, various attention distributions over image features are computed
using the concatenated image features and the LSTMs' final state. Two fully connected layers receive the
concatenated image feature snippets and the LSTM state to compute probabilities over answer classes.
RESNET 152 :
Figure 3.3 RESNET 152 ARCHITECTURE

Residual networks (Resnet) were put forth as a family of various deep neural network architectures with
comparable structural similarities but varied depths. To counteract deep neural network degeneration, Resnet
incorporates a structure called a residual learning unit. A feedforward network with a shortcut connection
makes up the construction of this unit, which allows for the addition of additional inputs while also producing
new outputs. This unit's key advantage is that it improves classification accuracy while keeping the model's
27
complexity constant. Resnet152 obtains the highest accuracy amongst the resnet family members.
LONG SHORT TERM MEMORY NETWORK(LSTM) :
Figure 3.4 LSTM ARCHITECTURE
A unique class of Recurrent Neural Networks (RNN) that can learn long-term dependencies is known as an
"LSTM" (long short term memory) network. They were initially developed by Hochreiter & Schmidhuber
(1997), and many individuals went on to improve and popularise them in subsequent works. They are
currently extensively utilized since they operate incredibly well in a wide range of situations.
Long-term reliance is a problem that LSTMs are specifically made to avoid. Their natural tendency, not
something they have to work at learning, is to retain information for extended periods of time.
A chain of repeating neural network modules is the shape that all recurrent neural networks take.
Recurrent neural networks with a specific ability to selectively recall patterns over extended periods of time
are known as long short-term memory (LSTM) networks.
Recurrent neural networks include Long Short Term Memory. The RNN's next step receives the output from
the previous step as input.
Proposed using a CNN for the image and an LSTM for the question to create the answer.
RNN's component parts are stored individually using LSTM.
28
3.2 DATASET
VQA v1.0 dataset is used. We test our model using the VQA dataset's balanced and unbalanced variants.
204,721 images from the MS COCO dataset make up VQA 1.0 .
3.3 METHOD
To predict the most likely response a from a fixed set of answers based on the content of a picture I and a
query q expressed in natural language.
aˆ = arg max a P(a|I, q)
Where a ∈ {a1, a2, ..., aM}
The training set's most popular responses were used to select the responses.
3.3.1) IMAGE EMBEDDINGS :
A pretrained convolutional neural network (CNN) model based on residual network architecture to compute a high level
representation φ of the input image I.
φ = CNN(I)
φ consists of a three-dimensional tensor with dimensions of 14 by 14 by 2048 from the residual network's last layer
before the final pooling layer. The depth (last) dimension of picture features, when subjected to l2 normalisation,
improves learning dynamics.
3.3.2) QUESTION EMBEDDINGS :
Creating word embeddings from a given query by tokenizing and encoding it Eq = {e1, e2, ..., eP }, where P is the
number of words in the question, D is the length of the distributed word representation, and ei ∈ RD. The long short-
term memory (LSTM) is subsequently fed the embeddings
(LSTM) [11]. s = LSTM
The LSTM's final state is utilized to represent the query.
3.3.3)Attention Network :
For distributions over the spatial dimensions of the visual features, multiple attention is estimated.
29
αc,l ∝ exp Fc(s, φl) 3 X L l=1 αc,l = 1 (4) xc = X l αc,lφl (5)
The weighted average of picture features across all spatial locations is used to calculate each image feature in a glance.
l = {1, 2, ..., L}. Each glimpse's attention weights (c, l) are normalised separately using the values c = 1, 2,..., C.
In actual modelling, F = [F1, F2,..., FC] is represented by two layers of convolution. Fi’s share parameters in the top
layer as a result.
3.3.4) Classifier
To create probability over answer classes, nonlinearities are applied to the concatenation of image glances
with the LSTM state.
P(ai |I, q) ∝ exp Gi(x, s)
where
x = [x1, x2, ..., xC ].

In actuality, G = [G1, G2,..., GM] is portrayed as two fully connected layers. Following is a definition of final
loss.
L = 1 K X K k=1 − log P(ak|I, q)
3.3.5) CUSTOMIZATION TO THE ABOVE NETWORK :

Our approach simultaneously performs visual and textual attentions over a number of steps and collects the
essential data from both modalities. We describe the basic attention mechanisms used at each phase in this
part, which act as the foundation for the overall Attention network.
1) VISUAL ATTENTION :
By paying attention to certain areas of the input image, visual attention seeks to create a context
vector.
The visual context vector v (k) is determined by v (k) = V at step k.
v (k) = V Att({vn} N n=1, m(k−1) v )
where m(k-1)v is a memory vector representing the data that has been attended up to step k-1.
2) TEXTUAL ATTENTION :
By paying close attention to particular words in the input sentence at each step, textual attention
calculates a textual context vector u(k):
u (k) = T Att({ut} T t=1, m(k−1) u )
30
where m (k−1) u is a memory vector.
Figure 3.5 CUSTOMIZED ARCHITECTURE
The algorithm pays attention to the precise areas and words that make it easier to answer the questions by
combining text attention with visual attention.
The softmax layer receives the output of the combined attention network and uses it to forecast the top 5
predictions.
The fusion of visual and question features depends heavily on attention.

The attention-guided integration of visual and textual characteristics has been extensively studied in a
number of previous papers.
Only a few of the top-k region suggestions (found using visual extraction) are pertinent.
with relation to a q input query.
3.4) Development towards Question Generator Module :

3.4.1) Image Captioning :
A written description must be provided for a given image as part of the difficult AI challenge known as
caption generation. To translate the comprehension of the image into words in the appropriate order, it needs
both computer vision techniques to comprehend the image's content and a language model from the field of
31
natural language processing.
A pre-trained Convolutional Neural Network (ENCODER), used in "traditional" image captioning systems,
would encode the image and create a hidden state h.
Then, it would employ an LSTM(DECODER) to decode this hidden state and produce each caption word
iteratively.
Figure 3.4.1 , A classic Image Captioning Model
The issue with this approach is that the model typically only manages to describe a small portion of the image
when it attempts to construct the next word of the caption. It is unable to fully convey the meaning of the
input image. It is not possible to efficiently generate different words for various regions of the image using the
entire representation of the image as a criterion. Exactly in this situation is where an attention technique is
effective.
The image is first fragmented into n pieces using an attention mechanism, then we compute the
representations of each portion (h1,..., hn) using a convolutional neural network (CNN). The decoder only
uses certain portions of the image when the RNN is creating a new word because the attention mechanism is
concentrating on the pertinent area of the image.
32
Figure 3.4.2 Image captioning architecture with attention Mechanism
Due to computationally expensive nature and impracticality when translating lengthy sentences, global
attention focus on all source side words for all target terms is not recommended. Local attention selects to
concentrate only on a limited portion of the hidden states of the encoder for every target word in order to
address this shortcoming.
The given image may be captioned as , ‘A boy on skateboard’ , when we say ‘boy’ it locates to the region of
‘boy, and when we say ‘skateboard’ it locates to the region of ‘skateboard’.Both of these locations are in
different pixels and VGG16 does not have any information about it.
But every location of convolution layers corresponds to some location of image as shown below.
33
VGG16 ARCHITECTURE
The output of VGGNet's fifth convolution layer is a feature map with a size of 14*14*512.
There are 196 such pixel positions in this fifth convolution layer, which includes 14*14 pixel locations that
correlate to specific image regions.
Finally, we may work with these 196 places, each of which has 512 dimensions.
The model will then develop a attention on these areas (which corresponds to actual locations in the
images).The caption generated acts as a “context” to question generation model.
Dataset :
Flickr 8k dataset is used.
There are 8000 images in this collection, each with five captions .
These images are divided into three parts:
Training Set: 6000 pictures.
Dev Set: 1000 pictures.
1000 test pictures.
34
3.5) Question Generator :
A Transformer-based architecture that employs a text-to-text method is called T5, or Text-to-Text Transfer
Transformer. The model text is fed as input into each task, including translation, question answering, question
generation and classification, and trained to produce the target text.
Transfer learning, which involves pre-training a model on a task with lots of data before fine-tuning it on a
subsequent task, has become a potent method for natural language processing (NLP). The success of transfer
learning has led to a wide range of methodologies, practises, and strategies.
Compared to just training on the tiny, labelled datasets without pre-training, the model performs significantly
better after being fine-tuned (trained) on smaller, task-specific labelled datasets.
We fine tune the Squad dataset for generating a question based on the context( Image Caption)
Tokenization of the words is done using T5 tokenizer.
35
Architecture of Question Generator
The given text is preprocessed and encoded , decoded using T5 Tokenizer. The context is passed to the Fine
Tuned T5 Transformer to train the question generation network. Context based Question is generated in the
evaluation .
Exploring Tokenizer :
Text : ‘This is our Capstone Project”
Encoded output:
Tokenized Output:
Decoded Output :
3.3.5) Hyper Parameters

36
3.3.5.1). Optimizer
Adam is utilised as the optimizer, starting with a learning rate of 0.001 and gradually lowering it with the aid
of a scheduler for learning rates. Every time we train a neural network, the network's output diverges from the
final result. We could refer to this as a cost function or loss function.
The optimizers are techniques for reducing network losses by modifying neural network properties like
weights and learning rates. Although there are many different kinds of optimizers, we employ the Adam
optimizer in this example. Adaptive moment estimate is referred to as Adam. The Adam Optimizer combines
two concepts.
i)Momentum
The Root Mean Squared Propagation method (RMS Prop)
Smoothing results from the use of momentum. And with RMS Prop's assistance.
3.6) System Requirement Specification :

Hardware Requirements
 Processor: Intel i5 or more:
 Processor speed: 2.4 GHZ
 RAM: 4 GB
 GPU 4GB:NVIDIA1050 TI
 Hard Disk Space : 128 GB
Software Requirements
 Operating system: Windows 7/8/10
 Coding Language: Python 3.0
 Software Tool: Jupyter Notebook, Google Colab.
Frontend Requirements :
• Coding Language : HTML ,CSS , Javascript & Flask.

• Software Tool : Jupyter Notebook.
37
CHAPTER 4: RESULT AND DISCUSSION

Learning Rate: 0.001
No. of epochs: 20
Batch size: 32
Loss Function: Categorical Cross entropy
Optimizer: Adam
Model Train loss Test Loss Train Accuracy Test Accuracy
Baseline model RESNET 0.8363 0.9404 0.5664 0.4924

Visual and Textual Attention Based 0.6483 0.8835 0.7822 0.5322
Model
38
39
40
CHAPTER 5: CONCLUSION AND FUTURE WORK
In this paper, a Customized attention network improvised with the addition of Visual and Textual Attention is
proposed to solve the Visual Question Answering (VQA) task. The model is trained and tested on VQA V1
Dataset. The customised model is tested and compared with the pre-existing models which shows the
improvement in the accuracy of the VQA task.
The given answer and the related image is passed onto the Image Captioning Model which generates caption
based on the image. This caption is considered as “Context” to question generator , which is trained on fine
tuned T5 Transformer to generate the Question.
The Question generated is then fed back to the VQA model to get refined answers.
The VQA model can be improvised to generate answers in phrases with more accuracy by addition of double
or triple attention layers. The network can be improvised by considering more data and more computational
power.
41
REFERENCES
[1] F. Liu, J. Liu, Z. Fang, R. Hong and H. Lu, "Visual Question Answering With Dense Inter- and Intra-
Modality Interactions," in IEEE Transactions on Multimedia, vol. 23, pp. 3518-3529, 2021, doi:
10.1109/TMM.2020.3026892.
[2] W. Guo, Y. Zhang, J. Yang and X. Yuan, "Re-Attention for Visual Question Answering," in IEEE
Transactions on Image Processing, vol. 30, pp. 6730-6743, 2021, doi: 10.1109/TIP.2021.3097180.
[3] Z. Yu, J. Yu, C. Xiang, J. Fan and D. Tao, "Beyond Bilinear: Generalized Multimodal Factorized High-
Order Pooling for Visual Question Answering," in IEEE Transactions on Neural Networks and Learning
Systems, vol. 29, no. 12, pp. 5947-5959, Dec. 2018, doi: 10.1109/TNNLS.2018.2817340.
[4] C. Chen, D. Han and J. Wang, "Multimodal Encoder-Decoder Attention Networks for Visual Question
Answering," in IEEE Access, vol. 8, pp. 35662-35671, 2020, doi: 10.1109/ACCESS.2020.2975093.
[5] L. Peng, Y. Yang, Z. Wang, Z. Huang and H. T. Shen, "MRA-Net: Improving VQA Via Multi-Modal
Relation Attention Network," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no.
1, pp. 318-329, 1 Jan. 2022, doi: 10.1109/TPAMI.2020.3004830.
[6] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang and X. -S. Hua, "Self-Adaptive Neural Module
Transformer for Visual Question Answering," in IEEE Transactions on Multimedia, vol. 23, pp. 1264-1273,
2021, doi: 10.1109/TMM.2020.2995278.
[7] Y. Liu, X. Zhang, F. Huang, L. Cheng and Z. Li, "Adversarial Learning With Multi-Modal Attention for
Visual Question Answering," in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 9,
pp. 3894-3908, Sept. 2021, doi: 10.1109/TNNLS.2020.3016083.
[8] K. C. Shahira and A. Lijiya, "Towards Assisting the Visually Impaired: A Review on Techniques for
Decoding the Visual Data From Chart Images," in IEEE Access, vol. 9, pp. 52926-52943, 2021, doi:
10.1109/ACCESS.2021.3069205.
[9] D. Gao, R. Wang, S. Shan and X. Chen, "Learning to Recognize Visual Concepts for Visual Question
Answering With Structural Label Space," in IEEE Journal of Selected Topics in Signal Processing, vol. 14,
no. 3, pp. 494-505, March 2020, doi: 10.1109/JSTSP.2020.2989701.
[10] Y. Zhou et al., "Plenty is Plague: Fine-Grained Learning for Visual Question Answering," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 697-709, 1 Feb. 2022, doi:
10.1109/TPAMI.2019.2956699.
[11] L. Zhang et al., "Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering,"
in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4362-4373, Oct. 2021,
doi: 10.1109/TNNLS.2020.3017530.
[12] J. Yu et al., "Reasoning on the Relation: Enhancing Visual Representation for Visual Question
42
Answering and Cross-Modal Retrieval," in IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3196-3209,
Dec. 2020, doi: 10.1109/TMM.2020.2972830.
[13] K. Terao, T. Tamaki, B. Raytchev, K. Kaneda and S. Satoh, "An Entropy Clustering Approach for
Assessing Visual Question Difficulty," in IEEE Access, vol. 8, pp. 180633-180645, 2020, doi:
10.1109/ACCESS.2020.3022063.
[14] X. Zheng, B. Wang, X. Du and X. Lu, "Mutual Attention Inception Network for Remote Sensing Visual
Question Answering," in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022, Art
no. 5606514, doi: 10.1109/TGRS.2021.3079918.
[15] Y. Tao, Z. Zongyang, Z. Jun, C. Xinghua and Z. Fuqiang, "Low-altitude small-sized object detection
using lightweight feature-enhanced convolutional neural network," in Journal of Systems Engineering and
Electronics, vol. 32, no. 4, pp. 841-853, Aug. 2021, doi: 10.23919/JSEE.2021.000073.
[16] Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). “Stacked Attention Networks for Image
Question Answering”. arXiv. https://doi.org/10.48550/arXiv.1511.02274.
[17] Aishwarya Agrawal∗ , Jiasen Lu∗ , Stanislaw Antol∗ , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi
Parikh, “VQA: Visual Question Answering”, arXiv:1505.00468 [cs.CL]
43
44

Reports

Uploaded by

Copyright:

Available Formats

You might also like

Reports

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reports

Uploaded by

Copyright:

Available Formats

PES UNIVERSITY

(Established under Karnataka Act No. 16 of 2013)

Under the guidance of

Name and Signature of Examiners:

NAME AND SIGNATURE OF THE CANDIDATE

We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

We are grateful to the project coordinator Prof. Rajasekar M, Department of

We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

1.3 Problem Statement. 10

1.4Organisation of the report. 10 – 11

Chapter 2: Literature Survey. 12 – 15

2.16 “Stacked Attention Networks for Image Question Answering”.

3.1 Block Diagram. 16 - 17

3.3.1.1 Image Encoder. 17 – 19

3.3.2 Question Encoder. 25 – 26

3.3.3 Model Architecture and Visualization

3.3.3.1 Stacked Attention Network

3.3.3.2 Customization in the Stacked Attention Network

Chapter 5: Software needed. 36

Chapter 6: Conclusion and Future Scope. 37

Figure 1.1 Example 1 : Fine grained Feature Extraction

CHAPTER 1.1: MOTIVATION

The scope of this project is as follows:

CHAPTER 1.2: PROBLEM STATEMENT

CHAPTER 1.3: REPORT ORGANISATION

CHAPTER 2: LITERATURE SURVEY

2.1 “Visual Question Answering with dense Inter and Intra-

2.2 “Re-Attention for Visual Question Answering”.

Methods Yes/No Overall Other

2.3 “Beyond Bilinear: Generalized Multimodal Factorized High-

Model Test-Dev Test-Challenge

9 MFH models 68.02 68.16

2.4 “Multimodal Encoder-Decoder Attention Networks for Visual

2.5 “MRA-Net: Improving VQA Via Multi-Modal Relation

Methods Q-rel nParams VQA-score

Table 2.5: Results on the VQA-2.0 Validation Set

2.6 “Self-Adaptive Neural module transformer for visual question

2.7 “Adversarial Learning with Multi-Modal Attention for Visual

2.8 “Towards Assisting the Visually Impaired: A Review on

 Modality based approaches - outputs audio and tactile info

Auther,Year Dataset Type IMG QUES IMG+QUES SAN

2.9 “Learning to Recognize Visual Concepts for Visual Question

2.10 “Plenty is Plague: Fine-Grained Learning for Visual

75% data it achieves 65.2% accuracy. Few advantages:

Paradigm VG STEP All Yes/No Num. Others

2.11 “Rich Visual Knowledge-Based Augmentation Network for

Model Overall Acc. ± Std (%)

2.12 “Reasoning on the relation: Enhancing Visual

Image Query Text Query

VSE++(fine-tuned) 57.2 - 93.3 1.0 45.9 - 89.1 2.0

4.13 “An Entropy Clustering Approach for Assessing Visual

Table. 2.12: Performance Comparison on the RSIVQA Dataset

2.14 “Mutual Attention Inception Network for Remote Sensing

Table. 2.13: Performance Comparison on the Remote Sensing VQA Dataset

2.15 “Low-altitude small-sized object detection using lightweight

Methods Size aMP/% FPS