Adi Report

PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013)

100-ft Ring Road, Bengaluru – 560 085, Karnataka, India
Report on
EXTRACTION OF FINE-GRAINED FEATURES OF A COMPLEX OUTDOOR
IMAGE AND QUESTION FOR VISUAL QUESTION ANSWERING
Submitted by
Aditya V(PES1UG19EC016)
Kaushik G (PES1UG19EC132)
Sinchana S R (PES1UG19EC299)
January-May 2022
Under the guidance of

Assistant Prof. Raghavendra M
Professor
Department of Electronics and Communication Engineering
PES University
Bengaluru - 560085
FACULT
CERTIFICATE Y OF
This is to certify that the Project ENGINEERING entitled
DEPARTM
ENT OF
EXTRACTION OF FINE- ELECTRONICS GRAINED FEATURES OF
AND 1
COMMUNICATI
ON
ENGINEERING
PROGRAM B. TECH
A COMPLEX OUTDOOR IMAGE AND QUESTION FOR VISUAL
QUESTION ANSWERING
is a bonafide work carried out by
Aditya V (PES1UG19EC016)
In partial fulfillment for the completion of the Program of Study B. Tech in Electronics and Communication
Engineering under rules and regulations of PES University, Bengaluru during the period August–December
2022. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in
the report. The project report has been approved as it satisfies the 6 th-semester academic requirements in
respect of project work.
Signature with date &Seal Signature with date & Seal Signature with date & Seal
Prof. Raghavendra M J Dr. Anuradha M Dr. Surya Prasad
Internal Guide Chairperson Dean of Faculty
Name and Signature of Examiners:

NAME: SIGNATURE:
DECLARATION
We, Aditya V, Kaushik G , Sinchana S R, hereby declare that the project entitled, "EXTRACTION OF
FINE GRAINED FEATURES OF A COMPLEX OUTDOOR IMAGE AND QUESTION FOR VISUAL
2
QUESTION ANSWERING", is an original work done by us under the guidance of Prof. Raghavendra M J,
Assistant Professor, Department of Electronic and Communication Engineering, PES UNIVERSITY,
and is being submitted in partial fulfilment of the requirements for completion of 6th semester course work
in the Program of Study B.Tech in Electronics and Communication Engineering.
PLACE: BENGALURU
DATE:14/12/2022
NAME AND SIGNATURE OF THE CANDIDATE
Aditya V (PES1UG19EC016)
3
ACKNOWLEDGEMENT
We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

Department of Electronics and Communication Engineering, PES University, for his
persistent guidance, suggestions, assistance and encouragement throughout the
development of this project. It was an honor for us to work on the project under his
supervision and understand the importance of the project at various stages.
We are grateful to the project coordinator Prof. Rajasekar M, Department of

Electronics and Communication Engineering for, organizing, managing, and helping
with the entire process.
We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Electronics and Communication Engineering, PES University, for her invaluable support
and thankful for the continuous support given by the department. I am also very grateful
to all the professors and non-teaching staff of the Department who have directly or
indirectly contributed to our research towards enriching work.
Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

DORESWAMY, Vice Chancellor Dr. SURYAPRASAD J, the Registrar Dr. K S
SRIDHAR and Dean of Engineering Dr. KESHAVAN B K of PES University for
giving me this wonderful opportunity to complete this project by providing all the
necessaries required to do so. It has motivated us and supported our work thoroughly.
We are grateful and obliged to our family and friend for providing the energy and
encouragement when we need them the most.
37
AUGUST-DECEMBER 2022
Extraction of fine grained features of a complex outdoor image and question for visual question answering
ABSTRACT
Visual Question Answering combines computer vision, natural language processing, and common sense
reasoning. It offers a wide range of possible uses for various HCI activities, including helping the blind and
creating AI-based personal assistant. There have been many developments and attempts in this field, We aim
to develop a Deep Learning Algorithm to extract features from complex outdoor images as well as question
using Visual Question Answering and to generate questions based on the answer generated . the model
extracts fine grained features of the input image using Convolution Neural Network model ResNet and the
fine grained features of the question using RNN and LSTM, the extracted outputs are merged and the answer
is generated using an attention mechanism .The generated answer is then used to generate questions which can
again be answered using our model. VQA V1dataset is used which include Complex outdoor images the train
split out of the samples is 70% and the remaining 30% being testing.
For solving the Visual Question Answering task we train and test the model on simple and complex images
and by implementing natural language processing we additionally integrate an attention mechanism by which
the semantic representation of questions as the search term, networks are utilized to hunt for regions in photos
that match the responses for better answer retrieval and for question generation we use A Trans former-based
architecture that employs a text-to-text method is called T5, or Text-to-Text Transfer Trans former. The model
text is fed as input into each task, including translation, question answering, question generation and
classification, and trained to produce the target output.
38
TABLE OF CONTENT
Content Pg.No
Abstract 5
Acknowledgement 4
Chapter 1: Introduction. 9
1.1 Motivation. 10
1.2 Objectives. 10
1.3 Problem Statement. 10
1.4Organisation of the report. 10
Chapter 2: Literature Survey. 11 – 20
2.1 “Learning to Recognize Visual Concepts for Visual Question Answering With Structural Label
Space”
2.2 “Fine-Grained Hashing With Double Filtering”

2.3 “Feature balance for fine grained object classification in aerial images”
2.4 “A Novel Semantics-Preserving Hashing for Fine-Grained Image Retrieval”
2.5 “Reasoning on the relation: Enhancing Visual Representation for Visual Question Answering
and Cross-Modal Retrieval”.
2.6 “Visual Question Answering with dense Inter and Intra-Modality interactions”
2.7 “Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question
Answering”
2.8 “Re-Attention for Visual Question Answering”

2. 9 “Diversified Attention Model for Fine-Grained Entity Typing”
2.10 “Visual Question Generation as Dual Task of Visual Question Answering”
Chapter 3: Methodology 21- 32
3.1 Block Diagram. 16 - 17
3.2 Dataset
38
3.3 Method 17 – 29
3.3.1.1 Image Encoder. 17 – 19
3.3.1,2 VGG16 19 – 22
3.3.1.3 ResNet 22 – 25
3.3.2 Question Encoder. 25 – 26
3.3.2.1 Glove 26 - 27
3.3.2.2 LSTM 27 – 29
3.3.3 Model Architecture and Visualization
3.3.3.1 Stacked Attention Network
3.3.3.2 Customization in the Stacked Attention Network
Chapter 4: Results. 33 - 35
Chapter 5: Software needed. 36
Chapter 6: Conclusion and Future Scope. 37
1 LIST OF TABLES
Table No. Table Name Pg.No
2.1 Comparison With Previous State-of-the-Art Methods on the VQA 1.0 Dataset 13
2.2 Performance on the Validation Split of the VQA v2 Dataset.
2.3 Overall Accuracies on the Test-Dev and Test-Challenge Sets of the VQA-2.0 Data Set
38
2.4 Results of Single Model Compared with the State-of-the-Art Models on Zhe Test-Dev and Test-
___________Std Set.
2.5 Results on the VQA-2.0 Validation Set
2.6 Ablation Study on VQA v1.0 and COCO-QA Data Sets. Denotes Variant Implementations of the
_________Model
2.7 Qualitative Analysis of Different Model Performances (Accuracy) in Reasoning
2.8 Performance (%) on VQA-CP V2 Test set and VQA V2 Validation set.
2.9 Evaluations of BUTD on VQA2.0 Test-Dev With Visual Genome Dataset
2.10 Overall Accuracy Result With Other Methods on the FVQA Data Set.
2.11 Comparison of Performance of Our Model With the State-of-the-Art Methods on MS-COCO
__________Dataset
2.12 Performance Comparison on the RSIVQA Dataset
2.13 Performance Comparison on the Remote Sensing VQA Dataset
2.14 Results on MS-COCO dataset
2.15 VQA results on our partition, in percentage
CHAPTER 1: INTRODUCTION
In the field of research known as "visual question answering," the goal is to create a computer system that
can respond to queries that are submitted both visually and orally.
To link computer vision with natural language processing (NLP) and push the frontiers of both domains,
the challenge of visual question answering was presented. Computer vision is the study of techniques for
38
gathering, analyzing, and comprehending images. Its main objective is to instruct machines in vision.
NLP, on the other hand, focuses on making it possible for computers and people to communicate in
natural language. NLP and computer vision both fall under the umbrella of artificial intelligence, and they
both use comparable machine learning-based techniques.
Our Model focuses on developing a visual question-answering (VQA) model that receives an input of a
question and an outdoor image and extracts fine-grained features from both to produce an answer. A
question generator model is then fed with the obtained answer to create more specific questions about it.
To obtain refined responses, the VQA model is fed with refined questions as inputs.
Figure 1.1 Example 1 : Fine grained Feature Extraction
In this example , a general question might be , “Is the man playing tennis” , for which the answer will be
‘YES" but using fine grained extraction we can generate answer to the question , “what is the colour of the
ball” as “Yellow”.
CHAPTER 1.1: MOTIVATION

Understanding the contents of images is the work of VQA, but doing so frequently necessitates the
knowledge of non-visual material, which can range from "common sense" to topic-specific
information. The motive behind this project is to develop a better understanding of the objects in an
image and the type of question asked.
38
The scope of this project is as follows:

• Kids-learning.
• Image processing in Space organization.
• Assistance to color-blind users.
• Interactive robotic systems.
CHAPTER 1.2: PROBLEM STATEMENT

To develop an algorithm to extract fine-grained features of a complex outdoor image and question for
Visual Question Answering to generate probable questions and get a refined answer.
CHAPTER 1.3: REPORT ORGANISATION
38
LITERATURE SURVEY
2.1“Learning to Recognize Visual Concepts for Visual Question Answering With

Structural Label Space”
The authors of this literature survey are Difei Gao , Ruiping Wang , Shiguang Shan and Xilin Chen. This
paper is chosen from IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 3, MARCH
2020 . This paper proposes the use of way of obtaining information in a VQL model using labels .
The datasets used are the Visual Genome , VQA v2, GVQA and VQA-CP V2.
The methods used are as follows :
1) Faster R-CNN is used for extraction of image features whereas grated recurrent unit is used for
extraction of word features .
2) A Group Prediction Network is used to utilise image features and question features to distill the group
index .
3) A Dynamic Concept Recognizer is used where the Fully Connected layers and soft attention
mechanism to predict the answer .
This DAG model achieves an accuracy of 56.94% on GQA Dataset Accuracy .

The model's two main benefits are that it facilitates question-answering performance and lowers the hazards
associated with overusing linguistic priors, and that it enhances visual recognition performance by utilising the
semantics of the labels.
The model should be taught to work on specific photos where it is challenging to find the group index label.
Its drawbacks include its inability to predict the intention of ambigious inquiries.
38
2.2“Fine-Grained Hashing With Double Filtering”.
The authors of this literature survey are Zhen-Duo Chen , Xin Luo , Yongxin Wang , Shanqing Guo , and Xin-
Shun Xu , Member, IEEE. This paper is chosen IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 31, 2022.
This paper proposes a novel method in fine grained hashing by a double filter called FISH .
The datasets used are the CUB-200-2011, Standard dogs and Vegfru .The methodology used in the proposed
model consists of a fine grained hashing model with a double filtering mechanism. It also consists of Space
filtering Module and Feature Filtering Module .
This model achieves an accuracy of 0.783 in 64 bits on CUB -200-201 accuracy and 0.807 in 64 bits in
Stanford dogs using Resnet18 as base network . With AlexNet as base network , this model achieves 0.527
in 48 bits on CUB -200-201 Accuracy and 0.611 in 48 bits on Stanford Dogs .
The key benefits of this study are that it simultaneously handles the problems of fine-grained feature
extraction, feature refining, and loss function designing, and that its accuracy levels are higher than those
of prior hashing methods.
The paper's key drawback is that because they are fully supervised algorithms, extra annotation is required,
which takes more time and is more expensive.
38
2.3“Feature balance for fine grained object classification in aerial images”
The authors of this literature survey are Wenda Zhao , Tingting Tong, Libo Yao , Yu Liu , Congan Xu , You
He , and Huchuan Lu , Senior Member, IEEE. This paper is chosen from ”, IEEE TRANSACTIONS ON
GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022. This paper proposes a low resolution fine-grained object
classification LR-FGOC which deals with objects and details which are blurred or missing.
The datasets used are the LFS Dataset and certain object datasets are DOTA,FS23 and HSRC2016 . The
methods used propose a novel pipeline based on two technical insights that is feature balance strategy and
Iterative interaction Mechanism . This model has an accuracy of 90.2 on LFS Datasets, 97.0 on DOTA , 89.3
on FS23 and 80.5 on HSRC2016 .
The model's advantages include improving the baseline model by 3.4%, increasing the concentration of quiet
features, which improves the representation of multilevel characteristics, and weighting regional features.
The disadvantages are that the time-consuming inference stage is where the drawbacks of the SR features
must be extracted. It is necessary to investigate distillation methods that directly instruct the LR branch to
reorganise SR characteristics.
38
2.4“A Novel Semantics-Preserving Hashing for Fine-Grained Image Retrieval”
The authors of this literature survey are Han Sun , Yejia Fan, Jiaquan Shen, Ningzhong Liu, Dong Liang And
Huiyu Zhou. This paper proposes a method to retrieve images and sensitive information related to it using
hashing . The datasets used are the convolutional neural network hashing (CNNH) which consists of CNN and
Hashing .The methodology used in the proposed model consists of four key components which are Feature
extractor , Hash layer ,Classification Layer and A loss part containing cross-entropy loss , quantization loss
and bit balance loss .
This model achieves an accuracy of 10.9%, 12.93% ,13.24% which is the mAP of our work .
The advantages of this model include the fact that it doesn't define speed like other data structures and that
hashing is a more dependable and safe method of retrieving images.
The disadvantage of this paper is that Hashing does not permit null values, and it becomes ineffective when
there are too many collisions, which is a drawback of this study.
38
2.5“Reasoning on the relation: Enhancing Visual Representation for Visual

Question Answering and Cross-Modal Retrieval”.
The authors of this literature survey are Jing tu, weifeng Zhang, yuhang luZengchang Qin, yue Hu, Jianlong
Tan and Qi Wu. This paper proposes the use of a novel Visual Relational Reasoning (VRR) module
The dataset used is MS-COCO dataset
The methods makes use of Bilinear Visual Attention Module and Visual Relational Reasoning Module.
The proposed model has an accuracy of 58.1% . An overall improvement of 2.3% is achieved.
The key benefits of this model are that it performs better than other models on "wh" questions since it is based
on logic and uses tiny networks with fewer parameters, which increases accuracy.
This model's fundamental flaw is that it cannot account for the connections between the various items in the
image.
38
2.6“Visual Question Answering with dense Inter and Intra-Modality interactions”
The authors of this literature survey are Fei liu, Jing Liu,Zhiwei Fang, Richang Hong and Hanquig Lu. This paper
is published on IEEE transactions on multimedia, vol 23,2021. This paper proposes the use of a novel DenIII
framework for VQA which helps in capturing more fine grained multimodal information
The datasets used VQAv1.0, VQAv2.0 and TDIUC

1) Feature extraction : extract initial image features using faster RCNN .
2) Dense inter and intra modality interactions: adopt GRU with 300 hidden units as question encoding
layer.
A maximum accuracy of 69% is achieved using VQAv1.0 dataset, VQAv2.0 = 68.7% and TDIUC = 68.8%
The primary benefit of this model is that the DenIII framework performs competitively on all three datasets
studied, demonstrating that the framework is well-modelled.
The fundamental drawback of this model is that it produces an erroneous number of items when one object in
a picture is obscured by another.
38
2.7“Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for

Visual Question Answering”
The authors of this literature survey Zhou Yu, Jun Yu , Member, IEEE, Chenchao Xiang, Jianping Fan , and
Dacheng Tao, Fellow, IEEE. This paper was published on IEEE TRANSACTIONS ON NEURAL NETWORKS AND
LEARNING SYSTEMS, VOL. 29, NO. 12, DECEMBER 2018. This paper proposes a method to predict automatic
answers that considers the complex correlations between multiple diverse answers for the same question
The datasets used are VQA-1.0 and VQA-2.0

1) Multimodal factorized bilinear pooling approach
2) multimodal factorized high-order
pooling (MFH) method to achieve more effective fusion of multimodal features
3) 3.Kullback–Leibler divergence (KLD) as the loss function
to achieve more accurate characterization .
This model achieves a 68.16% accuracy on VQA 2.0 dataset
.
This model's two key benefits are that it is robust and operates with minimal complexity to produce the
greatest results.
The key drawback is that it doesn't examine how picture regions and semantic words relate to one another.
38
2.8“Re-Attention for Visual Question Answering”
The authors of this literature survey are Wenya Guo; Ying Zhang; Jufeng Yang; Xiaojie Yuan. This paper was
published on IEEE transaction on image processing ,vol 30, 20 July 2021 . This paper proposes a re-attention
framework to utilize the information in answers for the VQA task .
The datasets used are VQA , VQA v2 , and COCO-QA

1. Questions associated with the relevant objects are unified into a single entity ,this is used the predict the
answers .
2. Answers generated are used to Re- attend the object and alter the visual attention map.
This DAG model achieves an Improvement of 6.28 % on overall accuracy on VQA V2 data set.
The primary benefit is that this re-attention technique reconstructs the visual attention map from the
information extracted from the replies generated, aiding in answer detailing.
Although this method teaches the right regions, it is challenging to respond to a question that requires
common sense, and a specific text interpretation module is required to correctly identify the numbers in the
image.
38
2.9“Diversified Attention Model for Fine-Grained Entity Typing”
The authors of this literature survey are Yikang Li,Nan Duan,Bolei Zhou,Xiao Chu,Wanli Ouyang,Xiaogang
Wang,Ming Zhou TheChineseUniversityofHongKong,HongKong,China
MassachusettsInstituteofTechnology,USA
This paper proposes a model (DSAM) for fine-grained entity typing, which explicitly diversifies the semantic
attentions for capturing multiple discriminative information.
The datasets used are FIGER, OntoNotes and BBN

The method used is we take our DSAM approach and compare it with different attention network models by
applying it on the three datasets which helps us measure the superiority of the DSAM .
We get 84.31 percent accuracy when we use the “ three length Att” model which is the highest we can get .
The two primary benefits are that Our DSAM approach uses the diversity constraint model to take advantage
of subtle and local discrimination for differentiating the subtypes, and This DSAM gained performance by
extracting the most relevant features by using the mention-aware attention mechanism.
It demonstrates that attention segments produced by short lengths cannot always supply additional
information for fine-grained entity type, which is one of the study's two key advantages. & The DSAM's
effectiveness is based on (tuning parameter). It rises with and sharply falls as with increases beyond a certain
degree.
38
2.10“Visual Question Generation as Dual Task of Visual Question Answering”
The authors of this literature survey are Difei Gao , Ruiping Wang , Shiguang Shan and Xilin Chen. This
paper is chosen from IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 3, MARCH
2020 .
This paper proposes the use of The proposed invertible bilinear fusion module and parameter sharing
scheme, our iQAN can accomplish VQA and its dual task VQG simultaneously .
The datasets used VQA2 and CLEVR
The methods used is Invertible Question Answering Network (iQAN),to introduce question generation as a
dual task of question answering .The encoder consists of the image Answer and question then fusion of either
is next initiated and an encoder for answer or question retrieval
Evaluated on the CLEVR and VQA2 datasets, our iQAN improvesthe top-1 accuracy of the prior art MUTAN
VQA method by 1.33% and 0.88%
The proposed method reconstructs the VQA model into its dual VQG form, which enables us to train a single
model simultaneously with two conjugate tasks. The two key advantages are that either the question or the
answer can be encoded, allowing us to retrieve either of them as output.
Due to its bidirectional nature and the fact that we must train it for each activity, there are two significant
drawbacks.
38
CHAPTER 3: METHODOLOGY
3.1 BLOCK DIAGRAM:
Fig. 3.1: Block Diagram.
The model requires two inputs: A complex outdoor image and a question regarding the input image. After
receiving inputs, the model analyses the image and the questions to extract fine-grained features. The answer-
generating model receives both features and uses them to anticipate an acceptable response. The obtained
answer and image are passed onto the question generator to generate probable questions regarding the given
inputs. The probable questions are fed as refined questions to the visual question-answering model to get
refined answers.
38
Figure 3.2 Baseline Model Architecture of visual question answering
To embed the image, convolutional neural network based on ResNet is used, A multi-layer LSTM is fed with
the tokenized, embedded input query. Then, various attention distributions over image features are computed
using the concatenated image features and the LSTMs' final state. Two fully connected layers receive the
concatenated image feature snippets and the LSTM state to compute probabilities over answer classes.
RESNET 152 :
Figure 3.3 RESNET 152 ARCHITECTURE

Residual networks (Resnet) were put forth as a family of various deep neural network architectures with
comparable structural similarities but varied depths. To counteract deep neural network degeneration, Resnet
38
incorporates a structure called a residual learning unit. A feedforward network with a shortcut connection
makes up the construction of this unit, which allows for the addition of additional inputs while also producing
new outputs. This unit's key advantage is that it improves classification accuracy while keeping the model's
complexity constant. Resnet152 obtains the highest accuracy amongst the Resnet family members.
LONG SHORT TERM MEMORY NETWORK(LSTM) :
Figure 3.4 LSTM ARCHITECTURE
A unique class of Recurrent Neural Networks (RNN) that can learn long-term dependencies is known as an
"LSTM" (long short term memory) network. They were initially developed by Hochreiter & Schmidhuber
(1997), and many individuals went on to improve and popularize them in subsequent works. They are
currently extensively utilized since they operate incredibly well in a wide range of situations.
Long-term reliance is a problem that LSTMs are specifically made to avoid. Their natural tendency, not
something they have to work at learning, is to retain information for extended periods of time.
A chain of repeating neural network modules is the shape that all recurrent neural networks take.
Recurrent neural networks with a specific ability to selectively recall patterns over extended periods of time
are known as long short-term memory (LSTM) networks.
Recurrent neural networks include Long Short Term Memory. The RNN's next step receives the output from
the previous step as input.
Proposed using a CNN for the image and an LSTM for the question to create the answer.
38
RNN's component parts are stored individually using LSTM.
3.2 DATASET
VQA v1.0 dataset is used. We test our model using the VQA dataset's balanced and unbalanced variants.
204,721 images from the MS COCO dataset make up VQA 1.0 .
3.3 METHOD
To predict the most likely response a from a fixed set of answers based on the content of a picture I and a
query q expressed in natural language.
aˆ = arg max a P(a|I, q)
Where a ∈ {a1, a2, ..., aM}
The training set's most popular responses were used to select the responses.
3.3.1) IMAGE EMBEDDINGS :
A pretrained convolutional neural network (CNN) model based on residual network architecture to compute a high level
representation φ of the input image I.
φ = CNN(I)
φ consists of a three-dimensional tensor with dimensions of 14 by 14 by 2048 from the residual network's last layer
before the final pooling layer. The depth (last) dimension of picture features, when subjected to l2 normalisation,
improves learning dynamics.
3.3.2) QUESTION EMBEDDINGS :
Creating word embeddings from a given query by tokenizing and encoding it Eq = {e1, e2, ..., eP }, where P is the
number of words in the question, D is the length of the distributed word representation, and ei ∈ RD. The long short-
term memory (LSTM) is subsequently fed the embeddings
(LSTM) [11]. s = LSTM
38
The LSTM's final state is utilized to represent the query.
3.3.3)Attention Network :
For distributions over the spatial dimensions of the visual features, multiple attention is estimated.
αc,l ∝ exp Fc(s, φl) 3 X L l=1 αc,l = 1 (4) xc = X l αc,lφl (5)
The weighted average of picture features across all spatial locations is used to calculate each image feature in a glance.
l = {1, 2, ..., L}. Each glimpse's attention weights (c, l) are normalized separately using the values c = 1, 2,..., C.
In actual modelling, F = [F1, F2,..., FC] is represented by two layers of convolution. Fi’s share parameters in the top
layer as a result.
3.3.4) Classifier
To create probability over answer classes, nonlinearities are applied to the concatenation of image glances
with the LSTM state.
P(ai |I, q) ∝ exp Gi(x, s)
where
x = [x1, x2, ..., xC ].

In actuality, G = [G1, G2,..., GM] is portrayed as two fully connected layers. Following is a definition of final
loss.
L = 1 K X K k=1 − log P(ak|I, q)
3.3.5) CUSTOMIZATION TO THE ABOVE NETWORK :

Our approach simultaneously performs visual and textual attentions over a number of steps and collects the
essential data from both modalities. We describe the basic attention mechanisms used at each phase in this
part, which act as the foundation for the overall Attention network.
1) VISUAL ATTENTION :
By paying attention to certain areas of the input image, visual attention seeks to create a context
vector.
The visual context vector v (k) is determined by v (k) = V at step k.
v (k) = V Att({vn} N n=1, m(k−1) v )
where m(k-1)v is a memory vector representing the data that has been attended up to step k-1.
38
2) TEXTUAL ATTENTION :
By paying close attention to particular words in the input sentence at each step, textual attention
calculates a textual context vector u(k):
u (k) = T Att({ut} T t=1, m(k−1) u )
where m (k−1) u is a memory vector.
Figure 3.5 CUSTOMIZED ARCHITECTURE
The algorithm pays attention to the precise areas and words that make it easier to answer the questions by
combining text attention with visual attention.
The softmax layer receives the output of the combined attention network and uses it to forecast the top 5
predictions.
The fusion of visual and question features depends heavily on attention.

The attention-guided integration of visual and textual characteristics has been extensively studied in a
number of previous papers.
Only a few of the top-k region suggestions (found using visual extraction) are pertinent.
with relation to a q input query.
3.4) Development towards Question Generator Module :

3.4.1) Image Captioning :
38
A written description must be provided for a given image as part of the difficult AI challenge known as
caption generation. To translate the comprehension of the image into words in the appropriate order, it needs
both computer vision techniques to comprehend the image's content and a language model from the field of
natural language processing.
A pre-trained Convolutional Neural Network (ENCODER), used in "traditional" image captioning systems,
would encode the image and create a hidden state h.
Then, it would employ an LSTM(DECODER) to decode this hidden state and produce each caption word
iteratively.
Figure 3.4.1 , A classic Image Captioning Model
The issue with this approach is that the model typically only manages to describe a small portion of the image
when it attempts to construct the next word of the caption. It is unable to fully convey the meaning of the
input image. It is not possible to efficiently generate different words for various regions of the image using the
entire representation of the image as a criterion. Exactly in this situation is where an attention technique is
effective.
The image is first fragmented into n pieces using an attention mechanism, then we compute the
representations of each portion (h1,..., hn) using a convolutional neural network (CNN). The decoder only
uses certain portions of the image when the RNN is creating a new word because the attention mechanism is
concentrating on the pertinent area of the image.
38
Figure 3.4.2 Image captioning architecture with attention Mechanism
Due to computationally expensive nature and impracticality when translating lengthy sentences, global
attention focus on all source side words for all target terms is not recommended. Local attention selects to
concentrate only on a limited portion of the hidden states of the encoder for every target word in order to
address this shortcoming.
The given image may be captioned as , ‘A boy on skateboard’ , when we say ‘boy’ it locates to the region of
‘boy, and when we say ‘skateboard’ it locates to the region of ‘skateboard’.Both of these locations are in
different pixels and VGG16 does not have any information about it.
But every location of convolution layers corresponds to some location of image as shown below.
38
VGG16 ARCHITECTURE
The output of VGGNet's fifth convolution layer is a feature map with a size of 14*14*512.
There are 196 such pixel positions in this fifth convolution layer, which includes 14*14 pixel locations that
correlate to specific image regions.
Finally, we may work with these 196 places, each of which has 512 dimensions.
The model will then develop a attention on these areas (which corresponds to actual locations in the
images).The caption generated acts as a “context” to question generation model.
Dataset :
Flickr 8k dataset is used.
There are 8000 images in this collection, each with five captions .
These images are divided into three parts:
Training Set: 6000 pictures.
Dev Set: 1000 pictures.
1000 test pictures.
38
3.5) Question Generator :
A Transformer-based architecture that employs a text-to-text method is called T5, or Text-to-Text Transfer
Transformer. The model text is fed as input into each task, including translation, question answering, question
generation and classification, and trained to produce the target text.
Transfer learning, which involves pre-training a model on a task with lots of data before fine-tuning it on a
subsequent task, has become a potent method for natural language processing (NLP). The success of transfer
learning has led to a wide range of methodologies, practices, and strategies.
Compared to just training on the tiny, labelled datasets without pre-training, the model performs significantly
better after being fine-tuned (trained) on smaller, task-specific labelled datasets.
We fine tune the Squad dataset for generating a question based on the context( Image Caption)
Tokenization of the words is done using T5 tokenizer.
38
Architecture of Question Generator
The given text is preprocessed and encoded , decoded using T5 Tokenizer. The context is passed to the Fine
Tuned T5 Transformer to train the question generation network. Context based Question is generated in the
evaluation .
Exploring Tokenizer :
Text : ‘This is our Capstone Project”
Encoded output:
Tokenized Output:
Decoded Output :
3.3.5) Hyper Parameters

38
3.3.5.1). Optimizer
Adam is utilized as the optimizer, starting with a learning rate of 0.001 and gradually lowering it with the aid
of a scheduler for learning rates. Every time we train a neural network, the network's output diverges from the
final result. We could refer to this as a cost function or loss function.
The optimizers are techniques for reducing network losses by modifying neural network properties like
weights and learning rates. Although there are many different kinds of optimizers, we employ the Adam
optimizer in this example. Adaptive moment estimate is referred to as Adam. The Adam Optimizer combines
two concepts.
i)Momentum
The Root Mean Squared Propagation method (RMS Prop)
Smoothing results from the use of momentum. And with RMS Prop's assistance.
3.6) System Requirement Specification :

Hardware Requirements
 Processor: Intel i5 or more:
 Processor speed: 2.4 GHZ
 RAM: 4 GB
 GPU 4GB:NVIDIA1050 TI
 Hard Disk Space : 128 GB
Software Requirements
 Operating system: Windows 7/8/10
 Coding Language: Python 3.0
 Software Tool: Jupyter Notebook, Google Colab.
Frontend Requirements :
• Coding Language : HTML ,CSS , Javascript & Flask.

• Software Tool : Jupyter Notebook.
38
CHAPTER 4: RESULT AND DISCUSSION
Model Train Accuracy
Baseline model RESNET 0.61

Visual and Textual Attention Based 0.64
Model
VQA MODEL OUTPUT SNAPSHOTS :
38
38
38
38
38
CHAPTER 5: CONCLUSION AND FUTURE WORK
In this paper, a Customized attention network improvised with the addition of Visual and Textual Attention is
proposed to solve the Visual Question Answering (VQA) task. The model is trained and tested on VQA V1
Dataset. The customised model is tested and compared with the pre-existing models which shows the
improvement in the accuracy of the VQA task.
The given answer and the related image is passed onto the Image Captioning Model which generates caption
based on the image. This caption is considered as “Context” to question generator , which is trained on fine
tuned T5 Transformer to generate the Question.
The Question generated is then fed back to the VQA model to get refined answers.
The VQA model can be improvised to generate answers in phrases with more accuracy by addition of double
or triple attention layers. The network can be improvised by considering more data and more computational
power.
38
REFERENCES
[1] F. Liu, J. Liu, Z. Fang, R. Hong and H. Lu, "Visual Question Answering With Dense Inter- and Intra-
Modality Interactions," in IEEE Transactions on Multimedia, vol. 23, pp. 3518-3529, 2021, doi:
10.1109/TMM.2020.3026892.
[2] W. Guo, Y. Zhang, J. Yang and X. Yuan, "Re-Attention for Visual Question Answering," in IEEE
Transactions on Image Processing, vol. 30, pp. 6730-6743, 2021, doi: 10.1109/TIP.2021.3097180.
[3] Z. Yu, J. Yu, C. Xiang, J. Fan and D. Tao, "Beyond Bilinear: Generalized Multimodal Factorized High-
Order Pooling for Visual Question Answering," in IEEE Transactions on Neural Networks and Learning
Systems, vol. 29, no. 12, pp. 5947-5959, Dec. 2018, doi: 10.1109/TNNLS.2018.2817340.
[4] C. Chen, D. Han and J. Wang, "Multimodal Encoder-Decoder Attention Networks for Visual Question
Answering," in IEEE Access, vol. 8, pp. 35662-35671, 2020, doi: 10.1109/ACCESS.2020.2975093.
[5] L. Peng, Y. Yang, Z. Wang, Z. Huang and H. T. Shen, "MRA-Net: Improving VQA Via Multi-Modal
Relation Attention Network," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no.
1, pp. 318-329, 1 Jan. 2022, doi: 10.1109/TPAMI.2020.3004830.
[6] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang and X. -S. Hua, "Self-Adaptive Neural Module
Transformer for Visual Question Answering," in IEEE Transactions on Multimedia, vol. 23, pp. 1264-1273,
2021, doi: 10.1109/TMM.2020.2995278.
[7] Y. Liu, X. Zhang, F. Huang, L. Cheng and Z. Li, "Adversarial Learning With Multi-Modal Attention for
Visual Question Answering," in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 9,
pp. 3894-3908, Sept. 2021, doi: 10.1109/TNNLS.2020.3016083.
[8] K. C. Shahira and A. Lijiya, "Towards Assisting the Visually Impaired: A Review on Techniques for
Decoding the Visual Data From Chart Images," in IEEE Access, vol. 9, pp. 52926-52943, 2021, doi:
10.1109/ACCESS.2021.3069205.
[9] D. Gao, R. Wang, S. Shan and X. Chen, "Learning to Recognize Visual Concepts for Visual Question
Answering With Structural Label Space," in IEEE Journal of Selected Topics in Signal Processing, vol. 14,
no. 3, pp. 494-505, March 2020, doi: 10.1109/JSTSP.2020.2989701.
[10] Y. Zhou et al., "Plenty is Plague: Fine-Grained Learning for Visual Question Answering," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 697-709, 1 Feb. 2022, doi:
10.1109/TPAMI.2019.2956699.
[11] L. Zhang et al., "Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering,"
in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4362-4373, Oct. 2021,
38
doi: 10.1109/TNNLS.2020.3017530.
[12] J. Yu et al., "Reasoning on the Relation: Enhancing Visual Representation for Visual Question
Answering and Cross-Modal Retrieval," in IEEE Transactions on Multimedia, vol. 22, no. 12, pp. 3196-3209,
Dec. 2020, doi: 10.1109/TMM.2020.2972830.
[13] K. Terao, T. Tamaki, B. Raytchev, K. Kaneda and S. Satoh, "An Entropy Clustering Approach for
Assessing Visual Question Difficulty," in IEEE Access, vol. 8, pp. 180633-180645, 2020, doi:
10.1109/ACCESS.2020.3022063.
[14] X. Zheng, B. Wang, X. Du and X. Lu, "Mutual Attention Inception Network for Remote Sensing Visual
Question Answering," in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022, Art
no. 5606514, doi: 10.1109/TGRS.2021.3079918.
[15] Y. Tao, Z. Zongyang, Z. Jun, C. Xinghua and Z. Fuqiang, "Low-altitude small-sized object detection
using lightweight feature-enhanced convolutional neural network," in Journal of Systems Engineering and
Electronics, vol. 32, no. 4, pp. 841-853, Aug. 2021, doi: 10.23919/JSEE.2021.000073.
[16] Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). “Stacked Attention Networks for Image
Question Answering”. arXiv. https://doi.org/10.48550/arXiv.1511.02274.
[17] Aishwarya Agrawal∗ , Jiasen Lu∗ , Stanislaw Antol∗ , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi
Parikh, “VQA: Visual Question Answering”, arXiv:1505.00468 [cs.CL]
38
38

Adi Report

Uploaded by

Copyright:

Available Formats

You might also like

Adi Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adi Report

Uploaded by

Copyright:

Available Formats

PES UNIVERSITY

(Established under Karnataka Act No. 16 of 2013)

Under the guidance of

Name and Signature of Examiners:

NAME AND SIGNATURE OF THE CANDIDATE

We would like to thank this opportunity to thank Prof. RAGHAVENDRA, Professor,

We are grateful to the project coordinator Prof. Rajasekar M, Department of

We would like to thank Dr. ANURADHA M, Professor and Chairperson, Department of

Chancellor Dr. M R DORESWAMY, Pro-Chancellor Prof. JAWAHAR

1.3 Problem Statement. 10

1.4Organisation of the report. 10

Chapter 2: Literature Survey. 11 – 20

2.2 “Fine-Grained Hashing With Double Filtering”

2.8 “Re-Attention for Visual Question Answering”

Chapter 3: Methodology 21- 32

3.1 Block Diagram. 16 - 17

3.3.1.1 Image Encoder. 17 – 19

3.3.2 Question Encoder. 25 – 26

3.3.3 Model Architecture and Visualization

3.3.3.1 Stacked Attention Network

3.3.3.2 Customization in the Stacked Attention Network

Chapter 5: Software needed. 36

Chapter 6: Conclusion and Future Scope. 37

Figure 1.1 Example 1 : Fine grained Feature Extraction

CHAPTER 1.1: MOTIVATION

The scope of this project is as follows:

CHAPTER 1.2: PROBLEM STATEMENT

CHAPTER 1.3: REPORT ORGANISATION

2.1“Learning to Recognize Visual Concepts for Visual Question Answering With

This DAG model achieves an accuracy of 56.94% on GQA Dataset Accuracy .

2.2“Fine-Grained Hashing With Double Filtering”.

2.3“Feature balance for fine grained object classification in aerial images”

2.4“A Novel Semantics-Preserving Hashing for Fine-Grained Image Retrieval”

2.5“Reasoning on the relation: Enhancing Visual Representation for Visual

2.6“Visual Question Answering with dense Inter and Intra-Modality interactions”

The methods used are as follows :

2.7“Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for

The datasets used are VQA-1.0 and VQA-2.0

The methods used are as follows :

This model achieves a 68.16% accuracy on VQA 2.0 dataset

2.8“Re-Attention for Visual Question Answering”

The datasets used are VQA , VQA v2 , and COCO-QA

The methods used are as follows :

2.9“Diversified Attention Model for Fine-Grained Entity Typing”

The datasets used are FIGER, OntoNotes and BBN

2.10“Visual Question Generation as Dual Task of Visual Question Answering”

3.1 BLOCK DIAGRAM:

Fig. 3.1: Block Diagram.

Figure 3.2 Baseline Model Architecture of visual question answering

Figure 3.3 RESNET 152 ARCHITECTURE

LONG SHORT TERM MEMORY NETWORK(LSTM) :

Figure 3.4 LSTM ARCHITECTURE

RNN's component parts are stored individually using LSTM.

3.3.1) IMAGE EMBEDDINGS :

3.3.2) QUESTION EMBEDDINGS :

(LSTM) [11]. s = LSTM

The LSTM's final state is utilized to represent the query.