Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence

Evaluating models based on Explainable AI
Srikanth K.S1*, T.K Ramesh2*, Suja Palaniswamy3, Ranganathan, Srinivasan4
Abstract
Usage of deep learning / machine learning algorithms together with the advancement
of processor technology have triggered an accelerated growth of complex autonomous
systems incorporating artificial intelligence (AI) models in diverse applications, e.g.,
medical diagnostics, sentiment analysis, recommender systems, autonomous
10 navigation, etc. A model can be trusted if the outcome of a model is like the solution
obtained by a human user. Hence model accuracy alone is not a necessary and sufficient
metric in AI. Model explainability can be evaluated based on a given use case.
In this paper Explainable AI (XAI) is used as a technique to compare the model
accuracy with respect to the explanations given by the model for a given classification.
15 A face mask detection use case is taken and state of art Convolution Neural Network’s
(CNN) with similar accuracies are compared with respect to the explanations. The
proposed approach demonstrates why model accuracy alone is not a sufficient metric
for selecting a model for deployment.
Keywords: GradCAM, Explainability, XAI, Artificial Intelligence.
20 2010 MSC: 00-01,99-00
1
Department of Electronics and Communication, Amrita School of Engineering, Bengaluru, Amrita Vishwa
Vidyapeetham, India. Email address: ks_srikanth@blr.amrita.edu
2
Department of Electronics and Communication, Amrita School of Engineering, Bengaluru, Amrita Vishwa
Vidyapeetham, India. Email address: tk_ramesh@blr.amrita.edu
* Corresponding author
3
Department of Computer Science and Engineering, Amrita School of Engineering, Bengaluru, Amrita
Vishwa Vidyapeetham, India. Email address: p_suja@blr.amrita.edu
4
Adjunct faculty, Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai,
India. Email address: ranga@iitm.ac.in
1. Introduction
Software is increasingly used in autonomous systems such as automobiles, aircrafts,

medical devices, smart city applications, etc. These autonomous systems are developed
with an objective to build intelligent agents that perceive, learn, decide, and act on their
25 own. However, the effectiveness and trust on these systems are limited by the AI
model’s Explainability.
There are a few popular AI models like regression models, decision trees, support
vector machines, which are inherently explainable but inadequate for capturing and
learning real world problems like image classification where deep learning models are
30 preferred. Deep learning (DL) models are used to solve classification problems where
the model learns complex features from the input image and predicts a class label. Due
to the nature in which they learn and perceive complex features they are termed as
“black box models” which makes explainability difficult which is an important
requirement to enhance the trust in the model[1]. When these models are applied for
35 solving real world problems, it becomes a necessity to get insights from the model and
verify that it has selected the right subset of features to arrive at the predicted outcome.
Hence, the concept of Explainable AI (XAI) in AI models is much sought after. With
the widespread corona virus (COVID-19), there has been great focus on accelerated
research, deploying DL for identifying patterns in transmission and possible control of
40 the infections. Most of the models developed using DL have not been evaluated to
verify if they are indeed looking at features which a human with the right domain
knowledge would look at. Understanding a DL model and evaluating it based on the
model explainability is the objective of this work.
In this paper, we consider an image classification problem, to detect if a person is
45 wearing a mask (correctly or incorrectly) or not . We compare the model accuracies of
state of art models and inspect the features considered for arriving at the decisions.
Based on the model explainability, we arrive at the best model. The models are
compared against the features that are normally picked up by human beings for the
same classification problem.
50
1.1 Organization of this paper
This paper is structured as follows: Section 2 provides the related work. Section 3
describes a use case of face mask detection and verifies the base model accuracies.
Section 4 applies XAI techniques on the base model to identify the features that were
55 considered significant for the model to deduce its outcome. Section 5 describes a
methodology to calculate an AI Safety score as a function of features that a human
would consider for the given use case. Section 6 provides a conclusion and
recommended future work.
2. Related work – Motivation from COVID 19 studies
60 The outbreak of COVID19 pandemic has provided significant opportunities to the

information technology sector especially in healthcare, assisted technology and
automation where the use of ML/DL has significantly helped researchers in identifying
the presence of COVID19 [2][3][4][5], identifying patterns that can be used to contain
the spread[6][7][8], discover new vaccines [9], trying to understand the features
65 considered significant by the model [10] , etc. An incorrect decision made by the
ML/DL model can have an adverse effect, hence it is important to evaluate DL model
particularly used in safety critical applications. Typically, DL models are evaluated
using the metrices summarised in Table 1.
Table 1 Accuracy metrics for DL model

Metric Formula
Accuracy 𝑁𝑜 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
True positive rate (Sensitivity) 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
True negative rate (Specificity) 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Recall 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
F1 Score
1
2∗( )
1 1
( + )
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
70
These metric alone cannot be considered as sufficient condition in accepting a model
[1]. Hence model explainability becomes an integral part of a DL pipeline.
Explainable AI (XAI) is a research initiative initiated by DARPA in the year 2017
to develop algorithms that unravel critical short comings of AI [11]. Real world problem
75 can be solved by learning complex features using DL making the models black box in
nature and difficult to explain them in a human interpretable manner. DARPA’s XAI
vision was to create a suite of ML/DL techniques that produce human interpretable
explanations while maintaining high level of learning. The pursuit of XAI has drawn
significant attention to a new problem domain called “Interpretability problem” [12].
80 XAI helps to unravel the contents of the hidden layers and arrive at an explanation that
are easy to understand[13]. A list of available XAI techniques are documented in [14].
This paper demonstrates the importance of model explainability by considering a
face mask detection use case.
This paper is split into three parts the first part implements a face mask detection
85 convolution neural network (CNN) which is described in 3.1. The second part evaluates
the state of art models using XAI to verify if the model is picking the right features to
arrive at the conclusion as described in section 3.2 and 3.3. The third part of this paper
is to evaluate the best model in terms of fairness where the model is compared against
the features that will be considered by a human to arrive at the classification as
90 described in section 4.0.
3. Face mask classifier – An image classification use case
World health organization has recommended people to wear a face mask and
maintain social distance while in public places. Outbreaks have been reported in
restaurants, malls, marriage halls, office places, etc., where people tend to gather in
95 large numbers forcing the government to enhance compliance enforcement.
Implementing an efficient real time face mask detection using DL techniques can help
in enforcing compliance in public places. Recently there has been an increased interest
in this area and several papers have been published related to face mask detection, social
distance compliance, identification, and sanitization of high touch points etc. A few
100 papers and associated datasets in this regard are summarized in Table 2
Table 2 Recently published papers for compliance management

Application Method used Accuracies Reference Dataset
(%) used
MTCNN 89.06% [15] YouTube
videos
RCNN 90.6% [16] [17]
RCNN 86.41 [18] [19]
ResNet50 96.0% [20] CCTV
footage
Face mask
YoloV3 90.0% [21] [19]
detection
MobileNetV2 99.0% [22] [23]
MobileNetV2 96.85 [24] [25]
ResNet50 98.2 [26] [19]
VGG19 96.82 [27] Not
specified
Inception V3 99.9 [28] [29]
Social distance Fast RCNN 87.3% [30] [31]
management
Although all the papers have high accuracy, none of the aforementioned work used
XAI techniques to verify if the model was indeed looking at the right features to arrive
105 at the classification. In this paper, we verify the model accuracies of state of art DL
models and use XAI techniques to verify if the model is looking at the right features for
arriving at the classification.
3.1 Problem Formulation
The input image used for classification are pre-processed such that the faces are
110 extracted. Each face is passed through the inference engine to classify the face into three
classes {Class A, Class B and Class C} where Class A refers to incorrect mask, Class
B refers to correct mask and Class C refers to missing mask.
Let 𝐷 represent a tuple {𝑋, 𝑌} where 𝑋 is an array of input images consisting of faces
used for training each resized to 224 x 224 pixels and 𝑌 contains the one of the three
115 target classes for the training dataset. Let 𝐹𝑖 be a feature vector consisting of features
that are considered significant in arriving at a classification. Let 𝐴𝑖 be the accuracy
obtained by for each class during training then the objective function is to maximize
accuracy 𝐴𝑖 for the input 𝐷 such that maximum pixels of the significant features in the
feature vector 𝐹𝑖 are used to determine the final classification.
120 3.2 State of art models
CNN is one of the famous models for supervised learning. XAI techniques can be
applied on CNN’s to inspect the reason for the obtained classification. The following
CNN architectures were selected for comparison VGG16, VGG19, Inception V3,
MobileNetV2, Resnet-50 and Resnet-152. All the models were trained using transfer
125 learning using pretrained weights from ImageNet. Data set were taken from [32], [33]
, [34] and Kaggle dataset5. In all around 1000 images with an equal split for the three
classes were taken.
The training accuracies of the CNN architectures is summarized in Table 3
Table 3 Training accuracies

Model Name Training accuracy Validation accuracy
VGG16 97.83% 97.80%
VGG19 98.99% 98.90%
MobileNetV2 99.71% 96.73%
InceptionV3 99.56% 98.90%
Resnet50 96.82% 97.80%
Resnet152 97.98% 45.60%
130
The accuracies obtained during training per epoch is plotted for all the models in Figure
1. As seen in the figure, the model accuracies of all the state of art models are similar
except for Resnet-152. The training accuracies are high indicating that there is no bias
error however the validation accuracy is very poor indicating a high variance. A low
5 Kaggle dataset obtained from https://www.kaggle.com/mloey1/medical-face-mask-detection-

dataset
135 bias and high variance indicate an overfit. Since the remaining five models have a
higher accuracy, they were considered for further analysis.
Figure 1 Accuracies over epochs
Training Accuracies over Epoch

1.1
Training accuracy
1
0.9
0.8
0.7
0.6
0.5
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epochs
VGG16 VGG19 MobileNetv2

InceptionV3 Resnet50 Resnet152
140
Validation accuracies over Epoch

1.1
Validation accuracues
0.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epochs
VGG16 VGG19 MobileNetv2

InceptionV3 Resnet50
Each model was subjected to unseen dataset having 300 images 100 each from class A
class B and Class C. The confusion matrix was plotted using PyCm [35] the results are
145 summarized in Table 6.
The key metrics captured are listed along with their respective formula/reference to the
paper describing the metric. PyCm provides a lot of metrices. The details can be found
in pycm documentation [35]. A few metrics that are relevant for multiclass
classification are taken for comparison.
150
Table 4 List of key metrices

Metric Description Formula used / Reference
Area under AUC corresponds to the arithmetic 𝑇𝑃𝑅 + 𝑇𝑁𝑅
𝐴𝑈𝐶 =
the curve mean of sensitivity and specificity 2
(AUC) values of each class.
Interpretation: Model with values
from 0.9 to 1.0 is considered as an
excellent model
Confusion CEN based upon the concept of 𝑗
𝑃𝑖,𝑗
𝑀𝑎𝑡𝑟𝑖𝑥(𝑖, 𝑗)
Entropy entropy for evaluating classifier = |𝑐|
∑𝑘=1 𝑚𝑎𝑡𝑟𝑖𝑥(𝑗, 𝑘) + 𝑀𝑎𝑡𝑟𝑖𝑥(𝑘, 𝑗)))
(CEN) performances. By exploiting the
misclassification information of 𝑖
𝑃𝑖,𝑗
confusion matrices, the measure 𝑀𝑎𝑡𝑟𝑖𝑥(𝑖, 𝑗)
=
evaluates the confusion level of the |𝑐|
∑𝑘=1 𝑚𝑎𝑡𝑟𝑖𝑥(𝑖, 𝑘) + 𝑀𝑎𝑡𝑟𝑖𝑥(𝑘, 𝑖)))
class distribution of misclassified
𝐶𝐸𝑁𝑗
samples. |𝐶|
Interpretation: Lower the value better 𝑗

= −( ∑ 𝑃𝑗,𝑘 𝑗
𝑙𝑜𝑔2(|𝐶|−1) (𝑃𝑗,𝑘 )
the model 𝑗
𝑘=1,𝑘≠𝑗
𝑗
+ 𝑃𝑗,𝑘 𝑙𝑜𝑔2(|𝐶|−1) (𝑃𝑘,𝑗 )
[36]
95% Confidence interval (CI) is a type of 𝐶𝐼 = 𝐴𝑐𝑐 ± 1.9 ∗ 𝑆𝐸𝐴𝑐𝑐
Confidence interval estimate (of a population 𝐶𝐼
interval (CI) parameter) that is computed from the = 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
observed data. The confidence level is 𝐴𝑐𝑐 = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
the frequency (i.e., the proportion) of 𝑆𝐸𝐴𝑐𝑐 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
possible confidence intervals that 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
contain the true value of their
corresponding parameter.
Comparison of the metrics are captured in Table 6 and Table 6.
Table 5 Comparison of key Metrics from verification dataset

Metric Model Class A Class B Class C
Incorrect Mask With Mask Without
Mask
VGG16 97.25% 96.89% 98.29%
AUC
VGG19 97.00% 97.67% 99.057%
Metric Model Class A Class B Class C
Incorrect Mask With Mask Without
Mask
MobileNetV2 98.81% 98.062% 100%
InceptionV3 99.762% 99.308% 99.383%
Resnet50 95.083% 96.925% 96.678%
VGG16 0.1516 0.0881 0.0844
VGG19 0.1273 0.07193 0.07453
CEN
MobileNetV2 0.0745 0.05594 0.0
Range [0-1]
InceptionV3 0.02211 0.03105 0.02277
Resnet50 0.21364 0.1084 0.11336
VGG16 (94.069, 98.422)
VGG19 (94.953, 98.904)
95% CI MobileNetV2 (96.811, 99.776)
InceptionV3 (98.375, 100)
Resnet50 (92.779, 97.664)
155
Table 6 Comparison of accuracy and precision

VGG16 VGG19 MobileNetV2
P A B C P A B C P A B C
↓/A→ ↓/A→ ↓/A→
1 98.79 0.0 1.2 1 96.38 0.0 3.6 1 100 0.0 0.0
2 5.42 93.79 0. 7 2 3.8 95.34 0.77 2 3.87 96.12 0.0
3 2.4 0.0 97.53 3 0.0 0.0 100 3 0.0 0.0 100
Inception V3 Resnet 50
P → Precision
P A B C P A B C A→ Accuracy
↓/A→ ↓/A→ Class A Incorrect
1 100 0.0 0.0 1 93.97 4.8 1.2 Class B with Mask
2 0.775 99.22 0.0 2 3.1 96.89 0.0
Class C Without Mask
3 0.0 1.23 98.76 3 4.9 1.23 93.82
3.3 Analysis of results
From Table 3, the accuracies of all the five models are around the same. The model
with highest training accuracy is MobilenetV2 and InceptionV3. When subjected to
160 unseen dataset, the accuracies of were the highest for Class A for MobileNetV2 and
InceptionV3 and the least accuracies were observed in detecting Class B using VGG16
model. The AUC obtained for all the models were greater than 90% indicating the
model training is considered very good. The 95% confidence interval of MobilenetV2
and InceptionV3 are the highest indicating that these models would indeed look at the
165 right areas for arriving at the classification.
Although the models have a high accuracy, it is inconclusive if the models selected the
right features for arriving at the conclusion. In order to resolve this issue, we can use
an XAL technique like GradCAM [37] to verify the regions considered significant in
arriving at the classification.
170 4. Results from post-hoc explanations
GradCAM [37] is a post-hoc explanation method used to produce heatmaps for

pretrained neural networks. The image was masked based on the intensity of the
heatmap and parts of the image that were considered most significant was passed for
inference with the respective models. The inference obtained with the masked image
175 was same as the original inference indicating that the masked image alone was
sufficient to detect the same class. The results obtained from GradCAM are summarized
for each model in Table 7
Table 7 GradCAM images for each class

Class Image VGG16 VGG19 MobileNet Inception Resnet50
V2 V3
A
Incorre
ct mask
B
Correct
Mask
C
Mask
Missing
180 Each training image was analyzed using GradCAM and the masked image for each
model was passed for inferencing. Different models look at certain areas that are more
significant than the other. It was observed that a few models like VGG16, InceptionV3
and Resenet50 performed well in detecting Class A and Class B. VGG 19 and
MobileNetV2 considered a significant area of the forehead region for classifying Class
185 B. Similarly, all models looked at the complete face for classifying an image to Class
C. An analysis was done on each image to identify the regions that were considered
significant by dividing the face into the forehead region, nose region, left chin, and right
chin region.
190 5. Identifying region of significance from facial landmarks
A human face has prominent landmarks like the eyes, nose, mouth, and chin. These
landmarks can be used to segment the face into regions. Further we can use these
regions identify the number of pixels using GradCAM that were active in contributing
to the output. The models can be evaluated as a function of the pixels that were used
195 for arriving at the inference.
There are multiple models to detect facial landmarks as tabulated in Table 8
Table 8 Face landmark detection

Methodology Reference
Using DLibC and OpenCV [38]
Emotion level using feature level fusion of facial features and body [39]
gestures.
Emotion recognition from facial expressions for 3D videos using [40]
Siamese network
In this paper we used DLibC was used to identify the facial landmarks and the face
200 was classified into four regions viz, forehead, nose, left chin, and right chin as shown
in Figure 2
Fore head region
Right Chin Left Chin

region region
Nose region
Figure 2 Facial landmarks

Once we obtain the prediction, the facial landmarks are detected. The image is then
205 split into four regions as shown in Figure 2. The number of pixels that contributed to
the given prediction is calculated. If the number of pixels for the nose, left and right
chin are high and the number of pixels for the forehead region are low, then we can
conclude that the model has looked at the right areas to arrive at a conclusion.
5.1 Interpretation of results
210 Typically for a face mask detection problem, a human would consider the region
around the nose, left chin and right chin for detecting if a face mask is present and
correctly worn. On the other hand, to identify if the face mask is missing, the full face
can be considered as significant. Each image was subjected to GradCAM and the
number of pixels that were significant in the forehead region, nose region, left and right
215 chin were calculated respectively. The mean data of each region was calculated to see
the percentage significance of a pixel in a given region.
Table 9 captures the regions that were considered significant for each of the model.
The column gradient cutoff indicates the threshold considered for each model where
the masked image also gave the same classification as the original prediction. If the cut
220 off is set to 0, then the entire image is considered as significant. As the cut off increases
the percentage of image that is considered significant for analysis reduces. So higher
the cutoff, more specific regions are considered as significant. An example can be seen
in Table 8 where models like VGG16 and VGG19 used very specific regions of the
face for arriving at a given outcome.
225 Table 9 Comparison of the region significance

Model Gradient Class/ No Mean % Mean Mean Mean
cutoff of samples Forehead % Nose % Left %
Chin Right
Chin
Class A / 1.46 43.85 33.29 23.98
173
VGG16 Class B / 26.26 51.54 41.05 40.59
140
214
Class C / 3.18 39.34 40.76 35.63
140
Class A / 5.05 43.24 46.39 40.54
173
VGG19 Class B / 43.16 55.25 42.44 41.23
102
214
Class C / 11.46 34.29 64.50 60.72
140
Class A / 16.32 66.88 75.81 68.14
173
MobileNetV2 Class B / 44.05 58.57 63.14 56.21
140
214
Class C / 27.85 54.61 58.73 61.33
140
Class A / 27.31 70.17 65.34 69.11
173
Class B / 61.71 94.89 81.52 95.44
Inception V3 51
214
Class C / 16.34 44.17 38.38 41.18
140
Class A / 56.61 93.60 86.74 91.88
173
Resnet 50 Class B / 81.50 93.04 92.04 94.68
51
214
Class C / 26.85 95.11 92.40 96.10
140
It can be observed from Table 9 that VGG16 used only 1.46% and 3.18% pixels from
the forehead region to classify Class A (Incorrect Mask) and Class C (With Mask)
respectively. Similarly, the percentage of the nose and left chin and right chin
230 considered for the inference is higher than the other models. Hence from the data it is
evident that VGG16 model implemented using transfer learning is the best model
considering features around the nose, left chin and right chin for correctly classifying
Class A and Class B.
VGG16 model was subjected to a set of unseen datasets to observe the features that
235 were considered significant. The results are tabulated in Table 10
Table 10 Comparison with unseen dataset

Model Gradient Class/ No Mean % Mean Mean Mean
cutoff of samples Forehead % Nose % Left %
Chin Right
Chin
Class A / 1.34 45.25 36.29 22.49
59
VGG16 Class B / 26.73 79.77 38.39 39.23
140
99
Class C / 3.41 39.09 40.05 35.88
70
The results obtained using VGG16 were aligned to the training dataset.
6. Conclusion and future work
240 In this paper, we evaluated multiple standard models for face mask detector use case.
State of art models were compared and models with similar accuracies were selected.
Each model was subjected to post-hoc explanation using GradCAM. The results
showed the regions that the models considered significant in arriving at the decision.
Post-hoc explanations were further enhanced by identifying the portion of the face
245 that was considered the most significant and identifying if these regions were like the
decisions that will be taken by a human for performing the same classification.
It was found that although all the state of art models had accuracies greater than 90%,
they looked at different regions of the face to arrive at the decision. It was found that
VGG16 model implemented using transfer learning selected the right features to arrive
250 at the classification although its accuracies were less than MobileNetV2 and
InceptionV3. Hence it is important to consider model explainability as a factor before
selecting the best model to be used for a given use case.
In future we aim to develop an algorithm which can further improve the accuracies
and the choice of feature selection. The algorithm can be generalised to obtain an
255 explainability metric based on a given use case.
References
[1] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should i trust

you?’ Explaining the predictions of any classifier,” 2016. doi:
10.1145/2939672.2939778.
260 [2] B. B. Gawde, “A Fast, Automatic Risk Detector for COVID-19,”

in 2020 IEEE Pune Section International Conference (PuneCon),
2020, pp. 146–151. doi: 10.1109/PuneCon50868.2020.9362389.
[3] S. Wang et al., “A deep learning algorithm using CT images to

screen for Corona Virus Disease (COVID-19),” medRxiv, 2020,
265 doi: 10.1101/2020.02.14.20023028.
[4] S. Anjomshoae, A. Najjar, D. Calvaresi, and K. Främling,
“Explainable Agents and Robots: Results from a Systematic
Literature Review,” in Proceedings of the 18th International
Conference on Autonomous Agents and MultiAgent Systems,
270 2019, pp. 1078–1088.
[5] K. Ahmed and N. Gouda, “AI Techniques and Mathematical

Modeling to Detect Coronavirus,” J. Inst. Eng. Ser. B, pp. 1–10,
Nov. 2020, doi: 10.1007/s40031-020-00514-0.
[6] S. Raghav et al., “Suraksha: Low Cost Device to Maintain Social

275 Distancing during CoVID-19,” in 2020 4th International
Conference on Electronics, Communication and Aerospace
Technology (ICECA), 2020, pp. 1476–1480. doi:
10.1109/ICECA49313.2020.9297503.
[7] K. Bhambani, T. Jain, and K. A. Sultanpure, “Real-time Face

280 Mask and Social Distancing Violation Detection System using
YOLO,” in 2020 IEEE Bangalore Humanitarian Technology
Conference (B-HTC), 2020, pp. 1–6. doi: 10.1109/B-
HTC50970.2020.9297902.
[8] S. Srinivasan, R. Rujula Singh, R. R. Biradar, and S. A. Revathi,

285 “COVID-19 Monitoring System using Social Distancing and Face
Mask Detection on Surveillance video datasets,” in 2021
International Conference on Emerging Smart Computing and
Informatics (ESCI), 2021, pp. 449–455. doi:
10.1109/ESCI50559.2021.9396783.
290 [9] N. Arora, A. K. Banerjee, and M. L. Narasu, “The role of artificial

intelligence in tackling COVID-19,” Future Virol., pp.
10.2217/fvl-2020–0130, Oct. 2020, doi: 10.2217/fvl-2020-0130.
[10] S. K., S. V., G. E.A., and S. K.P., “Explainable artificial

intelligence for heart rate variability in ECG signal,” Healthc.
295 Technol. Lett., vol. 7, no. 6, pp. 146–154, 2020, doi:
https://doi.org/10.1049/htl.2020.0033.
[11] D. Gunning and D. Aha, “DARPA’s Explainable Artificial

Intelligence (XAI) Program,” AI Mag., vol. 40, no. 2, pp. 44–58,
2019, doi: 10.1609/aimag.v40i2.2850.
300 [12] S. Palacio, A. Lucieri, M. Munir, J. Hees, S. Ahmed, and A.

Dengel, “XAI Handbook: Towards a Unified Framework for
Explainable AI.” 2021.
[13] Paul Voosen, “How AI detectives are cracking open the black box
of deep learning,” SCI Mag, 2017.
305 https://www.sciencemag.org/news/2017/07/how-ai-detectives-
are-cracking-open-black-box-deep-learning (accessed Dec. 20,
2020).
[14] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis,

“Explainable AI: A Review of Machine Learning Interpretability
310 Methods,” Entropy, vol. 23, no. 1, Dec. 2021, doi:
10.3390/e23010018.
[15] A. S. Joshi, S. S. Joshi, G. Kanahasabai, R. Kapil, and S. Gupta,

“Deep Learning Framework to Detect Face Masks from Video
Footage,” in 2020 12th International Conference on
315 Computational Intelligence and Communication Networks
(CICN), 2020, pp. 435–440. doi:
10.1109/CICN49253.2020.9242625.
[16] O. Cakiroglu, C. Ozer, and B. Gunsel, “Design of a Deep Face

Detector by Mask R-CNN,” in 2019 27th Signal Processing and
320 Communications Applications Conference (SIU), 2019, pp. 1–4.
doi: 10.1109/SIU.2019.8806447.
[17] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face

Detection Benchmark,” 2016.
[18] J. Zhang, F. Han, Y. Chun, and W. Chen, “A Novel Detection

325 Framework About Conditions of Wearing Face Mask for Helping
Control the Spread of COVID-19,” IEEE Access, vol. 9, pp.
42975–42984, 2021, doi: 10.1109/ACCESS.2021.3066538.
[19] Revanth, “Masked face Dataset (MAFA),” 2020.

https://www.kaggle.com/revanthrex/mafadataset?select=1q0Uw
330 RZsNGuPtoUMFrP-U06DSYp_2E1wB
[20] G. T. S. Draughon, P. Sun, and J. P. Lynch, “Implementation of a

Computer Vision Framework for Tracking and Visualizing Face
Mask Usage in Urban Environments,” in 2020 IEEE International
Smart Cities Conference (ISC2), 2020, pp. 1–8. doi:
335 10.1109/ISC251055.2020.9239012.
[21] T. Q. Vinh and N. T. N. Anh, “Real-Time Face Mask Detector

Using YOLOv3 Algorithm and Haar Cascade Classifier,” in 2020
International Conference on Advanced Computing and
Applications (ACOMP), 2020, pp. 146–149. doi:
340 10.1109/ACOMP50827.2020.00029.
[22] K. J. K. R. Kavitha, S. Vijayalakshmi, A. Annakkili, T.

Aravindhan, “Face Mask Detector Using Convolutional Neural
Network,” Ann. Rom. Cell Biol., vol. 25, no. 5, pp. 8–15, 2021.
[23] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, “A

345 hybrid deep transfer learning model with machine learning
methods for face mask detection in the era of the COVID-19
pandemic,” Measurement, vol. 167, p. 108288, 2021, doi:
https://doi.org/10.1016/j.measurement.2020.108288.
[24] S. A. Sanjaya and S. Adi Rakhmawan, “Face Mask Detection

350 Using MobileNetV2 in The Era of COVID-19 Pandemic,” in 2020
International Conference on Data Analytics for Business and
Industry: Way Towards a Sustainable Economy (ICDABI), 2020,
pp. 1–5. doi: 10.1109/ICDABI51230.2020.9325631.
[25] Baojin Huang, “Real-World Masked Face dataset (RFMD),”

355 2020. https://github.com/X-zhangyang/Real-World-Masked-
Face-Dataset#readme
[26] S. Sethi, M. Kathuria, and T. Kaushik, “Face mask detection using

deep learning: An approach to reduce risk of Coronavirus
spread.,” J. Biomed. Inform., vol. 120, p. 103848, Aug. 2021, doi:
360 10.1016/j.jbi.2021.103848.
[27] J. Xiao, J. Wang, S. Cao, and B. Li, “Application of a Novel and

Improved {VGG}-19 Network in the Detection of Workers
Wearing Masks,” vol. 1518, p. 12041, Apr. 2020, doi:
10.1088/1742-6596/1518/1/012041.
365 [28] G. J. Chowdary, N. S. Punn, S. K. Sonbhadra, and S. Agarwal,

“Face Mask Detection using Transfer Learning of InceptionV3.”
2020.
[29] Prajna Bhandary, “Simulated Masked Face Dataset (SMFD).”

https://github.com/prajnasb/observations
370 [30] S. K, B. S, and P. M. B, “Social Distance Identification Using

Optimized Faster Region-Based Convolutional Neural Network,”
in 2021 5th International Conference on Computing
Methodologies and Communication (ICCMC), 2021, pp. 753–
760. doi: 10.1109/ICCMC51019.2021.9418478.
375 [31] T.-Y. Lin et al., “Microsoft {COCO:} Common Objects in

Context,” CoRR, vol. abs/1405.0, 2014.
[32] P. Bhandary, “Dataset - Facemask detection,” Datasets, 2020.

https://github.com/prajnasb/observations/tree/master%0A/mask_
classifier/Data_Generator (accessed May 23, 2021).
380 [33] A. Cabani, K. Hammoudi, H. Benhabiles, and M. Melkemi,

“MaskedFace-Net – A dataset of correctly/incorrectly masked
face images in the context of COVID-19,” Smart Heal., vol. 19,
p. 100144, Mar. 2021, doi: 10.1016/j.smhl.2020.100144.
[34] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator

385 Architecture for Generative Adversarial Networks.” 2019.
[35] S. Haghighi, M. Jasemi, S. Hessabi, and A. Zolanvari, “{PyCM}:
Multiclass confusion matrix library in Python,” J. Open Source
Softw., vol. 3, no. 25, p. 729, May 2018, doi: 10.21105/joss.00729.
[36] J.-M. Wei, X.-J. Yuan, Q.-H. Hu, and S.-Q. Wang, “A novel
390 measure for evaluating classifiers,” Expert Syst. Appl., vol. 37, no.
5, pp. 3799–3809, 2010, doi:
https://doi.org/10.1016/j.eswa.2009.11.040.
[37] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh,

and D. Batra, “Grad-CAM: Visual Explanations from Deep
395 Networks via Gradient-Based Localization,” Int. J. Comput. Vis.,
vol. 128, no. 2, pp. 336–359, Oct. 2019, doi: 10.1007/s11263-019-
01228-7.
[38] D. Zhang, J. Li, and Z. Shan, “Implementation of Dlib Deep

Learning Face Recognition Technology,” in 2020 International
400 Conference on Robots Intelligent System (ICRIS), 2020, pp. 88–
91. doi: 10.1109/ICRIS52159.2020.00030.
[39] T. Keshari and S. Palaniswamy, “Emotion Recognition Using

Feature-level Fusion of Facial Expressions and Body Gestures,”
in 2019 International Conference on Communication and
405 Electronics Systems (ICCES), 2019, pp. 1184–1189. doi:
10.1109/ICCES45898.2019.9002175.
[40] D. Lawrance and S. Palaniswamy, “Emotion recognition from
facial expressions for 3D videos using siamese network,” in 2021
International Conference on Communication, Control and
410 Information Sciences (ICCISc), 2021, vol. 1, pp. 1–6. doi:
10.1109/ICCISc52257.2021.9484949.

Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence

Uploaded by

Copyright:

Available Formats

Evaluating models based on Explainable AI

Srikanth K.S1*, T.K Ramesh2*, Suja Palaniswamy3, Ranganathan, Srinivasan4

Software is increasingly used in autonomous systems such as automobiles, aircrafts,

2. Related work – Motivation from COVID 19 studies

60 The outbreak of COVID19 pandemic has provided significant opportunities to the

Table 1 Accuracy metrics for DL model

3. Face mask classifier – An image classification use case

Table 2 Recently published papers for compliance management

3.1 Problem Formulation

120 3.2 State of art models

Table 3 Training accuracies

5 Kaggle dataset obtained from https://www.kaggle.com/mloey1/medical-face-mask-detection-

Figure 1 Accuracies over epochs

Training Accuracies over Epoch

VGG16 VGG19 MobileNetv2

Validation accuracies over Epoch

VGG16 VGG19 MobileNetv2

Table 4 List of key metrices

Interpretation: Lower the value better 𝑗

Comparison of the metrics are captured in Table 6 and Table 6.

Table 5 Comparison of key Metrics from verification dataset

Table 6 Comparison of accuracy and precision

170 4. Results from post-hoc explanations

GradCAM [37] is a post-hoc explanation method used to produce heatmaps for

Table 7 GradCAM images for each class

190 5. Identifying region of significance from facial landmarks

Table 8 Face landmark detection

Fore head region

Right Chin Left Chin

Figure 2 Facial landmarks

5.1 Interpretation of results

225 Table 9 Comparison of the region significance

Table 10 Comparison with unseen dataset

6. Conclusion and future work

[1] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should i trust

260 [2] B. B. Gawde, “A Fast, Automatic Risk Detector for COVID-19,”

[3] S. Wang et al., “A deep learning algorithm using CT images to

[5] K. Ahmed and N. Gouda, “AI Techniques and Mathematical

[6] S. Raghav et al., “Suraksha: Low Cost Device to Maintain Social

[7] K. Bhambani, T. Jain, and K. A. Sultanpure, “Real-time Face

[8] S. Srinivasan, R. Rujula Singh, R. R. Biradar, and S. A. Revathi,

290 [9] N. Arora, A. K. Banerjee, and M. L. Narasu, “The role of artificial

[10] S. K., S. V., G. E.A., and S. K.P., “Explainable artificial

[11] D. Gunning and D. Aha, “DARPA’s Explainable Artificial

300 [12] S. Palacio, A. Lucieri, M. Munir, J. Hees, S. Ahmed, and A.

[14] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis,

[15] A. S. Joshi, S. S. Joshi, G. Kanahasabai, R. Kapil, and S. Gupta,

[16] O. Cakiroglu, C. Ozer, and B. Gunsel, “Design of a Deep Face

[17] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face

[18] J. Zhang, F. Han, Y. Chun, and W. Chen, “A Novel Detection

[19] Revanth, “Masked face Dataset (MAFA),” 2020.

[20] G. T. S. Draughon, P. Sun, and J. P. Lynch, “Implementation of a

[21] T. Q. Vinh and N. T. N. Anh, “Real-Time Face Mask Detector

[22] K. J. K. R. Kavitha, S. Vijayalakshmi, A. Annakkili, T.

[23] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, “A

[24] S. A. Sanjaya and S. Adi Rakhmawan, “Face Mask Detection

[25] Baojin Huang, “Real-World Masked Face dataset (RFMD),”

[26] S. Sethi, M. Kathuria, and T. Kaushik, “Face mask detection using

[27] J. Xiao, J. Wang, S. Cao, and B. Li, “Application of a Novel and

365 [28] G. J. Chowdary, N. S. Punn, S. K. Sonbhadra, and S. Agarwal,

[29] Prajna Bhandary, “Simulated Masked Face Dataset (SMFD).”

370 [30] S. K, B. S, and P. M. B, “Social Distance Identification Using

Srikanth K.S1, T.K Ramesh2, Suja Palaniswamy3, Ranganathan, Srinivasan4