Professional Documents
Culture Documents
Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence
Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence
Abstract
Usage of deep learning / machine learning algorithms together with the advancement
of processor technology have triggered an accelerated growth of complex autonomous
systems incorporating artificial intelligence (AI) models in diverse applications, e.g.,
medical diagnostics, sentiment analysis, recommender systems, autonomous
10 navigation, etc. A model can be trusted if the outcome of a model is like the solution
obtained by a human user. Hence model accuracy alone is not a necessary and sufficient
metric in AI. Model explainability can be evaluated based on a given use case.
In this paper Explainable AI (XAI) is used as a technique to compare the model
accuracy with respect to the explanations given by the model for a given classification.
15 A face mask detection use case is taken and state of art Convolution Neural Network’s
(CNN) with similar accuracies are compared with respect to the explanations. The
proposed approach demonstrates why model accuracy alone is not a sufficient metric
for selecting a model for deployment.
Keywords: GradCAM, Explainability, XAI, Artificial Intelligence.
20 2010 MSC: 00-01,99-00
1
Department of Electronics and Communication, Amrita School of Engineering, Bengaluru, Amrita Vishwa
Vidyapeetham, India. Email address: ks_srikanth@blr.amrita.edu
2
Department of Electronics and Communication, Amrita School of Engineering, Bengaluru, Amrita Vishwa
Vidyapeetham, India. Email address: tk_ramesh@blr.amrita.edu
* Corresponding author
3
Department of Computer Science and Engineering, Amrita School of Engineering, Bengaluru, Amrita
Vishwa Vidyapeetham, India. Email address: p_suja@blr.amrita.edu
4
Adjunct faculty, Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai,
India. Email address: ranga@iitm.ac.in
1. Introduction
This paper is structured as follows: Section 2 provides the related work. Section 3
describes a use case of face mask detection and verifies the base model accuracies.
Section 4 applies XAI techniques on the base model to identify the features that were
55 considered significant for the model to deduce its outcome. Section 5 describes a
methodology to calculate an AI Safety score as a function of features that a human
would consider for the given use case. Section 6 provides a conclusion and
recommended future work.
World health organization has recommended people to wear a face mask and
maintain social distance while in public places. Outbreaks have been reported in
restaurants, malls, marriage halls, office places, etc., where people tend to gather in
95 large numbers forcing the government to enhance compliance enforcement.
Implementing an efficient real time face mask detection using DL techniques can help
in enforcing compliance in public places. Recently there has been an increased interest
in this area and several papers have been published related to face mask detection, social
distance compliance, identification, and sanitization of high touch points etc. A few
100 papers and associated datasets in this regard are summarized in Table 2
Although all the papers have high accuracy, none of the aforementioned work used
XAI techniques to verify if the model was indeed looking at the right features to arrive
105 at the classification. In this paper, we verify the model accuracies of state of art DL
models and use XAI techniques to verify if the model is looking at the right features for
arriving at the classification.
The input image used for classification are pre-processed such that the faces are
110 extracted. Each face is passed through the inference engine to classify the face into three
classes {Class A, Class B and Class C} where Class A refers to incorrect mask, Class
B refers to correct mask and Class C refers to missing mask.
Let 𝐷 represent a tuple {𝑋, 𝑌} where 𝑋 is an array of input images consisting of faces
used for training each resized to 224 x 224 pixels and 𝑌 contains the one of the three
115 target classes for the training dataset. Let 𝐹𝑖 be a feature vector consisting of features
that are considered significant in arriving at a classification. Let 𝐴𝑖 be the accuracy
obtained by for each class during training then the objective function is to maximize
accuracy 𝐴𝑖 for the input 𝐷 such that maximum pixels of the significant features in the
feature vector 𝐹𝑖 are used to determine the final classification.
CNN is one of the famous models for supervised learning. XAI techniques can be
applied on CNN’s to inspect the reason for the obtained classification. The following
CNN architectures were selected for comparison VGG16, VGG19, Inception V3,
MobileNetV2, Resnet-50 and Resnet-152. All the models were trained using transfer
125 learning using pretrained weights from ImageNet. Data set were taken from [32], [33]
, [34] and Kaggle dataset5. In all around 1000 images with an equal split for the three
classes were taken.
The training accuracies of the CNN architectures is summarized in Table 3
1
0.9
0.8
0.7
0.6
0.5
0.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epochs
0.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epochs
Each model was subjected to unseen dataset having 300 images 100 each from class A
class B and Class C. The confusion matrix was plotted using PyCm [35] the results are
145 summarized in Table 6.
The key metrics captured are listed along with their respective formula/reference to the
paper describing the metric. PyCm provides a lot of metrices. The details can be found
in pycm documentation [35]. A few metrics that are relevant for multiclass
classification are taken for comparison.
150
the model 𝑗
𝑘=1,𝑘≠𝑗
𝑗
+ 𝑃𝑗,𝑘 𝑙𝑜𝑔2(|𝐶|−1) (𝑃𝑘,𝑗 )
[36]
95% Confidence interval (CI) is a type of 𝐶𝐼 = 𝐴𝑐𝑐 ± 1.9 ∗ 𝑆𝐸𝐴𝑐𝑐
Confidence interval estimate (of a population 𝐶𝐼
interval (CI) parameter) that is computed from the = 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
observed data. The confidence level is 𝐴𝑐𝑐 = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
the frequency (i.e., the proportion) of 𝑆𝐸𝐴𝑐𝑐 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
possible confidence intervals that 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
contain the true value of their
corresponding parameter.
Inception V3 Resnet 50
P → Precision
P A B C P A B C A→ Accuracy
↓/A→ ↓/A→ Class A Incorrect
1 100 0.0 0.0 1 93.97 4.8 1.2 Class B with Mask
2 0.775 99.22 0.0 2 3.1 96.89 0.0
Class C Without Mask
3 0.0 1.23 98.76 3 4.9 1.23 93.82
3.3 Analysis of results
From Table 3, the accuracies of all the five models are around the same. The model
with highest training accuracy is MobilenetV2 and InceptionV3. When subjected to
160 unseen dataset, the accuracies of were the highest for Class A for MobileNetV2 and
InceptionV3 and the least accuracies were observed in detecting Class B using VGG16
model. The AUC obtained for all the models were greater than 90% indicating the
model training is considered very good. The 95% confidence interval of MobilenetV2
and InceptionV3 are the highest indicating that these models would indeed look at the
165 right areas for arriving at the classification.
Although the models have a high accuracy, it is inconclusive if the models selected the
right features for arriving at the conclusion. In order to resolve this issue, we can use
an XAL technique like GradCAM [37] to verify the regions considered significant in
arriving at the classification.
C
Mask
Missing
180 Each training image was analyzed using GradCAM and the masked image for each
model was passed for inferencing. Different models look at certain areas that are more
significant than the other. It was observed that a few models like VGG16, InceptionV3
and Resenet50 performed well in detecting Class A and Class B. VGG 19 and
MobileNetV2 considered a significant area of the forehead region for classifying Class
185 B. Similarly, all models looked at the complete face for classifying an image to Class
C. An analysis was done on each image to identify the regions that were considered
significant by dividing the face into the forehead region, nose region, left chin, and right
chin region.
A human face has prominent landmarks like the eyes, nose, mouth, and chin. These
landmarks can be used to segment the face into regions. Further we can use these
regions identify the number of pixels using GradCAM that were active in contributing
to the output. The models can be evaluated as a function of the pixels that were used
195 for arriving at the inference.
There are multiple models to detect facial landmarks as tabulated in Table 8
In this paper we used DLibC was used to identify the facial landmarks and the face
200 was classified into four regions viz, forehead, nose, left chin, and right chin as shown
in Figure 2
Nose region
210 Typically for a face mask detection problem, a human would consider the region
around the nose, left chin and right chin for detecting if a face mask is present and
correctly worn. On the other hand, to identify if the face mask is missing, the full face
can be considered as significant. Each image was subjected to GradCAM and the
number of pixels that were significant in the forehead region, nose region, left and right
215 chin were calculated respectively. The mean data of each region was calculated to see
the percentage significance of a pixel in a given region.
Table 9 captures the regions that were considered significant for each of the model.
The column gradient cutoff indicates the threshold considered for each model where
the masked image also gave the same classification as the original prediction. If the cut
220 off is set to 0, then the entire image is considered as significant. As the cut off increases
the percentage of image that is considered significant for analysis reduces. So higher
the cutoff, more specific regions are considered as significant. An example can be seen
in Table 8 where models like VGG16 and VGG19 used very specific regions of the
face for arriving at a given outcome.
It can be observed from Table 9 that VGG16 used only 1.46% and 3.18% pixels from
the forehead region to classify Class A (Incorrect Mask) and Class C (With Mask)
respectively. Similarly, the percentage of the nose and left chin and right chin
230 considered for the inference is higher than the other models. Hence from the data it is
evident that VGG16 model implemented using transfer learning is the best model
considering features around the nose, left chin and right chin for correctly classifying
Class A and Class B.
VGG16 model was subjected to a set of unseen datasets to observe the features that
235 were considered significant. The results are tabulated in Table 10
The results obtained using VGG16 were aligned to the training dataset.
240 In this paper, we evaluated multiple standard models for face mask detector use case.
State of art models were compared and models with similar accuracies were selected.
Each model was subjected to post-hoc explanation using GradCAM. The results
showed the regions that the models considered significant in arriving at the decision.
Post-hoc explanations were further enhanced by identifying the portion of the face
245 that was considered the most significant and identifying if these regions were like the
decisions that will be taken by a human for performing the same classification.
It was found that although all the state of art models had accuracies greater than 90%,
they looked at different regions of the face to arrive at the decision. It was found that
VGG16 model implemented using transfer learning selected the right features to arrive
250 at the classification although its accuracies were less than MobileNetV2 and
InceptionV3. Hence it is important to consider model explainability as a factor before
selecting the best model to be used for a given use case.
In future we aim to develop an algorithm which can further improve the accuracies
and the choice of feature selection. The algorithm can be generalised to obtain an
255 explainability metric based on a given use case.
References
[13] Paul Voosen, “How AI detectives are cracking open the black box
of deep learning,” SCI Mag, 2017.
305 https://www.sciencemag.org/news/2017/07/how-ai-detectives-
are-cracking-open-black-box-deep-learning (accessed Dec. 20,
2020).
[36] J.-M. Wei, X.-J. Yuan, Q.-H. Hu, and S.-Q. Wang, “A novel
390 measure for evaluating classifiers,” Expert Syst. Appl., vol. 37, no.
5, pp. 3799–3809, 2010, doi:
https://doi.org/10.1016/j.eswa.2009.11.040.