Ground Truth Absent (Arxiv)

Evaluating Explanation Without Ground Truth in
Interpretable Machine Learning
Fan Yang, Mengnan Du, Xia Hu

Department of Computer Science and Engineering
Texas A&M University
{nacoyang, dumengnan, xiahu}@tamu.edu
arXiv:1907.06831v2 [cs.LG] 15 Aug 2019
ABSTRACT
Machine
Interpretable Machine Learning (IML) has become increas- Learning
This is a husky
(p =0 .93)
ingly important in many real-world applications, such as Process
autonomous cars and medical diagnosis, where explanations

ML Model Prediction User
are significantly preferred to help people better understand
how machine learning systems work and further enhance
their trust towards systems. However, due to the diversified Training Data Input
scenarios and subjective nature of explanations, we rarely
have the ground truth for benchmark evaluation in IML on ©University Of Toronto
the quality of generated explanations. Having a sense of Interpretable This is a husky

explanation quality not only matters for assessing system Machine because of
boundaries, but also helps to realize the true benefits to Learning
Process
human users in practical settings. To benchmark the evalu-
ation in IML, in this article, we rigorously define the prob- IML Model Interface User
lem of evaluating explanations, and systematically review

the existing efforts from state-of-the-arts. Specifically, we Figure 1: Illustration of the IML techniques. We
summarize three general aspects of explanation (i.e., gener- compare the two different pipelines between ma-
alizability, fidelity and persuasibility) with formal definitions, chine learning (ML) and IML. It is worth noting that
and respectively review the representative methodologies for IML model is capable of providing specific reasons
each of them under different tasks. Further, a unified eval- for particular machine decisions, while ML model
uation framework is designed according to the hierarchical may simply provide the prediction results with prob-
needs from developers and end-users, which could be easily ability scores. Here, we employ the image classifi-
adopted for different scenarios in practice. In the end, open cation task as an example, where IML model could
problems are discussed, and several limitations of current tell which part of the image contributes the animal
evaluation techniques are raised for future explorations. to a husky while ML model may only tell the overall
confidence towards a husky classification result.
1. INTRODUCTION
Serving as one of the most significant momentums for the
raised [6], aiming to help humans better understand the ma-
booming of artificial intelligence, machine learning is play-
chine learning decisions. We illustrate the core idea of IML
ing a vital role in many real-world systems, widely rang-
techniques in Figure 1.
ing from spam filters to humanoid robot. To handle the
IML is a new branch of machine learning techniques with
tasks that are increasingly complicated in practice nowa-
mounting attentions in recent years (shown by Figure 2),
days, more and more sophisticated machine learning sys-
focusing on the decision explanation beyond the instance
tems are designed, such as deep learning models [21], for ac-
prediction. IML is typically employed to extract useful in-
curate decision making. Despite the superior performance,
formation, from either system structure or prediction result,
those complex systems are typically hard to be interpreted
as explanations to interpret relevant decisions. Although
by human users, which largely limits their applications in
IML techniques have been comprehensively discussed cov-
some high-stake scenarios like self-driving vehicles and med-
ering methodology and application [7], the insights on IML
ical treatment, where explanations are important and nec-
evaluation perspective are still rather limited, which signif-
essary for scrutable decisions [36]. To this end, the concept
icantly impedes the way of IML to a rigorous science field.
of interpretable machine learning (IML) has been formally
To precisely reflect the boundaries of IML systems and mea-
Permission to make digital or hard copies of all or part of this work for sure the benefits of explanations brought to human users,
personal or classroom use is granted without fee provided that copies are effective evaluations are pivotal and indispensable. Differ-
not made or distributed for profit or commercial advantage and that copies ent from the conventional evaluation purely relied on model
bear this notice and the full citation on the first page. To copy otherwise, to performance, IML evaluation also needs to pay attention to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. the quality of the generated explanations, which makes it
Copyright 2008 ACM 0001-0782/08/0X00 ...$5.00. hard to be handled and benchmarked.
Trend of Research for IML give an overview about the explanations in IML, and catego-
800 755 rize them by a two-dimensional standard (i.e., interpretation
700 scope and interpretation manner) with representative ex-
# Publication for IML basd on Google Scholar amples. Then, we summarize three general properties (i.e.,
600 generalizability, fidelity and persuasibility) for explanation
Expon. Trendline with formal definitions, and rigorously define the problem of
500
evaluating explanation in the IML context. Next, following
400 those properties, we conduct a systematic review about ex-
isting work on explanation evaluation, with the focus on dif-
300 277
ferent techniques in various applications. Moreover, we also
200 review some other special properties for explanation evalu-
102
ation, which are typically considered under particular sce-
100
43 54 narios. Further, with the aid of those general properties, we
21
5 8 design a unified evaluation framework aligned with the hi-
0
erarchical needs from both model developers and end-users.
At last, we raise several open problems for current evalua-
tion techniques, and discuss some potential limitations for
future exploration.
Figure 2: Tendency of the IML research in re-
cent years. In particular, we present the number 2. EXPLANATION AND EVALUATION
of research publications related to IML from 2010 In this section, we first introduce the explanations we par-
to 2018, and plot the trendline according to the ticularly focus on, and give an overview about the categories
statistics. The relevant numerics are collected from of explanations in IML. Then, three general properties of ex-
Google Scholar, with the key words “interpretable planation are summarized for evaluation tasks according to
machine learning". We believe the actual numbers different perspectives in nature. Finally, we formally define
are even larger than the provided, since some other the problem of evaluating explanations in IML with the aid
terms, such as “explainable", which are closely re- of those general properties.
lated to IML, are ignored during collection. From
the results, we can see that IML related publication 2.1 Explanation Overview
has been increasing exponentially, and much more In the context of IML, explanations are particularly re-
attention has been paid for this field. ferred to those information that can help human users in-
terpret either learning process or prediction result for ma-
chine learning models. With different focuses, explanations
Evaluating explanation in IML is a challenging problem, in IML could have diversified forms with various charac-
since we need to balance well between the objective and sub- teristics, such as the local heatmaps for instances and the
jective perspectives when designing experiments. On one decision rules for models. In this article, we specifically cate-
hand, different users could have different preferences to- gorize the explanations in IML with a two-dimensional stan-
wards what a good explanation should be under different dard, covering both interpretation scope and interpretation
scenarios [6], thus it is not practical to benchmark the IML manner. As for the scope dimension, explanations can be
evaluation with a common set of ground truth for objective classified into the global and local ones, where global expla-
evaluations. For example, when deploying self-driving cars nation indicates the overall working mechanism of models
with IML, system engineers may consider sophisticated ex- by interpreting structures or parameters, and local explana-
planations as good ones for safety concerns, while car drivers tion reflects the particular model behaviour for individual
may prefer those concise explanations because complex ones instance by analyzing specific decisions. Regarding to the
could be too time-consuming for decision making during manner dimension, we can divide explanations into the in-
driving. On the other hand, there might be more criterion trinsic and posthoc (also written as post-hoc or post hoc)
beyond human subjective satisfaction. Human preferred ex- ones. Intrinsic explanation is typically achieved by those
planations may not always represent the full working mech- self-interpretable models that are transparent with particu-
anism of systems, which could lead to a poor performance lar designs, while posthoc one requires another independent
on generalization. It has shown that subjective satisfac- interpretation model or technique to provide explanations
tion of explanations largely depends on the response time for the target model. The two-dimensional taxonomy of ex-
of human users, and has no clear relation with the accu- planations in IML is illustrated by Figure 3.
racy performance [19]. This finding directly supports the The first category is intrinsic-global explanation. This
fact that human satisfaction cannot be regarded as the sole type of explanation can be well represented by some con-
standard when evaluating explanation. Besides, fully sub- ventional machine learning techniques, such as rule-based
jective evaluations would also result in ethics issues, because systems and decision trees, which are self-interpretable and
it is unmoral to manipulate an explanation to better cater capable of showing the overall working patterns for predic-
human users [14]. Seeking human satisfaction excessively tion. Take the decision tree for example, the intuitional
could cause explanations to persuade users, instead of actu- structure, as well as the set of all decision branches, con-
ally interpreting systems. stitutes the corresponding intrinsic-global explanation. The
Considering the aforementioned challenges, we aim to pave second category is intrinsic-local explanation, which is as-
the way for benchmark evaluation in this article, regarding sociated with specific input instances. A typical example
to the explanation generated from IML techniques. First, we is the attention mechanism applied on sequential models,
Interpretation Scope measure the Does the knowledge
in explanation
generalization power
Global Local generalize well?
of explanations
(a) Generalizability
root node
input sequence (b)
split condition
Intrinsic
weight
Does human satisfy
Does explanation
or comprehend
Interpretation Manner
explanation well?
General reflect the target
Properties system well?
leaf node output sequence
(decision)
Decision Tree Attention Mechanism
(teacher) (student) (c) (d) Persuasibility Fidelity
measure the measure the

heavy usefulness degree faithfulness degree
Posthoc
heated of explanations of explanations
light
heated
deep model shallow model Figure 4: Three general properties for explanations
Mimic Learning Instance Heatmap in IML, including generalizability, fidelity and per-
suasibility. Each property essentially corresponds
Figure 3: A two-dimensional categorization for ex- to one specific aspect in evaluation. Generalizabil-
planations in IML, covering interpretation scope ity focuses on the generalization power of explana-
and interpretation manner. According to the two- tion. Fidelity focuses on the faithfulness degree of
dimensional standard, we can divide explanations explanation. Persuasibility focuses on the usefulness
into four different groups: (a) intrinsic-global; (b) degree of explanation.
intrinsic-local; (c) posthoc-global; (d) posthoc-local.
For each category, we attach a representative exam-
ple for illustration. In particular, we employ deci-
to reflect the generalization power of explanation. In real-
sion tree as the example for intrinsic-global explana-
world applications, human users employ explanation from
tions, attention mechanism for intrinsic-local ones,
IML techniques mainly to obtain insights from the target
mimic learning for posthoc-global ones, and instance
system, which naturally brings forward the demand on ex-
heatmap for posthoc-local ones.
planation generalization performance. If a set of explana-
tions is poorly generalized, it can hardly be regarded with
good quality, since the knowledge and guidance it provides
where generated attention weights can help interpret par- would be rather limited in practice. One thing to clarify is
ticular predictions by indicating the important components. that the explanation generalization mentioned here is not
Attention model is widely used in both image captioning necessarily equal to the model predictive power, unless the
and machine translation tasks. Posthoc-global explanation model itself is interpretable with self-explanations (e.g., de-
serves as the third category, and the representative exam- cision tree). By measuring the generalizability of explana-
ple can be shown with mimic learning techniques for deep tion, users can have a sense of how accurate the generated
models. As for mimic learning, the teacher usually is a deep explanations are for specific tasks.
model, while the student is typically deployed as a shallow
model that is easier to be interpreted. The overall process of Definition 1: We define the generalizability of expla-
mimic learning can be regarded as a distillation process from nation in IML as an indicator for generalization perfor-
the teacher to the student, where the interpretable student mance, regarding to the knowledge and guidance deliv-
model provides a global view in a posthoc manner for the ered by the corresponding explanation.
deep teacher model. The posthoc-local explanation fills up
the last part of the taxonomy. We introduce this category The second general property is fidelity, which is used to
with an example of instance heatmap, which is used to visu- indicate how faithful explanations are to the target system.
alize the input regions with attribution score (i.e., a quan- Faithful explanation is always preferred by human, because
tified importance indicator). Instance heatmap works well it can precisely capture the decision making process of the
for both image and text, and is capable of showing the local target system and show the correct evidences for particu-
behaviour of the target model. Since heatmap depends on lar predictions. Explanations with high quality need to be
the particular input and does not involve the specific model faithful, since they are essentially served as important tools
design, it is a typical local explanation within a posthoc way. for users to understand the target system. Without suffi-
cient fidelity, explanations can only provide limited insights
2.2 General Properties of Explanation to the system, which degrades the functionalities of IML to
human users. To guarantee the relevance of explanations,
To formally define the problem of evaluating explanations we need fidelity to conduct explanation evaluation in IML.
in IML, it is important to make clear the general properties
of explanation for evaluation. In this article, we summarize
Definition 2: We define the fidelity of explanation in
three significant properties from different perspectives, i.e.,
IML as the faithfulness degree with regard to the target
generalizability, fidelity and persuasibility, where each of the
system, aiming to measure the relevance of explanations
property corresponds to one specific aspect in evaluation.
in practical settings.
The intuitions of the properties are illustrated in Figure 4.
The first general property is generalizability, which is used The third general property is persuasibility, which reflects
IML on the second part of IML evaluation, i.e., the explanation
Evaluation evaluation, and rigorously define the problem as follows.
Definition 4: The explanation evaluation problem

Model Evaluation Explanation Evaluation within IML context is to assess the quality of the gener-
ated explanations from systems, where high-quality ex-
To check the quality of prediction To check the quality of explanation planation corresponds to large values of generalizability,
fidelity and persuasibility with relevant measurement. In
general, good explanation ought to be well generalized,
Predictability Robustness Generalizability highly faithful and easily understood.
Capability Fidelity
3. EVALUATION REVIEW
Certainty Persuasibility
In this section, we will conduct a systematic review for
explanation evaluation problem in IML, following the prop-
...
erties of explanation we summarize. For each property, we

Figure 5: Illustration of the IML evaluation. Ba- mainly review the primary methodologies of evaluation for
sically, IML evaluation can be divided into model practical tasks, and shed light on the philosophy about why
evaluation and explanation evaluation. For model they are reasonable. After the review of evaluations on
evaluation, we focus on the generalizability of the generalizability, fidelity and persuasibility, we also focus on
system, and evaluate the quality of prediction. For some other special aspects, which are typically entangled
explanation evaluation, we focus on the predictabil- together with model evaluation, including the robustness,
ity, fidelity, persuasibility, and evaluate the quality capability and certainty.
of explanation. Besides, there are also some spe-
cial properties that are entangled with both model
3.1 Evaluation on Generalizability
and explanation. We list robustness, capability and Existing work, related to generalizability evaluation, mainly
certainty here for instance. In this paper, we specif- focus on the IML systems with intrinsic-global explanations.
ically focus on the aspects which are related to ex- Since intrinsic-global explanations are typically presented
planation evaluation. and organized as the form of prediction models, it is straight-
forward and convenient to evaluate generalizability by ap-
plying those explanations on test data to see the correspond-
the degree of how human comprehend and response to the ing generalization performance. Under this scenario, the
generated explanations. This property handles the subjec- generalizability evaluation task is somewhat equivalent to
tive aspect of explanation, and is typically measured with the model evaluation, where traditional model performance
human involvement. Good explanations are most likely to indicators (e.g., accuracy and F1-score) can be employed
be easily comprehended, and facilitate quick responses from as the metrics to quantify the explanation generalizability.
human users. Towards different user groups or application Conventional examples for this scenario can include gener-
scenarios, one specific set of explanations could possibly alized linear model (with informative coefficients) [27], deci-
have different persuasibility due to the diversified prefer- sion tree (with structured branches) [32], K-nearest neigh-
ences. Thus, discussing persuasibility for explanation should bors (with significant instances) [2] and rule-based systems
only be considered under a same setting of users and tasks. (with discriminative patterns) [3]. In general, the general-
izability evaluation for intrinsic-global explanations can be
Definition 3: We define the persuasibility of explanation easily converted to model evaluation tasks, in which gen-
in IML as the usefulness degree to human users, serving eralizability is positively correlated to the model prediction
as the measure of subjective satisfaction or comprehensi- performance. Take the recent work on decision set [20] for
bility for the corresponding explanation. example. The authors use the AUC scores, a common met-
ric for classification tasks in model evaluation, to indicate
the generalizability of explanations in set of decision rules.
2.3 Explanation Evaluation Problem Similarly in recent work [23], AUC scores are employed to
With the definitions of the three general properties for reflect the explanation generalizability indirectly.
explanation, we further introduce and formally define the Besides, there is another branch of work focusing on the
problem of evaluating explanations in IML. Technically, IML generalizability of posthoc-global explanations. The rele-
evaluation can be divided into two parts: model evaluation vant evaluation method is similar to that of generalizability
and explanation evaluation, shown by Figure 5. As for the for intrinsic-global ones, where model evaluation techniques
model evaluation part, the goal is to measure the predictive could be employed to indicate the explanation generalizabil-
power of IML systems, which is identical to that of common ity. The major difference lies in the fact that the explana-
machine learning systems and can be directly achieved with tions we apply on test data are not directly associated with
some conventional metrics (e.g., accuracy, precision, recall the target system, but are closely related to the interpretable
and F1-score). Explanation evaluation, however, is different proxies extracted or learned from the target. Those proxies
from model evaluation in both objective and methodology typically serve as the role for interpreting the target sys-
aspects. Since explanation typically contains more than one tem which is either a black box or a sophisticated model.
perspective and has no common ground truth over different Representative examples for this scenario can be found in
scenarios, traditional model evaluation techniques thus can- knowledge distillation [11] and mimic learning [5], where the
not be perfectly applied. In this article, we specifically focus common focus is to derive an interpretable proxy out of the
black-box neural model for providing explanations. For ex- to the image classification task, ablation- and perturbation-
ample, in literature [5], the authors employ Gradient Boost- based fidelity evaluation methods have also been effectively
ing Trees (GBT) as the interpretable proxy to explain the used in text classification [8], recommender system [38] and
working mechanism of neural networks. The constructed adversarial detection [24]. Furthermore, as for the posthoc-
GBT is capable of providing feature importance for expla- local explanations in form of training samples [18] and model
nation, and is assessed by model evaluation techniques with components [26], ablation and perturbation operations are
AUC score to show the generalizability of corresponding ex- properly applied as well in evaluating the explanation fi-
planations. The generalizability of posthoc-global explana- delity.
tion typically has positive correlation with the model per-
formance of the derived interpretable proxy. 3.3 Evaluation on Persuasibility
To effectively evaluate the persuasibility of generated ex-
3.2 Evaluation on Fidelity planations, human annotation is widely used especially in
Though pretty important for explanation evaluation, fi- those uncontentious tasks, such as object detection. The
delity may not be explicitly considered for intrinsic expla- annotation-based evaluation is usually regarded to be ob-
nations. In fact, the intrinsicality from explanations is suffi- jective, since relevant annotations do not change among dif-
cient to guarantee the exact working mechanism of the tar- ferent groups of user. In computer vision related tasks, the
get IML system, and the corresponding explanations can most common annotations for persuasibility evaluation are
thus be treated as faithful ones with full fidelity. The inter- bounding box [34] and semantic segmentation [25]. Ap-
pretable decision set [20] should be a good example here. propriate example can be found in recent work [33], which
The learned decision set is self-interpretable and explicitly utilizes bounding boxes to evaluate the persuasibility of ex-
presents the decision rules for the potential classification planations and employ the metric Intersection over Union
tasks. Under this example, we can see that those explana- (IoU) or Jaccard index to quantify the persuasibility per-
tion rules faithfully reflect the model prediction behaviour, formance. As for the annotations with semantic segmen-
and there does not exist any inconsistency between the IML tation, recent work [40] employs the pixel-level difference
system prediction and the relevant explanations. This kind as the metric to measure the corresponding persuasibility
of complete accordance between model and explanation is of explanations. Moreover, in natural language processing,
just what the full fidelity indicates. similar human annotation, named rationale [22], has been
However, different from intrinsic ones, posthoc-global ex- extensively used for evaluation, which is a subset of features
planations in form of interpretable proxies cannot be re- highlighted by annotators and regarded to be important for
garded with full fidelity, since the derived proxies usually prediction. Through those different forms of annotations,
work in a different way compared with the target system. the persuasibility of explanation can be objectively eval-
Although most proxies are derived to approximate the be- uated with human-grounded truth, which typically keeps
haviour of target system, it is still constructed as a different consistent across different groups of user and one particu-
model for the potential task. Existing work, related to fi- lar task. Due to the one-to-one correspondence between the
delity evaluation for interpretable proxies, mainly use the annotation and the instance, annotation-based evaluation is
difference in prediction performance to indicate the fidelity usually applied to those local explanations instead of the
degree. For instance, in work [5], the authors conducted ex- global ones.
periments with several sets of teacher-student models, where Conducting persuasibility evaluation with human anno-
the teacher is the target model and the student is the proxy tation does not work well in complicated tasks, since the
model. During the evaluation, the prediction differences be- related annotations may not keep consistent across different
tween corresponding teachers and students are used to re- user groups. Under those circumstances, employing users
flect the fidelity of the derived proxies, and preferred faithful for human studies is the common way to evaluate the per-
proxies are shown to have minor losses in performance. suasibility of explanation. To appropriately design relevant
Moreover, due to the posthoc manner and locality from human studies, both machine learning experts and human-
nature, none of posthoc-local explanations is fully faithful to computer interaction (HCI) researchers actively explore this
the target IML system. Among existing work, common ways area [1, 14], and propose several metrics for human eval-
to measure fidelity for posthoc-local explanations are abla- uation on general explanation from IML techniques, such
tion analysis [28] and meaningful perturbations [10], where as mental mode [20], human-machine performance [9], user
the core idea is to check the prediction variation after the satisfaction [19] and user trust [15]. Take the most recent
adversarial changes made according to the generated expla- work [19] for instance. The authors focus on the user satis-
nations. The philosophy of this kind of methodologies is faction in evaluating the persuasibility, and specifically em-
simple, i.e., modifications to the input instances, which are ploy the human response time and decision accuracy as the
in accordance with the generated explanations, can bring auxiliary metrics. The whole study is conducted on two dif-
about significant differences to model predictions if the ex- ferent domains with three types of explanation variation,
planations are faithful to the target system. Typical exam- aiming to conclude the relation between the explanation
ple can be found in image classification task with deep neural quality and human cognition. With the aid of human stud-
networks [33], where the fidelity of generated posthoc-local ies, persuasibility of explanation can be evaluated under a
explanations are evaluated by measuring the prediction dif- more complicated and practical setting, regarding to specific
ference between the original image and the perturbative im- user groups and application domains. By directly measuring
age. The overall logic here is to mask the attributing regions explanations from human users, we can realize the useful-
in images indicated by the explanations, and then check the ness in real-world applications when determining the expla-
extent of prediction variation. The larger the difference, the nation quality. Since human studies can be designed flexibly
more faithful the generated explanations are. In addition according to diversified needs and goals, this methodology
is generally applicable to all kinds of explanations for per- 4. DISCUSSION AND EXPLORATION
suasibility evaluation within IML context. In this section, we first propose a unified framework to
conduct general assessment on explanations in IML, accord-
3.4 Evaluation on Other Properties ing to the different level of needs for evaluation. Then, sev-
Besides the generalizability, fidelity and persuasibility, ex- eral open problems in explanation evaluation are raised and
isting work also consider some other properties when eval- discussed regarding to benchmarking issues. Further, we
uating the explanation in IML. We introduce those prop- highlight some significant limitations of current evaluation
erties separately due to the following two reasons. First, techniques for future exploration.
those properties are not representative and general for ex-
planation evaluation among IML systems, and are simply 4.1 Unified Framework for Evaluation
considered under specific architectures or applications. Sec- Despite the large number of work we reviewed for expla-
ond, those properties are related to both prediction model nation evaluation, different work typically have their own
and generated explanation, which typically need novel and particular focus, depending on the specific tasks, architec-
special design to evaluate. In this part, we particularly focus tures, or applications. This situation leads to the fact that
on the following three properties. it is hard to benchmark the evaluation process for expla-
Robustness. Similar to machine learning models, the gen- nations in IML as what we developed in model evaluation.
erated explanations from IML systems can also be fragile To pave the way to benchmark evaluation on explanation,
to adversarial perturbations, especially for those posthoc we try to construct a unified framework here by considering
ones from neural architectures [12]. Explanation robust- those properties of explanations. To make the framework
ness is primarily designed to measure how similar the ex- general, we simply take the generalizability, fidelity and per-
planations are for similar instances. Recent work [33, 39] suasibility into account, and do not consider those special
all conduct robustness evaluation for explanation with the ones under particular scenarios.
metrics on sensitivity, beyond the evaluation on those three
general properties we summarize. Robust explanations are 4.1.1 Different level of needs for evaluation
always preferred in building a trustable IML system for hu- Although we conduct the review separately, regarding to
man users. To obtain the explanations with high robustness, generalizability, fidelity and persuasibility, those three gen-
a stable prediction model and a reliable explanation gener- eral properties are internally related to each other, where
ation algorithm are usually the two most important keys. each of them represents a specific level of needs for evalua-
Capability. Another property for explanation evaluation is tion. From the lower level to higher level, we can sort the
named capability, which is used to indicate the extent that properties as: generalizability, fidelity, persuasibility. Gen-
corresponding explanations can be generated. Commonly, eralizability typically serves as the basic need in evaluation,
this property is evaluated on those explanations generated since it formulates the foundation for other properties. In
from search based methodologies [37], instead of those ob- real-world applications, good generalizability is the precon-
tained from gradient based [33] or perturbation based [31] dition for human users to make accurate decisions with the
methodologies. Typical example for capability evaluation generated explanations, which guarantees that the explana-
can be found in work [38] with the application to recom- tions we employ are generalizable and reflect the true knowl-
mender system, where the authors employ the explainabil- edge for particular tasks. After that, a further demand for
ity precision and explainability recall as the metrics to in- human users is to check whether the derived explanations
dicate the capability strength. Similar to the property ro- at hands are reliable or not. This demand pushes out the fi-
bustness, capability is also related to the target prediction delity property to the front. By assessing the fidelity, better
model, which essentially determines the upper bound of the decisions can be made on whether to trust the IML sys-
ability to generate explanations. tem or not based on the explanation relevance. As for the
Certainty. To further evaluate explanations on whether higher demand on real effectiveness in practice, persuasi-
they reflect the uncertainty of the target IML system, exist- bility is further considered to indicate the tangible impacts,
ing work also focus on the certainty aspect of explanation. directly bridging the gap between human users and machine
Certainty is also a property related to both model and ex- explanations. For one specific task, the explanation evalua-
planation, since explanation can only provide uncertainty tion mainly depends on the corresponding applications and
interpretation as long as the corresponding IML system it- user groups, which determine the level of needs in evalua-
self has the certainty measure. Recent work [29] gives an ap- tion design. Generally, model developers would care more
propriate example for certainty evaluation. In this work, the on those basic properties of lower levels, including generaliz-
authors consider the IML systems under the active learning ability and fidelity, while general end-users would pay more
settings, and propose a novel measure, named uncertainty attention on the persuasibility in a higher level.
bias, to evaluate the certainty of generated explanations.
Specifically, the explanation certainty is measured according 4.1.2 Hierarchical structure of the framework
to the discrepancy in prediction confidence of the IML sys- The overall unified evaluation framework is designed hier-
tem between one category and the others. In similar ways, archically, according to the different level of needs, as illus-
work [35] focus on the certainty aspect of explanations as trated in Figure 6. In the bottom tier, the evaluation goal
well, and provide insights on how confident users could be focuses on the generalizability, where generated explanations
for particular outputs with the computed explanations in are tested for their generalization power. In the medium tier,
form of flip set (i.e., a list of actionable changes that users the goal is to evaluate the fidelity, with regard to the target
can make to flip the prediction of the target system). In IML system. The top tier aims to evaluate the persuasibil-
essence, certainty evaluation and persuasibility evaluation ity, targeting on specific applications and user groups. To
can be mutually supported from each other. have a unified evaluation in one particular task, each tier
Higher Need 4.2.2 Fidelity for posthoc explanations
Focus on Human Among existing work, it is well received that good ex-
Persuasibility (Target on general end-users) planation should have high fidelity to the target IML sys-
tem. However, with the posthoc manner, it might not be
Fidelity the case that faithful explanations are always the good ones
Focus on Machine that human user prefer. During explanation evaluation, we
(Target on model developers)
typically assume that IML systems are well trained and are
Generalizability
capable of making reasonable decisions, but this assump-
Lower Need tion is hard to be perfectly achieved in practice. As a re-
sult, the generated post-hoc explanations may not be with
Figure 6: A unified hierarchical framework for ex- high quality due to the inferior model performance, although
planation evaluation in IML. The whole framework they might be highly faithful to the target system. Thus,
consists of three different tiers, corresponding to designing a novel methodology, which could consider both
generalizability, fidelity and persuasibility, from the model and explanation, for posthoc fidelity evaluation is of
lower level to the higher level. Basically, the bottom great importance. In general, how to utilize the model per-
and medium tier focus on the evaluation from ma- formance to guide the measurement of posthoc explanation
chine perspective, while the top tier concentrate on fidelity is the key problem to tackle this challenge, where the
the evaluation from human perspective. To this end, ultimate goal is to help human users better select out those
the bottom and medium tier are usually designed explanations with good quality from fidelity perspective.
for model developers, and the top tier is designed
for general end-users.
4.2.3 Persuasibility for global explanations
As for the persuasibility, it is also challenging to conduct
effective evaluations on global explanations, no matter us-
should have a consistent pipeline with a fixed set of data, ing annotation based methods or employing human studies.
user and metrics correspondingly. The overall evaluation re- The main reason lies in the fact that global explanations in
sults can be further derived through an ensemble way, such real applications are very sophisticated, which makes it hard
as weighted sum, where each tier could be assigned with to make annotations or select appropriate users for studies.
an importance weight depending on the applications and Essentially, the global nature requires either selected an-
user groups. This proposed hierarchical framework is gener- notators or users to equip with comprehensive understand-
ally applicable to most of explanation evaluation problems, ings towards the target task, otherwise the evaluation re-
which could be appended with new components if necessary. sults would be less convincing or even misleading. Besides,
With proper metrics, as well as a sensible manner for ensem- the global explanations in practice typically contain tons
ble, the framework can effectively help human users measure of information, which could be extremely time-consuming
the overall quality of explanation from IML techniques un- to evaluate persuasibility. One possible solution is to use
der certain circumstances. some simplified or proxy tasks to simulate the original one,
as mentioned in [6], but this kind of substitution needs
to maintain the original essence, which certainly requires
4.2 Open Problems for Benchmark non-trivial efforts on task abstraction. Another potential
To fully achieve the benchmark for explanation evaluation solution is to simplify the explanations shown to users, such
in real-world applications, there are still some open problems as only showing the top-k features, which, however, sacri-
left to explore, which are listed and discussed as follows. fices the comprehensiveness of generated explanations and
impedes the full view over the target system.
4.2.1 Generalizability for local explanations
Existing work on generalizability evaluation mainly focus 4.3 Limitations of Current Evaluation
on those global explanations, while limited efforts has been Although various methodologies of explanation evaluation
paid on the local ones. The challenges in evaluating gen- exist in IML research, there are still some significant limita-
eralizability of local explanations are in two folds: (1) local tions of current evaluation techniques. We briefly introduce
explanations cannot be easily organized into valid prediction some of the most important ones as below.
models, which makes the model evaluation techniques hard
to be directly applied; (2) local explanations simply contain 4.3.1 Causality insight for evaluation
the partial knowledge learned by the target IML system, The first limitation lies in the lack of causal perspective [17]
thus special designs are required to ensure the evaluation has in explanation evaluation. Current evaluation techniques,
a specific local focus. Though no direct solutions, some in- no matter what properties they focus on, mostly fail to
sights from existing efforts may be inspiring. As for the first have causal analysis when evaluating the explanation qual-
challenge, an approximated local classifier [31] could be po- ity. This kind of drawback could possibly lead to the fact
tentially built to carry the local explanations, and then the that our selected explanations may not fully represent the
generalizability could be further assessed with model evalu- true reasons behind the prediction, since the influence from
ation techniques by specifying test instances. Moreover, for confounders are not effectively blocked during interpreta-
the second challenge, we could possibly employ local expla- tion. Take the two most common methodologies in IML,
nations, together with human simulated/augmented data, gradient based and perturbation based methods, for exam-
to train a separate classifier [16] for generalizability evalua- ples. Both of them can be viewed as special cases of In-
tion, where the task is essentially reduced from the original dividual Causal Effect (ICE) analysis, where complicated
one and only involves the local knowledge we test with. inter-feature interactions could conceal the real importance
of some input features [4]. Thus, to derive better explana- explanation, covering various methodologies and application
tions with relevant causal guarantees, we need corresponding scenarios. Moreover, a potential unified evaluation frame-
evaluation techniques to assess the causal perspective of the work is built according to the hierarchical needs from both
generated explanations. In this way, human users would be model developers and general end-users. In the end, sev-
further enabled to have a clearer understanding towards the eral open problems in benchmark and limitations of current
cause-effect association when interpreting the target system. techniques are discussed for future exploration. Though nu-
merous obstacles are still left to be solved, explanation eval-
4.3.2 Completeness insight for evaluation uation will keep playing the key role in enabling effective
The second limitation is the neglect of completeness in ex- interpretation of IML systems.
planation evaluation [13]. Existing efforts on IML evaluation
cannot well reflect the degree of completeness for generated 6. REFERENCES
explanations, which makes it difficult for human users to [1] A. Abdul, J. Vermeulen, D. Wang, B. Y. Lim, and
further ensure the real value in practice. Explanation com- M. Kankanhalli. Trends and trajectories for
pleteness could be important in real applications, because it explainable, accountable and intelligible systems: An
is able to indicate the possibility of whether there would be hci research agenda. In Proceedings of the 2018 CHI
additional explanations for certain prediction results. Ques- Conference on Human Factors in Computing Systems,
tions, such as “Do we get the full explanations from the target page 582. ACM, 2018.
IML system?" and “Is it possible to generate better expla-
[2] N. S. Altman. An introduction to kernel and
nations than the current ones?", are not supported by the
nearest-neighbor nonparametric regression. The
current evaluation techniques. A completeness-aware evalu-
American Statistician, 46(3):175–185, 1992.
ation for explanation would definitely be helpful in exploring
[3] B. G. Buchanan and R. O. Duda. Principles of
the boundaries of the target IML system. Besides, having
rule-based expert systems. In Advances in computers,
completeness insight for assessment would also be a signifi-
volume 22, pages 163–216. Elsevier, 1983.
cant supplement for persuasibility evaluation, since the need
for explainability typically stems from the incompleteness in [4] A. Chattopadhyay, P. Manupriya, A. Sarkar, and
problem formalization [6]. V. N. Balasubramanian. Neural network attributions:
A causal perspective. In Proceedings of the 36th
4.3.3 Novelty insight for evaluation International Conference on Machine Learning,
The third limitation results from the explanation novelty volume 97, pages 981–990, 2019.
perspective [30]. Under the current infrastructure of expla- [5] Z. Che, S. Purushotham, R. Khemani, and Y. Liu.
nation evaluation in IML, it is commonly assumed that high- Distilling knowledge from deep networks with
quality explanations are those ones which can help human applications to healthcare domain. arXiv preprint
users make better decisions or obtain better understand- arXiv:1512.03542, 2015.
ings. Nevertheless, the view of this assumption for good [6] F. Doshi-Velez and B. Kim. Towards a rigorous
explanation is rather limited, since it somewhat overlooks science of interpretable machine learning. arXiv
the potential values of the explanations that may not be well preprint arXiv:1702.08608, 2017.
comprehended by users. Explanations which are not directly [7] M. Du, N. Liu, and X. Hu. Techniques for
“useful” to human users may still have significant influences, interpretable machine learning. Communications of
due to their important roles in extending the human knowl- the ACM, 2019.
edge boundary. Medical diagnosis should be a good example [8] M. Du, N. Liu, F. Yang, S. Ji, and X. Hu. On
to illustrate this point. When diagnosing patients, doctors attribution of recurrent neural network predictions via
would typically refer the generated explanations with their additive decomposition. In Proceedings of The Web
acquired domain knowledge, if they have access to the IML Conference 2019 (TheWebConf). ACM, 2019.
systems. Since there is no way that domain knowledge can [9] S. Feng and J. Boyd-Graber. What can ai do for me:
cover all aspects and contain full pathological mechanism, Evaluating machine learning interpretations in
especially for those new diseases, we cannot casually discard cooperative play. arXiv preprint arXiv:1810.09648,
the explanations that are mismatched with our knowledge. 2018.
Those “novel” explanations could possibly point out some [10] R. C. Fong and A. Vedaldi. Interpretable explanations
valuable research areas in a reverse way. To this end, cur- of black boxes by meaningful perturbation. In
rent evaluation techniques need to be further enhanced to Proceedings of the IEEE International Conference on
properly cover the novelty issue in assessing the quality of Computer Vision, pages 3429–3437, 2017.
generated explanations, so that novel explanations could be [11] N. Frosst and G. Hinton. Distilling a neural network
well distinguished from those noisy ones. into a soft decision tree. arXiv preprint
arXiv:1711.09784, 2017.
5. CONCLUSIONS [12] A. Ghorbani, A. Abid, and J. Zou. Interpretation of
With the booming development of IML techniques, how to neural networks is fragile. arXiv preprint
effectively evaluate those generated explanations, typically arXiv:1710.10547, 2017.
without ground truth on quality, is becoming increasingly [13] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa,
critical in recent years. In this article, we briefly introduce M. Specter, and L. Kagal. Explaining explanations:
the explanation in IML, as well as its three general proper- An overview of interpretability of machine learning. In
ties, and formally define the explanation evaluation problem 2018 IEEE 5th International Conference on Data
within the context of IML. Then, following the properties, Science and Advanced Analytics (DSAA), pages
we systematically review the existing efforts in evaluating 80–89. IEEE, 2018.
[14] B. Herman. The promise and peril of human [27] J. A. Nelder and R. W. Wedderburn. Generalized
evaluation for model interpretability. arXiv preprint linear models. Journal of the Royal Statistical Society:
arXiv:1711.07414, 2017. Series A (General), 135(3):370–384, 1972.
[15] D. Holliday, S. Wilson, and S. Stumpf. User trust in [28] A. Nguyen, J. Yosinski, and J. Clune. Deep neural
intelligent systems: A journey over time. In networks are easily fooled: High confidence predictions
Proceedings of the 21st International Conference on for unrecognizable images. In Proceedings of the IEEE
Intelligent User Interfaces, pages 164–168. ACM, conference on computer vision and pattern recognition,
2016. pages 427–436, 2015.
[16] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, [29] R. L. Phillips, K. H. Chang, and S. A. Friedler.
F. Viegas, and R. Sayres. Interpretability beyond Interpretable active learning. In FAT, 2017.
feature attribution: Quantitative testing with concept [30] K. Preuer, G. Klambauer, F. Rippmann,
activation vectors (tcav). arXiv preprint S. Hochreiter, and T. Unterthiner. Interpretable deep
arXiv:1711.11279, 2017. learning in drug discovery. arXiv preprint
[17] C. Kim and O. Bastani. Learning interpretable models arXiv:1903.02788, 2019.
with causal guarantees. arXiv preprint [31] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should
arXiv:1901.08576, 2019. i trust you?: Explaining the predictions of any
[18] P. W. Koh and P. Liang. Understanding black-box classifier. In Proceedings of the 22nd ACM SIGKDD
predictions via influence functions. In Proceedings of international conference on knowledge discovery and
the 34th International Conference on Machine data mining, pages 1135–1144. ACM, 2016.
Learning-Volume 70, pages 1885–1894. JMLR, 2017. [32] S. R. Safavian and D. Landgrebe. A survey of decision
[19] I. Lage, E. Chen, J. He, M. Narayanan, B. Kim, tree classifier methodology. IEEE transactions on
S. Gershman, and F. Doshi-Velez. An evaluation of systems, man, and cybernetics, 21(3):660–674, 1991.
the human-interpretability of explanation. arXiv [33] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
preprint arXiv:1902.00006, 2019. D. Parikh, and D. Batra. Grad-cam: Visual
[20] H. Lakkaraju, S. H. Bach, and J. Leskovec. explanations from deep networks via gradient-based
Interpretable decision sets: A joint framework for localization. In Proceedings of the IEEE International
description and prediction. In Proceedings of the 22nd Conference on Computer Vision, pages 618–626, 2017.
ACM SIGKDD international conference on knowledge [34] C. Szegedy, A. Toshev, and D. Erhan. Deep neural
discovery and data mining, pages 1675–1684. ACM, networks for object detection. In Advances in neural
2016. information processing systems, pages 2553–2561,
[21] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. 2013.
Nature, 521(7553):436, 2015. [35] B. Ustun, A. Spangher, and Y. Liu. Actionable
[22] T. Lei, R. Barzilay, and T. Jaakkola. Rationalizing recourse in linear classification. In Proceedings of the
neural predictions. In Proceedings of the Conference Conference on Fairness, Accountability, and
on Empirical Methods in Natural Language Processing. Transparency, pages 10–19. ACM, 2019.
NIH Public Access, 2016. [36] S. Wachter, B. Mittelstadt, and L. Floridi.
[23] B. Letham, C. Rudin, T. H. McCormick, and Transparent, explainable, and accountable ai for
D. Madigan. Interpretable classifiers using rules and robotics. Science Robotics, 2(6), 2017.
bayesian analysis: Building a better stroke prediction [37] E. Wallace, S. Feng, and J. Boyd-Graber. Interpreting
model. The Annals of Applied Statistics, neural networks with nearest neighbors. arXiv
9(3):1350–1371, 2015. preprint arXiv:1809.02847, 2018.
[24] N. Liu, H. Yang, and X. Hu. Adversarial detection [38] F. Yang, N. Liu, S. Wang, and X. Hu. Towards
with model interpretation. In Proceedings of the 24th interpretation of recommender systems with sorted
ACM SIGKDD International Conference on explanation paths. In 2018 IEEE International
Knowledge Discovery & Data Mining, pages Conference on Data Mining (ICDM), pages 667–676.
1803–1811. ACM, 2018. IEEE, 2018.
[25] J. Long, E. Shelhamer, and T. Darrell. Fully [39] C.-K. Yeh, C.-Y. Hsieh, A. S. Suggala, D. Inouye, and
convolutional networks for semantic segmentation. In P. Ravikumar. How sensitive are sensitivity-based
Proceedings of the IEEE conference on computer explanations? arXiv preprint arXiv:1901.09392, 2019.
vision and pattern recognition, pages 3431–3440, 2015. [40] B. Zhou, D. Bau, A. Oliva, and A. Torralba.
[26] T. Narendra, A. Sankaran, D. Vijaykeerthy, and Interpreting deep visual representations via network
S. Mani. Explaining deep learning models using causal dissection. IEEE transactions on pattern analysis and
inference. arXiv preprint arXiv:1811.04376, 2018. machine intelligence, 2018.

Ground Truth Absent (Arxiv)

Uploaded by

Copyright:

Available Formats

You might also like

Ground Truth Absent (Arxiv)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ground Truth Absent (Arxiv)

Uploaded by

Copyright:

Available Formats

Evaluating Explanation Without Ground Truth in

Interpretable Machine Learning

Fan Yang, Mengnan Du, Xia Hu

autonomous cars and medical diagnosis, where explanations

the quality of generated explanations. Having a sense of Interpretable This is a husky

lem of evaluating explanations, and systematically review

(teacher) (student) (c) (d) Persuasibility Fidelity

measure the measure the

heated of explanations of explanations

Definition 4: The explanation evaluation problem

erties of explanation we summarize. For each property, we

You might also like