Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Towards Modeling Uncertainties of Self-Explaining Neural Networks via


Conformal Prediction
Wei Qian* 1 , Chenxu Zhao* 1 , Yangyi Li* 1 , Fenglong Ma2 , Chao Zhang3 , Mengdi Huai1
1
Iowa State University
2
Pennsylvania State University
3
Georgia Institute of Technology
{wqi, cxzhao, liyangyi, mdhuai}@iastate.edu, fenglong@psu.edu, chaozhang@gatech.edu

Abstract To offer valuable insights into how a model makes predic-


Despite the recent progress in deep neural networks (DNNs), tions, many approaches have surfaced to explain the behav-
it remains challenging to explain the predictions made by ior of black-box neural networks. These methods primarily
DNNs. Existing explanation methods for DNNs mainly focus focus on providing post-hoc explanations, attempting to pro-
on post-hoc explanations where another explanatory model vide insights into the decision-making process of the under-
is employed to provide explanations. The fact that post-hoc lying model (Lundberg and Lee 2017; Ribeiro, Singh, and
methods can fail to reveal the actual original reasoning pro- Guestrin 2016; Slack et al. 2020, 2021; Yao et al. 2022; Han,
cess of DNNs raises the need to build DNNs with built-in Srinivas, and Lakkaraju 2022; Huai et al. 2020). Note that
interpretability. Motivated by this, many self-explaining neu- these post-hoc methods need designing another explanatory
ral networks have been proposed to generate not only ac- model to provide explanations for a trained deep learning
curate predictions but also clear and intuitive insights into
why a particular decision was made. However, existing self-
model. Despite their widespread use, post-hoc interpretation
explaining networks are limited in providing distribution-free methods can be inaccurate or incomplete in revealing the ac-
uncertainty quantification for the two simultaneously gen- tual reasoning process of the original model (Rudin 2018).
erated prediction outcomes (i.e., a sample’s final prediction To address these challenges, researchers have been ac-
and its corresponding explanations for interpreting that pre- tively promoting the adoption of inherently self-explaining
diction). Importantly, they also fail to establish a connec- neural networks (Koh et al. 2020; Havasi, Parbhoo, and
tion between the confidence values assigned to the generated Doshi-Velez 2022; Chauhan et al. 2023; Yuksekgonul,
explanations in the interpretation layer and those allocated
Wang, and Zou 2022; Alvarez Melis and Jaakkola 2018;
to the final predictions in the ultimate prediction layer. To
tackle the aforementioned challenges, in this paper, we design Sinha et al. 2023), which integrate interpretability directly
a novel uncertainty modeling framework for self-explaining into the model architecture and training process, producing
networks, which not only demonstrates strong distribution- explanations that are utilized during classification. Based on
free uncertainty modeling performance for the generated ex- the presence of ground-truth basis concepts during training,
planations in the interpretation layer but also excels in pro- existing works on self-explaining networks can be broadly
ducing efficient and effective prediction sets for the final pre- classified into two groups: those with access to ground-
dictions based on the informative high-level basis explana- truth and those without. For example, (Koh et al. 2020)
tions. We perform the theoretical analysis for the proposed trains the concept bottleneck layer to align concept predic-
framework. Extensive experimental evaluation demonstrates tions with established true concepts, while (Alvarez Melis
the effectiveness of the proposed uncertainty framework.
and Jaakkola 2018) utilizes autoencoder techniques to learn
prototype-based concepts from the training data.
Introduction Although existing self-explaining networks can offer in-
Recent advancements in DNNs have undeniably led to re- herent interpretability to make their decision-making pro-
markable improvements in accuracy. However, this progress cess transparent and accessible, they are limited in the
has come at the expense of losing visibility into their inter- distribution-free uncertainty quantification for both the gen-
nal workings (Nam et al. 2020). This lack of transparency erated explanation outcomes and the corresponding label
often renders these deep learning models as “black boxes” predictions. Such limitations have the potential to signif-
(Li et al. 2020; Bang et al. 2021; Zhao et al. 2023). Without icantly impede the widespread acceptance and utilization
understanding the rationales behind the predictions, these of self-explaining neural networks. In practice, such confi-
black-box models cannot be fully trusted and widely ap- dence values serve as indicators of the likelihood of cor-
plied in critical areas. In addition, explanations can facili- rectness for each prediction, and understanding the likeli-
tate model debugging and error analysis. These indicate the hood of each prediction allows us to assess the degree to
necessity of investigating the explainability of DNNs. which we can rely on it. Note that self-explaining neural
* These authors contributed equally. networks simultaneously generate two outputs (i.e., the fi-
Copyright © 2024, Association for the Advancement of Artificial nal prediction decision and an explanation of that final de-
Intelligence (www.aaai.org). All rights reserved. cision). The most straightforward way is to separately and

14651
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

independently quantify uncertainties for the predicted expla- attacks. We also perform the theoretical analysis for the fi-
nations and their corresponding final predictions. Neverthe- nal prediction sets. The extensive experimental results verify
less, we argue that within self-explaining neural networks, the effectiveness of our proposed method.
the high-level interpretation space is informative due to its
powerful inductive bias in generating high-level basis con- Problem Formulation
cepts for interpretable representations. Imagine a scenario Let us consider a general multi-class classification prob-
in which we engage in the binary classification task distin- lem with a total of M > 2 possible labels. For an input
guishing between zebras and elephants, and we hold a strong x ∈ Rd , its true label is represented as y ∈ R. Note that
95% confidence in identifying the presence of the “Stripes” self-explaining networks generate high-level concepts in the
concept within a test image sample. In this context, we can interpretation layer prior to the output layer. In Definition 1,
assert that the model exhibits high confidence in assigning we give a general definition of a self-explaining model (Al-
the “Zebra” label to this particular test sample. This also varez Melis and Jaakkola 2018; Elbaghdadi et al. 2020).
aligns with existing class-wise explanations (Li et al. 2021;
Zhao et al. 2021) that focus on uncovering the informative Definition 1 (Self-explaining Models). Let x ∈ X ⊂ Rd
patterns that influence the model’s decision-making process and Y ⊂ RM be the input and output spaces. We say that
when classifying instances into a particular class category. f : X → Y is a self-explaining prediction model if it has the
Our goal in this paper is to offer distribution-free uncer- following form
tainty quantification for self-explaining networks via two in- f (x) = g(θ1 (x)h1 (x), · · · , θC (x)hC (x)), (1)
dispensable and interconnected aspects: establishing a reli-
able uncertainty measure for the generated explanations and where the prediction head g depends on its arguments, the
effectively transferring the explanation prediction sets to the set {(θc (x), hc (x))}C
c=1 consists of basis concepts and their
associated predicted class outcomes. Note that the recent influence scores, and C is small compared to d.
work of (Slack et al. 2021) focuses exclusively on post-hoc Example 1: Concept bottleneck models (Koh et al. 2020).
explanations and relies on certain assumptions regarding the Concept bottleneck models (CBMs) are interpretable neu-
properties of these explanations, such as assuming a Gaus- ral networks that first predict labels for human-interpretable
sian distribution of explanation errors. On the other hand, concepts relevant to the prediction task, and then predict the
(Kim et al. 2023) mainly presents probabilities and uncer- final label based on the concept label predictions. We denote
tainties for individual concepts or classes, and does not pro- raw input features by x, the labels by y, and the pre-defined
vide the prediction set with uncertainty, neither for the con- true concepts by c ∈ RC . Given the observed training sam-
cept layer nor for the prediction layer. Therefore, in addition tra
ples Dtra = {(xi , ci , yi )}Ni=1 , a CBM (Koh et al. 2020)
to providing distribution-free uncertainty quantification for
can be trained using the concept and class prediction losses
the generated basis explanations, it is also essential to study
via two distinct approaches: independent and joint.
how to transfer the constructed prediction sets in the inter-
Example 2: Prototype-based self-explaining networks
pretation space to the final prediction decision space.
(Alvarez Melis and Jaakkola 2018). Different from CBMs
To address the above challenges, in this paper, we design (Koh et al. 2020), prototype-based self-explaining networks
a novel uncertainty modeling framework for self-explaining usually use an autoencoder to learn a set of prototype-
neural networks (unSENN), which can make valid and reli- based concepts directly from the training data (i.e., Dtra =
able predictions or estimates with quantifiable uncertainty, tra

even in situations where the underlying data distribution {(xi , yi )}N


i=1 ) during the training process without the pre-
may not be well-defined or where traditional assumptions of defined concept supervision information. These prototype-
parametric statistics might not hold. Specifically, in our pro- based concepts are extracted in a way that they can best rep-
posed method, we first design novel non-conformity mea- resent specific target sets (Zhang, Gao, and Ma 2023).
sures to capture the informative high-level concepts in the For a well-trained self-explaining network f = g ◦ h, our
interpretation layer for self-explaining neural networks. Im- goal in this paper is to quantify the uncertainty of the gen-
portantly, our proposed non-conformity measures can over- erated explanations in the interpretation layer (preceding the
come the difficulties posed by self-explaining networks, output layer) and the final predictions in the output layer. In
even in the absence of explanation ground truth in the inter- particular, for a test sample x and a chosen miscoverage rate
pretation layer. Note that these calculated scores reflect the ε ∈ (0, 1), our primary objective is twofold: First, we aim
prediction error of the underlying explainer, where a smaller to calculate a set Γεcpt (x) containing x’s true concepts with
prediction error would lead to the construction of smaller probability at least 1−ε. In addition, we will investigate how
and more informative concept sets. We also prove the theo- to transfer the explanation prediction sets to the final deci-
retical guarantees for the constructed concept prediction sets sion sets. In practice, predictions associated with confidence
in the interpretation layer. To get better conformal predic- values are highly desirable in risk-sensitive applications.
tions for the final predicted labels, we then design effective
transfer functions, which involve the search of possible la- Methodology
bels under certain constraints. Due to the challenge posed In this section, we take concept bottleneck models (Koh
by directly estimating final prediction sets through the for- et al. 2020; Havasi, Parbhoo, and Doshi-Velez 2022;
mulated transfer functions, we transform them into new op- Chauhan et al. 2023; Yuksekgonul, Wang, and Zou 2022) as
timization frameworks from the perspective of adversarial an illustrative example to present our proposed method. Note

14652
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

that concept bottleneck models incorporate pre-defined con- following non-conformity measure
cepts into a supervised learning procedure. We denote the
K
available dataset as D = {(xi , ci , yi )}N i=1 , where xi ∈ R ,
d X
C
ci = [ci,1 , · · · , ci,C ] ∈ R and yi ∈ [M ]. These samples s(xi , h∗ (xi ), h̃(xi )) = (1 − 1/(1 + exp(−h[j] (xi ))))
j=1
are drawn exchangeably from some unknown distribution
PXCY . To train a self-explaining network, we first split the C
X K
X
available dataset D = {(xi , ci , yi )}N i=1 into a training set
+ λ2 ∗ (1/(1 + exp(−h[j] (xi )))) = |h∗[j] (xi )
Dtra and a calibration set Dcal , where Dtra ∩ Dcal = ∅ j=K+1 j=1
and Dtra ∪ Dcal = D. Given the available training set Dtra C
X
associated with the supervised concepts, we can train a self- − h̃[j] (xi )| + λ2 ∗ |h∗[j] (xi ) − h̃[j] (xi )|, (3)
explaining network f = g ◦ h, where h : Rd → RC maps j=K+1
a raw sample x into the concept space and g : RC → RM
maps concepts into a final class prediction. where λ2 is a pre-defined trade-off parameter, and [j] corre-
Given the underlying well-trained self-explaining model sponds to the concept index of the top-j concept prediction
f = g ◦ h, our goal is to provide validity guarantees on score of h̃(xi ). The above non-conformity measure returns
its explanation and final prediction outcomes in a post-hoc small values when the concept prediction outputs of the true
way. Note that each xi , its true relevant concepts are repre- relevant concepts are high and the concept outputs of the
sented by a concept vector ci = [ci,1 , · · · , ci,j , · · · , ci,C ] ∈ non-relevant concepts are low. Note that smaller conformal
{0, 1}C , with ci,j = 1 indicating that xi is associ- scores correspond to more model confidence (Angelopoulos
ated with the j-th concept. We also use Ccpt (xi ) = and Bates 2021; Teng, Tan, and Yuan 2021). In particular,
{j|ci,j = 1} to represent the set of true concepts for xi . when λ2 = 0, we only consider the top-ranked concepts.
For xi , we can use h : Rd → RC to obtain h(xi ) = Based on the above constructed non-conformity measure,
[h1 (xi ), · · · , hj (xi ), · · · , hC (xi )], where hj (xi ) is the pre- for each calibration sample (xi , ci , yi ) ∈ Dcal , we apply the
diction score of xi with regards to the j-th concept. We de- non-conformity measure s(xi , h∗ (xi ), h̃(xi )) to get N cal
cal
note [h[1] (xi ), · · · , h[j] (xi ), · · · , h[C] (xi )] as the sorted val- non-conformity scores {si }Ni=1 , where N
cal
is the number
ues of h(xi ) in the descending order, i.e., h[1] (xi ) is the of calibration samples. Then, we select a miscoverage rate
cal
largest (top-1) score, h[2] (xi ) is the second largest (top- ε ∈ (0, 1) and compute Q1−ε as the ⌈(N N+1)(1−ε)⌉ cal quan-
2) score, and so on. Furthermore, [j] corresponds to the N cal
tile of the non-conformity scores {si }i=1 , where ⌈·⌉ is the
concept index of the top-j concept prediction score, i.e.,
ceiling function. Formally, Q1−ε is defined as
j ′ = [j] if hj ′ (xi ) = h[j] (xi ). In practice, researchers are
usually interested in identifying the top-K important con- |{i : s(xi , h∗ (xi ), h̃(xi )) ≤ Q}|
cepts to help users understand the key factors driving the Q1−ε = inf{Q : (4)
model’s predictions (Agarwal et al. 2022; Brunet, Ander- N cal
son, and Zemel 2022; Yeh et al. 2020; Huai et al. 2022; ⌈(N cal + 1)(1 − ε)⌉
≥ }.
Rajagopal et al. 2021; Hitzler and Sarker 2022). Here, we N cal
can easily convert a general concept predictor h into a top- Based on the above, given Q1−ε and a desired coverage level
K concept classifier by returning the set of concepts cor- 1 − ε ∈ (0, 1), for a test sample xtest (where xtest is known
responding to the set of top-K prediction scores from h. but the true top-K concepts C are not), we can get the expla-
For xi , the top-K concept prediction function returns the nation prediction set Γεcpt (xtest ) for xtest with 1 − ε cover-
set C˜cpt (xi ) = {[1], · · · , [K]} for 1 ≤ K < C. Here, we as- age guarantee. Algorithm 1 gives the procedure to calculate
sume the concept prediction scores are calibrated, i.e., taking the concept prediction sets.
values in the range of [0, 1]. For Ψ ∈ R, we can use simple
transforms such as σ(Ψ) = 1/(1 + e−Ψ ) to map it to the Definition 2 (Data Exchangeability). Consider variables
range of [0, 1] without changing their ranking. Then, for xi , z1 , · · · , zN . The variables are exchangeable if for ev-
we obtain the below calibrated concept importance scores ery permutation τ of the integers 1, · · · , N , the variables
w1 , · · · , wN where wi = zτ (i) , have the same joint proba-
h̃(xi ) = [1/(1 + exp(−h1 (xi ))), · · · , 1/(1 (2) bility distribution as z1 , · · · , zN .
+ exp(−hC (xi )))] = [σ(h1 (xi )), · · · , σ(hC (xi ))], Theorem 1. Suppose that the calibration samples (Dcal =
cal
{(xi , ci , yi )}N
i=1 ) and the given test sample x
test
are ex-
where σ(hj (xi )) ∈ [0, 1] is the calibrated impor- changeable. Then, if we calculate the quantile value Q1−ε
tance score of xi with regards to the j-th concept. and construct Γεcpt (xtest ) as indicated above, for the above
Not that for xi , it is associated with a concept subset non-conformity score function s and any ε ∈ (0, 1), the de-
Ccpt (xi ), which can be represented by a vector h∗ (xi ) = rived Γεcpt (xtest ) satisfies
[h∗1 (xi ), · · · , h∗j (xi ), · · · , h∗C (xi )], where h∗j (xi ) = 1 if and
only if concept j is associated with xi , and 0 otherwise. P (Ccpt (xtest ) ∈ Γεcpt (xtest )) ≥ 1 − ε, (5)
In order to quantify how “strange” the generated concept
explanations are for a given sample xi , we utilize the under- where Ccpt (xtest ) is the true relevant concept set for the
lying well-trained self-explaining model f to construct the given test sample xtest , and ε is a desired miscoverage rate.

14653
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Algorithm 1: Uncertainty quantification for self-explaining (Sesia and Favaro 2022). For a random variable v ∈ RC , we
neural networks denote [v[1] , v[2] , · · · , v[C] ] as the sorted values of v in the
1: Input: Calibration data D cal , pre-trained model f = descending order, i.e., v[1] is the largest (top-1) score, v[2]
g ◦ h, non-conformity measure s(·), a test sample xtest , is the second largest (top-2) score, and so on. Then, we can
and a desired miscoverage rate ε ∈ (0, 1). construct the label prediction set for the test sample xtest as
2: Output: The concept prediction set Γεcpt (xtest ) and la- K
X
bel set Γεlab (xtest ) for sample xtest . Γεlab (xtest ) = {y = g(v) : |h∗[j] (xtest ) − h̃[j] (v)|
3: Calculate the non-conformity scores on the calibration j=1
samples (i.e., Dcal ) based on Eqn. (3) C
Compute the quantile value Q1−ε based on Eqn. (4)
X
4: + λ2 ∗ |h∗[j] (xtest ) − h̃[j] (v)| ≤ Q1−ε }, (7)
5: Construct the concept prediction set Γεcpt (xtest ) for j=K+1
xtest with 1 − ε coverage guarantee
6: Initialize Γεlab (xtest ) = ∅ where h̃(v) = σ(v). In the above, since the true concepts
7: for m = 1, 2, · · · , M do are unknown for xtest , we instead calculate h∗[j] (xtest ) =
8: P δ by optimizing the loss in Eqn. (10)
Obtain h̃[j] (xtest ) = σ(h[j] (xtest )). Without loss of generality, in
9: if m′ ∈[M ]\m I[gm (v) > gm′ (v)] == M − 1 then the following, we take a setting where λ2 = 1 as an example
10: Update Γεlab (xtest ) = Γεlab (xtest ) ∪ {m} to discuss how to transfer the confidence prediction sets in
11: end if the concept space to the final class label prediction layer,
12: end for and the extension to other cases can be easily derived. In this
13: Return: Γεcpt (xtest ) and Γεlab (xtest ). case, we calculate the set for xtest as Γεlab (xtest ) = {y =
g(v) : ||h̃(v) − h∗ (xtest )||1 ≤ Q1−ε }, where v ∈ RC ,
v̂ = σ(v) = [σ(v1 ), · · · , σ(vC )], g is the prediction head,
Proof. Let s1 , s2 , · · · , sN cal , stest denote the non-formality h̃(xtest ) = [σ(h1 (xtest )), · · · , σ(hC (xtest ))] and Q1−ε is
scores of the calibration samples and the given test samples. derived based on the calibration set. To derive the label sets,
To avoid handling ties, we consider the case where these we propose to solve the following optimization
non-formality scores are distinct with probability 1. With- X
out loss of generality, we assume these calibration scores max I[gm (v) > gm′ (v)] (8)
v∈RC
are sorted so that s1 < s2 < · · · < sN cal . In this case, m′ ∈[M ]\m
we have that Q1−ε = s⌈(N cal +1)(1−ε)⌉ when ε ≥ N cal1 +1 s.t., ||h̃(v) − h∗ (xtest )||1 ≤ Q1−ε ,
and Q1−ε = ∞ otherwise. Note that in the case where
Q1−ε = ∞, the coverage property is trivially satisfied; thus, where the quantile value Q1−ε is derived based on Eqn. (4).
we only need to consider the case when ε ≥ N cal1 +1 . We In the above, we want to find a v, whose predicted label is
begin by observing the below equality of the two events “m”, under the constraint that ||h̃(v)−h∗ (xtest )||1 ≤ Q1−ε .
If we can successfully find such a v, we will include the label
{Ccpt (xtest ) ∈ Γεcpt (xtest )} = {stest ≤ Q1−ε }. (6) “m” in the label set; otherwise, we will exclude it. Notably,
By combining this with the definition of Q1−ε , we have we need to iteratively solve the above optimization over all
possible labels to construct the final prediction set.
{Ccpt (xtest ) ∈ Γεcpt (xtest )} = {stest ≤ s⌈(N cal +1)(1−ε)⌉ }. However, it is difficult to directly optimize the above
problem, due to the non-differentiability of the discrete
Since the variables s1 , · · · , stest are exchangeable, we can
components in the above equation. Below, we use δ =
have that P (stest ≤ sk ) = k/(N cal + 1) holds true for
[δ1 , · · · , δC ] to denote the difference between v̂ and
any integer k. This means that stest is equally likely to fall
in anywhere between the calibration scores s1 , · · · , sN cal . h̃(xtest ), i.e., δ = v̂ − h̃(xtest ), where v̂ = σ(v) =
Notably, the randomness is over all variables. Building on [σ(v1 ), · · · , σ(vC )], and h̃(xtest ) = σ(h(xtest )). Based on
this, we can derive that P (stest ≤ s⌈(N cal +1)(1−ε)⌉ ) = this, we can have
⌈(N cal +1)(1−ε)⌉
(N cal +1)
≥ 1−ε, which yields the desired result. v = [v1 = − log(1/(h̃1 (xtest ) + δ1 ) − 1), · · · , (9)
test
In the above theorem, we show that the proposed algo- vC = − log(1/(h̃C (x ) + δC ) − 1)].
rithm can output predictive concept sets that are rigorously To address the above challenge, we propose to solve the be-
guaranteed to satisfy the desired coverage property shown in low reformulated optimization
Eqn. (5), no matter what (possibly incorrect) self-explaining
model is used or what the (unknown) distribution of the data min max( max

gm′ (v) − gm (v), −β), (10)
||δ||1 ≤Q1−ε m ̸=m
is. The above theorem relies on the assumption that the data
points are exchangeable (see Definition 2), which is widely where m′ ∈ [M ], β is a pre-defined value, v is derived based
adopted by existing conformal inference works (Cherubin, on Eqn. (9), and g : RC → RM maps v into a final class
Chatzikokolakis, and Jaggi 2021; Tibshirani et al. 2019). prediction. g(v) represents the logit output of the underly-
The data exchangeability assumption is much weaker than ing self-explaining model for each class when the input is v.
i.i.d. Note that i.i.d variables are automatically exchangeable The above adversarial loss enforces the actual prediction of

14654
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Test accuracy (%)

Test accuracy (%)


v to the targeted label “m” (Huang et al. 2020; Carlini and 99

Wagner 2017). If we can find such a v, the label “m” will be 80

included in the prediction set; otherwise, it will not. 98 Standard 70 Standard


Lemma 2 ((Tibshirani et al. 2019)). If V1 , · · · , Vn+1 are Independent
60
Independent
Joint Joint
exchangeable random variables, then for any γ ∈ (0, 1), we 97
20 40 60 80 100 20 40 60 80 100
have the following Data proportion (%) Data proportion (%)

P {Vn+1 ≤ Quantile(γ; V1:n ∪ {∞})} ≥ γ, (11) (a) MNIST (b) CIFAR-100 Super-class
where Quantile(γ; V1:n ∪ {∞}) denotes the level γ quantile Figure 1: Data efficiency for concept bottleneck models with
of V1:n ∪ {∞}. In the above, V1:n = {V1 , · · · , Vn }. different learning strategies.
Theorem 3. Suppose the calibration samples and the test
sample are exchangeable, if we construct Γεlab (xtest ) as in-
dicated above, the following inequality holds for any ε ∈ from the trained similarity vector to find a surrogate vector
(0, 1) around it. Additionally, to obtain the final prediction sets, we
follow Eqn. (10) to determine whether a specific class label
P (y test ∈ Γεlab (xtest )) ≥ 1 − ε, (12) should be included in the final prediction sets.
where y test is the true label of xtest , and Γεlab (xtest ) is the
label prediction set derived based on Eqn. (7). Experiments
Due to space constraints, the complete proof for Theorem In this section, we conduct experiments to evaluate the
3 is postponed to the full version of this paper. This proof is performance of the proposed method (i.e., unSENN). Due
based on Lemma 2. to space limitations, more experimental details and results
Discussions on prototype-based self-explaining neu- (e.g., experiments on more self-explaining models and run-
ral networks. Here, we will discuss how to generalize our ning time) can be found in the full version of this paper.
above method to the prototype-based self-explaining net-
works, which do not depend on the pre-defined expert con- Experimental Setup
cepts (Alvarez Melis and Jaakkola 2018; Li et al. 2018).
Real-world datasets. In experiments, we adopt the fol-
They usually adopt the autoencoder network to learn a
lowing real-world datasets: CIFAR-100 Super-class (Fis-
lower-dimension latent representation of the data with an
cher et al. 2019) and MNIST (Deng 2012). Specifically,
encoder network, and then use several network layers over
the CIFAR-100 Super-class dataset comprises 100 diverse
the latent space to learn C prototype-based concepts, i.e.,
classes of images, which are further categorized into 20
{pj ∈ Rq }C j=1 . After that, for each pj , the similarity layer super-classes. For example, the five classes baby, boy, girl,
computes its distance from the learned latent representation
man and woman belong to the super-class people. In this
(i.e., eenc (x)) as hj (x) = ||eenc (x) − pj ||22 . The smaller the
case, we exploit each image class as a concept-based ex-
distance value is, the more similar eenc (x) and the j-th pro-
planation for the super-class prediction (Hong and Pavlic
totype (pj ) are. Finally, we can output a probability distribu-
2023). The MNIST dataset consists of a collection of
tion over the M classes. Nonetheless, we have no access to
grayscale images of handwritten digits (0-9). In experi-
the true basis explanations if we want to conduct conformal
ments, we consider the MNIST range classification task,
predictions at the prototype-based interpretation level. To
wherein the goal is to classify handwritten digits into five
address this, for prototype-based self-explaining networks,
distinct classes based on their numerical ranges. Specifically,
we construct the following non-conformity measure
the digits are divided into five non-overlapping ranges (i.e.,
K
X [0, 1], [2, 3], [4, 5], [6, 7], [8, 9]), each assigned a unique label
s(xi , yi , f ) = inf [ |h[j] (xi ) − v[j] | (the label set is denoted as {0, 1, 2, 3, 4}). For example, an
v∈{v:ℓ(g(v),yi )≤α}
j=1 input image depicting a handwritten digit “4” is classified as
C class 2 within the range of digits 4 to 5. Here, we exploit the
identity of each digit as a concept-based explanation for the
X
+ λ2 ∗ |h[j] (xi ) − v[j] |], (13)
j=K+1
ranges prediction (Rigotti et al. 2021).
Baselines. In experiments, we adopt two baselines: Naive
where ℓ(g(v), yi ) is the classification cross entropy loss for baseline and Vanilla Conformal Predictors. These baselines
v, and α is a predefined small threshold value to make the exclusively rely upon X and Y. Specifically, the Naive base-
prediction based on v to be close to label yi . Based on the line constructs the set sequentially incorporating classes in
above non-conformity measure, we can calculate the non- decreasing order of their probabilities until the cumula-
conformity scores for the calibration samples, and then cal- tive probability surpasses the predefined threshold 1 − ε.
culate the desired quantile value. Based on this, for the given Notably, this Naive approach lacks a theoretical guaran-
test sample, following Eqn. (8) and (13), we can calculate tee of its coverage. The Vanilla Conformal Predictors uti-
the final prediction set. However, it is usually complicated lize the final decisions to construct the prediction sets. Fol-
to calculate the above score due to the infimum operator. To lowing (Cherubin, Chatzikokolakis, and Jaggi 2021), we
overcome this challenge, we apply gradient descent starting set the non-conformity measure for this Vanilla baseline as

14655
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Concept set: Concept set: Concept set: Concept set: Concept set: Concept set:
1 4 0 lamp dolphin bottle
7 5 whale can
6 cup

Concept set: Concept set: Concept set: Concept set: Concept set: Concept set:
9 4 1 mountain orchid apple
9 2 tulip orange
3 pear
7 sweet_pepper

(a) MNIST (b) CIFAR-100 Super-class

Figure 2: Concept conformal sets on MNIST and CIFAR-100 Super-class. Phrases in bold and underlined mean true concepts.

Miscoverage Concept error rate Dataset Method Label coverage Average set size
Dataset
rate ε
unSENN Naive unSENN 0.988 ± 0.002 1.00 ± 0.00
MNIST Naive 0.985 ± 0.002 1.01 ± 0.00
0.05 0.047 ± 0.001 0.003 ± 0.000
Vanilla 0.903 ± 0.001 0.90 ± 0.00
0.1 0.094 ± 0.002 0.003 ± 0.000
MNIST
0.15 0.139 ± 0.006 0.004 ± 0.000 unSENN 0.847 ± 0.002 1.03 ± 0.00
CIFAR-100
0.2 0.174 ± 0.011 0.005 ± 0.000 Naive 0.786 ± 0.004 1.28 ± 0.02
Super-class
Vanilla 0.805 ± 0.006 1.15 ± 0.01
0.05 0.044 ± 0.002 0.073 ± 0.001
CIFAR-100 0.1 0.091 ± 0.003 0.100 ± 0.001
Super-class 0.15 0.137 ± 0.004 0.121 ± 0.001 Table 2: Label coverage and average set size of prediction
0.2 0.177 ± 0.003 0.136 ± 0.001 conformal sets on MNIST and CIFAR-100 Super-class.

Table 1: Error rate of concept conformal sets on MNIST and 4


12
CIFAR-100 Super-class with different miscoverage rates. 3 10
Length

Length
2 8
6
1
s(x, y) = −fy (x), where f is a pre-trained model, and 4
fy (x) is the softmax probability for x with true label y. 0 2
unSENN Naive Vanilla unSENN Naive Vanilla
Implementation details. In experiments, we evaluate the
proposed method (i.e., unSENN) across the following un- (a) MNIST (b) CIFAR-100 Super-class
derlying models: ResNet-50 (He et al. 2016), a convolu-
tional neural network (CNN), and a multi-layer perception Figure 3: Bloxplot of prediction sets. We display the lengths
(MLP). For the calibration set, we randomly hold out 10% of of conformal sets that are not equal to 1 over all the test data.
the original available dataset to compute the non-conformity
scores. All the experiments are conducted 10 times, and we
report the mean and standard errors. not held in the prediction sets) of concept conformal sets
with different miscoverage rates. From this table, we can
Experimental Results find that our proposed method can guarantee a probability
First, we measure the data efficiency (the amount of train- of error of at most ε and fully exploits its error margin to op-
ing data needed to achieve a desired level of accuracy) of timize efficiency. In contrast, the Naive baseline simply out-
the self-explaining model. For the CIFAR-100 Super-class, puts all possible concepts using the raw softmax probabili-
we apply ResNet-50 and MLP to learn the concepts and ties and does not provide theoretical guarantees. In addition,
labels, respectively. Similarly, we use CNN and MLP for we show concept conformal sets when ε = 0.1 in Figure 2.
the MNIST. In Figure 1, we present the test accuracy of Intuitively, a blurred image tends to have a large prediction
the self-explaining model using independent and joint learn- set due to its uncertainty during testing, while a clear image
ing strategies with various data proportions. Note that the tends to have a small set. Thus, a smaller set means higher
standard model ignores the concepts and directly maps in- confidence in the prediction, and a larger set indicates higher
put features to the final prediction. As we can see, the self- uncertainty regarding the predicted concepts.
explaining models are particularly effective on the adopted Next, we investigate the effectiveness of the final predic-
datasets. The joint model exhibits comparable performance tion sets within the self-explaining model. We provide in-
to the standard model, and the joint model is slightly more sights into the label coverage (the percentage of times the
accurate in lower data regimes (e.g., 20%). Therefore, in the prediction set size is 1 and contains the true label) as well
following experiments, we adopt the joint learning strategy. as the average set size. Table 2 reports the obtained ex-
Then, we explore the effectiveness of the concept confor- perimental results when ε = 0.1. From this table, we can
mal sets within the self-explaining model. In Table 1, we see that our proposed method consistently outperforms both
report the error rate (the percentage of true concepts that are the Naive and Vanilla baselines. Specifically, our proposed

14656
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Concept: crab Concept: bowl Concept: train

Method Prediction set Method Prediction set Method Prediction set


unSENN non-insect_invertebrates unSENN food_containers, fruit_and_vegetables unSENN vehicles_1, large_man-made_outdoor_things
non-insect_invertebrates, food_containers, fruit_and_vegetables, vehicles_1, large_natural_outdoor_scenes,
Naive Naive Naive
fruit_and_vegetables, fish household_furniture, vehicles_2 large_man-made_outdoor_things, reptiles
non-insect_invertebrates, food_containers, fruit_and_vegetables, vehicles_1, large_natural_outdoor_scenes,
Vanilla Vanilla Vanilla
fruit_and_vegetables, household_furniture large_man-made_outdoor_things, reptiles

Figure 4: Prediction conformal sets on CIFAR-100 Super-class. Phrases in bold and underlined mean true labels.

60 80
0.2% 0.2%
et al. 2023; Zarlenga et al. 2023). However, they fail to
40
2.0% 60 2.0% provide distribution-free uncertainty quantification for the
Density

Density

5.0% 5.0%
10.0% 40 10.0%
two simultaneously generated prediction outcomes in self-
20
20
explaining networks. On the other hand, to tackle the uncer-
tainty issues, traditional methods (e.g., Bayesian approxima-
0 0
0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 tion) for establishing confidence in prediction models usu-
Error rate Error rate
ally make strict assumptions and have high computational
(a) MNIST (b) CIFAR-100 Super-class cost properties. In this work, we build upon conformal pre-
diction (Gibbs and Candes 2021; Chernozhukov, Wüthrich,
Figure 5: Distribution of error rate with different calibration and Yinchu 2018; Xu and Xie 2023; Tibshirani et al. 2019;
sizes (splitting percentage of the original available data). Teng et al. 2022; Martinez et al. 2023; Ndiaye 2022; Lu et al.
2022; Jaramillo and Smirnov 2021; Liu et al. 2022), which
do not require strict assumptions on the underlying data dis-
method achieves notably higher label coverage and much tribution and has been applied before to classification and re-
tighter confidence sets. A more comprehensive understand- gression problems to attain marginal coverage. However, our
ing of the conformal set characteristics is illustrated in Fig- work is different from existing conformal works in that we
ure 3, where we display the lengths of conformal sets that first model explanation prediction uncertainties in both sce-
are not equal to 1. We can find that the Vanilla baseline con- narios – with and without true explanations. We also trans-
tains empty sets on MNIST, and the Naive baseline includes fer the explanation uncertainties to the final prediction un-
much looser prediction sets on the CIFAR-100 Super-class. certainties through designing new transfer functions and as-
In contrast, our proposed method generates reliable and tight sociated optimization frameworks for constructing the final
prediction sets. Additionally, in Figure 4, we demonstrate prediction sets.
prediction conformal sets when ε = 0.1. We can observe
that all the methods correctly output the true label, while Conclusion
the prediction set of our proposed method is the tightest.
From these reported experimental results, we can conclude In this paper, we design a novel uncertainty quantification
that our proposed method effectively produces uncertainty framework for self-explaining networks, which generates
prediction sets for the final predictions, leveraging the infor- distribution-free uncertainty estimates for the two intercon-
mative high-level basis concepts. nected outcomes (i.e., the final predictions and their cor-
Further, we perform an ablation study concerning the cal- responding explanations) without requiring any additional
ibration data size. In Figure 5, we illustrate the error rate model modifications. Specifically, in our approach, we first
distribution for the concept conformal sets with a miscov- design the non-conformity measures to capture the relevant
erage rate ε = 0.1 on MNIST and CIFAR-100 Super-class, basis explanations within the interpretation layer of self-
using different calibration data sizes. It is evident that utiliz- explaining networks, considering scenarios with and without
ing a larger calibration set can enhance the stability of the true concepts. Then, based on the designed non-conformity
prediction set. Therefore, we incorporate 10% of the orig- measures, we generate explanation prediction sets whose
inal available dataset for calibration without tuning, which size reflects the prediction uncertainty. To get better la-
also works well in our experiments. bel prediction sets, we also develop novel transfer func-
tions tailored to scenarios where self-explaining networks
are trained with and without true concepts. By leveraging
Related Work the adversarial attack paradigm, we further design novel op-
Currently, many self-explaining networks have been pro- timization frameworks to construct final prediction sets from
posed (Koh et al. 2020; Havasi, Parbhoo, and Doshi-Velez the designed transfer functions. We provide the theoretical
2022; Chauhan et al. 2023; Yuksekgonul, Wang, and Zou analysis for the proposed method. Comprehensive experi-
2022; Alvarez Melis and Jaakkola 2018; Li et al. 2018; Kim ments validate the effectiveness of our method.

14657
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgments Gibbs, I.; and Candes, E. 2021. Adaptive conformal infer-


ence under distribution shift. Advances in Neural Informa-
This work is supported in part by the US National Science
tion Processing Systems, 34: 1660–1672.
Foundation under grants IIS-2106961, and IIS-2144338.
Any opinions, findings, and conclusions or recommenda- Han, T.; Srinivas, S.; and Lakkaraju, H. 2022. Which expla-
tions expressed in this material are those of the author(s) and nation should i choose? a function approximation perspec-
do not necessarily reflect the views of the National Science tive to characterizing post hoc explanations. Advances in
Foundation. Neural Information Processing Systems, 35: 5256–5268.
Havasi, M.; Parbhoo, S.; and Doshi-Velez, F. 2022. Address-
References ing leakage in concept bottleneck models. Advances in Neu-
ral Information Processing Systems, 35: 23386–23397.
Agarwal, C.; Krishna, S.; Saxena, E.; Pawelczyk, M.; John-
son, N.; Puri, I.; Zitnik, M.; and Lakkaraju, H. 2022. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
Openxai: Towards a transparent evaluation of model expla- ual learning for image recognition. In Proceedings of the
nations. Advances in Neural Information Processing Sys- IEEE conference on computer vision and pattern recogni-
tems, 35: 15784–15799. tion, 770–778.
Alvarez Melis, D.; and Jaakkola, T. 2018. Towards robust Hitzler, P.; and Sarker, M. 2022. Human-centered concept
interpretability with self-explaining neural networks. Ad- explanations for neural networks. Neuro-Symbolic Artificial
vances in neural information processing systems, 31. Intelligence: The State of the Art, 342(337): 2.
Angelopoulos, A. N.; and Bates, S. 2021. A gentle intro- Hong, J.; and Pavlic, T. P. 2023. Concept-Centric
duction to conformal prediction and distribution-free uncer- Transformers: Concept Transformers with Object-Centric
tainty quantification. arXiv preprint arXiv:2107.07511. Concept Learning for Interpretability. arXiv preprint
arXiv:2305.15775.
Bang, S.; Xie, P.; Lee, H.; Wu, W.; and Xing, E. 2021. Ex-
Huai, M.; Liu, J.; Miao, C.; Yao, L.; and Zhang, A. 2022. To-
plaining a black-box by using a deep variational information
wards automating model explanations with certified robust-
bottleneck approach. In Proceedings of the AAAI Confer-
ness guarantees. In Proceedings of the AAAI Conference on
ence on Artificial Intelligence, 11396–11404.
Artificial Intelligence, 6935–6943.
Brunet, M.-E.; Anderson, A.; and Zemel, R. 2022. Impli- Huai, M.; Wang, D.; Miao, C.; and Zhang, A. 2020. Towards
cations of Model Indeterminacy for Explanations of Auto- interpretation of pairwise learning. In Proceedings of the
mated Decisions. Advances in Neural Information Process- AAAI Conference on Artificial Intelligence, 4166–4173.
ing Systems, 35: 7810–7823.
Huang, W. R.; Geiping, J.; Fowl, L.; Taylor, G.; and Gold-
Carlini, N.; and Wagner, D. 2017. Towards evaluating the stein, T. 2020. Metapoison: Practical general-purpose clean-
robustness of neural networks. In 2017 ieee symposium on label data poisoning. Advances in Neural Information Pro-
security and privacy (sp), 39–57. Ieee. cessing Systems, 33: 12080–12091.
Chauhan, K.; Tiwari, R.; Freyberg, J.; Shenoy, P.; and Dvi- Jaramillo, W. L.; and Smirnov, E. 2021. Shapley-value based
jotham, K. 2023. Interactive concept bottleneck models. inductive conformal prediction. In Conformal and Proba-
In Proceedings of the AAAI Conference on Artificial Intel- bilistic Prediction and Applications, 52–71. PMLR.
ligence, 5948–5955.
Kim, E.; Jung, D.; Park, S.; Kim, S.; and Yoon, S. 2023.
Chernozhukov, V.; Wüthrich, K.; and Yinchu, Z. 2018. Ex- Probabilistic Concept Bottleneck Models. arXiv preprint
act and robust conformal inference methods for predictive arXiv:2306.01574.
machine learning with dependent data. In Conference On
learning theory, 732–749. PMLR. Koh, P. W.; Nguyen, T.; Tang, Y. S.; Mussmann, S.; Pier-
son, E.; Kim, B.; and Liang, P. 2020. Concept bottleneck
Cherubin, G.; Chatzikokolakis, K.; and Jaggi, M. 2021. Ex- models. In International conference on machine learning,
act optimization of conformal predictors via incremental and 5338–5348. PMLR.
decremental learning. In International Conference on Ma-
Li, J.; Kuang, K.; Li, L.; Chen, L.; Zhang, S.; Shao, J.; and
chine Learning, 1836–1845. PMLR.
Xiao, J. 2021. Instance-wise or class-wise? a tale of neigh-
Deng, L. 2012. The mnist database of handwritten digit im- bor shapley for concept-based explanation. In Proceedings
ages for machine learning research. IEEE Signal Processing of the 29th ACM International Conference on Multimedia,
Magazine, 29(6): 141–142. 3664–3672.
Elbaghdadi, O.; Hussain, A.; Hoenes, C.; and Bardarov, I. Li, O.; Liu, H.; Chen, C.; and Rudin, C. 2018. Deep learn-
2020. Self explaining neural networks: A review with exten- ing for case-based reasoning through prototypes: A neural
sions. Fairness, Accountability, Confidentiality and Trans- network that explains its predictions. In Proceedings of the
parency in AI. AAAI Conference on Artificial Intelligence.
Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Li, W.; Feng, X.; An, H.; Ng, X. Y.; and Zhang, Y.-J. 2020.
Zhang, C.; and Vechev, M. 2019. DL2: training and querying Mri reconstruction with interpretable pixel-wise operations
neural networks with logic. In International Conference on using reinforcement learning. In Proceedings of the AAAI
Machine Learning, 1931–1941. PMLR. conference on artificial intelligence, 792–799.

14658
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Liu, M.; Ding, L.; Yu, D.; Liu, W.; Kong, L.; and Jiang, Teng, J.; Wen, C.; Zhang, D.; Bengio, Y.; Gao, Y.; and Yuan,
B. 2022. Conformalized Fairness via Quantile Regression. Y. 2022. Predictive inference with feature conformal predic-
Advances in Neural Information Processing Systems, 35: tion. arXiv preprint arXiv:2210.00173.
11561–11572. Tibshirani, R. J.; Foygel Barber, R.; Candes, E.; and Ram-
Lu, C.; Lemay, A.; Chang, K.; Höbel, K.; and Kalpathy- das, A. 2019. Conformal prediction under covariate shift.
Cramer, J. 2022. Fair conformal predictors for applications Advances in neural information processing systems, 32.
in medical imaging. In Proceedings of the AAAI Conference Xu, C.; and Xie, Y. 2023. Sequential predictive conformal
on Artificial Intelligence, 12008–12016. inference for time series. In International Conference on
Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach Machine Learning, 38707–38727. PMLR.
to interpreting model predictions. Advances in neural infor- Yao, L.; Li, Y.; Li, S.; Liu, J.; Huai, M.; Zhang, A.; and
mation processing systems, 30. Gao, J. 2022. Concept-Level Model Interpretation From the
Martinez, J. A.; Bhatt, U.; Weller, A.; and Cherubin, G. Causal Aspect. IEEE Transactions on Knowledge and Data
2023. Approximating Full Conformal Prediction at Scale Engineering.
via Influence Functions. In Proceedings of the AAAI Con- Yeh, C.-K.; Kim, B.; Arik, S.; Li, C.-L.; Pfister, T.; and
ference on Artificial Intelligence, 6631–6639. Ravikumar, P. 2020. On completeness-aware concept-based
Nam, W.-J.; Gur, S.; Choi, J.; Wolf, L.; and Lee, S.-W. 2020. explanations in deep neural networks. Advances in neural
Relative attributing propagation: Interpreting the compar- information processing systems, 33: 20554–20565.
ative contributions of individual units in deep neural net- Yuksekgonul, M.; Wang, M.; and Zou, J. 2022. Post-hoc
works. In Proceedings of the AAAI conference on artificial Concept Bottleneck Models. In The Eleventh International
intelligence, 2501–2508. Conference on Learning Representations.
Ndiaye, E. 2022. Stable conformal prediction sets. In Inter- Zarlenga, M. E.; Shams, Z.; Nelson, M. E.; Kim, B.; and
national Conference on Machine Learning, 16462–16479. Jamnik, M. 2023. TabCBM: Concept-based Interpretable
PMLR. Neural Networks for Tabular Data. Transactions on Machine
Rajagopal, D.; Balachandran, V.; Hovy, E.; and Tsvetkov, Y. Learning Research.
2021. Selfexplain: A self-explaining architecture for neural Zhang, Y.; Gao, N.; and Ma, C. 2023. Learning to Select
text classifiers. arXiv preprint arXiv:2103.12279. Prototypical Parts for Interpretable Sequential Data Model-
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why ing. In Proceedings of the AAAI Conference on Artificial
should i trust you?” Explaining the predictions of any clas- Intelligence, 6612–6620.
sifier. In Proceedings of the 22nd ACM SIGKDD interna- Zhao, C.; Qian, W.; Shi, Y.; Huai, M.; and Liu, N. 2023. Au-
tional conference on knowledge discovery and data mining, tomated Natural Language Explanation of Deep Visual Neu-
1135–1144. rons with Large Models. arXiv preprint arXiv:2310.10708.
Rigotti, M.; Miksovic, C.; Giurgiu, I.; Gschwind, T.; and Zhao, S.; Ma, X.; Wang, Y.; Bailey, J.; Li, B.; and Jiang, Y.-
Scotton, P. 2021. Attention-based interpretability with con- G. 2021. What do deep nets learn? class-wise patterns re-
cept transformers. In International Conference on Learning vealed in the input space. arXiv preprint arXiv:2101.06898.
Representations.
Rudin, C. 2018. Please stop explaining black box models
for high stakes decisions. Stat, 1050: 26.
Sesia, M.; and Favaro, S. 2022. Conformal frequency esti-
mation with sketched data. Advances in Neural Information
Processing Systems, 35: 6589–6602.
Sinha, S.; Huai, M.; Sun, J.; and Zhang, A. 2023. Under-
standing and enhancing robustness of concept-based mod-
els. In Proceedings of the AAAI Conference on Artificial
Intelligence, 15127–15135.
Slack, D.; Hilgard, A.; Singh, S.; and Lakkaraju, H. 2021.
Reliable post hoc explanations: Modeling uncertainty in ex-
plainability. Advances in neural information processing sys-
tems, 34: 9391–9404.
Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H.
2020. Fooling lime and shap: Adversarial attacks on post
hoc explanation methods. In Proceedings of the AAAI/ACM
Conference on AI, Ethics, and Society, 180–186.
Teng, J.; Tan, Z.; and Yuan, Y. 2021. T-sci: A two-stage
conformal inference algorithm with guaranteed coverage for
cox-mlp. In International Conference on Machine Learning,
10203–10213. PMLR.

14659

You might also like