Professional Documents
Culture Documents
Learning To Deceive With Attention-Based Explanations
Learning To Deceive With Attention-Based Explanations
Danish Pruthi† , Mansi Gupta‡ , Bhuwan Dhingra† , Graham Neubig† , Zachary C. Lipton†
†
Carnegie Mellon University, Pittsburgh, USA
‡
Twitter, New York, USA
ddanish@cs.cmu.edu, mansig@twitter.com,
{bdhingra, gneubig, zlipton}@cs.cmu.edu
Table 2: Example sentences from each classification task, with highlighted impermissible tokens and their support.
given a pre-specified set of impermissible words et al., 2017) we can optimize the mean value of our
I, for which we want to minimize the correspond- penalty as assessed over the set of attention heads
ing attention weights. For example, these may in- H as follows:
clude gender words such as “he”, “she”, “Mr.”, or
λ X
“Ms.”. We define the mask m to be a binary vector R=− log(1 − αTh m)).
of size n, such that |H|
h∈H
(
1, if wi ∈ I When a model has many attention heads, an au-
mi =
0 otherwise. ditor might not look at the mean attention assigned
to certain words but instead look head by head
Further, let α ∈ [0, 1]n denote the attention as- to see if any among them assigns a large amount
signed to each word in S by a model, such that
P of attention to impermissible words. Anticipating
i αi = 1. For any task-specific loss function L, this, we also explore a variant of our approach for
we define a new objective function L0 = L + R manipulating multi-headed attention where we pe-
where R is an additive penalty term whose pur- nalize the maximum amount of attention paid to
pose is to penalize the model for allocating atten- impermissible words (among all heads) as follows:
tion to impermissible words. For a single attention
layer, we define R as: R = −λ · min log(1 − αTh m).
h∈H
R = −λ log(1 − αT m)
For cases where the impermissible set of tokens
and λ is a penalty coefficient that modulates the
is unknown apriori, one can plausibly use the top
amount of attention assigned to impermissible to-
few highly attended tokens as a proxy.
kens. The argument of the log term (1 − αT m)
captures the total attention weight assigned to per- 4 Experimental Setup
missible words. In contrast to our penalty term,
Wiegreffe and Pinter (2019) use KL-divergence We study the manipulability of attention on four
to maximally separate the attention distribution of binary classification problems, and four sequence-
the manipulated model (αnew ) from the attention to-sequence tasks. In each dataset, (in some, by
distribution of the given model (αold ): design) a subset of input tokens are known a priori
to be indispensable for achieving high accuracy.
R0 = −λ KL(αnew k αold ). (1)
However, their penalty term is not directly appli- 4.1 Classification Tasks
cable to our case: instantiating αold to be uniform Occupation classification We use the biogra-
over impermissible tokens, and 0 over remainder phies collected by De-Arteaga et al. (2019) to
tokens results in an undefined loss term. study bias against gender-minorities in occupation
When dealing with models that employ multi- classification models. We carve out a binary clas-
headed attention, which use multiple different at- sification task of distinguishing between surgeons
tention vectors at each layer of the model (Vaswani and (non-surgeon) physicians from the multi-class
occupation prediction setup. We chose this sub- racy improvements when using the rank and per-
task because the biographies of the two profes- centile features in addition to the reference let-
sions use similar words, and a majority of sur- ter. Thus, we consider percentile and rank labels
geons (> 80%) in the dataset are male. We further (which are appended at the end of the letter text)
downsample minority classes—female surgeons, as impermissible tokens. An example from each
and male physicians—by a factor of ten, to en- classification task is listed in Table 2. More de-
courage models to use gender related tokens. Our tails about the datasets are in the appendix.
models (described in detail later in § 4.2) attain
96.4% accuracy on the task, and are reduced to 4.2 Classification Models
93.8% when the gender pronouns in the biogra- Embedding + Attention For illustrative pur-
phies are anonymized. Thus, the models (trained poses, we start with a simple model with attention
on unanonymized data) make use of gender indi- directly over word embeddings. The word embed-
cators to obtain a higher task performance. Con- dings are aggregated by a weighted sum (where
sequently, we consider gender indicators as imper- weights are the attention scores) to form a context
missible tokens for this task. vector, which is then fed to a linear layer, followed
Pronoun-based Gender Identification We by a softmax to perform prediction. For all our ex-
construct a toy dataset from Wikipedia comprised periments, we use dot-product attention, where the
of biographies, in which we automatically label query vector is a learnable weight vector. In this
biographies with a gender (female or male) based model, prior to attention there is no interaction be-
solely on the presence of gender pronouns. To do tween the permissible and impermissible tokens.
so, we use a pre-specified list of gender pronouns. The embedding dimension size is 128.
Biographies containing no gender pronouns, or
BiLSTM + Attention The encoder is a single-
pronouns spanning both classes are discarded.
layer bidirectional LSTM model (Graves and
The rationale behind creating this dataset is that
Schmidhuber, 2005) with attention, followed by
due to the manner in which the dataset was
a linear transformation and a softmax to perform
created, attaining 100% classification accuracy
classification. The embedding and hidden dimen-
is trivial if the model uses information from the
sion size are both set to 128.
pronouns. However, without the pronouns, it may
not be possible to achieve perfect accuracy. Our Transformer Models We use the Bidirec-
models trained on the same data with pronouns tional Encoder Representations from Transform-
anonymized, achieve at best 72.6% accuracy. ers (BERT) model (Devlin et al., 2019). We use
Sentiment Analysis with Distractor Sentences the base version consisting of 12 layers with self-
We use the binary version of Stanford Senti- attention. Further, each of the self-attention layers
ment Treebank (SST) (Socher et al., 2013), com- consists of 12 attention heads. The first token of
prised of 10, 564 movie reviews. We append every sequence is the special classification token
one randomly-selected “distractor” sentence to [CLS], whose final hidden state is used for classi-
each review, from a set of opening sentences of fication tasks. To block the information flow from
Wikipedia pages.2 Here, without relying upon the permissible to impermissible tokens, we multi-
tokens in the SST sentences, a model should not ply attention weights at every layer with a self-
be able to outperform random guessing. attention mask M, a binary matrix of size n × n
where n is the size of the input sequence. An ele-
Graduate School Reference Letters We obtain ment Mi,j represents whether the token wi should
a dataset of recommendation letters written for the attend on the token wj . Mi,j is 1 if both i and j
purpose of admission to graduate programs. The belong to the same set (either the set of impermis-
task is to predict whether the student, for whom sible tokens, I or its complement I c ). Addition-
the letter was written, was accepted. The letters ally, the [CLS] token attends to all the tokens, but
include students’ ranks and percentile scores as no token attends to [CLS] to prevent the informa-
marked by their mentors, which admissions com- tion flow between I and I c (Figure 1 illustrates
mittee members rely on. Indeed, we notice accu- this setting). We attempt to manipulate attention
2
Opening sentences tend to be declarative statements of from [CLS] token to other tokens, and consider
fact and typically are sentiment-neutral. two variants: one where we manipulate the maxi-
are restricted to be even. We use two sets of 100K
unseen random sequences from the same distribu-
tion as the validation and test set.
Table 3: Accuracy of various classification models along with their attention mass (A.M.) on impermissible tokens
I, with varying values of the loss coefficient λ. The first row for each model class represents the case when
impermissible tokens I for the task are deleted/anonymized. For most models, and tasks, we can severely reduce
attention mass on impermissible tokens while preserving original performance (λ = 0 implies no manipulation).
Table 4: Performance of sequence-to-sequence models and their attention mass (A.M.) on impermissible tokens I,
with varying values of the loss coefficient λ. Similar to classification tasks, we can severely reduce attention mass
on impermissible tokens while retaining original performance. All values are averaged over five runs.
impermissible tokens compared to models without For sequence-to-sequence tasks, from Table 4,
any manipulation (i.e. when λ = 0). This reduc- we observe that our manipulation scheme can sim-
tion comes at a minor, or no, decrease in task accu- ilarly reduce attention mass over impermissible
racy. Note that the models can not achieve perfor- alignments while preserving original performance.
mance similar to the original model (as they do), To measure performance, we use token-by-token
unless they rely on the set of impermissible tokens. accuracy for synthetic tasks, and BLEU score for
This can be seen from the gap between models that English to German MT. We also notice that the
do not use impermissible tokens ( I 7) from ones models with manipulated attention (i.e. deliber-
that do ( I 3). ately misaligned) outperform models with none
or uniform attention. This suggests that atten-
The only outlier to our findings is the SST+Wiki tion mechanisms add value to the learning process
sentiment analysis task, where we observe that the in sequence-to-sequence tasks which goes beyond
manipulated Embedding and BiLSTM models re- their usual interpretation as alignments.
duce the attention mass but also lose accuracy. We
speculate that these models are under parameter-
ized and thus jointly reducing attention mass and 5.2 Human Study
retaining original accuracy is harder. The more ex-
pressive BERT obtains an accuracy of over 90% We present three human subjects a series of inputs
while reducing the maximum attention mass over and outputs from the BiLSTM models, trained to
the movie review from 96.2% to 10−3 %. predict occupation (physician or surgeon) given
a short biography.6 We highlight the input to- Attention Example Q1 Q2
kens as per the attention scores from three dif- Ms. X practices
66%
ferent schemes: (i) original dot-product attention, Original medicine and specializes 3.00
(yes)
in urological surgery
(ii) adversarial attention from Wiegreffe and Pin-
ter (2019), and, (iii) our proposed attention manip- Adversarial Ms. X practices
0%
(Wiegreffe and medicine and specializes 1.00
ulation strategy. We ask human annotators (Q1): (yes)
Pinter, 2019) in urological surgery
Do you think that this prediction was influenced Ms. X practices
by the gender of the individual? Each participant 0%
Ours medicine and specializes 2.67
(yes)
answers either “yes” or “no” for a set of 50 ex- in urological surgery
amples from each of the three attention schemes.7
After looking at 50 examples from a given atten- Table 5: Results to questions posed to human partici-
tion scheme, we inquire about trustworthiness of pants. Q1: Do you think that this prediction was in-
fluenced by the gender of the individual? Q2: Do you
the attention scores (Q2): Do you believe the high-
believe the highlighted tokens capture the factors that
lighted tokens capture the factors that drive the drive the models prediction? See § 5.2 for discussion.
models’ prediction? They answer the question on
a scale of 1 to 4, where 1 denotes that the high-
problematic tokens, and retains attention over non-
lighted tokens do not determine the models’ pre-
problematic but predictive ones (e.g. “medicine”)
diction, whereas 4 implies they significantly de-
making it more believable. We assess agreement
termine the models’ prediction. We deliberately
among annotators, and calculate the Fleiss’ Kappa
ask participants once (towards the end) about the
to be 0.97, suggesting almost perfect agreement.
trustworthiness of attention-based explanations, in
contrast to polling after each example, as it re- 5.3 Alternative Workarounds
quires multiple examples to assess whether the ex-
planations capture factors that are predictive. We identify two mechanisms by which the mod-
els cheat, obtaining low attention values while re-
Results We find that for the original dot-product maining accurate.
attention, annotators labeled 66% of predictions to Models with recurrent encoders can simply
be influenced by gender. Whereas for the other pass information across tokens through recurrent
two attention schemes, none of the predictions connections, prior to the application of attention.
were marked to be influenced by gender (see Ta- To measure this effect, we hard-set the atten-
ble 5). This is despite all three models achieving tion values corresponding to impermissible words
roughly the same high accuracy (96%) which re- to zero after the manipulated model is trained,
lies on gender information. This demonstrates the thus clipping their direct contributions for infer-
efficacy of our manipulation scheme—predictions ence. For gender classification using the BiLSTM
from models biased against gender minorities are model, we are still able to predict over 99% of in-
perceived (by human participants) as not being in- stances correctly, thus confirming a large degree of
fluenced by gender. Further, our manipulated ex- information flow to neighboring representations.8
planations receive a trustworthiness score of 2.67 In contrast, the Embedding model (which has no
(out of 4), only slightly lower than the score for means to pass the information pre-attention) at-
the original explanations, and significantly better tains only about 50% test accuracy after zeroing
than the adversarial attention. We found that the the attention values for gender pronouns. We see
KL divergence term in training adversarial atten- similar evidence of passing around information in
tion (Eq. 1) encourages all the attention mass to sequence-to-sequence models, where certain ma-
concentrate on a single uninformative token for nipulated attention maps are off by one or two po-
most examples, and hence was deemed as less sitions from the gold alignments (see Figure 2).
trustworthy by the annotators (see Table 5, more Models restricted from passing information
examples in appendix). By contrast, our manip- prior to the attention mechanism tend to increase
ulation scheme only reduces attention mass over the magnitude of the representations correspond-
6
The participating subjects are graduate students, profi- ing to impermissible words to compensate for the
cient in English, and unaware of our work.
7 8
We shuffled the order of sets among the three participants A recent study (Brunner et al., 2019) similarly observes
to prevent any ordering bias. Full details of the instructions a high degree of ‘mixing’ of information across layers in
presented to the annotators are in the appendix Transformer models.
(a) Bigram Flipping
Figure 2: For three sequence-to-sequence tasks, we plot the original attention map on the left, followed by the
attention plots of two manipulated models. The only difference between the manipulated models for each task is
the (random) initialization seed. Different manipulated models resort to different alternative mechanisms.
B Dataset Details
Details about the datasets used for classification
tasks are available in Table 6.
C Qualitative Examples
A few qualitative examples illustrating three dif-
ferent attention schemes are listed in Table 7.
Attention Input Example Prediction
Original Ms. X practices medicine and specializes in urological surgery Physician
Adversarial
(Wiegreffe and Ms. X practices medicine and specializes in urological surgery Physician
Pinter, 2019)
Ours Ms. X practices medicine and specializes in urological surgery Physician
Original Ms. X practices medicine in Fort Myers, FL and specializes in family medicine Physician
Adversarial
(Wiegreffe and Ms. X practices medicine in Fort Myers, FL and specializes in family medicine Physician
Pinter, 2019)
Ours Ms. X practices medicine in Fort Myers, FL and specializes in family medicine Physician