Professional Documents
Culture Documents
Additive Mil
Additive Mil
Abstract
Multiple Instance Learning (MIL) has been widely applied in pathology towards
solving critical problems such as automating cancer diagnosis and grading, predict-
ing patient prognosis, and therapy response. Deploying these models in a clinical
setting requires careful inspection of these black boxes during development and
deployment to identify failures and maintain physician trust. In this work, we
propose a simple formulation of MIL models, which enables interpretability while
maintaining similar predictive performance. Our Additive MIL models enable
spatial credit assignment such that the contribution of each region in the image can
be exactly computed and visualized. We show that our spatial credit assignment
coincides with regions used by pathologists during diagnosis and improves upon
classical attention heatmaps from attention MIL models. We show that any existing
MIL model can be made additive with a simple change in function composition.
We also show how these models can debug model failures, identify spurious fea-
tures, and highlight class-wise regions of interest, enabling their use in high-stakes
environments such as clinical decision-making.
1 Introduction
Histopathology is the study and diagnosis of disease by microscopic inspection of tissue. Histologic
examination of tissue samples plays a key role in both clinical diagnosis and drug development. It is
regarded as medicine’s ground truth for various diseases and is important in evaluating disease severity,
measuring treatment effects, and biomarker scoring [37]. A differentiating feature of digitized tissue
slides or whole slide images (WSI) is their extremely large size, often billions of pixels per image. In
addition to being large, WSIs are extremely information dense, with each image containing thousands
of cells and detailed tissue regions that make manual analysis of these images challenging. This
information richness makes pathology an excellent application for machine learning, and indeed
there has been tremendous progress in recent years in applying machine learning to pathology data
[38, 8, 18, 41, 11, 6, 13, 12].
The most important applications of ML in digital pathology involve predicting patient’s clinical
characteristics from a WSI image. Models need to be able to make predictions about the entire slide
involving all the patient tissue available; we refer to these predictions as "slide-level". To overcome
the challenges presented by the size of these images, previous methods have used smaller hand
2
be traced back from the final predictions. We show that these additive scores reflect the true marginal
contribution of each patch to a prediction and can be visualized as a heatmap on a slide for various
applications like model debugging, validating model performance, and identifying spurious features.
We also show that these benefits are achieved without any loss of predictive performance even though
the predictor function is now fixed to be additive. This is critical as the accuracy-interpretability
trade-off has been an active area of research and has deep implications for applications in medical
imaging. Trading off performance for interpretability might make sense for improving validation and
in turn adoption of ML tools, however it raises ethical questions about deploying sub-optimal models
[31, 23, 9]. Furthermore, since our work is orthogonal to previous advancements in MIL modeling,
we show that previous methods can be made additive by a simple switching of function composition
at the last layer of the model, making it applicable to all MIL models where instance-level credit
assignment is important.
An attention MIL model [16] can be seen as a 3-part model consisting of a featurizer (f ) , typically
a deep CNN, an attention module (m) which induces a soft attention over the N patches and is
used to scale each patch feature, and a predictor (p) which takes the attended patch representations,
aggregates them using a permutation invariant function like sum pooling over the N patches, and
then outputs a prediction. This MIL model g(x) is given as:
g(x) = (p ◦ m ◦ f )(x) (1)
mi (x) = αi f (xi ) where αi = sof tmaxi (ψm (x)) (2)
XN
p(x) = ψp ( mi (x)) (3)
i=1
where ψm and ψp are MLPs with non-linear activation functions.
The attention scores αi learned by the model can be treated as patch importance scores and have been
used previously to interpret MIL models [42, 25, 33, 21]. However, there are four issues in doing
spatial attribution using these attention scores. For example, consider the task of classifying a slide
into benign, suspicious or malignant:
(a) Since the attention weights are used to scale the patch features used for the prediction task, a
high attention weight only means that the patch might be needed for the prediction downstream.
Therefore, a high attention score for a patch can be a necessary but not sufficient condition for
attributing a prediction to that patch. Similarly, patches with low attention can be important for the
downstream prediction since the attention scores are related non-linearly to the final classification
or regression layer. For example, in a malignant slide, non-tumor regions might get highlighted by
the attention scores since they need to be represented at the final classification layer to provide
discriminative signal. However, this does not imply malignant prediction should be attributed to
non-malignant regions, nor that these regions would be useful to guide a human expert.
(b) The patch’s contribution to the final prediction can be either positive (excitatory) or negative
(inhibitory), however attention scores do not distinguish between the two. A patch might be
providing strong negative evidence for a class but will be highlighted in the same way as a positive
patch. For example, benign mimics of cancer are regions which visually look like cancer, but are
normal benign tissue [36]. These regions are useful for the model to provide negative evidence for
the presence of cancer and thus might have high attention scores. While attending to these regions
may be useful to the model, they may complicate human interpretation of resulting heatmaps.
(c) Attention scores do not provide any information about the class-wise importance of a patch, but
only that a patch was weighted by a certain magnitude for generating the prediction. In the case
of multiclass classification, this becomes problematic as a high attention scores on a patch can
mean that it might be useful for any of the multiple classes. Different regions in the slide might
be contributing to different classes which are indistinguishable in an attention heatmap. For
example, if a patch has high attention weight for benign-suspicious-malignant classification, it
can be interpreted as being important for any one or more of the classes. This makes the attention
scores ineffective for verifying the role of individual patches for a slide-level prediction.
3
(d) Using attention scores to assess patch contribution ignores patch interactions at the classification
stage. For example, two different tumor patches might have moderate attention scores, but when
taken together in the final classification layer, they might jointly provide strong and sufficient
information for the slide being malignant. Thus, computing marginal patch contributions for a bag
needs to be done at the classification layer and not the attention layer since attention scores do not
capture patch interactions and thus can under or over estimate contributions to the final prediction.
These difficulties in interpreting attention MIL heatmaps motivate the formulation of a traceable
predictor function, where model predictions can be exactly specified in terms of patch contributions
(both positive and negative) for each class.
Feature
Classes
Extract
or
Whole Patches Pretrained Patch Attention Attention Weighted Patch-wise Class Slide-level
Attention MIL predictions
Slide in a Bag CNN Embeddings Scores Patch Embeddings Contributions
We first define the desiderata for a visual interpretability method for MIL models:
1. The method should be intrinsic to the model and not be a post-hoc method. This prevents
incorrect assumptions about the model and does not require post-hoc modeling which is often an
approximation. It also prevents many pitfalls of traditional saliency methods [19, 1].
2. Attribution in the MIL setup should be in terms of instances only. For pathology, this means that
the prediction should be attributed to individual patches. This constraint enables expression of
bag predictions in terms of marginal instance contributions.
3. Should reflect model’s sensitivity and invariances by reliably mirroring its functioning [20].
4. Should distinguish between excitatory and inhibitory patch contributions. Should also provide
per-class contributions for classification problems.
To enable the desired instance-level credit assignment in MIL, we re-frame the final predictor to be
an additive function of individual instances. This translates to a simple switching of the function
composition in Equation 3:
N
X
pAdditive (x) = ψp (mi (x)) (4)
i=1
Making this change results in the final predictor only being able to implement patch-additive functions
on top of arbitrarily complex patch representations. This provides both complexity of the learned
representations as well as a traceable patch contribution for a given prediction which solves the
spatial credit assignment problem. ψp (mi (x)) is the class-wise contribution for patch i in the bag. At
inference, ψ produces a RC×N matrix for a classification problem where C is the number of classes
and N is the number of patches in a bag. Thus, we get a class-wise score for each patch, which
when summed gives the final logits for the prediction problem. These scores can be visualized by
constructing a heatmap from the visual representation of patch-wise contributions for each class. The
sign of the patch contribution decides whether the patch is excitatory or inhibitory towards each class
since positive values add to the final logit while negative values bring down the final class logit. In
the next section, we prove that the instance contribution obtained from an Additive MIL model is
exactly equivalent to the actual marginal contribution of that patch to a model’s prediction.
4
2.3 Proof of equivalence between Additive MIL and Shapley Values
We highlight the spatial credit assignment properties of an Additive MIL model by proving its
equivalence to Shapley values. Shapley value [34] is a game theoretic concept used for calculating
the optimal marginal contribution of each player in a n-player coalition game with a given total
payoff. In machine learning, it’s used to interpret models predictions by decomposing them in terms
of their marginal feature contributions. For most practical models, the computational complexity
grows exponentially, thus requiring approximations [26].
The Additive MIL model, g is defined for a MIL problem with instances xi as input:
N
X
g(x) = ψp (αi f (xi )) (5)
n=1
where αi is the attention weight for the ith instance, f is the function encoding each instance into a
feature representation, and ψp is predictor function which maps instance representation to the model
output (e.g. logits for classification models).
Theorem 1: The marginal instance contribution from an Additive MIL model, g(xi ) is proportional
to the Shapley value of that instance, φi .
Consequence: Additive MIL scores ensure optimal credit assignment across instances of an MIL bag.
Thus each bag-level prediction in MIL can be exactly decomposed into marginal instance contributions
given by Additive MIL scores and provide model interpretability.
Proof: The interpretation for the value function VS is taken from [26] where it’s defined as the
expected value of the model given a specific input set x∗S .
VS (xS ) = E[g(x)|xS = x∗S ] (7)
Since the conditional expectation is for the case where only the set S is known, rewriting the equation
in the form of integrals and breaking it down by set S and its complement S̄ gives:
Z
VS (xS ) = g(x)p(xS̄ |xS = x∗S )dxS̄ (8)
Z hX X i
= ( g(x∗j )) + ( g(xj )) p(xS̄ |xS = x∗S )dxS̄ (9)
j∈S j∈S̄
Z X Z X
= g(x∗j ) p(xS̄ |xS = x∗S )dxS̄ + ( g(xj ) p(xS̄ |xS = x∗S )dxS̄ (10)
j∈S j∈S̄
X Z X
= g(x∗j ) p(xS̄ |xS = x∗S )dxS̄ + E[g(xj )] (11)
j∈S j∈S̄
X X
= g(x∗j ) + E[g(xj )] (12)
j∈S j∈S̄
Equation 9 uses the model definition from equation 5 to express the function g into its linearly additive
components over all instances which are either in set S or in S̄. Similarly, we can write the value
function when the ith index is included in S by removing it from set S̄ and adding it to S:
VS∪i (xS∪i ) = VS (xS ) + g(x∗i ) − E[g(xi )] (13)
VS∪i (xS∪i ) − VS (xS ) = g(x∗i ) − E[g(xi )] (14)
Since the second term here is the expected value of the model output, we can put this back in equation
6 to get an equivalence between the Shapley value and the instance contribution from an Additive
MIL model.
φi (V, x) ∝ g(xi )
5
2.4 Features of Additive MIL Models
Exact marginal patch contribution towards a prediction. Additive MIL models provide exact
patch contribution scores which are additively related to the prediction. This additive coupling of the
model and the interpretability method makes the spatial scores precisely mirror the invariances and
the sensitivities of the model, thus making them intrinsically interpretable.
Class-wise contributions. Additive MIL models allow decomposing the patch contributions and
attributing them to individual classes in a classification problem. This allows us to not just assign the
prediction to a region, but to also see which class it contributes to specifically. This is helpful in cases
where signal for multiple classes exist within the same slide.
Distinction between excitatory and inhibitory contributions. Additive MIL models allow for both
positive and negative contributions from a patch. This can help distinguish between areas which are
important because they provide evidence for the prediction and those which provide evidence against.
We perform various experiments to show the benefits of using Additive MIL models for interpretability
in pathology problems. Concretely, we show the following results:
• Additive MIL models provide intrinsic spatial interpretability without any loss of predictive
performance as compared to more expressive, non-additive models.
• Any pooling-based MIL model can be made additive by reformulating the predictor function and
leads to predictive results similar to the original model.
• Additive MIL heatmaps yield better alignment with region-annotations from an expert pathologist
than Attention MIL heatmaps.
• Additive MIL heatmaps provide more granular information like class-wise spatial assignment and
excitatory & inhibitory patches which is missing in attention heatmaps. This can be useful for
applications like model debugging.
We consider 3 different datasets and 2 different predictions for our experiments. The first problem is
the prediction of cancer subtypes in non-small cell lung carcinoma (NSCLC) & renal cell carcinoma
(RCC), both of which use the TCGA dataset [40]. The second problem is the detection of metastasis
in breast cancer using the Camelyon16 dataset [5]. TCGA RCC contains 966 whole slide images
(WSIs) with three histologic subtypes - KICH (chromophobe RCC), KIRC (clear cell RCC) and KIRP
(papillary RCC). TCGA NSCLC has 1002 WSIs, with 538 slides belonging to subtype LUAD (Lung
Adenocarcinoma) and 464 to LUSC (Lung Squamous Cell Carcinoma). Camelyon16 contains 267
WSI s for training and 129 for testing with a total of 159 malignant slides and 237 benign slides.
Both TCGA datasets were split into 60/15/25 (train/val/test) as done previously [33] while ensuring no
data leakage at a case level. For Camelyon16, we use the original splits provided with the dataset. For
training the models, a bag size of 48-1600 patches and batch size of 16-64 was experimented with and
the best one chosen using cross-validation. The patches were sampled from non-background regions
for all datasets at a resolution of 1 microns per pixel without any overlap between adjacent patches.
An ImageNet pre-trained Shufflenet [27] was used as the feature extractor and the entire model was
trained end-to-end with ADAM optimizer and a learning rate of 1e-4. For inference, multiple bag
predictions were aggregated using a majority vote to get the final slide-level prediction. AUROC (area
under the receiver operating curve) scores were generated using the proportion of bags predicting the
majority label as the class assignment probability. For TCGA-RCC, we compute macro average of
1-vs-rest AUROC across the 3 classes. The attention scores were obtained by directly taking the raw
outputs for each patch from the attention module. For additive patch contributions, the patch-wise
class contributions were taken and converted to a bounded patch contribution value using a sigmoid
function. Both the attention and additive patch-wise scores were used for generating a heatmap as an
overlay on the slide with Attention MIL heatmaps having a single value per patch and Additive MIL
6
Camelyon16 TCGA NSCLC TCGA RCC
Method
Accuracy AUC Accuracy AUC Accuracy AUC
Attention MIL [ ABMIL ] [16] 0.7734 0.7504 0.8826 0.9465 0.8780 0.9778
Attention MIL + Additive 0.8295 0.8460 0.8860 0.9414 0.9146 0.9829
TransMIL [33] 0.8047 0.7748 0.8785 0.9318 0.9146 0.9826
TransMIL + Additive 0.8047 0.8440 0.8947 0.9339 0.9106 0.9862
Table 1: Comparison of predictive performance on Camelyon16, TCGA NSCLC & RCC datasets.
heatmaps having C values per patch where C is the number of classes. All training and inference
runs were done on Quadro RTX 8000, and it takes 3 to 4 hours to train the model with four GPUs.
3.3.1 Predictive Performance of Additive MIL Models & Applicability to Previous Methods
We compare Additive MIL models with existing techniques in terms of predictive performance on
3 different datasets. We implement the standard Attention MIL model (ABMIL) and a transformer
based MIL model, TransMIL which is the state-of-the-art on these three datasets (for comparison
of TransMIL with other methods, refer [33]). Table 1 shows how Additive MIL models achieve
comparable or superior performance to the standard Attention MIL model. In the case of improved
performance, we hypothesize that the additive constraint regularizes the model and limits overfitting
in comparison to previous approaches. This is particularly relevant to pathology datasets that often
have less than one thousand slides. The results in the table also demonstrate how previous techniques
like TransMIL can be made additive by switching the function composition of the classifier layer
as done in Equation 3 and 4. This property is general, thus any high performing MIL method can
be converted to an Additive MIL model. Implementing the additive formulation gives nearly all
the benefits of modeling complexity from previous methods, while enabling spatial interpretability
without any loss of predictive performance.
7
Precision
Recall WSI from Camelyon16 Zoomed-in Area Additive MIL Heatmap Attention MIL Heatmap
Figure 2: Comparison of Additive & Attention MIL heatmaps at detecting annotated cancer regions in Came-
lyon16. Attention MIL heatmaps have lower precision & detect false-positives as highlighted in the yellow
circle.