An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning

392 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO.
1, JANUARY 2024
An Interpretable and Accurate Deep-Learning

Diagnosis Framework Modeled With Fully and
Semi-Supervised Reciprocal Learning
Chong Wang , Yuanhong Chen, Fengbei Liu, Michael Elliott , Chun Fung Kwok ,
Carlos Peña-Solorzano , Helen Frazer, Davis James McCarthy, and Gustavo Carneiro
Abstract — The deployment of automated deep-learning learning scenarios, reaching state-of-the-art classification
classifiers in clinical practice has the potential to streamline performance in both scenarios for the tasks of breast can-
the diagnosis process and improve the diagnosis accu- cer and retinal disease diagnosis. Moreover, relying on
racy, but the acceptance of those classifiers relies on weakly-labelled training images, InterNRL also achieves
both their accuracy and interpretability. In general, accurate superior breast cancer localisation and brain tumour seg-
deep-learning classifiers provide little model interpretabil- mentation results than other competing methods.
ity, while interpretable models do not have competitive
classification accuracy. In this paper, we introduce a new Index Terms— Interpretability, interpretable classifica-
deep-learning diagnosis framework, called InterNRL, that is tion, reciprocal learning, student-teacher model, semi-
designed to be highly accurate and interpretable. InterNRL supervised learning, weakly-supervised segmentation,
consists of a student-teacher framework, where the stu- mammogram, optical coherence tomography.
dent model is an interpretable prototype-based classifier I. I NTRODUCTION
(ProtoPNet) and the teacher is an accurate global image
EEP learning [1] has recently shown tremendous success
classifier (GlobalNet). The two classifiers are mutually opti-
mised with a novel reciprocal learning paradigm in which the
student ProtoPNet learns from optimal pseudo labels pro-
D in many automated image analysis tasks [2], [3], [4],
[5]. A typical representative of deep learning models is the
duced by the teacher GlobalNet, while GlobalNet learns from deep neural network (DNN) which can automatically learn
ProtoPNet’s classification performance and pseudo labels.
This reciprocal learning paradigm enables InterNRL to be
hierarchical features from image input. DNNs are commonly
flexibly optimised under both fully- and semi-supervised composed of many interconnected layers with a massive
number of learnable parameters, making it hard to interpret
Manuscript received 20 June 2023; revised 8 August 2023; accepted their predictions. Therefore, deep learning networks are often
11 August 2023. Date of publication 21 August 2023; date of current treated as black boxes that may produce a wrong outcome
version 2 January 2024. This work was supported in part by the with high confidence by changing only one pixel of the
Australian Government under the Medical Research Future Fund for
the Transforming Breast Cancer Screening with Artificial Intelligence input image [6]. Such lack of interpretability is an issue for
(BRAIx) Project under Grant MRFAI000090 and in part by the Australian medical applications that can lead to potentially catastrophic
Research Council under Grant FT190100525. (Corresponding author: consequences [7]. Furthermore, clinicians can be hesitant to
Chong Wang.)
This work involved human subjects or animals in its research. Approval
accept predictions made by black-box models, which hinders
of all ethical and experimental procedures and protocols was granted by their successful translations into real-world clinical practice.
the Ethics Committee of St Vincents Hospital Melbourne under Approval To improve prediction interpretability of deep learning
No. LNR/18/SVHM/162.
Chong Wang, Yuanhong Chen, and Fengbei Liu are with the Aus-
models, some studies provide post-hoc explanations (e.g.,
tralian Institute for Machine Learning, The University of Adelaide, Grad-CAM [8] and saliency maps [9]) to highlight the spa-
Adelaide, SA 5000, Australia (e-mail: chong.wang@adelaide.edu.au; tial importance of image regions related to predictions of a
yuanhong.chen@adelaide.edu.au; fengbei.liu@adelaide.edu.au). global deep-learning classifier, as shown in Fig. 1(a). For
Michael Elliott, Chun Fung Kwok, and Carlos Peña-Solorzano are
with the St. Vincent’s Institute of Medical Research, Melbourne, instance, Khakzar et al. [10] train a chest X-ray deep-learning
VIC 3065, Australia (e-mail: melliott@svi.edu.au; jkwok@svi.edu.au; classifier that uses perturbed adversarial training samples to
cpsolorzano@svi.edu.au). generate better class activation and saliency maps. However,
Helen Frazer is with the St. Vincent’s Hospital Melbourne, Melbourne,
VIC 3002, Australia (e-mail: helen.frazer@svha.org.au). such highlighted classification-relevant maps are often unre-
Davis James McCarthy is with the St. Vincent’s Institute of Medical liable and insufficient for interpreting the inner workings
Research, Melbourne, VIC 3065, Australia, and also with the Melbourne of deep models [7]. On the other hand, prototype-based
Integrative Genomics, The University of Melbourne, Melbourne,
VIC 3010, Australia (e-mail: dmccarthy@svi.edu.au).
models (e.g., ProtoPNet [11]), as illustrated in Fig. 1(b),
Gustavo Carneiro is with the Australian Institute for Machine Learn- can produce not only a map of activated regions, but also
ing, The University of Adelaide, Adelaide, SA 5000, Australia, and an interpretable reasoning process by directly associating
also with the Centre for Vision, Speech and Signal Processing classification outcomes with representative prototypes learned
(CVSSP), University of Surrey, GU2 7XH Guildford, U.K. (e-mail:
g.carneiro@surrey.ac.uk). from training samples. Even though prototype-based mod-
Digital Object Identifier 10.1109/TMI.2023.3306781 els show superior interpretability than post-hoc explanations,
1558-254X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 393
meaningful prototypes, which are beneficial to both

classification interpretability and accuracy.
4) Extensive experiments on four datasets show that our
method outperforms other state-of-the-art (SOTA) fully-
and semi-supervised methods in terms of classifica-
tion accuracy. Moreover, our method achieves superior
weakly-supervised cancer localisation and tumour seg-
mentation results than other competing methods.
Fig. 1. (a) Post-hoc explanations show where the model focuses, but This paper extends our preliminary work
not what happens within the model. (b) Prototype-based models give
interpretable predictions by measuring how similar parts of the test image BRAIxProtoPNet++ [14] in four aspects. First, different
(left) are to the representative prototypes (middle) of training images. The from [14] that distils the knowledge from a fixed pre-trained
similarity maps (right) illustrate which part of the test image is similar to global classifier to train the prototype-based classifier whose
the prototypes.
classification accuracy is thus upper bounded by the global
they tend to have inferior classification accuracy since their classifier, this paper presents a new student-teacher reciprocal
classification decisions are made by comparing local parts of learning paradigm to mutually improve the accuracy of both
an image with training prototypes instead of utilising the whole the prototype-based and global classifiers. This paradigm
image information. This accuracy degradation is also called also enables us to exploit unlabelled training samples to
performance-interpretability gap [7], [12], [13] that has been improve the classification accuracy under a semi-supervised
observed in various fields. The study of reducing such gap learning scenario. Second, we propose a novel regularisation
has become a rich area of research, particularly in the medical strategy for the prototype-based classifier by introducing
imaging domain. transformation-consistent constraints on its prototype learning.
In this paper, we present a new Interpretable Network mod- Third, we evaluate our method not only on two breast cancer
elled with Reciprocal Learning (InterNRL), for accurate and screening mammogram datasets, but also on a multi-class
interpretable medical image classification. InterNRL employs retinal OCT dataset. Experiments on mammogram datasets
a student-teacher strategy, where the student network is repre- show the superiority of our method over existing fully and
sented by an interpretable ProtoPNet, and the teacher network semi-supervised classification methods. Results on the retinal
is formed by a global deep classifier GlobalNet. To exploit OCT dataset demonstrate that our method is general and
the advantages of the holistic classification by GlobalNet effective in tackling the multi-class diagnosis task. Lastly,
and local interpretablity by ProtoPNet, InterNRL achieves we further conduct investigations into the potential of our
classification by their ensemble. To narrow the performance- method for weakly-supervised brain tumour segmentation.
interpretability gap, the two networks are mutually optimised Results on a brain tumour MRI dataset exhibit that our
with a novel reciprocal learning paradigm where the student prototype-based method is also applicable and competitive
ProtoPNet learns from optimal pseudo labels produced by the for the weakly-supervised segmentation task.
teacher GlobalNet, and GlobalNet learns from ProtoPNet’s
classification performance and accurately predicted pseudo II. R ELATED W ORK
labels. Via this paradigm, InterNRL is flexible enough to
be trained under both fully- and semi-supervised learning A. Interpretable Medical Image Classification
scenarios. Additionally, we develop a new transformation (e.g., Most deep learning models can be viewed as black boxes,
rotation, translation, scaling) consistency strategy at the input which motivates the development of interpretable classification
and output spaces of the interpretable ProtoPNet, to regularise strategies that are mainly divided into: 1) creating techniques
its prototype learning. We extensively evaluate our method to explain the predictions made by these black-box models;
on two breast cancer screening mammogram classification and 2) developing intrinsically interpretable deep models.
datasets, one optical coherence tomography (OCT) retinal To explain the predictions made by black-box models,
disease classification dataset, and one brain tumour segmen- class activation maps (CAM) [8] and saliency maps [9] are
tation magnetic resonance image (MRI) dataset. The major two commonly utilised techniques, which can provide spatial
contributions of this paper are listed as follows: importance maps as post-hoc classification explanations of
1) We propose a new student-teacher InterNRL method that a given input image. For example, Grad-CAM is used for
seamlessly incorporates prototype-based interpretability interpretable classification of electrocardiograms [15], CT lung
into existing deep global image classifiers. The method nodules [16], and chest X-ray images [17]. Fang et al. [18]
can reach highly accurate classification results with utilised saliency maps to guide the classification of retinal
promising interpretable predictions. OCT images. In [19], saliency maps obtained from Bag-
2) We present a novel reciprocal learning paradigm to Net [20] are employed for interpretable gender classification.
optimise InterNRL, with the goal of increasing the For intrinsically interpretable deep learning models,
accuracy of both the interpretable ProtoPNet and the an important representative is the prototype-based method,
global classifier GlobalNet. Based on this paradigm, our e.g., ProtoPNet [11] and XProtoPNet [21]. ProtoPNet can
method is flexible enough to be trained under both fully- learn a set of class-specific prototypes of local object parts
and semi-supervised learning scenarios. from training samples and realise interpretable classifica-
3) We introduce an effective prototype consistency regu- tion decisions by quantifying how strongly a test image
larisation strategy to help ProtoPNet learn robust and is similar to the learned prototypes. Similar to ProtoPNet,
394 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
XProtoPNet learns prototypes together with disease occur- III. M ETHOD

rence maps for interpretable chest radiography diagnosis. We introduce our InterNRL method for interpretable
Derived from ProtoPNet, IAIA-BL method [22] uses Pro- and accurate medical image classification, trained with the
|D |
toPNet supervised by fine annotations for mass classification weakly-labelled dataset D = {(xi , yi )}i=1 , where xi ∈
based on cropped regions of interest (ROI) instead of the whole X ⊂ R H ×W represents the image of size H × W , and
mammogram. yi ∈ Y ⊂ {0, 1}C denotes the one-hot representation of the
In general, post-hoc explanations maintain the highly accu- image-level class label for a C-class problem. Given that our
rate classification results of deep black-box models, but their method is flexible enough to be trained under both fully- and
spatial importance maps are not sufficiently reliable given that semi-supervised learning scenarios, we also define a set of
these methods do not change the training procedure to penalise |Du |
unlabelled data as Du = {x j } j=1 , where typically the number
wrong associations between image regions and classifica- of image samples has |Du | ≫ |D|.
tion results. On the other hand, prototype-based interpretable
models tend to produce spatial importance maps relevant to
informative prototypes, but they are often less accurate than A. Overview of the Proposed InterNRL Framework
non-interpretable global models given that their classification An overview of our InterNRL is shown in Fig. 2(a), which
decisions are made by comparing local parts of an image comprises a shared CNN backbone, denoted by f θ f : X →
H W
with learned prototypes and lack the use of the whole-image F ⊂ R 2 × 2 ×D (parameterised by θ f ∈ 2 f ), to extract a
information. Our InterNRL method addresses this accuracy low-level feature map from the input image x. In Fig. 2(a),
issue by seamlessly integrating the prototype-based model InterNRL also has two sibling classification branches, i.e.,
with a non-interpretable deep global classifier, where the interpretable ProtoPNet pθ p : F → 1, parameterised by
two models are mutually optimised using a new reciprocal θ p ∈ 2 p , and global classifier GlobalNet gθg : F →
student-teacher learning paradigm. 1, parameterised by θg ∈ 2g , where 1 = {ȳ : ȳ ∈
PC
[0, 1]C and c=1 ȳ(c) = 1} denotes the classification prob-
ability space. 2 f , 2 p , 2g denote the space of all possible
B. Student-Teacher Models in Medical Image Analysis
values of the parameters θ f , θ p , and θg , respectively. The final
The student-teacher model has been widely used for classification is obtained by the ensemble of ProtoPNet and
semi-supervised learning [23], [24] and knowledge distilla- GlobalNet.
tion [25]. It adopts two networks, i.e., the student and teacher GlobalNet is a non-interpretable whole-image deep clas-
networks, which can have the same or different architec- sifier (e.g., ResNet [31]), comprising convolutional layers,
tures. The training of the student-teacher model involves the pooling layers, and fully connected layers. ProtoPNet is an
optimisation of the teacher network using training data and intrinsically interpretable classifier which achieves classifi-
labels, followed by the optimisation of the student network cation by comparing local parts of an image with learned
with the same training data, but using pseudo labels produced training prototypes but tends to be not as accurate as the holis-
by the teacher [25]. An alternative training strategy is the tic GlobalNet classifier. This is because ProtoPNet focuses
self-ensembling mean teacher (MT) [26], which uses the only on local image parts. Although local information (e.g.,
exponential moving average (EMA) of the student network local lesions) is beneficial for diagnosing diseases, whole-
to form the teacher network, whose predictions are in turn image information (e.g., lesions distributed in different spatial
used to train the student network to help it achieve consistent locations and contrast between healthy and abnormal tissues)
predictions. is still critical for accurate classification. Hence, we pro-
In the medical imaging domain, several works have explored pose the InterNRL framework that seamlessly integrates the
the MT strategy for semi-supervised classification and segmen- interpretable ProtoPNet and the global classifier GlobalNet
tation. Liu et al. [27] presented a sample relation consistency where the two classifiers are mutually optimised with a
Mean Teacher (SRC-MT) method to further enforce the novel two-stage reciprocal learning paradigm, as detailed
consistency of semantic relationship among different train- below.
ing samples. Yu et al. [28] developed an uncertainty-aware
Mean Teacher (UA-MT) method for left atrium segmenta-
tion by utilising the segmentation uncertainty map to help B. Reciprocal Student-Teacher Learning Paradigm
the student network learn reliable targets. Wang et al. [29] The reciprocal student-teacher learning paradigm consists
proposed a double-uncertainty weighted method to exploit of two stages. In the first stage (Section III-B.1), the student
both segmentation uncertainty and feature uncertainty for ProtoPNet learns from the optimal pseudo labels produced by
semi-supervised medical image segmentation. To handle the the teacher GlobalNet, which is trained based on the student’s
training issue of class-imbalance in glaucoma classification, classification accuracy. In the second stage (Section III-B.2),
Zhao et al. [30] relied on MT to distil the features of the the teacher GlobalNet is further trained using the pseudo labels
majority non-glaucoma class and improve the discriminative generated by the accurate student ProtoPNet.
ability of features of the minority glaucoma class. In this To make the teacher GlobalNet create reasonably accu-
paper, we utilise the student-teacher strategy to model an rate pseudo labels for the next learning phase, we start
interpretable ProtoPNet and a global deep classifier Glob- by pre-training the CNN backbone f θ f (·) together with
alNet, with a reciprocal learning paradigm to improve their GlobalNet gθg (·), relying on the following optimisa-
performance. tion using mini-batches B sampled from the labelled
Fig. 2. An overview of our InterNRL framework (a), which consists of a shared CNN backbone and two classification branches, i.e., interpretable
ProtoPNet and global classifier GlobalNet. The InterNRL is optimised with a two-stage reciprocal student-teacher learning paradigm: (b) the student
ProtoPNet learns from the suitable pseudo labels produced by the teacher GlobalNet and the GlobalNet is trained based on the ProtoPNet’s
performance feedback; and (c) the teacher GlobalNet is further trained using the pseudo labels produced by the accurate student ProtoPNet.
set D: the classification validation loss of the student ProtoPNet on

1 X g
labelled data xi in Eq. (4). Hence, the teacher GlobalNet’s
θ ∗f , θg∗ = arg min ℓce (yi , ȳi ), (1) training is always constrained by the ProtoPNet optimisation
θ f ∈2 f ,θg ∈2g |B| g
(xi ,yi )∈B using the pseudo label ŷ j in Eq. (5).
g The student ProtoPNet is updated with the following itera-
where ȳi = gθg ( f θ f (xi )) represents the softmax probability
tive gradient descent step:
output of GlobalNet on labelled data xi and ℓce (·) denotes the
cross entropy (CE) loss function. In the next stages, the CNN θ p∗ = θ p − η p ∇θ p L p (θ p , θg ), (6)
backbone parameter θ f remains fixed.
where θ p = {θh , θk , θt } denotes the ProtoPNet’s parameters
1) Stage 1: Student Learns From Teacher, Teacher Learns
defined in Section III-C, and η p is the learning rate of
From Student’s Classification Performance: As described in
ProtoPNet. After the student network is updated with Eq. (6)
Section III-A, GlobalNet tends to be more accurate than Pro-
above, it is expected that it performs better on the labelled data
toPNet, so we treat GlobalNet as the teacher network to create g
xi . However, if the generated pseudo label ŷ j is inaccurate,
pseudo-labelled data to optimise the student ProtoPNet, with
the student will learn from incorrect supervision, which may
the goal of boosting the accuracy of ProtoPNet. The teacher
lead to worse performance on xi . In this regard, the student’s
GlobalNet is then trained from the student’s classification
classification performance on labelled data xi , computed from
performance so that the teacher can produce optimal pseudo
Eq. (4), should be fed back to the teacher network as a
labels for the student network, as shown in Fig. 2(b). The
signal to reward (when the student performs better) or penalise
training for the student and teacher networks in the first stage
(when the student performs worse) the teacher’s pseudo label
is based on the following iterative optimisation:
generation process. According to Eq. (2), the gradient descent
minimiseθg Lg (θg ) + Lce (θ p∗ ) step to update the teacher parameters is given by:
subject to θ p∗ = arg min L p (θ p , θg ),

(2) θg∗ = θg − ηg ∇θg Lg (θg ) + Lce (θ p∗ ) , (7)
θp
where where ηg is the GlobalNet’s learning rate. The first term of

the gradient in Eq. (7) can be simply computed from the
1 X g
Lg (θg ) = ℓce (yi , ȳi ), (3) supervised CE loss on the labelled sample (xi , yi ) in Eq. (3).
|B| The second term of the gradient enables the teacher network to
(xi ,yi )∈B
1 X
p
be aware of the student’s classification performance feedback.
Lce (θ p∗ ) = ℓce yi , ȳi (θ p∗ ) , (4) Motivated by [32], the second term of the gradient in Eq. (7)
|B|
(xi ,yi )∈B is defined as:
1 X g p 1 X
L p (θ p , θg ) = ℓ p (ŷ j , ȳ j ), (5) ∇θg Lce (θ p∗ ) = h ·
g g
∇θg ℓce (ŷ j , ȳ j ), (8)
|Bu | |Bu |
x j ∈Bu x j ∈Bu
with the mini batches being denoted by B, Bu ⊂ D for where

fully-supervised learning and B ⊂ D, Bu ⊂ Du for semi- ⊤
h = η p ∇θ p∗ Lce (θ p∗ ) ∇θ p L p (θ p , θg )

g (9)
supervised learning. The one-hot vector ŷ j ∈ {0, 1}C in Eq. (5)
is the GlobalNet’s hard pseudo label to train the ProtoPNet
g g is a scalar representing ProtoPNet’s performance feedback,
with unlabelled data x j , so ŷ j (c∗ ) = 1 for c∗ = argmaxc ȳ j (c)
g g which is computed by measuring the gradient similarity. This
and ŷ j (c) = 0 for c ̸ = c∗ , with ȳ j denoting the GlobalNet’s scalar quantifies how much the student network is optimised
p
softmax probability output. Also in Eq. (5), ȳ j = pθ p ( f θ f (x j )) towards the direction that its validation loss on labelled data
denotes the softmax prediction of ProtoPNet on unlabelled
p
(xi , yi ) is reduced (i.e., performs better), after being trained
data x j . In Eq. (4), ȳi (θ p∗ ) = pθ p∗ ( f θ f (xi )) is the softmax g
with the pseudo-labelled data (x j , ŷ j ). Based on h, the teacher
output of the updated student ProtoPNet on labelled data xi network adjusts its pseudo label generation from Eq. (8)1 to
g
after being trained via the hard pseudo label ŷ j using Eq. (5). produce more suitable pseudo labels for training the student
ℓ p (·) denotes the ProtoPNet training objective defined below in ProtoPNet, which can be explained by the following three
Eq. (11). The optimisation in Eq. (2) aims to train the teacher
GlobalNet using CE loss with labelled data xi in Eq. (3) and 1 A detailed derivation of Eq. (8) can be found in the supplement of [32].
situations: 1) if h > 0 (pseudo labels generated by the teacher

improve the student learning), the gradient term in Eq. (8) will
have a positive effect, which means the teacher is rewarded to
keep its current optimisation direction. A larger h indicates
stronger rewarding; 2) if h = 0 (pseudo labels generated
by the teacher neither improve, nor deteriorate the student
learning), the gradient term in Eq. (8) has no effect; 3) if Fig. 3. The architecture of the interpretable ProtoPNet.
h < 0 (pseudo labels generated by the teacher deteriorate
the student learning), the gradient term in Eq. (8) will have
a negative effect, which means the teacher is penalised and
needs to reverse its current optimisation direction. A larger
absolute value of h indicates stronger penalisation.
2) Stage 2: Teacher Learns From Student’s Pseudo Labels:
After the first stage described in Section III-B.1, we exper-
imentally noticed that ProtoPNet had higher accuracy than Fig. 4. The pipeline of our proposed prototype consistency regularisation
GlobalNet (detailed in the experimental part), thus we further strategy based on spatial image transformations.
use all training data in D∪Du , pseudo-labelled by the accurate
ProtoPNet, to optimise the teacher GlobalNet (as illustrated to generate the classification score (logits), followed by a
in Fig. 2(c)), so that both of them can reach high accuracy. softmax function to form probabilistic predictions ȳ p ∈ 1.
Specifically, the training of the teacher GlobalNet in the second The ProtoPNet is optimised by using the following objective
stage is achieved with: for each training sample (x, y):
1 X p g
θg∗ = arg min ℓce (ŷ j , ȳ j ), (10) ℓ p (y, ȳ p ) = ℓce (y, ȳ p )+λ1 ℓct (x, y)+λ2 ℓsp (x, y) + λ3 ℓcs (x),
θg |Ba |
x j ∈Ba (11)
p
where mini batches Ba are sampled from D∪Du , ŷ j ∈ {0, 1}C with
denotes the hard pseudo labels produced by the accurate
g
student ProtoPNet and ȳ j represents the softmax predictions ℓct (x, y) = min min ||v − pm ||22 , (12)
pm ∈Py v∈V
of GlobalNet. Note that in this stage, we fix the parameters of
the CNN backbone θ f and ProtoPNet θ p , and only optimise ℓsp (x, y) = max 0, γ − min min ||v − pm ||22 , (13)
/ Py v∈V
pm ∈
the teacher GlobalNet’s parameter θg .
3) Testing: As shown in Fig. 2(a), the final classification where λ1 , λ2 , λ3 , and γ are hyper-parameters. For the training
result of our InterNRL methodg
for
p
a testing image x is obtained image x, the cluster loss ℓct (·) in Eq. (12) enables the image
by the average ensemble ȳ +ȳ 2 , where ȳg and ȳ p denotes to have at least one local feature vector v of V close to one
predictions from GlobalNet and ProtoPNet, respectively. of the prototypes of its own class, while the separation loss
ℓsp (·) in Eq. (13) enforces all feature vectors of V to be far
(with margin γ ) from the prototypes that are not of its own
C. Interpretable ProtoPNet
class. Note that in Eq. (12) and (13) we abuse the notation
ProtoPNet [11] is inherently interpretable since its deci- v ∈ V to represent v ∈ R1×1×D as one of the 32 H W
× 32 feature
sions can be explained by showing image similarity maps vectors of V, and Py ⊂ P is the set of prototypes with class
associated with a set of class-specific image prototypes that label y. The last loss term ℓcs (·) in Eq. (11) is our proposed
are automatically learned from training samples. It achieves prototype consistency loss, with pipeline depicted in Fig. 4,
classification by computing similarities of testing image parts which enforces ProtoPNet to learn transformation-equivalent
with the learned training prototypes. As shown in Fig. 3, the representations with the goal of improving interpretability
ProtoPNet is denoted by pθ p (F) = tθt (kθk (h θh (F))), where and classification accuracy. Specifically, we leverage various
F = f θ f (x) ∈ F represents the shared low-level feature map spatial transformations at the input and output spaces of
extracted from the image x by the CNN backbone, h θh (·) ProtoPNet to regularise its prototype learning. The prototype
denotes the mapping layers (e.g., convolutional and pooling consistency ℓcs (·) is defined as follows:
layers) from the low-level feature map F to a high-level feature
H W M
map V ∈ V ⊂ R 32 × 32 ×D (with D denoting the feature 1 X
2
ℓcs (x) = sm t (x) − t sm (x) , (14)
dimension), kθk (·) represents the prototype layer, and tθt (·) is M 2
m=1
the fully connected (FC) layers. The prototype layer kθk (·)
learns M class-specific prototypes P = {pm }m=1 M , with M/C where t (x) denotes a transformed version (e.g., translation,
prototypes per class and pm ∈ R 1×1×D , which are utilised rotation, shearing, and scaling) of the original image x. The
(h,w) (h,w)
ℓcs (·) imposes transformation consistency on the prototype
= e−||V −pm ||2 /τ , where
2
to produce similarity maps sm
h ∈ {1, . . . , 32
H
} and w ∈ {1, . . . , 32
W
} are spatial indexes learning by introducing constraints for the similarity maps
in the similarity maps, and τ is a temperature factor. The {sm (·)}m=1
M . Hence, it can help ProtoPNet learn more robust
and visually meaningful prototypes.

o outputs M similarity scores from max-pooling
prototype layer
(h,w) M After each training epoch, we update each prototype pm
n
max sm , which are finally fed to the FC layers tθt (·)
h,w m=1 using the nearest feature vector v from all training images of
the same class, as in: with biopsy-confirmed benign or malignant tumours. In our
experiments, we split the whole CMMD into training (3198
pm ← arg min ||v − pm ||22 . (15) images, ID: D1-0001 ∼ D2-0247) and testing (2002 images,
v∈Vi∈{1,...,|D|}
ID: D2-0248 ∼ D2-0749) in a patient-wise way.
One requirement for the learned prototypes is that they NEH [35] is a public OCT dataset for 2D slice-
should be diverse enough to represent rich and semantically based retinal disease diagnosis among classes: normal,
distinctive patterns of training samples. This is important drusen, and choroidal neovascularization (CNV). The dataset
for improving model interpretability since it can reduce has 16822 OCT B-scans from 441 patients acquired by a
the chance of learning repeated prototypes, which is an Heidelberg SD-OCT. Following the exclusion criteria in [35],
issue noticed in the training of the original ProtoPNet [11]. we keep 12649 (5667 normal, 3742 drusen, 3240 CNV)
To facilitate prototype diversity, we utilise the greedy pro- B-scans and utilise a five-fold cross validation [35] in our
totype projection strategy [14] proposed in our preliminary experiments.
BRAIxProtoPNet++, which guarantees that there will not BraTS 2019 [36] is a multi-modal (T1, T2, T1ce, and
be any repetition when updating the prototypes in Eq. (15). FLAIR) brain tumour segmentation dataset, which includes
Besides, the above Eq. (15) is also utilised for visualising the 335 3D MR scans with their segmentation masks. Follow-
learned prototypes (see Fig. 7), the reader is encouraged to ing [37], we consider a 2D binary segmentation task, i.e.,
refer to [11] for more details. non-tumour vs. tumour, by merging all tumour classes into
a single whole-tumour class. Slices within 3D MR scans are
IV. E XPERIMENTS used as 2D images with a resolution of 240 × 240. As in [37],
purely blank slices without any skull-stripped brain region are
A. Datasets
excluded in our experiments, and we randomly split the whole
We validate our proposed InterNRL method on four BraTS into training, validation and testing sets, containing 271,
datasets: our private Annotated Digital Mammograms and 32 and 32 scans, respectively.
Associated Non-Image (ADMANI) dataset [33], the public
Chinese Mammography Database (CMMD) [34], the public B. Experimental Settings
Noor Eye Hospital (NEH) OCT dataset [35], and the public
brain tumour segmentation (BraTS) MRI dataset [36]. We use EfficientNet-B0 [38] to construct our InterNRL,
ADMANI [33] is a large-scale mammogram dataset where the GlobalNet gθg and ProtoPNet pθ p are branched
from the BreastScreen Victoria (Australia) program2 rang- from the EfficientNet-B0’s stem layer that works as the CNN
ing from 2013 to 2019. The dataset has high-resolution backbone f θ f containing a 3 × 3 convolutional layer with a
mammograms (of size about 5400 × 4200 pixels) of four stride of 2, followed by a batch normalisation (BN) layer. Our
standard views (L-CC, L-MLO, R-CC, and R-MLO) and InterNRL is implemented with PyTorch [39] and trained using
biopsy-confirmed diagnosis outcome (cancer or non-cancer). the Adam optimiser with an initial learning rate ηg = η p =
The whole ADMANI has about 4.4 million images acquired by 10−3 , weight decay of 10−5 , and batch size |B| = |Bu | = 8.
many manufactures: Siemens, Hologic, Fujifilm Corporation, The hyper-parameters in Eq. (11) are set to λ1 = λ2 =
Philips Digital Mammography Sweden AB, Philips Medical 0.02, λ3 = 0.2, which are tuned on ADMANI and CMMD in
Systems, and Konica Minolta, where 40000 images are made Section IV-L. In Eq. (13), we set the margin γ = 10 given the
publicly available to researchers in the Radiological Society robustness of our model to a large range of values, as shown
of North America (RSNA) Screening Mammography Breast in Section IV-L. For the prototype consistency loss in Eq. (14),
Cancer Detection AI Challenge.3 In this study, we use a subset we use the following transformations: random rotation (from
of ADMANI that has 12220 (with 6012 cancer samples and −10◦ to 10◦ ), random translation (up to 15% of input image
6208 non-cancer samples) training images (training set A), size), random scaling (by a factor between 0.95 and 1.05),
1820 (with 880 cancer samples and 940 non-cancer samples) and random shearing (within the range [−10, 10] degrees by
validation images, and 24172 (with 1572 cancer samples following the PyTorch definition). In ProtoPNet, the feature
and 22600 non-cancer samples) testing images. For semi- dimension D = 128. The temperature τ should meet τ ≫ 1 to
supervised learning, we also use additional 24988 unlabelled prevent the saturation effect of the exponential function used
training images (training set B) from ADMANI. There is no in the similarity metric, and we have τ = 128.
1) Mammogram Classification: For ADMANI and CMMD,
overlap of patient data between training, validation, testing,
and unlabelled sets. In the testing set, 995 cancer images the mammograms are pre-processed using an Otsu threshold-
have lesion annotations labelled by experienced radiologists ing algorithm to crop the breast region, which is then resized
for evaluating cancer localisation performance and model to H = 1536, W = 768. We set the number of prototypes
interpretability. M to 400 (200 per class) for both datasets. Mammogram
CMMD [34] is a public mammogram dataset from 1775 classification performance is assessed with the area under
Chinese patients collected from 2012 to 2016. The dataset con- the receiver operating characteristic curve (AUC). To evaluate
sists of 52004 (with 2632 cancer samples and 2568 non-cancer model interpretability on the 995 cancer-annotated testing
samples) 4-view mammograms of size 2294 × 1914 pixels images from ADMANI, we measure the area under the preci-
sion recall curve (PR-AUC) for breast cancer localisation.
2 https://www.acmd.org.au/braix 2) Retinal OCT Classification: For NEH dataset, the original
3 https://www.kaggle.com/competitions/rsna-breast-cancer-detection OCT images are resized to H = 512, W = 768. In ProtoPNet,
4 One examination (D1-1343) was excluded due to its pre-processing failure. the number of prototypes M is set to 150 (50 per class). Retinal
OCT classification performance is evaluated using the overall TABLE I

accuracy, sensitivity, specificity, and precision. C LASSIFICATION R ESULTS ON ADMANI T EST S ET, R EPORTED W ITH
95% C ONFIDENCE I NTERVALS F ROM 2000 B OOTSTRAP R EPLICATES .
3) Brain Tumour Segmentation: For BraTS dataset,
W E S HOW 4 S ETTINGS OF F ULLY (F♥, F♥♥) AND S EMI (S♥, S♥♥)
we employ MR images with their original resolution H S UPERVISED S CENARIOS , AS D ESCRIBED BY THE % OF L ABELLED
= W = 240. The prototype number M in ProtoPNet is AND U NLABELLED S AMPLES F ROM T RAINING S ETS A AND B
set to 20 (10 per class). Notice that, in this dataset we
use ResNet-34 [31] to construct our InterNRL and set the
stride of all residual network blocks to 1 in order to obtain
feature maps with higher resolution of H4 × W4 , which can
better preserve the anatomical structure of brain. Tumour
segmentation performance is evaluated using Dice score and
average symmetric surface distance (ASSD). A higher Dice
score and a lower ASSD value showcase better segmentation
performance.
All experiments are conducted on a machine with AMD
Ryzen 9 3900X CPU, 2 GeForce RTX 3090Ti GPUs,
and 32 GB RAM. The training of our InterNRL method
on ADMANI dataset takes about 58 hours for a total of
100 epochs, and the average testing time of InterNRL is about
0.0013 second per mammogram image.
C. Competing Methods
To validate the effectiveness of our InterNRL method,
we first compare it with the following fully- and
semi-supervised learning methods for the mammogram clas-
sification task:
Fully-supervised methods: EfficientNet-B0 [38], Sparse mul-
tiple instance learning (Sparse MIL) [40], Globally-aware HOG-SVM [43], Transfer Learning [44], LACNN [18],
multiple instance classifier (GMIC) [41], ProtoPNet [11], and FPN-EfficientNet-B0 [35]. HOG-SVM is a traditional
and BRAIxProtoPNet++ [14]. EfficientNet-B0 is the baseline machine-learning method which uses the multi-scale HOG
network on which our InterNRL is established. Sparse MIL feature descriptor and multiple binary SVM classifiers for
can localise lesions by dividing a mammogram into small OCT classification. Transfer Learning method utilises a
regions that are classified using multiple instance learning pre-trained CNN backbone as frozen feature extractor and
with a sparsity constraint, where we adopt EfficientNet-B0 retrains only fully connected layers to recognise OCT images.
as backbone for fair comparison. GMIC first utilises a global LACNN employs a lesion detection network (LDN) to
module to select informative regions of a mammogram, then localise macular lesions which are subsequently exploited
it relies on a local module to analyse those selected regions, to guide the classification task. FPN-EfficientNet-B0 is a
and it finally employs a fusion module to aggregate the global multi-scale feature pyramid network (FPN) which can cap-
and local outputs for classification. BRAIxProtoPNet++ is ture rich image features for identifying fine abnormalities in
our previously proposed method that distils the knowledge of OCT images.
a global classifier when training ProtoPNet. For the brain tumour segmentation task, we compared
Semi-supervised methods: Mean Teacher [26], SRC-MT [27], our InterNRL with the following weakly-supervised meth-
and FlexMatch [42]. Mean Teacher (MT) is a widely used ods: CAM [45], Grad-CAM++ [46], SEAM [47], and
semi-supervised baseline model. SRC-MT extends MT by WSS-CMER [37]. CAM and Grad-CAM++ are two popu-
further maintaining the consistency of semantic relation- lar baseline models for the weakly-supervised segmentation
ship among unlabelled training samples. FlexMatch is a task. SEAM improves the quality of CAMs by incorporat-
SOTA semi-supervised method for natural images. It employs ing an additional self-supervised equivariant constraint into
pseudo labels predicted from weakly-augmented unlabelled the original CAMs. Based on SEAM, WSS-CMER method
images to supervise a strongly-augmented version of the further introduces a cross-modality equivariant constraint for
same image, where training samples are flexibly selected via the multi-modal segmentation task, which applies consistency
a curriculum learning strategy. In FlexMatch, we use the regularisation on the CAMs produced across modalities.
transformation operations described in Section IV-B as weak
augmentations, and further use as strong augmentations the
Gaussian blur, Gaussian noise, elastic transformation, and D. Classification Results on Mammograms
gamma correction. To keep a fair comparison, we utilise Table I shows the classification AUC results of different
the same backbone of EfficientNet-B0 in MT, SRC-MT, and methods on ADMANI under both fully- and semi-supervised
FlexMatch. settings. For the InterNRL and our previously proposed
For the retinal OCT classification task, we compared our BRAIxProtoPNet++, we present the classification results
InterNRL with the following fully-supervised approaches: of ProtoPNet and GlobalNet branches independently, and
TABLE II TABLE IV
C LASSIFICATION AUC R ESULTS ON CMMD T EST S ET, R EPORTED R ETINAL OCT I MAGE C LASSIFICATION R ESULTS
W ITH 95% C ONFIDENCE I NTERVAL C OMPUTED F ROM ( IN P ERCENTAGE ) ON NEH DATASET
B OOTSTRAPPING W ITH 2000 R EPLICATES
TABLE III
P-VALUES C OMPUTED F ROM O NE -S IDED PAIRED T-T EST B ETWEEN
O UR InterNRL AND OTHER C OMPETING M ETHODS ON
ADMANI (S CENARIO F♥♥) AND CMMD
Fig. 5. (a) Breast cancer localisation (PR-AUC) results under different

their ensemble results to show the importance of combin- IoU thresholds on ADMANI. (b) Effect of the number of prototypes per
class on the mammogram and OCT classification performance.
ing both branches. In the two fully supervised settings,
the classification AUC of the original ProtoPNet is worse
than that of its non-interpretable counterpart EfficientNet-
B0. In the BRAIxProtoPNet++ method, the application of
knowledge distillation (KD) to the original ProtoPNet exhibits
an AUC improvement. However, the proposed InterNRL
method achieves significant performance gain and reaches
the best classification AUC results, particularly when using Fig. 6. Breast cancer localisation by different methods. The ellipse in
the leftmost image indicates the cancer annotation made by radiologists.
very limited training samples (e.g., 20% A), which indicates
the effectiveness of our proposed reciprocal student-teacher
learning paradigm for interpretable and accurate mammogram E. Breast Cancer Localisation Results
classification. Furthermore, in the two semi-supervised scenar- We also measure model interpretability under the
ios (20% A, 80% A) and (100% A, 100% B), our InterNRL fully-supervised scenario F♥♥ by evaluating the breast cancer
also outperforms other semi-supervised methods with mean localisation performance on ADMANI test set. To obtain the
AUC improvements of 1.43% and 0.77%, respectively, with predicted cancer regions, we apply a threshold of 0.5 on the
respect to the second best approach, FlexMatch. The other Grad-CAM (EfficientNet-B0), malignant map (Sparse MIL),
methods (i.e., MT and SRC-MT) show solid results, but are saliency map (GMIC), and similarity map with the top-1
not as competitive as FlexMatch and our InterNRL. These activated cancer prototype (ProtoPNet, BRAIxProtoPNet++,
results indicate the advantage of InterNRL in exploiting addi- and InterNRL). For all these models, we exclude misclassified
tional unlabelled training data to improve classification AUC testing cancer images. When calculating PR-AUC, an inter-
performance. section over union (IoU) threshold is needed to determine
Table II presents the classification AUC results of different if a cancer detection is a true positive. In our experiment,
methods on CMMD dataset under a fully supervised learn- we choose the IoU threshold from 0.05 to 0.5 in order
ing scenario. We can observe that our proposed InterNRL to get a series of PR-AUC values, as shown in Fig. 5(a).
achieves the best classification performance with an AUC It is observed that our InterNRL method consistently achieves
of 89.02%, outperforming other competing methods by a superior cancer localisation performance over other methods
large margin. For instance, compared with SparseMIL, GMIC under different IoU thresholds. Fig. 6 illustrates visual compar-
and EfficientNet-B0, our method is at least 2.16% better in ison of breast cancer localisation, where we can see InterNRL
terms of mean AUC, with GMIC and EfficientNet-B0 present- can more accurately localise the cancer region. Particularly,
ing similar results and SparseMIL being significantly worse. our InterNRL exhibits more accurate result than the original
Considering our previously proposed BRAIxProtoPNet++ ProtoPNet and BRAIxProtoPNet++, thanks to the proposed
method, InterNRL shows a mean AUC improvement prototype consistency regularisation strategy.
of 0.75%.
Table III presents a statistical analysis of our proposed
InterNRL with respect to other competing methods using the F. Classification Results on Retinal OCT Images
one-sided paired T-test. Assuming a significance level of 0.05, Table IV provides the fully-supervised retinal disease clas-
we can notice that the improvements of our InterNRL over the sification results on NEH dataset. It is observed that our
other considered approaches are statistically significant. InterNRL method performs better than the non-interpretable
TABLE V
P-VALUES F ROM O NE -S IDED PAIRED T-T EST B ETWEEN O UR InterNRL AND OTHER M ETHODS ON NEH OCT DATASET
TABLE VI
Q UANTITATIVE R ESULTS ( IN P ERCENTAGE ) OF W EAKLY-S UPERVISED B RAIN T UMOUR S EGMENTATION ON BraTS DATASET
EfficientNet-B0 and FPN-EfficientNet-B0, with accuracy Notice in Fig. 7(a) that the non-cancer mammogram proto-
improvements of about 2.1% and 1.3%, respectively. In addi- types are often from normal breast tissues or benign lesions,
tion, the proposed InterNRL method exhibits better classifica- while the cancer mammogram prototypes usually comprise
tion results than the original ProtoPNet and our preliminary regions containing cancerous visual biomarkers (e.g., malig-
BRAIxProtoPNet++, which indicates the effectiveness of the nant masses and micro-calcifications). Also, we can see from
student-teacher reciprocal learning paradigm. We should note Fig. 7(b) that the CNV prototypes often contain fibrotic
that the retinal disease classification task is a 3-class prob- scar or sub-retinal fluid while the Drusen prototypes tend to
lem, so the OCT classification results demonstrate that our derive from regions with drusen. In Fig. 7(c), the non-tumour
InterNRL is extensible to multi-class interpretable diagnosis prototypes are prone to capture the healthy brain structures,
tasks. while the tumour prototypes usually focus on the region
To validate whether the performance gain of our method with abnormal brain tumour. These findings are well aligned
is statistically significant, we also conduct one-sided paired with the criteria that clinicians use for diagnosing diseases,
T-test for our InterNRL with other competing methods, where suggesting that the learned prototypes are representative and
the overall accuracy is used as the evaluation metric. Results discriminative. Fig. 8 displays the interpretable reasoning
in Table V show that the p-values associated with all pairs are process of our InterNRL on a testing cancer mammogram from
below the significance level of 0.05, verifying the superiority ADMANI. As can be seen, our InterNRL method classifies
of our InterNRL over other methods. the image as belonging to the cancer class because the
abnormality present in the image looks more similar to the
cancer prototypes than the non-cancer ones, as evidenced by
G. Weakly-Supervised Segmentation Results on BraTS the higher similarity scores with the cancer prototypes.
Since our InterNRL method relies on image-level training
labels only, we further perform experiments on BraTS to
explore its effectiveness for the weakly-supervised tumour I. Analysis of Two-Stage Reciprocal Learning Paradigm
segmentation. To obtain the predicted segmentation mask of Figure 9 shows the validation AUC performance of our
the prototype-based methods (ProtoPNet, BRAIxProtoPNet++, InterNRL method on ADMANI under the fully-supervised
and InterNRL), we use a threshold of 0.5 to binarize the setting F♥♥. One phenomenon we can see is, in the first
mean similarity map of all tumour prototypes (instead of the stage (S1), the student ProtoPNet’s validation AUC gradually
top-1 activated tumour prototype, as the mean similarity map increases while the teacher network’s validation performance
yields slightly better results in our observations). Quantitative goes down, as the training progresses. At the end of the
segmentation results are given in Table VI. It is observed that first stage, the student ProtoPNet exhibits a higher AUC than
our InterNRL method achieves the best segmentation results the teacher GlobalNet. This result can be explained from
on T1, T1ce, and FLAIR modalities. In particular, InterNRL the observation that the teacher network is more effective at
outperforms the existing best method, WSS-CMER, by large creating suitable pseudo labels to train the student network
margins of 6.28% and 6.95% (in Dice) on T1 and T1ce modal- than at training to achieve high classification performance
ities, respectively. Also, it is worth noting that the original for itself. This finding indicates the necessity of the second
ProtoPNet’s results greatly exceeds the results by CAM and stage (S2) of the reciprocal learning, where we can see the
Grad-CAM++, suggesting that the interpretable prototypes are teacher GlobalNet can also achieve a high validation AUC
also effective in the weakly-supervised segmentation task. after being trained using the pseudo labels generated by the
accurate ProtoPNet. We show one more supporting evidence
in Table VII by providing an ablation study to investigate
H. Prototype Visualisation and Interpretable Reasoning the importance of the second stage, where the experimental
Fig. 7 illustrates typical visual prototypes learned from parameters remain the same as in Section IV-B, e.g., training
ADMANI (a), NEH (b), and BraTS (c), their corresponding details (batch size, learning rate, and weight decay), hyper-
source training images, and self-activated similarity maps. parameters in the ProtoPNet branch (λ1 , λ2 , λ3 and γ ), and
Fig. 7. Visual prototypes of our InterNRL method learned from (a) ADMANI, (b) NEH, and (c) BraTS (T1) datasets. Each triplet represents the
class-specific learned prototype, the source training image where the prototype originates, and the corresponding self-activated similarity map.
TABLE VIII
E FFECT OFO UR P ROTOTYPE C ONSISTENCY R EGULARISATION
S TRATEGY ON THE M AMMOGRAM C LASSIFICATION P ERFORMANCE
to validate the effectiveness of this regularisation strategy,

by adding our proposed ℓcs (·) in the original ProtoPNet
method [11]. In this experiment, we use the same hyper-
parameters (λ1 , λ2 , λ3 and γ ), training details (batch size,
learning rate, and weight decay), and image augmentations as
described in Section IV-B. Experimental results on ADMANI
Fig. 8. An interpretable reasoning example of our InterNRL method.
First row: testing cancer mammogram. Second row: top-3 activated non- and CMMD datasets are given in Table VIII. As we can
cancer (left) and cancer (right) prototypes. Third row: similarity maps with observe, the ProtoPNet method benefits from our prototype
the highest scores used for classification. Fourth row: connection weights consistency regularisation strategy to achieve more accurate
in FC layers. Last row: Total contribution to classification.
classification on both datasets.
K. Effect of the Number of Prototypes

We also study the effect of the number of prototypes M
on the classification performance, i.e., AUC on ADMANI
(setting F♥♥) and CMMD, overall accuracy on NEH, with
results shown in Fig. 5(b). As can be seen, the classification
results of the ProtoPNet are generally robust to a large range
Fig. 9. Validation AUC of the teacher and student networks from the first of prototype numbers (i.e., from 50 to 400). However, a small
(S1) and second (S2) reciprocal learning stage on ADMANI dataset. number of prototypes cannot fully represent diverse and com-
TABLE VII plicated patterns of training samples, leading to unsatisfactory
E FFECTIVENESS OF THE S ECOND S TAGE OF R ECIPROCAL performance. On the other hand, learning too many prototypes
L EARNING ON ADMANI AND CMMD DATASETS brings in redundant information, which results in performance
degradation and increased model complexity. Therefore, in our
method we set the number of prototypes per class to 200 and
50 for the mammogram and OCT datasets, respectively.
L. Sensitivity Analysis of λ1 , λ2 , and λ3 in ProtoPNet

In this subsection, we study the cluster loss ℓct (·) in
image augmentation techniques. As evident, with (w/) the Eq. (12), the separation loss ℓsp (·) in Eq. (13) and our
second stage, the classification performance of GlobalNet is prototype regularisation loss ℓcs (·) in Eq. (14). Note that in this
substantially improved on both ADMANI and CMMD, which experiment we use the same optimisation details and image
is also helpful to the ensemble of ProtoPNet and GlobalNet. augmentations mentioned in Section IV-B. We first study the
This further justifies the importance of the second stage of the effects of ℓct (·) and ℓsp (·) on the mammogram classification
reciprocal learning for achieving accurate classification. performance, where the ProtoPNet method is trained by fixing
λ3 as 0 and varying λ1 and λ2 , whose values always remain
the same (λ1 = λ2 ). Results in Fig. 10(a) show that both ℓct (·)
J. Effectiveness of Prototype Consistency Regularisation and ℓsp (·) are important and indispensable, with results being
The prototype consistency regularisation loss ℓcs (·) pro- fairly robust from 0.015 to 0.1. Hence, we choose λ1 = λ2 =
posed in Eq. (14) aims to enforce ProtoPNet to learn 0.02 in all our experiments. Next, we study the sensitivity of
robust prototypes with respect to various spatial transfor- λ3 (using the best λ1 = λ2 = 0.02) in Fig. 10(b), which reveals
mations. In this subsection, we provide an ablation study that the optimal value for λ3 is 0.2 on both datasets, and the
TABLE X
E FFECTIVENESS OF THE S ECOND S TAGE OF R ECIPROCAL L EARNING
U NDER S EMI -S UPERVISED L EARNING S CENARIOS
(S♥ AND S♥♥) ON ADMANI DATASETS
global deep classifier GlobalNet (as teacher). To narrow

the performance-interpretability gap, we propose a reciprocal
learning paradigm where the student ProtoPNet learns from
the optimal pseudo labels produced by the teacher GlobalNet,
Fig. 10. Sensitivity analysis of λ1 , λ2 , λ3 , and γ in the ProtoPNet and the GlobalNet is first optimised by the ProtoPNet’s clas-
method, using the classification performance on ADMANI and CMMD.
sification performance and then learns from the ProtoPNet’s
TABLE IX accurately predicted pseudo labels. Relying on this paradigm,
C LASSIFICATION R ESULTS OF O UR P ROPOSED I NTER NRL U SING InterNRL can be flexibly trained under both fully- and
D IFFERENT CNN B ACKBONES ON ADMANI AND CMMD DATASETS semi-supervised learning scenarios. Also, a transformation-
based regularisation strategy is designed to improve the
learning consistency of ProtoPNet’s prototypes. Extensive
experiments have validated the advantages of InterNRL on
fully- and semi-supervised medical image classification as
well as weakly-supervised cancer localisation and tumour
segmentation.
In the first stage of the reciprocal learning introduced in
Section III-B.1, the student network learns from the optimal
pseudo labels created by the teacher network and the teacher
classification performance does not change much around this network is updated by the student network’s performance
value. We thus set λ3 = 0.2. For the margin γ in Eq. (13), feedback. This strategy may also tackle the issue of learning
we choose γ = 10 because the results are insensitive to a large from weak labels [49] (e.g., scribbles [50]), where the student
range of values, as shown in Fig. 10(c). and teacher networks could be segmentation networks (e.g.,
U-Net [51]) instead of the classifiers of our method. These
M. Integration With Different Network Architectures two segmentation networks can be trained in the same way
as the classifiers, but one potential difference is the student’s
We also investigate the effectiveness of integrating performance validation loss Lce (θ p∗ ) in Eq. (4), which is
our InterNRL with other popular network architectures, suggested to be computed on samples with full labels, rather
e.g., ResNet-34 [31] and DenseNet-121 [48], forming than the sparsely-annotated weak labels. To achieve that, one
InterNRL-ResNet and InterNRL-DenseNet, respectively, can simply collect a small number of representative fully-
which are trained using Adam optimiser with an initial labelled samples [52] or use other techniques (e.g., active
learning rate of ηg = η p = 10−4 , weight decay of 10−5 , learning [53]) to efficiently find the most informative samples
and batch size |B| = |Bu | = 16. The hyper-parameters in to annotate.
the ProtoPNet branch remain the same as in Section IV-B. In the second stage of the reciprocal learning introduced
Experimental results on ADMANI (setting F♥♥) and CMMD in Section III-B.2, we employ all training data D ∪ Du ,
datasets are shown in Table IX. We notice that the adaptation pseudo-labelled by the accurate student ProtoPNet to retrain
of InterNRL to both CNN architectures enables performance the teacher GlobalNet, so that both of them can achieve
improvements. In particular, the InterNRL-DenseNet achieves high accuracy, eventually benefiting their ensemble. We have
mean classification AUC of 90.37% and 88.03% on ADMANI provided a thorough analysis on the motivation behind this
and CMMD, respectively, which is superior to ProtoPNet in Section IV-I for the fully-supervised learning scenario.
by 1.85% and 4.16%, and to DenseNet by 1.18% and In the semi-supervised scenario, since the student is trained
2.17%. This demonstrates that our InterNRL can be flexibly using abundant unlabelled samples with suitable pseudo labels
and effectively integrated with different types of network in the first learning stage, it tends to perform much bet-
architectures to achieve interpretable and accurate medical ter than the teacher after this stage. As a basic motivation
image classification. for the semi-supervised learning, using additional unlabelled
training samples to boost the performance is well accepted
V. D ISCUSSION and evidenced [23], [27], [54]. Following the same principle,
In this paper, we present the InterNRL framework by to take advantage of the unlabelled data, we pseudo-label
integrating the interpretable ProtoPNet (as student) with a them using the well-trained student and utilise them to retrain
training samples among classes in many diagnosis tasks [57].

Therefore, we plan to explore an adaptive scheme to learn
the optimal number of prototypes for each class in our future
work.
Fig. 11. Failure cases of our InterNRL method for breast cancer classi- VI. C ONCLUSION
fication. (a) Wrong associations with ground-truth cancer annotations
(yellow circles). (b) False negative classification of a cancer image. In this paper, we proposed the InterNRL method for
In each case, we randomly show three similarity maps with cancer accurate medical image classification with effective prototype-
prototypes and the highest similarity scores.
based interpretability. Our method has been designed to seam-
lessly integrate the prototype-based interpretable model with
the teacher so that the teacher can also improve its perfor- existing highly accurate global CNN classifiers. We present
mance, which finally contributes to their ensemble. To support a two-stage student-teacher reciprocal learning paradigm
this, we provide the semi-supervised learning results of our to reach highly accurate classification. This paradigm also
method on ADMANI dataset, trained without (w/o) and with enables InterNRL to exploit additional unlabelled training
(w/) the second stage, in Table X. We notice that without samples. Experimental results on four datasets demonstrate the
retraining, the performance of the teacher network largely superiority of our InterNRL compared to other SOTA meth-
deteriorates, resulting in even worse ensembled results than ods in disease classification, cancer localisation, and tumour
the interpretable ProtoPNet branch. One may observe that segmentation. Besides, our InterNRL is a general framework
the GlobalNet outperforms the ProtoPNet after the second that can be easily applied to other binary or multi-class
learning stage. We suspect one of the reasons is that the interpretable diagnosis tasks in the medical imaging domain.
non-interpretable GlobalNet branch is easier to optimise, com-
pared with the interpretable ProtoPNet branch. It is worth ACKNOWLEDGMENT
noting that the performances of both GlobalNet and ProtoPNet The authors would like to thank the editor and the anony-
branches are finally improved, compared with the original mous reviewers for their valuable comments and suggestions,
ProtoPNet [11] method and its non-interpretable counterpart particularly for proposing the brain tumour segmentation
EfficientNet-B0 [38], with results shown in Tables I, II, and IV. experiments on BraTS dataset.
Although our proposed InterNRL method can produce
promising interpretable results (see Fig. 8) for the breast R EFERENCES
cancer classification, there are still some unsatisfactory failure [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
cases observed in our experiments, which can be summarised no. 7553, pp. 436–444, 2015.
into the following two categories. (1) Wrong associations [2] L. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. Li, and S. Farsiu,
“Automatic segmentation of nine retinal layer boundaries in OCT images
with ground-truth cancer annotations. As shown in Fig. 11(a), of non-exudative AMD patients using deep learning and graph search,”
the cancerous lesion existing in the mammogram is poorly- Biomed. Opt. Exp., vol. 8, no. 5, pp. 2732–2744, 2017.
differentiable, presenting a diagnostically ambiguous bio- [3] G. Carneiro, J. Nascimento, and A. P. Bradley, “Automated analysis of
unregistered multi-view mammograms with deep learning,” IEEE Trans.
marker, which leads prototypes to activate image regions that Med. Imag., vol. 36, no. 11, pp. 2355–2365, Nov. 2017.
are not aligned with the ground-truth cancer lesion localisation. [4] C. Wang, Y. Jin, X. Chen, and Z. Liu, “Automatic classification of
Notice that these wrong associations may not necessarily volumetric optical coherence tomography images via recurrent neural
network,” Sens. Imag., vol. 21, no. 1, pp. 1–15, Dec. 2020.
lead to false classification results but we need to pay extra [5] C. Wang, Z. Cui, J. Yang, M. Han, G. Carneiro, and D. Shen, “BowelNet:
attention to them for debugging our model, e.g., inspect if the Joint semantic-geometric ensemble learning for bowel segmentation
learned prototypes are sufficiently accurate. (2) False negative from both partially and fully labeled CT images,” IEEE Trans. Med.
Imag., vol. 42, no. 4, pp. 1225–1236, Apr. 2023.
classification. As illustrated in Fig. 11(b), the breast cancer [6] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling
lesion occupies a small region of the image, resulting in lower deep neural networks,” IEEE Trans. Evol. Comput., vol. 23, no. 5,
similarity scores with the cancer prototypes than with the pp. 828–841, Oct. 2019.
non-cancer prototypes, which produces a final misclassifica- [7] C. Rudin, “Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead,” Nature Mach.
tion. These misclassified cases suggest that it is necessary Intell., vol. 1, no. 5, pp. 206–215, May 2019.
to develop an effective strategy to learn more fine-grained [8] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
prototypes for those small cancerous lesions. D. Batra, “Grad-CAM: Visual explanations from deep networks via
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis.
In our experiments, we have demonstrated that our proposed (ICCV), Oct. 2017, pp. 618–626.
InterNRL framework can be successfully applied to various [9] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
basic CNN architectures (Section IV-M) for the interpretable networks: Visualising image classification models and saliency maps,”
in Proc. ICLR Workshop, 2014, pp. 1–16.
diagnosis tasks in medical images. Notably, our InterNRL [10] A. Khakzar, S. Albarqouni, and N. Navab, “Learning interpretable
method is not restricted to those basic CNN architectures features via adversarially robust optimization,” in Proc. Int. Conf. Med.
and it can be readily integrated into more advanced net- Image Comput. Comput.-Assist. Intervent. Shenzhen, China: Springer,
2019, pp. 793–800.
works, e.g., visual transformers [55] and multi-level feature [11] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks
fusion [56]. Such integration is natural and uncomplicated like that: Deep learning for interpretable image recognition,” in Proc.
because of the high adaptability of InterNRL. Currently, Adv. Neural Inf. Process. Syst., 2019, pp. 795–807.
we empirically utilise the equal number of prototypes for each [12] D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, “Transparency
by design: Closing the gap between performance and interpretability
class in the ProtoPNet branch, which may not be appropriate in visual reasoning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
considering the different learning difficulties and imbalanced Recognit., Jun. 2018, pp. 4942–4950.
[13] B. H. M. van der Velden, H. J. Kuijf, K. G. A. Gilhuijs, and [35] S. Sotoudeh-Paima, A. Jodeiri, F. Hajizadeh, and H. Soltanian-Zadeh,
M. A. Viergever, “Explainable artificial intelligence (XAI) in deep “Multi-scale convolutional neural network for automated AMD clas-
learning-based medical image analysis,” Med. Image Anal., vol. 79, sification using retinal OCT images,” Comput. Biol. Med., vol. 144,
Jul. 2022, Art. no. 102470. May 2022, Art. no. 105368.
[14] C. Wang et al., “Knowledge distillation to ensemble global and inter- [36] B. H. Menze et al., “The multimodal brain tumor image segmenta-
pretable prototype-based mammogram classification models,” in Proc. tion benchmark (BRATS),” IEEE Trans. Med. Imag., vol. 34, no. 10,
Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Singapore: pp. 1993–2024, Oct. 2015.
Springer, 2022, pp. 1–9. [37] G. Patel and J. Dolz, “Weakly supervised segmentation with cross-
[15] M. Ganeshkumar, V. Ravi, V. Sowmya, E. Gopalakrishnan, and modality equivariant constraints,” Med. Image Anal., vol. 77, Apr. 2022,
K. Soman, “Explainable deep learning-based approach for multilabel Art. no. 102374.
classification of electrocardiogram,” IEEE Trans. Eng. Manag., vol. 70, [38] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo-
no. 8, pp. 2787–2799, Aug. 2023. lutional neural networks,” in Proc. ICML, 2019, pp. 6105–6114.
[16] Y. Lei, Y. Tian, H. Shan, J. Zhang, G. Wang, and M. K. Kalra, “Shape [39] A. Paszke et al., “PyTorch: An imperative style, high-performance
and margin-aware lung nodule classification in low-dose CT images deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
via soft activation mapping,” Med. Image Anal., vol. 60, Feb. 2020, pp. 32–46.
Art. no. 101628. [40] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networks
[17] Y. Oh, S. Park, and J. C. Ye, “Deep learning COVID-19 features on with sparse label assignment for whole mammogram classification,” in
CXR using limited training data sets,” IEEE Trans. Med. Imag., vol. 39, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Quebec,
no. 8, pp. 2688–2700, Aug. 2020. AC, Canada: Springer, 2017, pp. 603–611.
[18] L. Fang, C. Wang, S. Li, H. Rabbani, X. Chen, and Z. Liu, “Attention [41] Y. Shen et al., “An interpretable classifier for high-resolution breast
to lesion: Lesion-aware convolutional neural network for retinal optical cancer screening images utilizing weakly supervised localization,” Med.
coherence tomography image classification,” IEEE Trans. Med. Imag., Image Anal., vol. 68, Feb. 2021, Art. no. 101908.
vol. 38, no. 8, pp. 1959–1970, Aug. 2019. [42] B. Zhang et al., “FlexMatch: Boosting semi-supervised learning with
[19] I. Ilanchezian, D. Kobak, H. Faber, F. Ziemssen, P. Berens, and curriculum pseudo labeling,” in Proc. Adv. Neural Inf. Process. Syst.,
M. S. Ayhan, “Interpretable gender classification from retinal fun- 2021, pp. 18408–18419.
dus images using BagNets,” in Proc. Int. Conf. Med. Image Com- [43] P. P. Srinivasan et al., “Fully automated detection of diabetic macular
put. Comput.-Assist. Intervent. Strasbourg, France: Springer, 2021, edema and dry age-related macular degeneration from optical coherence
pp. 477–487. tomography images,” Biomed. Opt. Exp., vol. 5, no. 10, pp. 3568–3577,
[20] W. Brendel and M. Bethge, “Approximating CNNs with bag-of-local- 2014.
features models works surprisingly well on ImageNet,” in Proc. ICLR, [44] D. S. Kermany et al., “Identifying medical diagnoses and treat-
2018, pp. 1–15. able diseases by image-based deep learning,” Cell, vol. 172, no. 5,
[21] E. Kim, S. Kim, M. Seo, and S. Yoon, “XProtoNet: Diagnosis pp. 1122–1131.e9, Feb. 2018.
in chest radiography with global and local explanations,” in Proc. [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, “Learning deep features for discriminative localization,” in Proc.
pp. 15719–15728. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[22] A. J. Barnett et al., “A case-based interpretable deep learning model for pp. 2921–2929.
classification of mass lesions in digital mammography,” Nature Mach. [46] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
Intell., vol. 3, no. 12, pp. 1061–1070, Dec. 2021. “Grad-CAM++: Generalized gradient-based visual explanations for
[23] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised deep convolutional networks,” in Proc. IEEE Winter Conf. Appl. Comput.
learning,” Mach. Learn., vol. 109, no. 2, pp. 373–440, 2020. Vis. (WACV), Mar. 2018, pp. 839–847.
[24] Y. Liu, Y. Tian, Y. Chen, F. Liu, V. Belagiannis, and G. Carneiro, [47] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised
“Perturbed and strict mean teachers for semi-supervised semantic seg- equivariant attention mechanism for weakly supervised semantic seg-
mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2022, pp. 4258–4267. (CVPR), Jun. 2020, pp. 12275–12284.
[25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural [48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
network,” 2015, arXiv:1503.02531. connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
[26] A. Tarvainen and H. Valpola, “Mean teachers are better role mod- Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
els: Weight-averaged consistency targets improve semi-supervised deep [49] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, “Weakly supervised object
learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017, localization and detection: A survey,” IEEE Trans. Pattern Anal. Mach.
pp. 33–45. Intell., vol. 44, no. 9, pp. 5866–5885, Sep. 2022.
[27] Q. Liu, L. Yu, L. Luo, Q. Dou, and P. A. Heng, “Semi-supervised [50] H.-J. Oh, K. Lee, and W.-K. Jeong, “Scribble-supervised cell segmen-
medical image classification with relation-driven self-ensembling tation using multiscale contrastive regularization,” in Proc. IEEE 19th
model,” IEEE Trans. Med. Imag., vol. 39, no. 11, pp. 3429–3440, Int. Symp. Biomed. Imag. (ISBI), Mar. 2022, pp. 1–5.
Nov. 2020. [51] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[28] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware works for biomedical image segmentation,” in Proc. Int. Conf. Med.
self-ensembling model for semi-supervised 3D left atrium segmenta- Image Comput. Comput.-Assist. Intervent. Munich, Germany: Springer,
tion,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. 2015, pp. 234–241.
Shenzhen, China: Springer, 2019, pp. 605–613. [52] T. Lucas, P. Weinzaepfel, and G. Rogez, “Barely-supervised learning:
[29] Y. Wang et al., “Double-uncertainty weighted method for semi- Semi-supervised learning with very few labeled images,” in Proc. AAAI
supervised learning,” in Proc. Int. Conf. Med. Image Comput. Comput.- Conf. Artif. Intell., vol. 36, 2022, pp. 1881–1889.
Assist. Intervent. Lima, Peru: Springer, 2020, pp. 542–551. [53] S. Budd, E. C. Robinson, and B. Kainz, “A survey on active learning
[30] R. Zhao, X. Chen, Z. Chen, and S. Li, “Diagnosing glaucoma on and human-in-the-loop deep learning for medical image analysis,” Med.
imbalanced data with self-ensemble dual-curriculum learning,” Med. Image Anal., vol. 71, Jul. 2021, Art. no. 102062.
Image Anal., vol. 75, Jan. 2022, Art. no. 102295. [54] V. Cheplygina, M. de Bruijne, and J. P. W. Pluim, “Not-so-supervised:
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for A survey of semi-supervised, multi-instance, and transfer learning in
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. medical image analysis,” Med. Image Anal., vol. 54, pp. 280–296,
(CVPR), Jun. 2016, pp. 770–778. May 2019.
[32] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in Proc. [55] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.,
pp. 11557–11568. 2020, pp. 1–12.
[33] H. M. L. Frazer et al., “ADMANI: Annotated digital mammograms [56] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
and associated non-image datasets,” Radiol., Artif. Intell., vol. 5, no. 2, “DeepLab: Semantic image segmentation with deep convolutional nets,
Mar. 2023, Art. no. e220072. atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
[34] H. Cai et al., “Breast microcalcification diagnosis using deep convo- Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
lutional neural network from digital mammograms,” Comput. Math. [57] C. Wang et al., “Learning support and trivial prototypes for interpretable
Methods Med., vol. 2019, pp. 1–10, Mar. 2019. image classification,” 2023, arXiv:2301.04011.

An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning

Uploaded by

Copyright:

Available Formats

You might also like

An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning

Uploaded by

Copyright:

Available Formats

392 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO.

An Interpretable and Accurate Deep-Learning

meaningful prototypes, which are beneficial to both

XProtoPNet learns prototypes together with disease occur- III. M ETHOD

set D: the classification validation loss of the student ProtoPNet on

where where ηg is the GlobalNet’s learning rate. The first term of

with the mini batches being denoted by B, Bu ⊂ D for where

situations: 1) if h > 0 (pseudo labels generated by the teacher

and visually meaningful prototypes.

OCT classification performance is evaluated using the overall TABLE I

Fig. 5. (a) Breast cancer localisation (PR-AUC) results under different

to validate the effectiveness of this regularisation strategy,

K. Effect of the Number of Prototypes

L. Sensitivity Analysis of λ1 , λ2 , and λ3 in ProtoPNet

global deep classifier GlobalNet (as teacher). To narrow

training samples among classes in many diagnosis tasks [57].

You might also like