Professional Documents
Culture Documents
An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning
An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning
An Interpretable and Accurate Deep-Learning Diagnosis Framework Modeled With Fully and Semi-Supervised Reciprocal Learning
1, JANUARY 2024
Abstract — The deployment of automated deep-learning learning scenarios, reaching state-of-the-art classification
classifiers in clinical practice has the potential to streamline performance in both scenarios for the tasks of breast can-
the diagnosis process and improve the diagnosis accu- cer and retinal disease diagnosis. Moreover, relying on
racy, but the acceptance of those classifiers relies on weakly-labelled training images, InterNRL also achieves
both their accuracy and interpretability. In general, accurate superior breast cancer localisation and brain tumour seg-
deep-learning classifiers provide little model interpretabil- mentation results than other competing methods.
ity, while interpretable models do not have competitive
classification accuracy. In this paper, we introduce a new Index Terms— Interpretability, interpretable classifica-
deep-learning diagnosis framework, called InterNRL, that is tion, reciprocal learning, student-teacher model, semi-
designed to be highly accurate and interpretable. InterNRL supervised learning, weakly-supervised segmentation,
consists of a student-teacher framework, where the stu- mammogram, optical coherence tomography.
dent model is an interpretable prototype-based classifier I. I NTRODUCTION
(ProtoPNet) and the teacher is an accurate global image
EEP learning [1] has recently shown tremendous success
classifier (GlobalNet). The two classifiers are mutually opti-
mised with a novel reciprocal learning paradigm in which the
student ProtoPNet learns from optimal pseudo labels pro-
D in many automated image analysis tasks [2], [3], [4],
[5]. A typical representative of deep learning models is the
duced by the teacher GlobalNet, while GlobalNet learns from deep neural network (DNN) which can automatically learn
ProtoPNet’s classification performance and pseudo labels.
This reciprocal learning paradigm enables InterNRL to be
hierarchical features from image input. DNNs are commonly
flexibly optimised under both fully- and semi-supervised composed of many interconnected layers with a massive
number of learnable parameters, making it hard to interpret
Manuscript received 20 June 2023; revised 8 August 2023; accepted their predictions. Therefore, deep learning networks are often
11 August 2023. Date of publication 21 August 2023; date of current treated as black boxes that may produce a wrong outcome
version 2 January 2024. This work was supported in part by the with high confidence by changing only one pixel of the
Australian Government under the Medical Research Future Fund for
the Transforming Breast Cancer Screening with Artificial Intelligence input image [6]. Such lack of interpretability is an issue for
(BRAIx) Project under Grant MRFAI000090 and in part by the Australian medical applications that can lead to potentially catastrophic
Research Council under Grant FT190100525. (Corresponding author: consequences [7]. Furthermore, clinicians can be hesitant to
Chong Wang.)
This work involved human subjects or animals in its research. Approval
accept predictions made by black-box models, which hinders
of all ethical and experimental procedures and protocols was granted by their successful translations into real-world clinical practice.
the Ethics Committee of St Vincents Hospital Melbourne under Approval To improve prediction interpretability of deep learning
No. LNR/18/SVHM/162.
Chong Wang, Yuanhong Chen, and Fengbei Liu are with the Aus-
models, some studies provide post-hoc explanations (e.g.,
tralian Institute for Machine Learning, The University of Adelaide, Grad-CAM [8] and saliency maps [9]) to highlight the spa-
Adelaide, SA 5000, Australia (e-mail: chong.wang@adelaide.edu.au; tial importance of image regions related to predictions of a
yuanhong.chen@adelaide.edu.au; fengbei.liu@adelaide.edu.au). global deep-learning classifier, as shown in Fig. 1(a). For
Michael Elliott, Chun Fung Kwok, and Carlos Peña-Solorzano are
with the St. Vincent’s Institute of Medical Research, Melbourne, instance, Khakzar et al. [10] train a chest X-ray deep-learning
VIC 3065, Australia (e-mail: melliott@svi.edu.au; jkwok@svi.edu.au; classifier that uses perturbed adversarial training samples to
cpsolorzano@svi.edu.au). generate better class activation and saliency maps. However,
Helen Frazer is with the St. Vincent’s Hospital Melbourne, Melbourne,
VIC 3002, Australia (e-mail: helen.frazer@svha.org.au). such highlighted classification-relevant maps are often unre-
Davis James McCarthy is with the St. Vincent’s Institute of Medical liable and insufficient for interpreting the inner workings
Research, Melbourne, VIC 3065, Australia, and also with the Melbourne of deep models [7]. On the other hand, prototype-based
Integrative Genomics, The University of Melbourne, Melbourne,
VIC 3010, Australia (e-mail: dmccarthy@svi.edu.au).
models (e.g., ProtoPNet [11]), as illustrated in Fig. 1(b),
Gustavo Carneiro is with the Australian Institute for Machine Learn- can produce not only a map of activated regions, but also
ing, The University of Adelaide, Adelaide, SA 5000, Australia, and an interpretable reasoning process by directly associating
also with the Centre for Vision, Speech and Signal Processing classification outcomes with representative prototypes learned
(CVSSP), University of Surrey, GU2 7XH Guildford, U.K. (e-mail:
g.carneiro@surrey.ac.uk). from training samples. Even though prototype-based mod-
Digital Object Identifier 10.1109/TMI.2023.3306781 els show superior interpretability than post-hoc explanations,
1558-254X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 393
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
394 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 395
Fig. 2. An overview of our InterNRL framework (a), which consists of a shared CNN backbone and two classification branches, i.e., interpretable
ProtoPNet and global classifier GlobalNet. The InterNRL is optimised with a two-stage reciprocal student-teacher learning paradigm: (b) the student
ProtoPNet learns from the suitable pseudo labels produced by the teacher GlobalNet and the GlobalNet is trained based on the ProtoPNet’s
performance feedback; and (c) the teacher GlobalNet is further trained using the pseudo labels produced by the accurate student ProtoPNet.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
396 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 397
the same class, as in: with biopsy-confirmed benign or malignant tumours. In our
experiments, we split the whole CMMD into training (3198
pm ← arg min ||v − pm ||22 . (15) images, ID: D1-0001 ∼ D2-0247) and testing (2002 images,
v∈Vi∈{1,...,|D|}
ID: D2-0248 ∼ D2-0749) in a patient-wise way.
One requirement for the learned prototypes is that they NEH [35] is a public OCT dataset for 2D slice-
should be diverse enough to represent rich and semantically based retinal disease diagnosis among classes: normal,
distinctive patterns of training samples. This is important drusen, and choroidal neovascularization (CNV). The dataset
for improving model interpretability since it can reduce has 16822 OCT B-scans from 441 patients acquired by a
the chance of learning repeated prototypes, which is an Heidelberg SD-OCT. Following the exclusion criteria in [35],
issue noticed in the training of the original ProtoPNet [11]. we keep 12649 (5667 normal, 3742 drusen, 3240 CNV)
To facilitate prototype diversity, we utilise the greedy pro- B-scans and utilise a five-fold cross validation [35] in our
totype projection strategy [14] proposed in our preliminary experiments.
BRAIxProtoPNet++, which guarantees that there will not BraTS 2019 [36] is a multi-modal (T1, T2, T1ce, and
be any repetition when updating the prototypes in Eq. (15). FLAIR) brain tumour segmentation dataset, which includes
Besides, the above Eq. (15) is also utilised for visualising the 335 3D MR scans with their segmentation masks. Follow-
learned prototypes (see Fig. 7), the reader is encouraged to ing [37], we consider a 2D binary segmentation task, i.e.,
refer to [11] for more details. non-tumour vs. tumour, by merging all tumour classes into
a single whole-tumour class. Slices within 3D MR scans are
IV. E XPERIMENTS used as 2D images with a resolution of 240 × 240. As in [37],
purely blank slices without any skull-stripped brain region are
A. Datasets
excluded in our experiments, and we randomly split the whole
We validate our proposed InterNRL method on four BraTS into training, validation and testing sets, containing 271,
datasets: our private Annotated Digital Mammograms and 32 and 32 scans, respectively.
Associated Non-Image (ADMANI) dataset [33], the public
Chinese Mammography Database (CMMD) [34], the public B. Experimental Settings
Noor Eye Hospital (NEH) OCT dataset [35], and the public
brain tumour segmentation (BraTS) MRI dataset [36]. We use EfficientNet-B0 [38] to construct our InterNRL,
ADMANI [33] is a large-scale mammogram dataset where the GlobalNet gθg and ProtoPNet pθ p are branched
from the BreastScreen Victoria (Australia) program2 rang- from the EfficientNet-B0’s stem layer that works as the CNN
ing from 2013 to 2019. The dataset has high-resolution backbone f θ f containing a 3 × 3 convolutional layer with a
mammograms (of size about 5400 × 4200 pixels) of four stride of 2, followed by a batch normalisation (BN) layer. Our
standard views (L-CC, L-MLO, R-CC, and R-MLO) and InterNRL is implemented with PyTorch [39] and trained using
biopsy-confirmed diagnosis outcome (cancer or non-cancer). the Adam optimiser with an initial learning rate ηg = η p =
The whole ADMANI has about 4.4 million images acquired by 10−3 , weight decay of 10−5 , and batch size |B| = |Bu | = 8.
many manufactures: Siemens, Hologic, Fujifilm Corporation, The hyper-parameters in Eq. (11) are set to λ1 = λ2 =
Philips Digital Mammography Sweden AB, Philips Medical 0.02, λ3 = 0.2, which are tuned on ADMANI and CMMD in
Systems, and Konica Minolta, where 40000 images are made Section IV-L. In Eq. (13), we set the margin γ = 10 given the
publicly available to researchers in the Radiological Society robustness of our model to a large range of values, as shown
of North America (RSNA) Screening Mammography Breast in Section IV-L. For the prototype consistency loss in Eq. (14),
Cancer Detection AI Challenge.3 In this study, we use a subset we use the following transformations: random rotation (from
of ADMANI that has 12220 (with 6012 cancer samples and −10◦ to 10◦ ), random translation (up to 15% of input image
6208 non-cancer samples) training images (training set A), size), random scaling (by a factor between 0.95 and 1.05),
1820 (with 880 cancer samples and 940 non-cancer samples) and random shearing (within the range [−10, 10] degrees by
validation images, and 24172 (with 1572 cancer samples following the PyTorch definition). In ProtoPNet, the feature
and 22600 non-cancer samples) testing images. For semi- dimension D = 128. The temperature τ should meet τ ≫ 1 to
supervised learning, we also use additional 24988 unlabelled prevent the saturation effect of the exponential function used
training images (training set B) from ADMANI. There is no in the similarity metric, and we have τ = 128.
1) Mammogram Classification: For ADMANI and CMMD,
overlap of patient data between training, validation, testing,
and unlabelled sets. In the testing set, 995 cancer images the mammograms are pre-processed using an Otsu threshold-
have lesion annotations labelled by experienced radiologists ing algorithm to crop the breast region, which is then resized
for evaluating cancer localisation performance and model to H = 1536, W = 768. We set the number of prototypes
interpretability. M to 400 (200 per class) for both datasets. Mammogram
CMMD [34] is a public mammogram dataset from 1775 classification performance is assessed with the area under
Chinese patients collected from 2012 to 2016. The dataset con- the receiver operating characteristic curve (AUC). To evaluate
sists of 52004 (with 2632 cancer samples and 2568 non-cancer model interpretability on the 995 cancer-annotated testing
samples) 4-view mammograms of size 2294 × 1914 pixels images from ADMANI, we measure the area under the preci-
sion recall curve (PR-AUC) for breast cancer localisation.
2 https://www.acmd.org.au/braix 2) Retinal OCT Classification: For NEH dataset, the original
3 https://www.kaggle.com/competitions/rsna-breast-cancer-detection OCT images are resized to H = 512, W = 768. In ProtoPNet,
4 One examination (D1-1343) was excluded due to its pre-processing failure. the number of prototypes M is set to 150 (50 per class). Retinal
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
398 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
C. Competing Methods
To validate the effectiveness of our InterNRL method,
we first compare it with the following fully- and
semi-supervised learning methods for the mammogram clas-
sification task:
Fully-supervised methods: EfficientNet-B0 [38], Sparse mul-
tiple instance learning (Sparse MIL) [40], Globally-aware HOG-SVM [43], Transfer Learning [44], LACNN [18],
multiple instance classifier (GMIC) [41], ProtoPNet [11], and FPN-EfficientNet-B0 [35]. HOG-SVM is a traditional
and BRAIxProtoPNet++ [14]. EfficientNet-B0 is the baseline machine-learning method which uses the multi-scale HOG
network on which our InterNRL is established. Sparse MIL feature descriptor and multiple binary SVM classifiers for
can localise lesions by dividing a mammogram into small OCT classification. Transfer Learning method utilises a
regions that are classified using multiple instance learning pre-trained CNN backbone as frozen feature extractor and
with a sparsity constraint, where we adopt EfficientNet-B0 retrains only fully connected layers to recognise OCT images.
as backbone for fair comparison. GMIC first utilises a global LACNN employs a lesion detection network (LDN) to
module to select informative regions of a mammogram, then localise macular lesions which are subsequently exploited
it relies on a local module to analyse those selected regions, to guide the classification task. FPN-EfficientNet-B0 is a
and it finally employs a fusion module to aggregate the global multi-scale feature pyramid network (FPN) which can cap-
and local outputs for classification. BRAIxProtoPNet++ is ture rich image features for identifying fine abnormalities in
our previously proposed method that distils the knowledge of OCT images.
a global classifier when training ProtoPNet. For the brain tumour segmentation task, we compared
Semi-supervised methods: Mean Teacher [26], SRC-MT [27], our InterNRL with the following weakly-supervised meth-
and FlexMatch [42]. Mean Teacher (MT) is a widely used ods: CAM [45], Grad-CAM++ [46], SEAM [47], and
semi-supervised baseline model. SRC-MT extends MT by WSS-CMER [37]. CAM and Grad-CAM++ are two popu-
further maintaining the consistency of semantic relation- lar baseline models for the weakly-supervised segmentation
ship among unlabelled training samples. FlexMatch is a task. SEAM improves the quality of CAMs by incorporat-
SOTA semi-supervised method for natural images. It employs ing an additional self-supervised equivariant constraint into
pseudo labels predicted from weakly-augmented unlabelled the original CAMs. Based on SEAM, WSS-CMER method
images to supervise a strongly-augmented version of the further introduces a cross-modality equivariant constraint for
same image, where training samples are flexibly selected via the multi-modal segmentation task, which applies consistency
a curriculum learning strategy. In FlexMatch, we use the regularisation on the CAMs produced across modalities.
transformation operations described in Section IV-B as weak
augmentations, and further use as strong augmentations the
Gaussian blur, Gaussian noise, elastic transformation, and D. Classification Results on Mammograms
gamma correction. To keep a fair comparison, we utilise Table I shows the classification AUC results of different
the same backbone of EfficientNet-B0 in MT, SRC-MT, and methods on ADMANI under both fully- and semi-supervised
FlexMatch. settings. For the InterNRL and our previously proposed
For the retinal OCT classification task, we compared our BRAIxProtoPNet++, we present the classification results
InterNRL with the following fully-supervised approaches: of ProtoPNet and GlobalNet branches independently, and
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 399
TABLE II TABLE IV
C LASSIFICATION AUC R ESULTS ON CMMD T EST S ET, R EPORTED R ETINAL OCT I MAGE C LASSIFICATION R ESULTS
W ITH 95% C ONFIDENCE I NTERVAL C OMPUTED F ROM ( IN P ERCENTAGE ) ON NEH DATASET
B OOTSTRAPPING W ITH 2000 R EPLICATES
TABLE III
P-VALUES C OMPUTED F ROM O NE -S IDED PAIRED T-T EST B ETWEEN
O UR InterNRL AND OTHER C OMPETING M ETHODS ON
ADMANI (S CENARIO F♥♥) AND CMMD
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
400 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
TABLE V
P-VALUES F ROM O NE -S IDED PAIRED T-T EST B ETWEEN O UR InterNRL AND OTHER M ETHODS ON NEH OCT DATASET
TABLE VI
Q UANTITATIVE R ESULTS ( IN P ERCENTAGE ) OF W EAKLY-S UPERVISED B RAIN T UMOUR S EGMENTATION ON BraTS DATASET
EfficientNet-B0 and FPN-EfficientNet-B0, with accuracy Notice in Fig. 7(a) that the non-cancer mammogram proto-
improvements of about 2.1% and 1.3%, respectively. In addi- types are often from normal breast tissues or benign lesions,
tion, the proposed InterNRL method exhibits better classifica- while the cancer mammogram prototypes usually comprise
tion results than the original ProtoPNet and our preliminary regions containing cancerous visual biomarkers (e.g., malig-
BRAIxProtoPNet++, which indicates the effectiveness of the nant masses and micro-calcifications). Also, we can see from
student-teacher reciprocal learning paradigm. We should note Fig. 7(b) that the CNV prototypes often contain fibrotic
that the retinal disease classification task is a 3-class prob- scar or sub-retinal fluid while the Drusen prototypes tend to
lem, so the OCT classification results demonstrate that our derive from regions with drusen. In Fig. 7(c), the non-tumour
InterNRL is extensible to multi-class interpretable diagnosis prototypes are prone to capture the healthy brain structures,
tasks. while the tumour prototypes usually focus on the region
To validate whether the performance gain of our method with abnormal brain tumour. These findings are well aligned
is statistically significant, we also conduct one-sided paired with the criteria that clinicians use for diagnosing diseases,
T-test for our InterNRL with other competing methods, where suggesting that the learned prototypes are representative and
the overall accuracy is used as the evaluation metric. Results discriminative. Fig. 8 displays the interpretable reasoning
in Table V show that the p-values associated with all pairs are process of our InterNRL on a testing cancer mammogram from
below the significance level of 0.05, verifying the superiority ADMANI. As can be seen, our InterNRL method classifies
of our InterNRL over other methods. the image as belonging to the cancer class because the
abnormality present in the image looks more similar to the
cancer prototypes than the non-cancer ones, as evidenced by
G. Weakly-Supervised Segmentation Results on BraTS the higher similarity scores with the cancer prototypes.
Since our InterNRL method relies on image-level training
labels only, we further perform experiments on BraTS to
explore its effectiveness for the weakly-supervised tumour I. Analysis of Two-Stage Reciprocal Learning Paradigm
segmentation. To obtain the predicted segmentation mask of Figure 9 shows the validation AUC performance of our
the prototype-based methods (ProtoPNet, BRAIxProtoPNet++, InterNRL method on ADMANI under the fully-supervised
and InterNRL), we use a threshold of 0.5 to binarize the setting F♥♥. One phenomenon we can see is, in the first
mean similarity map of all tumour prototypes (instead of the stage (S1), the student ProtoPNet’s validation AUC gradually
top-1 activated tumour prototype, as the mean similarity map increases while the teacher network’s validation performance
yields slightly better results in our observations). Quantitative goes down, as the training progresses. At the end of the
segmentation results are given in Table VI. It is observed that first stage, the student ProtoPNet exhibits a higher AUC than
our InterNRL method achieves the best segmentation results the teacher GlobalNet. This result can be explained from
on T1, T1ce, and FLAIR modalities. In particular, InterNRL the observation that the teacher network is more effective at
outperforms the existing best method, WSS-CMER, by large creating suitable pseudo labels to train the student network
margins of 6.28% and 6.95% (in Dice) on T1 and T1ce modal- than at training to achieve high classification performance
ities, respectively. Also, it is worth noting that the original for itself. This finding indicates the necessity of the second
ProtoPNet’s results greatly exceeds the results by CAM and stage (S2) of the reciprocal learning, where we can see the
Grad-CAM++, suggesting that the interpretable prototypes are teacher GlobalNet can also achieve a high validation AUC
also effective in the weakly-supervised segmentation task. after being trained using the pseudo labels generated by the
accurate ProtoPNet. We show one more supporting evidence
in Table VII by providing an ablation study to investigate
H. Prototype Visualisation and Interpretable Reasoning the importance of the second stage, where the experimental
Fig. 7 illustrates typical visual prototypes learned from parameters remain the same as in Section IV-B, e.g., training
ADMANI (a), NEH (b), and BraTS (c), their corresponding details (batch size, learning rate, and weight decay), hyper-
source training images, and self-activated similarity maps. parameters in the ProtoPNet branch (λ1 , λ2 , λ3 and γ ), and
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 401
Fig. 7. Visual prototypes of our InterNRL method learned from (a) ADMANI, (b) NEH, and (c) BraTS (T1) datasets. Each triplet represents the
class-specific learned prototype, the source training image where the prototype originates, and the corresponding self-activated similarity map.
TABLE VIII
E FFECT OFO UR P ROTOTYPE C ONSISTENCY R EGULARISATION
S TRATEGY ON THE M AMMOGRAM C LASSIFICATION P ERFORMANCE
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
402 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
TABLE X
E FFECTIVENESS OF THE S ECOND S TAGE OF R ECIPROCAL L EARNING
U NDER S EMI -S UPERVISED L EARNING S CENARIOS
(S♥ AND S♥♥) ON ADMANI DATASETS
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: INTERPRETABLE AND ACCURATE DEEP-LEARNING DIAGNOSIS FRAMEWORK 403
Fig. 11. Failure cases of our InterNRL method for breast cancer classi- VI. C ONCLUSION
fication. (a) Wrong associations with ground-truth cancer annotations
(yellow circles). (b) False negative classification of a cancer image. In this paper, we proposed the InterNRL method for
In each case, we randomly show three similarity maps with cancer accurate medical image classification with effective prototype-
prototypes and the highest similarity scores.
based interpretability. Our method has been designed to seam-
lessly integrate the prototype-based interpretable model with
the teacher so that the teacher can also improve its perfor- existing highly accurate global CNN classifiers. We present
mance, which finally contributes to their ensemble. To support a two-stage student-teacher reciprocal learning paradigm
this, we provide the semi-supervised learning results of our to reach highly accurate classification. This paradigm also
method on ADMANI dataset, trained without (w/o) and with enables InterNRL to exploit additional unlabelled training
(w/) the second stage, in Table X. We notice that without samples. Experimental results on four datasets demonstrate the
retraining, the performance of the teacher network largely superiority of our InterNRL compared to other SOTA meth-
deteriorates, resulting in even worse ensembled results than ods in disease classification, cancer localisation, and tumour
the interpretable ProtoPNet branch. One may observe that segmentation. Besides, our InterNRL is a general framework
the GlobalNet outperforms the ProtoPNet after the second that can be easily applied to other binary or multi-class
learning stage. We suspect one of the reasons is that the interpretable diagnosis tasks in the medical imaging domain.
non-interpretable GlobalNet branch is easier to optimise, com-
pared with the interpretable ProtoPNet branch. It is worth ACKNOWLEDGMENT
noting that the performances of both GlobalNet and ProtoPNet The authors would like to thank the editor and the anony-
branches are finally improved, compared with the original mous reviewers for their valuable comments and suggestions,
ProtoPNet [11] method and its non-interpretable counterpart particularly for proposing the brain tumour segmentation
EfficientNet-B0 [38], with results shown in Tables I, II, and IV. experiments on BraTS dataset.
Although our proposed InterNRL method can produce
promising interpretable results (see Fig. 8) for the breast R EFERENCES
cancer classification, there are still some unsatisfactory failure [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
cases observed in our experiments, which can be summarised no. 7553, pp. 436–444, 2015.
into the following two categories. (1) Wrong associations [2] L. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. Li, and S. Farsiu,
“Automatic segmentation of nine retinal layer boundaries in OCT images
with ground-truth cancer annotations. As shown in Fig. 11(a), of non-exudative AMD patients using deep learning and graph search,”
the cancerous lesion existing in the mammogram is poorly- Biomed. Opt. Exp., vol. 8, no. 5, pp. 2732–2744, 2017.
differentiable, presenting a diagnostically ambiguous bio- [3] G. Carneiro, J. Nascimento, and A. P. Bradley, “Automated analysis of
unregistered multi-view mammograms with deep learning,” IEEE Trans.
marker, which leads prototypes to activate image regions that Med. Imag., vol. 36, no. 11, pp. 2355–2365, Nov. 2017.
are not aligned with the ground-truth cancer lesion localisation. [4] C. Wang, Y. Jin, X. Chen, and Z. Liu, “Automatic classification of
Notice that these wrong associations may not necessarily volumetric optical coherence tomography images via recurrent neural
network,” Sens. Imag., vol. 21, no. 1, pp. 1–15, Dec. 2020.
lead to false classification results but we need to pay extra [5] C. Wang, Z. Cui, J. Yang, M. Han, G. Carneiro, and D. Shen, “BowelNet:
attention to them for debugging our model, e.g., inspect if the Joint semantic-geometric ensemble learning for bowel segmentation
learned prototypes are sufficiently accurate. (2) False negative from both partially and fully labeled CT images,” IEEE Trans. Med.
Imag., vol. 42, no. 4, pp. 1225–1236, Apr. 2023.
classification. As illustrated in Fig. 11(b), the breast cancer [6] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling
lesion occupies a small region of the image, resulting in lower deep neural networks,” IEEE Trans. Evol. Comput., vol. 23, no. 5,
similarity scores with the cancer prototypes than with the pp. 828–841, Oct. 2019.
non-cancer prototypes, which produces a final misclassifica- [7] C. Rudin, “Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead,” Nature Mach.
tion. These misclassified cases suggest that it is necessary Intell., vol. 1, no. 5, pp. 206–215, May 2019.
to develop an effective strategy to learn more fine-grained [8] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
prototypes for those small cancerous lesions. D. Batra, “Grad-CAM: Visual explanations from deep networks via
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis.
In our experiments, we have demonstrated that our proposed (ICCV), Oct. 2017, pp. 618–626.
InterNRL framework can be successfully applied to various [9] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
basic CNN architectures (Section IV-M) for the interpretable networks: Visualising image classification models and saliency maps,”
in Proc. ICLR Workshop, 2014, pp. 1–16.
diagnosis tasks in medical images. Notably, our InterNRL [10] A. Khakzar, S. Albarqouni, and N. Navab, “Learning interpretable
method is not restricted to those basic CNN architectures features via adversarially robust optimization,” in Proc. Int. Conf. Med.
and it can be readily integrated into more advanced net- Image Comput. Comput.-Assist. Intervent. Shenzhen, China: Springer,
2019, pp. 793–800.
works, e.g., visual transformers [55] and multi-level feature [11] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks
fusion [56]. Such integration is natural and uncomplicated like that: Deep learning for interpretable image recognition,” in Proc.
because of the high adaptability of InterNRL. Currently, Adv. Neural Inf. Process. Syst., 2019, pp. 795–807.
we empirically utilise the equal number of prototypes for each [12] D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, “Transparency
by design: Closing the gap between performance and interpretability
class in the ProtoPNet branch, which may not be appropriate in visual reasoning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
considering the different learning difficulties and imbalanced Recognit., Jun. 2018, pp. 4942–4950.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.
404 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 1, JANUARY 2024
[13] B. H. M. van der Velden, H. J. Kuijf, K. G. A. Gilhuijs, and [35] S. Sotoudeh-Paima, A. Jodeiri, F. Hajizadeh, and H. Soltanian-Zadeh,
M. A. Viergever, “Explainable artificial intelligence (XAI) in deep “Multi-scale convolutional neural network for automated AMD clas-
learning-based medical image analysis,” Med. Image Anal., vol. 79, sification using retinal OCT images,” Comput. Biol. Med., vol. 144,
Jul. 2022, Art. no. 102470. May 2022, Art. no. 105368.
[14] C. Wang et al., “Knowledge distillation to ensemble global and inter- [36] B. H. Menze et al., “The multimodal brain tumor image segmenta-
pretable prototype-based mammogram classification models,” in Proc. tion benchmark (BRATS),” IEEE Trans. Med. Imag., vol. 34, no. 10,
Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Singapore: pp. 1993–2024, Oct. 2015.
Springer, 2022, pp. 1–9. [37] G. Patel and J. Dolz, “Weakly supervised segmentation with cross-
[15] M. Ganeshkumar, V. Ravi, V. Sowmya, E. Gopalakrishnan, and modality equivariant constraints,” Med. Image Anal., vol. 77, Apr. 2022,
K. Soman, “Explainable deep learning-based approach for multilabel Art. no. 102374.
classification of electrocardiogram,” IEEE Trans. Eng. Manag., vol. 70, [38] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo-
no. 8, pp. 2787–2799, Aug. 2023. lutional neural networks,” in Proc. ICML, 2019, pp. 6105–6114.
[16] Y. Lei, Y. Tian, H. Shan, J. Zhang, G. Wang, and M. K. Kalra, “Shape [39] A. Paszke et al., “PyTorch: An imperative style, high-performance
and margin-aware lung nodule classification in low-dose CT images deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
via soft activation mapping,” Med. Image Anal., vol. 60, Feb. 2020, pp. 32–46.
Art. no. 101628. [40] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networks
[17] Y. Oh, S. Park, and J. C. Ye, “Deep learning COVID-19 features on with sparse label assignment for whole mammogram classification,” in
CXR using limited training data sets,” IEEE Trans. Med. Imag., vol. 39, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Quebec,
no. 8, pp. 2688–2700, Aug. 2020. AC, Canada: Springer, 2017, pp. 603–611.
[18] L. Fang, C. Wang, S. Li, H. Rabbani, X. Chen, and Z. Liu, “Attention [41] Y. Shen et al., “An interpretable classifier for high-resolution breast
to lesion: Lesion-aware convolutional neural network for retinal optical cancer screening images utilizing weakly supervised localization,” Med.
coherence tomography image classification,” IEEE Trans. Med. Imag., Image Anal., vol. 68, Feb. 2021, Art. no. 101908.
vol. 38, no. 8, pp. 1959–1970, Aug. 2019. [42] B. Zhang et al., “FlexMatch: Boosting semi-supervised learning with
[19] I. Ilanchezian, D. Kobak, H. Faber, F. Ziemssen, P. Berens, and curriculum pseudo labeling,” in Proc. Adv. Neural Inf. Process. Syst.,
M. S. Ayhan, “Interpretable gender classification from retinal fun- 2021, pp. 18408–18419.
dus images using BagNets,” in Proc. Int. Conf. Med. Image Com- [43] P. P. Srinivasan et al., “Fully automated detection of diabetic macular
put. Comput.-Assist. Intervent. Strasbourg, France: Springer, 2021, edema and dry age-related macular degeneration from optical coherence
pp. 477–487. tomography images,” Biomed. Opt. Exp., vol. 5, no. 10, pp. 3568–3577,
[20] W. Brendel and M. Bethge, “Approximating CNNs with bag-of-local- 2014.
features models works surprisingly well on ImageNet,” in Proc. ICLR, [44] D. S. Kermany et al., “Identifying medical diagnoses and treat-
2018, pp. 1–15. able diseases by image-based deep learning,” Cell, vol. 172, no. 5,
[21] E. Kim, S. Kim, M. Seo, and S. Yoon, “XProtoNet: Diagnosis pp. 1122–1131.e9, Feb. 2018.
in chest radiography with global and local explanations,” in Proc. [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, “Learning deep features for discriminative localization,” in Proc.
pp. 15719–15728. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[22] A. J. Barnett et al., “A case-based interpretable deep learning model for pp. 2921–2929.
classification of mass lesions in digital mammography,” Nature Mach. [46] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
Intell., vol. 3, no. 12, pp. 1061–1070, Dec. 2021. “Grad-CAM++: Generalized gradient-based visual explanations for
[23] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised deep convolutional networks,” in Proc. IEEE Winter Conf. Appl. Comput.
learning,” Mach. Learn., vol. 109, no. 2, pp. 373–440, 2020. Vis. (WACV), Mar. 2018, pp. 839–847.
[24] Y. Liu, Y. Tian, Y. Chen, F. Liu, V. Belagiannis, and G. Carneiro, [47] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised
“Perturbed and strict mean teachers for semi-supervised semantic seg- equivariant attention mechanism for weakly supervised semantic seg-
mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2022, pp. 4258–4267. (CVPR), Jun. 2020, pp. 12275–12284.
[25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural [48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
network,” 2015, arXiv:1503.02531. connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
[26] A. Tarvainen and H. Valpola, “Mean teachers are better role mod- Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
els: Weight-averaged consistency targets improve semi-supervised deep [49] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, “Weakly supervised object
learning results,” in Proc. Adv. Neural Inf. Process. Syst., 2017, localization and detection: A survey,” IEEE Trans. Pattern Anal. Mach.
pp. 33–45. Intell., vol. 44, no. 9, pp. 5866–5885, Sep. 2022.
[27] Q. Liu, L. Yu, L. Luo, Q. Dou, and P. A. Heng, “Semi-supervised [50] H.-J. Oh, K. Lee, and W.-K. Jeong, “Scribble-supervised cell segmen-
medical image classification with relation-driven self-ensembling tation using multiscale contrastive regularization,” in Proc. IEEE 19th
model,” IEEE Trans. Med. Imag., vol. 39, no. 11, pp. 3429–3440, Int. Symp. Biomed. Imag. (ISBI), Mar. 2022, pp. 1–5.
Nov. 2020. [51] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[28] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware works for biomedical image segmentation,” in Proc. Int. Conf. Med.
self-ensembling model for semi-supervised 3D left atrium segmenta- Image Comput. Comput.-Assist. Intervent. Munich, Germany: Springer,
tion,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. 2015, pp. 234–241.
Shenzhen, China: Springer, 2019, pp. 605–613. [52] T. Lucas, P. Weinzaepfel, and G. Rogez, “Barely-supervised learning:
[29] Y. Wang et al., “Double-uncertainty weighted method for semi- Semi-supervised learning with very few labeled images,” in Proc. AAAI
supervised learning,” in Proc. Int. Conf. Med. Image Comput. Comput.- Conf. Artif. Intell., vol. 36, 2022, pp. 1881–1889.
Assist. Intervent. Lima, Peru: Springer, 2020, pp. 542–551. [53] S. Budd, E. C. Robinson, and B. Kainz, “A survey on active learning
[30] R. Zhao, X. Chen, Z. Chen, and S. Li, “Diagnosing glaucoma on and human-in-the-loop deep learning for medical image analysis,” Med.
imbalanced data with self-ensemble dual-curriculum learning,” Med. Image Anal., vol. 71, Jul. 2021, Art. no. 102062.
Image Anal., vol. 75, Jan. 2022, Art. no. 102295. [54] V. Cheplygina, M. de Bruijne, and J. P. W. Pluim, “Not-so-supervised:
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for A survey of semi-supervised, multi-instance, and transfer learning in
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. medical image analysis,” Med. Image Anal., vol. 54, pp. 280–296,
(CVPR), Jun. 2016, pp. 770–778. May 2019.
[32] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in Proc. [55] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.,
pp. 11557–11568. 2020, pp. 1–12.
[33] H. M. L. Frazer et al., “ADMANI: Annotated digital mammograms [56] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
and associated non-image datasets,” Radiol., Artif. Intell., vol. 5, no. 2, “DeepLab: Semantic image segmentation with deep convolutional nets,
Mar. 2023, Art. no. e220072. atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
[34] H. Cai et al., “Breast microcalcification diagnosis using deep convo- Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
lutional neural network from digital mammograms,” Comput. Math. [57] C. Wang et al., “Learning support and trivial prototypes for interpretable
Methods Med., vol. 2019, pp. 1–10, Mar. 2019. image classification,” 2023, arXiv:2301.04011.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on February 06,2024 at 02:00:36 UTC from IEEE Xplore. Restrictions apply.