Deep Learning For Weak Supervision of Diabetic Retinopathy Abnormalities

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334410282

Deep Learning for Weak Supervision of Diabetic Retinopathy Abnormalities

Conference Paper · April 2019


DOI: 10.1109/ISBI.2019.8759417

CITATIONS READS
10 797

3 authors, including:

Maroof Ahmad Harshit Pande


Banda University of Agriculture and Tech. Banda U.P. India Adobe Inc.
3 PUBLICATIONS   10 CITATIONS    7 PUBLICATIONS   44 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Harshit Pande on 08 February 2020.

The user has requested enhancement of the downloaded file.


2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019)
Venice, Italy, April 8-11, 2019

DEEP LEARNING FOR WEAK SUPERVISION OF DIABETIC RETINOPATHY


ABNORMALITIES

Maroof Ahmad, Nikhil Kasukurthi, Harshit Pande

SigTuple Technologies Pvt. Ltd., Bengaluru, India

ABSTRACT
Deep learning-based grading of the fundus images of the
retina is an active area of research. Various existing studies
use different deep learning architectures on different datasets.
Results of some of the studies could not be replicated in other
studies. Thus a benchmarking study across multiple architec-
tures spanning both classification and localization is needed.
We present a comparative study of different state-of-the-art
architectures trained on a proprietary dataset and tested on
the publicly available Messidor-2 dataset. Although evidence
is of utmost importance in AI-based medical diagnosis, most
studies limit themselves to the classification performance
and do not report the quantification of the performance of
the abnormalities localization. To alleviate this, using class
activation maps, we also report a comparison of localiza-
tion scores for different architectures. For classification, we
found that as the number of parameters increase, the mod- Fig. 1: Diabetic Retinopathy abnormalities
els perform better, with NASNet yielding highest accuracy
and average precision, recall, and F1-scores of around 95%.
For localization, VGG19 outperformed all the models with
count while viewing a fundus image. For this study, we have
a mean Intersection over Minimum of 0.45. We also found
classified the images into three classes based on their severity
that there is a trade-off between classification performance
– non-referable DR (NRDR), non-proliferative DR (NPDR),
and localization performance. As the models get deeper, their
and proliferative DR (PDR).
receptive field increases, causing them to perform well on
classification but underperform on the localization of fine- Automated detection of DR is an active area of research
grained abnormalities. among computer vision researchers. Gulshan et al. [3] use In-
ceptionV3 [4] model pre-trained on ImageNet [5]. For train-
Index Terms— Diabetic retinopathy, Messidor-2, Abnor- ing, they use a proprietary EyePACS-1 dataset and report the
mality localization, Class activation maps performance on the Messidor-2 [6, 7] dataset. A replication
attempt conducted by Voets et al. [8] on the Kaggle DR datat-
1. INTRODUCTION set, which is a different version of the EyePACS-1 dataset,
reports inability to replicate the original results by Gulshan
Diabetic Retinopathy (DR) is one of the leading causes of et al. [3]. Voets et al. [8] report a receiver operating curve
preventable blindness, with an estimated 347 million [1] di- (AUC) of 0.94 on Kaggle EyePACS and 0.80 on Messidor-2,
abetics worldwide. Globally, 1.9% of moderate to severe vi- while Gulshan et al. [3] report an AUC of 0.99 on both Eye-
sion loss and 2.6% of total blindness is caused by DR [2]. Pacs as well as Messidor-2.
DR is broadly classified into two categories – non-referable Deep learning models have yielded state-of-the-art results
and referable DR. Referable DR pertains to more than mild in classification tasks but a deeper understanding of the rea-
DR severity. There are many granular features like microa- soning behind a certain prediction made by the model has not
neurysms (MA), hard exudates (HE), and superficial hemor- been much studied. The work by Zhou et al. [9] has played
rhages along with higher level features like new vessels on a crucial role in localization using class activation maps
disc (NVD) or new vessels elsewhere (NVE), pre-retinal or (CAMs) only through weak labels annotations on data using
vitreous hemorrhages that an ophthalmologist takes into ac- the global average pooling (GAP) layer. For DR, Gargeya et

978-1-5386-3640-4/19/$31.00 ©2019 IEEE 573


al. [10], use heatmaps for interpreting the output of the model Label Count
through the GAP layer. For a better coverage of the region Non-Proliferative DR (NPDR) 218
of interest, Wang et al. [11] fused multiple CAMs generated Non-Referable DR (NRDR) 1087
from different scales of input images. They report that these Proliferative DR (PDR) 22
fused CAMs accurately capture the abnormalities in most
cases but their study lacks quantification of the localization Table 2: Distribution on Messidor-2 annotations: the test data
performance. for reporting performance
This motivated us to benchmark different deep learning
architectures on a common dataset to gain insights into the Label Count
models that are apt for DR classification. Since evidence is of Deep-Haemorrhage 262
utmost importance in medical diagnosis, we evaluate all the Drusen 2
models on their localization capabilities of various abnormal- Fovea 3
ities which are considered by the medical experts for diagno- Hard-Exudate 287
Microaneurysm 218
sis. NVD/NVE/Fibrosis 22
Using transfer learning of models pre-trained on Im- Others 31
ageNet [5], we train and evaluate VGG16, VGG19 [12], Scar 347
InceptionV3 [4], InceptionResNetV2 [13], Xception [14], Soft Exudate 63
Subhyaloid-Haemorrhage 27
DensetNet121 [15], ResNet50 [16] and NASNet [17] for Superficial-Haemorrhage 920
classification and localization through classification.
Table 3: Messidor-2 abnormalities
2. DATASET AND PRE-PROCESSING

A proprietary dataset of 10,274 images, captured through viation of 30, on a fundus image Ic . The edge image, Ie ,
multiple different cameras, and collected from Narayana is created by subtracting Ib from Ic . The edge image Ie is
Netralaya (NN), C.L. Gupta Eye Institute (CLGEI) and Sai resized to 512⇥512 using bi-cubic interpolation. The image
Retinal Foundation (SRF) in India is used for training. It is is rescaled to range of 0 to 1. During training random rota-
annotated into 3 classes namely referable non-proliferative tions in range of 0 -360 , random horizontal and vertical flips
DR (NPDR), non-referable DR (NRDR) and proliferative DR along with a zoom range of [0.8, 1.2] are applied. To compen-
(PDR) by a panel of 3 annotators having more than 5 years sate for the varying illumination of images, with a probability
of experience in retina sub-specialty. The ground truth label of 0.5, gamma correction is applied between a gamma range
for an image is assigned if two or more annotators give it of [0.3, 1.6] and gain of 1.0. Since there is a skew towards the
the same label. The images with disagreement among all NPDR class, PDR is class balanced by repeating images ran-
three annotators are discarded and not used in training. The domly to ensure the number of images in both the abnormal
final distribution of the training data is mentioned in Table 1. classes are comparable. The dataset mentioned in Table-1 is
The results of all the models are reported on the Messidor-2 split into 85% training and 15% validation in a stratified man-
ner for each class.
Label Count
Non-Proliferative DR (NPDR) 3564 3. BENCHMARKING METHODOLOGY
Non-Referable DR (NRDR) 4630
Proliferative DR (PDR) 2080 Transfer learning through Keras with models pre-trained on
ImageNet [5] is used to train VGG16, VGG19 [12], Incep-
Table 1: Dataset distribution used for training tionV3 [4], InceptionResNetV2 [13], Xception [14], Denset-
Net121 [15], ResNet50 [16], and NASNet [17]. All the mod-
dataset annotated by a panel of 2 annotators. The dataset els are trained and tested on a machine with 12-core CPU, 110
has 1748 images out of which the consensus is achieved for GB RAM and a Nvidia K-80 GPU.
1327 images for the above mentioned 3 classes. The class For all the models, the flattening layer after the last con-
distribution is mentioned in Table 2. The abnormalities per- volutional layer is replaced with the global average pooling
taining to DR are annotated by a single annotator to evaluate (GAP) layer to reduce the number of parameters and to aid
the localization capabilities of the models. Only 240 images in localization via class activation maps as done by Zhou et
belonging to the abnormal classes are annotated. The distri- al. [9]. Adam with a learning rate of 5e 5 and a batch size
bution of abnormalities in these 240 images is shown in Table of 16 with categorical cross-entropy as the loss function is
3. used. While training, the best model is saved on the basis of
As a part of pre-processing steps, a background image, minimum validation loss. The models are evaluated on preci-
Ib , is created by applying a Gaussian blur with a standard de- sion, recall, and F1-scores of all three classes for classification

574
benchmarking.
To validate the localization capabilities of classifica-
tion models, following Wang et al. [18] we compute the
mean Intersection over Minimum area (equation 1) with
respect to each type of abnormality. The ground truth con-
tours are created from the boundary of the abnormalities
marked by the annotators and the predicted contours are gen-
erated from the boundary of CAMs thresholded at a value of
1.5 ⇥ mean(CAM ).
(a) Inception-ResNet-V2 (b) VGG19
n
1X
mIoM = mIoMi (1)
n i=1
Fig. 2: The heatmap generated by the Inception-ResNet-V2
where, (2a) is slightly shifted due to its large receptive field as it can
be observed from the red contours in comparison with the
mIoMi is the mean Intersection over Minimum area
ground truth in blue contours. On the other hand, VGG19
for the ith predicted contour
(2b) has precise contours.
n is the total number of predicted contours

When compared across models, there is a high variance


1 X area(Pi \ G) in the results for localization (Table 5) as the models have
mIoMi = { > t} (2)
|T | t2T min (area(Pi ), area(G)) different receptive fields. The models with a higher recep-
tive field like InceptionV3 are able to effectively localize
where, larger and more evident features like Neo Vascularisation
on Disc (NVD) whereas for fine-grained abnormalities like
thresholds T = 0.15, 0.35, ..., 0.75
deep Hemorrhages, VGG16 with a lower receptive field per-
Pi is the ith predicted contour by the model forms better. Inception-ResNet-V2 has the worst localization
G is the ground truth contour by the annotator for a performance due to the increased depth of the architecture.
specific abnormality
is the indicator random variable 5. CONCLUSION
area(x) is the area of the contour x
This benchmarking study addresses the need of the miss-
4. RESULTS ing comparative analysis of the classification performance
of multiple state-of-the-art deep learning architectures. Ad-
For classification, as reported in Table 4, NASNet performs ditionally, we also fill the gap in existing studies in regards
the best in terms of average precision, recall and F1-score. to comparative analysis of the localization performance at
However, the number of parameters in NASNet are much the granular level of specific abnormalities. Localization of
higher as compared to that of the other models. abnormalities is critical to evidence-based medical imaging.
As reported in Table 5, if mean of IoM area is considered, This study also revealed a trade-off between the performance
VGG19 performs the best for localization in almost all the of classification and the performance of abnormality localiza-
abnormalities. Microaneurysms have the lowest score for all tion across models. For classification performance, NASNet
the models due to their apparent small size of roughly 40µm architecture, which has the highest number of parameters and
diameter. Hard exudates are localized best by VGG19 with a performs the best with 95% accuracy. On the other hand,
mean IoM of 0.65 as they have very distinct features in terms VGG16 with 6 folds lesser parameters is able to achieve
of color and appear in clusters as compared to the other ab- accuracy only slightly less than that of NASNet. VGG19,
normalities. Deep hemorrhages and superficial hemorrhages with number of parameters less than 4 folds of that of NAS-
have similar scores since they have distinguishable features. Net, delivers the best metrics on localization performance
Exceedingly high mean IoMs are not achieved for any of with a mean Intersection over Minimum of 0.45 when av-
the abnormalities, which might be because a deep learning eraged over all the abnormalities. On the other hand, the
model does not necessarily consider all the abnormality re- best classification model of NASNet only achieves a mean
gions to classify an image into its corresponding class and Intersection over Minimum of 0.23. Overall, across different
only learn the most discriminating inter-class features. For architectures, classification accuracy varies in a small range
instance, given a dog image, a network can classify it by only of 0.90 to 0.95, while there is a huge variation in the average
considering the head and ignoring legs or tail. localization score in the range of 0.20 to 0.45.

575
NPDR NRDR PDR Weighted Average
Model A Pr
P R F1 P R F1 P R F1 P R F1
DenseNet- 0.70 0.84 0.76 0.97 0.92 0.94 0.45 0.86 0.59 0.92 0.90 0.91 0.90 7,040,579
121
Inception- 0.76 0.88 0.82 0.98 0.95 0.96 0.79 0.86 0.83 0.94 0.93 0.94 0.94 54,341,347
ResNetV2
Inception- 0.83 0.79 0.82 0.97 0.97 0.97 0.62 0.95 0.75 0.94 0.94 0.94 0.93 21,808,931
V3
NASNet 0.83 0.87 0.85 0.97 0.96 0.97 0.77 0.91 0.83 0.95 0.95 0.95 0.95 84,928,917
ResNet50 0.68 0.89 0.77 0.98 0.90 0.94 0.40 0.86 0.55 0.92 0.90 0.90 0.90 23,593,859
VGG16 0.83 0.84 0.84 0.98 0.96 0.97 0.56 0.82 0.67 0.95 0.94 0.94 0.94 14,716,227
VGG19 0.78 0.87 0.82 0.98 0.95 0.96 0.59 0.73 0.65 0.94 0.93 0.94 0.93 20,025,923
Xception 0.73 0.91 0.81 0.99 0.92 0.95 0.51 0.91 0.66 0.94 0.92 0.92 0.92 20,867,627

Table 4: Classification Metrics. Estimation of Precision (P), Recall (R), F1-Score(F1), Accuracy (A), Pr (Parameters);
Non-Proliferative DR (NPDR), Non-Refereable DR (NRDR), Proliferatetive DR (PDR)

Model Deep- Hard- Micro- NVD- Others Scar Soft- Sub- Super- wt.Avg
H Exudate aneurysm NVE- Exudate Hyloid- ficial-H
Fibrosis H
DenseNet- 0.24 0.56 0.12 0.50 0.37 0.36 0.45 0.35 0.30 0.33
121
Inception- 0.19 0.27 0.07 0.37 0.32 0.20 0.28 0.10 0.19 0.20
ResNetV2
Inception- 0.23 0.42 0.10 0.52 0.14 0.24 0.23 0.28 0.23 0.25
V3
NASNet 0.18 0.53 0.14 0.43 0.28 0.17 0.17 0.42 0.18 0.23
ResNet50 0.20 0.47 0.07 0.40 0.21 0.30 0.42 0.27 0.26 0.27
VGG16 0.42 0.56 0.13 0.26 0.41 0.55 0.48 0.50 0.43 0.44
VGG19 0.42 0.65 0.14 0.32 0.49 0.46 0.51 0.51 0.46 0.45
Xception 0.30 0.51 0.12 0.51 0.21 0.27 0.22 0.25 0.28 0.29

Table 5: mIoMa of abnormalities localization; H - Hemorrhage; wt.Avg - Weighted Average, support in Table 3

(a) DenseNet12 (b) Inception-ResNetV2 (c) Inception V3 (d) NASNet

(e) ResNet50 (f) VGG16 (g) VGG19 (h) Xception

Fig. 3: Green contours are marked by the annotator and white contours are predicted by models

576
6. REFERENCES learning algorithm for detection of diabetic retinopa-
thy in retinal fundus photographs,” CoRR, vol.
[1] Sebahat Atalikoğlu Başkan and Mehtap Tan, “Research abs/1803.04337, 2018.
of type 2 diabetes patients’ problem areas and affecting
factors,” Journal of Diabetes Mellitus, vol. 07, no. 03, [9] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A Tor-
pp. 175–183, 2017. ralba, “Learning deep features for discriminative local-
ization.,” cvpr, 2016.
[2] Rupert R A Bourne, Gretchen A Stevens, Richard A
White, Jennifer L Smith, Seth R Flaxman, Holly Price, [10] Rishab Gargeya and Theodore Leng, “Automated iden-
Jost B Jonas, Jill Keeffe, Janet Leasher, Kovin Naidoo, tification of diabetic retinopathy using deep learning,”
Konrad Pesudovs, Serge Resnikoff, and Hugh R Taylor, Ophthalmology, vol. 124, no. 7, pp. 962–969, jul 2017.
“Causes of vision loss worldwide, 1990–2010: a sys-
[11] Zhiguang Wang and Jianbo Yang, “Diabetic retinopathy
tematic analysis,” The Lancet Global Health, vol. 1, no.
detection via deep convolutional networks for discrim-
6, pp. e339–e349, dec 2013.
inative localization and visual explanation,” in AAAI
[3] Gulshan V, Peng L, Coram M, and et al, “Development Workshops, 2018.
and validation of a deep learning algorithm for detection
[12] Karen Simonyan and Andrew Zisserman, “Very deep
of diabetic retinopathy in retinal fundus photographs,”
convolutional networks for large-scale image recogni-
JAMA, vol. 316, no. 22, pp. 2402–2410, 2016.
tion,” CoRR, vol. abs/1409.1556, 2014.
[4] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna, “Rethinking the [13] Christian Szegedy, Sergey Ioffe, and Vincent Van-
inception architecture for computer vision,” CoRR, vol. houcke, “Inception-v4, inception-resnet and the im-
abs/1512.00567, 2015. pact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016.
[5] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej [14] François Chollet, “Xception: Deep learning with
Karpathy, Aditya Khosla, Michael Bernstein, Alexan- depthwise separable convolutions,” CoRR, vol.
der C. Berg, and Li Fei-Fei, “ImageNet Large Scale Vi- abs/1610.02357, 2016.
sual Recognition Challenge,” International Journal of
[15] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger,
Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252,
“Densely connected convolutional networks,” CoRR,
2015.
vol. abs/1608.06993, 2016.
[6] Etienne Decencière, Xiwei Zhang, Guy Cazuguel,
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe
Sun, “Deep residual learning for image recognition,”
Gain, Richard Ordonez, Pascale Massin, Ali Erginay,
CoRR, vol. abs/1512.03385, 2015.
Béatrice Charton, and Jean-Claude Klein, “FEED-
BACK ON a PUBLICLY DISTRIBUTED IMAGE [17] Barret Zoph, Vijay Vasudevan, Jonathon Shlens,
DATABASE: THE MESSIDOR DATABASE,” Image and Quoc V. Le, “Learning transferable architec-
Analysis & Stereology, vol. 33, no. 3, pp. 231, aug 2014. tures for scalable image recognition,” CoRR, vol.
[7] G. Quellec, M. Lamard, P.M. Josselin, G. Cazuguel, abs/1707.07012, 2017.
B. Cochener, and C. Roux, “Optimal wavelet trans- [18] Zhe Wang, Yanxin Yin, Jianping Shi, Wei Fang,
form for the detection of microaneurysms in retina pho- Hongsheng Li, and Xiaogang Wang, “Zoom-in-net:
tographs,” IEEE Transactions on Medical Imaging, vol. Deep mining lesions for diabetic retinopathy detection,”
27, no. 9, pp. 1230–1241, sep 2008. CoRR, vol. abs/1706.04372, 2017.
[8] Mike Voets, Kajsa Møllersen, and Lars Ailo Bongo,
“Replication study: Development and validation of deep

577

View publication stats

You might also like