HardAttentionNetforRetinaVesselSegmentation Final

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/342248007
Hard Attention Net for Automatic Retinal Vessel Segmentation
Article in IEEE Journal of Biomedical and Health Informatics · June 2020

DOI: 10.1109/JBHI.2020.3002985
CITATIONS READS
121 922
5 authors, including:
Dongyi Wang Osamah J Saeedi

University of Arkansas University of Maryland, Baltimore
43 PUBLICATIONS 685 CITATIONS 70 PUBLICATIONS 1,329 CITATIONS
SEE PROFILE SEE PROFILE
Yang Tao
University of Maryland, College Park
59 PUBLICATIONS 1,503 CITATIONS
SEE PROFILE
All content following this page was uploaded by Dongyi Wang on 18 June 2020.
The user has requested enhancement of the downloaded file.

JBHI-01622-2019 1
Hard Attention Net for Automatic Retinal Vessel

Segmentation
Dongyi Wang, Ayman Haytham, Jessica Pottenburgh, Osamah Saeedi* and Yang Tao*1
Abstract—Automated retinal vessel segmentation is among the on which related research began three decades ago [10]. Early
most significant application and research topics in ophthalmologic studies focused on traditional image processing and
image analysis. Deep learning based retinal vessel segmentation unsupervised machine learning methods, including edge/line
models have attracted much attention in the recent years. detector methods [11], matched filter based methods [10],
However, current deep network designs tend to predominantly
morphology based methods [12], region growth methods [13],
focus on vessels which are easy to segment, while overlooking
vessels which are more difficult to segment, such as thin vessels or
wavelet based methods [14, 15], clustering based methods [16,
those with uncertain boundaries. To address this critical gap, we 17], and active contour methods [18, 19]. However, in the
propose a new end-to-end deep learning architecture for retinal absence of the supervision and manually annotated information,
vessel segmentation: hard attention net (HAnet). Our design is all the aforementioned methods are prone to detect false edges
composed of three decoder networks: the first of which and are susceptible to errors with image illumination variations
dynamically locates which image regions are “hard” or “easy” to [20, 21].
analyze, while the other two aim to segment retinal vessels in these With the advent of several publicly available retinal image
“hard” and “easy” regions independently. We introduce attention databases [22, 23], supervised vessel segmentation methods
mechanisms in the network to reinforce focus on image features in have become increasingly appealing to researchers [24].
the “hard” regions. Finally, a final vessel segmentation map is
generated by fusing all decoder outputs. To quantify the network’s
Compared to unsupervised methods, supervised methods
performance, we evaluate our model on four public fundus facilitate increased detection accuracy by learning human
photography datasets (DRIVE, STARE, CHASE_DB1, HRF), two annotated training datasets [24]. Prior to the revolutionary
recent published color scanning laser ophthalmoscopy image success of deep neural networks [25], supervised methods had
datasets (IOSTAR, RC-SLO), and a self-collected indocyanine been employed in two general steps: a handcrafted feature
green angiography dataset. Compared to existing state-of-the-art extraction step, and a classification step [26]. Common
models, the proposed architecture achieves better/comparable handcrafted features included color intensity information [27],
performances in segmentation accuracy, area under the receiver principle component information [28], wavelet response [29,
operating characteristic curve (AUC), and f1-score. To further 30], and edge response [31], which were mainly inherited from
gauge the ability to generalize our model, cross-dataset and cross-
modality evaluations are conducted, and demonstrate promising
unsupervised methods. The classification step could apply any
extendibility of our proposed network architecture. type of classifiers, including nearest neighbor [22], support
vector machine [31], perceptron [32], random forest [33],
Index Terms—deep learning, ophthalmology, retina vessel Gaussian Mixture Model [29], and hybrid methods [34]. The
segmentation, fundus photography, scanning laser ophthalmology performances of these traditional supervised methods are highly
dependent on extracted vessel features, but selecting effective
I. INTRODUCTION features is a tedious “trial and error” process. Since different
imaging modalities have varying fields of view, focal lengths,
Retinal vessel characteristics, such as vessel width,
and pixel resolutions [35, 36], these hand-crafted features must
reflectivity, tortuosity, and branching features, are important
be well-designed for their specific applications.
biomarkers for many retinal and systemic diseases, including
Recently, emerging data driven deep convolutional neural
diabetic retinopathy [1], glaucoma [2, 3], macular degeneration
networks offer new opportunities to automatically extract task
[4], hypertension [5], and cardiovascular diseases [6]. Manual
related image features [25], and have achieved great success in
vessel segmentation and delineation is a subjective and time-
many image processing tasks, such as classification [25],
consuming process, and thus automatic retinal blood vessel
detection [37], and segmentation [38]. For the task of retinal
segmentation is a significant research topic in computer-aided
vessel segmentation, the deep learning model was first used for
ophthalmology [7]. Furthermore, retinal vessel segmentation is
central point binary classification [39]. The input of the network
a necessary preprocessing step for some cellular-level
was cropped into small image patches, and the network was
ophthalmology research [8], and objective automatic vessel
designed for predicting whether the central pixel of the patch
extraction can standardize the analysis for certain emerging
was a vessel pixel. Variants of this idea utilized an ensemble of
imaging modalities [8, 9].
different feature extraction models to improve discrimination
The retinal vessel extraction problem can be generalized as
performance [40, 41]. However, these central pixel
an image foreground and background segmentation problem,
This work was supported by National Institutes of Health/National Eye A. Haytham is with Aureus University School of Medicine, Wayaca 31C,
Institute Career Development Award (K23 EY025014). Oranjestad, Aruba. (e-mail: aymanhhaytham@gmail.com).
D. Wang and Y. Tao are with the Department of Bioengineering, University J. Pottenburgh and O. Saeedi are with the Department of Ophthalmology
of Maryland at College Park, College Park, MD, 20742, USA (email: and Visual Sciences, University of Maryland School of Medicine, 419 W
dywang@terpmail.umd.edu; ytao@umd.edu); Redwood Street, Suite 470, Baltimore, MD 21201, USA (e-mail:
JPottenburgh@som.umaryland.edu; OSaeedi@som.umaryland.edu).
JBHI-01622-2019 2
classification ideas overlooked information hidden in label shallow U-net structure is fed with all previous decoder outputs
dependencies of neighboring pixels, which were expected to along with the original image input to generate a final refined
benefit model segmentation performance [42]. The fully segmentation output. All sub-networks in HAnet are trained
convolutional network (FCN) offered an end-to-end solution to together by introducing joint loss functions. Two attention
take advantage of the neighboring information for the task of mechanisms are introduced in HAnet to effectively reinforce
structured prediction [38, 42]. However, max-pooling vessel features, especially for hard vessels. Moreover, the
operations in FCNs sacrificed network localization accuracy, design of HAnet can be seamlessly integrated with other
which yielded relatively low segmentation accuracy [43]. To baseline networks and loss function modifications.
target this problem, a U-net structure [43, 44] was designed The remainder of this paper is organized as follows: section
based on the structure of the FCN, except which could II will present the design of the proposed networks framework
propagate context image feature information from its down- and its training procedure. Section III will introduce datasets
sampling part to its up-sampling part to account for network and segmentation evaluation metrics used in this study. Section
features at varying resolution levels. To further account for this IV will provide visual retinal segmentation results, quantitative
variance, researchers fed U-net with multiple inputs [45], and evaluations and comparisons with state-of-the-art algorithms.
proposed various methods to reuse context information in the The applicability of our proposed model is also discussed based
network [46, 47]. Wang et al. proposed a dual encoder model on cross-dataset and cross-modality experimental results.
with different block designs to preserve both spatial and Ablation studies are also included to discuss the effectiveness
contextual information [48]. Wu et al. introduced multiscale of different network components. Section V will conclude the
supervision in the network and offered end-to-end solutions to paper and discuss potential future work.
preserve multiscale information [49, 50]. Since most retinal
segmentation studies used cropped image patches as the II. HARD ATTENTION NET (HANET)
network input, improvements attained by the multi-scale
methods were of little benefit, and the means of effectively A. Network Architecture
reusing features inside the network was a “black-box” problem An overview architectural diagram of HAnet is shown in Fig.
necessitating further study. 1. The core idea of our design adheres to the principles of U-net
In an effort to achieve superior retinal vessel segmentation architecture. U-net is composed of an encoder sub-network to
performance, researchers utilized the structural properties of extract high-level image features, and a decoder sub-network
vessels to guide network training. Li et al. introduced new built above these features. The decoder aims to map image
vessel connection loss values to preserve vessel connectivity features to a corresponding segmentation output. U-net
[51]. Jin et al. introduced a deformable convolution operation experimentally proved that sharing (concatenating) encoder
to model vessels with various shapes [52]. One study claims information with the decoder is effective in promoting network
that multi-task training can improve segmentation results segmentation accuracy [43].
considering the feature differences between arteries and veins As shown in Fig. 1, HAnet is equipped with an encoder (sub-
[53]. Yan et al. recognized an inherent data imbalance between network illustrated in orange), as in U-net, which has basic
thin and thick vessels influencing the network outcome, standardized convolution layers (3*3 kernel) and maxpooling
suggesting that training loss values would be dominated by layers to extract image features and down-sample images,
thick vessels while overlooking thin vessels [26]. Therefore, respectively. Rather than using a single mode decoder, HAnet
their work manually pre-defined thick and thin vessels. Then, has three different decoders models (sub-networks illustrated in
two separate models were applied to segment annotated thick light blue (A), green (B) and red (C)). Decoder A, along with
and thin vessels, and an additional network was designed to fuse the encoder, forms a basic U-net structure, which is expected to
the predicted thick and thin vessels. Although, this model could acquire a coarse segmentation result. Its output, 𝐴𝑂 , is a
effectively reconstruct both thick and thin vessels, it required probability map with the same size as the input image. Its value,
the diameter of thin vessels to be manually predefined, which 𝐴𝑂 (𝑥, 𝑦), represents how likely the input pixel at position (𝑥, 𝑦)
could not be easily transformed to imaging systems with is to be a vessel pixel. If 𝐴𝑂 (𝑥, 𝑦) is close to 1, the network
different focal lengths or to different image modalities. Further, tends to classify the pixel (𝑥, 𝑦) as a vessel pixel. Conversely,
in practice, blurry and noisy vessel boundaries may also be if 𝐴𝑂 (𝑥, 𝑦) is close to 0, the pixel (𝑥, 𝑦) tends to be classified
dominated by thick vessels, and such regions are difficult to as a non-vessel pixel. However, if 𝐴𝑂 (𝑥, 𝑦) is close to 0.5, the
manually predefine. basic U-net encoder and decoder structure yield an inaccurate
With this said, the main contribution of this work is to segmentation prediction at (𝑥, 𝑦). Utilizing this, and combining
propose a new end-to-end network design, namely, hard with segmentation ground truth, GT, we define a bi-level
attention net (HAnet), which automatically focuses the threshold based on the 𝐴𝑂 (𝑥, 𝑦) value to determine the vascular
network’s attention on regions which are “hard” to segment. areas that are hard or easy to segment. In this experiment,
Briefly, the proposed network has a basic encoder and decoder without any pre-knowledge, hard and easy region segmentation
network architecture, which is expected to yield a “coarse” masks, 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 and 𝑀𝑎𝑠𝑘𝑒𝑎𝑠𝑦 , respectively, are created.
vessel segmentation result. Based on the coarse segmentation 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 (𝑥, 𝑦) = 1 if 0.25 < 𝐴𝑂 (𝑥, 𝑦) < 0.75 or 𝐴𝑂 is
probabilistic map, the vessel regions which are “hard” or “easy” misclassified with a 0.5 threshold, and conversely
to segment are automatically determined. Then, two additional 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 (𝑥, 𝑦) = 0. 𝑀𝑎𝑠𝑘𝑒𝑎𝑠𝑦 (𝑥, 𝑦) = 1 − 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 (𝑥, 𝑦).
decoder networks are built upon the encoded high-level image 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 and 𝑀𝑎𝑠𝑘𝑒𝑎𝑠𝑦 are dynamically generated during the
features. One decoder is targeted to segment “hard” vessels network training process. Based on the two masks, the
while the other is targeted to segment “easy” vessels. Finally, a
JBHI-01622-2019 3
segmentation ground truths for the hard region and easy region focus the network’s attention on vessel regions, and particularly
are defined as 𝐺𝑇ℎ𝑎𝑟𝑑 and 𝐺𝑇𝑒𝑎𝑠𝑦 , where 𝐺𝑇ℎ𝑎𝑟𝑑 = 𝐺𝑇 ∗ on “hard” vessel regions. First, attention gates are introduced to
𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 and 𝐺𝑇𝑒𝑎𝑠𝑦 = 𝐺𝑇 ∗ 𝑀𝑎𝑠𝑘𝑒𝑎𝑠𝑦 . Based on the two automatically give more weight to encoded vessel feature
dynamic ground truth maps, another two decoders (decoder B responses when concatenating the encoded features to the
and C) are built above the encoded high-level image features decoder B and C. The attention gates’ design follows that in
for easy and hard region segmentation specifically. Their [54]. As shown in the legend at the bottom right of Fig. 1, the
outputs are denoted 𝐵𝑂 and 𝐶𝑂 for decoders B and C, attention gate has two inputs: one from the encoded features and
respectively. Finally, the original input, along with 𝐴𝑂 , 𝐵𝑂 , and the other from the corresponding decoder layer. The decoder
𝐶𝑂 are concatenated and fed into a shallow U-net structure to input aims to transmit object related contextual segmentation
yield the final segmentation prediction, 𝑈𝑂 . information to the encoder layers to identify salient image
There are two additional designs embedded in HAnet to regions. The two inputs first pass through a convolutional layer
Fig. 1. Architecture diagram of our proposed method: hard attention net. First, the input passes through an encoder network (orange) to extract high level image
features. The light blue decoder maps these image features to a coarse segmentation result. Based on the coarse segmentation result and a bi-level threshold, two
masks are generated to define the regions which are hard or easy to segment. Then, another two decoder networks (green and red) are built to segment the “hard”
and “easy” region separately. Finally, an additional shallow U-net structure (purple) fuses the previous prediction results to achieve a refined segmentation
output.
JBHI-01622-2019 4
independently, and then the refined encoder and decoder 𝐿𝑟 = 𝐵𝐶𝐸(𝑈𝑜 , 𝐺𝑇) are computed.
features are fused together additively. Based on the fused In the experiment, an additional auxiliary output, 𝐴𝑢𝑥𝑜 =
features, another convolutional layer with a single output 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐹𝑀𝐵 + 𝐹𝑀𝐶 ), and its corresponding loss function
channel can generate a final attention map with sigmoid 𝐿𝑎𝑢𝑥 (𝐴𝑢𝑥𝑜 , 𝐺𝑇), are defined, where 𝐹𝑀𝐵 and 𝐹𝑀𝑐 are final
activation. The output of this attention gate is equivalent to the feature maps for hard and easy segmentation, respectively. This
input encoded image features weighted by the generated design is to ensure that features from easy and hard
attention map, and the output is finally concatenated with the segmentation can collectively describe vessel features.
decoder layer, as in U-net. As illustrated in [54], the attention Meanwhile, during the training process, easy region features are
map weights can be dynamically adjusted during the expected to help the network localize hard regions, since hard
backpropagation procedure. In HAnet, attention gates in regions are usually thin vessel branches from large vessels and
decoders B and C can automatically and specifically focus on vague boundaries of large vessels.
“easy” and “hard” regions in the shared encoded feature maps. The total loss function for HAnet is the sum of all losses of
sub-tasks, as shown in (2):
The second attention mechanism is explained in this
L = ∑𝑖 𝑎𝑖 𝐿𝑖 𝑖 ∈ [𝑐, 𝑒, ℎ, 𝑟, 𝑎𝑢𝑥] (2)
paragraph. Decoder A is trained based on 𝐺𝑇 for coarse
In the experiment, all sub-tasks are considered evenly, with 𝑎𝑖
segmentation, and thus its final feature map (feature map before
equal to 1 for ∀𝑖 ∈ [𝑐, 𝑒, ℎ, 𝑟, 𝑎𝑢𝑥]. 𝜆1𝑖 are set as [0.25, 0.25,
sigmoid activation), 𝐹𝑀𝐴 , includes image features for both easy
0.2, 0.2, 0.25] and 𝜆2𝑖 are set as [0.01, 0.01, 0.05, 0.05, 0.01]
and hard vessel regions. Decoder B is designed for easy region
for 𝑖 = [c, e, h, r, aux]. The hyperparameters are fine-tuned
segmentation, and only image features for easy regions will be based on the DRIVE dataset [22] and directly applied to other
highlighted in its final feature map, 𝐹𝑀𝐵 . Therefore, as shown experimental datasets. One advantage of HAnet is that
in Fig.1, a difference map is defined as the difference value different decoders are relatively independent, which allows
between 𝐹𝑀𝐴 and 𝐹𝑀𝐵 . An indicator function, 1(𝐴𝑜 > 0.01), training with different loss function modifications. The
is used to exclude the background region based on the output of weights of the network are initialized as “kaiming” loss [56].
decoder A. The difference map is expected to automatically The network training and test are implemented with the deep
determine image features in the hard region, and can also be learning library PyTorch. The gradient descent algorithm is set
used to guide the training of shallow U-Net (Net-U) , as denoted as ‘Adam’ [57], with the initial learning rate value set to
in Fig. 1. The difference map can be weighed by the attention 3 ∗ 10−4 , betas=(0.9, 0.999). The training batch size is set as
map to further highlight image features for hard vessels, which 10 with a single Nvidia GeForce Titan X GPU.
are concatenated in the last convolutional layer of decoder C
and Net-U for better segmentation. Importantly, the difference III. DATABASES AND METRICS
map is repeated to match the channel number with the last A. Databases
convolutional layer (32, in the experiment) when training
We test our proposed model on images from several different
decoder C; it is directly concatenated into Net-U, which reduces
imaging datasets and modalities. To demonstrate the superiority
network parameters without diminishing network performance.
of our algorithm, it is first tested on several public fundus
B. Loss function imaging datasets: DRIVE [22], STARE [58], CHASE_DB1
The training of HAnet is a multitask learning process, which [59], and HRF [60]. Digital retinal imaging by fundus
has been experimentally proven to be an effective tool for photography is a standard method of documenting the
improving image segmentation performance [51, 53, 55]. The appearance of the retina [7].
tasks in HAnet include: coarse vessel segmentation, hard region The DRIVE dataset [22] has 40 fundus images (565*584
segmentation, easy region segmentation, and refined vessel pixels) with a 45° field of view (FOV), and has been split into
segmentation. In the experiment, following [48], pixel level a training set and test set, each containing 20 images. The vessel
binary cross entropy loss and Jaccard loss are used as the loss segmentation ground truths of all images were labelled by two
function for each task. For each decoder the joint loss function trained human observers, and the labels from the first observer
is shown in (1). were used for network training. Binary FOV region masks are
𝐿𝑖 (𝑥, 𝑦) = 𝜆1𝑖 𝐵𝐶𝐸(𝑥, 𝑦) + 𝜆2𝑖 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝑥, 𝑦) (1) also offered in the dataset. The STARE dataset consists of 20
Where BCE represents binary cross entropy loss, defined as color images (700*605 pixels) with a 35° field of view, with
𝐵𝐶𝐸(𝑥, 𝑦) = − ∑𝑖(𝑦𝑖 𝑙𝑜𝑔(𝑥𝑖 ) + (1 − 𝑦𝑖 )𝑙𝑜𝑔 (1 − 𝑥𝑖 )), where half of the images of eyes with ocular pathology. In our
experiment, a leave-one-out cross validation strategy is used for
all the pixels in the image will be considered. 𝑦𝑖 represents
network training and the FOV masks from [61] are used to
ground truth label at position 𝑖, 0/1 in our case; and 𝑥𝑖 ∈ [0, 1]
evaluate the model. The CHASE_DB1 dataset [59] has images
represents the predicted output at position 𝑖 after sigmoid acquired from 28 eyes of 14 ten-year-old children. The images
activation. 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 represents Jaccard loss, defined as have a 30° FOV and 999*960 pixels. Since images were
𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝑥, 𝑦) = 1 − |𝑥 ∩ 𝑦|/|𝑥 ∪ 𝑦| , which represents the captured in subdued lighting and the operators adjusted
higher-order similarities between predicted and ground truth illumination settings to account for perceived over- or
segmentation maps [48]. 𝐿𝑖 , 𝑖 ∈ [𝑐, 𝑒, ℎ, 𝑟] represent the vessel underexposure, the images in CHASE_DB1 show more
segmentation loss for coarse output, easy region output, hard illumination variation compared to the DRIVE and STARE
region output and the final refined output, respectively. datasets. In the experiment, as in other studies [26], the first 20
Specifically, 𝐿𝑐 (𝐴𝑜 , 𝐺𝑇) , 𝐿𝑒 (𝐵𝑜 , 𝐺𝑇𝑒𝑎𝑠𝑦 ) , 𝐿ℎ (𝐶𝑜 , 𝐺𝑇ℎ𝑎𝑟𝑑 ) , images (from 10 subjects) in the CHASE dataset are used for
JBHI-01622-2019 5
network training, and the remaining 8 images (from 4 subjects) patch predictions together, and the average prediction value is
are used for model testing. The HRF dataset [60] is comprised used as the final prediction value in overlap regions.
of 45 high resolution fundus images (3504*2336 pixels) with a
C. Evaluation metrics
60° FOV. Of the 45 images, 15 images are from healthy patients,
15 images are of eyes with diabetic retinopathy, and the As mentioned above, the output of HAnet is a probability
remaining 15 images are of glaucomatous eyes. In the map, which describes the likelihood of a pixel being a vessel.
experiment, the first 5 images from each of the aforementioned A binary segmentation result can be generated by setting the
classes (totaling 15 images), are used for network training. The probability threshold as 0.5. Following the methods used in
remaining 30 images are used for network model evaluation. other work [26], the evaluation metrics used in our experiment
We also evaluate our model on scanning laser are described below:
ophthalmoscopy (SLO) images. Compared to fundus images, We denote 𝑁𝑇𝑃 as the number of correctly detected vessel
SLO possesses the advantage of lower levels of light exposure (true positive) pixels, 𝑁𝑇𝑁 as the number of correctly detected
and a relatively high contrast due to the confocal design [62]. A non-vessel (true negative) pixels, 𝑁𝐹𝑃 as the number of
laser with a specific wavelength is used as the light source to incorrectly detected vessel (false positive) pixels, NFN as the
generate monochrome images in SLO. Therefore, SLO is the number of incorrectly detected non-vessel (false negative)
core instrument for several common retinal imaging modalities, pixels. Sensitivity (𝑆𝑒), also known as true positive rate, is
including fluorescein angiography and ICG angiography. To defined as 𝑆𝑒 = 𝑁𝑇𝑃/(𝑁𝑇𝑃 + 𝑁𝐹𝑁). Specificity (𝑆𝑝), also
acquire color SLO images, monochrome images obtained using known as the true negative rate, is defined as 𝑆𝑝 = 𝑁𝑇𝑁/
lasers with blue, green, and/or red wavelengths can be (𝑁𝑇𝑁 + 𝑁𝐹𝑃). The vessel segmentation accuracy (𝐴𝑐𝑐 ) is
combined to generate a pseudo color image. In our experiment, defined as 𝐴𝑐𝑐 = (𝑁𝑇𝑁 + 𝑁𝑇𝑃)/(𝑁𝑇𝑃 + 𝑁𝐹𝑁 + 𝑁𝑇𝑁 +
two public SLO datasets are used to test HAnet: IOSTAR [63] 𝑁𝐹𝑃). To consider how the segmentation results change with
and RC-SLO [14]. The IOSTAR dataset has 30 images with a different probability thresholds, the area under the receiving
resolution of 1024*1024 pixels and a 45° FOV. The first 20 operator characteristic (ROC) curve is also computed, denoted
images are used for network training and the remaining 10 AUC. In addition to the metrics mentioned in [26], a f1-score is
images are used for network evaluation. The RC-SLO dataset further computed to consider 𝑆𝑒 together with precision (𝑃𝑟).
contains 40 images patches covering a wide range of difficult The definition of the f1-score is 𝑓1 = (2 ∗ 𝑃𝑟 ∗ 𝑆𝑒)/(𝑃𝑟 + 𝑆𝑒),
cases with a resolution of 360*320 pixels. The first 30 images where 𝑃𝑟 = 𝑁𝑇𝑃/(𝑁𝑇𝑃 + 𝑁𝐹𝑃).
are used for network training, and the remaining for evaluation.
In addition to vessel structure studies, SLO based IV. RESULTS AND DISCUSSIONS
angiography methods can be used to study retinal vascular
A. Visualization Results
perfusion [64], which require analysis of angiographic videos
as opposed to individual image frames. To account for vessel In this section, we utilize an example image from DRIVE to
contraction and eye movement, the mean frame of registered visually appreciate the intermediate and final outputs of the
video are usually used to determine vessel region interests [8], network. Fig. 2(a) is the first image in the DRIVE test dataset.
with the drawback of more blurring. In our experiment, four Fig. 2(b) is the image preprocessing result following the
mean ICG angiography images from four ten-second ICG preprocessing rubrics mentioned in Section III B. Fig. 2(c) is
videos are used for validating the network model with a leave- the corresponding manual annotation, 𝐺𝑇. As mentioned earlier,
one-out cross validation strategy. The videos are from four image patches are randomly cropped from preprocessed images,
subjects captured by a Heidelberg Retina Angiograph 2 (HRA2, and used as the network input, which equal in size to the output.
Heidelberg Engineering, Heidelberg Germany) at 24.6 frames To obtain a full-sized output, the output patches are
per second and with a 15° ∗ 7.5° FOV. concatenated together following the procedure mentioned in
Section III B. In this section, all result images are shown in their
B. Image preprocessing and network parameters full size.
In the image preprocessing step, the color fundus images and
color SLO images mentioned above are first transformed to
grayscale images. This allows the trained network model to be
potentially transferred to other monochrome imaging
modalities. Our other preprocessing steps follow those in [42].
The grayscale images are normalized to ensure each image has
a mean gray value of zero and standard deviation of one. To
adjust image contrast, a contrast limited adaptive histogram
equalization and gamma adjustment are applied (gamma=1.2). (a) (b) (c)
The inputs of the network are 128*128 pixel image patches. Fig. 2. (a) An example fundus image from DRIVE test set. (b)
Corresponding preprocessed image. (c) Corresponding manual annotation
Utilizing PyTorch, the full-sized grayscale retinal images are (segmentation ground truth)
first augmented by random flip, then input image patches are
randomly cropped from the augmented full-sized image along The input patches are fed into the encoder network to extract
with rotation, translation, and scaling for network training. In high-level image features. Through decoder A, these image
the test procedure, the patches are cropped from the first block features are mapped to a segmentation probability map, 𝐴𝑂 , as
of the raw full-sized images with a 16 pixel overlap both shown in Fig. 3(a). Based on predefined rubrics, 𝑀𝑎𝑠𝑘ℎ𝑎𝑟𝑑 and
horizontally and vertically. The final prediction concatenates all 𝑀𝑎𝑠𝑘𝑒𝑎𝑠𝑦 are generated, as shown in Fig. 3(b). The red area is
JBHI-01622-2019 6
considered the hard region, and the black area is defined as the model, HAnet, acquires better segmentation results,
easy region. As mentioned above, the two masks are particularly for smaller and low contrast vessels. Fig. 5
dynamically adjusted during the network training process based showcases this comparison. Fig. 5(a) is a raw fundus image
on the value of 𝐴𝑂 . In Fig. 3(b), it may be appreciated that hard from the DRIVE test dataset (also shown in Fig. 2(a)). Fig. 5(b)
regions are mainly composed of small vessels and boundaries is a magnified portion from the raw image (blue box). Fig. 5(c)
of large vessels, where the baseline segmentation model is the corresponding manual annotation (ground truth) for the
becomes deficient. Superimposing the two masks into the cropped image patch. Fig. 5(d) is the segmentation output from
vessel segmentation ground truth shown in Fig. 2(c), we acquire U-net, and Fig. 5(e) is the segmentation output from HAnet.
the easy vessel segmentation ground truth, 𝐺𝑇𝑒𝑎𝑠𝑦 , and hard Compared to the baseline network, U-net, the “hard” vessels are
segmentation ground truth, 𝐺𝑇ℎ𝑎𝑟𝑑 . Fig. 3(c) shows the two better detected with HAnet. In the experiment, U-net is
ground truths in a single image. The green channel represents composed of an encoder and decoder which have the same
𝐺𝑇𝑒𝑎𝑠𝑦 , while the red channel represents 𝐺𝑇ℎ𝑎𝑟𝑑 . Supervised by network parameters as the encoder and decoder A in HAnet.
the two ground truths, the targets of decoder B and decoder C
are to generate segmentation probability maps, 𝐵𝑂 and 𝐶𝑂 , for
the hard vessel region and easy vessel regions, respectively. As
mentioned, in decoder C, an attention gate [54] is introduced to
automatically strengthen the features in the hard regions and
weaken the features in the easy regions. Fig. 3(d) is the
generated attention map and Fig. 3(e) is the generated
difference map, after the network is well-trained. Compared (a) (b) (c)
with Fig. 2(c), both the attention map and difference map Fig. 4. (a) “Easy” region segmentation result from decoder B. (b)
“Hard” region segmentation result from decoder C. (c) Final binary
successfully reinforce features for hard regions while
segmentation result from Net-U.
suppressing features for easy vessels. The attention map
broadly highlights uncertain regions by weighing
corresponding features. The difference map specifically
reinforces existing hard region features based on outputs from
decoders A and B, as shown in Fig. 1.
(b) (c)
(a) (d) (e)

Fig. 5. (a). The first fundus image from DRIVE. (b) Magnified view of
the image patch cropped from raw image (blue box). (c) Corresponding
(a) (b) manual segmentation notation of the image patch. (d) Segmentation
output from U-net. (e) Segmentation output from the proposed HAnet.
To further visualize model segmentation performance across
different datasets, an example test image from each dataset
mentioned in Section III and their segmentation outputs are
shown in Fig. 6-11. Each figure is composed of five sub-images
(a) raw image; (b) manual annotation; (c) easy region
segmentation (output from decoder B); (d) hard region
(c) (d) (e)
segmentation (output from decoder C); (e) final segmentation
result (output from Net-U). As supported by these visual results,
Fig. 3. (a) Probability map, 𝐴𝑂 , generated from decoder A. (b) Masks for
it may be inferred that HAnet can effectively delineate easy and
regions which are “hard” (red) and “easy” (black) to segment based on the
bi-level threshold of 𝐴𝑂 . (c) Vessels which are “hard” to segment (red hard regions and fuse them to achieve a final output.
channel) and vessels which are “easy” to segment (green channel). (d)
Attention map generated in decoder C which highlights the convolutional B. Quantitative Results
features in the “hard” region. (e) Generated difference map used to further This section presents quantitative analysis results to support
highlight “hard” regions. the segmentation efficacy of HAnet. The evaluation metrics are
Fig. 4(a) and Fig. 4(b) display segmentation results from described in Section III C, and include the 𝑆𝑒 , 𝑆𝑝 , 𝐴𝑐𝑐, 𝐴𝑈𝐶,
decoders B and C. Their segmentation probability maps are fed and 𝑓1 score. In cognizance of the extensive and ongoing
into the shallow U-net structure (Net-U), together with the research in retinal vessel segmentation, we compare HAnet to
original image input and probability maps from decoder A. The the best performance achieved by other recent methods that
probability maps are normalized to 0~1 to ensure different share the same test situations as HAnet. Since 2015, retinal
inputs can be treated equally by Net-U. The final output of Net- vessel segmentation research has shifted from conventional
U is shown in Fig. 4(c). unsupervised and supervised learning methods to convolutional
Compared with the baseline network architecture in the neural network based deep supervised learning models. Deep
medical image segmentation domain (U-net [43]), our proposed learning based models have displayed significant segmentation
JBHI-01622-2019 7
(a) (b)
(a) (b)
(c) (d) (e)

Fig. 6. An example image from STARE (#0077). (a). Raw image. (b).
Manual annotation. (c). Output from decoder B (easy region segmentation). (c) (d) (e)
(d). Output from decoder C (hard region segmentation). (e) Output from Fig. 8. An example image from IOSTAR dataset (#48_OSN). (a). raw
Net-U (final output from Net-U). image. (b). manual annotation. (c). the output from decoder B (easy region
segmentation). (d). the output from decoder C (hard region segmentation).
(e). the output from Net-U (final output from Net-U).
(a) (b) (a) (b)
(c) (d) (e)

Fig. 9. An example image from the RC-SLO dataset
(c) (d) (#32_ODC_patch4). (a). Raw image. (b). Manual annotation. (c).
(e)
Output from decoder B (easy region segmentation). (d). Output from
Fig. 7. An example image from CHASE_DB1 (#13L). (a). Raw image. (b).
decoder C (hard region segmentation). (e). Output from Net-U (final
Manual annotation. (c). Output from decoder B (easy region segmentation). (d).
output from Net-U).
Output from decoder C (hard region segmentation). (e). Output from Net-U
(final output from Net-U).
(a) (b)
(c) (d) (e)

Fig. 10. An example image from the HRF dataset (#14h). (a). Raw image. (b). Manual annotation. (c). Output from decoder B (easy region segmentation).
(d). Output from decoder C (hard region segmentation). (e). Output from Net-U (final output from Net-U).
JBHI-01622-2019 8
TABLE II
STATISTIC COMPARISON RESULTS FOR STARE DATASET
Model Se Sp Acc AUC f1

nd
(a) (b) 2 human 0.8952 0.9384 0.9349 - 0.7600
[65] (2015) 0.7320 0.9840 0.9560 0.9670 -
[14] (2016) 0.7791 0.9758 0.9554 0.9748 -
[66]* (2017) 0.7680 0.9738 - - 0.7644
[67]* (2018) 0.7581 0.9846 0.9612 0.9801 -
[26]* (2018) 0.7735 0.9857 0.9638 0.9833 -
[52]* (2019) 0.7595 0.9878 0.9641 0.9832 0.8143
(c) (d) (e)
HAnet* 0.8186 0.9844 0.9673 0.9881 0.8379
Fig. 11. An example image from the self-collected ICGA dataset. (a). Raw
image (grayscale value inverted). (b). Manual annotation. (c). Output from * represent deep learning models. The definitions of Se, Sp, Acc, AUC,
decoder B (easy region segmentation). (d). Output from decoder C (hard and f1, are given in Section III C. The best values are bold.
region segmentation). (e). Output from Net-U (final output from Net-U).
TABLE III
STATISTIC COMPARISON RESULTS FOR CHASE_DB1 DATASET
TABLE I Sp
Model Se Acc AUC f1
STATISTIC COMPARISON RESULTS FOR DRIVE DATASET
2nd human 0.8315 0.9745 0.9615 - 0.7970
Model Se Sp Acc AUC f1 [34] (2012) 0.7224 0.9711 0.9469 0.9712 -
2nd human 0.7796 0.9717 0.9470 - 0.7910 [68] (2014) 0.7201 0.9824 0.9530 0.9532 -
[68] (2014) 0.7250 0.9830 0.9520 0.9620 - [65] (2015) 0.7615 0.9575 0.9467 0.9623 -
[69] (2015) 0.7569 0.9816 0.9527 0.9738 [69] (2015) 0.7507 0.9793 0.9581 0.9716 -
[73]* (2016) 0.7603 - 0.9523 - - [66]* (2017) 0.7277 0.9712 - - 0.7332
[39]* (2016) 0.7811 0.9807 0.9535 0.9790 - [70]* (2017) 0.8194 0.9739 0.9630 - -
[66]* (2017) 0.7897 0.9684 - - 0.7857 [67]* (2018) 0.7633 0.9809 0.9610 0.9781 -
[42]* (2017) 0.7691 0.9801 0.9533 0.9744 - [71]* (2018) 0.7756 0.9820 0.9634 0.9815 0.7928
[67]* (2018) 0.7653 0.9818 0.9542 0.9752 - [26]* (2018) 0.7641 0.9806 0.9607 0.9776 -
[71]* (2018) 0.7792 0.9813 0.9556 0.9784 0.8171 [49]* (2018) 0.7538 0.9847 0.9637 0.9825 -
[26]* (2018) 0.7631 0.9820 0.9538 0.9750 - [52]* (2019) 0.8155 0.9752 0.9610 0.9804 0.7883
[49]* (2018) 0.7844 0.9807 0.9567 0.9819 - [72]* (2019) 0.7888 0.9801 0.9627 0.9840 0.7983
[52]* (2019) 0.7963 0.9800 0.9566 0.9802 0.8237 [48]* (2019) 0.8074 0.9821 0.9661 0.9812 0.8037
[72]* (2019) 0.7891 0.9804 0.9561 0.9806 0.8249 [50]* (2019) 0.8132 0.9814 0.9661 0.9860 -
[53]* (2019) 0.7916 0.9811 0.9570 0.9810 - HAnet* 0.8239 0.9813 0.9670 0.9871 0.8191
[48]* (2019) 0.7940 0.9816 0.9567 0.9772 0.8270 * represent deep learning models. The definitions of Se, Sp, Acc, AUC,
[50]* (2019) 0.8038 0.9802 0.9578 0.9821 - and f1, are given in Section III C. The best values are bolded.
HAnet* 0.7991 0.9813 0.9581 0.9823 0.8293
TABLE IV
* represent deep learning models. The definitions of Se, Sp, Acc, AUC, STATISTIC COMPARISON RESULTS FOR HRF DATASET
and f1, are given in Section III C. The best values are bolded.
TABLE V
STATISTIC COMPARISON RESULTS FOR IOSTAR DATASET [66]* (2017) 0.7874 0.9584 - - 0.7158
[74] (2017) 0.7490 0.9420 0.9410 0.9710 -
Model Se Sp Acc AUC f1 [67]* (2018) 0.7881 0.9592 0.9437 - -
[52]* (2019) 0.7467 0.9874 0.9651 0.9831 -
[14] (2016) 0.7545 0.9740 0.9514 0.9615 -
HAnet* 0.7803 0.9843 0.9654 0.9837 0.8074
[74] (2017) 0.7720 0.9670 0.9480 0.9600 -
[75]* (2017) 0.8038 0.9801 0.9695 0.9771 - * represent deep learning models. The definitions of Se, Sp, Acc, AUC,
HAnet* 0.7538 0.9893 0.9652 0.9859 0.8161 and f1, are given in Section III C. The best values are bolded.
* represent deep learning models. The definitions of Se, Sp, Acc, AUC, TABLE VI
and f1, are given in Section III C. The best values are bolded. STATISTIC COMPARISON RESULTS FOR RC-SLO DATASET
TABLE VII Model Se Sp Acc AUC f1

STATISTIC COMPARISON RESULTS FOR SELF-COLLECTED ICG
ANGIOGRAPHY [14] (2016) 0.7787 0.9710 0.9512 0.9626 -
HAnet* 0.8681 0.9797 0.9699 0.9911 0.8350
* represent deep learning models. The definitions of Se, Sp, Acc, AUC,
nd
2 human 0.8684 0.9884 0.9728 - 0.8927 and f1, are given in Section III C. The best values are bolded.
Baseline* 0.8598 0.9840 0.9678 0.9869 0.8744
HAnet* 0.8753 0.9828 0.9688 0.9880 0.8797 DRIVE, STARE, CHASE_DB1 and HRF are commonly
used fundus image databases for retinal vessel segmentation
Baseline network is selected as the U-net [43]. The definitions of Se, Sp,
Acc, AUC, and f1, are given in Section III C. The best values are bolded. model evaluation. DRIVE has been split into a training and test
dataset with a FOV mask given, thus facilitating the ability to
performance increases compared to conventional methods [39]. compare statistic metrics across different models. For the other
Tables I – VII show the statistic comparison results across three datasets, there are varying data split strategies and FOV
different datasets. The last row in each table is the result definitions. In the above tables, only the results from studies
obtained by HAnet, and the best value obtained within each which have the same evaluation data are included. Dataset split
metric is bolded. problems are not considered for listed unsupervised models.
JBHI-01622-2019 9
The two SLO datasets, IOSTAR and RC-SLO, have not been The second experiments we conduct are cross-training
well-studied for either conventional methods or deep learning evaluations for different imaging modalities. We apply the
methods since they were published in 2016 [14]. The results we model trained on fundus photography images to other imaging
obtained during our literature review are presented in Tables V modalities, such as color SLO and ICGA. There are more
and VI. The self-collected small ICGA dataset has not been fundus photography datasets available relative to SLO datasets,
previously published. Therefore, the performance of our model however, SLO based modalities are gaining increasing
is compared with the baseline U-net and our second trained importance in ophthalmologic research [76]. To the best of our
grader. The baseline U-net has an encoder and a decoder which knowledge, this is the first study to evaluate cross-training
have the same network parameter configuration as HAnet, and performance on images from different retinal imaging
all other hyperparameters including learning rate and training modalities. Table IX shows cross-training evaluation results for
batch size, are kept identical when training U-net. datasets from SLO-based imaging modalities. The network
The superiority of deep learning methods compared to the model is pre-trained based DRIVE, which has competitive
state-of-the-art unsupervised [14, 65, 74] and shallow learning segmentation performance on the IOSTAR and RC-SLO
methods [34, 68, 69] is shown in these tables. The handcrafted datasets compared with the state-of-art unsupervised method
features in these models are inferior at describing vessels, and [14]. On the ICGA dataset, the slight sensitivity drop may be
this is reflected in low vessel segmentation sensitivity values. attributable to the differing optic disc size in our self-collected
Compared with other deep learning methods, HAnet decodes ICGA images, in addition to the fact that the vessels passing
hard and easy regions independently, and benefits from through the disc and strong choroid vessel signals have not been
attention mechanisms which allow more focus on vessels which well studied during DRIVE model training.
are hard to segment. This is reflected in the great improvements TABLE IX
in segmentation sensitivity and excellent AUC values and f1- CROSS-TRAINING EVALUATION (PRE-TRAINED ON DRIVE) FOR SLO
scores in almost all datasets. Compared with [26], instead of DATASETS (WITHOUT FINE TUNING)
manually defining thin vessels beforehand, hard regions can be Test Dataset Se Sp Acc AUC f1
dynamically adjusted in HAnet, which means that additional
decoders can dynamically compensate for uncertain predictions IOSTAR 0.7903 0.9757 0.9568 0.9763 0.7891
in the baseline decoder (decoder A). Compared with [48, 71, RC-SLO 0.9099 0.9625 0.9579 0.9873 0.7914
72], the independent decoders and attention mechanisms in ICGA 0.7130 0.9936 0.9570 0.9824 0.8122
HAnet make the feature flow inside the network easier to
The definitions of Se, Sp, Acc, AUC, and f1 are given in Section III C.
visualize and control. The idea of HAnet can also be easily
integrated with different baseline networks [52] and loss TABLE VIII
functions [67], because the three decoders in HAnet can be any CROSS-TRAINING EVALUATION BASED ON DRIVE AND STARE
DATASETS (WITHOUT FINE TUNING)
decoder structure with their own loss function.
Model Se Sp Acc AUC
C. Cross dataset & cross modality evaluation
To quantify the generalization and extendibility of the STARE (train) -> DRIVE (test)
[26] 0.7014 0.9802 0.9444 0.9568
proposed model, additional cross dataset and cross modality [52] 0.6505 0.9914 0.9481 0.9718
evaluations are presented in this section. HAnet 0.7140 0.9879 0.9530 0.9758
The first experiment we conduct is a cross-training DRIVE (train) -> STARE (test)
evaluation on the DRIVE and STARE datasets, which have also [26] 0.7319 0.9840 0.9580 0.9678
been studied by other groups [26, 52]. We directly apply our [52] 0.7000 0.9759 0.9474 0.9571
well-trained deep leaning model to other datasets without HAnet 0.8187 0.9699 0.9543 0.9648
retraining the model on the new dataset. Table VIII shows The definitions of Se, Sp, Acc, and AUC are given in Section III C.
statistical cross-training evaluation results based on HAnet.
Results reported by other state-of-the-art models are also listed. D. Ablation tests for attention mechanism and thin vessel
Compared to DRIVE, STARE has fewer images with thin segmentation evaluation
vessels, but more images with ocular pathology. Therefore, In HAnet, several attention concepts are newly introduced to
using the well-trained STARE model to evaluate the DRIVE improve the network segmentation results, including “hard”
dataset may lead to a relatively low sensitivity value because of and “easy” regions, an attention gate (AG), and difference maps
a different network focus. This is also evidenced by the (DM). In this section, we design two ablation experiments to
specificity value decrease if the experiment is conducted see how these different components contribute to our network.
conversely (trained on DRIVE, and tested on STARE). [52] Firstly, we evaluate how “hard” and “easy” regions benefit
utilized shape knowledge from the dataset to train the novel HAnet. In this experiment, four settings are considered. The
“deformable” convolution. This knowledge is difficult to baseline setting is plain U-net, as introduced in Section IV A
transfer between datasets. Conversely, HAnet and [26], which and Fig. 5, which only includes the encoder and decoder A of
both treat “hard” and “easy” regions separately, yield a higher HAnet. Then, decoder B and decoder C are added separately
overall segmentation accuracy and higher AUC values. into the baseline design, which means that shallow Net-U only
Specifically, HAnet can dynamically focus on vascular considers “easy” or “hard” regions. The last setting combines
intricacies that are common across the dataset images, such as both decoders B and C, as HAnet. However, the AG and DM
thin vessels and vague vessel boundaries, thus allowing HAnet are not considered in this condition. Table X shows comparison
to achieve high overall sensitivity. statistical results, and Fig. 12 shows visualization outputs from
JBHI-01622-2019 10
two example patches. In Fig. 12(e), it is appreciable that solely

Raw GT No-attention AG only DM only HAnet
introducing “hard” regions may diminish the network
performance for some large vessels, but “hard regions” improve
the segmentation performance for small and low-contrast
vessels, as shown in Fig. 12(k). Compared with U-net results,
solely introducing “easy” regions improves vessel
(a) (b) (c) (d) (e) (f)
completeness, as shown in Fig. 12(d) and (j), but increases the
tendency to overlook minute image features. As shown Fig.
12(f) and (l), combining both “hard” and “easy” regions
constructively blends the advantages of both, showing better
performance for both large and small vessels, and achieves the (g) (h) (i) (j) (k) (l)
best Acc and f1-score in this experiment. Fig. 13. Two examples patches (from test image #1 and #14) for the
ablation study of attention mechanisms. GT represents ground truth.
TABLE X
ABLATION TESTS FOR “EASY” ,“HARD” REGIONS BASED ON DRIVE DATASET IV A, to better balance the 𝑆𝑒 , 𝑆𝑝 and 𝑃𝑟 values, and to achieve
Attention the best Acc and f1-score.
Se Sp Acc AUC f1
configs
U-net 0.7860 0.9824 0.9575 0.9814 0.8247
“Easy” only 0.7747 0.9839 0.9573 0.9814 0.8219 V. CONCLUSION AND FUTURE WORK
“Hard” only 0.7809 0.9835 0.9577 0.9818 0.8245 This research proposes a new deep neural network
“Easy”+“hard” 0.7920 0.9820 0.9578 0.9817 0.8271
architecture for retinal blood vessel segmentation, namely, hard
The definitions of Se, Sp, Acc, AUC and f1 are given in Section III C. attention net (HAnet). It is composed of one encoder and three
“Easy” “Hard” “Easy”+
decoder sub-networks. The encoder is designed to extract high
Raw GT U-net level image features. The first decoder network generates a
only only “Hard”
coarse vessel segmentation result which can be used to
dynamically locate vascular regions which are hard or easy to
segment. Based on the two dynamic regions, the other two
(a) (b) (c) (d) (e) (f) decoders are designed specifically for hard and easy region
segmentation. Additional attention mechanisms are introduced
in HAnet, which effectively reinforce the shared high-level
image features in vessel regions. Finally, the outputs from the
decoders are fused together and fed into a shallow network to
acquire the final segmentation output.
(g) (h) (i) (j) (k) (l)
The network segmentation performance is evaluated on four
Fig. 12. Two examples patches (from test image #12 and #8) for the
ablation study of “easy” and “hard” regions. GT represents ground truth. benchmark fundus photography image datasets, DRIVE,
STARE, CHASE_DB1 and HRF, two recent published color
The second ablation study considers the influence of the AG SLO datasets, IOSTAR, RC-SLO, and a self-collected small
and DM. In this experiment, the baseline setting considers only ICGA dataset. Compared with other state-of-the-art
“easy” and “hard” regions, as above, defined as the no-attention conventional and deep learning based segmentation models,
setting. Then, the AG and DM are added individually. The final HAnet achieves better/comparable performance in all the
design is the proposed HAnet. To better quantify the aforementioned datasets. Further, the design of HAnet can be
performance, this study only considers thin vessels (<3 pixels) easily integrated with other baseline networks and loss function
and their neighbor regions (5-pixel range), as in [67]. Following modifications.
the above study, an Otsu threshold is selected for binarization. This research also conducts an extensive cross dataset and
The comparison results are listed in Table XI, and Fig. 13 shows cross modality evaluation. In general, the well-trained HAnet
two visualization results. Fig. 13(a) shows an example in which exhibits state-of-the-art cross-training performance on two
introducing attention mechanisms better segments low contrast fundus photography datasets, DRIVE and STARE. Meanwhile,
vessels compared to a network that only fuses “easy” and “hard” HAnet demonstrates good transferability from fundus
regions, which is reflected by a higher 𝑆𝑒 value. However, photography to color SLO and ICGA images. In ICGA images,
solely introducing the AG or DM can also lead to more false vessels passing through the optic disc and strong choroid vessel
positive predictions, as shown in Fig. 13(g). HAnet fuses the signals may influence segmentation results, and this will be our
advantages of the AG and DM, which were discussed in Section direction for further study.
TABLE XI
ABLATION TESTS FOR ATTENTION MECHANISMS BASED ON THIN VESSELS
OF DRIVE DATASET ACKNOWLEDGMENT
Attention D. Wang would like to thank Heidelberg Engineering GmbH
Se Sp Acc Pr f1
configs
for providing equipment and research support for ICG-A image
[67] 0.7567 0.9158 0.8778 0.7449 -
No attention 0.7612 0.9269 0.8870 0.7675 0.7643
acquisition for this project.
AG only 0.7788 0.9192 0.8854 0.7535 0.7659
DM only 0.7747 0.9205 0.8854 0.7554 0.7649 REFERENCES
HAnet 0.7716 0.9238 0.8872 0.7626 0.7671 [1] R. J. Winder, P. J. Morrow, I. N. McRitchie, J. Bailie, and P. M.
The definitions of Se, Sp, Acc, Pr, and f1 are given in Section III C. Hart, "Algorithms for digital image processing in diabetic
JBHI-01622-2019 11
retinopathy," Computerized medical imaging and graphics, vol. 33, retinal images," IEEE transactions on medical imaging, vol. 34, pp.
pp. 608-622, 2009. 1797-1807, 2015.
[2] P. Mitchell, H. Leung, J. J. Wang, E. Rochtchina, A. J. Lee, T. Y. [20] A. M. Mendonca and A. Campilho, "Segmentation of retinal blood
Wong, et al., "Retinal vessel diameter and open-angle glaucoma: the vessels by combining the detection of centerlines and morphological
Blue Mountains Eye Study," Ophthalmology, vol. 112, pp. 245-250, reconstruction," IEEE transactions on medical imaging, vol. 25, pp.
2005. 1200-1213, 2006.
[3] J. B. Jonas, X. N. Nguyen, and G. Naumann, "Parapapillary retinal [21] R. M. Rangayyan, F. Oloumi, F. Oloumi, P. Eshghzadeh-Zanjani,
vessel diameter in normal and glaucoma eyes. I. Morphometric and F. J. Ayres, "Detection of blood vessels in the retina using Gabor
data," Investigative ophthalmology & visual science, vol. 30, pp. filters," in 2007 Canadian Conference on Electrical and Computer
1599-1603, 1989. Engineering, 2007, pp. 717-720.
[4] L. A. Yannuzzi, S. Negrão, I. Tomohiro, C. Carvalho, H. Rodriguez- [22] J. Staal, M. D. Abràmoff, M. Niemeijer, M. A. Viergever, and B.
Coleman, J. Slakter, et al., "Retinal angiomatous proliferation in Van Ginneken, "Ridge-based vessel segmentation in color images
age–related macular degeneration," Retina, vol. 32, pp. 416-434, of the retina," IEEE transactions on medical imaging, vol. 23, pp.
2012. 501-509, 2004.
[5] M. K. Ikram, J. C. Witteman, J. R. Vingerling, M. M. Breteler, A. [23] A. Hoover, V. Kouznetsova, and M. Goldbaum, "Locating blood
Hofman, and P. T. de Jong, "Retinal vessel diameters and risk of vessels in retinal images by piece-wise threshold probing of a
hypertension: the Rotterdam Study," hypertension, vol. 47, pp. 189- matched filter response," in Proceedings of the AMIA Symposium,
194, 2006. 1998, p. 931.
[6] O. Gishti, V. W. Jaddoe, J. F. Felix, C. C. Klaver, A. Hofman, T. Y. [24] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R.
Wong, et al., "Retinal microvasculature and cardiovascular health in Rudnicka, C. G. Owen, et al., "Blood vessel segmentation
childhood," Pediatrics, vol. 135, pp. 678-685, 2015. methodologies in retinal images–a survey," Computer methods and
[7] S. Akbar, M. Sharif, M. U. Akram, T. Saba, T. Mahmood, and M. programs in biomedicine, vol. 108, pp. 407-433, 2012.
Kolivand, "Automated techniques for blood vessels segmentation [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet
through fundus retinal images: A review," Microscopy research and classification with deep convolutional neural networks," in
technique, vol. 82, pp. 153-170, 2019. Advances in neural information processing systems, 2012, pp. 1097-
[8] D. Wang, A. Haytham, L. Mayo, Y. Tao, and O. Saeedi, "Automated 1105.
retinal microvascular velocimetry based on erythrocyte mediated [26] Z. Yan, X. Yang, and K.-T. T. Cheng, "A three-stage deep learning
angiography," Biomedical optics express, vol. 10, pp. 3681-3697, model for accurate retinal vessel segmentation," IEEE journal of
2019. biomedical and health informatics, 2018.
[9] R. Flower, E. Peiretti, M. Magnani, L. Rossi, S. Serafini, Z. [27] M. Niemeijer, J. Staal, B. van Ginneken, M. Loog, and M. D.
Gryczynski, et al., "Observation of erythrocyte dynamics in the Abramoff, "Comparative study of retinal vessel segmentation
retinal capillaries and choriocapillaris using ICG-loaded erythrocyte methods on a new publicly available database," in Medical imaging
ghost cells," Investigative ophthalmology & visual science, vol. 49, 2004: image processing, 2004, pp. 648-656.
pp. 5510-5516, 2008. [28] C. Sinthanayothin, J. F. Boyce, H. L. Cook, and T. H. Williamson,
[10] N. Katz, M. Nelson, M. Goldbaum, S. Chaudhuri, and S. Chatterjee, "Automated localisation of the optic disc, fovea, and retinal blood
"Detection of blood vessels in retinal images using two-dimensional vessels from digital colour fundus images," British journal of
matched filters," IEEE Trans. Med. Imaging, vol. 8, pp. 263-269, ophthalmology, vol. 83, pp. 902-910, 1999.
1989. [29] J. V. Soares, J. J. Leandro, R. M. Cesar, H. F. Jelinek, and M. J.
[11] N. Katz, M. Goldbaum, M. Nelson, and S. Chaudhuri, "An image Cree, "Retinal vessel segmentation using the 2-D Gabor wavelet and
processing system for automatic retina diagnosis," in Three- supervised classification," IEEE Transactions on medical Imaging,
Dimensional Imaging and Remote Sensing Imaging, 1988, pp. 131- vol. 25, pp. 1214-1222, 2006.
137. [30] R. M. Rangayyan, F. J. Ayres, F. Oloumi, F. Oloumi, and P.
[12] T. Spencer, J. A. Olson, K. C. McHardy, P. F. Sharp, and J. V. Eshghzadeh-Zanjani, "Detection of blood vessels in the retina with
Forrester, "An image-processing strategy for the segmentation and multiscale Gabor filters," Journal of Electronic Imaging, vol. 17, p.
quantification of microaneurysms in fluorescein angiograms of the 023018, 2008.
ocular fundus," Computers and biomedical research, vol. 29, pp. [31] E. Ricci and R. Perfetti, "Retinal blood vessel segmentation using
284-302, 1996. line operators and support vector classification," IEEE transactions
[13] M. E. Martínez-Pérez, A. D. Hughes, A. V. Stanton, S. A. Thom, A. on medical imaging, vol. 26, pp. 1357-1365, 2007.
A. Bharath, and K. H. Parker, "Retinal blood vessel segmentation [32] S. W. Franklin and S. E. Rajan, "Retinal vessel segmentation
by means of scale-space analysis and region growing," in employing ANN technique by Gabor and moment invariants-based
International Conference on Medical Image Computing and features," Applied Soft Computing, vol. 22, pp. 94-100, 2014.
Computer-Assisted Intervention, 1999, pp. 90-97. [33] J. Zhang, Y. Chen, E. Bekkers, M. Wang, B. Dashtbozorg, and B.
[14] J. Zhang, B. Dashtbozorg, E. Bekkers, J. P. Pluim, R. Duits, and B. M. ter Haar Romeny, "Retinal vessel delineation using a brain-
M. ter Haar Romeny, "Robust retinal vessel segmentation via locally inspired wavelet transform and random forest," Pattern
adaptive derivative frames in orientation scores," IEEE transactions Recognition, vol. 69, pp. 107-123, 2017.
on medical imaging, vol. 35, pp. 2631-2644, 2016. [34] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R.
[15] J. J. Leandro, J. Cesar, and H. F. Jelinek, "Blood vessels Rudnicka, C. G. Owen, et al., "An ensemble classification-based
segmentation in retina: Preliminary assessment of the mathematical approach applied to retinal blood vessel segmentation," IEEE
morphology and of the wavelet transform techniques," in Transactions on Biomedical Engineering, vol. 59, pp. 2538-2548,
Proceedings XIV Brazilian Symposium on Computer Graphics and 2012.
Image Processing, 2001, pp. 84-90. [35] M. D. Abràmoff, M. K. Garvin, and M. Sonka, "Retinal imaging and
[16] A. Bhuiyan, B. Nath, J. Chua, and R. Kotagiri, "Blood vessel image analysis," IEEE reviews in biomedical engineering, vol. 3,
segmentation from color retinal images using unsupervised texture pp. 169-208, 2010.
classification," in 2007 IEEE International Conference on Image [36] B. M. ter Haar Romeny, E. J. Bekkers, J. Zhang, S. Abbasi-
Processing, 2007, pp. V-521-V-524. Sureshjani, F. Huang, R. Duits, et al., "Brain-inspired algorithms for
[17] N. M. Salem, S. A. Salem, and A. K. Nandi, "Segmentation of retinal image analysis," Machine Vision and Applications, vol. 27,
retinal blood vessels based on analysis of the Hessian matrix and pp. 1117-1135, 2016.
clustering algorithm," in 2007 15th European Signal Processing [37] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-
Conference, 2007, pp. 428-432. time object detection with region proposal networks," in Advances
[18] B. Al-Diri, A. Hunter, and D. Steel, "An active contour model for in neural information processing systems, 2015, pp. 91-99.
segmenting and measuring retinal vessels," IEEE Transactions on [38] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional
Medical imaging, vol. 28, pp. 1488-1497, 2009. networks for semantic segmentation," in Proceedings of the IEEE
[19] Y. Zhao, L. Rada, K. Chen, S. P. Harding, and Y. Zheng, conference on computer vision and pattern recognition, 2015, pp.
"Automated vessel segmentation using infinite perimeter active 3431-3440.
contour model with hybrid region information with application to
JBHI-01622-2019 12
[39] P. Liskowski and K. Krawiec, "Segmenting retinal blood vessels [59] C. G. Owen, A. R. Rudnicka, R. Mullen, S. A. Barman, D.
with deep neural networks," IEEE transactions on medical imaging, Monekosso, P. H. Whincup, et al., "Measuring retinal vessel
vol. 35, pp. 2369-2380, 2016. tortuosity in 10-year-old children: validation of the computer-
[40] A. Lahiri, A. G. Roy, D. Sheet, and P. K. Biswas, "Deep neural assisted image analysis of the retina (CAIAR) program,"
ensemble for retinal vessel segmentation in fundus images towards Investigative ophthalmology & visual science, vol. 50, pp. 2004-
achieving label-free angiography," in 2016 38th Annual 2010, 2009.
International Conference of the IEEE Engineering in Medicine and [60] A. Budai, R. Bock, A. Maier, J. Hornegger, and G. Michelson,
Biology Society (EMBC), 2016, pp. 1340-1343. "Robust vessel segmentation in fundus images," International
[41] D. Maji, A. Santara, P. Mitra, and D. Sheet, "Ensemble of deep journal of biomedical imaging, vol. 2013, 2013.
convolutional neural networks for learning to detect retinal vessels [61] D. Marín, A. Aquino, M. E. Gegúndez-Arias, and J. M. Bravo, "A
in fundus images," arXiv preprint arXiv:1603.04833, 2016. new supervised method for blood vessel segmentation in retinal
[42] A. Dasgupta and S. Singh, "A fully convolutional neural network images by using gray-level and moment invariants-based features,"
based structured prediction approach towards the retinal vessel IEEE Transactions on medical imaging, vol. 30, pp. 146-158, 2010.
segmentation," in 2017 IEEE 14th International Symposium on [62] F. LaRocca, D. Nankivil, S. Farsiu, and J. A. Izatt, "True color
Biomedical Imaging (ISBI 2017), 2017, pp. 248-251. scanning laser ophthalmoscopy and optical coherence tomography
[43] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional handheld probe," Biomedical optics express, vol. 5, pp. 3204-3216,
networks for biomedical image segmentation," in International 2014.
Conference on Medical image computing and computer-assisted [63] S. Abbasi-Sureshjani, I. Smit-Ockeloen, J. Zhang, and B. T. H.
intervention, 2015, pp. 234-241. Romeny, "Biologically-inspired supervised vasculature
[44] W. Xiancheng, L. Wei, M. Bingyi, J. He, Z. Jiang, W. Xu, et al., segmentation in SLO retinal fundus images," in International
"Retina blood vessel segmentation using a U-net based Conference Image Analysis and Recognition, 2015, pp. 325-334.
Convolutional neural network," in Procedia Computer Science: [64] T. A. Ciulla, A. Harris, and B. J. Martin, "Ocular perfusion and
International Conference on Data Science (ICDS 2018), Beijing, age ‐ related macular degeneration," Acta Ophthalmologica
China, 2018, pp. 8-9. Scandinavica, vol. 79, pp. 108-115, 2001.
[45] K. Yue, B. Zou, Z. Chen, and Q. Liu, "Retinal vessel segmentation [65] S. Roychowdhury, D. D. Koozekanani, and K. K. Parhi, "Iterative
using dense U-net with multiscale inputs," Journal of Medical vessel segmentation of fundus images," IEEE Transactions on
Imaging, vol. 6, p. 034004, 2019. Biomedical Engineering, vol. 62, pp. 1738-1749, 2015.
[46] P. M. Samuel and T. Veeramalai, "Multilevel and Multiscale Deep [66] J. I. Orlando, E. Prokofyeva, and M. B. Blaschko, "A
Neural Network for Retinal Blood Vessel Segmentation," discriminatively trained fully connected conditional random field
Symmetry, vol. 11, p. 946, 2019. model for blood vessel segmentation in fundus images," IEEE
[47] K. Hu, Z. Zhang, X. Niu, Y. Zhang, C. Cao, F. Xiao, et al., "Retinal transactions on Biomedical Engineering, vol. 64, pp. 16-27, 2016.
vessel segmentation of color fundus images using multiscale [67] Z. Yan, X. Yang, and K.-T. Cheng, "Joint segment-level and pixel-
convolutional neural network with an improved cross-entropy loss wise losses for deep learning based retinal vessel segmentation,"
function," Neurocomputing, vol. 309, pp. 179-191, 2018. IEEE Transactions on Biomedical Engineering, vol. 65, pp. 1912-
[48] B. Wang, S. Qiu, and H. He, "Dual Encoding U-Net for Retinal 1923, 2018.
Vessel Segmentation," in International Conference on Medical [68] S. Roychowdhury, D. D. Koozekanani, and K. K. Parhi, "Blood
Image Computing and Computer-Assisted Intervention, 2019, pp. vessel segmentation of fundus images by major vessel extraction
84-92. and subimage classification," IEEE journal of biomedical and
[49] Y. Wu, Y. Xia, Y. Song, Y. Zhang, and W. Cai, "Multiscale network health informatics, vol. 19, pp. 1118-1128, 2014.
followed network model for retinal vessel segmentation," in [69] Q. Li, B. Feng, L. Xie, P. Liang, H. Zhang, and T. Wang, "A cross-
International Conference on Medical Image Computing and modality learning approach for vessel segmentation in retinal
Computer-Assisted Intervention, 2018, pp. 119-126. images," IEEE transactions on medical imaging, vol. 35, pp. 109-
[50] Y. Wu, Y. Xia, Y. Song, D. Zhang, D. Liu, C. Zhang, et al., "Vessel- 118, 2015.
Net: retinal vessel segmentation under multi-path supervision," in [70] Y. M. Kassim and K. Palaniappan, "Extracting retinal vascular
International Conference on Medical Image Computing and networks using deep learning architecture," in 2017 IEEE
Computer-Assisted Intervention, 2019, pp. 264-272. International Conference on Bioinformatics and Biomedicine
[51] R. Li, M. Li, and J. Li, "Connection sensitive attention U-NET for (BIBM), 2017, pp. 1170-1174.
accurate retinal vessel segmentation," arXiv preprint [71] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari,
arXiv:1903.05558, 2019. "Recurrent residual convolutional neural network based on u-net
[52] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su, "DUNet: (R2U-net) for medical image segmentation," arXiv preprint
A deformable network for retinal vessel segmentation," Knowledge- arXiv:1802.06955, 2018.
Based Systems, vol. 178, pp. 149-162, 2019. [72] S. Guo, K. Wang, H. Kang, Y. Zhang, Y. Gao, and T. Li, "BTS-
[53] W. Ma, S. Yu, K. Ma, J. Wang, X. Ding, and Y. Zheng, "Multi-task DSN: Deeply supervised neural network with short connections for
Neural Networks with Spatial Activation for Retinal Vessel retinal vessel segmentation," International journal of medical
Segmentation and Artery/Vein Classification," in International informatics, vol. 126, pp. 105-113, 2019.
Conference on Medical Image Computing and Computer-Assisted [73] H. Fu, Y. Xu, S. Lin, D. W. K. Wong, and J. Liu, "Deepvessel:
Intervention, 2019, pp. 769-778. Retinal vessel segmentation via deep learning and conditional
[54] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. random field," in International conference on medical image
Misawa, et al., "Attention u-net: Learning where to look for the computing and computer-assisted intervention, 2016, pp. 132-139.
pancreas," arXiv preprint arXiv:1804.03999, 2018. [74] Y. Zhao, Y. Zheng, Y. Liu, Y. Zhao, L. Luo, S. Yang, et al.,
[55] H. Chen, X. Qi, L. Yu, and P.-A. Heng, "DCAN: deep contour- "Automatic 2-D/3-D vessel enhancement in multiple modality
aware networks for accurate gland segmentation," in Proceedings of images using a weighted symmetry filter," IEEE transactions on
the IEEE conference on Computer Vision and Pattern Recognition, medical imaging, vol. 37, pp. 438-450, 2017.
2016, pp. 2487-2496. [75] M. I. Meyer, P. Costa, A. Galdran, A. M. Mendonça, and A.
[56] K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Campilho, "A deep neural network for vessel segmentation of
Surpassing human-level performance on imagenet classification," in scanning laser ophthalmoscopy images," in International
Proceedings of the IEEE international conference on computer Conference Image Analysis and Recognition, 2017, pp. 507-515.
vision, 2015, pp. 1026-1034. [76] P. F. Sharp, A. Manivannan, H. Xu, and J. V. Forrester, "The
[57] D. P. Kingma and J. Ba, "Adam: A method for stochastic scanning laser ophthalmoscope—a review of its role in bioscience
optimization," arXiv preprint arXiv:1412.6980, 2014. and medicine," Physics in Medicine & Biology, vol. 49, p. 1085,
[58] A. Hoover, V. Kouznetsova, and M. Goldbaum, "Locating blood 2004.
vessels in retinal images by piecewise threshold probing of a
matched filter response," IEEE Transactions on Medical imaging,
vol. 19, pp. 203-210, 2000.
View publication stats

HardAttentionNetforRetinaVesselSegmentation Final

Uploaded by

Copyright:

Available Formats

You might also like

HardAttentionNetforRetinaVesselSegmentation Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HardAttentionNetforRetinaVesselSegmentation Final

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Hard Attention Net for Automatic Retinal Vessel Segmentation

Article in IEEE Journal of Biomedical and Health Informatics · June 2020

Dongyi Wang Osamah J Saeedi

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Hard Attention Net for Automatic Retinal Vessel

(a) (d) (e)

(c) (d) (e)

(a) (b) (a) (b)

(c) (d) (e)

(c) (d) (e)

Model Se Sp Acc AUC f1

TABLE VII Model Se Sp Acc AUC f1

two example patches. In Fig. 12(e), it is appreciable that solely

View publication stats

You might also like