Mask R-CNN Natalia 2020

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Pixel-Based Iris and Pupil Segmentation in Cataract Surgery Videos

Using Mask R-CNN

Natalia Sokolova1 , Mario Taschwer1 , Stephanie Sarny2 3 , Doris Putzgruber-Adamitsch2 and Klaus Schoeffmann1
1
Klagenfurt University, Austria
2
Klinikum Klagenfurt, Austria
3
Medical University Graz, Austria

ABSTRACT procedure. Automatic detection and measurement of pupil


Automatically detecting clinically relevant events in and iris in cataract surgery videos is, therefore, a necessary
surgery video recordings is becoming increasingly important preprocessing step of such automated analysis.
for documentary, educational, and scientific purposes in the Most of the existing techniques that have already ad-
medical domain. From a medical image analysis perspective, dressed this problem in medical domain search for the outer
such events need to be treated individually and associated circle of iris and pupil in the frame. However, eyes are
with specific visible objects or regions. In the field of moving during surgery, and instruments cover parts of the
cataract surgery (lens replacement in the human eye), pupil eye during essential moments. Also, it is not always possible
reaction (dilation or restriction) during surgery may lead to find such circles in the frames, for example, if the visible
to complications and hence represents a clinically relevant part of pupil or iris does not look round because of the used
event. Its detection requires automatic segmentation and instruments. In addition, the edge detection algorithms which
measurement of pupil and iris in recorded video frames. In are mostly used to find those circles do not always work well
this work, we contribute to research on pupil and iris seg- because of eye movements and focus conditions. Pixel-wised
mentation methods by (1) providing a dataset of 82 annotated segmentation detects the iris and pupil more precisely and
images for training and evaluating suitable machine learning differentiates their areas from, for example, used surgical
algorithms, and (2) applying the Mask R-CNN algorithm to instruments.
this problem, which—in contrast to existing techniques for In this work, we investigate pixel-accurate pupil and iris
pupil segmentation—predicts free-form pixel-accurate seg- segmentation by a region-based convolutional neural network
mentation masks for iris and pupil. The proposed approach (Mask R-CNN), which – to the best of our knowledge –
achieves consistent high segmentation accuracies on several has not been applied to this problem before. We evaluate
metrics while delivering an acceptable prediction efficiency, the performance of Mask R-CNN with different backbone
establishing a promising basis for further segmentation and networks for a manually annotated image dataset, collected
event detection approaches on eye surgery videos. from different cataract surgeries. The backbone networks
object segmentation, cataract surgery videos, mask R- were initialized with weights from COCO [1].
CNN, deep learning The contribution of this paper is two-fold: (1) we provide
a dataset1 of 82 annotated images for pixel-accurate pupil
1. INTRODUCTION and iris segmentation that can be used to train and evaluate
Cataract surgery is an eye natural lens replacement with machine learning algorithms; and (2) we train and evaluate
an artificial lens, and it is one of the most common surgi- the Mask R-CNN algorithm on this dataset and achieve
cal operations performed worldwide. Because of the rather high accuracy on several metrics, establishing a promising
small size of operated organ, surgeons have to work with approach for further segmentation tasks on eye surgery
tiny instruments under a microscope. Such microscopes are videos.
equipped with a camera, and the video output signal can be
recorded and stored in an archive. 2. RELATED WORK
Cataract surgery videos are useful for educational and
The problems of recognition, localization and tracking of
documentary purposes in hospitals, but they may also be
eyes in the images and videos have already been studied in
analyzed automatically to detect the occurrence of adverse
the research community. This topic was addressed in terms of
events or complications during postoperative analysis. One
different use case scenarios. For example, iris segmentation
such event that may lead to complications is pupil reactions,
was used in the biometric identification approaches [2], [3],
i.e. dilation or constriction of the pupil during the surgical
[4] to localize the iris for further biometric analysis.
*This work was funded by the FWF Austrian Science Fund under grant
P 31486-N31. 1 http://ftp.itec.aau.at/datasets/ovid/iris_pupil_seg/

978-1-7281-7401-3/20/$31.00 ©2020
Authorized licensed use limited to: IEEE
Universitas Airlangga. Downloaded on January 08,2024 at 10:38:01 UTC from IEEE Xplore. Restrictions apply.
Iris segmentation was also applied to some medical imag- 3. MATERIAL AND METHODS
ing tasks. For example, McConnon et al. [5] evaluated the 3.1. Collected dataset
effect of different diseases on iris segmentation. Also, this
For this work we used 35 cataract surgery videos, collected
approach was applied to diagnose such diseases as pterygium
at Klinikum Klagenfurt with the following parameters:
[6], strabismus [7] and cataract [8] in the images and even
to compare the eyes before and after cataract surgeries [9]. • a resolution of 540x720 pixels;

Iris and pupil localization was also applied to improve the • a frame rate of 25 fps;

video analysis in the medical domain for different purposes • recorded during years 2017–2018.

over the last years. In particular, it was used to assess sur- All of these videos are varying in terms of the zoom level,
geons by introducing the eye gase tracking control interface eye color, brightness and blur levels. We extracted 82 frames
for the surgical system in [10] using iris segmentation. from these videos. 51 of them contain also surgical tools.
Some authors have already used segmentation methods of The frames were obtained from different phases of cataract
pupil and iris in cataract surgery videos. For Lalys et al. [11] surgery. Therefore, the images contain different surgical tools
the goal was to localize exactly where the iris and pupil are to and show the eyes in different states; for example, with
use it as a preprocessing step later. The images from cataract natural or artificial lenses, or even without any lens.
surgery videos were first filtered and then analyzed using All of the frames were manually annotated with free-form
color histogram. The authors defined at first the personal regions of iris and pupil. Iris annotations do not include
color characteristics for every patient and then the thresholds the pupil area. Figures 1, 2 and 3 show examples of the
in obtained histograms to distinguish the boundaries between original frames (a), as well as corresponding annotations (b)
pupil, iris and sclera. As a result, authors obtained for every and predictions (c). All annotations were created and stored
image a binary mask to segment an iris from sclera. This in the MS COCO format using the COCO Annotator tool
method, as authors also mentioned, was not able to overcome [18]. The dataset was randomly split into training, validation
completely the problems of light conditions and used medical and test subsets with a ratio of 70:15:15.
instruments, and it requires an individual setup for every
3.2. Experimental setup
analyzed cataract surgery video.
Even more progress was done in [12] and [13] research As segmentation and classification algorithm, we use Mask
works. There the main idea was to find the pupil area to use it R-CNN [19], which is known to work well for general
as a region of interest for further instruments localization and object segmentation. This region-based convolutional neural
classification. The images were first transformed to the YUV network is an extension of Faster R-CNN [20] with an
color space, and then, with the assumption that the color of additional branch to predict object masks. For every detected
pupil is different from everything else in the frame, the binary object, Mask R-CNN outputs a binary segmentation mask, a
mask was obtained. From this mask the most probable pupil boundary box, a class category and confidence value.
circle was retrieved using Hough transform [14]. To decrease In our evaluations we use two different backbone networks
fault rate the pixel and circle centers counting were used. for Mask R-CNN: ResNet-50 and ResNet-101 [21]. Both
One more pupil localization approach was presented in models were initialized with COCO weights what allows us
[15]. This localization was used later as a preprocessing step to train our network on the rather small dataset. As shown in
of a operation workflow retrieval in [16] to normalize the Table I, training with random weights results in much lower
cataract surgery videos. For this purpose, using Canny edge performance. We experiment with different hyperparameters
detection and Hough transform algorithm the center of the and achieve best validation results with the following values:
pupil was found in the frames to implement eye tracking to • learning rate of 0.002;
further determine a region of interest. • momentum of 0.9;
Also, some analyses have been done in this area already. • batch size of 1;
For example, authors of [17] aimed to solve a problem of iris • 100 epochs;
registration under both rigid and nonrigid deformations for • 5 validation steps.
use in refractive cataract surgery. They split a task of iris and For model selection, we use the training epoch with the
pupil segmentation into two parts. At the beginning they used lowest validation loss. For ResNet-50 and ResNet-101 this
a random sample consensus algorithm to find most probable was epoch 52 and 73, respectively. We used the Mask R-
circles of pupil to combine them afterwards into elliptical fit. CNN implementation provided by [22] with Tensorflow as a
To find a boundary between iris and sclera, circular spines backend and an NVIDIA TITAN RTX N GPU with 24 GB
algorithm was developed in this work. It searched for the of RAM. This implementation does not work with occluded
circle outside of the pupil which has the highest gradient object annotations and with annotated regions enclosing
value between inner and outer areas. holes (iris without pupil), so we had to slightly adapt it.
However, none of these approaches used pixel-based seg-
mentation, which would allow us to localize pupil and 4. RESULTS AND DISCUSSION
iris and accurately segment them from surgical instruments Segmentation performance is evaluated using Intersection
and other tissues in a sufficiently precise way for further over Union (IoU), which is the ratio of the intersection of the
automated analysis. ground truth segment with the predicted segment divided by

Authorized licensed use limited to: Universitas Airlangga. Downloaded on January 08,2024 at 10:38:01 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)
Fig. 1: Initial (a), annotated (b) and predicted (c) frames without any surgical tool.

(a) (b) (c)


Fig. 2: Initial (a), annotated (b) and predicted (c) frames with secondary incision knife.

(a) (b) (c)


Fig. 3: Initial (a), annotated (b) and predicted (c) frames with the lowest iris IoU value.

the union of both of them. Using different IoU thresholds to examples are recognized with at least 90% IoU. For boundary
determine true positives, we compute mean average precision box segmentation, both models correctly recognized all iris
(mAP) as in the original Mask R-CNN paper [19]. The value and pupil test examples with at least 85% IoU (mAP85 = 1
mAP50:95 represents the average of mAP values with IoU in Table II).
thresholds ranging between 50% and 95% in steps of 5%.
To interpret these results further, we look at the test
During training we discovered that the results for the net- example recognized with the lowest IoU value. According
work with all trained layers are slightly better than when only to minimal IoU thresholds X with mAPX < 1 in Tables I
head layers were trained. We also found that the bounding and II, this occurred for mask segmentation of an iris test
box computed from the predicted segmentation mask is more example recognized by the ResNet-101 model (mAP75 = 1
precise than the boundary box output of the Mask R-CNN and mAP80 < 1). The actual IoU value for the predicted
algorithm. We therefore use these computed bounding boxes iris mask is 78.4%, and the corresponding test example is
for evaluation of boundary box segmentation. shown in Figure 3 (iris corresponds to red area). Apparently,
Both Mask R-CNN models (with ResNet-50 and ResNet- the predicted iris mask does not look too bad and may still
101 backbone networks, respectively) lead to similar evalua- be usable for further processing or analysis. The reason for
tion results, both for free-form mask segmentation (Table I) this is that the area of this iris is rather small and has quite
and boundary box segmentation (Table II). For mask segmen- long external and internal borderlines compared to the other
tation, all of the iris test examples are recognized (mAP = 1) iris examples. In this case even little shortcomings in the
with an IoU threshold of 75% and 80% by ResNet-101 predicted region contribute significantly to the IoU value. At
and ResNet-50 models, respectively. Both models achieve the same time the pupil IoU value for the mask segmentation
identical results for pupil mask segmentation, where all test in the same frame is 91.47% (cyan area in Figure 3c).

Authorized licensed use limited to: Universitas Airlangga. Downloaded on January 08,2024 at 10:38:01 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Free-form mask evaluation results. [3] R. Raghavendra, K. B. Raja, V. K. Vemuri, S. Kumari, P. Gacon,
Class (CNN) mAP75 mAP80 mAP85 mAP90 mAP95 mAP50:95 E. Krichen, and C. Busch, “Influence of cataract surgery on iris
Iris (ResNet-50) 1.0000 1.0000 0.7273 0.0909 0.0000 0.7818 recognition: A preliminary study,” in 2016 International Conference
Iris (ResNet-101) 1.0000 0.9091 0.9091 0.1818 0.0000 0.8000 on Biometrics (ICB), June 2016, pp. 1–8.
Pupil (ResNet-50) 1.0000 1.0000 1.0000 1.0000 0.0909 0.9091 [4] R. Keshari, S. Ghosh, A. Agarwal, R. Singh, and M. Vatsa, “Mobile
Pupil (ResNet-101) 1.0000 1.0000 1.0000 1.0000 0.0909 0.9091
Random weights periocular matching with pre-post cataract surgery,” in 2016 IEEE
Iris (ResNet-101) 0.5455 0.0909 0.0000 0.0000 0.0000 0.4091 International Conference on Image Processing (ICIP), Sep. 2016, pp.
Pupil (ResNet-101) 0.7273 0.7273 0.6364 0.2727 0.0000 0.6455 3116–3120.
[5] G. McConnon, F. Deravi, S. Hoque, K. Sirlantzis, and G. Howells,
“Impact of common ophthalmic disorders on iris segmentation,” in
TABLE II: Boundary box evaluation results. 2012 5th IAPR International Conference on Biometrics (ICB), March
Class (CNN) mAP75 mAP80 mAP85 mAP90 mAP95 mAP50:95 2012, pp. 277–282.
Iris (ResNet-50) 1.0000 1.0000 1.0000 0.9091 0.3636 0.9273 [6] R. G. Mesquita and E. M. N. Figueiredo, “An algorithm for measur-
Iris (ResNet-101) 1.0000 1.0000 1.0000 0.9091 0.1818 0.9191 ing pterygium’s progress in already diagnosed eyes,” in 2012 IEEE
Pupil (ResNet-50) 1.0000 1.0000 1.0000 0.7273 0.0909 0.8818
Pupil (ResNet-101) 1.0000 1.0000 1.0000 0.8182 0.0909 0.8909
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), March 2012, pp. 733–736.
[7] N. Khumdat, P. Phukpattaranont, and S. Tengtrisorn, “Development
of a computer system for strabismus screening,” in The 6th 2013
Prediction efficiency is also quite similar for both evalu- Biomedical Engineering International Conference, Oct 2013, pp. 1–5.
[8] S. Patange and A. Jagadale, “Framework for detection of cataract and
ated Mask R-CNN models. The ResNet-101 model performs gradation according to its severity,” in 2015 International Conference
predictions at 6.07 frames per second (fps) in our test on Pervasive Computing (ICPC), Jan 2015, pp. 1–3.
environment, while the ResNet-50 model runs at 6.37 fps. [9] A. Lakra, P. Tripathi, R. Keshari, M. Vatsa, and R. Singh, “Seg-
densenet: Iris segmentation for pre-and-post cataract surgery,” in 2018
In summary, both evaluated Mask R-CNN models achieve 24th International Conference on Pattern Recognition (ICPR), Aug
promising and effective results for iris and pupil segmenta- 2018, pp. 3150–3155.
tion. According to mAP50:95 values, the ResNet-101 model [10] Z. Li, I. Tong, L. Metcalf, C. Hennessey, and S. E. Salcudean, “Free
head movement eye gaze contingent ultrasound interfaces for the da
tends to be slightly more effective than the ResNet-50 model, vinci surgical system,” IEEE Robotics and Automation Letters, vol. 3,
except for boundary box segmentation of the iris, where the no. 3, pp. 2137–2143, July 2018.
ResNet-101 model seems to overfit slightly. [11] A. Ektesabi and A. Kapoor, “Exact pupil and iris boundary detection,”
in The 2nd International Conference on Control, Instrumentation and
Automation, Dec 2011, pp. 1217–1221.
5. CONCLUSION AND FURTHER WORK [12] F. Lalys, L. Riffaud, D. Bouget, and P. Jannin, “A framework for the
recognition of high-level surgical tasks from video images for cataract
This work provides a diverse dataset of 82 images selected surgeries,” IEEE Transactions on Biomedical Engineering, vol. 59,
from 35 cataract surgery videos, where the regions of iris no. 4, pp. 966–976, April 2012.
and pupil have been manually segmented and annotated. [13] D. Bouget, F. Lalys, and P. Jannin, “Surgical Tools Recognition
and Pupil Segmentation for Cataract Surgical Process Modeling,”
We trained and evaluated the Mask R-CNN segmentation in Medicine Meets Virtual Reality - NextMed, vol. 173. Newport
algorithm on this dataset, which predicts free-form pixel- beach, CA, United States: IOS press books, Feb. 2012, pp. 78–84.
accurate segmentation masks for iris and pupil, in contrast [Online]. Available: https://www.hal.inserm.fr/inserm-00669660
[14] M. J. Swain and D. H. Ballard, “Color indexing,” International
to existing techniques that constrain predicted segments to Journal of Computer Vision, vol. 7, no. 1, pp. 11–32, 1991. [Online].
circular or elliptic regions. Our evaluations employ ResNet- Available: https://doi.org/10.1007/BF00130487
50 and ResNet-101 as backbone networks of Mask R-CNN [15] G. Quellec, K. Charrière, M. Lamard, B. Cochener, and G. Cazuguel,
“Normalizing videos of anterior eye segment surgeries,” in 2014 36th
and demonstrate consistent high segmentation accuracies on Annual International Conference of the IEEE Engineering in Medicine
several metrics while delivering an acceptable prediction and Biology Society, Aug 2014, pp. 122–125.
efficiency of 6 frames per second on current GPU hardware. [16] K. Charrière, G. Quellec, M. Lamard, G. Coatrieux, B. Cochener,
and G. Cazuguel, “Automated surgical step recognition in normalized
Future work may include extending the iris and pupil cataract surgery videos,” in 2014 36th Annual International Confer-
dataset to allow for more robust segmentation models and ence of the IEEE Engineering in Medicine and Biology Society, Aug
evaluation results. More importantly, the promising results 2014, pp. 4647–4650.
[17] D. Morley and H. Foroosh, “Computing cyclotorsion in refractive
reported in this study may lead to the development of cataract surgery,” IEEE Transactions on Biomedical Engineering,
accurate segmentation and measurement approaches for other vol. 63, no. 10, pp. 2155–2168, Oct 2016.
regions of interest in eye surgery videos like operation [18] J. Brooks, “COCO Annotator,” https://github.com/jsbroks/coco-
annotator/, 2019.
instruments or natural and artificial lenses. Tracking such [19] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask
regions over time may eventually allow us to detect events R-CNN,” CoRR, vol. abs/1703.06870, 2017. [Online]. Available:
that may be relevant for clinical studies, such as intra- http://arxiv.org/abs/1703.06870
[20] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-
operative pupil reactions in cataract surgeries. CNN: towards real-time object detection with region proposal
networks,” CoRR, vol. abs/1506.01497, 2015. [Online]. Available:
6. REFERENCES http://arxiv.org/abs/1506.01497
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[1] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. image recognition,” in Proc. of the IEEE Conference on Computer
Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, Vision and Pattern Recognition, 2016, pp. 770–778.
and C. L. Zitnick, “Microsoft COCO: common objects in [22] W. Abdulla, “Mask R-CNN for object detection
context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: and instance segmentation on keras and tensorflow,”
http://arxiv.org/abs/1405.0312 https://github.com/matterport/Mask_RCNN, 2017.
[2] C. N. Devi, “Automatic segmentation and recognition of iris images:
With special reference to twins,” in 2017 Fourth International Confer-
ence on Signal Processing, Communication and Networking (ICSCN),
March 2017, pp. 1–5.

Authorized licensed use limited to: Universitas Airlangga. Downloaded on January 08,2024 at 10:38:01 UTC from IEEE Xplore. Restrictions apply.

You might also like