Komdig Colonos

Automated Detection of Non-Informative Frames for Colonoscopy
Through a Combination of Deep Learning and Feature Extraction

Heming Yao1 , Ryan W. Stidham2,3 , Reza Soroushmehr1,3 , Jonathan Gryak1 , and Kayvan Najarian1,4,5
Abstract— Colonoscopy is a standard medical examination

used to inspect the mucosal surface and detect abnormalities of
the colon. Objective assessment and scoring of disease features
in the colon are important in conditions such as colorectal
cancer and inflammatory bowel disease. However, subjectivity
in human disease assessment and measurement is hampered
by interobserver variation and several biases. A computer-
aided system for colonoscopy video analysis could facilitate
diagnosis and disease severity measurement, which would aid in
treatment selection and clinical outcome prediction. However,
a large number of images captured during colonoscopy are
non-informative, making detecting and removing those frames
an important first step in performing automated analysis. In
this paper, we present a combination of deep learning and Fig. 1. Examples of non-informative frames (top row) and informative
conventional feature extraction to distinguish non-informative frames (bottom row).
from informative images in patients with ulcerative colitis. Our
result shows that the combination of bottleneck features in the
RGB color space and hand-crafted features in the HSV color
space can boost the classification performance. Our proposed artifacts. Those non-informative frames may interrupt dis-
method was validated using 5-fold cross-validation and achieved ease severity estimation by providing non-informative or
an average AUC of 0.939 and an average F1 score of 0.775. conflicting information. An effective classification method
is essential to drop non-informative frames while retaining
I. INTRODUCTION
informative ones.
Optical colonoscopy is a medical procedure where a flexi- Several methods have been proposed for non-informative
ble probe containing a charge-coupled device (CCD) camera colonoscopy image classification. Texture analysis using
and a fiber optic light source is inserted into the rectum and Local Binary Patterns on the frequency domain has been
advanced through the length of the colon. Colonoscopy is proposed [3] to detect the non-informative frames. A set
commonly performed to inspect the colon surface for a range of convolutional neural network (CNN) architectures was
of abnormalities such as polyps, adenocarcinoma, diverticula, explored in [4] and the effectiveness of CNNs in image
and inflammatory changes in the colon. Despite its com- classification was demonstrated. In [5], non-informative im-
mon use in both the diagnosis and longitudinal monitoring ages were classified through motion, edge, and color features.
of disease, the objective grading of disease severity and Most of the previous methods have focused on either hand-
localization of disease features are challenging. Moreover, crafted feature extraction or end-to-end deep learning.
recent data suggest that there is a significant miss-rate for In this work, we proposed the use of a combination of
the detection of polyps and cancers [1], [2]. hand-crafted features and bottleneck features (i.e., the last
An efficient computer-aided system for automated de- activation maps before the fully-connected layers) from deep
tection and estimation of disease features present in learning to perform the non-informative image classifica-
colonoscopy could facilitate standardized disease assess- tion. A pre-trained Inception-v3 model on ImageNet was
ment. However, an important barrier to automated analy- used to extract textural and high-level features in RGB. In
sis of colonoscopy videos is the large proportion of non- colonoscopy videos image sharpness can be affected by the
informative frames due to the motion blur, obscured field camera motion. Further, image brightness and sharpness are
of view by debris, variation in light intensity and image also influenced by variable features in the internal colon
1 Heming Yao, Reza Soroushmehr, Jonathan Gryak, and Kayvan Najar- environment, including the amount of water or debris, surface
ian are with Department of Computational Medicine and Bioinformatics, texture, and the distance between the camera and the colon
University of Michigan, Ann Arbor, MI, USA wall. Therefore, image analysis in the HSV color model,
2 Division of Gastroenterology, University of Michigan, Ann Arbor, MI,
USA
which separates color components from the intensity, may
3 Michigan Integrated Center for Health Analytics & Medical Prediction, provide additional information. In this study, a set of hand-
University of Michigan, Ann Arbor, MI, USA crafted features including measures of edges, reflections,
4 Michigan Center for Integrative Research in Critical Care, University of
blur, and focus were extracted in the HSV color space.
Michigan, Ann Arbor, MI, USA
5 Department of Emergency Medicine, University of Michigan, Ann From our experiments, the combination of hand-crafted
Arbor, MI, USA features in the HSV color space and bottleneck features in
978-1-5386-1311-5/19/$31.00 ©2019 IEEE 2402
C. Bottleneck Features
The Inception-v3 model [6] was employed for our non-
informative frame classification. Though more than 10 thou-
sand annotated frames are available for the model training,
those frames are only from 10 different patients. To avoid
the bias in individual patient characteristics, a pre-trained
model on ImageNet [7] was used and a fine-tuning was
performed using annotated training data. As shown in Fig.
4, the activation maps from the last average pooling layer
were taken as bottleneck features. The number of bottleneck
features is 2048.
D. Hand-crafted Feature Extraction

The Hue, Saturation, and Value channels in the HSV
color space correspond to the color, the amount of gray, and
Fig. 2. Distribution of informative/non-informative frames within each brightness, respectively. HSV channels of an example frame
colonoscopy video.
are shown in Fig. 3(b). The Hue channel is visualized by
the RGB color space efficiently improved the classification color mapping. Unlike the RGB color space, the HSV color
performance. space can separate color from intensity. Feature extraction
II. M ETHODS AND M ATERIALS from frames in the HSV color space may provide additional
information. While many pre-trained models are available
A. Dataset for generating features in the RGB color space, large-scale
We collected colonoscopies from 10 patients with inflam- datasets suitable for HSV feature extraction are limited,
matory bowel disease. An Olympus PCF-H190 colonoscope where reflections, shadow, and brightness in the images
using a CLV-190 light source and image processor was used, should relate to classification tasks.
with video recorded at 1920x1080 resolution at 30 frames TABLE I
per second. Video frames sampled at a rate of one frame A SUMMARY OF EXTRACTED FEATURES
per second were manually annotated by a board-certified
gastroenterologist. Specifically, the following four types of Category The number of features
frames were annotated as non-informative (examples shown Reflection mask 1
in Fig. 1): Intensity statistics 12
Edge features 2
• Frames with motion blur significant enough to blur Gray-level co-occurrence matrix statistics 12
colon vasculature. Blur measure 10
• Frames captured when the camera view is obscured by
excessive liquid or solid debris. To overcome the lack of large-scale datasets, the frames
• Frames captured when the camera is too close to the were converted from RGB to HSV and a feature extraction
colon wall. algorithm was proposed. In total, 37 hand-crafted features
• Frames overexposed due to the failure of automatic were extracted to characterize frames in the colonoscopy
lighting controls. videos. A diagram of the feature extraction algorithm is given
Of 16,659 total frames, 12830 frames (77.0%) were anno- in Fig. 5 and the number of features in each category is
tated as non-informative, and 3829 frames (23.0%) were summarized in Table I. The details of each feature extraction
annotated as informative. Fig. 2 provides the distribution of are as follows:
frames within each colonoscopy video used in the experi- Specular reflection mask: Because of the specular re-
ments. flections from the colon surface, many frames contain many
bright regions (an example is shown in Fig. 3(a)). Those
B. Pre-processing regions will significantly increase the number of edges and
Frames from different colonoscopies were pre-processed contrast measure of the image. To remove the effects of these
to ensure consistency. First, each frame was binarized and reflections, a reflection mask was built for each frame using a
the largest component (content captured by the camera) in threshold on the Saturation channel and a dilation operation.
the frame was identified. The smallest rectangular region The fraction of reflection regions in the ROI was calculated
containing the largest component was used to crop the as a feature. The features in other categories were corrected
original frame. After that, zero paddings were added to fill using the reflection mask, details of which will be discussed
the rectangular region into a square region, which was then in the description of the following feature categories.
resized to 256 × 256 pixels (an example is shown in Fig. Intensity statistics: After removing the pixels within the
3(a)). As shown in the right image in Fig. 3(a), the region reflection mask and pixels outside of the ROI, the mean,
of interest (ROI) for feature extraction is an octagon. variance, skewness, and kurtosis of pixel intensities in the
2403
Fig. 3. Illustration for the proposed image processing method. (a) Image pre-processing. (b) HSV color model. The Hue, Saturation, and Value channel
are shown from left to right. The Hue channel is visualized with color mapping. (c) The Hue, Saturation, and Value channel after interpolation for reflection
regions and background regions are shown from left to right. (d) Edge detection (in blue) and line detection (in green) using Hough transform.
effect of reflections, regions falling in the reflection mask

were also filled (shown as Fig. 3(c)). After that, the gray-level
co-occurrence matrix (GLCM) [11] was calculated for each
channel, respectively. The pixel intensities were quantized
into 32 levels. Second-order statistics of the GLCM including
Fig. 4. A diagram of the deep network. The bottleneck features used in
contrast, energy, and homogeneity were calculated as:
this work are shown in the red box (Conv: convolutional layer; Avg pooling: X
global average pooling layer; FC: fully-connected layer) Fcontrast = |i − j|2 v(i, j), (1)
i,j
Hue, Saturation and Value channels were calculated, respec- X

Fenergy = v(i, j)2 , (2)
tively. The mean value gives the average level while the
i,j
variance calculates heterogeneity. In addition, the skewness
and kurtosis measure the asymmetry and the tailedness of X v(i, j)
the distribution, respectively. Intensity statistics characterize Fhomogeneity = , (3)
i,j
1 + |i − j|
the distribution of color components, gray amount, and
brightness in the ROI. where v(i, j) is the value of the GLCM at location (i, j).
Edge features: First, the contrast in the Value channel was Blur measure: The extent of blur is an important indicator
enhanced using contrast-limited adaptive histogram equaliza- of non-informative image classification. To measure blur, the
tion [8], where the contrast transform function was calculated rectangular regions (sub-region) resulted from the inward
on local regions. Secondly, a Canny edge detector [9] was interpolation were used. The absolute difference map be-
used to generate an initial edge mask. To remove the effects tween the original and blurred sub-regions using a Gaussian
of high reflections, edges within the reflection mask were filter was generated. The mean and standard deviation of
removed. In addition, each edge component was analyzed the differences were calculated as blur measures. This is
and edges with an eccentricity smaller than 0.9 (i.e., close based on the assumption that the Gaussian filter has less
to a circle) were removed. The percentage of edges in the effect on blurry images. Additionally, a no-reference image
ROI was calculated as one edge feature. Next, the Hough quality measure for blurred images in the frequency domain
transform [10] was used to find lines in the edge. Considering proposed in [12] was calculated. As blurred images may also
that a typical informative frame contains a clear view of result from optical de-focus, a set of focus measures was
the colon lumen and colon folds, the number of line-like calculated. Subbarao et al.[13] proposed a focus measure by
edges was also used as an edge feature. Fig. 3(d) gives two convolving a discrete Laplacian mask with the input image.
examples of the proposed edge detection and line detection The measurement can be written as:
method, where the colon’s fold contours were successfully
detected. While the edge detection using the Canny method W
H X
will also find the sharp intensity gradient in the colon wall
X
FLAP E = (I(i + 1, j) + I(i, j − 1)+
(e.g., vessels in the colon wall), the line detection responses i=1 j=1 (4)
tend to cluster at the fold contour and salient disease features. 2
I(i, j + 1) − 4I(i, j)) ,
Gray-level co-occurrence matrix: To remove the effect
from the border of the ROI, regions outside the ROI were where I denotes the input image and H, W denote the height
filled using inward interpolation. Similarly, to remove the and width of the input image, respectively.
2404
Fig. 5. A diagram of the feature extraction and feature fusion in the HSV and RGB color spaces.
TABLE II
Shen and Chen [14] measured the focus by dividing the
C LASSIFICATION PERFORMANCE COMPARISON USING DIFFERENT
image into multiple square blocks and calculating the energy
CATEGORIES OF HAND - CRAFTED FEATURES . T HE AUC IS GIVEN IN THE
ratio of AC and DC coefficients of each block in the discrete
FORMAT OF µ (σ) FROM 5- FOLD CROSS - VALIDATION .
cosine transform domain:
H/S W/S Hand-crafted features AUC
X X EACu,v
FDCT R = , (5) Reflection 0.740 (0.036)
u=1 v=1
EDCu,v Intensity 0.840 (0.067)
Edge 0.718 (0.022)
where S is the size of the block. EACu,v is the sum of GLCM 0.884 (0.022)
square of the AC coefficients and EDCu,v is the square of Blur 0.891 (0.021)
Reflection + Intensity 0.843 (0.047)
the DC coefficient in the block (u, v). In this study, we used Reflection + Intensity + Edge 0.873 (0.030)
S = 15. Reflection + Intensity + Edge + GLCM 0.899 (0.023)
Reflection + Intensity + Edge + GLCM + Blur 0.909 (0.020)
E. Classifier Training
A random forest (RF) model was trained using the com-
bination of hand-crafted features and bottleneck features. in colonoscopy videos, frames before the tube insertion and
Considering that the number of features after the feature after the tube withdrawal have lots of edge information,
fusion is large, the RF model was selected due to its property but they are non-informative. Also, the intensity value of
of automatic feature selection. each channel may be affected by the patient’s individual
colon environment and camera settings. However, when we
III. R ESULTS AND D ISCUSSION combine features from reflections, intensity statistics, and
A. Experimental Settings edges, the classification performance increases significantly,
Our CNN was implemented using the TensorFlow library. indicating they are highly complementary.
During fine-tuning, the Adam optimizer was used to mini-
C. Classification Performance of Feature Fusion
mize the cross-entropy loss with a learning rate of 10−6 .
To better compare the performance of different meth- From Table III, combining hand-crafted features in the
ods, a patient-wise 5-fold cross-validation was performed. HSV color space and deep learning based features in the
Frames were randomly divided into 5 folds, where each fold RGB color space achieves statistically significant improve-
contained frames from two patients. The averages of the ment on non-informative frame classification performance
mean (µ) and standard deviation (σ) of F1-score, sensitivity, compared with other methods. While deep learning methods
specificity, and AUC of the image classification in the 5 folds have proven to be effective in extracting comprehensive fea-
were calculated as our final results. tures, our result demonstrates that a small set of hand-crafted
features based on visual-based prior knowledge can still
B. Classification Performance of Hand-crafted Features provide additional and helpful information. This observation
Table II lists classification performances using single or may also be applied to other medical applications where the
different combinations of hand-crafted feature categories. training set is small and domain knowledge is available.
From the table, using a single feature category, GLCM and Fig. 6 gives examples of classification results using feature
Blur measures perform significantly better than other feature fusion. While our model makes correct classifications for
categories. The classification model using only intensity most of the frames, misclassification happens when frames
statistics has a high standard deviation, and the classification exhibit features from both informative and non-informative
model using only edge features has a low classification groups. For the first two frames in Fig. 6(a), fold contours
accuracy. This may be because these two feature categories are visible while water or reflections obscure part of the
are important for only a portion of images. For example, camera view. For the last frames in Fig. 6(a), the camera
2405
the efficiency of integrating visual-based prior knowledge
into data representations extracted using deep learning. Al-
though deep learning techniques have achieved wide success
in various fields, feature engineering with domain knowledge
can provide valuable information. Feature fusion has the
potential to improve model performance, especially when the
training dataset is small and domain knowledge is available.
Finally, the proposed automatic and accurate non-informative
frame detection system is essential for further colonoscopy
video analysis. Accurate detection and removal of non-
informative frames can efficiently improve the accuracy of
Fig. 6. Examples of classification results using feature fusion. (a) Frames disease severity estimation and reduce computational cost.
incorrectly classified as informative; (b) Frames incorrectly classified as
non-informative R EFERENCES
[1] L. Hixson, M. B. Fennerty, R. Sampliner, and H. Garewal, “Prospective
blinded trial of the colonoscopic miss-rate of large colorectal polyps,”
is too close to the colon wall to capture informative content Gastrointestinal endoscopy, vol. 37, no. 2, pp. 125–127, 1991.
[2] T. Kaltenbach, S. Friedland, and R. Soetikno, “A randomized tandem
while vessels in the colon wall are quite clear. For frames colonoscopy trial of narrow band imaging versus white light exami-
in Fig. 6(b), the reason for being misclassified as non- nation to compare neoplasia miss rates,” Gut, 2008.
informative may be because they are blurry and obscured by [3] C. Ballesteros, M. Trujillo, C. Mazo, D. Chaves, and J. Hoyos, “Auto-
matic classification of non-informative frames in colonoscopy videos
water. However, the overall colon structures are still visible, using texture analysis,” in Progress in Pattern Recognition, Image
so it is reasonable to annotate them as informative. From Analysis, Computer Vision, and Applications, C. Beltrán-Castañón,
those misclassified instances, one limitation of our study I. Nyström, and F. Famili, Eds. Cham: Springer International
Publishing, 2017, pp. 401–408.
is the lack of quantitative criteria for image annotation. [4] A. Islam, A. Alammari, J. Oh, W. Tavanapong, J. Wong, and P. C.
Adding uncertainty grading during the annotation process de Groen, “Non-informative frame classification in colonoscopy videos
and integrating the uncertainty in the training process may using cnns,” in Proceedings of the 2018 3rd International Conference
on Biomedical Imaging, Signal Processing. ACM, 2018, pp. 53–60.
improve model performance. [5] M. A. Armin, G. Chetty, F. Jurgen, H. De Visser, C. Dumas, A. Fazlol-
lahi, F. Grimpen, and O. Salvado, “Uninformative frame detection in
TABLE III colonoscopy through motion, edge and color features,” in International
C LASSIFICATION PERFORMANCE COMPARISON . T HE EVALUATION Workshop on Computer-Assisted and Robotic Endoscopy. Springer,
MEASURES ARE GIVEN IN THE FORMAT OF µ (σ) FROM 5- FOLD 2015, pp. 153–162.
[6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
CROSS - VALIDATION ing the inception architecture for computer vision,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
Method AUC F1 Sensitivity Specificity 2016, pp. 2818–2826.
Hand-crafted 0.909 0.720 0.846 0.845 [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
features + RF (0.020) (0.045) (0.043) (0.020) “Imagenet: A large-scale hierarchical image database,” in Computer
0.924 0.752 0.821 0.890 Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
Deep Learning
(0.020) (0.032) (0.022) (0.051) on. Ieee, 2009, pp. 248–255.
Bottleneck 0.928 0.756 0.824 0.891 [8] A. M. Reza, “Realization of the contrast limited adaptive histogram
features + RF (0.012) (0.040) (0.043) (0.008) equalization (CLAHE) for real-time image enhancement,” Journal of
Feature fusion 0.939 0.775 0.828 0.919 VLSI signal processing systems for signal, image and video technology,
+ RF (0.009) (0.028) (0.057) (0.029) vol. 38, no. 1, pp. 35–44, 2004.
Note: The method ”Deep Learning” uses an end-to-end Inception-v3 [9] C. Harris and M. Stephens, “A combined corner and edge detector.” in
architecture, while the method ”Bottleneck features + RF” uses bottleneck Alvey vision conference, vol. 15, no. 50. Citeseer, 1988, pp. 10–5244.
features and a random forest (RF) classifier. A paired t-test was performed [10] D. H. Ballard, “Generalizing the hough transform to detect arbitrary
comparing cross-validation results from the ”Feature fusion + RF” and shapes,” Pattern recognition, vol. 13, no. 2, pp. 111–122, 1981.
”Bottleneck features + RF” methods. The p-values for AUC, F1, and [11] A. Baraldi and F. Parmiggiani, “An investigation of the textural char-
Specificity are smaller than 0.05. acteristics associated with gray level cooccurrence matrix statistical
parameters,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 33, no. 2, pp. 293–304, 1995.
IV. C ONCLUSION [12] K. De and V. Masilamani, “Image sharpness measure for blurred
images in frequency domain,” Procedia Engineering, vol. 64, pp. 149–
A new algorithm of non-informative frame detection for 158, 2013.
colonoscopy videos was proposed using a combination of [13] M. Subbarao and J.-K. Tyan, “Selecting the optimal focus measure
for autofocusing and depth-from-focus,” IEEE transactions on pattern
bottleneck features in the RGB color space and a small set analysis and machine intelligence, vol. 20, no. 8, pp. 864–870, 1998.
of hand-crafted features in the HSV color space. In our ex- [14] C.-H. Shen and H. H. Chen, “Robust focus measure for low-contrast
periments, feature fusion achieved an average AUC of 0.939 images,” in Consumer Electronics, 2006. ICCE’06. 2006 Digest of
Technical Papers. International Conference on. IEEE, 2006, pp. 69–
using 5-fold cross-validation, better than using bottleneck 70.
features or hand-crafted features alone. Our key contribution
is threefold. First, we designed a feature extraction algorithm
in HSV color space and demonstrated that features from both
RGB and HSV color spaces could better characterize the
frames from colonoscopy videos. Secondly, we demonstrated
2406

Komdig Colonos

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Komdig Colonos

Uploaded by

Copyright:

Available Formats

Automated Detection of Non-Informative Frames for Colonoscopy

Through a Combination of Deep Learning and Feature Extraction

Abstract— Colonoscopy is a standard medical examination

D. Hand-crafted Feature Extraction

effect of reflections, regions falling in the reflection mask

Hue, Saturation and Value channels were calculated, respec- X

You might also like