Major Project Report

SEMANTIC SEGMENTATION OF WIRELESS
CAPSULE ENDOSCOPY IMAGES
Thesis
Submitted By in partial fulfillment for the award of the

degree of
MASTER OF TECHNOLOGY
in
Signal Processing and Machine Learning
by
Jayakrishnan T
192SP008
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING,

NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
SURATHKAL, MANGALORE-575025
June 2021
SEMANTIC SEGMENTATION OF WIRELESS
CAPSULE ENDOSCOPY IMAGES
Thesis
Submitted By in partial fulfillment for the award of the

degree of
in
Signal Processing and Machine Learning
by
Jayakrishnan T
192SP008
under the guidance of

Dr. Aparna P
Assistant Professor, National Institute of Technology, Karnataka
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING,

NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
SURATHKAL, MANGALORE-575025
June 2021
DECLARATION
by the P.G. (M.Tech) Student
I hereby declare that the report of the P.G. Project work entitled SEMANTIC SEG-
MENTATION OF WIRELESS CAPSULE ENDOSCOPY IMAGES which is being
submitted to the National Institute of Technology Karnataka, Surathkal in partial ful-
fillment of the requirements for the award of the degree of Master of Technology in
Signal Processing and Machine Learning in the department of Electronics and Com-
munication Engineering, is a bonafide report of the work carried out by me. The material
contained in the report has not been submitted to any University or Institution for the
award of any degree.
Jayakrishnan T
Reg. No:-192473SP008
Department of Electronics and Communication Engineering
Place: NITK Surathkal

Date: 6th June 2021
CERTIFICATE
This is to certify that the P.G. project work entitled SEMANTIC SEGMENTATION
OF WIRELESS CAPSULE ENDOSCOPY IMAGES submitted by Jayakrishnan
T (Registration No:- 192473SP008) as the record of the work carried out by him is
accepted as the P.G. Project Work Report submission in partial fulfillment of the re-
quirements for the award of degree of Master of Technology in Signal Processing and
Machine Learning of the Department of Electronics and Communication Engineering
during the academic year of 2020-2021.
Dr. Ashvini Chaturvedi Dr. Aparna P

Head of the Department Internal Guide
Professor Assistant Professor
Dept. of E & C Engg Dept. of E & C Engg
NITK Surathkal -575025 NITK Surathkal -575025
ACKNOWLEDGEMENT
On the very outset of this report, I would like to extend my sincere and heartfelt obli-
gation towards all the personages who have helped in this endeavor. I am extremely
thankful and pay my gratitude to my guide Dr. Aparna P, Assistant Professor of the
Department of Electronics and Communication Engineering, for her valuable guidance
and support for the completion of this project. I am ineffably indebted to Dr. Ashwini
Chaturvedi, HOD, for guidance and encouragement to accomplish this work.
Jayakrishnan T
192SP008
Abstract
Wireless Capsule Endoscopy (WCE) is of great importance nowadays. It

has been widely used in the direct inspection of the Gastro-Intestinal (GI)
tract without any surgical operation. A tiny wireless camera in the form
of a capsule is swallowed by the patient and it takes videos as it passes
through the GI tract. By analyzing these videos, we can detect polyps in
the GI tract. The polyps are abnormal tissue growths that most often look
like small, flat bumps and they can result in cancer.
For the analysis of WCE videos, the medical examiners need to screen
a very large number of pictures per patient manually, which is time-consuming
and tedious work. To ease the job of the medical examiner, semantic seg-
mentation of WCE videos can be used. This will help in understanding
and locating the region of the GI tract where abnormality exists. This
project develops various semantic segmentation schemes using deep neu-
ral networks to detect polyps in the GI tract automatically.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
3 Implementation 5
3.1 Fully Convolutional Network (FCN-32s, 16s, and 8s) . . 5
3.2 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 ResU-Net . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Attention U-Net . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Double U-Net . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Data Augmentation . . . . . . . . . . . . . . . . . . . . 20
3.8 Training Details . . . . . . . . . . . . . . . . . . . . . . 21
3.9 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 21
4 Results 23
4.1 FCN-32s . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 FCN-16s . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 FCN-8s . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Res U-Net . . . . . . . . . . . . . . . . . . . . . . . . . 27
I
4.6 Attention U-Net . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Double U-Net . . . . . . . . . . . . . . . . . . . . . . . 29
5 Conclusion 31
Bibliography 33
Biodata 34
II
List of Figures
1 WCE Capsule . . . . . . . . . . . . . . . . . . . . . . . 2
2 FCN Architecture . . . . . . . . . . . . . . . . . . . . . 6
3 FCN Architecture Variations . . . . . . . . . . . . . . . 7
4 U-Net Architecture . . . . . . . . . . . . . . . . . . . . 8
5 Building blocks of neural networks. (a) Plain neural unit
used in U-Net and (b) Residual unit with identity mapping
used in ResU-Net . . . . . . . . . . . . . . . . . . . . . 9
6 ResU-Net Architecture . . . . . . . . . . . . . . . . . . 11
7 Attention Gate . . . . . . . . . . . . . . . . . . . . . . . 14
8 Attention U-Net . . . . . . . . . . . . . . . . . . . . . . 15
9 Atrous Convolution: 2D convolution using a 3 x 3 kernel
with a dilation rate of 2 and no padding . . . . . . . . . 16
10 Atrous Spatial Pyramid Pooling (ASPP) . . . . . . . . . 17
11 Squeeze and Excitation Block . . . . . . . . . . . . . . 17
12 Double U-Net . . . . . . . . . . . . . . . . . . . . . . . 19
13 FCN-32s Results. (a) Input image (b) Ground truth (c)

Segmented output . . . . . . . . . . . . . . . . . . . . . 24
III
16 U-Net Results. (a) Input image (b) Ground truth (c) Seg-
mented output . . . . . . . . . . . . . . . . . . . . . . . 27
17 ResU-Net Results. (a) Input image (b) Ground truth (c)
18 Attention U-Net Results. (a) Input image (b) Ground truth
(c) Segmented output . . . . . . . . . . . . . . . . . . . 29
19 Double U-Net Results. (a) Input image (b) Ground truth
(c) Segmented output . . . . . . . . . . . . . . . . . . . 30
IV
List of Tables
1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
V
List of Abbreviations
GIE Gastro-Intestinal Endoscopy
GI tract Gastro-Intestinal tract
WCE Wireless Capsule Endoscopy
FCN Fully Convolutional Network
ReLU Rectified Linear Unit
IoU Intersection over Union
AG Attention Gate
ASPP Atrous Spatial Pyramid Pooling
CNN Convolutional Neural Network
TP True Positive
TN True Negative
FP False Positive
FN False Negative
VI
Chapter 1
Introduction
Gastro-Intestinal Endoscopy (GIE) is conducted to detect any pathologies

or abnormal conditions in the Gastro-Intestinal (GI) tract. GI tract is clas-
sified into four regions, namely entrance, stomach, small intestine, and
large intestine. The examination of the GI tract is used to detect polyps
which are abnormal tissue growths that can result in cancer. There are
various technologies present for carrying out GIE and Wireless Capsule
Endoscopy (WCE) is the latest of them. Traditional endoscopy involves
passing a long, flexible tube equipped with a video camera down your
throat or through your rectum. Capsule endoscopy helps doctors see in-
side the small intestine, an area that isn’t easily reached with traditional
endoscopy procedures.
Wireless Capsule Endoscopy is a procedure used to record internal
images of the Gastro-Intestinal tract for use in medical diagnosis. A cap-
sule (Figure 1) is swallowed by the patient which contains a tiny wireless
camera. The capsule takes videos as it passes through the GI tract. The
images collected by the miniature camera are transferred wirelessly to
an external receiver worn by the patient. The collected images are then
transferred to a computer for display, review, and diagnosis.
1
Figure 1: WCE Capsule
1.1 Motivation
Wireless Capsule Endoscopy technique is used for detecting polyps in the

GI tract which can result in cancer. Examination of WCE videos can result
in saving a life. Manual examination of WCE videos by physicians or a
technician is a time-consuming and tedious task due to very long video
length, thus increasing the chances of human error. The video recorded
using the WCE technique is typically 6-8 Hrs long containing 60,000 to
1,20,000 images. Also, the manual examination will be laborious and
significantly increase the procedural costs. Artificial intelligence can do
a lot here which provides faster examination with human-level accuracy.
1.2 Problem Statement
Analysis of WCE videos is time-consuming and there are chances of hu-

man error. To avoid these issues, artificial intelligence can be made use
of. It would be helpful if a neural network could analyse the WCE videos
and find polyps inside the GI tract automatically. For this purpose, image
segmentation can be used. Semantic segmentation of WCE videos using
deep learning will output images containing pixels classified into either
background (non-polyp region) or polyps. This project tries to implement
basic to advanced segmentation architectures using deep learning.
2
Chapter 2
Background
Image segmentation has been a hot topic inside the deep learning com-
munity for quite a long time. It is one of the toughest tasks in computer
vision. The benefits of image segmentation are numerous especially in
the biomedical field where the data is not always clean and easy to exam-
ine. Other applications include image compression, scene understanding,
locating objects in satellite images, autonomous vehicles, augmented re-
ality, etc. Over time, many algorithms have been developed for image
segmentation but with the advent of deep learning in computer vision,
many deep learning models for image segmentation have also emerged.
Semantic segmentation and instance segmentation are the two types
of segmentation schemes in deep learning. Semantic segmentation per-
forms pixel-level labeling with a set of object categories (e.g., human,
car, tree, sky) for all image pixels. It treats multiple objects of the same
class as a single entity. It is generally a harder undertaking than image
classification, which predicts a single label for the entire image.
Instance segmentation extends the semantic segmentation scope fur-
ther by detecting and delineating each object of interest in the image (e.g.,
partitioning of individual persons). It treats multiple objects of the same
class as distinct individual objects (or instances). In the case of WCE im-
ages which is dealt with in this project, semantic segmentation is enough.
3
Various deep neural network architectures are present for semantic
segmentation and are always getting updated. The initial architectures
were an advancement in standard classification architectures by replacing
the fully connected layers with convolutional layers to get images as the
output. These architectures are generally called fully convolutional net-
works. Later, architectures were designed exclusively for segmentation.
Although there are several architectures present for semantic segmen-
tation and they seem entirely different, the basic work a segmentation
architecture does is somewhat the same. The difference is mainly in the
way the basic work is enhanced. In most of the segmentation schemes,
the input image is fed through a series of convolutional layers which de-
crease the dimensionality of the image and increase the number of chan-
nels. From this reduced form, the segmented output image is obtained
after increasing the dimensionality to that of the input again through a
series of convolutional layers.
4
Chapter 3
Implementation
In this project, seven different segmentation architectures from basic to

advanced levels are implemented for the segmentation of WCE images.
The architectures are variants of Fully Convolutional Network (FCN) and
U-Net. These are FCN-32s, FCN-16s, FCN-8s, U-Net, Residual U-Net,
Attention U-Net, and Double U-Net. Each of them is discussed below in
detail.
3.1 Fully Convolutional Network (FCN-32s, 16s, and 8s)
FCN [1] is one of the earliest and widely used image segmentation algo-
rithms in deep learning. It was proposed in 2014. VGG16 [2] is chosen as
the base network for FCN which according to the original paper gave bet-
ter performance compared to other standard networks like GoogLeNet.
The architecture (Figure 2) consists of a series of convolutional and max-
pooling layers. The height and width of the input image are reduced con-
tinuously because of the convolutional and maxpooling layers. Also, the
depth is increased as the number of filters used increases in deeper layers.
From the output of the last maxpooling layer, the segmented output is ob-
tained using upsampling. The major change in FCN compared to VGG16
is the replacement of fully connected layer by convolutional layer. The
5
network consists of a downsampling path, used to extract and interpret
the context, and an upsampling path, which allows for localization.
Figure 2: FCN Architecture
There are three variations in FCN. These are FCN-32s, FCN-16s and
FCN-8s. FCN-32s is the conventional one that does not have any skip
connections as discussed before. In FCN-16s and FCN-8s, skip connec-
tions are introduced to improve the information flow and thereby the out-
put. FCN-16s has one skip connection from the second last pooling layer
whereas FCN-8s has two skip connections from the second and third last
pooling layers. FCN 32s, FCN-16s, and FCN-8s require 32x, 16x, and
8x upsampling to obtain the segmented output. Hence, the names. The
basic difference in the architecture is shown below (Figure 3).
6
Figure 3: FCN Architecture Variations
In all three variants of FCN, the input images are resized to

224 x 224 x 3. The fully connected layers of VGG16 are replaced by
fully convolutional layers. Batch normalization is introduced to VGG16
architecture for better results. Deconvolution is used to scale the output
of the encoded version of the input. The segmented output size is
224 x 224 x 1.
3.2 U-Net
U-Net [3] architecture was proposed in 2015. It is the most popular seg-
mentation architecture which is widely used in biomedical applications.
U-Net architecture (Figure 4) consists of a contracting path and an ex-
pansive path. The contracting path follows the typical architecture of a
7
convolutional network. It consists of the repeated application of two
3 x 3 convolutions (unpadded convolutions), each followed by a Recti-
fied Linear Unit (ReLU) and a 2 x 2 max pooling operation with stride
2 for downsampling. At each downsampling step, the number of fea-
ture channels is doubled. Every step in the expansive path consists of an
upsampling of the feature map followed by concatenation with the corre-
spondingly cropped feature map from the contracting path, and two
3 x 3 convolutions, each followed by a ReLU. The cropping is necessary
due to the loss of border pixels in every convolution. At the final layer, a
1 x 1 convolution is used to map each feature vector to the desired num-
ber of classes. U-Net has a network and training strategy that relies on
the strong use of data augmentation to use the available annotated samples
more efficiently.
Figure 4: U-Net Architecture
8
The input images are resized to 256 x 256 x 3 and the encoder output
dimension is 16 x 16 x 128. Output size is 256 x 256 x 1.
3.3 ResU-Net
ResU-Net [4] refers to Deep Residual U-Net. It was proposed in 2017. It

is an encoder-decoder architecture developed for semantic segmentation.
It was initially used for road extraction from high-resolution aerial im-
ages in the field of remote sensing image analysis. Later, it was adopted
by researchers for multiple other applications such as polyp segmenta-
tion, brain tumor segmentation, human image segmentation, and many
more. It is a fully convolutional neural network that is designed to get
high performance with fewer parameters. It is an improvement over the
existing UNET architecture. ResU-Net takes the advantage of both the
U-Net architecture and the Deep Residual Learning [5].
Figure 5: Building blocks of neural networks. (a) Plain neural unit used in U-Net and
(b) Residual unit with identity mapping used in ResU-Net
9
The major difference between the building blocks of U-Net and ResU-
Net is shown (Figure 5). The building block of U-Net contains a series
of convolutional layers. In ResU-Net, the building block contains a skip
connection through which input of the block is added with the output.
The residual network was proposed to avoid the issues when the net-
work is made deeper. Going deeper would improve the performance of
a multi-layer neural network, however, could hamper the training and a
degradation problem maybe occur. Residual network overcomes these
problems. The residual neural network consists of a series of stacked
residual units. Each residual unit can be illustrated as a general form:
where xl and xl+1 are the input and output of the l-th residual unit, F(·)
is the residual function, f (yl ) is activation function and h(xl ) is an identity
mapping function, a typical one is h(xl ) = xl .
The deep ResUnet which combines strengths of both U-Net and resid-
ual neural network. This combination bring us two benefits: 1) the resid-
ual unit will ease training of the network; 2) the skip connections within
a residual unit and between low levels and high levels of the network will
facilitate information propagation without degradation, making it possi-
ble to design a neural network with much fewer parameters however could
achieve better performance on semantic segmentation. The original deep
residual network paper suggested a full pre-activation design. The ResU-
Net paper also employs a full pre-activation residual unit to build the ar-
chitecture.
10
Figure 6: ResU-Net Architecture
11
The ResU-Net architecture is shown (Figure 6). It consists of an en-
coding network, decoding network, and a bridge connecting both these
networks, just like a U-Net. The U-Net uses two 3 x 3 convolutions, where
each is followed by a ReLU activation function. In the case of ResU-Net,
these layers are replaced by a pre-activated residual block.
The encoder takes the input image and passes it through different en-
coder blocks, which helps the network to learn an abstract representation.
The encoder consists of three encoder blocks, which are built using the
pre-activated residual block. The output of each encoder block acts as
a skip connection for the corresponding decoder block. The bridge also
consists of a pre-activated residual block.
The decoder takes the feature map from the bridge and the skip con-
nections from different encoder blocks and learns a better semantic rep-
resentation, which is used to generate a segmentation mask. The decoder
consists of three decoder blocks, and after each block, the spatial dimen-
sions of the feature map are doubles and the number of feature channels
is reduced.
Each decoder block begins with a 2 × 2 upsampling, which doubles
the spatial dimensions of the feature maps. Next, these feature maps are
then concatenated with the appropriate skip connection from the encoder
block. These skip connections help the decoder blocks to get the feature
learned by the encoder network. After this, the feature maps from the
concatenation operation are passes through a pre-activated residual block.
The output of the last decoder passes through a 1×1 convolution with
sigmoid activation. The sigmoid activation function gives the segmenta-
tion mask representing the pixel-wise classification.
12
3.4 Attention U-Net
Attention U-Net [6] was proposed in 2018 and is a novel attention gate
(AG) mechanism that allows the U-Net to focus on target structures of
varying size and shape.
Attention, in the context of image segmentation, is a way to highlight
only the relevant activations during training. This reduces the computa-
tional resources wasted on irrelevant activations, providing the network
with better generalisation power. Essentially, the network can pay “atten-
tion” to certain parts of the image.
Attention comes in two forms, hard and soft. Hard attention works on
the basis of highlighting relevant regions by cropping the image or iter-
ative region proposal. Since hard attention can only choose one region
of an image at a time, it has two implications, it is non-differentiable and
requires reinforcement learning to train. Since it is non-differentiable, it
means that for a given region in an image, the network can either pay
“attention” or not, with no in-between. As a result, standard backpropa-
gation cannot be done, and Monte Carlo sampling is needed to calculate
the accuracy across various stages of backpropagation.
Soft attention works by weighting different parts of the image. Areas
of high relevance are multiplied with a larger weight and areas of low
relevance are tagged with smaller weights. As the model is trained, more
focus is given to the regions with higher weights. Unlike hard attention,
these weights can be applied to many patches in the image.
Due to the deterministic nature of soft attention, it remains differen-
tiable and can be trained with standard backpropagation. As the model
is trained, the weighting is also trained such that the model gets better at
deciding which parts to pay attention to.
During upsampling in the expanding path, spatial information recre-
ated is imprecise. To counteract this problem, the U-Net uses skip con-
13
nections that combine spatial information from the downsampling path
with the upsampling path. However, this brings across many redundant
low-level feature extractions, as feature representation is poor in the initial
layers. Soft attention implemented at the skip connections will actively
suppress activations in irrelevant regions, reducing the number of redun-
dant features brought across.
Figure 7: Attention Gate
For incorporating soft attention, attention gates are introduced. The

structure of the attention gate is shown (Figure 7). The attention gate
takes in two inputs, vectors xl and g. The vector, g, is taken from the next
lowest layer of the network. This vector has smaller dimensions and better
feature representation, given that it comes from deeper into the network.
Vector, xl , goes through a strided convolution and vector, g, goes through
a 1x1 convolution such that their dimensions become equal. The two vec-
tors are summed element-wise. This process results in aligned weights
becoming larger while unaligned weights become relatively smaller. The
resultant vector goes through a ReLU activation layer and a 1x1 convo-
lution. This vector goes through a sigmoid layer which scales the vector
between the range [0, 1], producing the attention coefficients (weights),
where coefficients closer to 1 indicate more relevant features. The atten-
tion coefficients are then upsampled to the dimension of the vector, xl .
The attention coefficients are multiplied element-wise to the original xl
vector. This is then passed along in the skip connection as normal.
14
Figure 8: Attention U-Net
The complete architecture of attention U-Net is shown (Figure 8). At-

tention U-Net is the mix of U-Net and attention gate. Attention gates are
introduced at the skip connections of the U-Net architecture. This addi-
tional feature helps in suppressing the irrelevant regions. The input to the
attention gate comes from the encoder side as well as the decoder side.
The input from the decoder side has smaller dimensions compared to that
from the encoder side.
3.5 Double U-Net
Double U-Net [7] is a combination of two U-Net architectures stacked on

top of each other. The first U-Net uses a pre-trained VGG-19 [2] as the
encoder, which has already learned features from ImageNet and can be
15
transferred to another task easily. The main reasons for using the VGG
19 network are: (1) VGG-19 is a lightweight model as compared to other
pre-trained models, (2) the architecture of VGG-19 is similar to U-Net,
making it easy to concatenate with U-Net, and (3) it will allow much
deeper networks for producing better output segmentation mask. To cap-
ture more semantic information efficiently, another U-Net is added at the
bottom. Atrous Spatial Pyramid Pooling (ASPP) [8] is adopted to capture
contextual information within the network. ASPP uses atrous convolution
or dilated convolution which is a special type of convolution.
Figure 9: Atrous Convolution: 2D convolution using a 3 x 3 kernel with a dilation rate

of 2 and no padding
Atrous or dilated convolutions [9] introduce another parameter to con-

volutional layers called the dilation rate. This defines the spacing between
the values in a kernel. A 3 x 3 kernel with a dilation rate of 2 is shown
(Figure 9). It will have the same field of view as a 5 x 5 kernel, while only
16
using 9 parameters. This is similar to taking a 5 x 5 kernel and deleting
every second column and row. This delivers a wider field of view at the
same computational cost. Atrous convolutions are particularly popular in
the field of real-time segmentation. It is used if a wide field of view is
needed and multiple convolutions or larger kernels cannot be afforded.
Figure 10: Atrous Spatial Pyramid Pooling (ASPP)
In ASPP (Figure 10), parallel atrous convolution with different rates

is applied in the input feature map and they are concatenated. As objects
of the same class can have different scales in the image, ASPP helps to
account for different object scales which can improve the accuracy.
Figure 11: Squeeze and Excitation Block
17
Double U-Net also uses Squeeze and Excitation blocks [10] (Figure
11). It improves channel interdependencies at almost no computational
cost.
Convolutional Neural Networks (CNN) use their filters to extract in-
formation from images. Lower layers find trivial pieces of context like
edges or high frequencies, while upper layers can detect faces, text, or
other complex geometrical shapes. All of this works by fusing the spatial
and channel information of an image. The different filters will first find
spatial features in each input channel before adding the information across
all available output channels. The network weighs each of its channels
equally when creating the output feature maps. Squeeze and excite block
adds a content aware mechanism to weigh each channel adaptively. To
get a global understanding of each channel, the feature maps are squeezed
into a single numeric value. This results in a vector of size n, where n is
equal to the number of convolutional channels. Then, it is fed through a
two-layer neural network, which outputs a vector of the same size. These
n values can now be used as weights on the original features maps, scaling
each channel based on its importance.
The architecture of the Double U-Net is shown (Figure 12). It can be
seen as two networks connected together, Network 1 and Network 2. In
Network 1, pretrained VGG 19 (on ImageNet) is used as the encoder. The
encoder output is fed to the decoder through ASPP to generate the first
segmented mask, Output 1. The decoder of Network 1 is the same as that
of U-Net.
An element-wise multiplication is performed between Output 1 with
the input image. This is a kind of attention mechanism. The multiplied
output is then fed to the encoder of Network 2. The encoder and decoder
of Network 2 are similar to that of U-Net. The encoder output is again
fed through ASPP to the decoder. The second segmented mask, Output
18
Figure 12: Double U-Net
2, is generated at the decoder output. Finally, both the masks (Output1

and Output2) are concatenated to see the qualitative difference between
the intermediate mask (Output1) and the final predicted mask (Output2).
The squeeze and excitation block is used in the encoder of Network
1 and decoder blocks of Network 1 and Network 2. This reduces the re-
dundant information and passes the most relevant information. In the first
decoder, skip connections from the first encoder are only used, but in the
second decoder, skip connections from both the encoders are used. This
maintains the spatial resolution and enhances the quality of the output
feature maps.
Summary: To summarise the segmentation architectures used in this
19
project, the first one is FCN which modifies VGG16 by introducing skip
connections and replacing fully connected layers with convolutional lay-
ers to get an image as the output. Three variants of FCN (32s, 16s, and
8s) were implemented. U-Net is the second architecture implemented and
it uses an encoder-decoder architecture with skip connections from the
encoder side to the decoder to improve the information flow. ResU-Net
is the third architecture implemented which adds deep residual learning
with basic U-Net architecture in order to attain better information flow.
The fourth architecture implemented is Attention U-Net. It introduces
attention gates to suppress irrelevant regions. Double U-Net is the final
architecture implemented. It stacks two U-Nets on top of each other. To
improve the performance, ASPP, squeeze and excitation block, and atten-
tion mechanism are introduced.
3.6 Dataset
The dataset used in this project is the Kvasir-SEG dataset [11]. This
dataset is based on the previous Kvasir dataset, which is the first multi-
class dataset for Gastro-Intestinal (GI) tract disease detection and clas-
sification. The original Kvasir dataset comprises 8,000 GI tract images
from 8 classes where each class consists of 1000 images. In the Kvasir-
SEG dataset, only the polyp class of the original Kvasir dataset is used.
It consists of 1000 polyp images captured through WCE and the corre-
sponding ground truth images. The resolution of the images contained in
the dataset varies from 332 x 487 to 1920 x 1072 pixels.
3.7 Data Augmentation
Data augmentation is an important technique that has been widely used

in the machine learning pipeline. By introducing variations of images,
20
such as different orientation, location, scale, brightness, etc, to existing
data, the robustness of the model is increased and over-fitting reduced.
Five basic augmentation methods are used which are rotation, cropping,
horizontal flip, vertical flip, and distortion. After data augmentation, the
number of images in the dataset is increased from 1000 to 6000. 80%
of the dataset (4800 images) is used for training and 10% of the dataset
(600 images) is used for both testing and validation. Data augmentation
is carried out using a library named Albumentation.
3.8 Training Details
The model is implemented on the Tensor Flow library. The optimizer

used is NAdam. The initial learning rate is chosen as 1e-4. Using Re-
duceLROnPlateau, the learning rate is reduced by a factor of 0.1 when a
metric has stopped improving for 4 epochs.
3.9 Evaluation Metrics
1) Dice Coefficient: Dice coefficient is a spatial overlap index and a re-

producibility validation metric used in machine learning, especially in
semantic image segmentation. It measures the similarity between the pre-
diction binary segmentation result and the ground truth mask. The value
of a dice coefficient ranges from 0, indicating no spatial overlap between
two sets of binary segmentation results, to 1, indicating complete overlap.
The value of the Dice coefficient equals twice the number of elements
common to both sets divided by the sum of the number of elements in
each set. Formally, the Dice coefficient is defined as:
21
where |X| and |Y| are the cardinalities of the two sets (the number of pix-
els in each binary mask image).
2) Intersection over Union: The Intersection over Union (IoU) is another

standard metric to evaluate a segmentation method. The IoU calculates
the similarity between predicted (A) and its corresponding ground truth
(B) as shown in the equation below:
In the equation, t is the threshold. At each threshold value t, a precision

value is calculated based on the above equation and parameters, which is
done by calculating the predicted object to all the ground truth objects.
3) Recall and Precision: Recall and Precision are calculated using the
equation given below:
where TP and TN denote the number of true positives and true negatives,
FP and FN denote the number of false positives and false negatives. A
detection is considered a true positive when the center of the prediction
bounding box is located within the ground truth bounding box. In bi-
nary classification, recall is also referred to as sensitivity which shows
the model’s ability to return the most true positive samples, for example,
polyps in this project. Precision represents the model’s ability to detect
more true positives than false positives, for example, more real polyps
than incorrectly detected normal tissue.
22
Chapter 4
Results
Seven segmentation architectures were implemented for the segmenta-

tion of WCE images. All the networks were trained on the Kvasir-SEG
dataset. It consists of 1000 polyp images captured through WCE and the
corresponding ground truth images. Data Augmentation was introduced
to reduce overfitting by increasing the number of images. After data aug-
mentation, the number of images in the dataset is increased from 1000
to 6000. 80% of the dataset (4800 images) is used for training anwing d
10% of the dataset(600 images) is used for both testing and validation.
After training all the networks on the Kvasir-SEG dataset, the follow-
ing results were obtained.
4.1 FCN-32s
The metric values obtained during training are good and there was some
level of reduction in metric values during testing. The dice coefficient
obtained during training is 0.9664 while during testing is 0.8307. IoU
obtained during training and testing is 0.9356 and 0.7175 respectively.
The recall and precision obtained are 0.9046 and 0.9871 respectively in
training and 0.7782 and 0.8710 in testing.
23
The output obtained using FCN-32s is shown (Figure 13). The seg-
mented output is similar to the ground truth.
Figure 13: FCN-32s Results. (a) Input image (b) Ground truth (c) Segmented output
4.2 FCN-16s
The dice coefficient obtained during training is 0.9653 while during test-
ing is 0.8529. IoU obtained during training and testing is 0.9336 and
0.7502 respectively. The recall and precision obtained are 0.9038 and
0.9848 respectively in training and 0.7915 and 0.8921 in testing. The
metric values obtained are almost similar to that of FCN-32s. There is a
slight improvement in the metric values during testing compared to FCN-
32s.
24
4.3 FCN-8s
metric values obtained are almost similar to that of the other FCN vari-
ants. There is a slight improvement in the metric values during testing
compared to the other two.
25
4.4 U-Net
metric values are good, but there is a small reduction compared to the
FCN variants.
The output obtained using U-Net is shown (Figure 16). The segmented
output is similar to the ground truth.
26
Figure 16: U-Net Results. (a) Input image (b) Ground truth (c) Segmented output
4.5 Res U-Net
metric values are better compared to U-Net, but a bit lesser compared to
the FCN variants.
The output obtained using ResU-Net is shown (Figure 17). The seg-
27
Figure 17: ResU-Net Results. (a) Input image (b) Ground truth (c) Segmented output
4.6 Attention U-Net
metric values are better compared to U-Net and ResU-Net, but a bit lesser
compared to the FCN variants.
The output obtained using Attention U-Net is shown (Figure 18). The
segmented output is similar to the ground truth.
28
Figure 18: Attention U-Net Results. (a) Input image (b) Ground truth (c) Segmented
output
4.7 Double U-Net
0.9315 respectively in training and 0.8582 and 0.8979 in testing.
The metric values obtained during training are lesser as compared to
other architectures, but the test set performance is very good. The dif-
ference between the metric values of the training and test set is very low.
This means that the overfitting is avoided to a greater extent. The out-
put obtained using Double U-Net is shown (Figure 19). The segmented
output is similar to the ground truth.
Summary: To summarize the results, each architecture gives a good
segmented output which is almost identical to the ground truth. The FCN
variants, especially FCN-8s, shown better performance in the training
phase as well as the testing phase. As the complexity is increased, the
29
Figure 19: Double U-Net Results. (a) Input image (b) Ground truth (c) Segmented
output
difference between the training phase results and the testing phase results
keeps on decreasing. This is a sign of decreasing overfitting. The metric
values of each architecture in the training and testing phase are shown in
the table (Table 1).
Table 1: Results
Training Testing
Dice Dice
Architecture Coefficient IoU Recall Precision Coefficient IoU Recall Precision
FCN-32s 0.97 0.94 0.90 0.98 0.83 0.72 0.78 0.87
FCN-16s 0.97 0.94 0.90 0.98 0.85 0.75 0.79 0.89
FCN-8s 0.95 0.91 0.89 0.97 0.86 0.76 0.81 0.88
U-Net 0.92 0.86 0.86 0.96 0.80 0.68 0.75 0.84
ResU-Net 0.96 0.91 0.89 0.98 0.76 0.62 0.71 0.80
Attention U-Net 0.95 0.90 0.89 0.98 0.80 0.68 0.75 0.84
Double U-Net 0.75 0.61 0.89 0.93 0.73 0.58 0.86 0.90
30
Chapter 5
Conclusion
Seven image segmentation architectures (FCN-32s, FCN-16s, FCN-8s,

U-Net, ResU-Net, Attention U-Net, and Double U-Net) were implemented
and trained on the Kvasir-SEG dataset for the segmentation of Wireless
Capsule Endoscopy images. The results obtained using each architec-
ture were good. Each architecture was able to produce segmented output
which was almost identical to the ground truth. Out of the seven segmen-
tation architectures, FCN-8s had slightly better performance compared
with other architectures in both the training and testing phase. Double
U-Net which has the most complex architecture was found to avoid over-
fitting to a greater extent compared to others.
31
Bibliography
[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convo-
lutional networks for semantic segmentation. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages
3431–3440, 2015.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolu-

tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[3] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Con-
volutional networks for biomedical image segmentation. In Inter-
national Conference on Medical image computing and computer-
assisted intervention, pages 234–241. Springer, 2015.
[4] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. Road extrac-
tion by deep residual u-net. IEEE Geoscience and Remote Sensing
Letters, 15(5):749–753, 2018.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–
778, 2016.
[6] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mat-

tias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh,
Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning
32
where to look for the pancreas. arXiv preprint arXiv:1804.03999,
2018.
[7] Debesh Jha, Michael A Riegler, Dag Johansen, Pål Halvorsen, and
Håvard D Johansen. Doubleu-net: A deep convolutional neural net-
work for medical image segmentation. In 2020 IEEE 33rd Inter-
national Symposium on Computer-Based Medical Systems (CBMS),
pages 558–564. IEEE, 2020.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial
pyramid pooling in deep convolutional networks for visual recog-
nition. IEEE transactions on pattern analysis and machine intelli-
gence, 37(9):1904–1916, 2015.
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin

Murphy, and Alan L Yuille. Deeplab: Semantic image segmen-
tation with deep convolutional nets, atrous convolution, and fully
connected crfs. IEEE transactions on pattern analysis and machine
intelligence, 40(4):834–848, 2017.
[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In

Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 7132–7141, 2018.
[11] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen,

Thomas de Lange, Dag Johansen, and Håvard D Johansen. Kvasir-
seg: A segmented polyp dataset. In International Conference on
Multimedia Modeling, pages 451–462. Springer, 2020.
33
Biodata
NAME :Jayakrishnan T
DATE OF BIRTH :01 November 1995
CONTACT NO. :7510695597
EMAIL ID. :jayan01krishnan@gmail.com.
EDUCATIONAL QUALIFICATIONS
BACHELOR OF TECHNOLOGY
Institution Government Engineering College, Thrissur

University University of Calicut
Stream Electronics and Communication Engineering
Year of Passing 2018
Institution National Institute of Technology Karnataka

Specialization Signal Processing and Machine Learning
Year of Passing 2021
34

Major Project Report

Uploaded by

Copyright:

Available Formats

You might also like

Major Project Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Major Project Report

Uploaded by

Copyright:

Available Formats

SEMANTIC SEGMENTATION OF WIRELESS

CAPSULE ENDOSCOPY IMAGES

Submitted By in partial fulfillment for the award of the

Signal Processing and Machine Learning

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING,

Submitted By in partial fulfillment for the award of the

under the guidance of

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING,

Place: NITK Surathkal

Dr. Ashvini Chaturvedi Dr. Aparna P

Wireless Capsule Endoscopy (WCE) is of great importance nowadays. It

13 FCN-32s Results. (a) Input image (b) Ground truth (c)

GIE Gastro-Intestinal Endoscopy

GI tract Gastro-Intestinal tract

WCE Wireless Capsule Endoscopy

FCN Fully Convolutional Network

ReLU Rectified Linear Unit

IoU Intersection over Union

ASPP Atrous Spatial Pyramid Pooling

CNN Convolutional Neural Network

Gastro-Intestinal Endoscopy (GIE) is conducted to detect any pathologies

Wireless Capsule Endoscopy technique is used for detecting polyps in the

1.2 Problem Statement

Analysis of WCE videos is time-consuming and there are chances of hu-

In this project, seven different segmentation architectures from basic to

3.1 Fully Convolutional Network (FCN-32s, 16s, and 8s)

Figure 2: FCN Architecture

In all three variants of FCN, the input images are resized to

Figure 4: U-Net Architecture

ResU-Net [4] refers to Deep Residual U-Net. It was proposed in 2017. It

Figure 7: Attention Gate

For incorporating soft attention, attention gates are introduced. The

The complete architecture of attention U-Net is shown (Figure 8). At-

3.5 Double U-Net

Double U-Net [7] is a combination of two U-Net architectures stacked on

Figure 9: Atrous Convolution: 2D convolution using a 3 x 3 kernel with a dilation rate

Atrous or dilated convolutions [9] introduce another parameter to con-

Figure 10: Atrous Spatial Pyramid Pooling (ASPP)

In ASPP (Figure 10), parallel atrous convolution with different rates

Figure 11: Squeeze and Excitation Block

2, is generated at the decoder output. Finally, both the masks (Output1

3.7 Data Augmentation

Data augmentation is an important technique that has been widely used

3.8 Training Details

The model is implemented on the Tensor Flow library. The optimizer

3.9 Evaluation Metrics

1) Dice Coefficient: Dice coefficient is a spatial overlap index and a re-

2) Intersection over Union: The Intersection over Union (IoU) is another

In the equation, t is the threshold. At each threshold value t, a precision

Seven segmentation architectures were implemented for the segmenta-

4.5 Res U-Net

4.6 Attention U-Net

4.7 Double U-Net

Seven image segmentation architectures (FCN-32s, FCN-16s, FCN-8s,

[2] Karen Simonyan and Andrew Zisserman. Very deep convolu-

[6] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mat-

[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin

[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In