Professional Documents
Culture Documents
Thesis Abhishek Singh Final
Thesis Abhishek Singh Final
Dissertation submitted to
National Institute of Technology, Agartala
for the award of the degree of
Master of Technology
(Communication Engineering)
by
Abhishek Singh
(18PEC001)
1
Dedicated
to
my Parents
and
everyone who has been a part of my life at any moment of time.
2
APPROVAL SHEET
This thesis entitled “Segmentation of polyps using modified U-Net” by Mr. Abhishek Singh
(18PEC001) is approved for the degree of M.Tech. in Communication Engineering
specialization.
Examiners
________________________
________________________
________________________
Supervisor
________________________
________________________
________________________
Chairman
________________________
Date :____________
Place :____________
3
DECLARATION
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources. I
also declare that I have adhered to all principles of academic honesty and integrity and have not
misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.
_________________________________
Abhishek Singh
Enrolment No.: 18PEC001
Date: ___________
4
CERTIFICATE
It is certified that the work contained in the thesis titled “Segmentation of polyps using
modified Unet” by Mr. Abhishek Singh has been carried out under my/our supervision and
that this work has not been submitted elsewhere for a degree.
___________________________
Dr. Dibyendu Ghoshal
Associate Professor
Electronics and Communication Engineering
N.I.T. Agartala
June, 2020
5
ACKNOWLEDGEMENT
I would like to express my gratitude to Dr. Dibyendu Ghoshal, Associate Professor, NIT Agartala,
department of Electronics & Communication Engineering who has always been an inspiration,
guiding factor and support throughout this project work. I am extremely grateful to have the
opportunity to work with him.
I wish to express my gratitude to Dr. Atanu Chowdhury, Assiatant Professor, NIT Agartala, and
Mr. Manik Bhowmik, Assiatant Professor, NIT Agartala, Department of Electronics and
Communication Engineering and the entire faculty as well as staff of the Department who
consecrated their worthy time and helped me towards successful completion of this work.
I would also like to thanks Mr. Rajiv Das, PhD Scholar, Mr. Shubham anand, M.Tech Scholar,
NIT Agartala all other seniors and friends who have extended all sorts of help for accomplishing this
undertaking Lastly, I would like to thank my parents and my family for their years of love, support
and encouragement.
6
LIST OF FIGURES
7
Figure 5.4: Resultant from trained U-Net model
8
LIST OF TABLE
Table 5.1: Loss and accuracy of the modified U-Net and Original U-Net
9
LIST OF ABBREVIATIONS
10
ABSTRACT
11
CONTENTS
Title…………………………………………………………………………………... 1
Certificate of Approval…………………………………………………………………2
Declaration……………………………………………………………………………..3
Certificate………………………………………………………………………………4
Acknowledgements…………………………………………………………………….5
List of Figures.....................................................................................................6
List of Tables.......................................................................................................7
List of Abbreviations……………………………………………………………………8
Abstract………………………………………………………………………………....9
Contents………………………………………………………………………………...10
Chapter 1
Introduction……………………………………………………………...….15
1.1 Segmentation of polyps in GI tract…………………………………….16
1.1.1 Polyps in GI tract
1.1.2 Segmentation of polyps
1.1.3 Segmentation polyps using neural network
1.1.4 Significance of the work in medical field
1.2 Software Used………………………………………………………….18
1.2.1 Anaconda
1.2.2 Google Colab
1.3 Objective and Scope of the work……………………………………….19
1.4 References……………………………………………………………….20
Chapter 2
Literature Review…………………………………………………………..…21
2.1 Review on research paper……………………………………………….…21
2.3 References…………………………………………………………………28
Chapter 3
Unet CNN with modifications…………………………………………………29
3.1 Unet Architecture…………………………………………………………..29
3.2 Modifications possible……………………………………………………..31
3.3 Modified U-Net…………………………………………………………….32
12
3.4 References…………………………………………………………………..33
Chapter 4
Model training factors……………………………………………………………34
4.1 Parameters……………………………………………………………………34
4.1.1 Hyper parameters
4.1.2 Learned parameters
4.2 Layers…………………………………………..……………………………35
4.2.1 Input layer
4.2.2 Convolutional layer
4.2.3 Pooling layer
4.2.4 Concatenation layer
4.2.5 Upconvolutional layer
4.3 Activation functions…………………………………………………………..39
4.4 Loss function…………………………………………………………………..42
4.4.1 Mean squared error
4.4.2 Binary Crossentropy
4.5 Optimizer………………………………………………………………………42
4.5.1 Gradient Descent
4.5.2 ADAM
4.6 References………………………………………………………………………44
Chapter 5
Model training and result analysis………………………………………………….45
5.1 Kvasir-SEG dataset……………………………………………………………..45
5.2 metrics used……………………………………………………………………..47
5.3 model training……………………………………………………………………48
5.4 Comparison with U-Net and watershed methods………………………………..56
5.5 References……………………………………………………………………….57
6.1 Conclusion……………………………………………………………………………….58
6.2 Future Scope……………………………………………………………………………..58
Appendix……………………………………………………………………………………..59
13
14
Chapter 1
Introduction
Image segmentation is a process of subdividing an image into constituent parts or objects present in
the image to extract some information and features which are useful in machine vision applications.
Level of subdividing an image is always application dependent.
Approaches: 1. Discontinuity based approach
2. Similarity based approach
3. Neural network based approach
Edge linking: This operation is done on edge operated images to remove discontinuity in images
which may occur due to noise.
Ex: Hough transform for global linking
Similarity based approach: In similarity based segmentation we find similarity between regions
based on their intensity values to segment them.
This approach includes 1) Thresholding the image based on histogram
2) Region growing
3) Region splitting and merging
15
1. Object detection
2. Object segmentation
3. Motion capturing
4. Image captioning
Polyps are lesions within bowels detectable as mucosal outgrows. Polyps in the GI-tract are
precursors of the colon cancer. Hence detecting and removing them is a medically important task.
Polyps can be classified in 3 categories based on their appearance.
1. Flat
2. Elevated
3. Pedunculated
16
Polyps can be distinguished from the GI-tract wall by color or by surface pattern. Detection of
polyps has been done by the manual inspection of the video data collected from the endoscopy and
colonoscopy videos. These are considered gold standard for the detection of the polyps.
Since polyps look alike the wall of the GI-tract mostly and the analysis of the hours long video is a
exhaustive process, the polyps may be overlooked by doctors. Hence the design of a polyps detection
and segmentation network will improve the diagnosis of the patient.
After detection the polyps can be removed using endoscopic mucosal resection process.
Polyps are found inside the GI-tract. For the purpose of treatment their size, shape, color, location of
the polyp must be known. That is why we take out the portion of the image which contains the polyp.
And analyse the image to get medical recommendations.
Polyps can be segmented out based on the texture and shape using traditional image processing
techniques. Or we can employ a neural network to learn the features required to segment out the
polyp.
AI is an emerging field in the automation of huge variety of tasks such as detection, classification,
sound generation, image generation, game playing etc. The task of segmentation also does the
detection part hence it is important to have a robust algorithm for the purpose.
Neural network learn what is required to from every type of data hence they can segment out the
object without handcrafting of the features from the data. For the purpose of segmentation the
network uses convolutional layers which are very effective in the image processing tasks. The
procedure is explained further in the topics.
17
1.1.5 Significance of the work in medical field
Polyps are precursor of the cancer hence removal of them is of utmost priority. The colon cancer
causes too many deaths throughout the world. And by detecting and removing the polyps in the early
development stage we can decrease the number of deaths.
Segmentation of the polyp can help in generating a report and treatment possibilities. It can analyse
hours long videos of several patients and make recommendations on the spot. The analysis by
computer will also encourage peoples to take the endoscopy and colonoscopy.
1.2.1 Anaconda
Anaconda is a free software package which provides an easy way to install, run python, R
programs and upload in cloud. It provides several software packages such as Jupyter
Notebook, Spyder, RStudio, Orange, Glue, etc through a GUI Anaconda Navigator.
This software simplifies the package management and product development. To install any
package we can simply select and install using Anaconda navigator GUI.
Anaconda also provides commands to install the packages such as pip and conda. Advantage
of using pip command is that it also installs the dependencies of the package to be installed
along with it.
Anaconda also provides a cloud service anaconda cloud to find, store, access, and share
notebooks. It hosts repositories like PyPI and environments to run the python and R codes
online.
Default installation is available for both Python 3.7 and Python 2.7. And we can customize
the environment to include whatever version we want for our application individually for
each package
.
18
1.2.2 Google Colab
It is a free Jupyter Notebook environment that runs in the cloud. The data needed in the
notebook and the notebook itself can be stored in the Google drive to access and run
anywhere and through any device. To use the Google Colab services for free the user just
need a Gmail account.
Colab environment provides all necessary packages and requirements run all type of
scientific computing programs. User doesn’t need to install any libraries required by the code.
It provides support to the R, Julia, Python and swift.
GI-tract of the humans plays a very important role in the proper functioning of the body. It is
also prone to lots of diseases from the substances passing through it. Polyp is an abnormal
tissue growth on the inner wall of the GI-tract. Since polyps are prone to become cancer, in
this work we present a neural network model to identify, locate, segment out the polyp from
the images of the GI-tract collected through endoscopy or colonoscopy.
The method will provide improvement possible in the existing U-Net model. Using this
model the video of the GI-tract can be speedily analysed and manual error can be avoided.
After segmentation the size of the polyp can be calculated and we can tell the possibility of it
becoming cancerous by combining color information with it.
Method discussed here can be used in automation of the polyp removal procedure from the
GI-tract. It can be used to get recommendations for the treatment and precautions to cure the
polyp outgrowth.
19
1.4 References
1. Ronneberger, Olaf, Philipp Fischer and Thomas Brox. “U-Net: Convolutional Networks
for Biomedical Image Segmentation.” ArXiv abs/1505.04597 (2015):
2. C. Di Rubeto, A. Dempster, S. Khan and B. Jarra, "Segmentation of blood images using
morphological operators," Proceedings 15th International Conference on Pattern
Recognition. ICPR-2000, Barcelona, Spain, 2000, pp. 397-400 vol.3.
3. A. V. Mamonov, I. N. Figueiredo, P. N. Figueiredo, and Y.-H. R. Tsai, “Automated Polyp
Detection in Colon Capsule Endoscopy,” IEEE Transactions on Medical Imaging, vol. 33,
no. 7, pp. 1488–1502, 2014.
4. S. Albawi, T. A. Mohammed and S. Al-Zawi, "Understanding of a convolutional neural
network," 2017 International Conference on Engineering and Technology (ICET), Antalya,
2017, pp. 1-6, doi: 10.1109/ICEngTechnol.2017.8308186
20
Chapter 2
Literature Review
There has been significant amount of work done in the area of image segmentation with neural
network and with digital image processing techniques. Here are few important work related to my
work. These papers outline the past, present works and future scope in the area of the image
segmentation.
A convolutional neural network is a network which uses convolution operation to learn features from
the image. The convolution operation detects one feature per operation, for performing the
convolution operation we require an image and a kernel to get one feature.
For the computation of one feature we need fixed amount of weights to learn per kernel. For example
for 3 * 3 kernel we need to learn 9 weights, similarly for 4 * 4 we need to learn 16 features. Hence in
using the convolutional operation for learning features of image we need to learn fixed amount of
weights per kernel.
Main advantage of using the convolutional layers in the network is that we can design bigger
network and still have manageable number of parameters to learn. The network employs several
other types of layers like pooling layer to decrease number pixels in the image and concatenation
layer to merge the features learned from different parts of the network.
21
The convolutional network performs better than fully connected layers with very less number of
parameters. Hence for the learning image related features the convolutional networks perform way
better than the other type of neural networks.
U-Net:
It is a fully convolutional neural network designed for semantic segmentation. This model relies on
strong use of data augmentation to train for 4 to 5 fold on the same data with different aspects. The
model has proved to be better than the well known sliding window method of segmentation. At the
end output contains a semantic map of the image which classifies each pixel with a probability of
being part of the polyp.
The network follows encoder-decoder model. It has two parts joined by bottleneck layer.
The contracting path captures the context of the image and learns the features of the image using the
filter. At successive layers the number of filters will become double and the size of image will be
halved. U-Net uses the skip connections to transfer features learned in encoder directly to
corresponding decoder part. This makes the features learned to be reutilized in the decision making
process and increases the resolution of context capturing.
Expanding path uses the features learned in the contracting path to precisely locate the pixels which
belong to the polyp. The expanding path takes features from the contracting layers and concatenates
the features learned from the below layers.
Performance of the U-Net model has been very good for segmentation. Since data augmentation
causes neural network to learn invariance and robustness to deformations in data.
22
Figure: 2.1 U-Net Architecture
This network is a result of modification made in the U-Net. The network fuses the feature maps from
the higher layers and lower layers to strengthen feature propagation in current layer. This way it
improves following:
The network is able to avoid segmentation abnormalities and histological variations to predict with
high level accuracy for each pixel. Since a marginal bias will result in high false clinical treatment.
MDU-Net has two types of extra connections than U-Net. Intra –block connections of the MDU-net
embeds the dense block to traditional Convolutional block. This way it feeds the features learned in
the encoder block with the different level blocks of the decoder.
Inter-block connections feed the features learned from the previous layer with the current layer. This
causes the MDU-Net to learn features of the features in the network.
1. The work analyses multi-scale dense U-Net architecture with quantization which affects the
accuracy and performance of the model.
2. It also does the analysis on the influence of U-Net with multi-scale dense connections.
23
Figure: 2.2 MDU-Net Architecture
Stacked U-Net:
Stacked U-Net architecture uses power of stacking several layers of the miniaturised U-Net model to
get performance boost. It learns long distance contextual information while retaining high spatial
resolution at output.
There are challenges of parameters heaviness and low resolution in several segmentation
architectures this model attempts to solve that problem too.
24
U-Net++: A Nested U-Net Architecture for Medical Image Segmentation
U-Net++ is made up of an encoder and decoder who are connected to each other by a series of
nested dense convolutional blocks. The U-Net++ bridges the semantic gap between the feature
maps of the encoder and decoder blocks prior to fusion. In the below figure the semantic gap
between (X0;0,X1;3) is bridged using a dense convolution block with three convolution layers. In the
graphical abstract, black indicates the original U-Net, green and blue show dense convolution blocks
on the skip pathways, and red indicates deep supervision. Red, green, and blue components
distinguish U-Net++ from U-Net.
This architecture takes advantage of re-designed skip pathways and deep supervision. The re-
designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder
and decoder sub-networks, resulting in a possibly simpler optimization problem for the optimizer to
solve
.
SegNet has an encoder network and a corresponding decoder network, which are followed by a final
pixel wise classification layer. This network is shown below. The encoder network consists of 13
convolutional layers which correspond to the first 13 convolutional layers in the VGG16 network
designed for object classification. Hence we can initialize the training process from weights trained
for classification on large datasets.
Here we can discard the fully connected layers in place of retaining higher resolution feature maps at
the deepest encoder output. This also reduces the number of parameters in the SegNet encoder
25
network significantly as compared to other architectures. Each encoder layer has a corresponding
decoder layer and hence the decoder network has 13 layers. The final decoder output is fed to a
multi-class soft-max classifier to produce class probabilities for each pixel independently.
SegNet is more efficient since it only stores the max-pooling indices of the feature maps and uses
them in its decoder network to achieve good performance. On large and well known datasets SegNet
performs competitively, achieving high scores for road scene understanding.
H-Dense U-Net: Hybrid Densely Connected U-Net for Liver and Tumor Segmentation from
CT Volumes
Dense U-Net effectively probes hierarchical intra-slice features for liver and tumor segmentation, it
has the densely connected path and U-Net connections are carefully integrated based on pre-defined
design principles to improve the liver tumor segmentation performance. H-Dense U-Net framework
explores hybrid features for segmentation of the liver and tumor.
The hybrid feature learning architecture well sidesteps the problems that 2D networks neglect the
volumetric contexts and 3D networks suffer from heavy computational cost, and can be served as a
new paradigm for effectively exploiting 3D contexts. This method ranked the 1st on lesion
segmentation, achieved very competitive performance on liver segmentation in the 2017 LiTS
Leaderboard, and also achieved the state-of-the-art results on the 3DIRCADb Dataset.
26
Figure : 2.6 H-Dense U-Net
Mathematical morphology is used here to segment the blood image. The system is based on
morphological techniques, it uses granulometries to evaluate the size of the red cells and the size of
the nuclei of parasites, opening or closing to enhance, suppress or smooth some areas, thinning to
improve cells contours.
First step is the whole analysis procedure are the detection of white blood cells and schizonts and
their removal from the image to segment. Platelets are ignored in this analysis because they are
regarded as noise. The image produced after segmentation contains two kinds of objects, "red blood
cell" and "background", with the aim to isolate each individual red blood cell, especially when they
are overlapping and partially occluded and form clusters in the viewing field of the microscope.
Finally it locates cell contours and isolates them.
Granulometry is used to capture information about objects of particular size and shape. It uses a non-
flat (hemisphere) disk-shaped structuring element to enhance the roundness and the compactness of
the red cells before applying the watershed algorithm to finally segment the image.
27
2.2 References:
1. Ronneberger, Olaf, Philipp Fischer and Thomas Brox. “U-Net: Convolutional Networks for
Biomedical Image Segmentation.” ArXiv abs/1505.04597 (2015):
2. Zhou, Zongwei & Rahman Siddiquee, Md Mahfuzur & Tajbakhsh, Nima & Liang, Jianming.
(2018). UNet++: A Nested U-Net Architecture for Medical Image Segmentation.
3. Badrinarayanan, Vijay et al. “SegNet: A Deep Convolutional Encoder-Decoder Architecture for
Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2017):
2481-2495.
4. X. Li, H. Chen, X. Qi, Q. Dou, C. Fu and P. Heng, "H-DenseUNet: Hybrid Densely Connected
UNet for Liver and Tumor Segmentation From CT Volumes," in IEEE Transactions on Medical
Imaging, vol. 37, no. 12, pp. 2663-2674, Dec. 2018, doi: 10.1109/TMI.2018.2845918.
5. Zhang, Jiawei et al. “MDU-Net: Multi-scale Densely Connected U-Net for biomedical image
segmentation.” ArXiv abs/1812.00352 (2018): n. pag.
6. Sun, Tao & Chen, Zehui & Yang, Wenxiang & Wang, Yin. (2018). Stacked U-Nets with Multi-
output for Road Extraction. 187-1874. 10.1109/CVPRW.2018.00033.
7. C. Di Rubeto, A. Dempster, S. Khan and B. Jarra, "Segmentation of blood images using
morphological operators," Proceedings 15th International Conference on Pattern Recognition. ICPR-
2000, Barcelona, Spain, 2000, pp. 397-400 vol.3, doi: 10.1109/ICPR.2000.903568.
8. S. Albawi, T. A. Mohammed and S. Al-Zawi, "Understanding of a convolutional neural network,"
2017 International Conference on Engineering and Technology (ICET), Antalya, 2017, pp. 1-6, doi:
10.1109/ICEngTechnol.2017.8308186.
28
Chapter 3
It is a fully convolutional neural network which is used specially for segmentation of the biomedical
images to identify the abnormality and special features or landmark.
The architecture has three main parts based on convolution operation.
1. Contracting path
2. Bottleneck
3. Expansive path
Contacting path as a feature detection path which generates features of varying degree for the input
image the contacting path feeds these features to the Expansive path through the use of skip
connections. Contracting path decreases the size of the image by half at every successive step of the
architecture hence the name contracting.
Bottleneck layer serves as a bridge between the Contracting path and the expansive path. It transfers
the features learned from the contracting path to the expansive path.
Expansive path concatenates the feature map learned at the corresponding stage of the network with
the feature map from the below of the network and expands the feature map using Upconvolution
process.
1. It consists of the repeated application of two 3x3 convolutions (unpadded), each followed by a
rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At
each downsampling step we double the number of feature channels. Every step in the expansive path
consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that
halves the number of feature channels, a concatenation with the correspondingly cropped feature
map from the contracting path, and two 3x3 convolutions, each followed by a ReLU activation
function.
2. At the final layer a 1x1 convolution is used to map each 64- component feature vector to the
desired number of classes. In total the network has 23 convolutional layers.
29
3. Since for Biomedical tasks there is very little training data available, hence we use data
augmentation by applying elastic deformations to the available training images. This allows the
network to learn invariance to such deformations, without the need to see these transformations in
the annotated image corpus. It is important in biomedical segmentation, since deformation used to be
the most common variation in tissue and realistic deformations can be simulated efficiently.
Input Output
utt
Bottle Neck
Layer(C, C)
30
3.2 Modifications possible:
The U-Net can be modified in several ways to suit the segmentation application needed. The first of
all we can modify in the way the feature map propagates in the network.
Feature maps can be passed to upper layer of the contraction path or to all layers of the contraction
path. We can also propagate them downwards in the contacting path or upwards in the expanding
path by using skip connections. Dense Net is a very good example of dense feature propagation.
Apart from the feature propagation the U-Net itself can be stacked in many ways to produce the
better segmentation results. Stacked U-Net is a very good example of this method.
Apart from structure there are several hyperparameters like batch size, image size, number of filters,
and learning rate, optimizer, which can be varied to tune the performance of the network.
31
3.3 Modified U-Net
Modifications which I have done are inclusion of one extra learning layer in between the contracting
path and expansive path. Using this the achieves higher accuracy in less time.
Output
utt
Input
1 * 1 Conv
Bottle Neck
Layer(C, C)
32
3.4 References
1. Ronneberger, Olaf, Philipp Fischer and Thomas Brox. “U-Net: Convolutional Networks for
Biomedical Image Segmentation.” ArXiv abs/1505.04597 (2015):
2. Zhou, Zongwei & Rahman Siddiquee, Md Mahfuzur & Tajbakhsh, Nima & Liang, Jianming.
(2018). UNet++: A Nested U-Net Architecture for Medical Image Segmentation.
3. Badrinarayanan, Vijay et al. “SegNet: A Deep Convolutional Encoder-Decoder Architecture for
Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2017):
2481-2495.
4. X. Li, H. Chen, X. Qi, Q. Dou, C. Fu and P. Heng, "H-DenseUNet: Hybrid Densely Connected
UNet for Liver and Tumor Segmentation From CT Volumes," in IEEE Transactions on Medical
Imaging, vol. 37, no. 12, pp. 2663-2674, Dec. 2018, doi: 10.1109/TMI.2018.2845918.
5. Zhang, Jiawei et al. “MDU-Net: Multi-scale Densely Connected U-Net for biomedical image
segmentation.” ArXiv abs/1812.00352 (2018): n. pag.
33
Chapter 4
A convolutional neural network has some important factors which need to be decided manually or
calculated automatically to get optimum performance from the network. We have discussed
important parts and their role in the network below.
4.1.1 Hyperparameters
Hyperparameter is a parameter whose value is used to control the model training and optimisation
process. They do not affect the performance of the model but has considerable impact on the speed
and quality of the learning process.
These parameters represent the size and structure of all components of the network. They can not be
learned and we tune them by experimenting on several values or take pretested values.
Some hyperparameters are learning rate, number of filters, size of kernel, batch size, etc.
34
Weights of the kernels are main learned parameters.
4.2 Layers
The network is divided between several types of layers. Each layer does a specific part in the
learning process. There are numerous types of layers which can be used in a neural network, here we
have discussed the layers used in this architecture.
+∞
𝑦(𝑡) = (𝑥 ∗ 𝑤)(𝑡) = ∫ 𝑥(𝑎)𝑤(𝑡 − 𝑎)𝑑𝑎
−∞
The function y(t) can be seen as a modified version of x(t) weighed by w(a). In a neural network
context, x(t) is the input, w(a) the kernel, and y(t) the output or feature map. Since algorithms only
compute in discrete steps, the convolution operation has to be discretized:
Here, t is an integer. A convolution can be generalized for multiple dimensions, where functions x
and w are defined on a set t of integers. With a 2-dimensional input x(i, j), for example images with
width and height coordinates i and j, the convolution can be written in the following form:
35
The first prominent implementation of convolutions in neural networks was carried out by Le Cun et
al. proving that the discrete convolution has important properties for use in image recognition. It is
an operation that preserves the notion of ordering. Only a few input units feed into a given output
unit, and parameters are systematically reused, such that the same weights are applied to multiple
locations in the input. The convolution operation can be visualized with the example in figure 2.6.
The light blue grid is called the input feature map. The kernel (shaded blue area) slides across the
input feature map. In this example, the Kernel is
0 1 2
𝑤= 2 2 0
0 1 2
At each location, the product between each element of the kernel and the input element it overlaps
with is computed and the results are summed to obtain the output at the current location. The result is
called the output feature map (green grid).
The following parameters can be adjusted in a convolutional layer:
• The kernel size, also called filter size in neural networks;
• The step size, which is the distance between two consecutive positions of the kernel;
• zero padding, which is the number of zeros concatenated at the beginning and end of an axis.
36
4.2.3 Pooling layer
The pooling layer is a tool that ensures invariance of y under small translations of the input, as well
as for down sampling. It is very often used in combination with Convolutional layers. The pooling
operation computes a summary statistic of nearby inputs using an arithmetic function, and can be
freely defined. One of the most popular pooling operations is called max pooling, which returns the
maximum input within a rectangular neighbourhood. The pooling operation is similar to discrete
convolution, but replaces the linear combination with some other function, such as the maximum or
average of the input of the neighbours. Figure 4.2 shows the max pooling operation applied to the
same input as in figure 4.2 (shown in blue) and its output (shown in green) for the computation in
nine steps. Similar to the convolutional layer, the pooling layer can also have different kernel sizes,
strides and zero padding.
37
4.2.4 Concatenation layer
A concatenation layer takes inputs and concatenates them along a specified dimension. The inputs
must have the same size in all dimensions except the concatenation dimension.
In U-Net model concatenation layer plays very important role in the merging of features learned
from different blocks of the model.
Feature 2
38
4.3 Activation functions:
Activation functions are functions which decide the values at which the neuron will get activated and
will transfer the activation forward.
A sigmoid function produces a curve with an “S” shape. The sigmoid non-linearity is defined as
The output of sigmoid falls in the interval of (0, 1). The gradient of sigmoid at either tail is almost
zero. The neural network using sigmoid as the activation function suffers from the vanishing gradient
problem when we use gradient-based training. In the vanishing gradient problem, the gradient value
in the front layers of networks decreases greatly, which makes the preceding layer train very slowly.
2. Hyperbolic Tangent:
39
Tanh squashes a real-valued number to the range [-1, 1]. It’s non-linear. But unlike Sigmoid, its
output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the
sigmoid nonlinearity.
𝑧 𝑧>0
𝑧=( )
0 𝑧≤0
The ReLU has become very popular in the last few years, because it was found to greatly shorten the
learning cycle due to its linear non-saturating form. The use of ReLU introduces the sparsity. ReLU
has faster training speed compared to Sigmoid and Tanh. However, the ReLU units can “die” during
training, where ReLU always outputs the same value for any input.
40
Figure 4.7: ReLU Function
4. Leaky ReLU:
Leaky ReLU fixes the ReLU problem of dying. Since the function becomes zero when 𝑥 < 0, a
Leaky ReLU has a small positive gradient for negative inputs.
Leaky ReLU has the this mathematical form
𝑧 𝑧>0
𝑧=( )
𝛼𝑧 𝑧 ≤ 0
41
4.4 Loss function
Loss function is a mathematical function which compares the training output and predicted output to
compute the error. This error is minimized in the model training process for optimization. There are
several loss function, here we have discussed the Binary Crossentropy.
− 𝑦𝑖 |2
Here xi represents the loss of 1 prediction and n is total number of examples model has gone
through. And y is the classes of prediction.
The loss function binary crossentropy is used on yes/no decisions, e.g., multi-label classification. The
loss tells you how wrong your model’s predictions are. For instance, in multi-label problems, where
an example can belong to multiple classes at the same time, the model tries to decide for each class
whether the example belongs to that class or not.
𝑛
exp(𝑥𝑖 )
𝐿𝑜𝑠𝑠 (𝑥, 𝑦) = − ∑ 𝑦𝑖 ∗ log 𝑛
∑𝑗=1 exp(𝑥𝑗 )
𝑖=1
Where 𝑥 is a vector of 𝑛 predictions of neural networks, and 𝑦 is a binary vector full of 0s besides 1s
representing the real classes of the input images.
Using this loss function we predict the possibility of a pixel being part of the polyp.
4.5 Optimizer
It is an optimisation algorithm which uses the gradient sign of the loss function with respect to
parameters to change their value towards the optimised value to reduce the value of the loss.
This is the main algorithm behind all the learning process.
42
For the optimisation process the a network uses the back propagation process to propagate the
gradient at each level of the network then with the help of the gradient the values of the gradients are
modified at a good learning rate.
Learning rate plays very important role in the optimisation process if the learning rate is high the
gradient descent may not converge and if the learning rate to low gradient descent may take lots of
time to converge. Hence the learning rate should be choosen very carefully for the good optimisation
process.
It is optimization method closely related to Adam. Sometimes it is used with momentum. There are
some differences between RMSProp with momentum and Adam, RMSProp with momentum
generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates
are directly estimated using a running average of first and second moment of the gradient.
RMSProp also lacks a bias-correction term; since in that case not correcting the bias leads to very
large step sizes and often divergence, which can be demonstrated empirically. It performs better than
stochastic gradient descent and converges faster.
4.5.2 Adam
The method computes individual adaptive learning rates for different parameters from estimates of
first and second moments of the gradients; the name Adam is derived from adaptive moment
estimation. Our method is designed to combine the advantages of two recently popular methods:
AdaGrad, which works well with sparse gradients, and RMSProp.
Finally we can say Adam to be robust and well-suited to a wide range of non-convex optimization
problems in the field machine learning.
43
4.6 References
44
Chapter 5
This dataset is derived from the Kvasir dataset which is a collection of 8000 images classified in 8
classes. Dataset aims to address the problem of colon cancer in GI-tract by providing images with
polyps and their corresponding masks.
For creating masks for the polyp containing imges the images were uploaded to LabelBox
application. A team of medical doctor and engineer was assigned to ouline the polyps and create
maks using Labelbox application which was then verified by an an experienced gastroenterologist.
Some images also contain the image of the endoscope position marking probe from the ScopeGuide.
Dataset contains two folders one for images and one for masks with each containing 1000 images.
Image and its coressponding masks have same name. Image files are encoded using JPEG
compression.
45
Images :
46
5.2 Metrics used
Different metrics for evaluating and comparing the performance of the architectures exist. For
medical image segmentation tasks, the perhaps most commonly used metrics are Dice coefficient and
IoU. In this medical image segmentation approach, each pixel of the image either belongs to a polyp
or non-polyp region. We calculate the Dice coefficient and mean IoU based on this principle.
Accuracy:
Accuracy of the U-Net defines how many pixels are correctly classified as part of the polyp.
It is calculated in percentage.
Loss:
Loss of the U-Net is defined as the ration of incorrectly classified pixels to total pixels.
It is calculated in percentage.
Dice coefficient: Dice coefficient is a standard metric for comparing the pixel-wise results between
predicted segmentation and ground truth. It is defined as:
where A signifies the predicted set of pixels and B is the ground truth of the object to be found in the
image. Here, TP represents true positive, FP represents false positive, and FN represents the false
negative.
Intersection over Union: The Intersection over Union (IoU) is another standard metric to evaluate a
segmentation method. The IoU calculates the similarity between predicted (A) and its corresponding
ground truth (B) as shown in the equation below:
In equation t is the threshold. At each threshold value t, a precision value is calculated based on the
above equation and parameters, It is done by calculating the predicted object to all the real objects.
There are other parameters such as recall, specificity, precision, and accuracy which are mostly used
for frame-wise image classification tasks.
47
5.3 Model training
Steps of Algorithm:
48
Figure 5.3: Resultant mask from trained model
2. Downsampling Path:
1. It consists of two 3x3 convolutions (unpadded convolutions), each followed by a rectified
linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling.
2. At each downsampling step we double the number of feature channels.
Upsampling Path:
1. Every step in the expansive path has upsampling of the feature map followed by a 2x2
convolution (“up-convolution”), a concatenation with the correspondingly feature map
from the downsampling path, and two 3x3 convolutions, each followed by a ReLU.
Skip Connection:
The skip connection from the downsampling path is concatenated with feature map during
upsampling path. These skip connection provide local information to global information while
upsampling.
49
Final Layer:
At the final layer a 1x1 convolution is used to map each feature vector to the desired number of
classes.
Results:
The model is trained for 100 epochs using Adam optimizer on Google colaboratory. We have used
ReLu activation function for all layers except output layer. In output layer we have used sigmoid
activation function to take decision on the pixels for being part of the polyp.
The following images show the output of the model after training for 100 epochs. Some of them give
very good results as shown below.
50
51
In the below image the image contain the polyp but the model doesn’t predict the presence of the
polyp. Which shows the tendency of model to fail at polyps who are very similar to the background.
The accuracy of training data increases with number of epochs but the testing accuracy doesn’t after
epoch 67 which is the best place to stop training the model.
The maximum training accuracy achieved in training process is 95.95 % and testing accuracy is
91.71 %.
Model performs very well on the polyps whose texture and the shape is different from background.
The image below shows the accuracy vs epoch plot of the training the model.
52
Figure 5.5 : Plot of accuracy vs epoch
As accuracy increases the loss of the model decreases in proportion. The lowest training loss
achieved is .013, and lowest testing loss achieved is 0.2863.
Similarly to the accuracy the testing stops improving after 67 epochs and oscillates around a value.
Binary cross entropy of the model behaves same as the loss and the accuracy hence we can conclude
that our model achieves best results at 67 epochs training and doesn’t improve much after that.
53
Figure 5.7:Plot of binary crossentropy vs epoch
Following image shows the training of the model after 81 epoch to 100th epoch. We can see that the
training accuracy keeps increasing with time but testing accuracy doesn’t. Similar behaviour is
shown by the loss and the binary crossentropy metrics of the model.
54
Figure 5.8: Showing training process of the model
The above figure shows the training of the modified U-Net model for 100 epochs
55
5.4 Comparison with U-Net model
As observed from table the modified U-Net learns faster and better than the original U-Net.In the
table modified U-Net starts with higher accuracy from the first epoch while U-Net lags by atleast 10
percent. Finally modified U-Net achieves 95 percent accuracy on training set and 90 percent
accuracy o n testing set.
It can be observed that the proposed modified U-Net performs better than the U-Net model.
56
5.5 References
1. Jha D. et al. (2020) Kvasir-SEG: A Segmented Polyp Dataset. In: Ro Y. et al. (eds) MultiMedia
Modeling. MMM 2020. Lecture Notes in Computer Science, vol 11962. Springer, Cham
2. Sharma, M., Rasmuson, D., Rieger, B., Kjelkerud, D., et al.: Labelbox: The best way to create and
manage training data. software, LabelBox, Inc, https://www.labelbox.com/ (2019), accessed: 2019-
05-21
3. K V, Lalitha & .R, Amrutha & Michahial, Stafford & M, Dr. (2016). Implementation of
Watershed Segmentation. IJARCCE. 5. 196-199. 10.17148/IJARCCE.2016.51243.
57
6.1 Conclusion:
From the above comparison table ,we can conclude that the modified U-Net performs better than the
traditional U-Net model for the segmentation of biomedical images. Because of modifications which
are more feature map sharing between layers the modified U-Net learns faster and better than the
original U-Net model.
58
Appendix
59
image = cv2.resize(image, (self.image_size, self.image_size));# resizing input image to 128* 12
8
mask = cv2.imread(mask_path, 0); #mask image of same size with all zeros
mask = cv2.resize(mask, (self.image_size ,self.image_size));# resizing mask to fit the 128 *128
image
mask = np.expand_dims(mask, axis=-1);
#image normalisation
image = image / 255.0
mask = mask / 255.0
return image, mask
def on_epoch_end(self):
print("epoch completed");
def __len__(self):
return int(np.ceil(len(self.ids)/float(self.batch_size)));#length of the epoch as length ofgeneratio
n DataGen obj
#hyperparameter
image_size = 128;
train_path = "/content/gdrive/My Drive/Colab Notebooks/Segmentation_data"; #address of the datas
et
epochs = 15; #number of time we need to train dataset
batch_size = 32; #tarining batch size
#train path
train_ids = os.listdir(train_path + "/images")
#Validation Data Size
60
val_data_size = 32; #size of set of images used for the validation
valid_ids = train_ids[:val_data_size]; # list of image ids used for validation of result 0 to 9
train_ids = train_ids[val_data_size:]; #list of image ids used for training dataset
#print(valid_ids, "\n\n");
print("training_size: ", len(train_ids), "validation_size: ", len(valid_ids))
#making generator object
gen = DataGen(train_ids, train_path, batch_size, image_size);
print("total batches: ", len(gen))
#print(valid_ids)
#unet model
def UNet():
f = [8,16, 32, 64, 128];
inputs = keras.layers.Input((image_size, image_size, 3));
p0 = inputs
61
c2 = keras.layers.Conv2D(f[1], (3, 3), padding="same", strides=1, activation="relu")(c2)
p2 = keras.layers.MaxPool2D((2, 2), (2, 2))(c2)
s
bn = bottleneck(p4, f[4])
62
uc3 = keras.layers.Conv2D(f[1], (3, 3), padding="same", strides=1, activation="relu")(uconcat3)
uc3 = keras.layers.Conv2D(f[1], (3, 3), padding="same", strides=1, activation="relu")(uc3)
model = UNet();
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc", keras.metrics.Binary
Crossentropy()]);
model.summary();
# Size classification
def size_classification(pi_mask):
#analysis of the polyp based on color and size for classification
'''
Size: < 1 cm -> 1 %
1 - 2 cm -> 10 %
> 2 cm -> 50 %
'''
#Based of size:
63
polyp_sizes = [1600, 2500]
polyp_size = np.count_nonzero(pi_mask)
if polyp_size == 0:
print('No polyp detected you are safe !')
elif polyp_size < polyp_sizes[0]:
print('Polyp size: ', polyp_size, ' pixels. It is a low risk polyp of type 1. It has 1% chance of beco
ming cancerous.')
elif polyp_size < polyp_sizes[1]:
print('Polyp size: ', polyp_size, ' pixels. It is a medium risk polyp of type 2. It has 10% chance o
f becoming cancerous.')
else:
print('Polyp size: ', polyp_size, ' pixels. It is a high risk polyp of type 3. It has 50% chance of be
coming cancerous.')
64
inverted_masked_image = (((1 - predicted_mask[i]) * 255) * pr_image).astype(int)
my_plot = fig1.add_subplot(1, 5, 3)
pred_mask = np.reshape(predicted_mask[i]*255, (image_size, image_size)) #(128, 128, 1) -
> (128, 128)
my_plot.imshow(pred_mask, cmap="gray")
plt.title("Predicted mask")
my_plot = fig1.add_subplot(1, 5, 4)
my_plot.imshow(masked_image)
plt.title("masked_image")
my_plot = fig1.add_subplot(1, 5, 5)
my_plot.imshow(inverted_masked_image)
plt.title("inv_masked_image")
size_classification(pred_mask)
# IOU calculation
# variables used will be p_mask and pred_mask
intersection = np.logical_and(p_mask, pred_mask)
union = np.logical_or(p_mask, pred_mask)
iou_score = np.sum(intersection) / np.sum(union)
print("IOU of prediction = ", iou_score)
65