Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Hamburg University of Applied Sciences

Faculty of Life Sciences


Department of Biomedical Engineering

Bachelor Thesis

Single cell classification:


An approach to understand the diversity of protein localization

Version 0.2, March 2021

submitted: June 1, 2021

by: Philip James Sullivan


born on November 12, 1994
in Heidelberg, Germany
Contents

1 Introduction 3

2 Dataset 4
2.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Ground truth formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Cell segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Composite image creation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Cropping and Resizing . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.4 Label creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Image Classification Network 8


3.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . 10
3.2.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3.2 SiLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.4 Overfitting Prevention . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4.1 Stochastic Depth . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4.2 MBConv Block . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Evaluation 22
4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1
Contents

4.1.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


4.2 K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Conclusion 23

Bibliography 24

2
1 Introduction

When describing biological cells, it is usually assumed that the function of a cell is
dependant on its cell type and that specific cell types operate in specific ways. While
this is not incorrect, it is still a simplification. When viewing a cell population as a
whole, the cell population might seem uniform and the cells indistinguishable from one
another, however, when focussing on individual cells, slight differences in protein structure
and function become apparent. Much of the current biological knowledge is based on
ensemble measurements that ignore these cell-to-cell differences. These differences, referred
to as cellular heterogeneity, can be observed in all cell populations and are mainly due to
differing cell-specific protein distributions that occur as a consequence of several factors
during protein expression, as well as due to environmental factors after the proteins have
been expressed. In single-cell research the locations of proteins in cells are analyzed to
further the knowledge about cell heterogeneity and it is attempted to understand the
impacts of these individual differences on the overall cell.
To understand correlations between protein localization and cell structures, it is necessary
to have a large dataset of labelled sample images that can be analyzed to subsequently
infer meaning from new unlabelled images. This procedure is achieved by neural networks
that improve their understanding of input images by training on a pre-labeled dataset,
known as a ground truth dataset. The resulting trained network can then be fed with a
new image of the same kind and then outputs the labels that it deems as most probable.
This thesis attempts to analyze the problem of cell heterogeneity by training a neural
network on a dataset of labelled images. The dataset is provided by the Human Protein
Atlas (HPA) initiative and includes over 20000 images of cell populations from human cells,
tissues and organs. In each image, proteins-of-interest are marked and labelled based on
the structures that they are localized to. The task at hand is to detect and crop individual
cells from the population and then perform cell-specific classification of the structures that
the proteins-of-interest are localized to.
Many different multi-cell classification models based on the HPA dataset have been
created and published in the past. Their outputs are general classifications for the locations
of proteins labelled in multi-cell images. These models are not calibrated to single-cell
images and only give imprecise image-level outputs. The main improvement of my work
will be the creation of a single-cell classifier that infers precise labels to individual cells.

3
2 Dataset

<placeholder text>

2.1 Biological background

<placeholder text>

2.2 Data Analysis

The HPA dataset consists of weakly labelled images of biological cell populations. Each
image is made up of 4 filters, of which each represents a different structure. The labels
describe the cell regions that the proteins of interest are localized in in each given cell
image. They are termed as weak because they only describe the protein localization’s on
an imprecise image-level and not on a cell-level. Therefore not every cell in the image
must express the protein localization specified in the label.
In total, the dataset consists of 21806 images of cell populations, encompassing 17 different
cell types. The 4 filters that describe each cell population are an antibody-stained image
of the protein of interest, as well as 3 reference filters to outline the cell: microtubules,
nucleus and endoplasmic reticulum. An example of the four filters of an image is shown in
2.1 .

Figure 2.1: Four filters describing the structures of one cell population in the HPA dataset

The label distribution in the HPA dataset is visualized in the diagram in 2.2. A clear
imbalance in the label occurrences can be observed.
A publicly available pre-trained instance segmentation program [1] is used to segment the
dataset images into single cell masks from which cell-specific bounding boxes are extracted.
Extracting all cells from the dataset yields a list of 489871 bounding boxes.

4
2 Dataset

Figure 2.2: Class labels of the HPA dataset ranked by number of occurrences

2.3 Ground truth formation

The aim of this project, as stated in the introduction, is to build a classification algorithm
that can deduce the protein localizations of the highlighted proteins in the HPA dataset.
To accomplish this, an Image Classification Network will be built as further detailed in
chapter Image classification network. In order to train the network using the HPA dataset,
a proper ground truth must be created from it. As a supervised learning approach will
be used to train the network, the dataset must be structured in a way that each input is
given with its desired output values. This means that each input image is given with a set
of localization classes describing the protein localizations.

2.3.1 Cell segmentation

The first step of ground truth creation is the segmentation of the composite cell images
into individual cells. Because the task of the approach is to perform classification on
an individual cell level, the individual cells must first be extracted from the overall cell
population image. To achieve this, a pre-trained segmentation network is used that was
published specifically to produce a cell segmentation from the HPA dataset images. This
network operates by accepting the four filters of each cell population image and returning
a segmentation mask overlay for each image, assigning a value to each pixel, denoting
the presence of a cell or the absence of a cell, respectively. The assigned values separate
between cell instances, meaning that the pixels of the first cell are assigned the value 1,
the cells of the second cell the value 2, and so forth.

A
The exact function of this network is beyond the scope of this work and its outputs are
accepted as is.
After one segmentation mask has been created for each cell population, a simple script
is executed to convert the segmentations for each cell into four bounding box coordinates
for each cell. One set of bounding box coordinates consists of four values: xmin, xmax,

5
2 Dataset

ymin, ymax. These four values describe a tight box around the cell, allowing an efficient
storage of all cell localizations.

2.3.2 Composite image creation

The second step of the ground truth formation is to create the input images. The HPA
dataset consists of 4 filters per cell population as described in section data analysis. Im-
age Classification Networks are designed to be used with single images as input which
means that the four filters of each image must be combined into one RGB image per cell
population. In the HPA dataset, each of the four filters is given as a separate .png image
file. The file names are structured with an id and a the filter color, easily allowing all four
filters to be associated to one id. A simple script was written to iterate through all 81000
image files and to combine the filter image files into one composite image per id. The
challenge of combining four images into a three-channel RGB composite image was solved
by using the following code:

A
The blue channel was used as the blue channel in the RGB image. The red and green
channels were created as the mean value between them and the yellow channel, due to the
fact that yellow is created as an equal mixture between red and green when operating in
RGB color space.

A
2.3.3 Cropping and Resizing

Once the bounding boxes of all cells of all cell populations have been saved and the RGB
images of the cell populations created, the next step is to crop the individual cells from
the composite images. This is done to create one image file per cell, resulting in 420000
image files, one for each cell. The process to achieve this is done by iterating through
the dictionary from section segmentation that maps each cell population id to a set of
bounding boxes. The image of each cell population id is opened and each bounding box
cropped into a new image file. To preserve all information, the new file names consist of
the cell population id and a cell id that is given by the segmentation script from section

6
2 Dataset

segmentation.

A
Once all individual cell images have been created, the resulting image files must be
resized to a uniform resolution, as demanded by the desired input size of the image
classification network. All images are therefor resized to a resolution of 224x224.

2.3.4 Label creation

1 rgb 2 cell masks 3 cell bboxes 4 cropping 5 resizing 6 labels 7 data generator

7
3 Image Classification Network

The task at hand is to create program that can with sufficient accuracy classify protein
localizations from cell images in which the proteins of interest have been highlighted. The
dataset described in chapter one is used as a basis for creating this model. The goal is
to create a model that is generalizable to be used with any dataset of this type. For this
task, an artificial neural network will be created that can extract meaning from this input
data and can classify it into a specified group of pre-defined localization classes.

3.1 Deep Learning

Creating and training deep neural networks (DNNs) is described in the area of computer
science known as deep learning, which is a subset of machine learning. Machine learning in
general is an approach of creating predictive models from datasets without solely relying
on pre-defined functions. Most machine learning still requires feature engineering, meaning
that the feature extraction and processing of the network must still be adjusted manually.
In deep learning most of the feature extraction is automated, eliminating much of the
required manual intervention. DNNs are referred to as a "deep", as they possess many
consecutive layers, allowing them to learn more complex relationships between input and
output data.
DNNs are used to solve problems that are imprecise or require great amounts of
information [?]. They are implemented as computational models that can extract meaning
from imprecise input information using a data processing algorithm that is continually
improved by training it with new input data. The main characteristic of a neural network
is the presence of so-called neurons which connect the input data to the output data
and are structured in layers. The details of these neurons will be explained later in this
document. DNNs offer a list of advantages over classical data processing algorithms [?]:

• Adaptive Learning: models can learn non-linear and complex relationships between
input and output data

• Self-Organization: deducing relevant information from input data can be learned by


neural networks

• Fault Tolerance: the meaning of data can be recognized even with varying or
erroneous input data

8
3 Image Classification Network

Today, neural networks are widely used to solve a variety of complex problems, ranging
from speech recognition and letter recognition to object detection in static images and
videos. While the networks for all of these use cases share the same principles, there exist
a variety of differences in the combination of these elements to form task-specific networks.
The approach of this work to classify the protein locations in the given cell images
demands the creation of an image classification network [?]. This network will then be
trained on the dataset described in chapter one in order to create a program that can
reliably localize protein structures in the given microscopic cell images. The type of
training is referred to as supervised training, as the desired output values are given for all
input values.
In mathematical terms, the network training will approximate the mapping function
connecting the images with the classifications.
The steps to create a neural network to approach the given classification problem are
as follows: First, a ground truth dataset is created as explained in chapter one that has
both the input and the desired output data. Then, the network architecture is created
using a structure that is fitting to the problem at hand. Subsequently, the network is
trained to improve the accuracy of the predictions. This training has two alternating
steps: forward propagation and backpropagation. Forward propagation is the process
of feeding an input image into the network to obtain the classification output from the
network. This classification output is then compared to the desired output using a loss
function. An Optimization algorithm them then improves the network in a process
named backpropagation in order to improve its classification accuracy so its output values
approach the desired values.

3.2 Network Architecture

When approaching a deep learning problem, it is necessary to choose a network architecture


that is well-suited to the task. Image Classification Tasks are best approached using a
Convolutional Neural Network (CNN). This network type consists of convolutional layers
followed by pooling layers. For the given Image Classification Problem, I have chosen
EfficientNet, as it has been proven to offer high classification accuracy, while being light
on processing resources and offering fast training and inference. This section explains all
of the elements that are used in the EfficientNet architecture.

3.2.1 Neurons

The most fundamental building block of any neural network is a neuron. A neuron works
as a simple linear function with multiple inputs. The values of a neuron are its weights
and its bias. Each input value x has one weight wi associated with it that it is multiplied
with. The sum of all of these products is then offset by the neuron’s bias value b. The

9
3 Image Classification Network

resulting
 valuez is the neuron’s output. The formula of each neuron can be written as:
wi · xi + b
P
z=
i
The weights and input values are transformed from vectors to a scalar by this function.
These simple elements can map linear functions between the input and output data, as
they are at their core nothing more than linear functions with multiple inputs.

insert some more visualizations here


A
3.2.2 Layers

To construct a neural network that can describe complex relationships between input
and output data, many neurons are needed. They are arranged in consecutive groups
that are referred to as layers. The input data constitutes the first layer. During forward
propagation, the network’s layers are then passed from the input layer through so-called
hidden layers and then to the final output layer that returns its predictions.

insert basic NN dense layer type thing here


A
Our network architecture includes several network types that are used in different
parts of the network to achieve several functions. The used basic layer types include
convolutional layers, pooling layers, and a fully connected layer. The functions of these
layers are explained in this section.
Each layer also possesses an activation function that is applied to each of its neuron
output values. The function and purpose will be explained in the section about activation
functions.

3.2.2.1 Convolutional Layer

-stride -padding When processing images, convolutional layers are used. These layers
conserve spatial information while propagating information through the network. Each
neuron only receives information from neurons in the previous layer that are in the same
image region, the receptive field, of that neuron. This concept is inspired by biology and
the receptive fields found in the retina. The input values of a neuron are processed as a
mathematical matrix and then multiplied by a specific matrix referred to as a filter kernel.
Each filter kernel produces a feature map from the image that filters for specific image
structures. The feature maps in the network progress from visualizing low-level image
features early in the network to visualizing high-level image features in the network’s latter

10
3 Image Classification Network

layers.
Before attempting to understand convolutional layers in a mathematical way, it is
beneficial to view their function in a visual format first. In this graph, an input image
is shown as x, a filter kernel is shown as w and the resulting feature map is shown as y.
The convolution process is achieved by sliding the filter kernel over the input image and
calculating the dot product for each location. The shown filter kernel example is designed
to detect only horizontal lines. In fig.1 the result of applying this filter to image A is
shown. In fig.2 the result on image B is shown. As image A contains horizontal lines,
the resulting feature map A contains values at the location of this line. Feature map B
contains nothing, proving that the filter kernel works as intended by only being sensitive
to horizontal and not to vertical lines.

A
After gaining this intuitive understanding of the function of convolutional layers, it
is important to expand on the underlying mathematics of this process. In essence, this
operation works the same as a simple convolutional operation that processes the input
image as the function x(t) and the filter kernel as the function w(t). The result of this
operation is a feature map that visualizes a specific feature. A convolutional function is
described mathematically as:
R +∞
y(t) = (x ∗ w)(t) = −∞ x(a)w(t − a)da
This general definition of a convolution must be adapted to the circumstances of the
convolutional layer.
The filter progresses across the input image in discrete steps with a specific stride length of
usually one element per step. This means that the integral of the convolutional operation
can be reduced to a sum as follows:
y(t) = (x ∗ w)(t) = +∞ a=−∞ x(a)w(t − a)
P

This formula must now be modified to accept two-dimensional input, as the images are
two dimensional. The filter must then also be modified to be of two dimensions. This
leads to the following function for the convolutional layer:
y(i, j) = (x ∗ w)(i, j) = m n x(i − m, j − n)w(m, n)
P P

On an individual neuron scale, this convolution is performed as described in the for-


mula in the section neurons. A neuron uses outputs of spatially neighboring neurons from
the previous layer and multiplies each with a specific weight given by the filter kernel.
Each neuron uses the same filter kernel. In the case of an artifical neural network the
filters are not manually defined, but rather a result of model training on a dataset.

11
3 Image Classification Network

3.2.2.2 Pooling Layer

The purpose of pooling layers is to make the network invariant to small irregularities
in the data, as well as for downsampling input data. After aggregating the necessary
feature maps in our network, a pooling layer is utilized in order to disregard unnecessary
information and to preserve important information. Different pooling layer types are used
for different purposes. Max pooling layers only preserve the maximum value of a portion
of the feature map, while average pooling layers preserve the mean value of a feature map
region. EfficientNet uses max pooling layers to extract the important information from its
feature maps. At the end of the network, a global average pooling layer is added. This last
layer produces the average value of each feature map and creates a one-dimensional vector
from it which is then used as input to the last layer that creates classification predictions
from it. SE-Blocks also make use of the GlobalAveragePooling Layer as described in
section MBConv.

A
3.2.2.3 Fully Connected Layer

A Fully Connected Layer is a layer type that uses all neuron outputs of a preceeding layer
as inputs to each of its neurons. This results in a very large amount of connections that all
have separate trainable parameters. In our network a fully connected layer only appears
in the SE-Block and once in the final network layer. The function in the SE-Block is
explained in the section MBConv. The FC Layer that is found in the last network layer
has the purpose of creating a classification of the input image based on a given set of
pre-defined classes. The fully connected layer of this prediction output is constructed
with the same amount of neurons as there are classes. The input to this layer is the
output of the global average pooling layer that returns one value per feature map. Every
single output value from the pooling operation is fed into every single of the prediction
output neurons. The prediction layer then creates a confidence score for each of the classes,
correlating to the probability of the class appearing in the input image.

3.2.3 Activation Functions

When using neurons as described in section neurons, a problem arises: because the neurons
work with linear functions, only linear correlations between input and output can be
mapped. This problem is solved by applying an activation function to the outputs of each
layer. An activation function is a non-linear function that effectively allows the layer that
is made up of neurons with linear functions to return an output value in a non-linear way.

12
3 Image Classification Network

3.2.3.1 Sigmoid

The sigmoid function is defined as:


1
σ(x) = 1+exp(−x)

A
In our network this function is used in three locations: in the SE-Block, in the SiLU
activation function, and as a standalone function in the final layer of the network in order
to produce class predictions. The use as part of the SE-Block and in the SiLU function
will be explained in the respective sections. In the prediction output, the Sigmoid function
serves the purpose of creating confidence scores for each class. This means that the network
creates a score for each class representing the network’s confidence of the given class being
present in the input image. Because probabilities must be given in the interval [0; 1], the
sigmoid function is used. It receives the unbounded information from the global average
pooling operation described in section .. and converts it into the necessary probability
value interval.

3.2.3.2 SiLU

As described in section Neurons, neurons work with linear functions and can only map
linear relationships between input and output information. This is only useful in the
most basic of input-output-correlations. An approach to map non-linear functions must
therefore be implemented. This is achieved by adding activation functions after each layer.
Besides the special case of prediction layers described in section sigmoid, most networks
use the max function as their most common activation
f (x) = max(0, x)
.

A
The part of the network that employs this function is referred to as the Rectified Linear
Unit (ReLU). Even though our network does not use this function and instead uses the
Sigmoid Linear Unit (SiLU), understanding ReLU’s function is important to understand
SiLU.
The way that the max function allows the mapping of non-linear correlations is evident
from its graph: all negative input values are reduced to zero, while all positive input values

13
3 Image Classification Network

are left unchanged. This effectively means that neurons are deactivated selectively based
on the input value. This is similar to a piece-wise definition of a function, as only the
remaining enabled neurons and their weights and biases influence the network output.

A
EfficientNet uses SiLU instead of ReLU. It is defined as //f (x) = xσ(x)//, where σ(x)
is the signmoid function described in section sigmoid. The overall output of SiLU is
visualized as:

A
The important difference of SiLU to ReLU is that it SiLU does not reduce negative values
to zero and is constantly differentiable. This trait has some very advantageous properties
for backpropagation that will be explained in section optimizer.

3.2.4 Overfitting Prevention

A good DNN is one that can generalize well. This means that even though the network
is trained on a specific dataset, it also performs well on unseen data that is not part
of the given dataset. The difference between a network’s performance on the training
dataset and an unseen testing dataset is known as the generalization error. It must be
reduced to create a versatile model. Large generalization errors occur due to overfitting.
When a network overfits to a dataset, it is analogous to a person memorizing data instead
of understanding its meaning. In a neural network, this problem manifests itself when
one type of input data is always processed by specific neuronal connections. This gives
a reproducible and correct output for this data but not for similar unseen data of the
same class. To reduce overfitting, a network must be trained in a way that it is not too
dependant on specific connections. Our network architecture uses a concept known as
stochastic depth to reduce overfitting.

3.2.4.1 Stochastic Depth

One very common approach is to randomly disable some of the neurons in the network.
This leads to input data being processed by different neurons in every iteration. This
approach is referred to as Dropout. Stochastic depth is a variation of this, the difference
being that instead of disabling single neurons, entire layers chosen at random are skipped

14
3 Image Classification Network

in each iteration.

3.2.4.2 MBConv Block

In the EfficientNet architecture, an MBConv Block is used. This Block can be deconstructed
into two important parts: an Inverted Residual Block and a Squeeze-Excitation Block
(SE-Block). These two elements serve several purposes that will be explained in this
section.

Vanishing gradient
The main issue that is solved by the inverted residual block is the problem of vanishing
gradient. Vanishing gradient is an issue that occurs during backpropagation that leads to
a performance plateau at a point that does not reflect the ideal parameters. This is due
to the nature of backpropagation and the chain rule. In summary, the backpropagation
aims to improve the trainable parameters of the network by following the gradient that
points into the direction in which the network’s parameters are optimal. This gradient is
calculated using the derivatives of all of the neuronal connections that are passed during
forward propagation. These derivatives are computed using the chain rule, which means
that if only one of its elements has a small value close to zero, the entire result is reduced
to a value close to zero. The exact mathematics of backpropagation are explained in the
section about optimizers.
A Residual Block solves this problem by adding skip connections to the network. Skip
connections directly connect non-adjacent layers, thereby effectively skipping some layers.
This reduces the possibility of single neurons negatively influencing the overall gradient,
as the skip connection will always add the current gradient value back into the equation.
A simple residual block is nothing more than a connection that adds the two values from
two non-adjacent locations in the network and divides the value by two.
An improvement on the standard residual block is the bottleneck residual block. The
advantage of this element is that it not only connects two non-adjacent layers but that it
adds convolutional layers that have the effect of reducing the trainable parameters which
allows more efficient network training and the creation of deeper networks.
The inverted Residual block that is used in EfficientNet is a development that im-
poroves on the structure of the bottleneck residual block by inverting the progression of
its convolutional layers. A normal bottleneck Residual Block has the progression wide-
>narrow->wide. Inverting this filter progression to narrow->wide->narrow means that an
inverted Residual Block possesses even fewer trainable parameters and is therefore less
resource-consuming during training. Networks that use the inverted residual block can be
trained on less powerful hardware with training that is less time-consuming.

15
3 Image Classification Network

A
Channel Attention
Before classification, a convolutional neural network must first perform feature extraction
on the input image. This is done as described in the section Convolutional layer using a
set of convolutional layers that produce a set of feature maps. These feature maps each
describe a different feature of the original image. Not all of these features are, however,
equally important. Some must be given a higher importance than others. This unequal
focus on some channels is named as channel attention and is implemented in the MBConv
Block of EfficientNet using a Squeeze-Excitation Block (SE-Block).

A
The SE-Block possesses three components: squeeze, excitation, and scaling. The squeeze
operation takes a set of feature maps as input that have the dimension CxHxW with H
and W being height and width respectively and C being the amount of channels. Each
feature map constitutes one channel. The squeeze block then performs a Global average
pooling operation on this set of feature maps, reducing each feature map to one value,
resulting in an output with the dimensions Cx1x1.

A
The subsequent excitation layer consists of a Fully Connected Layer (see section FC
layer) with a ReLU activation function applied to it (see section Activation Functions).
This Excitation module has the goal of learning channel attentions for each of its input
values.

16
3 Image Classification Network

The last part of the SE-Block is the scaling module. This module has the task of creating
an importance value for each channel. This is achieved by a FC-Layer and a Sigmoid
function (see section sigmoid). As described before, a Sigmoid activation function returns
a value between 0 and 1 for each of its input values. These values are then returned to the
main path of the network and multiplied with each feature map, effectively deciding the
effect that each feature map has on the network output.

A
A network can also learn channel weights without the use of an SE-Block, however,
research has shown that explicitly implementing channel attention as a separate trainable
block increases a network’s ability to assign importance to each of its feature maps and
therefore increases performance. This is because the SE-Block by its use of the global
average pooling operation ensures that each feature map is seen in its entirety instead of
focusing only on parts as a convolutional layer does.

3.3 Training

The network architecture described in section 3.2 is not able to perform a classification of
the input data without first being trained on a dataset. This process and the underlying
mechanisms are explained in this section. As described in section 3.1, training a network
is the process of finding the optimal values for all trainable parameters of a network in
order to create a model that can perform correct classification of all kinds of unseen
input data. Training is achieved by utilizing an optimizer that improves the network by
using the process of backpropagation. This section will explain loss functions followed by
backpropagation and gradients, and then explain the way in which an optimizer works.

3.3.1 Loss function

Once a network architecture has been constructed, its weights and biases must be adjusted
in order to make the network’s mapping function approach the true mapping function.
To improve a model, the performance of the model must be measured. A loss function
achieves this by calculating one loss value for a model output. Different output types
require different loss functions because of the nature of their output values. In the case of a
classification problem, the worth of a prediction is based on the difference of the confidence
scores for each class compared to the binary ground truth values given by the ground
truth dataset. Ground truth class scores are either 1 or 0 as there is no classification
ambiguity in the ground truth dataset. Confidence scores that are outputs of the network

17
3 Image Classification Network

are decimal numbers in the range [0;1] as the network can be "sure" or "unsure" about a
classification. The closer a confidence score is to the value 0 or the value 1, the higher its
confidence is for this classification. The closer the value is to 0.5, the lower its confidence
in this classification is.
In our case, the loss function that was used is the F1 Loss, which is an adaptation of
the F1 score. The F1 score reflects a prediction’s accuracy by calculating its precision and
recall values. Precision and recall are calculated as shown in Fig1:

A
In this figure "true" refers to correct classifications and "false" refers to incorrect classifica-
tions. A theoretical model with perfect classification accuracy would have an F1 score of
1, whereas a model that is consistently wrong in every prediction would have an F1 score
of 0. The F1 score calculation requires binary values of 0 or 1 as its prediction inputs. To
achieve this, the decimal values of the prediction output are rounded to the nearest whole
value. To adapt the F1 score calculation to be used as a loss function, this rounding step
is skipped. This creates the F1 loss. This loss function reacts to changes in the prediction
values in a more granular way, because even small changes in the prediction values lead to
a change in the F1 loss. In comparison, the F1 score only changes when a prediction value
crosses the threshold of 0.5, at which its classification changes. This granular reaction of
the loss function is important, as it allows the adjustment of the models parameters as
explained in the section backpropagation.
F1 Loss BCE

3.3.2 Backpropagation

In order to understand model training, it is necessary to understand the underlying process


of backpropagation and gradient descent. When a model with a specific set of parameters
completes a forward pass, this means that input data is passed through the network’s
layers according to its architecture from its first layer until its classification layer. The
next step it then to calculate a loss value for this set of class confidence scores produced
by the network. The aim is then to lower this loss value in order to increase the model’s
classification accuracy. To achieve this, the model must be viewed in a mathematical way.
Each layer and activation function can be represented by a function, which means that
the overall network is nothing more than a nested function comprised of all of its layers’
functions. The input to this function is the input image and the output is the classification
values for each class. The network’s can also be desribed by a different function: the
inputs being all adjustable parameters and the output being the single loss value. The
derivative of this network loss function leads to a function that can be used to calculate

18
3 Image Classification Network

the loss value for each set of weights and biases. Finding the global minimum of this loss
function returns the optimal weights and biases that have to be used in order to tune the
network to produce the best possible results. Because of the unfathomably huge amount
of parameters of the function, this cannot be easily visualized. To simplify the explanation,
a hypothetical loss function of a network with only one adjustable parameter is visualized:

A
This network’s loss is only dependant on its parameter x. In order to improve the loss, the
function is derived to create the following function:

A
At any point on the function f(x) a change of parameter x has a different effect on the
loss value. This is visualized in the plot of f’(x) as the derivative of f(x). the optimal
value for x in this example is ... because it results in a global minimum at which x=..
results in the lowest possible loss value. Because f(x) is a very simple function, creating
the derivative in a mathematical way is simple and therefore the value of f’(x) can be
calculated for every point on f(x). In neural networks this is different, as there are usually
several million adjustable parameters that impact the loss value. The overall derivative can
therefore effectively not be calculated as all possible combinations of all of the parameters
would have to be considered. Instead, numerical differentiation is used for each specific set
of parameters to calculate the impact that a change in each of the parameters would have
on the loss value. For this, the overall network loss function must be derived. Because this
is a nested function, the chain rule is used for this differentiation. The resulting function
can be used to calculate a gradient vector. This vector reflects the changes that have to
be made to each adjustable network parameter in order to lower the loss value.
The derivation of the network function is shown for a simple network:
....

3.3.3 Optimizer

When training a network, not only a loss function must be chosen, but also an optimization
algorithm. This optimizer has the task of adjusting the network’s parameters by utilizing
the calculated gradient at a point and then following in the direction of this gradient to
find a minimum on the loss function of the network. This approach is named as gradient
descent. The challenge, however, is to decide on the exact amount that each of the

19
3 Image Classification Network

adjustable parameters has to be changed. If the parameters are all changed by a value
that is too great, then the result could be that many local minima are skipped, resulting
in an entirely different location on the overall function. This is visualized in the following
simple function that only has one adjustable parameter:

A
The starting point on this function is (A). After calculating the slope at this point using
numerical differentiation, it is decided to change the parameter x in the direction of the
slope by a value of 3. This leads to the point (B). This point is at an entirely different
location on the function plot that is mostly unrelated to the original location (A). Cal-
culating the slope of this new point results in an entirely different slope. Following this
new slope again by changing the value of x by 3 leads to point (C), a point that is again
entirely unrelated to point (B). Updating the function parameters by values of too great
magnitude therefore are akin to random jumps on the function that do not improve the
loss value and do not result in finding the desired global minimum. When attempting
to find a global minimum it is therefore important to choose the scope of the parameter
updates wisely. This is referred to as the learning rate. In this plot the same function f(x)
is visualized with small updates to its parameter, resulting in finding the global minimum.

A
This solution of simply choosing a small learning rate is not a definite solution, as choosing
a learning rate that is too small would lead to a very slow training process. The learning
rate in our approach was set in an incrementally decreasing manner, quartering every 10
epochs.

A
gradient descent sgd, rmsprop, momentum, adam learning rate

3.3.4 Results

Once all network elements have been explained, an overview of the overall network
architecture and the training process can be understood.
The Efficentnet architecture is a so-called Residual Network (Resnet) that has been

20
3 Image Classification Network

improved by adding the MBConv Block as described in section MBConv Block. The
overall network architecture is visualized in fig 1:

A
The training process to

21
4 Evaluation

4.1 Metrics

4.1.1 F1 Score

4.1.2 Confusion Matrix

4.2 K-Fold Cross Validation

22
5 Conclusion

23
Bibliography

[1] Hpa-cell-segmentation. https://github.com/CellProfiling/


HPA-Cell-Segmentation, April 2021.

24

You might also like