Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 00 (2021) 000–000
Procedia
Procedia Computer
Computer Science
Science 19200 (2021)
(2021) 000–000
582–591 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia

25th International Conference on Knowledge-Based and Intelligent Information & Engineering


25th International Conference on Knowledge-Based
Systems and Intelligent Information & Engineering
Systems
Morphological
Morphological Cross
Cross Entropy
Entropy Loss
Loss for
for Improved
Improved Semantic
Semantic
Segmentation of Small and Thin Objects
Segmentation of Small and Thin Objects
René Pihlak∗∗, Andri Riid
René Pihlak , Andri Riid
Department of Software Science, Tallinn University of Technology, Ehitajate tee 5, Tallinn 19086, Estonia
Department of Software Science, Tallinn University of Technology, Ehitajate tee 5, Tallinn 19086, Estonia

Abstract
Abstract
Image segmentation often faces a trade-off between using higher resolution images to detect fine details, such as the edges of objects
Image
or thin segmentation often
structures, and faces
lower a trade-off
resolution between
images whichusing highersuitable
is more resolution images tosegmentation
for accurate detect fine details, such asobjects.
of massive the edges of objects
Because low
or thin structures,
resolution and lower
images require lessresolution
resources,images
accuratewhich is more
detection suitable
of small for accurate
objects is often segmentation
less prioritizedofinmassive objects.
trying to achieveBecause low
the highest
resolution images require less resources, accurate detection of small objects is often less prioritized in trying
accuracy. In this paper, we propose to improve the segmentation of small and thin objects by convolutional neural networks by to achieve the highest
accuracy. In this paper, element
adding a morphological we proposeto theto loss
improve the used
function segmentation of the
for training small and thin objects
segmentation by convolutional
network. The approach neural
is testednetworks by
on a traffic
adding a morphological
sign segmentation element
problem usingtothe
theCityscapes
loss function used for
dataset withtraining the segmentation
a training network.
set of 2 979 images The
and is approach
shown to ishave
tested
an on a traffic
advantage
sign
over segmentation problem using
the popular cross-entropy (CE)theand
Cityscapes
a grounddataset with affinity
truth (GT) a training
mapsetweighted
of 2 979CE images
as it and is higher
yields shown global
to haveIoUan and,
advantage
more
over the popular
importantly, cross-entropy
an IoU gain among(CE) and a ground
the smaller truth with
traffic signs (GT)no affinity map computational
additional weighted CE asresources.
it yields higher global IoU and, more
importantly, an IoU gain among the smaller traffic signs with no additional computational resources.
© 2021 The
© 2021 The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
© 2021an
This The Authors. Published by Elsevier B.V.
This is
is an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an
Peer-reviewopen access
Peer-review under article
underresponsibilityunder the
responsibilityofofthe CC BY-NC-ND
thescientific license
scientificcommittee
committee ofof(http://creativecommons.org/licenses/by-nc-nd/4.0/)
the KESInternational.
KES International.
Peer-review under responsibility of the scientific committee of the KES International.
Keywords: Semantic segmentation; Traffic signs; Loss function; Convolutional neural network;
Keywords: Semantic segmentation; Traffic signs; Loss function; Convolutional neural network;

1. Introduction
1. Introduction
Advances in deep learning (DL) have transformed and boosted computer vision (CV) in the last decade. The most
Advances
important in deepin
problems learning
CV are(DL)imagehave transformed object
classification, and boosted computer
detection, vision (CV)
and semantic in the last decade.
segmentation, The most
in the ascending
important problems in CV are image classification, object detection, and semantic segmentation,
order of difficulty. Semantic segmentation — assigning one of the pre-defined classes to each pixel in an image — in the ascending
order of difficulty.
has applications in Semantic
many fields segmentation — assigning
such as autonomous one of
driving, the pre-defined
robotics, industrialclasses to each
inspection, pixel in
medical an image
imaging, —
scene
has applications
understanding, etc. in many fields such as autonomous driving, robotics, industrial inspection, medical imaging, scene
understanding, etc.
It has been observed that there is a trade-off in semantic segmentation — fine details, such as the edges of objects
It has been observed thatbetter
or thin structures, are often therecaptured
is a trade-off
with in semantic
scaled-up segmentation
images in higher— fine details,
resolution, such as
whereas the edges
accurate of objects
segmentation
or thin structures, are often better captured with scaled-up images in higher resolution, whereas accurate
of massive objects assumes more context and is achieved better with downscaled images of lower resolution [1]. segmentation
of massive objects assumes more context and is achieved better with downscaled images of lower resolution [1].

∗ Corresponding author.
∗ Corresponding
E-mail address:author.
rene.pihlak@taltech.ee
E-mail address: rene.pihlak@taltech.ee
1877-0509 © 2021 The Authors. Published by Elsevier B.V.
1877-0509
This © 2021
is an open Thearticle
access Authors. Published
under by Elsevier B.V.
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1877-0509 © 2021 The Authors. Published by Elsevier B.V.
This is an open
Peer-review access
under article under
responsibility the scientific
of the CC BY-NC-ND license
committee (http://creativecommons.org/licenses/by-nc-nd/4.0/)
oflicense
the KES(https://creativecommons.org/licenses/by-nc-nd/4.0)
International.
This is an open access article under the CC BY-NC-ND
Peer-review under responsibility of the scientific committee of
Peer-review under responsibility of the scientific committee of the KES
KES International.
International.
10.1016/j.procs.2021.08.060
René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591 583
2 Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000

This issue has been addressed by a number of researchers and the proposed solutions range from multi-scale
networks [1] and attention maps [2, 3] to specifically designed loss functions [4, 5]. Typically, those approaches suffer
from the requirement for an excessive amount of time or GPU resources and they do not specifically focus on small
and thin objects. Instead, [1, 2, 3] use ‘black box’ attention maps that are trained as part of the neural network (NN),
while [4] uses affinity map (AF) to give more weight on pixels that are near borders and [1, 5] calculate loss based on
a region including and surrounding a given pixel.
We propose a loss function to improve semantic segmentation performance on small and thin objects — without
a performance loss on larger objects — by adding a morphological element to the loss function used for training the
segmentation NN.
The proposed approach has the following advantages: unlike [1, 4, 5, 6], it requires neither a multi-scale NN
architecture nor excessive GPU resources; it does not only focus on the edges of objects and can be implemented
in existing NNs without architectural changes. To verify the approach, the NN is trained and tested for a traffic sign
segmentation problem on a publicly available Cityscapes dataset [7], making the results reproducible and comparable.
The results indicate that for traffic signs (which heftily vary in size), the proposed method yields ≈ 2.11 % gain in
terms of the evaluation metric, i.e. global intersection over union (IoU) at pixel level.
We also apply a dedicated metric to assess the segmentation performance for different object sizes that reveals that
the performance gain is the highest among smaller objects.

2. Related Work

The task of semantic (image) segmentation is to give each pixel in an image a corresponding label so that the pixels
with the same label share certain characteristics (i.e. they depict the same kind of object — car, flower, building, etc.).
The goal of segmentation is to recognize what is depicted in the image. It, however, does not distinguish between
different instances of objects of the same class (this is called instance segmentation).
Over the years, several different methods and techniques have been proposed for semantic segmentation. Some of
the older segmentation solutions include thresholding [8], k-means clustering [9], histogram-based segmentation [10]
and region-growing [11] as the principal method. However, these methods are generally outperformed by more recent
DL-based methods, especially by convolutional neural networks (CNNs) [12, 13, 14]. The evolution is not surprising
as, after all, in recent years, CNNs have become a cornerstone in the field of computer vision, and have been applied
to many tasks ranging from image classification [15] and object detection [16] to semantic segmentation [1] and even
to artistic style transfer [17].
Semantic segmentation networks convert an existing CNN architecture constructed for classification into a fully
convolutional neural network (FCN). A typical classification CNN consists of a series of convolutions and poolings,
followed by fully connected layers that maps an image to the label space. A FCN, on the other hand, maps an image
to another image with a number of channels that corresponds to the number of classes. This was first attempted by
replacing the fully connected layer(s) with an upsampling layer [18] but the approach suffered from limitations —
the objects that are substantially larger or smaller than the pre-defined fixed-size receptive field became fragmented
or mislabelled and small objects were often ignored and classified as background. This was improved by an encoder-
decoder architecture [19] where the encoder (the usual contracting network) decreases the spatial size of the input
maps while increasing the number of channels and the decoder uses up-convolution operation with decreasing number
of filters so as to increase the spatial dimensions of the feature map and reduce its channels. [20] introduced the skip
connections — the feature maps from the encoder path are fed into their respective decoder paths where these two
feature maps are concatenated along the channel dimension — at several stages.
The segmentation trade-off (fine details are better captured at higher and large structures at lower resolutions)
is commonly addressed by multi-scale inference. Predictions are done at a range of scales and those multi-scale
predictions are combined together at a pixel level with averaging or max pooling, or, by an attention module [1].
Some researchers have stated that it suffices to use single-resolution FCNs but with dual (DRANet [21],
DANet [22]) or criss-cross (CCNet [2, 3]) attention modules inserted into the FCN. However, adding attention mod-
ules introduces a new architecture, thus complicating transfer learning and requires additional, sometimes excessive,
computational resources.
584 René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591
Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000 3

2.1. Loss Functions

FCN based semantic segmentation approaches treat the segmentation problem as pixel-wise classification and
therefore train the model by minimizing an average pixel-wise classification loss of the image or images in a mini-
batch. One of the most commonly used loss functions is the softmax cross entropy loss (CE):

N 
 C
 
CE (y, p) = −N −1 yn,c log pn,c (1)
n=1 c=1

where N denotes the number of pixels, C denotes the number of classes, y ∈ {0, 1} is the ground truth (GT) and
p ∈ [0, 1] is the prediction of the NN.
Alternatively, a loss function could take into account the ‘context’ of a pixel. This would give a better estimation
how the pixel fits in with the surrounding pixels. This approach is, for example, taken by [4] with a neighbouring pixel
affinity (NPALoss) and by [5] with a region mutual information (RMI) loss. NPALoss uses a GT based affinity map
that is calculated as a sum of XOR with a pixel and all pixels in the search area and used as pixel-wise weights for
CE. RMI, on the other hand, builds a multidimensional point from the pixels that are within a search area and uses the
distance between such multidimensional points in prediction and GT. These approaches have the advantage that they
can be used with existing FCN architectures, only the loss function must be changed. However, the computational
complexity increases when neighbouring pixels are included in the loss calculations which causes the computation to
require more resources and/or become slower. In contrast to the proposed method, [4] uses AF to give more weight
on pixels that are near borders and [5] calculates a loss based on a region including and surrounding a given pixel. In
other words, these methods do not specifically focus on improving semantic segmentation of small and thin objects.

3. Binary Morphological Loss

A loss function has to fulfil two requirements: (a) calculating the loss function should not require an excessive
amount of resources; (b) the training process that minimizes the loss function should obtain a trained model that pro-
duces predictions that are close to the expected results — the GT. These two conditions, however, are often conflicting
because the loss functions that yield more accurate results are often more complex thus requiring more computational
resources.
We propose a combined loss function that is a linear combination of (1) and a cross entropy loss for small objects

MorCE N = α × CE + β × CE small (k), (2)

where CE is the cross entropy loss, CE small is the cross entropy loss for small objects, α, β are the weights of individual
losses, and k is the size of the kernel for morphological operations.
It is expected that combining several loss functions will provide enough synergy to overcome the individual loss
functions shortcomings and further loss functions could be weighted into (2), however, in order to isolate the effects
of the loss component calculated on small and thin objects, we do not consider that in this paper.
The CE loss for small objects is defined as

 
CE small (k) = CE WT H(ytrue , k), WT H(y pred , k) , (3)

where WT H is the white-top-hat operation and k is the size of the kernel. The morphological white-top-hat operation
is used to filter out objects that are neither small or thin. The top-hat operations can be further broken down into
René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591 585
4 Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000

erosion and dilation operations:

WT H(y, k) = y − y  b(k) = y − (y  b(k)) ⊕ b(k) (4)

where ,  and ⊕ are the opening, erosion and dilation operations, respectively, b(k) is a structuring element (kernel),
and k is the size of the kernel.
In computer vision, morphological operations, such as erosion and dilation, are primarily used for pre- and post-
processing for noise filtering [23] though some image segmentation methods [24, 25] also rely on morphological
operations. (4), however, uses morphological operations to filter out small and thin objects from the rest of the objects
while calculating the loss.
The proposed approach is similar to [6], with a weighted CE where the weights calculated in the pre-processing
stage are based on the size of GT objects, however, there are several key differences. First, we can use the proposed
method with data augmentation that zooms in or out while [6] pre-calculates the weights, thus making them fixed and
valid only for the original zoom level. This also means that [6] must crop out small object masks near the edges, as
those might belong to objects that are not fully visible for a given rotation and cropping. Furthermore, [6] calculates the
weights based on the pixel size of objects, thus classifying a long thin object as an easily detectable large object, while
the proposed method identifies such thin objects also as hard-to-detect ones. At first sight, [6] appears to have more
nuanced weight calculation as the weight varies with the size of the object, while the proposed method uses a simpler
binary threshold: small/thin object or not. However, increasing the loss weights for very tiny objects might hinder the
overall performance of the FCN because many of those very tiny GT annotations are artifacts of annotation mistakes
when multiple classes are overlapping. Finally, unlike [6], the proposed method evaluates also if the predicted objects
are small and/or thin, thus penalizes if two small or thin GT objects are mistakenly predicted as one object. Moreover,
it penalizes when small or thin objects are predicted in locations where there are neither small nor thin objects.

4. Performance Metrics

Semantic segmentation tasks are usually evaluated by Jaccard similarity index, also known as IoU (intersection
over union).


|X Y| TP
IoU =  = , (5)
|X Y| TP + FP + FN

which is the area of overlap between the GT (X) and the predicted semantic segmentation (Y) divided by the area of
union between the predicted semantic segmentation and the GT. TP stands for the number of pixels correctly predicted
to belong to a given class (TP = |X ∩ Y|), FP for the number of pixels falsely predicted to belong to a given class
(FP = |Y| − TP), and FN for the number of pixels falsely predicted not to belong to a class (FN = |X| − TP). It
is well-known that the global IoU measure (5) is biased toward object instances that cover a large image area. This
becomes troublesome in applications where there is a strong scale variation among instances. To address the issue, an
instance-level intersection over union (iIoU) metric has been proposed [26]


wi iTPi
i
iIoU =   , (6)
i wi iTPi + FP + i wi iFNi

where iTPi , iFNi denote the numbers of TP and FN pixels per i-th GT instance, respectively, which are weighted by
wi = S̄ /(iTPi + iFNi ) where S̄ is the average GT instance size. The FP pixels, however, are not associated with any
GT instance and thus do not require normalization.
586 René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591
Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000 5

Object #1
TP = 0 Object #2
FN = 0 T P = 4005
F P = 888 F N = 741
oIoU = 0.000 F P = 15
oIoU = 0.841
Object #3
T P = 3228
F N = 1528
FP = 0 Object #4
oIoU = 0.679 TP = 0
F N = 1016
FP = 0
oIoU = 0.000

Fig. 1. Objects obtained by combining TP, FP and FN pixels.

The iIoU aims to evaluate how well the individual instances in the scene are represented in the labelling, however,
if iIoU < IoU, it simply means that smaller than average objects are not detected with the same accuracy than larger
than average objects.
On the other hand, average precision — a popular metric in measuring the accuracy of object detectors [16] —
is occasionally broken down into individual measures for small, medium, and large objects (medium objects are
specifically defined as having in between 322 and 962 pixels), which helps to better identify where predictions fail.
However, computing the size of a prediction in object detection which yields object bounding boxes is trivial in
contrast to that in segmentation problems and the definition of size classes is specific to the COCO dataset [27] with
relatively small images.

4.1. Object-level and size-specific measures

We combine these two assessments into one. As a first step, we combine GT and prediction pixels, X and Y to yield
object instances. A given object comprising prediction and GT pixels can contain any combination of TP, FP and FN
pixels (Fig. 1) that are defined as follows:

 
oTP j = X j ∩ Y j  , (7)

where X j and Y j are GT and prediction pixels belonging to the j-th object,

 
oFP j = Y j  − oT P j , (8)
 
oFN = X  − oT P ,
j j j (9)

and computation of object-level intersection over union (oIoU) for the j-th object is straightforward

oTP j
oIoU j = . (10)
oTP j + oFP j + oFN j
René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591 587
6 512 × 512 × 3 Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000

conv ×1024
conv ×1024
conv ×128
conv ×128

conv ×256
conv ×256

conv ×512
conv ×512
conv ×64
conv ×64

maxpool

maxpool

maxpool

maxpool

maxpool
dropout

dropout

dropout

dropout

dropout
INPUT

conv ×2048
conv ×2048
512 × 512 × 1

conv ×1024
conv ×1024
conv ×128
conv ×128

conv ×256
conv ×256

conv ×512
conv ×512
OUTPUT

upsample

upsample

upsample

upsample

upsample
conv ×64
conv ×64
conv ×1

dropout

dropout

dropout

dropout

dropout
concat

concat

concat

concat

concat
Fig. 2. Architecture of the neural network

The metric for segmentation quality assessments can be developed at two levels.. At pixel-level, the objects are
grouped by size into pre-determined size groups and each object contributes its the number of pixels correctly pre-
dicted to belong to a given class (TP), the number of pixels falsely predicted to belong to a given class (FP) and
number of pixels falsely predicted not to belong to a given class (FN) pixel counts only to the particular size group
metric to calculate group pixel-level IoU.
Additionally, if the object detection is valid or not can be determined (as in object detection) by its oIoU value as
follows. If (oIoU j > 0.5), the object receives a true positive status. If (oIoU j ≤ 0.5) , and the object contains any
FN pixels, it obtains a false negative status; otherwise, it is given a false positive status. For example, Fig. 1 depicts
two true positive objects, one false positive object, and one false negative object. The classification results can build
confusion matrices and, incidentally, to calculate the object-level IoU of classification, which can be further broken
down into different size groups.
Size of an j-th object in this approach is determined by oT P j + oFN j , (size of the j-th GT). If an object does not
contain any GT pixels, its size is determined by oFP j .

5. Experiments

5.1. Experimental Setup

We trained three U-Net based models employing three different loss functions: (a) CE, (b) AF (a simplified version
on NPALoss [4]) and (c) MorCE. We used kernel sizes 7 and 95 for AF and MorCE based loss functions, respectively.
This allowed training with AF with batch size two and MorCE model with batch size four on a single GeForce RTX
2080 Ti GPU. Besides, kernel size 95 roughly corresponds to the COCO dataset threshold for large objects which is
962 pixels.
In order to isolate the effect that different loss functions have on the segmentation performance, we kept all other
variables, such as the base model architecture, constant throughout the experiments. Therefore, each experiment com-
prises training a CNN model with the same fixed architecture (Fig. 2) that is inspired by U-Net [20]. In this base
model, the final convolutional layer has a kernel size 1 × 1 while all the other convolutional layers have a kernel
size 3 × 3. Different from the original U-Net architecture, the base model has dropout layers with 0.1 probability of
dropout. We obtained this dropout rate through hyper-parameter optimization.
All three models were trained on the Cityscapes dataset [7] that contains 3 475 finely annotated panoramic images.
These high-definition images have a size 1 024 × 2 048 pixels. We divided these images into training and validation
datasets with 2 979 and 496 images respectively by sorting the images using the relative paths of the files and tak-
ing every seventh image for validation. This ensures that the validation dataset contains images from all cities in the
dataset. Besides the training and validation images, the Cityscapes dataset also contains 1 525 testing images. How-
ever, these testing images are provided without publicly available annotations. Therefore, we randomly chose 276
images out of 1 525 and manually annotated these following the guidelines of the Cityscapes dataset.
588 René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591
Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000 7

0.8
0.663 0.677
0.640
0.6
0.521
0.492 0.490

0.4

0.2

0
Object based Pixel based
CE MorCE AF

Fig. 3. Test results for object and pixel based global IoU

At the beginning of training, we set the learning rate to 10−5 . During the training process, the learning rate is
halved if the validation metric does not improve for 13 consecutive training steps, and if the validation metric has not
improved for 18 consecutive training steps, the training is stopped.
Training on GPU sets limitations to the input image dimensions and trainable parameters of the NN. Our models
are trained using input size of 512 × 512 pixels by randomized cropping of original images at each training cycle. The
inference, however, is performed on CPU, thus, the full image at the original size of 1 024 × 2 048 is used as input.
Especially in Northern Europe, the lighting conditions (brightness and color of sunlight) can significantly vary dur-
ing a day. This means a large dataset is needed to fully capture these variations. However, from the Cityscapes dataset,
only 3 475 images are available for training and validation combined, therefore, data augmentation is necessary to
increase the robustness of the model.
At each training step, we randomly apply the following augmentations to each (copy of a) image separately: (a)
add random values (±6) to R, G and B channels, (b) random zoom (0.75–1.25), (c) random angle (±5°), (d) gamma
correction (0.95–1.5), (e) top hat with random kernel (with size 75–275), (f) clahe with parameters 1.33 and 100, (g)
simple white-balance and (h) adding salt-and-pepper noise.

5.2. Results on Cityscapes Dataset

After training the three models — CE baseline, AF and MorCE-95 — we evaluated the performance of the models
on the test dataset of 276 images. The results over all object sizes show that the AF model has the lowest performance,
while the MorCE-95 model trained on MorCE loss with a kernel size of 95 outperforms both the CE baseline and AF
models for both object and pixel based global IoU measures (Fig. 3): the object and pixel based IoU values for the
baseline model are, respectively, 0.492 and 0.663, while the same values for MorCE-95 are 0.521 and 0.677. To put it
in perspective, the global IoU results improved for object based measure ≈ 0.029 and for pixel based measure ≈ 0.014
percentage points when using a MorCE-95 model. This means that the IoU increased 5.89 % and 2.11 % for object
and pixel based measures respectively.
While the resulting global IoU values (Fig. 3) suggest that MorCE-95 model outperforms both the CE baseline
and AF models, this information alone does not conclude that the increased performance is in fact because of higher
detection of small and thin objects. Therefore, we divided the objects into five empirically determined size groups
based on their pixel-wise size (Table 1) and assessed each model’s performance per group. In case of object based
evaluation, the MorCE-95 model outperforms both the CE baseline and AF models in the groups of tiny, small and
medium objects, while the CE baseline model outperform both AF and MorCE-95 models in the groups large and
extra large (Fig. 4). In other words, the increased performance of MorCE-95 model is, indeed, due to improved
performance in detecting small and thin objects.
René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591 589
8 Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000

1
0.862 0.844
0.830 0.819 0.815
0.8 0.776
0.714 0.731
0.664

0.6 0.571
0.532 0.527

0.4
0.293 0.289
0.266

0.2

0
Tiny Small Medium Large Extra large
CE MorCE AF

Fig. 4. Detailed test results by size for object based IoU

Table 1. Size groups and the distribution of ground truth (GT) objects depending on the size of a traffic sign (M)
Group name Boundaries No. of GT objects No. of pixels of GT Objects

tiny 52 < M < 152 755 (37.6 %) 87 977 (3.9 %)


small 152 ≤ M < 302 721 (36.0 %) 325 313 (14.5 %)
medium 302 ≤ M < 552 358 (17.9 %) 569 117 (25.3 %)
large 552 ≤ M < 842 110 (5.5 %) 481 969 (21.4 %)
extra large M ≥ 842 61 (3.0 %) 785 586 (34.9 %)

0.8 0.764 0.763 0.744


0.711 0.730
0.687
0.639 0.656
0.603
0.6
0.495 0.518
0.489

0.4
0.275 0.301 0.289

0.2

0
Tiny Small Medium Large Extra large
CE MorCE AF

Fig. 5. Detailed test results by size for pixel based IoU

Similar results are obtained when comparing the pixel based IoU performance measure values of the models
(Fig. 5). The MorCE-95 model outperforms the CE baseline and AF models in all size categories except for the
large where CE baseline model edges slightly higher with a difference of ≈ −0.001 percentage points. The AF is
outperformed by MorCE-95 model in all size groups, while CE baseline model outperforms AF in all size groups
except for tiny, where AF has ≈ 0.014 percentage points higher performance. Some segmentation results are depicted
in Fig. 6.
590 René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591
Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000 9

(a) (b) (c) (d) (e)


Fig. 6. Segmentation results: input image (a), ground truth (b), segmentation with CE (c), AF (d), and MorCE-95 (e)

However, it must be noted that choosing a too small kernel size (e.g. 35) for the MorCE loss function will hinder the
model’s performance both for object and pixel based IoU. The reason for this could be that sometimes these small/thin
GT objects are artifacts of annotation mistakes and focusing on those can lead to a model that generates noisy outputs.

6. Conclusion

Semantic image segmentation suffers from the segmentation trade-off where fine details of objects are better cap-
tured at a higher resolution but large structures require more context hence lower resolution and the computational
footprint of multi-scale approaches that address the issue might be excessive and not accessible to everyone. We pro-
pose a simple yet efficient method to improve the segmentation of small and thin objects by adding an element to
the loss function used for the training of the segmentation network that specifically focuses on such objects that are
picked out using the morphological operations.
The experiments with the traffic sign segmentation problem on Cityscapes dataset show that the proposed method
yields higher IoU than plain CE (the IoU gain is the highest among the smaller traffic signs) and suggest that the
proposed binary threshold for picking out small or thin objects is more efficient than giving proportionally higher
weights to smaller objects. We can implement the method in existing NNs without architectural changes and does not
require excessive GPU resources.

Acknowledgements

This study was partially supported by the Archimedes Foundation and EyeVi Ltd. in the scope of the smart special-
ization research and development project #LEP19022: “Applied research for creating a cost-effective interchangeable
3D spatial data infrastructure with survey-grade accuracy”.
René Pihlak et al. / Procedia Computer Science 192 (2021) 582–591 591
10 Pihlak and Riid / Procedia Computer Science 00 (2021) 000–000

References

[1] Tao, Andrew, Karan Sapra, and Bryan Catanzaro. (2020) “Hierarchical multi-scale attention for semantic segmentation.” arXiv:2005.10821.
[2] Huang, Zilong, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, and Thomas S. Huang. (2020) “CCNet: Criss-cross
attention for semantic segmentation.” arXiv:1811.11721.
[3] Huang, Z., X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu. (2019) “CCNet: Criss-cross attention for semantic segmentation”, in Proceedings
of 2019 IEEE/CVF International Conference on Computer Vision (ICCV): 603–612.
[4] Feng, Y., W. Diao, X. Sun, J. Li, K. Chen, K. Fu, and X. Gao. (2020) “NPALoss: Neighboring pixel affinity loss for semantic segmentation in
high-resolution aerial imagery.” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-2-2020: 475–482.
[5] Zhao, Shuai, Yang Wang, Zheng Yang, and Deng Cai. (2019) “Region mutual information loss for semantic segmentation.” arXiv:1910.12037.
[6] Mohanty, Sharada Prasanna, Jakub Czakon, Kamil A. Kaczmarek, Andrzej Pyskir, Piotr Tarasiewicz, Saket Kunwar, Janick Rohrbach, Dave
Luo, Manjunath Prasad, Sascha Fleer, Jan Philip Göpfert, Akshat Tandon, Guillaume Mollard, Nikhil Rayaprolu, Marcel Salathe, and Malte
Schilling. (2020) “Deep learning for understanding satellite imagery: An experimental survey.” Frontiers in Artificial Intelligence 3: 534696.
[7] Cordts, Marius, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and
Bernt Schiele. (2016) “The Cityscapes dataset for semantic urban scene understanding”, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR): 3213–3223.
[8] Hoover, A. D., V. Kouznetsova, and M. Goldbaum. (2000) “Locating blood vessels in retinal images by piecewise threshold probing of a
matched filter response.” IEEE Transactions on Medical Imaging 19 (3): 203–210.
[9] Kanungo, T., D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. (2002) “An efficient k-means clustering algorithm:
analysis and implementation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7): 881–892.
[10] Tobias, O. J. and R. Seara. (2002) “Image segmentation by histogram thresholding using fuzzy sets.” IEEE Transactions on Image Processing
11 (12): 1457–1465.
[11] Deng, Y. and B. S. Manjunath. (2001) “Unsupervised segmentation of color-texture regions in images and video.” IEEE Transactions on
Pattern Analysis and Machine Intelligence 23 (8): 800–810.
[12] Chen, L., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. (2018) “Deeplab: Semantic image segmentation with deep convolutional
nets, atrous convolution, and fully connected crfs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4): 834–848.
[13] Zhao, H., J. Shi, X. Qi, X. Wang, and J. Jia. (2017) “Pyramid scene parsing network”, in Proceedings of 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR): 6230–6239.
[14] Liu, S., L. Qi, H. Qin, J. Shi, and J. Jia. (2018) “Path aggregation network for instance segmentation”, in Proceedings of 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition: 8759–8768.
[15] Xian, Y., T. Lorenz, B. Schiele, and Z. Akata. (2018) “Feature generating networks for zero-shot learning”, in Proceedings of 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition: 5542–5551.
[16] Redmon, J., S. Divvala, R. Girshick, and A. Farhadi. (2016) “You only look once: Unified, real-time object detection”, in Proceedings of 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 779–788.
[17] Chen, Y., Y. Lai, and Y. Liu. (2018) “Cartoongan: Generative adversarial networks for photo cartoonization”, in Proceedings of 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition: 9465–9474.
[18] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. (2015) “Fully convolutional networks for semantic segmentation”, in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition: 3431–3440.
[19] Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. (2015) “Learning deconvolution network for semantic segmentation”, in Proceedings
of the IEEE International Conference on Computer Vision: 1520–1528.
[20] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. (2015) “U-Net: Convolutional networks for biomedical image segmentation”, in Nassir
Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (eds) Medical Image Computing and Computer-Assisted Intervention
– MICCAI 2015, Springer International Publishing.
[21] Fu, Jun, Jin Liu, Jie Jiang, Yong Li, Yongjun Bao, and Hanqing Lu. (2020) “Scene segmentation with dual relation-aware attention network.”
IEEE Transactions on Neural Networks and Learning Systems: 1–14.
[22] Fu, Jun, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. (2019) “Dual attention network for scene segmentation”,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 3146–3154.
[23] Kobayashi, S. and R. Miyamoto. (2019) “Noise reduction of segmented images by spatio-temporal morphological operations”, in Proceedings
of 2019 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS): 297–300.
[24] Benediktsson, J. A., M. Pesaresi, and K. Amason. (2003) “Classification and feature extraction for remote sensing images from urban areas
based on morphological transformations.” IEEE Transactions on Geoscience and Remote Sensing 41 (9): 1940–1949.
[25] Grau, V., A. U. J. Mewes, M. Alcaniz, R. Kikinis, and S. K. Warfield. (2004) “Improved watershed transform for medical image segmentation
using prior information.” IEEE Transactions on Medical Imaging 23 (4): 447–458.
[26] Alhaija, Hassan, Siva Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. (2018) “Augmented reality meets computer vision:
Efficient data generation for urban driving scenes.” International Journal of Computer Vision (IJCV) 126: 961—972.
[27] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. (2014)
“Microsoft COCO: Common objects in context”, in Computer Vision – ECCV 2014, Springer International Publishing: 740–755.

You might also like