FULLTEXT01 (1)

Digital Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Science and Technology 2253
Image Processing and Analysis

Methods for Biomedical
Applications
EVA BREZNIK
ACTA
UNIVERSITATIS
UPSALIENSIS ISSN 1651-6214
ISBN 978-91-513-1760-1
UPPSALA URN urn:nbn:se:uu:diva-498953
2023
Dissertation presented at Uppsala University to be publicly examined in Sonja Lyttkens
(101121), Ångström Laboratoriet, Lägerhyddsvägen 1, Uppsala, Friday, 12 May 2023 at
09:15 for the degree of Doctor of Philosophy. The examination will be conducted in English.
Faculty examiner: Professor Alejandro Frangi (University of Leeds).
Abstract
Breznik, E. 2023. Image Processing and Analysis Methods for Biomedical Applications.
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and
Technology 2253. 74 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-1760-1.
With new technologies and developments medical images can be acquired more quickly and
at a larger scale than ever before. However, increased amount of data induces an overhead
in the human labour needed for its inspection and analysis. To support clinicians in decision
making and enable swifter examinations, computerized methods can be utilized to automate the
more time-consuming tasks. For such use, methods need be highly accurate, fast, reliable and
interpretable. In this thesis we develop and improve methods for image segmentation, retrieval
and statistical analysis, with applications in imaging-based diagnostic pipelines.
Individual objects often need to first be extracted/segmented from the image before they
can be analysed further. We propose methodological improvements for deep learning-based
segmentation methods using distance maps, with the focus on fully-supervised 3D patch-based
training and training on 2D slices under point supervision. We show that using a directly
interpretable distance prior helps to improve segmentation accuracy and training stability.
For histological data in particular, we propose and extensively evaluate a contrastive learning
and bag of words-based pipeline for cross-modal image retrieval. The method is able to recover
correct matches from the database across modalities and small transformations with improved
accuracy compared to the competitors.
In addition, we examine a number of methods for multiplicity correction on statistical
analyses of correlation using medical images. Evaluation strategies are discussed and anatomy-
observing extensions to the methods are developed as a way of directly decreasing the
multiplicity issue in an interpretable manner, providing improvements in error control.
The methods presented in this thesis were developed with clinical applications in mind and
provide a strong base for further developments and future use in medical practice.
Keywords: Multiple comparisons, image segmentation, image retrieval, deep learning,

medical image analysis, magnetic resonance imaging, whole-body imaging
Eva Breznik, Department of Information Technology, Computerized Image Analysis and

Human-Computer Interaction, Box 337, Uppsala University, SE-75105 Uppsala, Sweden.
© Eva Breznik 2023
ISSN 1651-6214
ISBN 978-91-513-1760-1
URN urn:nbn:se:uu:diva-498953 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-498953)
To my family.
List of papers
This thesis is based on the following papers, which are referred to in the text
by their Roman numerals.
I E. Breznik, J. Kullberg, H. Ahlström, R. Strand. "Introducing spatial

context in patch-based deep learning for semantic segmentation in
whole body MRI". in Scandinavian Conference on Image Analysis
(SCIA), accepted for publication, 2023.
II E. Breznik, H. Kervadec, F. Malmberg, J. Kullberg, H. Ahlström, M.

de Bruijne, R. Strand. "Leveraging point annotations in segmentation
learning with boundary loss". Submitted.
III E. Breznik, E. Wetzer, J. Lindblad, N. Sladoje. "Cross-Modality

Sub-Image Retrieval using Contrastive Multimodal Image
Representations". Under review.
IV E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

"Multiple comparison correction methods for whole-body magnetic
resonance imaging". Journal of Medical Imaging, 7(1), 2020.
Reprints were made with permission from the publishers.

Summary of Contributions
The roman numerals correspond to the numbers in the list of papers.
I I developed the idea independently, designed the experiments, wrote

the code, ran and analyzed the experiments and wrote the paper.
II I developed the idea independently. Extensions to the code were done

jointly with the first co-author and with input from Filip Malmberg.
The experiments were designed and analyzed together with the first
co-author. I wrote the paper with input from co-authors.
III The experimental design, code writing, and results analysis were done
together with the first co-author. The paper was written jointly with all
co-authors.
IV I developed the ideas, designed the experiments, wrote the code, and
performed and analyzed the experiments with support from co-authors.
The paper was written with input from all the co-authors.
Related Work
In addition to the papers included in this thesis, the author has also written or
contributed to the following works:
Associated (Peer-reviewed) Acts

R1 S. Kundu, S. Banerjee, E. Breznik, D. Toumpanakis, J. Wikström, R.
Strand, A. Kumar Dhara. "ASE-Net for Segmentation of
Post-operative Glioblastoma and Patient-specific Fine-tuning for
Segmentation Refinement of Follow-up MRI Scans". Submitted for
journal publication.
R2 R. Heil, E. Breznik. "A Study of Augmentation Methods for

Handwritten Stenography Recognition". Submitted for conference
publication.
R3 I. Tominec, E. Breznik. "An unfitted RBF-FD method in a

least-squares setting for elliptic PDEs on complex geometries".
Journal of Computational Physics, 436:110283, 2021.
R4 R. Strand, S. Ekström, E. Breznik, T. Sjöholm, M. Pilia, L. Lind, F.

Malmberg, H. Ahlström, and J. Kullberg. "Recent advances in large
scale whole body MRI image analysis: Imiomics". In Proceedings of
the 5th International Conference on Sustainable Information
Engineering and Technology (SIET ’20). 10-15, 2021.
R5 E. Wetzer, E. Breznik, J. Lindblad, N. Sladoje. "Re-ranking strategies

in cross-modality microscopy retrieval". 19th IEEE International
Symposium on Biomedical Imaging (ISBI), 2022. [reviewed extended
abstract, poster]
R6 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

"Segmentation of abdominal organs in MRI". In Women in Machine
Learning (WiML) workshop at Neural Information Processing Systems,
2017. [reviewed extended abstract, poster]
Associated (not Peer-reviewed) Acts
P1 E. Breznik, E. Wetzer, J. Lindblad, N. Sladoje. "Label-Free Reverse
Image Search of Multimodal Microscopy Images". In Swedish
Symposium on Image Analysis (SSBA), 2022. [oral]
P2 E. Breznik, R. Strand. "Effects of distance transform choice in training

with boundary loss". In Swedish Symposium on Deep Learning
(SSDL), 2021. [poster]
P3 E. Wetzer, N. Pielawski, E. Breznik, J. Öfverstedt, J. Lu, C. Wählby, J.

Lindblad, N. Sladoje. "Contrastive Learning for Equivariant
Multimodal Image Representations". In The Power of Women in Deep
Learning Workshop at the Mathematics of deep learning Programme at
the Isaac Newton Institute for Mathematical Sciences, 2021. [oral]
P4 T. Asplund, K. Bengtsson Bernarder, E. Breznik. "CNNs on Graphs:

A New Pooling Approach and Similarities to Mathematical
Morphology". In Swedish Symposium on Deep Learning (SSDL), 2019.
[poster]
P5 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand. "Using

deep learning with anatomical information for segmentation of
abdominal organs in whole-body MRI".In Swedish Symposium on
Deep Learning (SSDL), 2018. [poster]
P6 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

"Statistical considerations in whole-body MRI analyses for Imiomics".
In Swedish Symposium on Image Analysis (SSBA), 2017. [oral]
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1 Aims and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Working with biomedical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 The challenges of working with medical images . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Data availability and the effect on repeatability . . . . . . . . . . . . 14
2.1.2 Data heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Reliability and interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Private dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Open datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Technical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Medical image processing fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Content-based image retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Training considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Distance transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Common distance definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Using distance transforms within DL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Test multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Improving segmentation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Patch-based learning in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Guiding the learning under weak supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Methods and background experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Cross-modality image retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Step I: Bridging the modality gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Step II: Feature extraction and matching . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 Step III: Reranking Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Handling statistical analyses in Imiomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Evaluation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Anatomically compliant corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7 Conclusions and future work .................................................................... 61
Sammanfattning på svenska ............................................................................ 63
Acknowledgements .......................................................................................... 65
References ........................................................................................................ 67
1. Introduction
Ever since the first X-Ray image was made in 1895, medical doctors have been
increasingly relying on images and scans of different modalities for diagnosis
and disease progression monitoring [86]. With new technical developments
came new techniques, improved image resolution and faster, high-throughput
imaging protocols. Today, even whole-body scans can be obtained in a rea-
sonable time to be used not only for research but even in medical practice [96],
for example in diagnostic exploratory searches, tumour detection, staging, and
therapy evaluation [77].
Depending on the application and the reasons for acquiring the images,
they must be processed and inspected in specialized ways, typically relying on
expert knowledge. For example, apart from identifying and localizing various
pathologies, one might also need to measure the size or volume of objects in
the image. Through the increased availability of imaging for daily use and the
unprecedented rise in the amount of data acquired, the human effort required
for the analyses became the bottleneck in high-throughput medical imaging
pipelines. Numerous computer-based automatic and semi-automatic image
processing and analysis tools have emerged, aiming to reduce this bottleneck.
A very high-level (and by no means exhaustive) overview of a medical
image-processing-based workflow for diagnostics is shown in Figure 1.1. Ra-
diology and microscopy/pathology typically play complementary roles in image-
based diagnostics. While radiological scans are often used as a stand-alone
tool, they sometimes require additional processing or indicate the need for
more targeted examinations, like biopsies. Similarly, while examining histo-
logical images can be the end goal of a medical procedure, it can also indicate
the need for broader radiological imaging. Statistical analyses are commonly
required for practical use in decision-making in order to enable a sound and
reliable medical interpretation of the final imaging results and findings.
The work of this thesis is concerned with developing and improving com-
puterized, automated methods for different stages of the shown diagnostic
pipeline, namely the processing of radiological scans (by segmentation meth-
ods), biomedical image retrieval (with a focus on histological data), and the
final statistical analyses as applied to images.
While a quick literature search reveals an abundance of methods for im-
age segmentation, retrieval and analysis, these are generally data- or specific
application-dependent. In addition, medical applications have the potential
for a strong societal impact. As such, they are subject to heavier requirements
with regard to accuracy, reliability and explainability.
11
Figure 1.1. A high-level overview of an example imaging-based diagnostic pipeline,
with the emphasis the steps that were the focus of this thesis work. Microscopy and
radiological images can be used jointly or as stand-alone tools. Biomedical images
typically require further processing, handling and analyses, to be used as the basis for
diagnosis.
1.1 Aims and contributions

This thesis work comprises four papers with a connecting goal of improving
automatic image processing and analysis methods for use within diagnostics.
In paper I, we aim to improve the patch-based training paradigm for segmen-
tation in 3D. We propose and evaluate the inclusion of landmark-informed
maps on the problem of abdominal organ segmentation. Paper II investigates
the use of distances with various properties in training with boundary loss,
with the aim to improve segmentation training with point annotations and ul-
timately loosen the need for laborious pixel-wise annotations. To tackle the
problem of accurate image retrieval across multi-modal histological image
pairs, a contrastive-learning-based pipeline is proposed in paper III. Finally,
in paper IV, we evaluate a number of correction methods and proposed exten-
sions for statistical hypothesis testing using whole-body MR scans. The results
form the basis to discuss their applicability within analyses by Imiomics.
1.2 Thesis outline

As the methods and tools developed in this thesis target biomedical data, we
begin by discussing some particularities and challenges of such data in chap-
ter 2. We also provide a brief account of all the datasets used in this thesis in
section 2.2.
Chapter 3 offers the essential technical background and terminology under-
pinning the work in the included papers. The chapters 4 and 5 include more
involved explanations of the work done in papers I, II and III respectively.
12
Chapter 6 covers the work of paper IV. Finally, a short conclusion with possi-
ble future work directions is given in chapter 7.
13
2. Working with biomedical data
While wrangling medical data can lead to useful insights with potentially high
societal impact, it comes with its own set of challenges. In this chapter, we
first discuss the unique problems of working with biomedical data with regard
to data access, research reproducibility and real-life applications, followed by
a brief summary of all the datasets that have been used during the work on this
thesis and included papers.
2.1 The challenges of working with medical images

2.1.1 Data availability and the effect on repeatability
Medical data, and specifically medical images, can contain identifying infor-
mation, through which the patient identity may be recoverable. To protect
the anonymity special measures need to be taken. Following The General
Data Protection Regulation (GDPR) [18], stricter rules have been imposed on
online data handling and the process of anonymization, due to which many
datasets cannot be made publicly available for benchmarking.
In addition to anonymity, another limiting factor in medical data usage is
patient consent, which needs to be acquired before image acquisition and shar-
ing, but can mostly be retracted at any given point. This means that any
patient-based datasets need to be continuously curated and updated, poten-
tially affecting the repeatability of studies and method evaluations.
When working with data of which sharing is restricted, it limits not only
the reproducibility and repeatability of the published works using that data but
also potential collaborations and use of resources. For example, off-site su-
percomputing centres, while attractive for time-consuming methods, are often
off-limits if sensitive data needs to be moved around.
At the same time, in the medical image processing field, the work is often
motivated by dataset-specific questions that arise in collaborations with local
medical centres. While there is nothing negative in drawing inspiration for
method development from local, closed-source datasets, any results presented
on such data are of limited use to the broader research community. To en-
courage transparency, repeatability and fair comparisons, it is thus advisable
to, whenever possible, perform evaluations also on open, publicly available
datasets.
Nowadays, with faster and more widespread imaging, an increasing num-
ber of open datasets are becoming available, for example through open chal-
lenges [39, 9] or population-wide screening. A notable example is e.g. the
14
UK Biobank [95], a large-scale biomedical database with the goal of scanning
some 100k subjects.
2.1.2 Data heterogeneity

Radiological as well as microscopy data suffers from being very heteroge-
neous, in the sense that even when the data is of the same modality, it may
exhibit different properties (e.g. due to slight differences in imaging proto-
cols, machines and vendors, staining processes, the person carrying out the
procedure). This means that methods developed on a particular dataset may
not easily generalize to another, even within the same imaging modality. In
addition, multiple modalities may need to be combined for accurate diagnosis
or decision-making. Therefore, multimodal methods (either joining or gener-
alizing over multiple modalities), as well as domain adaptation techniques, are
now particularly interesting.
For many learning-based methods, or simply for evaluation purposes, even
annotations of a desired output are required together with the raw data. How-
ever, these depend heavily on the expert creating them and give rise to yet
another source of heterogeneity and uncertainty in medical data due to intra-,
and inter-observer variability [69].
Data heterogeneity affects not only the generalizability of methods, but
even comparability among methods. For fair comparisons among methods,
the effect of data heterogeneity on them should be removed. All compared
methods should generally be evaluated on the same datasets [88].
2.1.3 Reliability and interpretability

In medical applications, there is a need for very high – potentially human-level
– method accuracy, which is the driving force behind the incremental research
focused on the "search for that 1-point improvement". However, as discussed
above, inter-/intra-rater variability and human mistakes can affect the method
evaluation and put the objective interpretability of the results under question.
In addition, while high accuracy is desired, it is more important in practice
that the methods are stable and reliable, and that their behaviour is not only
predictable but also explainable. That way the medical experts can feel safe
and comfortable using them in practice [4].
To produce methods that can be used in a practical setting, the evaluation
should be done on multiple datasets, when possible [15]. And the expected
results and quality assessment should be interpretable. This means that the
evaluations should be provided together with statistical analyses and explana-
tions of their limitations, and be reliable also in a statistical sense [88].
15
2.2 Dataset overview
2.2.1 Private dataset
The work in this thesis was primarily developed for an in-house dataset of
whole-body MRI scans.
POEM
The POEM data comes from a Prospective investigation of Obesity, ENergy
production and Metabolism study on a healthy sample of 50-year-olds from
Uppsala. It is a local (not currently publicly available) cohort of whole-body
fat/water-separated 3D MR images. The cohort includes data from 502 pa-
tients.
Imaged field of view (FOV) was 530 × 377 × 2000 mm3 , and the resolu-
tion anisotropic, with a reconstructed voxel size of 2.07 × 2.07 × 8.0 mm3 in
left-right × anterior-posterior × foot-head directions. For additional technical
details regarding the properties and acquisition of the images see [59].
Figure 2.1. Example corresponding slices of water (above) and fat (below) content
images of a random subject from the POEM dataset [59].
In addition to the scans, a number of non-imaging measurements such as

bioimpedance analysis, triglycerides, joint positions, etc. are available for
each subject. For a small subset of 50 images, full pixel-wise annotations of
the liver, kidneys, bladder, pancreas and spleen are also provided. This dataset
was used in papers I, IV and II.
16
2.2.2 Open datasets
For additional benchmarking and reproducibility reasons, we also use a num-
ber of publicly available datasets for different tasks.
ISLES data
The Ischemic Stroke Lesion Segmentation (ISLES) 2018 challenge [39] is a
cohort of multimodal brain scans of 103 patients with ischemic stroke lesions.
The patients underwent CT perfusion imaging (CTP) followed by an MRI
diffusion-weighted imaging (DWI). Provided imaging data consists of perfu-
sion maps (cerebral blood flow CBF, mean transit time MTT, cerebral blood
volume CBV, the time of maximum residue Tmax, and CTP source data). For
each subject, the DWI sequence was co-registered with CTP. See example pa-
tient images in Figure 2.2. Ground truth segmentation, based on the DWI
scans, is also provided for all lesions (for the training set). Scans are 3D vol-
umes of varying voxel size and slice thickness, containing only a few slices
each. For more details see [39].
Figure 2.2. All modalities for one example subject slice from the ISLES 2018 dataset
[39]. From left to right: CTP source data, CBF, CBV, MTT and Tmax.
This dataset was used for work leading to the ideas of paper II, and is in-
cluded also in the discussions and experiments of chapter 4.
ACDC data
The Automated Cardiac Diagnosis Challenge (ACDC) [9] is a public bench-
mark multi-class heart segmentation dataset. It contains cine-MR images of
150 patients, covering healthy scans and four types of pathologies in equal
amounts, with pixel-wise annotations for the right ventricle (RV), myocardium
(Myo) and left ventricle (LV) heart structures. The images are 3D volumes,
Figure 2.3. Example 2D slices from 4 patients from the ACDC dataset [9].
17
with anisotropic inter-slice spacing and varying spatial resolution. More de-
tails regarding the dataset are available in [9]. This dataset is used mainly
within paper II. See Figure 2.3 for a few example slices.
Multi-modal histological images

This is a publicly available 2D multi-modal histological image dataset [32]
that we used for the image retrieval work, and consists of 206 aligned pairs
of Bright-Field (BF) and Second-Harmonic Generation (SHG) tissue micro-
array images. SHG is a label-free, non-linear imaging modality [53], while BF
microscopy is a trans-illumination technique exploiting differences in light ab-
sorbance properties. BF images are generally of low contrast, thus staining is
often required to increase it. In this dataset, the standard Hematoxylin&Eosin
stain is used.
As this dataset was primarily introduced as a registration benchmark dataset,
it contains both the reference and rigidly transformed versions for each pair.
The transformations were applied randomly, using rotations of up to ±30◦ ,
and random translations up to ±100 px in both directions. All images in the
dataset are of size 834 × 834 pixels. An example pair of images is shown in
Figure 2.4. We use this dataset for image retrieval method development in
paper III.
Figure 2.4. An example pair of Second-Harmonic Generation (SHG) and Bright-Field

(BF) microscopy images from the histological SHG and BF dataset [32]. From left:
original BF and SHG images, and randomly transformed versions of the same pair.
18
3. Technical background
This chapter briefly introduces the technical background of the tools and meth-
ods that are used in the included papers. The basic terms and tasks of image
processing are given in section 3.1. Section 3.2 covers the important con-
cepts of the deep learning methods applied in some of the included papers and
section 3.3 gives a brief account of distance transforms on images. Finally,
section 3.4 summarizes the essentials of statistical analyses on images.
3.1 Medical image processing fundamentals

In a computer, an image is represented as a matrix, with its elements holding
the intensity values. The differences in these intensities are on some level as-
sumed to correspond to differences in image content (i.e. presence or absence
of objects). The individual elements of the image are usually called pixels (in
2D) or voxels (in 3D or higher). In this thesis, the terms voxel and pixel are
used somewhat interchangeably.
Formally, an image can be viewed as a function, mapping a spatial domain
Ω ⊂ Zd onto a set of intensity values, eg. f : Ω → R, in the case of a d-
dimensional discrete image with arbitrary (real) intensity values. In general,
the images are often also discretized in intensity, mapping onto a discrete range
of values {0, 1, . . . , k} = Zk .
Some examples of mid-level image processing tasks one encounters within
medical applications include classification (assigning a label, or class to an im-
age), pattern/object detection (finding and localizing a specified object within
an image, if it exists), segmentation (finding and perfectly delineating an ob-
ject of interest), and registration (transforming the coordinate system of one
image onto another, while preserving object correspondences). In the case of
microscopy data which is generally less identifiable than radiological scans,
the images are often stored in large databases after processing, to be searched
and retrieved at a later stage for further analyses or comparisons. This is
termed image retrieval. As the papers included in this thesis are mainly con-
cerned with segmentation and image retrieval, we describe these in more detail
below.
3.1.1 Image segmentation

Image segmentation is the problem of exactly delineating the object of inter-
est in the image. It boils down to a classification problem, on a pixel level.
19
Formally, a (binary) segmentation of an object in an image can be viewed as
some function f from image domain Ω to a discrete set {0, 1}, where values 1
and 0 denote the presence and absence of the object respectively. In the case
of a K-class segmentation, it could analogously be defined as a mapping from
Ω to ZK , however for a more consistent notation through the thesis we assume
a so-called one-hot representation: f : Ω → {0, 1}K , where f (x) sums to 1 for
all x ∈ Ω and the k-th element of f (x) represents belongingness of pixel x to
class k.
There are multiple flavours of image segmentation. For example, object
segmentation is concerned only with segmenting all areas of a specific class,
instance segmentation requires different instances of the same class to be seg-
mented separately, and semantic segmentation is the task of splitting the entire
image into the specified classes (i.e. each pixel needs to be segmented as one
of the available classes). A combination of the latter two, where each pixel
needs to be assigned a class, but individual instances of the same class need
to be separated too, is termed panoptic segmentation. For examples of the
different segmentation tasks, see Figure 3.1.
Figure 3.1. An illustration of various segmentation tasks. From left to right: original
image, (multiple) object segmentation, instance segmentation, semantic segmentation
and panoptic segmentation.
Traditionally, image segmentation methods often included thresholding,

clustering, edge detection, etc [36]. In the latest years however, machine
learning and particularly deep learning (DL) based techniques explained in
section 3.2 have established themselves as the state-of-the-art for segmenta-
tion tasks [70, 62].
Evaluation
When evaluating segmentation methods, the segmented output is usually com-
pared to the ground truth (i.e. the manually segmented reference, desired
output) in a chosen evaluation metric. The choice of appropriate metric will
depend on what we are segmenting and why, as different quantities may be
important in different applications (e.g. accurate volumetric measure, exact
delineation, or best object coverage). In addition, different metrics have dif-
ferent properties and limitations, that can affect their interpretability [81]. To
evaluate whether the output of a segmentation method really represents the
20
ground truth well, it is thus important to combine multiple metrics and visual
inspection of the results [88]. The most commonly used metrics for segmen-
tation evaluation include Dice score [27] and symmetric Hausdorff distance
[44], which are used also in the included papers.
Dice score
The Dice score [27] (commonly abbreviated as DSC for Dice Similarity- or
Dice-Sørensen Coeficient) is a similarity measure, measuring overlap between
images. For a given image and a single object we aim to segment, let G, S :
Ω → {0, 1} be the ground truth and the segmentation that we wish to evaluate,
respectively. Then the Dice score of the segmentation S given the ground truth
G can be formally defined as
2 ∑x∈Ω (G(x) · S(x))
DSC(S, G) = (3.1)
∑x∈Ω G(x) + ∑x∈Ω S(x)
By definition, the Dice score attains values in the range [0, 1], with 1 corre-
sponding to a perfect segmentation. It is sometimes called also F1 score, and
is very similar to another well-known metric, intersection over union (IoU,
also called Jaccard index, IoU = (2−DSC)
DSC
).
The DSC definition in Equation 3.1 is primarily intended for evaluating a
single class/object segmentation. For a multi-class setting with K classes, there
is no one clear definition of a summarizing Dice score metric. One option
would be to simply calculate the average of Dice scores over all classes of
interest, or consider all pixels where the labellings of G, S : Ω → {0, 1}K agree,
equally: |{x|x∈Ω; G(x)=S(x)}|
|Ω| , where | · | denotes set cardinality (i.e. number of
elements). But without any distinction between the different classes, these
definitions favour the majority classes, which makes them non-representative
of the actual segmentation success when classes are severely imbalanced (as
is often the case in medical applications). To account for differing class sizes,
a so-called Generalized Dice was proposed in [20], weighting the individual
classes by the inverse of their size in the ground truth:
k=0 ∑x∈Ω wk · Gk (x) · Sk (x)

2 ∑K−1
DSC(S, G) = (3.2)
k=0 ∑x∈Ω wk · (Gk (x) + Sk (x))
∑K−1
with weights wk = ∑ 1G (x) .
x∈Ω k
While the summarization of the metric is sometimes inevitable, it is good
practice to always show the results for the individual classes for improved
interpretability [71].
Hausdorff distance
As opposed to the Dice score, the Hausdorff distance [44] is a dissimilar-
ity measure: lower numbers mean the evaluated segmentation is better, more
similar to the ground truth.
21
It measures the distance between the surfaces of the segmented objects,
hence focusing on the segmentation boundaries. Reusing the notation from
above, assuming a binary segmentation, the Hausdorff distance between a
segmented object X = {x | x ∈ Ω, S(x) = 1} and its ground truth Y = {x | x ∈
Ω, G(x) = 1} can be defined as
HD(X,Y ) = max {d(X,Y ), d(Y, X)} (3.3)
where d(X,Y ) is a directed distance based on some norm · : Ω → R (usually

L2 or L1 on discrete image domain):
d(A, B) = max min a − b. (3.4)

a∈A b∈B
The definition in Equation 3.3 is often called also a symmetric Hausdorff dis-
tance, as it takes into account the distances in both directions. As it is very
sensitive to noise/outliers and medical images generally tend to be noisy, we
mostly use a 95 percentile version of HD, where the maximum in Equation 3.4
is replaced by the 95 percentile of the distances. The 95 percentile adjusted
metric is denoted by HD95.
Same as in the case of DSC, the definition in Equation 3.3 is valid for a
single object/class. While per-class scores should be reported for a clearer in-
terpretation of results, average HD95 over all classes can be used as a summa-
rization metric. As it is a boundary- and not overlap-/count-based metric (like
DSC), the class prevalence does not directly affect the average over classes.
3.1.2 Content-based image retrieval

The term content-based image retrieval (CBIR) stands for retrieving images
from a database based on their content and the content of the query. Generally
speaking, a query can be of any type; a property, label, text, image, etc. In the
scope of this thesis, we focus on the case when the query is also an image; such
CBIR is sometimes termed query-by-example, or reverse image search (RIS).
When only a small image segment or patch is provided as a query image, the
task of finding the images containing the query is called sub-image retrieval,
or s-CBIR for short.
When querying a database, the end-goal is finding images that are similar
to the query, for example, in terms of properties it exhibits, what objects it con-
tains, or what class it belongs to. This is particularly prevalent in the diagnos-
tic pipelines, where databases of images with known pathologies are queried
with images of new, undiagnosed patients to inform or confirm their diagnosis
[72]. The concept of similarity is very important here; if we query an image
of a pulmonary embolism-affected lung, retrieving a brain scan of an embolic
stroke patient may be more informative than retrieving scans of healthy lungs
[90]. Based on the desired retrieval outcome, (s-)CBIR tasks can be divided
22
Figure 3.2. An illustration of the common essential steps in a general CBIR pipeline.
into category- and instance-level retrieval. Category-level corresponds to the

retrieval of images that are similar to the query in terms of belonging to the
same class, while instance-level covers cases where each query has a single
correct match within the database.
Most CBIR systems to date consist of two main parts [3], namely feature
extraction and similarity-based matching; see Figure 3.2 for an illustration
of the process. Feature extraction refers to the calculation or extraction of
important features (properties) from the query as well as all database images.
The matching step refers to the process of matching the (aggregated) extracted
features of the query image to those of the database images via a chosen dis-
tance/similarity measure. The final output of the system is a selected number
of the best (highest similarity) matches.
With growing image databases, the computational efficiency of (s-)CBIR
systems is of uttermost importance for practical use. To lay the basis for the
discussion of included papers, we shortly summarize the feature extraction
and matching steps, as well as discuss ways of evaluating CBIR systems.
Feature extraction
The choice of feature extractors in the context of image retrieval depends on
the type of similarity that is expected to be exhibited by a good query-retrieval
image match. Typically features can encode colour, shape, texture, etc. This
construction, or hand-crafting, of the application-specific features, however,
requires extensive domain knowledge. As in many image processing appli-
cations, DL approaches have become very common even in image retrieval
[62, 28]. By using neural networks (NNs) as feature extractors, we can cir-
cumvent the laborious and expertise-based handcrafting of features.
More often than not, the query and database images are not expected to be
aligned in any way, even when searching for exact matches. Thus invariance to
a certain amount of transformations can be a desirable property in a feature ex-
tractor. In paper III, we employ and compare two very well-known and widely
used classical local feature extractors, robust to various transformations: the
23
Scale Invariant Feature Transform (SIFT) [64] and Speeded Up Robust Fea-
tures (SURF) [7]. In addition, we use an NN-based feature extractor based
on an adapted version of the so-called ResNet architecture, explained in more
detail in section 3.2.
SIFT transform consists of first finding interesting keypoints, and calculating

their lower dimensional descriptions. The keypoints of interest in the image
are found by repeatedly smoothing the image with Gaussian kernels of in-
creasing width, and finding the extremal point of the differences between each
two adjacent smoothing levels. The local point descriptors, describing the ap-
pearance of the regions around the points, are subsequently calculated based
on their oriented histograms and are invariant to rotation, translation, scaling,
and partially even to illumination changes and shear.
SURF works in similar ways as SIFT, using a different approximation for the
scale-space (the differences between the smoothed images at varying scales),
which is more computationally efficient than the one of SIFT. The local point
descriptors again take into account orientation and intensity changes in the
local neighbourhood of each point but are calculated in a different way than
within SIFT, forming a descriptor that is half the size of the SIFT descriptor.
Hence SURF features tend to have similar performance as SIFT for image
matching but are more efficient to compute.
Some available feature extractors produce an equal, predefined number of fea-

tures for every image (e.g. NN based extractors or general dense extractors).
In the case of SIFT and SURF, however, the number of extracted features per
image can be arbitrary. An intermediate step of feature aggregation is thus re-
quired in order to produce comparable, equally-sized, compact image descrip-
tors. A Bag-Of-Features (BOF), also called Bag-Of-Words (BoW) method,
is a frequently used approach to feature aggregation, which we also applied
in paper III. It requires clustering the set of all database image features into
a predefined number (also called vocabulary size) of clusters. The cluster
membership histogram (frequency vector) of all image features is then used
as an aggregated BoW image descriptor, or image encoding. Instead of di-
rectly matching individual features, these aggregated image descriptors are
compared in the matching step.
Similarity matching
The database images need to be ranked according to how similar their fea-
ture representations are to the one of the query image. As the representations
are equally sized, simple distance measures (e.g. Euclidean, Manhattan, see
24
section 3.3 for more) can be used to evaluate how close (similar) they are. In
paper III we calculate the similarities using a cosine similarity measure, which
is used most frequently in combination with BoW descriptors. It is defined as
v1 · v2
dcos (
v1 , v2 ) = (3.5)

v1 · v2
where v1 , v2 stand for the two compared descriptors (frequency vectors), and
· is a vector norm.
Evaluation
The desired output of an (s-)CBIR system is a ranked set of K best (most
similar) matches for each query. When attempting a category-level retrieval
task, many of the retrieved matches may be correct, i.e. corresponding to what
we wished to retrieve. In instance-level retrieval, on the other hand, there
is only one correct match, which either is or is not found within the first K
retrievals. Depending on the type of retrieval and whether or not the actual
ranking of the correctly retrieved images within the first K best matches is of
interest, different evaluation metrics may be suitable.
(Average) Precision at K
Let I j denote the indicator function of a correct match at rank j for a given
query (i.e. if the image at rank j is a valid match, I j = 1, else I j = 0). Ap-
plicable when caring only about the amount of correctly retrieved images, the
Precision at K (P@K) is defined as
∑K
k=1 Ik
P@K = (3.6)
min(K, n)
where n stands for the number of all database images correctly corresponding
to the query. If, however, a manual inspection of the first K matches is not
reasonable for the application at hand and the first-place matches are the only
ones of interest, the actual ranks of the correct matches can be included in the
metrics. This rank-adjusted precision measure is called Average precision at
K, or AP@K:
∑K Ik · P@k
AP@K = k=1K . (3.7)
∑k=1 Ik
As this is a per-query success metric, it can be averaged over multiple queries
to be more representative of the CBIR system success (this query-averaged
version is denoted by mAP@K, where m stands for mean).
Accuracy at K
When dealing with instance-based retrieval with only one possible correct
match per query, the set of viable queries is usually well-defined. In such
25
problems, the P@K metric results in a binary indicator of success and can be
averaged over the query set to produce a measure of Accuracy at K (some-
times denoted also by Acc@K). In paper III, we refer to this value as a top-K
retrieval success, as it indicates for what fraction of queries the CBIR system
was successful in retrieving the correct image within the first K matches.
3.2 Deep Learning

The term machine learning [11] covers methods and algorithms that repeatedly
adapt to the given data in an automated way, according to some predefined ob-
jectives. This process of repeated adaptation with the aim of solving a specific
task is referred to as training. A specialized subset of machine learning tech-
niques, called deep learning (DL) [38], concerns the use of (artificial) neural
networks (NN).
In NN, multiple layers of neurons, which compute a weighted linear com-
bination of their inputs followed by some nonlinear activation function, are
stacked on top of each other. These layers are to be applied consecutively and
produce increasingly higher-level feature representations of the original input.
During training, the network learns the neuronal weights through backprop-
agation of the objective/loss function values. The depth in DL relates to the
number of layers in a NN. Generally speaking, a deeper network has a higher
representational power (i.e. is able to learn more complex things).
Neural network-based models can be efficiently implemented on GPUs and
to date constitute the state-of-the-art approaches for most image processing
tasks [49, 62]. In this thesis, all DL methods are implemented in PyTorch [76]
(unless stated otherwise).
3.2.1 Architectures
In DL, the term architecture refers to the design choices when building a net-
work. To date, numerous architectures have been proposed for various image
processing problems [49, 62]. While NNs can be used on different types of
data, we focus here only on image inputs. A NN layer comprised of neurons
which work globally (i.e. use the entire input in their calculations) is called
a dense or densely-connected layer. A convolutional layer, on the other hand,
consists of locally-focused neurons, that are connected to only a subset of the
input at a time and applied in a sliding window fashion (i.e. applying one
neuron corresponds to convolving a filter with the input image, followed by a
nonlinearity). The size (in terms of the area of the input they are connected to)
of such neurons is then referred to as kernel size.
Convolutional layers are much more computationally efficient than dense

layers, and their number of parameters scales better with the size of input data.
26
Hence convolutional neural networks (CNNs) are particularly well suited for
application on images. Contemporary CNN architectures consist of a com-
bination of convolutional layers and other types of layers like dropout, batch-
normalization, etc. For more details on different layers and design possibilities
see [38]. In papers I and II, we employ CNNs for medical image segmentation,
and in paper III, CNNs are used both for feature extraction and style transfer.
More specifically, we use a U-Net[82], DeepMedic[50] and a simple Vanilla
CNN architecture in paper I, and E-Net [75] architecture in paper II. In paper
III, ResNet [40] is used within a replacement study, and Tiramisu [48] within
a method used as a part of the proposed pipeline.
Vanilla networks refer to simple feed-forward networks without any added

complexity. ResNet is a very deep convolutional architecture with additional
residual (also known as skip) connections between nonadjacent layers, which
improve the gradient flow during training. The U-Net architecture has a U-
shaped (encoder-decoder) structure, combining blocks of convolutional lay-
ers with down- and up-sampling layers and skip connections, and DeepMedic
consists of two fully convolutional pathways, each operating on a different
scale (sampling) of the input image, which are concatenated in the last lay-
ers and trained jointly. E-Net is a lightweight and very efficient version of an
encoder-decoder structure. It is inspired by SegNet [5], which uses the gen-
eral u-structure of the U-Net but utilizes the pooling indices from the encoder
for the upsampling step instead of learning it. Improving on SegNet, E-Net
applies heavy downsampling in the beginning layers, and uses different types
of convolution and a more compact decoder. The Tiramisu architecture [48]
has a similar u-shaped structure with skip connections as U-Net, but includes
dense blocks in both the encoder and decoder paths. Within the dense blocks,
the layers are densely connected by skip connections.
A special type of NN architectures are the so-called Generative-Adversarial

Networks, or GANs for short [37, 49]. These consist of a generator and a dis-
criminator, competing in a zero-sum game. The generator learns to generate a
synthetic image based on the distribution of the input data, while the discrim-
inator classifies the representation as generated or real, thereby pushing the
generator to produce images indistinguishable from the real ones. The result-
ing generated images are often termed "fake". To date, GANs have been used
successfully within medical applications like segmentation [17], data genera-
tion and modality transfer [101]. Two GAN architectures, pix2pix [47] and
cycleGAN [106], are used also within the replacement study part of the paper
III, for the task of modality transfer. Pix2pix uses a conditional GAN to learn
a mapping of an input image to a representation of this image in a desired
style. It requires pairs of images (corresponding images in the raw and desired
styles) as input. CycleGAN differs from pix2pix by achieving this goal with-
out requiring examples of image pairs. Instead, it employs a cycle consistency
27
enforcing that the input image can be reconstructed back from the newly gen-
erated image.
The presented networks comprise only a small subset of available architec-

tures. A more comprehensive survey of various networks can be found e.g. in
[62, 49, 55].
3.2.2 Training considerations

Training of neural networks is equivalent to finding parameters (network weights)
that optimize a given cost (loss) function. However, finding a combination that
is appropriate and well-performing for any given problem requires also a smart
setting of hyper-parameters. These are decided prior to training start, and in-
clude e.g. architecture design, learning rate and scheduling, optimizer, and
loss functions. It has been shown [84, 24] that injecting the networks with
prior knowledge can improve their performance, speed up the convergence
and make them more robust w.r.t. input and parameter settings. Below we dis-
cuss a few important considerations and terminology relating to the training
process of CNNs. Note that we use the terms training and learning somewhat
interchangeably.
Types of supervision
Supervision is related to the type of data neural networks are trained with.
If there is no ground truth available during training, it is said that the net-
work is trained in an unsupervised manner. In contrast, supervised training
means that the network is shown both data and the expected output connected
to it. When ground truth is available only for a subset of the training dataset,
such a training setting is called semi-supervised. Furthermore, if the ground
truth exists for each sample but is in some way incomplete, the training is
said to be carried out under weak supervision. Different types of learning can
also be joined within the same pipeline. When a network, for example, uses
unstructured data to learn labels (i.e. unsupervised) which are then used for
supervising downstream learning tasks, the training is called self-supervised.
Sometimes, only a certain kind of information is available about the data; for
example which samples are similar (and which are different). When the net-
work is trained with such information instead of the ground truth in terms
of the desired output labels, this is called (self-supervised) contrastive learn-
ing. In papers I and II, we employ full and weak supervision for segmentation
learning respectively. And in paper III, the proposed pipeline relies on con-
trastive learning.
Effective receptive field

The receptive field is the amount of the input image that the network takes into
consideration during learning, and that affects the final output. With densely
28
Figure 3.3. An illustration of the receptive field growth by stacking convolutional
layers. A single unit of output relies on the information of a 7 × 7 area of the input in
a simple 4-layer convolutional network with kernels of size 3 × 3.
connected layers, a single output in a layer is connected to the entire image

in the previous layer, thus taking into account the global information of the
whole input. In convolutional neural networks on the other hand, a single out-
put unit of a layer is connected only to a local neighbourhood of that unit in
the previous layer, limiting the amount of input the network can base its deci-
sion for that unit on. However, the local information aggregates over stacked
layers, and with enough layers, the receptive field can actually be extended to
the size of the entire image (see Figure 3.3 for illustration). Larger receptive
fields enable networks to capture more spatial context. Receptive field size
can be computed in closed-form and depends on the network architecture. It
can be further increased with layers of pooling, dilated convolution, etc. This
calculated receptive field however provides only a theoretical (upper) bound
on its size. It has been shown experimentally that the actual, effective receptive
field is usually smaller [65], meaning that only a portion of the theoretically
calculated receptive field area has a non-negligible contribution to the output.
As wider context tends to be useful in learning tasks, the networks are com-
monly designed in a way that their receptive field covers the entire relevant
input image region. Alternatively, the information that resides in the global
context can be to some extent replaced by the addition of problem-specific
prior information.
Losses
During training, the network’s parameters are repeatedly optimized to min-
imize the chosen loss. The choice of loss depends on the problem at hand
and the specifics of the data. In supervised training, it is designed to measure
the difference between the network output and the ground truth. Contrastive
learning on the other hand requires a loss that pushes the network to learn sim-
ilar/different representations for the positive/negative pairs. In the context of
segmentation, more well-known losses include Cross-entropy (CE), Dice loss,
Focal loss, Tversky loss and Boundary loss [66]. These are used in papers I
and II under both full and weak supervision. For contrastive learning, triplet
29
loss [87] is typically used. It compares a matching (positive) sample and a
non-matching (negative) sample to a reference input, minimizing the distance
between positive and maximizing the distance between negative pairs.
For the definitions that follow, let s represent the network output and G
the one-hot encoding of the desired ground truth labelling G : Ω → {0, 1}K ,
where Ω is again the image domain, Gk denotes the k-th output of G and K
is the number of classes. When dealing with segmentation problems, the raw
network output is usually sent through a softmax nonlinearity [13] to produce
values in range [0, 1] that can be interpreted as probabilities. Hence s : Ω →
[0, 1]K where s(x) sums to 1 for any x ∈ Ω and sk (x) (k-th element of output at
x) represents the probability of x belonging to class k.
Pixel-wise CE loss
Cross entropy loss stems from the Kullback-Leibler divergence which mea-
sures dissimilarity between two distributions. For a multi-class problem with
K classes, it is defined as
1 K
LCE (s, G) = − ∑
N k=1 ∑ log(sk (x)) (3.8)
x∈Ω
Gk (x)=1
with N representing the number of pixels, N = |Ω|.This definition of CE as-

signs equal importance to every pixel in the image. For imbalanced problems,
weighting by inverse class frequencies can help shift the focus from the ma-
jority classes.
Another adaptation of CE puts more emphasis on harder examples, i.e.
weighs down the contribution to the loss of those pixels which the network
is able to classify with less uncertainty. It is called Focal loss [58].
Dice loss
The dice loss as used for training CNNs is generally defined simply as
LDSC (s, G) = 1 − DSC(s, G) (3.9)
for a chosen definition of the DSC metrics (see section 3.1.1). The only dif-
ference is that the DSC in this case is calculated directly on the output prob-
abilities instead of the class labelling (which is sometimes termed also soft
Dice).
Multiple versions of the Dice loss have been proposed for segmentation
training [46]. Tversky loss [85] can be seen as a version of Dice loss, weighing
false positives differently than false negatives (while the original Dice loss
weighs both equally).
Boundary loss
Boundary loss (BL) was first proposed in [54] with the aim to improve seg-
mentation accuracy in problems with highly imbalanced classes. Both Dice
30
and CE involve summations over entire regions, which can have a detrimental
effect on training performance if the differences in class sizes are very large.
Instead, BL is calculated on the space of object contours, integrating only over
values in the areas between the segmented and the ground truth boundaries. In
order to put a higher penalty on larger deviations, the integration is done over
distances from the ground truth boundary. It is formally defined as:
K
(k)
LBL (s, G) = ∑ ∑ sk (x)φG (x) (3.10)
k=1 x∈Ω
(k)
where φG is a signed distance map (see section 3.3 for detailed explanation)
of G, computed on the class k mask. Boundary loss actually complements
regional information, and is in practice often combined with losses like Dice
and CE for better stability (especially in the binary segmentation case). As op-
posed to the Dice and CE losses, the boundary loss can attain negative values.
While it is bounded, its bounds depend on the image data and the distances
used.
3.2.3 Evaluation
There are two important aspects of evaluating how a neural network performs.
Firstly, there is the application-dependent evaluation of the final results in
comparison to the expected/desired ground truth. For the case of segmen-
tation learning for example, two common metrics have been described in sec-
tion 3.1.1. Secondly, the behaviour of a model throughout the training should
also be taken into account. To some extent, these model properties can be
observed through the inspection of training curves.
For both aspects it is important to not rely on a single measurement but
carry out the experiments repeatedly. Utilizing data in an appropriate way (in
terms of splitting and input size) also plays a vital role.
Why receptive field matters

Due to the memory limitations of modern-day GPUs, CNNs are sometimes
trained with only parts, cropped segments of the original images. Evaluation is
however often still required on the whole images. In such cases, the receptive
field of the model can have a considerable effect on the final results.
Some very deep networks have large receptive fields (e.g. a 101 layers
deep ResNet called ResNet-101 [40] has a receptive field of more than 9002 ).
Nevertheless, they may still be used and trained with smaller images with
advantage. However, training a network with images smaller than its recep-
tive field is effectively the same as feeding it zero-padded images. Hence the
network does not learn how to incorporate a larger spatial context, which is
present when evaluating the model on larger images. Consequently, the final
decision/output of the network is based on the input in an uncontrolled way.
31
In short, while seeing more context is generally beneficial in CNN, pro-
viding this context when a network has not been trained to use it can lead to
unexpected outputs and degradation in performance. For a fair evaluation, the
models should thus be evaluated on inputs of the same size as the training data.
The importance of splitting the data

As CNNs are data-driven methods optimized for the best fit to shown samples,
they should always be evaluated with unseen datasets. Typically the available
data is split three ways prior to any training, namely training, validation and
test sets [88, 15]. While training samples serve as the basis for backprop-
agation and weight update, the validation samples do not affect the model
directly. Instead one usually monitors metrics, calculated on the validation set
throughout the training, to inform settings of hyperparameters (e.g. learning
rate, early stopping, model choice). When the training is finished and the best
model is chosen, the final evaluation is done on the holdout (i.e. not visible
during training) test dataset. The validation set also offers an opportunity to
gain intuition on how the model will perform on unseen data (how well it can
generalize), however, as soon as it is used for tuning any hyperparameters, the
evaluation is not unbiased anymore. When the test (and even validation) set is
not held separate from training, there is always a danger of data leakage (i.e.
using information from outside the training dataset for model creation), and
any insights from post-training evaluation on such data are unreliable.
As neural networks typically require ample amounts of data which is often
not available in real-life applications, removing the need for the test set is
often desirable. In such cases evaluation is sometimes performed through k-
fold cross-validation. It requires an exhaustive k-ways equal splitting of data
and repeated training of the model using one fold for validation and the rest for
training. While the repetitions lead to more stable evaluations, the best models
evaluated on the validation folds can still rely on the underlying knowledge
about the data in that fold. Therefore it should be preferred to employ an
independent test set, and when that is not possible, tune the hyperparameters
on a single split and use the same best model for all the rest k − 1 splits [88].
The importance of training curves

Training curves portray the evolution of the loss and metric values by epoch.
They should be examined together with validation curves, and the term train-
ing curves is typically used as an umbrella term for both together.
Using a validation set is not important only because of the hyperparameters.
Looking at the validation curve evolution with respect to the training curve can
offer insights into generalizing capabilities of the network. Overfitting for ex-
ample (when the training set results keep improving while the results on the
validation set start deteriorating again) implies bad generalization. When the
training and validation sets are very similar, even large differences in con-
verged validation and training results can indicate that the network will not
32
generalize well. On the contrary, results that are too similar can be a cause for
concern; perhaps the datasets are too similar, which can cause problems and
unpredictable behaviour in real-life deployment [15].
Particularly in medical applications, generalization properties are very im-
portant, due to extreme data heterogeneity (see section 2.1). It is not practical,
or reasonable, to constrain method use to images of the exact same properties.
While networks are trained for many epochs, the model at a single specific
epoch is chosen for evaluation. Meaning that highly oscillatory curves can
produce good results "by chance". Examination of training curves should thus
also be done to establish the stability properties and inform parameter adjust-
ments.
3.3 Distance transforms

A distance transform (also a distance map, abbreviated DT) on a digital image
is a representation of the image where each pixel is mapped to a numerical
value corresponding to the shortest distance from this pixel to a chosen loca-
tion/pixel/object in the image. Signed distance transform simply refers to a
DT where a negative/positive sign for the distance is used as an indication of
the pixel being inside/outside of the object the distance is calculated from. It
is in essence a signed point-to-set distance. Formally, on an image domain Ω
we can define it as
⎧
⎨ min d(x, y), if x ∈
/O
y∈O
DT (x, O) = (3.11)
⎩− min d(x, y), if x ∈ O
y∈∂ O
for every x ∈ Ω, where O ⊂ Ω is the point/object we are calculating the dis-

tance to, and ∂ O = {y | y ∈ O, y has a neighbour in Ω \ O} its boundary. His-
torically, DTs have played a major role in segmentation, template matching,
skeletonization, clustering and more [10, 73, 60, 21]. The appearance of a dis-
tance transform will depend on the choice of distance and underlying image
size and/or intensity profile.
3.3.1 Common distance definitions

Formally speaking, a function d : M × M → R on a set M is called a (distance)
metric or distance (function) if it satisfies the following axioms:
1. d(x, y) ≥ 0 (nonnegativity)
2. d(x, y) = 0 ⇔ x = y (reflexivity)
3. d(x, y) = d(y, x) (symmetry)
4. d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality)
33
for all x, y ∈ M. For an arbitrary function to be called a distance function,
the nonnegativity condition is actually the only requirement. However, met-
ric properties are interesting theoretically and have implications for practical
computations of distances. When a function satisfies all but the reflexivity
property, it is called a pseudo-metric, and when it only fails to satisfy the tri-
angle inequality, it is called a semi-metric. For a more in-depth reference on
metrics and various distances see [26].
Intuitively a distance is simply a function that measures a difference be-
tween pixels or objects. Even the common evaluation metrics (dissimilarity
measures in particular) can be formulated/viewed as distances. Some dis-
tances are global (i.e. depend only on source and target points) and some
are path-based (i.e. depend on the actual path taken between them).
On digital images, in order to speed up the computations of DTs, even the
global distance values are commonly approximated through propagation of
local distances (or their vectors), i.e. distances between neighbouring pixels
[83, 33, 16, 23]. This leads to distances that are computed along (discrete)
paths, requiring that a reachable local pixel neighbourhood needs to be decided
upon for the practical calculations.
Below we define a few common distance measures that are used mainly in
paper II. While they originally work on continuous domains, we here consider
only distances on images. For the sake of simplicity, we provide definitions
for images on 2D image domains, I : Ω ⊂ Z2 → R, but extensions to 3D are
trivial.
For any two points (pixels) x, y ∈ Ω, let πx,y = (x = p1 , p2 , . . . , pn−1 , pn = y)
be a path between them, with pi and pi+1 , i = 1, . . . , n − 1 adjacent points
(pixels) on the path. We use Πx,y to denote a set of all such allowed paths.
The distance between x and y computed along a path πx,y is calculated by
i=1 d(pi , pi+1 ) where d is the chosen distance function. To obtain (or ap-
∑n−1
proximate, for the case of non-path-based distances) the final distance between
x and y, the minimum of all distances along paths from Πx,y is taken.
Euclidean distance
Euclidean or L2 distance is the most widely used distance, arising naturally in
the physical world. For points x, y ∈ Ω, x = (x1 , x2 ), y = (y1 , y2 ), it is defined
as
d(x, y) = min (x1 − y1 )2 + (x2 − y2 )2 . (3.12)
Taxicab distance
Taxicab distance, called also Manhattan or L1 distance, is a distance between
points computed along a grid. If we consider all possible paths in the image
as valid, the path-based computation will produce the exact distance.
d(x, y) = |x1 − y1 | + |x2 − y2 | (3.13)
34
Geodesic distance
The geodesic distance [98] is the distance along a curve (on a manifold in
higher dimensions). For images, this means that the geodesic distance between
pixels depends not only on their spatial proximity, but also on their intensities.
It is commonly defined using L2 norms (i.e. for 2D images, it boils down to
L2 in 3D), but along paths:

d(x, y) = ||x − y||22 + (I(x) − I(y))2

= (x1 − y1 )2 + (x2 − y2 )2 + (I(x) − I(y))2 (3.14)
In practice, it is however often computed using the L1 distance instead. Fur-
thermore, to account for the fact that the additional dimension (in the case of
images) is the intensity which may have a very different range than the rest,
an additional parameter λ is commonly used to balance the contributions:
d(x, y) = (1 − λ )||x − y||1 + λ |I(x) − I(y)| (3.15)
Minimum barrier distance

Minimum barrier distance (MBD) [92] is a path-based distance. It is calcu-
lated exclusively on the image intensities, and effectively independent of the
path length in space. Given an admissible path πx,y = (p0 = x, p1 , . . . , pn−1 , pn =
y) between pixels x and y, a barrier along that path is the largest intensity dif-
ference on the path: b(πx,y ) = max I(pi ) − min I(pi ). With this in mind,
pi ∈πx,y pi ∈πx,y
MBD is then defined as
d(x, y) = min b(π) (3.16)
π∈Πx,y
As opposed to the rest of the distances defined here, MBD is actually only a
pseudo-distance, disobeying the reflexivity property.
3.3.2 Using distance transforms within DL

Distance transforms can be used with advantage in CNN training in various
ways, particularly for segmentation tasks [67]. Through the introduction of
DT in training, the network can be supplied with prior information without
the need for a human to explicitly formulate it. DTs can also be used to en-
force/emphasise the importance of specific image properties. In recent years,
incorporating the distance transform maps (calculated based on the ground
truth annotations) into CNNs pipelines has thus received more and more at-
tention.
Distance maps are commonly included through losses [54, 52], direct met-
ric learning (i.e. learning/regressing the distance map for object location or
segmentation, [104]) or multi-task learning and regularization [6, 22]. Dif-
ferent distance functions have different properties and choosing which one to
use, and how, is again application dependent.
35
3.4 Statistical analyses
Within the field of statistics, hypothesis testing is the topic that permeates all
levels of image processing. Be that for investigating the true significance of
the seemingly improved results from a new model, or even at the very end of
the diagnostic pipeline, when providing medical practitioners with statistically
sound measures of signals of interest in the images. The latter is the focus in
paper IV, aiming to provide interpretable and reliable testing procedures on
imaging data. Below we provide a short overview of the hypothesis testing
procedure and the problems related to it when considering images. More de-
tails and the underlying theory can be found e.g. in [99].
3.4.1 Hypothesis testing

Hypothesis testing is one of the more commonly used statistical inference
techniques. It is the process of assessing to what extent our initial hypoth-
esis about a population is aligned with (supported by) the actual data at hand.
The general procedure of hypothesis testing consists of four steps:
1. choose null (and alternative) hypothesis,
2. define and compute the test statistic,
3. calculate the p-value,
4. assess significance (interpret the p-value).
The first step refers to deciding on what we wish to test for; the null hypoth-
esis, H0 , commonly covers the statement of no effect/no difference (between
some compared sample-based values, θ1 and θ2 ), while the alternative hypoth-
esis, Ha , constitutes the alternative possibilities. Depending on what kind of
differences we are interested in, a one- (H0 : θ1 = θ2 ) or two-sided (H0 : θ1 ≥ θ2
or θ1 ≤ θ2 ) test may be appropriate.
After the null hypothesis has been decided upon, the statistics to be tested
needs to be computed, measuring the compatibility of the data with the null
hypothesis. Its choice depends on the type of test we wish to carry out (testing
for the difference of distribution means, or difference in effect, etc.) and the
underlying distributions. Test statistics are typically designed in a way that
their size indicates inconsistency with H0 . Large inconsistencies however do
not directly imply that the null hypothesis is false – it is possible that H0 holds
and our dataset is simply a very unusual observation.
The p-value, calculated in step 3, represents the probability that the test
statistic would take a value at least as extreme as the one that was actually
observed, under the premise that the null hypothesis is true. Thus small p-
values indicate that the observed dataset would be unusual under H0 , casting
doubt on its validity.
The final goal of the testing procedure is to determine whether there is suf-
ficient evidence (as given by the data) to reject the null hypothesis. For that,
one needs to define what amount of evidence should be considered sufficient;
36
that is the scope of the fourth step. Most commonly, p-values under 0.05 are
interpreted as strong evidence against H0 (with values < 0.01 considered as
very strong evidence). If the values are within the range 0.05 to 0.1, that is
usually considered as weak evidence, while values above 0.1 provide no or
little evidence against the null hypothesis. The threshold at which we consider
the evidence sufficient is called the significance level, α.
The errors that can occur in hypothesis testing can be split into Type I and
Type II errors. Type I error, or false positive, is the case when we mistakenly
reject H0 that is true. Its probability is controlled by the significance level.
On the other hand, when H0 is false but we fail to reject it, we commit a
type II error or a false negative. While α is usually set such that it keeps
the type I errors at bay, its settings affect the type II errors too. Choosing an
appropriate α therefore requires balancing between sensitivity (true positives)
and specificity (true negatives).
Below we shortly explain three common types of tests that are used also
across the papers included in this thesis.
Student t-test
According to the central limit theorem, the distribution of a sample-mean vari-
able becomes approximately normal with increasing sample size. Therefore,
given a sufficiently large sample of a population with known variance, the
expected underlying distribution of the statistic can be assumed normal. How-
ever, it is often the case in practical applications that we are dealing with a
limited sample and unknown population variance. In such cases one can in-
stead use a Student-t distribution, and perform a so-called t-test.
There are different types of t-tests available, with regard to the values we
wish to compare: comparing means stemming from a single group/population
(e.g. measuring the outcome of a treatment) requires a paired t-test, compar-
ing groups from two different populations boils down to a two-sample t-test,
and comparing a group based value against a specific scalar (e.g. comparing
correlation coefficient to 0) calls for a one-sample t-test.
The t-statistic T for a two-sample test comparing groups of sizes N1 , N2
with means m1 , m2 and standard deviations s1 , s2 is calculated as:
m1 − m2
T= (3.17)
s p N11 + N12

(N1 −1)s21 +(N2 −1)s22
where s p is a pooled standard deviation s p = N1 +N2 −2 . This def-
inition assumes the groups have sufficiently similar variances. If that is not
the case, a more general version can be found e.g. in [19]. In the case of a
1 −μ
one-sample test, comparing group mean m1 to μ, this simplifies to T = m 1
.
s1 N1
The paired version of the test requires dependent samples (repeated mea-
sures of a single sample or two paired samples) and is calculated according to
37
the one-sample test, using the mean and standard deviation of the differences
between corresponding pairs.
The t-test is a parametric test (assuming approximately normal distribu-
tions), and thus not appropriate when this assumption is violated. It can be
sensitive to outliers when the sample size is very small.
The permutation test

The permutation test [2] is an exact (i.e. not based on large sample theory
approximations) nonparametric test testing the equality of the underlying dis-
tributions for given samples. Instead of assuming the underlying distribution
shape for a chosen statistic under H0 , it is estimated through random permu-
tations of the data. However, this requires exchangeability under H0 , meaning
that, given H0 , all permutations are equally likely. And when each of the pos-
sible permutations is equally likely under H0 , the p-value can simply be com-
puted as the proportion of the statistic values on permuted data that is greater
than or equal to the original value of the statistic.
To illustrate, let the two samples s1 , s2 we wish to compare in terms of
some statistic T (s1 , s2 ) be of size N1 and N2 , respectively. Then new values of
the examined statistic T are computed over all N possible resamplings of the
pooled data s1 ∪ s2 into two groups ŝ1 , ŝ2 of sizes N1 and N2 , producing values
T (ŝ1 , ŝ2 ) that constitute the distribution under H0 . The p-value p for T (s1 , s2 )
is then computed using this empirical null distribution:
1
p= · |{T (ŝ1 , ŝ2 ) | T (ŝ1 , ŝ2 ) ≥ T (s1 , s2 )}|. (3.18)
N
Naturally, the sample sizes (the number of possible permutations) will affect
the accuracy of the distribution representation. And from Equation 3.18 it
clearly follows that given a permutation set of size N, the only attainable p-
values are Nk , for k = 1, ..., N. Hence the minimal N required to perform a test
at significance level α is such that Nα ≥ 1.
For very large samples performing an exhaustive permutation test is not
always feasible, and a (large enough) random subset of possible permutations
may be used. This is also known as the approximate permutation test and is
more conservative and less powerful than the ordinary permutation test.
Paired signed-rank Wilcoxon test

This is a nonparametric counterpart to the paired samples t-test [100]. It is
intended for repeated paired measures and requires the measures in question
to be subtractable (i.e. numerical), and their differences must be comparable
(possible to order).
To perform the test, pair differences must first be computed and ranked
according to their absolute values (if there are ties, the mean value of untied
ranks is usually used for them). The acquired ranks are assigned the signs of
their corresponding differences. Absolute values of all negative and positive
38
ranks are summed individually, and the lowest sum is used as the Wilcoxon
statistic (W) value.
To obtain the corresponding p-value, the acquired value needs to be com-
pared to the underlying distribution of W under H0 . The null hypothesis here
corresponds to the sample of positive and negative differences being normally
distributed around 0 (with median 0) [100]. For small sample sizes, it can be
computed exactly, in a combinatorial manner. But for larger sample sizes, that
is not computationally tractable, and a normal approximation is used for the
standardized W statistic.
3.4.2 Test multiplicity

With the enormous amounts of data available for analyses nowadays, there
is often a need for carrying out multiple tests simultaneously. In images, for
example, simultaneous testing is often performed over all pixels in the image.
The multiplicity of tests however complicates the relationship between the
expected amount of type I errors and the significance threshold setting. When
performing a single test, using a significance level of α means that we can
expect to make a type I error with α probability. But if two tests are performed
simultaneously instead, the probability that at least one of them will result in
a type I error is already larger, at α + (1 − α)α = (2 − α)α. What is more,
with an increasing number of tests m, we quickly reach very high (close to 1)
probabilities of at least one type I error: P(type I error) = 1 − (1 − α)m .
This multiple comparison problem is a well-known problem in statistical
hypothesis testing (see [89] for more details on the topic) that reportedly leads
to inflated detection rates and problems with interpretability [45, 8, 31]. If we
wish to adequately control the family-wise type I error rate (FWER) on the
experiment as a whole (i.e. error rate on the entire family of tests instead of
each individual test), the significance level α (or the computed p-values) can
be adjusted using one of the so-called correction methods. While there are
initiatives for developing better statistical models and methods based on the
specifics of acquisition systems [14], using correction methods is still the stan-
dard way of dealing with multiplicity issues at the time of writing. Throughout
the years, many correction methods relying on various assumptions have been
developed; most targeting neuroimaging data in particular [61, 74].
Below we describe several FWER-controlling correction methods, appro-
priate (or even explicitly designed for) testing on images, which we use and
evaluate in paper IV. With respect to inference level (i.e. the individual units
we aim to establish the significance of), correction methods as applied to im-
ages can be grouped into cluster-wise, pixel/voxel-wise or set-level methods,
and we will focus on the former two. From here on we assume to be dealing
with images, performing an individual test at every pixel.
39
Bonferroni method
The oldest and simplest FWER-controlling correction procedure is the Bon-
ferroni correction [29], which assumes test independence. The correction con-
sists of simply dividing the significance threshold by the number of all tests.
Clearly, this method is very conservative when test multiplicity is very high,
quickly losing the power of detecting any actual significance.
This is especially prominent in the case of images, where the number of
tests equals the number of pixels. In addition, test independence is typically
violated on images, as some spatial correlation between pixels is almost uni-
versally present. While this does not undermine the corrected threshold’s the-
oretical validity, it exacerbates its stringency.
Some improvements (with respect to extreme stringency) of the Bonferroni
method include the step-down Holm procedure [42], which relies on ordering
the signals/p-values and rejecting them sequentially, at different thresholds.
RFT method
The random field theory method (RFT) [1, 103] was developed as a solution
to the very restrictive independence assumption of the Bonferroni method. It
has been most popular within the neuroimaging community in recent years, as
it accounts for the fact that pixels are influenced by the signal in nearby pixels.
RFT considers the image to be a realization of a (e.g. Gaussian) random
field and relies on theoretical results for Euler characteristics in a thresholded
image. Euler characteristic (EC) can be interpreted as the number of con-
tinuous areas/clusters consisting of pixels with values exceeding some given
threshold (minus the number of holes/voids in these areas). In that sense, it
is tightly connected to homology, as it is a topological descriptor that can be
written as an alternating sum of Betti numbers.
Using a large enough threshold (w.r.t pixel statistic values), EC will repre-
sent the number of local maxima and eventually fall to 0 (when the threshold
is set higher than the maximal pixel statistic). The expected value of EC can
thus be understood as the probability of at least one suprathreshold area, or in
other words, the probability of the maximal value (over all pixels) exceeding
the given threshold. This means that calculating the expected value of EC for
an image is roughly the same as performing FWER correction on its p-values.
A closed-form approximation for the expected value of EC (i.e. for the tails
of the null distribution) are derived in [1] for the Gaussian fields, and extended
to other types of random fields in [103]. The RFT method involves a consider-
able amount of mathematical theory, however, the availability of closed-form
expressions makes it computationally undemanding and thus feasible for use
in imaging.
Although RFT solves the independence problem, it also introduces new re-
strictions on image smoothness, differentiability of the autocorrelation func-
tion, and locality-independent distributions. All of these are often unattainable
40
for medical images in practice. In addition, it tends to be very stringent when
the number of subjects/images involved in the analysis is small.
While this formulation of RFT holds for voxel-based correction and infer-
ence, it is possible to use it for cluster-based inference too, by instead control-
ling the probability that a single cluster of a certain size exceeds the threshold
under the null hypothesis.
Permutation-based method
The permutation-based methods [2] are based on the same idea as the per-
mutation test and use the same mechanism to empirically estimate the null
distribution. They are also subject to the same assumption of exchangeability
under H0 . However, to deal with the multiplicity issue, an extremal (maximal)
summarizing statistic needs to be used in the calculations.
In essence, given a family of n tests producing statistic values T1 , . . . , Tn ,
the summarizing (maximal) statistic is given as T = maxi=1..n Ti and the em-
pirical distribution under H0 is given by the maximal statistic values for all
permutations. Since the possibility of at least one false rejection coincides
with rejecting at least the test with the largest statistic, setting a significance
threshold for T then results in control of the FWER.
Cluster-extent based method

So far all the presented methods offer voxel-based inference. However, to deal
with the problem of image smoothness, another popular option is to resort to
cluster-based inference instead.
The cluster-extent-based method relies on an arbitrarily chosen primary
threshold to define the clusters, then analyzes them in terms of their size (spa-
tial extent in pixels). The cluster-extent threshold for a given significance level
is determined according to the distribution of the cluster sizes (more specif-
ically, the largest cluster size, in order to control FWER) under H0 , with the
null hypothesis here being no activation in any pixels in the cluster. This dis-
tribution is typically estimated either via RFT, Monte Carlo simulations or the
permutation method [61].
Because of its sensitivity, the cluster-extent-based method has been very
popular in medical imaging. However, due to the loss of spatial specificity (a
p-value or significance information for a cluster does not say anything about
the signal at individual pixels) and the arbitrary choice of the cluster-defining
threshold (affecting the inference), there are many pitfalls and interpretation
issues connected to it [102, 31].
TFCE method
To overcome the problem of an arbitrary cluster-defining threshold, an ex-
tension called Threshold-Free Cluster Enhancement (TFCE) [91] has been
proposed. The statistic it uses, the so-called TFCE-score, is a pixel-based
41
measure, encoding signal strength as well as its spatial extent information.
Formally, the TFCE-score for a pixel x is defined as:
hN
TFCE(x) = e(h)E hH dh (3.19)
h0
where e(h) is the extent of the cluster containing x at threshold h, and E, H are
parameters typically set to 0.5 and 2 respectively. In practice, this is computed
using a summation, with dt = 1 for discretized images. The TFCE-score out-
put map can be converted into FWER-corrected p-values using permutation
testing (for the maximal TFCE-score statistic).
As a pixel-based score, TFCE retains localization power which is otherwise
lost in cluster-based methods. However, the spatial extent information that
accounts for smoothness and lack of test independence is still incorporated in
the calculations through the integration of spatial support for the given pixel
over a set of thresholds. This way, the method is equipped to detect both
diffuse signals with low strength and strong, focal signals.
CBA method
In Cluster-Based Analysis (CBA) [41], the initial step of defining clusters is
done on a no-activation image, which is excluded from the analysis. Further-
more, the so-defined clusters are assigned some summarizing statistic (taking
into account all elements of the cluster), which is finally analyzed instead of
pixel-based statistics. In this sense, the CBA method lies at the intersection
of pixel- and cluster-based methods, as all individual pixels are in a way con-
sidered (and have the possibility of being significant), but the actual analysis
(and inference) units are predefined non-overlapping clusters.
The process of defining clusters is based on the pairwise inter-pixel corre-
lation in the local neighbourhood. For each pixel, the correlation is computed
with all its neighbours, and the pixel is then clustered together with its most
highly correlated neighbour. Since the maximum correlation property is not
symmetric, the produced clusters can be of arbitrary sizes.
For a given pixel or voxel, not all of its neighbours are at the same distance
from it. Therefore the correlations need to be corrected with respect to the
distances. After the cluster are defined and their signals computed, the p-
values for the clusters are calculated. The null hypotheses here are that none
of the cluster elements is active, under which the cluster p-values are uniformly
distributed.
The so-acquired p-values still need to be adjusted to correct for the test
multiplicity. The original paper [41] suggests using methods of false discovery
rate control, but the adjustment can just as well be done for FWER control,
for example by any of the previously described approaches. By effectively
(at least) halving the number of tests and joining strongly correlated areas,
the CBA method directly alleviates the multiplicity issue and problems with
violating the independence assumptions of other correction methods.
42
4. Improving segmentation learning
Nowadays CNNs based methods constitute the state-of-the-art approaches for

image segmentation problems [70]. Learning methods can be steered/their
training improved by including prior (domain-specific) knowledge. This can
be done in many ways, e.g. through additional inputs, specially designed loss
functions, architectural adjustments, enforcing prior-compliant learning rules,
and more [24, 84].
During this thesis work, more specifically, in papers I and II, we investi-
gated the use of distance maps as prior knowledge. As described in subsec-
tion 3.3.2, combining DL with DTs has been an active field of research, with
works ranging from providing DTs as input to using them as ground truth in
training.
This chapter is split two ways: section 4.1 covers the work of paper I, focus-
ing on patch-based training and improving the segmentation accuracy through
adding DTs at the network input stage. In section 4.2, the work behind pa-
per II is introduced, exploiting specific properties of DTs to allow for weak
(point-based) supervision.
4.1 Patch-based learning in 3D

Deep learning models tend to be memory-consuming and typically require the
use of GPUs to be trained in reasonable times. GPUs in turn impose limits
on available memory, and while their capabilities have been steadily grow-
ing through the years, the quality and resolution of medical scans have been
increasing, too. Increased resolution results in large image sizes, preventing
the use of full image data for the training of CNNs. To overcome memory
constraints, training can be done on image patches (e.g. [50]).
Training CNN models with patches is less memory-demanding and can also
be easier to optimize in highly imbalanced problems, through class-based sam-
pling and augmentation. However, the patches, cropped from a single image,
are typically presented to the network as individual training samples, disre-
garding their interrelationship and source locations.
While neural networks have a high capacity for learning, they are limited
by the data they are shown. Thus, in paper I, we investigate their training
behaviour when providing additional information encoding the source location
of the patches. The final aim is to improve performance in patch-based training
paradigms for medical images. This section provides a short overview of the
work done in paper I together with an additional supporting experiment. For
specific technical details and in-depth explanations refer to the original paper.
43
4.1.1 Motivation
Medical image segmentation, particularly multiclass, tends to depend heavily
on contextual information since object positions and coocurrences are typi-
cally class dependent. Using only image patches for CNN training effectively
restricts the receptive field, and the only way of introducing broader context is
by adding it to the network separately.
In patch-based training, patch locations within the image carry information
about the broader context and can even encode anatomical priors where the un-
derlying anatomy is well-defined (e.g. whole-body scans). A number of works
have experimented with patch-location-informed training settings [51, 35, 34],
but mostly use on absolute coordinates or prior probability- or structural at-
lases, relying on image registration.
When working with anisotropic data or data like whole-body scans, where
the objects in the image can exhibit a fair amount of variability in terms of size
and positioning, absolute coordinates of the patch positions across subjects
may not be comparable or informative. In paper I we therefore propose a
distance-map-based encoding of the patch locations, which uses predefined
landmarks but requires no prior registration and produces directly comparable
data across patches.
4.1.2 Methods
The method development in paper I was driven by the POEM dataset (see sec-
tion 2.2), from which additional absolute landmark position data for ankle,
knee, hip and shoulder joints has been extracted. While similar methodologi-
cal development could be done by any other anatomically relevant points (or
even sequentially, by using two landmarks at a time), we focused specifically
on abdominal organ segmentation and utilized the hip and shoulder joint posi-
tions.
To provide the patch locations in a relative manner (i.e. such that similar
numbers correspond to the same anatomical areas across images), we propose
using landmark-normalized distance maps as additional input to the network
at the convolutional level. Let (i, j, k) represent the 3D image-space coordi-
nates in the left-right (X), front-back (Y) and foot-head (Z) directions. We
define two distance maps, Dx and Dz , omitting the front-back direction (since
variability in landmark/object locations in that direction is low):
i − A(L) k−F
Dx (i, j, k) = − 0.5 and Dz (i, j, k) = . (4.1)
A(R) − A(L) H −F
Here F and H refer to the minimum and maximum k-coordinates of the hip
and shoulder landmarks, respectively, while A(L) and A(R) are the average i-
coordinates of the left and right pairs of landmarks. We chose to offset the Dx
map by 0.5, to bring the two maps closer in range.
44
Figure 4.1. Two subjects of different sizes from the POEM cohort. From both, we
sample an equally-sized patch centred on the landmark midpoint. The patches of
proposed Dx and Dz thus have the same value in the centre and encode the amount
of actual tissue captured within the patches through their ranges. The absolute voxel
coordinates of the patch centre on the other hand hold no such information and are not
easily comparable.
Sampling patches from maps defined in Equation 4.1 captures the patch
location relative to the landmarks. Further multiplying the coordinates by the
reconstructed voxel sizes counteracts anisotropy. Providing patch locations in
such a relative sense has an important advantage over simply considering the
absolute patch positions: it encodes information about the extent of the subject
within the image. The effect is illustrated in Figure 4.1.
Considering that we chose to work with four landmarks with a known oc-
currence pattern (i.e. their convex hull forms an approximate trapezoid when
projected on the coronal (frontal) plane, another possible way of computing
relevant distance maps could be to simply compute a projective transforma-
tion of the landmark convex hull onto a [−1, 1]2 square.
But, due to the elongated overall shape of the human body, this can in-
duce extreme values and potential singularities where the lines connecting the
left two and the right two landmarks intersect (see Figure 4.2). Hence, so-
created maps are not appropriate for training unless additionally processed
(e.g cropped or value clipped).
45
Figure 4.2. The distance maps created through a projective transform of the landmark
convex hull (in left-right, X, and foot-head, Z, directions only) onto [0, 1]2 . The two
maps represent the transformed X and Z voxel coordinates, respectively. Ranges are
clipped for visualization. Pink dots show landmark positions. In the Z coordinate
map, a singularity occurs at the level of the feet (where lines connecting shoulder and
hip landmarks cross), producing extreme values and a sign swap.
4.1.3 Results
Training with different networks
The experiments in paper I were performed on the task of abdominal or-
gan segmentation (bladder, kidneys, liver, pancreas and spleen), using three
network architectures that take into account different levels of context: U-
Net [82], Vanilla CNN and DeepMedic [50]. All networks were trained on
patches of size 253 , both with and without added Dx and Dz maps (see paper
I for detailed training settings). With all three networks, we saw an improve-
ment both in terms of DSC and stability (as deduced from the training curves)
when using our proposed distance maps.
Effects of distance map calculation

In addition to the work presented in paper I, we run a separate experiment to
confirm the motivation behind our proposed computation of the distance maps,
using Vanilla CNN with absolute (but normalized to range [0, 1]) distances. A
box plot of achieved Dice scores is shown in Figure 4.3.
According to the results, using absolute distances does not improve the Dice
scores to the same extent as our proposed maps. This shows the importance
of the extent information, hidden in Dx and Dz maps. Nevertheless, the differ-
ences are small. The subjects from the POEM cohort that we used for training
were not extremely different in sizes, and we hypothesize that the effect of
using proposed distances would be more pronounced (compared to the abso-
lute ones) when used on images with higher (and potentially more anisotropic)
resolution or a dataset of very differently sized subjects.
46
Figure 4.3. A box plot of the 5-fold validation Dice scores per class (averaged over 5
repetitions), for training with absolute- and the proposed DTs, and without them.
4.1.4 Conclusions
In paper I we proposed a way of informing the network of the patch positions
to improve patch-based segmentation training. We evaluated the idea using
three network architectures and compared it to using absolute patch positions,
as is often done in the literature. We showed that using the proposed landmark-
normalized maps improves both training stability as well as the final Dice
scores on the task of abdominal organ segmentation.
4.2 Guiding the learning under weak supervision

CNN-based methods have proven very successful for various tasks within
medical image segmentation. However, they require large amounts of train-
ing data to achieve sufficient accuracy and generalizability. At the same time,
pixel-wise annotations are very time-consuming and expensive to obtain within
the medical field, as they require expert knowledge.
In paper II we thus moved from fully-supervised patch-based training in
3D to training on 2D slices under point-based supervision. We once again
proposed to use distance transforms as a way of supplementing the weak an-
notations with more information about the underlying objects. This chapter
briefly summarizes the work done in paper II and provides the background
experiments that motivated the approach.
4.2.1 Motivation
Point annotations are very cheap to acquire and can be done accurately even
by less experienced annotators. However, they do not contain any object shape
or extent information, like full pixel-wise annotations. It is thus reasonable to
try and provide such priors to the network in a different way.
At the same time, for training with full annotations, the DT-reliant Bound-
ary loss [54] has proven very effective, particularly for imbalanced datasets
47
and irregularly shaped objects, which is often the case in medical image seg-
mentation. Obviously, enforcing closeness in boundary/space when training
with point annotations would result in degenerated outputs containing points
or only background. But as mentioned in section 3.3, the appearance (and the
information it encodes) of a DT depends on the choice of the distance function.
This motivates the question of whether it is possible to harness the strengths
of the Boundary loss even under weak supervision, given appropriate distance
definitions.
Figure 4.4. A grayscale image (left) and examples of optimal learned segmentation
curves under losses focusing only on spatial proximity (red curve, middle) or intensity
differences (multiple curves, right) to the desired ground truth (the shaded object in
the centre). Enforcing exact overlap with the ground truth in space, the optimal curve
coincides with the ground truth delineation. When using intensity information on the
other hand, many curves can incur a close-to-zero cost and prove optimal. Either
behaviour can be desirable, depending on the application.
In practice, when using medical images, individual objects to be segmented

often comprise fairly homogeneous areas in terms of intensity. Therefore,
interpreting the pixel intensities as indicative of object membership can be
a way of encoding the spatial extent or shape of the objects when the exact
delineations are unknown. Figure 4.4 illustrates how optimal segmentation
curves may change, depending on what is the focus of the training loss. Since
the main problem in training with point annotations is that the segmentation
curve may be pushed inwards, causing undersegmentation, we proposed using
Boundary loss (BL) with intensity-based distances to counteract it.
4.2.2 Methods and background experiments

There are many distance definitions available that, in some way, consider in-
tensity information. Prior to the work of paper II, we investigated the be-
haviour of the BL with different distances, and conducted experiments with the
Euclidean, Taxicab, Geodesic and MBD distances on fully annotated ISLES
data (see section 2.2 for details) in the related paper P2 [12]. Figure 4.5 il-
lustrates the difference in the appearance of DTs on an example ISLES slice,
using Euclidean, Geodesic and Minimum barrier distances.
48
Figure 4.5. Example image from the ISLES dataset (see section 2.2) with its ground
truth contour and corresponding signed distance maps: spatial proximity-based, both
intensity- and spatial proximity-based, and only intensity-based, respectively.
ISLES data is highly imbalanced, with the lesion class representing only a
very small area, so not all 2D slices contain lesions. Hence the signed DTs
used for BL were computed directly on the 3D scans. Training with U-Net
[82] (see [12] for specific experimental settings) produced the training curves
given in Figure 4.6. According to these curves, the fully intensity-based dis-
Figure 4.6. Training and validation curves showing DSC and HD95 scores on training
with fully annotated ISLES data. GDL denotes training with generalized DSC, and
BL(·) denotes training with boundary loss using the specified distance.
tance, MBD, may accelerate the training and overfit quicker. In addition, we
noticed that the training curves for BL with Geodesic distance follow the orig-
inal curves (BL in Euclidean setting) fairly tightly, incurring only a small loss
of performance towards the end of the training. That can be attributed to over-
49
penalizing the close-to-exact segmentation, since the Geodesic-based BL pe-
nalizes both spatial and intensity differences.
These experiments further confirmed our decision to focus on fully intensity-
based distances. In paper II we thus proposed using BL with MBD for segmen-
tation training under point supervision. For completeness, we again compare it
to Euclidean and Geodesic distance, but also to another purely intensity-based
distance, using Equation 3.15 with λ = 1.
4.2.3 Results
The experiments of paper II were run on 2D slices of the ACDC and POEM
datasets (see section 2.2). Since both datasets come with full pixel-wise an-
notations, we created synthetic point annotations randomly. For faster experi-
mentation, we used the E-Net [75] architecture. As the point annotations were
created per slice, even the DTs were computed on individual slices. The pro-
posed combination of BL with MBD was compared not only to combinations
with other distances but also to the state-of-the-art approach in weak segmen-
tation, using the Conditional Random Field (CRF) regularized loss [97]. For
more details regarding the experiments consult paper II.
Using 2D maps
The experiments done on the ACDC dataset showed that the proposed ap-
proach, using BL with MBD, performed best in terms of the highest reached
Dice. It was also characterized by a slower collapse towards the weak ground
truth, providing a wider interval of epochs for choosing the best model. The
method using CRF-loss was not able to compete with BL derivatives, on av-
erage. Inspecting individual runs, however, showed that it could potentially
outperform the proposed combination of BL and MBD, but the performance
is very unstable, varying a lot across runs.
The Dice scores achieved in training on POEM data, on the other hand, were
less convincing. The combination of BL and MBD compared on pair, but not
significantly better than the other combinations. We suspect that was due to
the relatively low resolution of the images, allowing the MBD distance values
to more easily bleed through object boundaries, allowing for more spatially-
inconsistent, fragmented segmentation regions.
Using 3D maps
When training on coronal slices of the POEM data, most slices will lack at
least some of the foreground classes. By design, the DT on a slice where
annotation for some class is absent will be 0, allowing for unpenalized over-
segmentations of that class. Instead of artificially setting them to some pre-
defined nonzero value, one can use slices of a DT calculated on the entire 3D
subject. In paper II, we run an additional set of experiments with 3D distance
50
maps. The overall performance is improved, but the MBD, Geodesic and In-
tensity distances still perform on-pair. Due to an additional dimension, the
MBD distance has even more potential for bleed-through, which could be the
reason behind the lack of improvement.
Time considerations
The DTs for use with boundary loss need to be computed per class (and po-
tentially per channel), which increases the importance of the computational
burden of distance map calculations. To evaluate the tradeoff between the in-
crease in segmentation accuracy and required time, we also benchmarked the
computation times of different distance maps in paper II. As the MBD was the
only distance not computed with the use of GPU, it was expectedly most time-
demanding. However, the time requirements are still reasonable for practical
use when the DTs can be precomputed before training, particularly in 2D.
4.2.4 Conclusions
We proposed the use of the purely intensity based Minimum barrier distance
to harness the advantages of using Boundary loss even in weakly supervised
segmentation training. Based on its definition, the Boundary loss appears to
be directly incompatible with inexact ground truth boundaries. But through
extensive experiments we showed that, perhaps counterintuitively, the Bound-
ary loss behaves well even under extremely weak forms of supervision when
combined with intensity-based distances. The experiments confirmed that, un-
der point annotations, the combination of BL with MBD produces promising
results, but potentially requires additional preprocessing for some datasets.
51
5. Cross-modality image retrieval
As described in subsection 3.1.2, image retrieval today plays an important role

in diagnostic pipelines. Medical doctors can, for example, use large databases
of images with known pathologies to confirm their diagnosis or generate a set
of viable ones based on the new patient query data [72]. With new emerg-
ing technologies it is becoming increasingly desirable to allow for searches
across different modalities. While there are studies available on image re-
trieval across radiological modalities, like CT and MRI [68], works using
histo(patho)logical images have mainly been focused on within-modality re-
trieval [105], or retrieval across text and image data [43].
This chapter summarizes the work done in paper III, where we develop
a cross-modality (sub-)image retrieval pipeline with a focus on microscopy
images.
5.1 Motivation
Examination of histological images (especially hematoxylin and eosin stained
brightfield (BF) microscopy images) forms the basis for many medical diag-
noses (e.g. cancer) [80]. These images are typically acquired in whole slide
image (WSI) scanners that generally only capture a single modality. However,
different modalities can capture different, complementary information about
the data, providing further insights into potential pathologies. In particular,
second harmonic generation (SHG) has been found to facilitate content under-
standing when examined together with the corresponding BF images.
To be able to utilize the large databases of BF and SHG images together, it
is thus important to be able to query them across modalities. Furthermore, BF
images captured with WSI scanners generally cover a large tissue area (even
up to 100.000 square pixels), while the coverage of SHG images (at the same
resolution) is typically much smaller, illustrating the need for methods that
can handle sub-image retrieval. Additionally, as SHG and BF images of the
corresponding specimen are often not taken simultaneously or with the same
machine, it may be necessary to perform the retrieval on the instance level.
In paper III we therefore proposed a modular method, able to handle instance-
level cross-modal histological image retrieval across very dissimilar modali-
ties, and evaluated it on the multimodal histological dataset of BF and SHG
images (see section 2.2 for the details).
52
5.2 Methods
Our proposed pipeline consists of 3 main steps. The first step provides a way
to bridge the gap between the modalities and bring them closer together. In
the second one, features are extracted and binned (through the use of bag of
words (BoW) model) to make them comparable across images. Using these
BoW encodings, pairwise similarities can be computed as the criterion for the
ranking of matches. The third step consists of re-applying the second step
on a subset of the first chosen few best matches to rerank them for increased
accuracy.
5.2.1 Step I: Bridging the modality gap

Learning new image representations by means of contrastive learning is a
commonly used strategy for embedding two different domains into a com-
mon space. It is based on training with positive and negative samples, with
the aim of bringing the positive ones closer together in their embeddings and
pushing the negative ones further apart. In paper III, we employed the so-
called Contrastive Multimodal Image Representations (CoMIRs) [78] as the
common embeddings of the two modalities. The network structure used in the
contrastive learning process to produce CoMIRs differs from the typical set-
tings in that it relies on the full U-Net [82] architecture, producing 2D (image-
like) embeddings by design. See Figure 5.1 for examples of images and their
CoMIRs.
Figure 5.1. The first and third columns show example BF (above) and SHG (below)
image pairs [32] with their corresponding CoMIR [78] embeddings in the second and
last column.
5.2.2 Step II: Feature extraction and matching

As the CoMIRs created in step I are image-like representations, we employed
the SURF feature extractor on them. For similarity matching, the detected
features are binned through a BoW model with a reasonably large vocabulary
size of 20.000. The histogram encodings of the query and database images are
53
compared via the cosine similarity measure (see subsection 3.1.2). The larger
the similarity score, the better (higher ranked) the match.
5.2.3 Step III: Reranking Strategies

As the last step of the pipeline presented in paper III, we proposed a reranking
of the first few matches produced by step II. It is based on the SURF features
computed on the query and the top few retrieval matches, combining them
through another BoW in the same way as described in step II. For the case of
sub-image retrieval, we propose to cut all top retrieved matches to a minimal
number of (equidistantly placed) query-sized patches fully covering the image
prior to computing the new BoW model for reranking.
While paper III only describes and uses one specific reranking strategy
to further improve the retrieval results, this particular strategy was evaluated
through further experiments presented in related work R2. In those experi-
ments, we compared three reranking strategies that we here denote simply by
Strategy A-C for brevity.
Strategy A stands for reranking based on local features, i.e. by comparing
individual SURF features of the query and top matches, sorting them accord-
ing to the number of feature matches. Within Strategy B, reranking was per-
formed through feature-based registration. More specifically, the M-estimator
SAmple Consensus (MSAC) algorithm was applied to compute the transfor-
mation matrix between the features of the query and top matches. Strategy C
denotes the strategy we proposed to use within our pipeline.
Figure 5.2. The top-1, 5 and 10 retrieval success using various reranking methods on
the task of retrieval across BF and SHG modalities and transformations. BF in SHG
denotes searching through images of SHG modality for a BF query and vice versa.
The results of using no reranking versus reranking the first 15 matches with
the three described reranking methods on the problem of retrieval across trans-
formations and BF, SHG modalities are summarized in Figure 5.2.
54
Based on these limited experiments, we confirm that the reranking method
we chose to use within our proposed pipeline is a reasonable option.
5.3 Results
Replacement study
To further confirm the proposed pipeline design choices in paper III, we per-
formed a replacement study, swapping the individual parts of the pipeline for
a few viable alternatives. More concretely, we used the Pix2Pix [47] and Cy-
cleGAN [106] architectures as alternatives for bridging the modality gap. For
feature extraction we evaluated SURF [7], SIFT [64] and ResNet [40]. In-
stead of feature extraction with BoW encodings, we tried a recent reverse im-
age search tool using 2D Krawtchouk Descriptors (2KDK) [25]. The results
showed that our proposed pipeline outperformed all other tested combinations
and was the only one able to deal both with the retrieval across modalities and
transformations.
Our pipeline and competitors

Our proposed pipeline also clearly outperformed the two competitor meth-
ods we used for comparison (an instance-level retrieval method designed for
sketch retrieval [57], and a medical image retrieval method with implied ap-
plicability in cross-modal retrieval [79]), reaching an average 79.5% top-10
retrieval success (on whole images) across modalities and transformations af-
ter reranking (of the first 30 matches). The competing methods achieved a
below 50% success.
5.4 Conclusions
Among the current state-of-the-art methods in cross-modality image retrieval,
few apply to instance-level retrieval across modalities as different as BF and
SHG. In addition, many are specific to category-level retrieval (requiring more
than one instance per category for training) [28]. In paper III, we proposed a
modular image retrieval pipeline for query by example across imaging modal-
ities on an instance level. We motivated the design choices and illustrated the
superiority of our method through extensive experimentation. We also dis-
cussed the reasons behind the failures of the compared methods and steps.
55
6. Handling statistical analyses in Imiomics
The most common way of joining the imaging data with non-imaging-based
parameters for use in research or diagnostics to date is to extract features and
measurements of interest from the images and examine them together with
other parameters. However, that requires concrete decisions on what those
features of interest might be and cannot represent the full richness of informa-
tion present in images.
A concept called Imiomics (imaging-omics) [94, 93] instead aims to do the
opposite and integrate the non-imaging measurements onto the whole-body
imaging data to preserve as much underlying information from the images as
possible. First, all subjects involved in the analyses are registered together so
that their pixel values at individual locations are comparable. Then the chosen
non-imaging parameters can be merged with the registered images through
per-pixel application of functions (e.g. correlation) depending on both the
parameter and the underlying pixel value. Instead of the raw intensity values,
the Jacobian determinants (JD) of the displacement fields, recovered by the
registration procedure, can be used [94]. These can be considered a measure of
areal/volumetric changes between subjects. Examples of analyses that can be
performed through Imiomics include creating a healthy whole-body imaging
atlas, anomaly detection, and cross-sectional and longitudinal studies.
While presenting information as an image can be more informative, it is
also subject to multiplicity issues (as discussed in section 3.4) and can thus
complicate the detection of true deviations/activity/significance in the data.
While numerous correction methods are available, they depend on the under-
lying data. In paper IV, we develop paradigms for correction evaluation on
whole-body scans, benchmark the methods for use within Imiomics and pro-
pose method extensions. This chapter briefly presents the key results of the
work. For more details and in-depth discussions, refer to the text of the origi-
nal paper.
6.1 Motivation
As mentioned in section 3.4, most of the correction methods have been de-
veloped with neuroimaging applications in mind. While the general issues
of image smoothness and test dependence persist across all medical imaging
applications, differences in modalities and scan regions severely affect com-
pliance with various assumptions of correction methods and their performance
on the data.
56
It is therefore essential to evaluate methods, not only on synthetic data but
even on datasets of interest, in order to establish principled correction proce-
dures. But the evaluation process is not always straightforward. In functional
neuroimaging, the standard approach is to evaluate them using special no-
activation scans, which may be impossible to acquire in different applications.
In paper IV, we thus performed (to the best of our knowledge) the first val-
idation of correction methods on large-scale imaging data, specifically whole-
body MRI, and propose an Imiomics-specific evaluation strategy. In addition,
anatomy-compliant method improvements are developed.
6.2 Methods
6.2.1 Evaluation strategies
For the particular case of correlation analyses, we propose two strategies for
correction method evaluation. The first one is done in the absence of activity,
analogous to the no-activation-based evaluations from neuroimaging studies.
Since it is not possible to acquire a non-imaging parameter that would surely
not be correlated with any signal in the body, we suggest using a mix of ar-
tificial and real-life data and constructing a synthetic set of measures to be
correlated with the imaging data.
The second evaluation strategy relies on data with known presence of activ-
ity pattern to empirically evaluate the type II error rate. Such evaluation can be
of interest as type I and II error rates are nontrivially connected [56]. And as-
suming that the tested methods really do keep the FWER below the imposed
upper bound, their performance can be further evaluated via sensitivity (i.e.
type II errors).
6.2.2 Anatomically compliant corrections

Imiomics is based on the use of whole-body scans, which possess a specific
(known) anatomy. The idea is thus to build this knowledge into the correction
methods. Prior spatial information has been included in the corrections for
example in [30], focusing on smaller regions of interest. However, that again
results in discarding potentially interesting data from the analyses and is thus
unsuitable for Imiomics. At the time of writing, the only work aiming to in-
clude the anatomy in the corrections without limiting them to a specific region
was the one of [63], using brain-specific symmetry priors.
We proposed two ways of making the corrections anatomy-compliant: by
limiting FWER on predefined anatomy-based clusters, or by limiting the clus-
ter support within CBA, TFCE or permutation methods.
57
Correction on predefined anatomy-based clusters
When it is reasonable to assume that any true activity will always encompass
the entire structure/tissue/organ, the multiplicity problem can be alleviated by
treating individual structures as cluster units for inference.
A simple mean of the signal over each structure cluster could boost the
signal-to-noise ratio. However, it can be sensitive to incorrect delineations.
Hence the uncertainties of structure membership at each pixel should be com-
pensated for. Our proposed summarizing statistic for segmentation-defined
clusters uses a weighted signal average of pixels in each cluster as the cluster
signals. The weights are chosen such that they decrease towards the cluster
boundaries, representing the segmentation accuracy. If segmentation is done
in an automated manner, the network’s uncertainty can be used directly. If, on
the other hand, the segmentation is performed manually by multiple experts,
the fraction of the experts annotating a particular pixel as the given structure
can be used as that pixel’s accuracy/weight. Formally, given a structure seg-
mentation U with a known accuracy function a : Ω → [0, 1], the signal S of
that structure cluster is defined as
1
S= ∑ x · a(x)
N x∈U
(6.1)
where N is the cardinality (extent in pixels) of U.

To ensure appropriate FWER control over the anatomically-defined clus-
ters, one may then employ any of the described correction methods. Since the
multiplicity is now much lower and the tests more independent, even the oth-
erwise severe stringency of Bonferroni-type methods is no longer a problem.
Limiting cluster support

As many popular correction methods are cluster-based, and the clusters they
allow for can be of arbitrary extents, we proposed making them more anatomy-
compliant by limiting the cluster support to the underlying anatomical regions.
For example, in TFCE, by only integrating over the cluster support that falls
within a single anatomical region. And in BIA and cluster-extent threshold by
allowing individual clusters to extend only within underlying regions.
Such extension to the TFCE score can have negative implications for its use
due to varying structure sizes (i.e. if organ A is larger than organ B, its overall
signal need not be as large as the one of organ B for both to be detected ac-
tive). Thus we proposed to adjust the TFCE score by scaling with anatomical
structure sizes (maximal allowed clusters):
hN
emax E H
TFCE(x) = ∑ e(h) · h dh (6.2)
h=h eX
0
where x ∈ X, eX extent of structure X, emax = maxX eX and thresholded extents

e(h) is computed only within X.
58
6.3 Results
We evaluate several known correction procedures, namely Holm, RFT, TFCE,
permutation method and CBA, on correlation analyses for Imiomics. They are
compared to the proposed anatomy-based method and extensions through both
proposed evaluation strategies, namely in the assumed absence and known
presence of a signal. All the experiments are performed using a subset of the
POEM dataset (see section 2.2), split by gender, together with the Bioimpedance
Analysis (BIA) measurements. The significance level was set to 0.05, and the
primary cluster defining threshold to 0.001 where applicable. The BIA mea-
sure represents the total fat in the body and is expected to relate strongly to the
Jacobian determinants of the registration displacement fields. The integration
of BIA with image data was thus done through correlation maps (with JD).
In absence of activity
For the null case, a random vector from a standard normal distribution was
used in correlations. The nominal error rates over 200 repetitions showed that,
while cluster-based methods are typically better at retaining true activity, they
tend to also be more lenient with false positives. The anatomy-compliant ex-
tension to the permutation method (i.e. looking directly at the cluster extents)
offers an improvement over the original, but has a somewhat opposite effect
in TFCE and BIA.
In presence of activity
Analyzing correlations of BIA with JD, the (true) signal was expected to be
present in high-fat regions identified apriori through prior knowledge and man-
ual segmentation.
The number of voxels retaining activity post-correction was interpreted as
indicative of the method stringency. Ideally, all the observed activity inside
the areas with high-fat content should be retained. But according to theory,
some level of true activity outside of fat areas cannot be excluded. Through
examination of the voxel-activity histogram, we deduced that the cluster-based
methods are generally better at retaining all the activity in the fat regions and
that our proposed anatomy-based extensions improve true-activity retention.
6.4 Conclusions
We have proposed strategies for correction method evaluation on whole-body
MRI and Imiomics analyses, as well as adapted versions of various correc-
tion methods for better interpretable, anatomy-compliant error control. The
performed study constituted the first large-scale evaluation of the methods on
real-life data outside of neuroimaging.
59
With respect to the aims of the project, we proposed methods that improve
the robustness and provided an in-depth discussion on the appropriateness of
available methods for statistical evaluations in Imiomics.
Limitations of the study include the correlation-restricted analyses and sen-
sitivity of proposed methods to uncertainties of anatomical priors (they require
accurate delineations and/or a good uncertainty quantification of the inaccura-
cies). For corrections on predefined clusters directly, the statistic values may
not be trivially comparable among clusters for arbitrary accuracy functions. If
only the outermost segmentation borders are assumed to have lower accuracy,
for example, the method may be biased towards detecting signals from large,
circular areas (with a lower boundary-to-area ratio).
60
7. Conclusions and future work
To make medical care more widely accessible and enable high-throughput

screenings, it is essential to try and support the medical diagnosis- and decision-
making pipelines with automated methods wherever possible. The image pro-
cessing and analysis methods, developed and discussed in this thesis, were all
developed with specific biomedical applications in mind and are applicable at
different stages of a diagnostic process, as described in the introduction.
In papers I and II, we focused on segmentation problems, with the more spe-
cific aims of relaxing the full annotation requirements for training DL models
and improving patch learning to enable training on images too large to fit in
a contemporary GPU memory. In paper I, we proposed a way to introduce
relative patch location and extent information through landmark-based dis-
tance maps. Validation with three different networks showed significant im-
provements in the segmentation accuracy and training stability on the task of
abdominal organ segmentation from whole-body MRI, enabling better patch
learning with a negligible increase in memory requirements.
Distance maps were used also in paper II, where we discuss and evaluate
the potential applicability of Boundary loss in segmentation learning under
very weak forms of supervision. We argue for using intensity-based distances
and show that the combination of the Minimum barrier distance and Boundary
loss outperforms the current state of the art in point-supervised training with-
out a need for extensive parameter tuning. With this approach, we managed to
relax the pixel-wise annotation requirements while still achieving satisfactory
segmentation results.
Image retrieval has long been used within diagnostics for retrieving and
comparing similar pathologies in histological images. More recently, with the
increase in available data, it has become popular even within radiology. In
both cases, the importance of being able to search across modalities is clear.
With the goal of establishing a method for accurate and reliable instance-
level retrieval across modalities and transformations, we proposed a three-step
method consisting of contrastive-learning-based 2D image representations, a
bag of words model and a reranking scheme. We provide an extensive replace-
ment study, motivating our design choices. Through comparison with state-of-
the-art methods, we show the clear superiority of the proposed method on the
problem of histological image retrieval across BF and SHG modalities. With
61
state-of-the-art top-K results, the results are up to par with respect to the set
aims.
The statistical analyses required for reliable inference are often overlooked
but play just as important a part in medical image analysis as the rest of the
image-processing tools and methods. Our work in paper IV aimed to improve
and evaluate the robustness of statistical analyses for Imiomics on whole-body
MRI. We proposed two evaluation strategies, evaluated a number of known
correction methods and extended their definitions to include priors on the un-
derlying anatomy in the images. We showed, experimentally, that using such
underlying anatomical knowledge can help relieve the multiplicity issue, ful-
filling our aims to improve robustness and reduce the stringency of methods
for Imiomics analyses.
These paper contributions are well aligned with the main objective of the
thesis work, to develop and improve automatic methods supporting medical
(specifically diagnostic) workflows.
Considering the complexity of medical workflows, we only tackled a very

small subset of image-processing applications within diagnostics, and there
is still immense room for high-impact improvements and automation of the
process. Potential avenues for further investigations, in connection to the con-
tributions of this thesis, include developing yet more accurate and explainable
segmentation methods, for example, by additional prior knowledge or relying
on multi-source data in case of pathologies (patient journals, other types of
measurements). In the case of weakly-supervised training with Boundary loss
in particular, the remaining open questions include DT (or channel intensity)
scaling and treatment of individual channels in DT computation. More testing
and development are also needed to reveal whether using a summary intensity
statistic for the entire annotation (or at least its border) would affect the dis-
tance transform smoothness and behaviour in training in a positive way.
While the retrieval pipeline we proposed in this thesis performs well in

comparison to other available methods, we still have a long way to go in
reaching adequate precision. Potential future works therefore include the vali-
dation of our pipeline across datasets and combining the contrastive-learning-
based representations with additional knowledge (potentially image topology
or modality-specifications-based), to further improve the top-1 retrieval accu-
racy.
Within the work on statistical analyses, we only proposed a very crude way
of using anatomical knowledge, and the possibilities of including it more di-
rectly, in a statistical manner, remain to be explored.
62
Sammanfattning på svenska
Medicinska bilder har spelat en viktig roll inom diagnostik, ända sedan den
första röntgenbilden togs år 1895. Under de senaste åren har både antalet olika
modaliteter och antal tagna bilder ökat enormt. Dessutom har bildupplösnin-
gen förbättrats, och tiden som behövs för att samla in bilder har minskat. Sam-
mantaget gör den här utvecklingen att läkare alltmer förlitar sig på medicinska
bilder för att ställa diagnos och för övervakning av olika sjukdomar.
Med den stora mängd tillgänglig bilddata finns det idag brist på experter för
att analysera dem, och den manuella insats som krävs för bildbehandling och
analys har nu blivit en flaskhals i processen. Datoriserade metoder har blivit
ett viktigt verktyg för att stödja läkarna i deras arbete och för att påskynda
bildbehandlings- och analysprocesser.
Många datorbaserade bildbehandlings- och analysmetoder har under lång
tid utvecklats med syfte att stödja läkare och förenkla deras arbete. För att
datorbaserade, automatiska och halvautomatiska, metoder ska tillämpas inom
medicin bör de vara pålitliga, förklarbara och reproducerbara. Inte minst inom
diagnostik, vars resultat kan ha stora konsekvenser för patienten.
Den specifika tillämpningen och anledningen till bildinsamling avgör på
vilket sätt bilderna behandlas eller/och analyseras på, vanligtvis med hjälp av
expertkunskap. Inom bildbaserad diagnostik används vanligtvis radiologiska
bilder och mikroskopibilder, vilka ofta innehåller kompletterande information.
Medan radiologiska bilder ofta används som ett fristående verktyg, kräver de
ibland ytterligare bearbetning, baserade på t.ex. histologiska bilder av biopsier.
På motsvarande sätt kan en undersökning av histologiska bilder antingen vara
målet för en medicinsk procedur, eller indikera behovet av t.ex. radiologisk
avbildning.
I den här avhandlingen siktar vi på att förbättra metoder som används inom
bildbaserad diagnostik, både för radiologiska bilder och för mikroskopibilder.
Vi utvecklar eller förbättrar metoder som kan appliceras i olika steg i en di-
agnostisk pipeline, nämligen bearbetning av radiologiska bilder (genom seg-
menteringsmetoder), biomedicinsk bildinhämtning (med fokus på histologiska
data) och statistiska analyser tillämpade på det så kallade Imiomics-konceptet.
I artiklar I och II arbetar vi med segmenteringsmetoder baserade på neural
nätverk. I vissa tillämpningar behöver man träna neurala nätverk med utk-
lippta delar, patches, av bilderna, eftersom medicinska bilder ibland är för
stora för att få plats i minnet i moderna GPU:er. I artikel I förbättrar och sta-
biliserar vi träningen av patch-baserade metoder genom användning av särskilt
normaliserade avståndstransformer baserade på anatomiska landmärken.
63
Moderna inlärningsbaserade segmenteringsmetoder för bildbehandling brukar
kräva en stor mängd annoterad data. Eftersom exakta annoteringar är dyra och
tidskrävande att generera, kan metoder som använder ofullständigt annoterad
data, baserad på träning med så kallad svag övervakning (weak supervision),
användas. I artikel II presenterar vi en segmenteringsmetod som kombinerar
Boundary loss med Minimum barrier distans (MBD) för bättre inlärning under
svag övervakning.
Histologiska bilder lagras ofta i stora databaser tillsammans med annan in-
formation som t.ex. tidigare diagnoser. När läkare undersöker ett nytt vävnad-
sprov kan de då titta på liknande vävnadsprov som finns i en databas, för att
säkerställa att nya diagnosen stämmer överens med annan data. Därför behövs
det korrekta och pålitliga automatiska metoder som kan söka och hämta bilder
från en databas, och hitta bilder som liknar den sökta bilden.
Artikel III introducerar en sådan metod, utvecklad specifikt för bildsökning
av histologiskt data. Metoden bygger på egenskapsextraktion, matchning och
omordning av identifierade toppresultat.
För en felfri och tillförlitlig medicinsk tolkning av de slutliga bildresultaten
och för att möjliggöra praktisk användning i medicinskt beslutsfattande krävs
det ofta statistiska analyser. I artikel IV utvärderar vi metoder för korrek-
tion av multipla tester i hypotesprövning på helkropps-MR-bilder. Dessutom
presenterar vi anatomi-baserade metoder, vilka visade sig ge bättre resultat i
korrektionen av multipla tester.
64
Acknowledgements
This PhD journey has been a long and exhausting one, and yet somehow si-
multaneously also super interesting and fun. It has taught me more than I ever
thought it would, but most importantly, it has brought some amazing people
to my life that I would otherwise not have had the pleasure of meeting.
There are a lot of people I would like to extend my thanks to, for being a
part of my journey and making it better through their presence.
First, to my main supervisor, Robin Strand: thank you for giving me this
opportunity and for your help, support and encouragement throughout these
years. I am immensely grateful for being allowed to explore my own ideas
and do things my way, even when it didn’t work out quite as planned.
Filip Malmberg, thank you for being a great teacher, for entertaining my
curiosity and patiently answering all my random questions. The same thank
you also extends to Justin Pearson. Justin, the writing of this thesis broke our
TDA focus. But I hope we get the chance to collaborate in the future!
Hvala i vama, Nataša Sladoje i Joakim Lindblad, na društvu i zabavnim
diskusijama. Bili ste mi mnogo dobri, uključeni i podržavajući saradniki -
hvala puno.
A big thank you goes to the seniors I have had the pleasure to teach with
(I’m looking at you, Joachim Parrow and Matteo Magnani): it was a pleasure
to learn from such great educators.
As with most jobs, being a PhD student involves a surprising amount of
bureaucracy-related tasks. Thank you, Anna-Lena Forsberg, Elisabeth Lindqvist
and Inger Hammarin, for helping me navigate through it all. Anna-Lena, you
have done much for me than your job description entails. Thank you for being
so open, supportive and understanding.
To all the Equal Opportunities group members that I’ve met through the
years - thank you for making the work environment better, and helping me see
through my own biases.
Carolina Wählby and Ida-Maria Sintorn, thank you for your contagious
smiles, and for being the amazing women that you are. I look up to you.
Ida-Maria, you light up the space with your presence. I cannot possibly put
into words how much your hugs and kind words meant to me. Thank you, so
so much, for checking in on me.
To all the past and present PhD students and coworkers at the division,
thank you for creating a welcoming and pleasant workplace! Gabriele Partel,
thank you for your wisdoms and all the delicious food, Anindya Gupta, for
lending a shoulder to lean on, and Teo Asplund, for fun discussions and book
suggestions.
65
To my PhD-parents-support group, Amanda&Håkan, Elisabeth&Ragnar,
Li&Nicolas, Virginia&Martin: thank you for helping me keep my sanity by
listening to my endless complaining and oversharing. May the tantrums and
VAB stay away from you!
Thank you, Jennifer Alvén, for making conferences fun and for all the par-
enting advice. Nicolas Pielawski, it’s been fun to listen to all your crazy ideas.
I hope we now finally get to work on some of them together. Elisabeth Wet-
zer, thank you for being a great collaborator and an even greater friend. I hope
we get the chance to work together again in the future. Virginia Grande, you
always know exactly what to say when I’m down, making my world a better
place by being there.
Dear Raphaela Heil, without you - breakdown and dehydration. And def-
initely no good thesis figures. Thank you for being a supportive and under-
standing friend, and my IT helpdesk. Raphaela and Nikolaus Huber, I will
miss our breakfasts and random laughs. Movie night soon, Dankeschön!
Håkan Wieslander, thank you for being a great friend and an irreplaceable
gym buddy. Mr Gupta, Ankit! Sharing the office has been super fun, and I
will miss our random deep talks. Thank you for your friendship and encour-
agement, and for always telling it like it is.
To my Slovenian circle, Teja and Martin Tement-štrumbelj, Sanja Obrovnik,
Maša Brumec. For always being there in times of need, no matter how much
time has passed. For being understanding about my lack of planning abilities
and for always carving out some time in your busy schedules for me. Hvala!
And I owe you plenty.
The last people to mention here are those that I hold dearest and without
whom this thesis would either not exist or cost me my sanity. To my dear little
family, Igor&Sever. Thank you, Igor, for all your support, love and encour-
agement. For the warmest hugs and the cosiest moments, for feeding me, for
listening to me, and for taking care of Sever through my late work hours. For
loving me and being annoyed with me in equal measure. You are the Batch-
Norm to my unstable training; my exploding gradients would be even more
explosive without you! Sever, my little sunshine, my explosive bundle of joy.
Thank you for teaching me that there are more important things in life than
work.
66
References
[1] Chapter 44 - introduction to random field theory. In R. S. Frackowiak, K. J.

Friston, C. D. Frith, R. J. Dolan, C. J. Price, S. Zeki, J. T. Ashburner, and
W. D. Penny, editors, Human Brain Function (Second Edition), pages
867–879. Academic Press, Burlington, second edition edition, 2004.
[2] Chapter 46 - nonparametric permutation tests for functional neuroimaging. In
R. S. Frackowiak, K. J. Friston, C. D. Frith, R. J. Dolan, C. J. Price, S. Zeki,
J. T. Ashburner, and W. D. Penny, editors, Human Brain Function (Second
Edition), pages 887–910. Academic Press, Burlington, second edition edition,
2004.
[3] A. Alzu’bi, A. Amira, and N. Ramzan. Semantic content-based image
retrieval: A comprehensive study. Journal of Visual Communication and
Image Representation, 32:20–54, 2015.
[4] J. Amann, A. Blasimme, E. Vayena, D. Frey, V. I. Madai, and the
Precise4Q consortium. Explainability for artificial intelligence in healthcare: a
multidisciplinary perspective. BMC Medical Informatics and Decision
Making, 20, 2020.
[5] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional
encoder-decoder architecture for image segmentation.
[6] S. Banerjee, D. Toumpanakis, A. K. Dhara, J. Wikström, and R. Strand.
Topology-aware learning for volumetric cerebrovascular segmentation. In
2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI),
pages 1–4, 2022.
[7] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. In
A. Leonardis, H. Bischof, and A. Pinz, editors, Computer Vision – ECCV
2006, pages 404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
[8] C. M. Bennett, M. B. Miller, and G. L. Wolford. Neural correlates of
interspecies perspective taking in the post-mortem atlantic salmon: An
argument for multiple comparisons correction. Neuroimage, 47(Suppl
1):S125, 2009.
[9] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng,
I. Cetin, K. Lekadir, O. Camara, M. A. Gonzalez Ballester, G. Sanroma,
S. Napel, S. Petersen, G. Tziritas, E. Grinias, M. Khened, V. A. Kollerathu,
G. Krishnamurthi, M.-M. Rohé, X. Pennec, M. Sermesant, F. Isensee, P. Jäger,
K. H. Maier-Hein, P. M. Full, I. Wolf, S. Engelhardt, C. F. Baumgartner, L. M.
Koch, J. M. Wolterink, I. Išgum, Y. Jang, Y. Hong, J. Patravali, S. Jain,
O. Humbert, and P.-M. Jodoin. Deep learning techniques for automatic MRI
cardiac multi-structures segmentation and diagnosis: Is the problem solved?
IEEE Transactions on Medical Imaging, 37(11):2514–2525, 2018.
[10] S. Beucher. Use of watersheds in contour detection. In Proc. Int. Workshop on
Image Processing, Sept. 1979, pages 17–21, 1979.
67
[11] C. M. Bishop. Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer, 1 edition, 2007.
[12] E. Breznik and R. Strand. Effects of distance transform choice in training with
boundary loss, 2021. Online, Retrieved from
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-499054.
[13] J. S. Bridle. Probabilistic interpretation of feedforward classification network
outputs, with relationships to statistical pattern recognition. In NATO
Neurocomputing, 1989.
[14] E. N. Brown and M. Behrmann. Controversy in statistical analysis of
functional magnetic resonance imaging data. Proceedings of the National
Academy of Sciences of the United States of America, 114:E3368–E3369,
2017.
[15] F. Cabitza, A. Campagner, F. Soares, L. G. Guadiana-Romualdo, F. Challa,
A. Sulejmani, M. Seghezzi, and A. Carobene. The importance of being
external: methodological insights for the external validation of machine
learning models in medicine. Computer Methods and Programs in
Biomedicine, 208:106288, 2021.
[16] K. C. Ciesielski, R. Strand, F. Malmberg, and P. K. Saha. Efficient algorithm
for finding the exact minimum barrier distance. Computer Vision and Image
Understanding, 123:53–64, 2014.
[17] M. D. Cirillo, D. Abramian, and A. Eklund. Vox2vox: 3d-gan for brain tumour
segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and
Traumatic Brain Injuries: 6th International Workshop, BrainLes 2020, Held in
Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Revised
Selected Papers, Part I 6, pages 274–284. Springer, 2021.
[18] E. Commission. 2018 reform of EU data protection rules (GDPR,
corrigendum). Available at http:
//data.europa.eu/eli/reg/2016/679/corrigendum/2018-05-23/oj,
accessed 2023-03-10.
[19] H. Cramér. Mathematical methods of statistics, volume 26. Princeton
university press, 1999.
[20] W. Crum, O. Camara, and D. Hill. Generalized overlap measures for
evaluation and validation in medical image analysis. IEEE Transactions on
Medical Imaging, 25(11):1451–1461, 2006.
[21] V. Curic, J. Lindblad, N. Sladoje, H. Sarve, and G. Borgefors. A new set
distance and its application to shape registration. Pattern Analysis and
Applications, 17:141–152, 2014.
[22] S. Dangi, C. A. Linte, and Z. Yaniv. A distance map regularized CNN for
cardiac cine MR image segmentation. Medical physics, 46(12):5637–5651,
2019.
[23] P.-E. Danielsson. Euclidean distance mapping. Computer Graphics and Image
Processing, 14(3):227–248, 1980.
[24] T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan. A review of some
techniques for inclusion of domain-knowledge into deep neural networks.
Scientific Reports, 12, 2022.
[25] J. S. DeVille, D. Kihara, and A. Sit. 2DKD: a toolkit for content-based local
image search. Source Code Biol Med., 2020.
68
[26] M. M. Deza and E. Deza. Encyclopedia of Distances. Springer, 2013.
[27] L. R. Dice. Measures of the amount of ecologic association between species.
Ecology, 26(3):297–302, 1945.
[28] S. R. Dubey. A decade survey of content based image retrieval using deep
learning. IEEE Transactions on Circuits and Systems for Video Technology,
32(5):2687–2704, 2022.
[29] O. J. Dunn. Multiple comparisons among means. Journal of the American
Statistical Association, 56(293):52–64, 1961.
[30] S. B. Eickhoff, S. Heim, K. Zilles, and K. Amunts. Testing anatomically
specified hypotheses in functional imaging using cytoarchitectonic maps.
NeuroImage, 32(2):570–582, 2006.
[31] A. Eklund, T. E. Nichols, and H. Knutsson. Cluster failure: Why fMRI
inferences for spatial extent have inflated false-positive rates. Proceedings of
the national academy of sciences, 113(28):7900–7905, 2016.
[32] K. Eliceiri, B. Li, and A. Keikhosravi. Multimodal biomedical dataset for
evaluating registration methods (patches from TMA cores). zenodo
https://zenodo.org/record/3874362, June 2020.
[33] C. Fouard, R. Strand, and G. Borgefors. Weighted distance transforms
generalized to modules and their computation on point lattices. Pattern
Recognition, 40(9):2453–2474, 2007.
[34] M. Ghafoorian, N. Karssemeijer, T. Heskes, M. Bergkamp, J. Wissink,
J. Obels, K. Keizer, F.-E. de Leeuw, B. van Ginneken, E. Marchiori, et al.
Deep multi-scale location-aware 3D convolutional neural networks for
automated detection of lacunes of presumed vascular origin. NeuroImage:
Clinical, 14:391–399, 2017.
[35] M. Ghafoorian, N. Karssemeijer, T. Heskes, I. W. van Uden, C. I. Sanchez,
G. Litjens, F.-E. de Leeuw, B. van Ginneken, E. Marchiori, and B. Platel.
Location sensitive deep convolutional neural networks for segmentation of
white matter hyperintensities. Scientific Reports, 7(1):1–12, 2017.
[36] R. C. Gonzalez. Digital image processing. Pearson, New York, NY, fourth
edition, global edition. edition, 2018.
[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in
Neural Information Processing Systems, volume 27. Curran Associates, Inc.,
2014.
[38] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press,
Cambridge, MA, USA, 2016.
[39] A. Hakim, S. Christensen, S. Winzeck, M. G. Lansberg, M. W. Parsons,
C. Lucas, D. Robben, R. Wiest, M. Reyes, and G. Zaharchuk. Predicting
infarct core from computed tomography perfusion in acute ischemia with
machine learning: Lessons from the ISLES challenge. Stroke,
52(7):2328–2337, 2021.
[40] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[41] R. Heller, D. Stanley, D. Yekutieli, N. Rubin, and Y. Benjamini. Cluster-based
69
analysis of fMRI data. NeuroImage, 33(2):599–608, 2006.
[42] S. Holm. A simple sequentially rejective multiple test procedure.
Scandinavian journal of statistics, pages 65–70, 1979.
[43] D. Hu, F. Xie, Z. Jiang, Y. Zheng, and J. Shi. Histopathology cross-modal
retrieval based on dual-transformer network. In 2022 IEEE 22nd International
Conference on Bioinformatics and Bioengineering (BIBE), pages 97–102,
2022.
[44] D. Huttenlocher, G. Klanderman, and W. Rucklidge. Comparing images using
the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 15(9):850–863, 1993.
[45] J. P. A. Ioannidis. Excess Significance Bias in the Literature on Brain Volume
Abnormalities. Archives of General Psychiatry, 68(8):773–780, 08 2011.
[46] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein.
nnU-Net: a self-configuring method for deep learning-based biomedical image
segmentation. Nature methods, 18(2):203–211, 2021.
[47] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with
conditional adversarial networks. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[48] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one
hundred layers Tiramisu: Fully convolutional densenets for semantic
segmentation. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 1175–1183, 2017.
[49] L. Jiao and J. Zhao. A survey on the new generation of deep learning in image
processing. IEEE Access, 7:172231–172263, 2019.
[50] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K.
Menon, D. Rueckert, and B. Glocker. Efficient multi-scale 3D CNN with fully
connected CRF for accurate brain lesion segmentation. Medical Image
Analysis, 36:61–78, 2017.
[51] P.-Y. Kao, S. Shailja, J. Jiang, A. Zhang, A. Khan, J. W. Chen, and B. S.
Manjunath. Improving patch-based convolutional neural networks for MRI
brain tumor segmentation by leveraging location information. Frontiers in
Neuroscience, 13, 2020.
[52] D. Karimi and S. E. Salcudean. Reducing the hausdorff distance in medical
image segmentation with convolutional neural networks. IEEE Transactions
on Medical Imaging, 39(2):499–513, 2020.
[53] A. Keikhosravi, J. S. Bredfeldt, A. K. Sagar, and K. W. Eliceiri. Chapter 28 -
second-harmonic generation imaging of cancer. In Quantitative Imaging in
Cell Biology, volume 123 of Methods in Cell Biology, pages 531–546.
Academic Press, 2014.
[54] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. Ben
Ayed. Boundary loss for highly unbalanced segmentation. Medical Image
Analysis, 67:101851, 2021.
[55] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi. A survey of the recent
architectures of deep convolutional neural networks. Artificial intelligence
review, 53:5455–5516, 2020.
[56] M. D. Lieberman and W. A. Cunningham. Type I and Type II error concerns in
fMRI research: re-balancing the scale. Social Cognitive and Affective
70
Neuroscience, 4(4):423–428, 12 2009.
[57] H. Lin, Y. Fu, P. Lu, S. Gong, X. Xue, and Y.-G. Jiang. TC-Net for ISBIR:
Triplet classification network for instance-level sketch based image retrieval.
In Proc. ACM Intl. Conf. on Multimedia, page 1676–1684. ACM, 2019.
[58] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense
object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.
[59] L. Lind. Relationships between three different tests to evaluate
endothelium-dependent vasodilation and cardiovascular risk in a middle-aged
sample. Journal of Hypertension, 31:1570–1574, 2013.
[60] J. Lindblad and N. Sladoje. Linear time distances between fuzzy sets with
applications to pattern matching and classification. IEEE Transactions on
Image Processing, 23(1):126–136, 2014.
[61] M. A. Lindquist and A. Mejia. Zen and the art of multiple comparisons.
Psychosomatic medicine, 77:114–125, 2015.
[62] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian,
J. A. van der Laak, B. van Ginneken, and C. I. Sánchez. A survey on deep
learning in medical image analysis. Medical Image Analysis, 42:60–88, 2017.
[63] G. Lohmann, J. Neumann, K. Mueller, and J. Lepsien. The multiple
comparison problem in fMRI: a new method based on anatomical priors.
MICCAI Workshop on Analysis of Functional Images, 1-8 (2008), 01 2008.
[64] D. Lowe. Object recognition from local scale-invariant features. In
Proceedings of the Seventh IEEE International Conference on Computer
Vision, volume 2, pages 1150–1157 vol.2, 1999.
[65] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective
receptive field in deep convolutional neural networks, 2017.
[66] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L. Martel. Loss
odyssey in medical image segmentation. Medical Image Analysis, 71:102035,
2021.
[67] J. Ma, Z. Wei, Y. Zhang, Y. Wang, R. Lv, C. Zhu, C. Gaoxiang, J. Liu, C. Peng,
L. Wang, Y. Wang, and J. Chen. How distance transform maps boost
segmentation CNNs: An empirical study. In International Conference on
Medical Imaging with Deep Learning, 2020.
[68] A. Mbilinyi and H. Schuldt. Cross-modality medical image retrieval with deep
features. In 2020 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pages 2632–2639, 2020.
[69] A. M. Meesters, K. Ten Duis, H. Banierink, V. M. Stirler, P. C. Wouters,
J. Kraeima, J.-P. P. de Vries, M. J. Witjes, and F. F. IJpma. What are the
interobserver and intraobserver variability of gap and stepoff measurements in
acetabular fractures? Clinical Orthopaedics and Related Research,
478(12):2801, 2020.
[70] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and
D. Terzopoulos. Image segmentation using deep learning: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(7):3523–3542,
2022.
[71] D. Müller, I. n. Soto-Rey, and F. Kramer. Towards a guideline for evaluation
metrics in medical image segmentation. BMC Research Notes, 15, 2022.
71
[72] H. Müller, N. Michoux, D. Bandon, and A. Geissbuhler. A review of
content-based image retrieval systems in medical applications—clinical
benefits and future directions. International Journal of Medical Informatics,
73(1):1–23, 2004.
[73] C. Niblack, P. B. Gibbons, and D. W. Capson. Generating skeletons and
centerlines from the distance transform. CVGIP: Graphical Models and Image
Processing, 54(5):420–437, 1992.
[74] T. Nichols and S. Hayasaka. Controlling the familywise error rate in functional
neuroimaging: a comparative review. Statistical Methods in Medical Research,
12(5):419–446, 2003. PMID: 14599004.
[75] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A deep neural
network architecture for real-time semantic segmentation.
[76] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch.
In NIPS-W, 2017.
[77] G. Petralia and A. R. Padhani. Whole-body magnetic resonance imaging in
oncology: uses and indications. Magnetic Resonance Imaging Clinics,
26(4):495–507, 2018.
[78] N. Pielawski, E. Wetzer, J. Öfverstedt, J. Lu, C. Wählby, J. Lindblad, and
N. Sladoje. CoMIR: Contrastive multimodal image representation for
registration. In Advances in Neural Information Processing Systems,
volume 33, pages 18433–18444. Curran Associates, Inc., 2020.
[79] L. Putzu, A. Loddo, and C. D. Ruberto. Invariant moments, textural and deep
features for diagnostic MR and CT image retrieval. In Computer Analysis of
Images and Patterns, pages 287–297. Springer Intl. Publishing, 2021.
[80] A. Rahman, C. Jahangir, S. M. Lynch, N. Alattar, C. Aura, N. Russell,
F. Lanigan, and W. M. Gallagher. Advances in tissue-based imaging: impact
on oncology research and clinical practice. Expert Review of Molecular
Diagnostics, 20(10):1027–1037, 2020. PMID: 32510287.
[81] A. Reinke, M. D. Tizabi, C. H. Sudre, M. Eisenmann, T. Rädsch,
M. Baumgartner, L. Acion, M. Antonelli, T. Arbel, S. Bakas, P. Bankhead,
A. Benis, M. J. Cardoso, V. Cheplygina, E. Christodoulou, B. Cimini, G. S.
Collins, K. Farahani, B. van Ginneken, B. Glocker, P. Godau, F. Hamprecht,
D. A. Hashimoto, D. Heckmann-Nötzel, M. M. Hoffman, M. Huisman,
F. Isensee, P. Jannin, C. E. Kahn, A. Karargyris, A. Karthikesalingam,
B. Kainz, E. Kavur, H. Kenngott, J. Kleesiek, T. Kooi, M. Kozubek,
A. Kreshuk, T. Kurc, B. A. Landman, G. Litjens, A. Madani, K. Maier-Hein,
A. L. Martel, P. Mattson, E. Meijering, B. Menze, D. Moher, K. G. M. Moons,
H. Müller, B. Nichyporuk, F. Nickel, M. A. Noyan, J. Petersen, G. Polat,
N. Rajpoot, M. Reyes, N. Rieke, M. Riegler, H. Rivaz, J. Saez-Rodriguez,
C. S. Gutierrez, J. Schroeter, A. Saha, S. Shetty, M. van Smeden, B. Stieltjes,
R. M. Summers, A. A. Taha, S. A. Tsaftaris, B. Van Calster, G. Varoquaux,
M. Wiesenfarth, Z. R. Yaniv, A. Kopp-Schneider, P. Jäger, and L. Maier-Hein.
Common limitations of image processing metrics: A picture story, 2021.
[82] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for
biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and
A. F. Frangi, editors, Medical Image Computing and Computer-Assisted
72
Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer
International Publishing.
[83] A. Rosenfeld and J. Pfaltz. Distance functions on digital pictures. Pattern
Recognition, 1(1):33–61, 1968.
[84] S. Roychowdhury, M. Diligenti, and M. Gori. Regularizing deep networks
with prior knowledge: A constraint-based approach. Knowledge-Based
Systems, 222:106989, 2021.
[85] S. S. M. Salehi, D. Erdogmus, and A. Gholipour. Tversky loss function for
image segmentation using 3D fully convolutional deep networks. In Machine
Learning in Medical Imaging: 8th International Workshop, MLMI 2017, Held
in Conjunction with MICCAI 2017, Quebec City, QC, Canada, September 10,
2017, Proceedings 8, pages 379–387. Springer, 2017.
[86] J. H. Scatliff and P. J. Morris. From röntgen to magnetic resonance imaging:
The history of medical imaging. North Carolina Medical Journal,
75(2):111–113, 3 2014.
[87] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015.
[88] M. L. Seghier. Ten simple rules for reporting machine learning methods
implementation and evaluation on biomedical data. International Journal of
Imaging Systems and Technology, 32(1):5–11, 2022.
[89] J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology,
46(1):561–584, 1995.
[90] W. Silva, T. Gonçalves, K. Härmä, E. Schröder, V. C. Obmann, M. C. Barroso,
A. Poellinger, M. Reyes, and J. S. Cardoso. Computer-aided diagnosis through
medical image retrieval in radiology. Scientific Reports, 12, 2022.
[91] S. M. Smith and T. E. Nichols. Threshold-free cluster enhancement:
Addressing problems of smoothing, threshold dependence and localisation in
cluster inference. NeuroImage, 44(1):83–98, 2009.
[92] R. Strand, K. C. Ciesielski, F. Malmberg, and P. K. Saha. The minimum
barrier distance. Computer Vision and Image Understanding, 117(4):429–437,
2013. Special issue on Discrete Geometry for Computer Imagery.
[93] R. Strand, S. Ekström, E. Breznik, T. Sjöholm, M. Pilia, L. Lind, F. Malmberg,
H. Ahlström, and J. Kullberg. Recent advances in large scale whole body MRI
image analysis: Imiomics. In Proceedings of the 5th International Conference
on Sustainable Information Engineering and Technology, SIET ’20, pages
10–15, New York, NY, USA, 2021. Association for Computing Machinery.
[94] R. Strand, F. Malmberg, L. Johansson, L. Lind, M. Sundbom, H. Ahlström,
and J. Kullberg. A concept for holistic whole body MRI data analysis,
Imiomics. PLOS ONE, 12:1–17, 02 2017.
[95] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey,
P. Elliott, J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell,
A. Silman, A. Young, T. Sprosen, T. Peakman, and R. Collins. UK Biobank:
An open access resource for identifying the causes of a wide range of complex
diseases of middle and old age. PLOS Medicine, 12(3):1–10, 03 2015.
[96] P. Summers, G. Saia, A. Colombo, P. Pricolo, F. Zugni, S. Alessi, G. Marvaso,
B. A. Jereczek-Fossa, M. Bellomi, and G. Petralia. Whole-body magnetic
73
resonance imaging: Technique, guidelines and key applications.
ecancermedicalscience, 15, 2021.
[97] M. Tang, F. Perazzi, A. Djelouah, I. B. Ayed, C. Schroers, and Y. Boykov. On
regularized losses for weakly-supervised cnn segmentation. In V. Ferrari,
M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV
2018, pages 524–540, Cham, 2018. Springer International Publishing.
[98] P. J. Toivanen. New geodosic distance transforms for gray-scale images.
Pattern Recognition Letters, 17(5):437–450, 1996.
[99] L. Wasserman. All of statistics: a concise course in statistical inference,
volume 26. Springer, 2004.
[100] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics
Bulletin, 1(6):80–83, 1945.
[101] J. M. Wolterink, A. M. Dinkla, M. H. F. Savenije, P. R. Seevinck, C. A. T.
van den Berg, and I. Išgum. Deep mr to ct synthesis using unpaired data. In
S. A. Tsaftaris, A. Gooya, A. F. Frangi, and J. L. Prince, editors, Simulation
and Synthesis in Medical Imaging, pages 14–23, Cham, 2017. Springer
International Publishing.
[102] C.-W. Woo, A. Krishnan, and T. D. Wager. Cluster-extent based thresholding
in fMRI analyses: Pitfalls and recommendations. NeuroImage, 91:412–419,
2014.
[103] K. J. Worsley, S. Marrett, P. Neelin, A. C. Vandal, K. J. Friston, and A. C.
Evans. A unified statistical approach for determining significant signals in
images of cerebral activation. Human Brain Mapping, 4(1):58–73, 1996.
[104] Y. Xue, H. Tang, Z. Qiao, G. Gong, Y. Yin, Z. Qian, C. Huang, W. Fan, and
X. Huang. Shape-aware organ segmentation by predicting signed distance
maps. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 34, pages 12565–12572, 2020.
[105] P. Yang, Y. Zhai, L. Li, H. Lv, J. Wang, C. Zhu, and R. Jiang. A deep metric
learning approach for histopathological image retrieval. Methods, 179:14–25,
2020. Interpretable machine learning in bioinformatics.
[106] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Intl. Conf. on
Computer Vision (ICCV), 2017.
74
Acta Universitatis Upsaliensis
Digital Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Science and Technology 2253
Editor: The Dean of the Faculty of Science and Technology
A doctoral dissertation from the Faculty of Science and

Technology, Uppsala University, is usually a summary of a
number of papers. A few copies of the complete dissertation
are kept at major Swedish research libraries, while the
summary alone is distributed internationally through
the series Digital Comprehensive Summaries of Uppsala
Dissertations from the Faculty of Science and Technology.
(Prior to January, 2005, the series was published under the
title “Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Science and Technology”.)
ACTA
UNIVERSITATIS
UPSALIENSIS
Distribution: publications.uu.se UPPSALA
urn:nbn:se:uu:diva-498953 2023

FULLTEXT01 (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FULLTEXT01 (1)

Uploaded by

Copyright:

Available Formats

Digital Comprehensive Summaries of Uppsala Dissertations

from the Faculty of Science and Technology 2253

Image Processing and Analysis

Keywords: Multiple comparisons, image segmentation, image retrieval, deep learning,

Eva Breznik, Department of Information Technology, Computerized Image Analysis and

© Eva Breznik 2023

I E. Breznik, J. Kullberg, H. Ahlström, R. Strand. "Introducing spatial

II E. Breznik, H. Kervadec, F. Malmberg, J. Kullberg, H. Ahlström, M.

III E. Breznik, E. Wetzer, J. Lindblad, N. Sladoje. "Cross-Modality

IV E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

Reprints were made with permission from the publishers.

I I developed the idea independently, designed the experiments, wrote

II I developed the idea independently. Extensions to the code were done

Associated (Peer-reviewed) Acts

R2 R. Heil, E. Breznik. "A Study of Augmentation Methods for

R3 I. Tominec, E. Breznik. "An unﬁtted RBF-FD method in a

R4 R. Strand, S. Ekström, E. Breznik, T. Sjöholm, M. Pilia, L. Lind, F.

R5 E. Wetzer, E. Breznik, J. Lindblad, N. Sladoje. "Re-ranking strategies

R6 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

P2 E. Breznik, R. Strand. "Effects of distance transform choice in training

P3 E. Wetzer, N. Pielawski, E. Breznik, J. Öfverstedt, J. Lu, C. Wählby, J.

P4 T. Asplund, K. Bengtsson Bernarder, E. Breznik. "CNNs on Graphs:

P5 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand. "Using

P6 E. Breznik, F. Malmberg, J. Kullberg, H. Ahlström, R. Strand.

1.1 Aims and contributions

1.2 Thesis outline

2.1 The challenges of working with medical images

2.1.2 Data heterogeneity

2.1.3 Reliability and interpretability

In addition to the scans, a number of non-imaging measurements such as

Multi-modal histological images

Figure 2.4. An example pair of Second-Harmonic Generation (SHG) and Bright-Field

3.1 Medical image processing fundamentals

3.1.1 Image segmentation

Traditionally, image segmentation methods often included thresholding,

k=0 ∑x∈Ω wk · Gk (x) · Sk (x)

HD(X,Y ) = max {d(X,Y ), d(Y, X)} (3.3)

where d(X,Y ) is a directed distance based on some norm · : Ω → R (usually

d(A, B) = max min a − b. (3.4)

3.1.2 Content-based image retrieval

into category- and instance-level retrieval. Category-level corresponds to the

SIFT transform consists of ﬁrst ﬁnding interesting keypoints, and calculating

Some available feature extractors produce an equal, predeﬁned number of fea-

3.2 Deep Learning

Convolutional layers are much more computationally efﬁcient than dense

Vanilla networks refer to simple feed-forward networks without any added

A special type of NN architectures are the so-called Generative-Adversarial

The presented networks comprise only a small subset of available architec-

3.2.2 Training considerations

Effective receptive ﬁeld

connected layers, a single output in a layer is connected to the entire image

with N representing the number of pixels, N = |Ω|.This deﬁnition of CE as-

Why receptive ﬁeld matters

The importance of splitting the data

The importance of training curves

3.3 Distance transforms

for every x ∈ Ω, where O ⊂ Ω is the point/object we are calculating the dis-

3.3.1 Common distance deﬁnitions

d(x, y) = |x1 − y1 | + |x2 − y2 | (3.13)

Minimum barrier distance

3.3.2 Using distance transforms within DL

3.4.1 Hypothesis testing

The permutation test

Paired signed-rank Wilcoxon test

where d(X,Y ) is a directed distance based on some norm · : Ω → R (usually

d(A, B) = max min a − b. (3.4)