Professional Documents
Culture Documents
FULLTEXT01 (1)
FULLTEXT01 (1)
EVA BREZNIK
ACTA
UNIVERSITATIS
UPSALIENSIS ISSN 1651-6214
ISBN 978-91-513-1760-1
UPPSALA URN urn:nbn:se:uu:diva-498953
2023
Dissertation presented at Uppsala University to be publicly examined in Sonja Lyttkens
(101121), Ångström Laboratoriet, Lägerhyddsvägen 1, Uppsala, Friday, 12 May 2023 at
09:15 for the degree of Doctor of Philosophy. The examination will be conducted in English.
Faculty examiner: Professor Alejandro Frangi (University of Leeds).
Abstract
Breznik, E. 2023. Image Processing and Analysis Methods for Biomedical Applications.
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and
Technology 2253. 74 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-1760-1.
With new technologies and developments medical images can be acquired more quickly and
at a larger scale than ever before. However, increased amount of data induces an overhead
in the human labour needed for its inspection and analysis. To support clinicians in decision
making and enable swifter examinations, computerized methods can be utilized to automate the
more time-consuming tasks. For such use, methods need be highly accurate, fast, reliable and
interpretable. In this thesis we develop and improve methods for image segmentation, retrieval
and statistical analysis, with applications in imaging-based diagnostic pipelines.
Individual objects often need to first be extracted/segmented from the image before they
can be analysed further. We propose methodological improvements for deep learning-based
segmentation methods using distance maps, with the focus on fully-supervised 3D patch-based
training and training on 2D slices under point supervision. We show that using a directly
interpretable distance prior helps to improve segmentation accuracy and training stability.
For histological data in particular, we propose and extensively evaluate a contrastive learning
and bag of words-based pipeline for cross-modal image retrieval. The method is able to recover
correct matches from the database across modalities and small transformations with improved
accuracy compared to the competitors.
In addition, we examine a number of methods for multiplicity correction on statistical
analyses of correlation using medical images. Evaluation strategies are discussed and anatomy-
observing extensions to the methods are developed as a way of directly decreasing the
multiplicity issue in an interpretable manner, providing improvements in error control.
The methods presented in this thesis were developed with clinical applications in mind and
provide a strong base for further developments and future use in medical practice.
ISSN 1651-6214
ISBN 978-91-513-1760-1
URN urn:nbn:se:uu:diva-498953 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-498953)
To my family.
List of papers
This thesis is based on the following papers, which are referred to in the text
by their Roman numerals.
III The experimental design, code writing, and results analysis were done
together with the first co-author. The paper was written jointly with all
co-authors.
IV I developed the ideas, designed the experiments, wrote the code, and
performed and analyzed the experiments with support from co-authors.
The paper was written with input from all the co-authors.
Related Work
In addition to the papers included in this thesis, the author has also written or
contributed to the following works:
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1 Aims and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Working with biomedical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 The challenges of working with medical images . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Data availability and the effect on repeatability . . . . . . . . . . . . 14
2.1.2 Data heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Reliability and interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Private dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Open datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Technical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Medical image processing fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Content-based image retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Training considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Distance transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Common distance definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Using distance transforms within DL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Test multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Improving segmentation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Patch-based learning in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Guiding the learning under weak supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Methods and background experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Cross-modality image retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Step I: Bridging the modality gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Step II: Feature extraction and matching . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 Step III: Reranking Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Handling statistical analyses in Imiomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Evaluation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Anatomically compliant corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7 Conclusions and future work .................................................................... 61
Sammanfattning på svenska ............................................................................ 63
Acknowledgements .......................................................................................... 65
References ........................................................................................................ 67
1. Introduction
Ever since the first X-Ray image was made in 1895, medical doctors have been
increasingly relying on images and scans of different modalities for diagnosis
and disease progression monitoring [86]. With new technical developments
came new techniques, improved image resolution and faster, high-throughput
imaging protocols. Today, even whole-body scans can be obtained in a rea-
sonable time to be used not only for research but even in medical practice [96],
for example in diagnostic exploratory searches, tumour detection, staging, and
therapy evaluation [77].
Depending on the application and the reasons for acquiring the images,
they must be processed and inspected in specialized ways, typically relying on
expert knowledge. For example, apart from identifying and localizing various
pathologies, one might also need to measure the size or volume of objects in
the image. Through the increased availability of imaging for daily use and the
unprecedented rise in the amount of data acquired, the human effort required
for the analyses became the bottleneck in high-throughput medical imaging
pipelines. Numerous computer-based automatic and semi-automatic image
processing and analysis tools have emerged, aiming to reduce this bottleneck.
A very high-level (and by no means exhaustive) overview of a medical
image-processing-based workflow for diagnostics is shown in Figure 1.1. Ra-
diology and microscopy/pathology typically play complementary roles in image-
based diagnostics. While radiological scans are often used as a stand-alone
tool, they sometimes require additional processing or indicate the need for
more targeted examinations, like biopsies. Similarly, while examining histo-
logical images can be the end goal of a medical procedure, it can also indicate
the need for broader radiological imaging. Statistical analyses are commonly
required for practical use in decision-making in order to enable a sound and
reliable medical interpretation of the final imaging results and findings.
The work of this thesis is concerned with developing and improving com-
puterized, automated methods for different stages of the shown diagnostic
pipeline, namely the processing of radiological scans (by segmentation meth-
ods), biomedical image retrieval (with a focus on histological data), and the
final statistical analyses as applied to images.
While a quick literature search reveals an abundance of methods for im-
age segmentation, retrieval and analysis, these are generally data- or specific
application-dependent. In addition, medical applications have the potential
for a strong societal impact. As such, they are subject to heavier requirements
with regard to accuracy, reliability and explainability.
11
Figure 1.1. A high-level overview of an example imaging-based diagnostic pipeline,
with the emphasis the steps that were the focus of this thesis work. Microscopy and
radiological images can be used jointly or as stand-alone tools. Biomedical images
typically require further processing, handling and analyses, to be used as the basis for
diagnosis.
12
Chapter 6 covers the work of paper IV. Finally, a short conclusion with possi-
ble future work directions is given in chapter 7.
13
2. Working with biomedical data
While wrangling medical data can lead to useful insights with potentially high
societal impact, it comes with its own set of challenges. In this chapter, we
first discuss the unique problems of working with biomedical data with regard
to data access, research reproducibility and real-life applications, followed by
a brief summary of all the datasets that have been used during the work on this
thesis and included papers.
14
UK Biobank [95], a large-scale biomedical database with the goal of scanning
some 100k subjects.
15
2.2 Dataset overview
2.2.1 Private dataset
The work in this thesis was primarily developed for an in-house dataset of
whole-body MRI scans.
POEM
The POEM data comes from a Prospective investigation of Obesity, ENergy
production and Metabolism study on a healthy sample of 50-year-olds from
Uppsala. It is a local (not currently publicly available) cohort of whole-body
fat/water-separated 3D MR images. The cohort includes data from 502 pa-
tients.
Imaged field of view (FOV) was 530 × 377 × 2000 mm3 , and the resolu-
tion anisotropic, with a reconstructed voxel size of 2.07 × 2.07 × 8.0 mm3 in
left-right × anterior-posterior × foot-head directions. For additional technical
details regarding the properties and acquisition of the images see [59].
Figure 2.1. Example corresponding slices of water (above) and fat (below) content
images of a random subject from the POEM dataset [59].
16
2.2.2 Open datasets
For additional benchmarking and reproducibility reasons, we also use a num-
ber of publicly available datasets for different tasks.
ISLES data
The Ischemic Stroke Lesion Segmentation (ISLES) 2018 challenge [39] is a
cohort of multimodal brain scans of 103 patients with ischemic stroke lesions.
The patients underwent CT perfusion imaging (CTP) followed by an MRI
diffusion-weighted imaging (DWI). Provided imaging data consists of perfu-
sion maps (cerebral blood flow CBF, mean transit time MTT, cerebral blood
volume CBV, the time of maximum residue Tmax, and CTP source data). For
each subject, the DWI sequence was co-registered with CTP. See example pa-
tient images in Figure 2.2. Ground truth segmentation, based on the DWI
scans, is also provided for all lesions (for the training set). Scans are 3D vol-
umes of varying voxel size and slice thickness, containing only a few slices
each. For more details see [39].
Figure 2.2. All modalities for one example subject slice from the ISLES 2018 dataset
[39]. From left to right: CTP source data, CBF, CBV, MTT and Tmax.
This dataset was used for work leading to the ideas of paper II, and is in-
cluded also in the discussions and experiments of chapter 4.
ACDC data
The Automated Cardiac Diagnosis Challenge (ACDC) [9] is a public bench-
mark multi-class heart segmentation dataset. It contains cine-MR images of
150 patients, covering healthy scans and four types of pathologies in equal
amounts, with pixel-wise annotations for the right ventricle (RV), myocardium
(Myo) and left ventricle (LV) heart structures. The images are 3D volumes,
Figure 2.3. Example 2D slices from 4 patients from the ACDC dataset [9].
17
with anisotropic inter-slice spacing and varying spatial resolution. More de-
tails regarding the dataset are available in [9]. This dataset is used mainly
within paper II. See Figure 2.3 for a few example slices.
18
3. Technical background
This chapter briefly introduces the technical background of the tools and meth-
ods that are used in the included papers. The basic terms and tasks of image
processing are given in section 3.1. Section 3.2 covers the important con-
cepts of the deep learning methods applied in some of the included papers and
section 3.3 gives a brief account of distance transforms on images. Finally,
section 3.4 summarizes the essentials of statistical analyses on images.
19
Formally, a (binary) segmentation of an object in an image can be viewed as
some function f from image domain Ω to a discrete set {0, 1}, where values 1
and 0 denote the presence and absence of the object respectively. In the case
of a K-class segmentation, it could analogously be defined as a mapping from
Ω to ZK , however for a more consistent notation through the thesis we assume
a so-called one-hot representation: f : Ω → {0, 1}K , where f (x) sums to 1 for
all x ∈ Ω and the k-th element of f (x) represents belongingness of pixel x to
class k.
There are multiple flavours of image segmentation. For example, object
segmentation is concerned only with segmenting all areas of a specific class,
instance segmentation requires different instances of the same class to be seg-
mented separately, and semantic segmentation is the task of splitting the entire
image into the specified classes (i.e. each pixel needs to be segmented as one
of the available classes). A combination of the latter two, where each pixel
needs to be assigned a class, but individual instances of the same class need
to be separated too, is termed panoptic segmentation. For examples of the
different segmentation tasks, see Figure 3.1.
Figure 3.1. An illustration of various segmentation tasks. From left to right: original
image, (multiple) object segmentation, instance segmentation, semantic segmentation
and panoptic segmentation.
Evaluation
When evaluating segmentation methods, the segmented output is usually com-
pared to the ground truth (i.e. the manually segmented reference, desired
output) in a chosen evaluation metric. The choice of appropriate metric will
depend on what we are segmenting and why, as different quantities may be
important in different applications (e.g. accurate volumetric measure, exact
delineation, or best object coverage). In addition, different metrics have dif-
ferent properties and limitations, that can affect their interpretability [81]. To
evaluate whether the output of a segmentation method really represents the
20
ground truth well, it is thus important to combine multiple metrics and visual
inspection of the results [88]. The most commonly used metrics for segmen-
tation evaluation include Dice score [27] and symmetric Hausdorff distance
[44], which are used also in the included papers.
Dice score
The Dice score [27] (commonly abbreviated as DSC for Dice Similarity- or
Dice-Sørensen Coeficient) is a similarity measure, measuring overlap between
images. For a given image and a single object we aim to segment, let G, S :
Ω → {0, 1} be the ground truth and the segmentation that we wish to evaluate,
respectively. Then the Dice score of the segmentation S given the ground truth
G can be formally defined as
2 ∑x∈Ω (G(x) · S(x))
DSC(S, G) = (3.1)
∑x∈Ω G(x) + ∑x∈Ω S(x)
By definition, the Dice score attains values in the range [0, 1], with 1 corre-
sponding to a perfect segmentation. It is sometimes called also F1 score, and
is very similar to another well-known metric, intersection over union (IoU,
also called Jaccard index, IoU = (2−DSC)
DSC
).
The DSC definition in Equation 3.1 is primarily intended for evaluating a
single class/object segmentation. For a multi-class setting with K classes, there
is no one clear definition of a summarizing Dice score metric. One option
would be to simply calculate the average of Dice scores over all classes of
interest, or consider all pixels where the labellings of G, S : Ω → {0, 1}K agree,
equally: |{x|x∈Ω; G(x)=S(x)}|
|Ω| , where | · | denotes set cardinality (i.e. number of
elements). But without any distinction between the different classes, these
definitions favour the majority classes, which makes them non-representative
of the actual segmentation success when classes are severely imbalanced (as
is often the case in medical applications). To account for differing class sizes,
a so-called Generalized Dice was proposed in [20], weighting the individual
classes by the inverse of their size in the ground truth:
Hausdorff distance
As opposed to the Dice score, the Hausdorff distance [44] is a dissimilar-
ity measure: lower numbers mean the evaluated segmentation is better, more
similar to the ground truth.
21
It measures the distance between the surfaces of the segmented objects,
hence focusing on the segmentation boundaries. Reusing the notation from
above, assuming a binary segmentation, the Hausdorff distance between a
segmented object X = {x | x ∈ Ω, S(x) = 1} and its ground truth Y = {x | x ∈
Ω, G(x) = 1} can be defined as
The definition in Equation 3.3 is often called also a symmetric Hausdorff dis-
tance, as it takes into account the distances in both directions. As it is very
sensitive to noise/outliers and medical images generally tend to be noisy, we
mostly use a 95 percentile version of HD, where the maximum in Equation 3.4
is replaced by the 95 percentile of the distances. The 95 percentile adjusted
metric is denoted by HD95.
Same as in the case of DSC, the definition in Equation 3.3 is valid for a
single object/class. While per-class scores should be reported for a clearer in-
terpretation of results, average HD95 over all classes can be used as a summa-
rization metric. As it is a boundary- and not overlap-/count-based metric (like
DSC), the class prevalence does not directly affect the average over classes.
22
Figure 3.2. An illustration of the common essential steps in a general CBIR pipeline.
Feature extraction
The choice of feature extractors in the context of image retrieval depends on
the type of similarity that is expected to be exhibited by a good query-retrieval
image match. Typically features can encode colour, shape, texture, etc. This
construction, or hand-crafting, of the application-specific features, however,
requires extensive domain knowledge. As in many image processing appli-
cations, DL approaches have become very common even in image retrieval
[62, 28]. By using neural networks (NNs) as feature extractors, we can cir-
cumvent the laborious and expertise-based handcrafting of features.
More often than not, the query and database images are not expected to be
aligned in any way, even when searching for exact matches. Thus invariance to
a certain amount of transformations can be a desirable property in a feature ex-
tractor. In paper III, we employ and compare two very well-known and widely
used classical local feature extractors, robust to various transformations: the
23
Scale Invariant Feature Transform (SIFT) [64] and Speeded Up Robust Fea-
tures (SURF) [7]. In addition, we use an NN-based feature extractor based
on an adapted version of the so-called ResNet architecture, explained in more
detail in section 3.2.
SURF works in similar ways as SIFT, using a different approximation for the
scale-space (the differences between the smoothed images at varying scales),
which is more computationally efficient than the one of SIFT. The local point
descriptors again take into account orientation and intensity changes in the
local neighbourhood of each point but are calculated in a different way than
within SIFT, forming a descriptor that is half the size of the SIFT descriptor.
Hence SURF features tend to have similar performance as SIFT for image
matching but are more efficient to compute.
Similarity matching
The database images need to be ranked according to how similar their fea-
ture representations are to the one of the query image. As the representations
are equally sized, simple distance measures (e.g. Euclidean, Manhattan, see
24
section 3.3 for more) can be used to evaluate how close (similar) they are. In
paper III we calculate the similarities using a cosine similarity measure, which
is used most frequently in combination with BoW descriptors. It is defined as
v1 · v2
dcos (
v1 , v2 ) = (3.5)
v1 · v2
where v1 , v2 stand for the two compared descriptors (frequency vectors), and
· is a vector norm.
Evaluation
The desired output of an (s-)CBIR system is a ranked set of K best (most
similar) matches for each query. When attempting a category-level retrieval
task, many of the retrieved matches may be correct, i.e. corresponding to what
we wished to retrieve. In instance-level retrieval, on the other hand, there
is only one correct match, which either is or is not found within the first K
retrievals. Depending on the type of retrieval and whether or not the actual
ranking of the correctly retrieved images within the first K best matches is of
interest, different evaluation metrics may be suitable.
(Average) Precision at K
Let I j denote the indicator function of a correct match at rank j for a given
query (i.e. if the image at rank j is a valid match, I j = 1, else I j = 0). Ap-
plicable when caring only about the amount of correctly retrieved images, the
Precision at K (P@K) is defined as
∑K
k=1 Ik
P@K = (3.6)
min(K, n)
where n stands for the number of all database images correctly corresponding
to the query. If, however, a manual inspection of the first K matches is not
reasonable for the application at hand and the first-place matches are the only
ones of interest, the actual ranks of the correct matches can be included in the
metrics. This rank-adjusted precision measure is called Average precision at
K, or AP@K:
∑K Ik · P@k
AP@K = k=1K . (3.7)
∑k=1 Ik
As this is a per-query success metric, it can be averaged over multiple queries
to be more representative of the CBIR system success (this query-averaged
version is denoted by mAP@K, where m stands for mean).
Accuracy at K
When dealing with instance-based retrieval with only one possible correct
match per query, the set of viable queries is usually well-defined. In such
25
problems, the P@K metric results in a binary indicator of success and can be
averaged over the query set to produce a measure of Accuracy at K (some-
times denoted also by Acc@K). In paper III, we refer to this value as a top-K
retrieval success, as it indicates for what fraction of queries the CBIR system
was successful in retrieving the correct image within the first K matches.
3.2.1 Architectures
In DL, the term architecture refers to the design choices when building a net-
work. To date, numerous architectures have been proposed for various image
processing problems [49, 62]. While NNs can be used on different types of
data, we focus here only on image inputs. A NN layer comprised of neurons
which work globally (i.e. use the entire input in their calculations) is called
a dense or densely-connected layer. A convolutional layer, on the other hand,
consists of locally-focused neurons, that are connected to only a subset of the
input at a time and applied in a sliding window fashion (i.e. applying one
neuron corresponds to convolving a filter with the input image, followed by a
nonlinearity). The size (in terms of the area of the input they are connected to)
of such neurons is then referred to as kernel size.
26
Hence convolutional neural networks (CNNs) are particularly well suited for
application on images. Contemporary CNN architectures consist of a com-
bination of convolutional layers and other types of layers like dropout, batch-
normalization, etc. For more details on different layers and design possibilities
see [38]. In papers I and II, we employ CNNs for medical image segmentation,
and in paper III, CNNs are used both for feature extraction and style transfer.
More specifically, we use a U-Net[82], DeepMedic[50] and a simple Vanilla
CNN architecture in paper I, and E-Net [75] architecture in paper II. In paper
III, ResNet [40] is used within a replacement study, and Tiramisu [48] within
a method used as a part of the proposed pipeline.
27
enforcing that the input image can be reconstructed back from the newly gen-
erated image.
Types of supervision
Supervision is related to the type of data neural networks are trained with.
If there is no ground truth available during training, it is said that the net-
work is trained in an unsupervised manner. In contrast, supervised training
means that the network is shown both data and the expected output connected
to it. When ground truth is available only for a subset of the training dataset,
such a training setting is called semi-supervised. Furthermore, if the ground
truth exists for each sample but is in some way incomplete, the training is
said to be carried out under weak supervision. Different types of learning can
also be joined within the same pipeline. When a network, for example, uses
unstructured data to learn labels (i.e. unsupervised) which are then used for
supervising downstream learning tasks, the training is called self-supervised.
Sometimes, only a certain kind of information is available about the data; for
example which samples are similar (and which are different). When the net-
work is trained with such information instead of the ground truth in terms
of the desired output labels, this is called (self-supervised) contrastive learn-
ing. In papers I and II, we employ full and weak supervision for segmentation
learning respectively. And in paper III, the proposed pipeline relies on con-
trastive learning.
28
Figure 3.3. An illustration of the receptive field growth by stacking convolutional
layers. A single unit of output relies on the information of a 7 × 7 area of the input in
a simple 4-layer convolutional network with kernels of size 3 × 3.
Losses
During training, the network’s parameters are repeatedly optimized to min-
imize the chosen loss. The choice of loss depends on the problem at hand
and the specifics of the data. In supervised training, it is designed to measure
the difference between the network output and the ground truth. Contrastive
learning on the other hand requires a loss that pushes the network to learn sim-
ilar/different representations for the positive/negative pairs. In the context of
segmentation, more well-known losses include Cross-entropy (CE), Dice loss,
Focal loss, Tversky loss and Boundary loss [66]. These are used in papers I
and II under both full and weak supervision. For contrastive learning, triplet
29
loss [87] is typically used. It compares a matching (positive) sample and a
non-matching (negative) sample to a reference input, minimizing the distance
between positive and maximizing the distance between negative pairs.
For the definitions that follow, let s represent the network output and G
the one-hot encoding of the desired ground truth labelling G : Ω → {0, 1}K ,
where Ω is again the image domain, Gk denotes the k-th output of G and K
is the number of classes. When dealing with segmentation problems, the raw
network output is usually sent through a softmax nonlinearity [13] to produce
values in range [0, 1] that can be interpreted as probabilities. Hence s : Ω →
[0, 1]K where s(x) sums to 1 for any x ∈ Ω and sk (x) (k-th element of output at
x) represents the probability of x belonging to class k.
Pixel-wise CE loss
Cross entropy loss stems from the Kullback-Leibler divergence which mea-
sures dissimilarity between two distributions. For a multi-class problem with
K classes, it is defined as
1 K
LCE (s, G) = − ∑
N k=1 ∑ log(sk (x)) (3.8)
x∈Ω
Gk (x)=1
Dice loss
The dice loss as used for training CNNs is generally defined simply as
LDSC (s, G) = 1 − DSC(s, G) (3.9)
for a chosen definition of the DSC metrics (see section 3.1.1). The only dif-
ference is that the DSC in this case is calculated directly on the output prob-
abilities instead of the class labelling (which is sometimes termed also soft
Dice).
Multiple versions of the Dice loss have been proposed for segmentation
training [46]. Tversky loss [85] can be seen as a version of Dice loss, weighing
false positives differently than false negatives (while the original Dice loss
weighs both equally).
Boundary loss
Boundary loss (BL) was first proposed in [54] with the aim to improve seg-
mentation accuracy in problems with highly imbalanced classes. Both Dice
30
and CE involve summations over entire regions, which can have a detrimental
effect on training performance if the differences in class sizes are very large.
Instead, BL is calculated on the space of object contours, integrating only over
values in the areas between the segmented and the ground truth boundaries. In
order to put a higher penalty on larger deviations, the integration is done over
distances from the ground truth boundary. It is formally defined as:
K
(k)
LBL (s, G) = ∑ ∑ sk (x)φG (x) (3.10)
k=1 x∈Ω
(k)
where φG is a signed distance map (see section 3.3 for detailed explanation)
of G, computed on the class k mask. Boundary loss actually complements
regional information, and is in practice often combined with losses like Dice
and CE for better stability (especially in the binary segmentation case). As op-
posed to the Dice and CE losses, the boundary loss can attain negative values.
While it is bounded, its bounds depend on the image data and the distances
used.
3.2.3 Evaluation
There are two important aspects of evaluating how a neural network performs.
Firstly, there is the application-dependent evaluation of the final results in
comparison to the expected/desired ground truth. For the case of segmen-
tation learning for example, two common metrics have been described in sec-
tion 3.1.1. Secondly, the behaviour of a model throughout the training should
also be taken into account. To some extent, these model properties can be
observed through the inspection of training curves.
For both aspects it is important to not rely on a single measurement but
carry out the experiments repeatedly. Utilizing data in an appropriate way (in
terms of splitting and input size) also plays a vital role.
31
In short, while seeing more context is generally beneficial in CNN, pro-
viding this context when a network has not been trained to use it can lead to
unexpected outputs and degradation in performance. For a fair evaluation, the
models should thus be evaluated on inputs of the same size as the training data.
32
generalize well. On the contrary, results that are too similar can be a cause for
concern; perhaps the datasets are too similar, which can cause problems and
unpredictable behaviour in real-life deployment [15].
Particularly in medical applications, generalization properties are very im-
portant, due to extreme data heterogeneity (see section 2.1). It is not practical,
or reasonable, to constrain method use to images of the exact same properties.
While networks are trained for many epochs, the model at a single specific
epoch is chosen for evaluation. Meaning that highly oscillatory curves can
produce good results "by chance". Examination of training curves should thus
also be done to establish the stability properties and inform parameter adjust-
ments.
33
for all x, y ∈ M. For an arbitrary function to be called a distance function,
the nonnegativity condition is actually the only requirement. However, met-
ric properties are interesting theoretically and have implications for practical
computations of distances. When a function satisfies all but the reflexivity
property, it is called a pseudo-metric, and when it only fails to satisfy the tri-
angle inequality, it is called a semi-metric. For a more in-depth reference on
metrics and various distances see [26].
Intuitively a distance is simply a function that measures a difference be-
tween pixels or objects. Even the common evaluation metrics (dissimilarity
measures in particular) can be formulated/viewed as distances. Some dis-
tances are global (i.e. depend only on source and target points) and some
are path-based (i.e. depend on the actual path taken between them).
On digital images, in order to speed up the computations of DTs, even the
global distance values are commonly approximated through propagation of
local distances (or their vectors), i.e. distances between neighbouring pixels
[83, 33, 16, 23]. This leads to distances that are computed along (discrete)
paths, requiring that a reachable local pixel neighbourhood needs to be decided
upon for the practical calculations.
Below we define a few common distance measures that are used mainly in
paper II. While they originally work on continuous domains, we here consider
only distances on images. For the sake of simplicity, we provide definitions
for images on 2D image domains, I : Ω ⊂ Z2 → R, but extensions to 3D are
trivial.
For any two points (pixels) x, y ∈ Ω, let πx,y = (x = p1 , p2 , . . . , pn−1 , pn = y)
be a path between them, with pi and pi+1 , i = 1, . . . , n − 1 adjacent points
(pixels) on the path. We use Πx,y to denote a set of all such allowed paths.
The distance between x and y computed along a path πx,y is calculated by
i=1 d(pi , pi+1 ) where d is the chosen distance function. To obtain (or ap-
∑n−1
proximate, for the case of non-path-based distances) the final distance between
x and y, the minimum of all distances along paths from Πx,y is taken.
Euclidean distance
Euclidean or L2 distance is the most widely used distance, arising naturally in
the physical world. For points x, y ∈ Ω, x = (x1 , x2 ), y = (y1 , y2 ), it is defined
as
d(x, y) = min (x1 − y1 )2 + (x2 − y2 )2 . (3.12)
Taxicab distance
Taxicab distance, called also Manhattan or L1 distance, is a distance between
points computed along a grid. If we consider all possible paths in the image
as valid, the path-based computation will produce the exact distance.
34
Geodesic distance
The geodesic distance [98] is the distance along a curve (on a manifold in
higher dimensions). For images, this means that the geodesic distance between
pixels depends not only on their spatial proximity, but also on their intensities.
It is commonly defined using L2 norms (i.e. for 2D images, it boils down to
L2 in 3D), but along paths:
d(x, y) = ||x − y||22 + (I(x) − I(y))2
= (x1 − y1 )2 + (x2 − y2 )2 + (I(x) − I(y))2 (3.14)
In practice, it is however often computed using the L1 distance instead. Fur-
thermore, to account for the fact that the additional dimension (in the case of
images) is the intensity which may have a very different range than the rest,
an additional parameter λ is commonly used to balance the contributions:
d(x, y) = (1 − λ )||x − y||1 + λ |I(x) − I(y)| (3.15)
35
3.4 Statistical analyses
Within the field of statistics, hypothesis testing is the topic that permeates all
levels of image processing. Be that for investigating the true significance of
the seemingly improved results from a new model, or even at the very end of
the diagnostic pipeline, when providing medical practitioners with statistically
sound measures of signals of interest in the images. The latter is the focus in
paper IV, aiming to provide interpretable and reliable testing procedures on
imaging data. Below we provide a short overview of the hypothesis testing
procedure and the problems related to it when considering images. More de-
tails and the underlying theory can be found e.g. in [99].
36
that is the scope of the fourth step. Most commonly, p-values under 0.05 are
interpreted as strong evidence against H0 (with values < 0.01 considered as
very strong evidence). If the values are within the range 0.05 to 0.1, that is
usually considered as weak evidence, while values above 0.1 provide no or
little evidence against the null hypothesis. The threshold at which we consider
the evidence sufficient is called the significance level, α.
The errors that can occur in hypothesis testing can be split into Type I and
Type II errors. Type I error, or false positive, is the case when we mistakenly
reject H0 that is true. Its probability is controlled by the significance level.
On the other hand, when H0 is false but we fail to reject it, we commit a
type II error or a false negative. While α is usually set such that it keeps
the type I errors at bay, its settings affect the type II errors too. Choosing an
appropriate α therefore requires balancing between sensitivity (true positives)
and specificity (true negatives).
Below we shortly explain three common types of tests that are used also
across the papers included in this thesis.
Student t-test
According to the central limit theorem, the distribution of a sample-mean vari-
able becomes approximately normal with increasing sample size. Therefore,
given a sufficiently large sample of a population with known variance, the
expected underlying distribution of the statistic can be assumed normal. How-
ever, it is often the case in practical applications that we are dealing with a
limited sample and unknown population variance. In such cases one can in-
stead use a Student-t distribution, and perform a so-called t-test.
There are different types of t-tests available, with regard to the values we
wish to compare: comparing means stemming from a single group/population
(e.g. measuring the outcome of a treatment) requires a paired t-test, compar-
ing groups from two different populations boils down to a two-sample t-test,
and comparing a group based value against a specific scalar (e.g. comparing
correlation coefficient to 0) calls for a one-sample t-test.
The t-statistic T for a two-sample test comparing groups of sizes N1 , N2
with means m1 , m2 and standard deviations s1 , s2 is calculated as:
m1 − m2
T= (3.17)
s p N11 + N12
(N1 −1)s21 +(N2 −1)s22
where s p is a pooled standard deviation s p = N1 +N2 −2 . This def-
inition assumes the groups have sufficiently similar variances. If that is not
the case, a more general version can be found e.g. in [19]. In the case of a
1 −μ
one-sample test, comparing group mean m1 to μ, this simplifies to T = m 1
.
s1 N1
The paired version of the test requires dependent samples (repeated mea-
sures of a single sample or two paired samples) and is calculated according to
37
the one-sample test, using the mean and standard deviation of the differences
between corresponding pairs.
The t-test is a parametric test (assuming approximately normal distribu-
tions), and thus not appropriate when this assumption is violated. It can be
sensitive to outliers when the sample size is very small.
38
ranks are summed individually, and the lowest sum is used as the Wilcoxon
statistic (W) value.
To obtain the corresponding p-value, the acquired value needs to be com-
pared to the underlying distribution of W under H0 . The null hypothesis here
corresponds to the sample of positive and negative differences being normally
distributed around 0 (with median 0) [100]. For small sample sizes, it can be
computed exactly, in a combinatorial manner. But for larger sample sizes, that
is not computationally tractable, and a normal approximation is used for the
standardized W statistic.
39
Bonferroni method
The oldest and simplest FWER-controlling correction procedure is the Bon-
ferroni correction [29], which assumes test independence. The correction con-
sists of simply dividing the significance threshold by the number of all tests.
Clearly, this method is very conservative when test multiplicity is very high,
quickly losing the power of detecting any actual significance.
This is especially prominent in the case of images, where the number of
tests equals the number of pixels. In addition, test independence is typically
violated on images, as some spatial correlation between pixels is almost uni-
versally present. While this does not undermine the corrected threshold’s the-
oretical validity, it exacerbates its stringency.
Some improvements (with respect to extreme stringency) of the Bonferroni
method include the step-down Holm procedure [42], which relies on ordering
the signals/p-values and rejecting them sequentially, at different thresholds.
RFT method
The random field theory method (RFT) [1, 103] was developed as a solution
to the very restrictive independence assumption of the Bonferroni method. It
has been most popular within the neuroimaging community in recent years, as
it accounts for the fact that pixels are influenced by the signal in nearby pixels.
RFT considers the image to be a realization of a (e.g. Gaussian) random
field and relies on theoretical results for Euler characteristics in a thresholded
image. Euler characteristic (EC) can be interpreted as the number of con-
tinuous areas/clusters consisting of pixels with values exceeding some given
threshold (minus the number of holes/voids in these areas). In that sense, it
is tightly connected to homology, as it is a topological descriptor that can be
written as an alternating sum of Betti numbers.
Using a large enough threshold (w.r.t pixel statistic values), EC will repre-
sent the number of local maxima and eventually fall to 0 (when the threshold
is set higher than the maximal pixel statistic). The expected value of EC can
thus be understood as the probability of at least one suprathreshold area, or in
other words, the probability of the maximal value (over all pixels) exceeding
the given threshold. This means that calculating the expected value of EC for
an image is roughly the same as performing FWER correction on its p-values.
A closed-form approximation for the expected value of EC (i.e. for the tails
of the null distribution) are derived in [1] for the Gaussian fields, and extended
to other types of random fields in [103]. The RFT method involves a consider-
able amount of mathematical theory, however, the availability of closed-form
expressions makes it computationally undemanding and thus feasible for use
in imaging.
Although RFT solves the independence problem, it also introduces new re-
strictions on image smoothness, differentiability of the autocorrelation func-
tion, and locality-independent distributions. All of these are often unattainable
40
for medical images in practice. In addition, it tends to be very stringent when
the number of subjects/images involved in the analysis is small.
While this formulation of RFT holds for voxel-based correction and infer-
ence, it is possible to use it for cluster-based inference too, by instead control-
ling the probability that a single cluster of a certain size exceeds the threshold
under the null hypothesis.
Permutation-based method
The permutation-based methods [2] are based on the same idea as the per-
mutation test and use the same mechanism to empirically estimate the null
distribution. They are also subject to the same assumption of exchangeability
under H0 . However, to deal with the multiplicity issue, an extremal (maximal)
summarizing statistic needs to be used in the calculations.
In essence, given a family of n tests producing statistic values T1 , . . . , Tn ,
the summarizing (maximal) statistic is given as T = maxi=1..n Ti and the em-
pirical distribution under H0 is given by the maximal statistic values for all
permutations. Since the possibility of at least one false rejection coincides
with rejecting at least the test with the largest statistic, setting a significance
threshold for T then results in control of the FWER.
TFCE method
To overcome the problem of an arbitrary cluster-defining threshold, an ex-
tension called Threshold-Free Cluster Enhancement (TFCE) [91] has been
proposed. The statistic it uses, the so-called TFCE-score, is a pixel-based
41
measure, encoding signal strength as well as its spatial extent information.
Formally, the TFCE-score for a pixel x is defined as:
hN
TFCE(x) = e(h)E hH dh (3.19)
h0
where e(h) is the extent of the cluster containing x at threshold h, and E, H are
parameters typically set to 0.5 and 2 respectively. In practice, this is computed
using a summation, with dt = 1 for discretized images. The TFCE-score out-
put map can be converted into FWER-corrected p-values using permutation
testing (for the maximal TFCE-score statistic).
As a pixel-based score, TFCE retains localization power which is otherwise
lost in cluster-based methods. However, the spatial extent information that
accounts for smoothness and lack of test independence is still incorporated in
the calculations through the integration of spatial support for the given pixel
over a set of thresholds. This way, the method is equipped to detect both
diffuse signals with low strength and strong, focal signals.
CBA method
In Cluster-Based Analysis (CBA) [41], the initial step of defining clusters is
done on a no-activation image, which is excluded from the analysis. Further-
more, the so-defined clusters are assigned some summarizing statistic (taking
into account all elements of the cluster), which is finally analyzed instead of
pixel-based statistics. In this sense, the CBA method lies at the intersection
of pixel- and cluster-based methods, as all individual pixels are in a way con-
sidered (and have the possibility of being significant), but the actual analysis
(and inference) units are predefined non-overlapping clusters.
The process of defining clusters is based on the pairwise inter-pixel corre-
lation in the local neighbourhood. For each pixel, the correlation is computed
with all its neighbours, and the pixel is then clustered together with its most
highly correlated neighbour. Since the maximum correlation property is not
symmetric, the produced clusters can be of arbitrary sizes.
For a given pixel or voxel, not all of its neighbours are at the same distance
from it. Therefore the correlations need to be corrected with respect to the
distances. After the cluster are defined and their signals computed, the p-
values for the clusters are calculated. The null hypotheses here are that none
of the cluster elements is active, under which the cluster p-values are uniformly
distributed.
The so-acquired p-values still need to be adjusted to correct for the test
multiplicity. The original paper [41] suggests using methods of false discovery
rate control, but the adjustment can just as well be done for FWER control,
for example by any of the previously described approaches. By effectively
(at least) halving the number of tests and joining strongly correlated areas,
the CBA method directly alleviates the multiplicity issue and problems with
violating the independence assumptions of other correction methods.
42
4. Improving segmentation learning
43
4.1.1 Motivation
Medical image segmentation, particularly multiclass, tends to depend heavily
on contextual information since object positions and coocurrences are typi-
cally class dependent. Using only image patches for CNN training effectively
restricts the receptive field, and the only way of introducing broader context is
by adding it to the network separately.
In patch-based training, patch locations within the image carry information
about the broader context and can even encode anatomical priors where the un-
derlying anatomy is well-defined (e.g. whole-body scans). A number of works
have experimented with patch-location-informed training settings [51, 35, 34],
but mostly use on absolute coordinates or prior probability- or structural at-
lases, relying on image registration.
When working with anisotropic data or data like whole-body scans, where
the objects in the image can exhibit a fair amount of variability in terms of size
and positioning, absolute coordinates of the patch positions across subjects
may not be comparable or informative. In paper I we therefore propose a
distance-map-based encoding of the patch locations, which uses predefined
landmarks but requires no prior registration and produces directly comparable
data across patches.
4.1.2 Methods
The method development in paper I was driven by the POEM dataset (see sec-
tion 2.2), from which additional absolute landmark position data for ankle,
knee, hip and shoulder joints has been extracted. While similar methodologi-
cal development could be done by any other anatomically relevant points (or
even sequentially, by using two landmarks at a time), we focused specifically
on abdominal organ segmentation and utilized the hip and shoulder joint posi-
tions.
To provide the patch locations in a relative manner (i.e. such that similar
numbers correspond to the same anatomical areas across images), we propose
using landmark-normalized distance maps as additional input to the network
at the convolutional level. Let (i, j, k) represent the 3D image-space coordi-
nates in the left-right (X), front-back (Y) and foot-head (Z) directions. We
define two distance maps, Dx and Dz , omitting the front-back direction (since
variability in landmark/object locations in that direction is low):
i − A(L) k−F
Dx (i, j, k) = − 0.5 and Dz (i, j, k) = . (4.1)
A(R) − A(L) H −F
Here F and H refer to the minimum and maximum k-coordinates of the hip
and shoulder landmarks, respectively, while A(L) and A(R) are the average i-
coordinates of the left and right pairs of landmarks. We chose to offset the Dx
map by 0.5, to bring the two maps closer in range.
44
Figure 4.1. Two subjects of different sizes from the POEM cohort. From both, we
sample an equally-sized patch centred on the landmark midpoint. The patches of
proposed Dx and Dz thus have the same value in the centre and encode the amount
of actual tissue captured within the patches through their ranges. The absolute voxel
coordinates of the patch centre on the other hand hold no such information and are not
easily comparable.
Sampling patches from maps defined in Equation 4.1 captures the patch
location relative to the landmarks. Further multiplying the coordinates by the
reconstructed voxel sizes counteracts anisotropy. Providing patch locations in
such a relative sense has an important advantage over simply considering the
absolute patch positions: it encodes information about the extent of the subject
within the image. The effect is illustrated in Figure 4.1.
Considering that we chose to work with four landmarks with a known oc-
currence pattern (i.e. their convex hull forms an approximate trapezoid when
projected on the coronal (frontal) plane, another possible way of computing
relevant distance maps could be to simply compute a projective transforma-
tion of the landmark convex hull onto a [−1, 1]2 square.
But, due to the elongated overall shape of the human body, this can in-
duce extreme values and potential singularities where the lines connecting the
left two and the right two landmarks intersect (see Figure 4.2). Hence, so-
created maps are not appropriate for training unless additionally processed
(e.g cropped or value clipped).
45
Figure 4.2. The distance maps created through a projective transform of the landmark
convex hull (in left-right, X, and foot-head, Z, directions only) onto [0, 1]2 . The two
maps represent the transformed X and Z voxel coordinates, respectively. Ranges are
clipped for visualization. Pink dots show landmark positions. In the Z coordinate
map, a singularity occurs at the level of the feet (where lines connecting shoulder and
hip landmarks cross), producing extreme values and a sign swap.
4.1.3 Results
Training with different networks
The experiments in paper I were performed on the task of abdominal or-
gan segmentation (bladder, kidneys, liver, pancreas and spleen), using three
network architectures that take into account different levels of context: U-
Net [82], Vanilla CNN and DeepMedic [50]. All networks were trained on
patches of size 253 , both with and without added Dx and Dz maps (see paper
I for detailed training settings). With all three networks, we saw an improve-
ment both in terms of DSC and stability (as deduced from the training curves)
when using our proposed distance maps.
46
Figure 4.3. A box plot of the 5-fold validation Dice scores per class (averaged over 5
repetitions), for training with absolute- and the proposed DTs, and without them.
4.1.4 Conclusions
In paper I we proposed a way of informing the network of the patch positions
to improve patch-based segmentation training. We evaluated the idea using
three network architectures and compared it to using absolute patch positions,
as is often done in the literature. We showed that using the proposed landmark-
normalized maps improves both training stability as well as the final Dice
scores on the task of abdominal organ segmentation.
4.2.1 Motivation
Point annotations are very cheap to acquire and can be done accurately even
by less experienced annotators. However, they do not contain any object shape
or extent information, like full pixel-wise annotations. It is thus reasonable to
try and provide such priors to the network in a different way.
At the same time, for training with full annotations, the DT-reliant Bound-
ary loss [54] has proven very effective, particularly for imbalanced datasets
47
and irregularly shaped objects, which is often the case in medical image seg-
mentation. Obviously, enforcing closeness in boundary/space when training
with point annotations would result in degenerated outputs containing points
or only background. But as mentioned in section 3.3, the appearance (and the
information it encodes) of a DT depends on the choice of the distance function.
This motivates the question of whether it is possible to harness the strengths
of the Boundary loss even under weak supervision, given appropriate distance
definitions.
Figure 4.4. A grayscale image (left) and examples of optimal learned segmentation
curves under losses focusing only on spatial proximity (red curve, middle) or intensity
differences (multiple curves, right) to the desired ground truth (the shaded object in
the centre). Enforcing exact overlap with the ground truth in space, the optimal curve
coincides with the ground truth delineation. When using intensity information on the
other hand, many curves can incur a close-to-zero cost and prove optimal. Either
behaviour can be desirable, depending on the application.
48
Figure 4.5. Example image from the ISLES dataset (see section 2.2) with its ground
truth contour and corresponding signed distance maps: spatial proximity-based, both
intensity- and spatial proximity-based, and only intensity-based, respectively.
ISLES data is highly imbalanced, with the lesion class representing only a
very small area, so not all 2D slices contain lesions. Hence the signed DTs
used for BL were computed directly on the 3D scans. Training with U-Net
[82] (see [12] for specific experimental settings) produced the training curves
given in Figure 4.6. According to these curves, the fully intensity-based dis-
Figure 4.6. Training and validation curves showing DSC and HD95 scores on training
with fully annotated ISLES data. GDL denotes training with generalized DSC, and
BL(·) denotes training with boundary loss using the specified distance.
tance, MBD, may accelerate the training and overfit quicker. In addition, we
noticed that the training curves for BL with Geodesic distance follow the orig-
inal curves (BL in Euclidean setting) fairly tightly, incurring only a small loss
of performance towards the end of the training. That can be attributed to over-
49
penalizing the close-to-exact segmentation, since the Geodesic-based BL pe-
nalizes both spatial and intensity differences.
These experiments further confirmed our decision to focus on fully intensity-
based distances. In paper II we thus proposed using BL with MBD for segmen-
tation training under point supervision. For completeness, we again compare it
to Euclidean and Geodesic distance, but also to another purely intensity-based
distance, using Equation 3.15 with λ = 1.
4.2.3 Results
The experiments of paper II were run on 2D slices of the ACDC and POEM
datasets (see section 2.2). Since both datasets come with full pixel-wise an-
notations, we created synthetic point annotations randomly. For faster experi-
mentation, we used the E-Net [75] architecture. As the point annotations were
created per slice, even the DTs were computed on individual slices. The pro-
posed combination of BL with MBD was compared not only to combinations
with other distances but also to the state-of-the-art approach in weak segmen-
tation, using the Conditional Random Field (CRF) regularized loss [97]. For
more details regarding the experiments consult paper II.
Using 2D maps
The experiments done on the ACDC dataset showed that the proposed ap-
proach, using BL with MBD, performed best in terms of the highest reached
Dice. It was also characterized by a slower collapse towards the weak ground
truth, providing a wider interval of epochs for choosing the best model. The
method using CRF-loss was not able to compete with BL derivatives, on av-
erage. Inspecting individual runs, however, showed that it could potentially
outperform the proposed combination of BL and MBD, but the performance
is very unstable, varying a lot across runs.
The Dice scores achieved in training on POEM data, on the other hand, were
less convincing. The combination of BL and MBD compared on pair, but not
significantly better than the other combinations. We suspect that was due to
the relatively low resolution of the images, allowing the MBD distance values
to more easily bleed through object boundaries, allowing for more spatially-
inconsistent, fragmented segmentation regions.
Using 3D maps
When training on coronal slices of the POEM data, most slices will lack at
least some of the foreground classes. By design, the DT on a slice where
annotation for some class is absent will be 0, allowing for unpenalized over-
segmentations of that class. Instead of artificially setting them to some pre-
defined nonzero value, one can use slices of a DT calculated on the entire 3D
subject. In paper II, we run an additional set of experiments with 3D distance
50
maps. The overall performance is improved, but the MBD, Geodesic and In-
tensity distances still perform on-pair. Due to an additional dimension, the
MBD distance has even more potential for bleed-through, which could be the
reason behind the lack of improvement.
Time considerations
The DTs for use with boundary loss need to be computed per class (and po-
tentially per channel), which increases the importance of the computational
burden of distance map calculations. To evaluate the tradeoff between the in-
crease in segmentation accuracy and required time, we also benchmarked the
computation times of different distance maps in paper II. As the MBD was the
only distance not computed with the use of GPU, it was expectedly most time-
demanding. However, the time requirements are still reasonable for practical
use when the DTs can be precomputed before training, particularly in 2D.
4.2.4 Conclusions
We proposed the use of the purely intensity based Minimum barrier distance
to harness the advantages of using Boundary loss even in weakly supervised
segmentation training. Based on its definition, the Boundary loss appears to
be directly incompatible with inexact ground truth boundaries. But through
extensive experiments we showed that, perhaps counterintuitively, the Bound-
ary loss behaves well even under extremely weak forms of supervision when
combined with intensity-based distances. The experiments confirmed that, un-
der point annotations, the combination of BL with MBD produces promising
results, but potentially requires additional preprocessing for some datasets.
51
5. Cross-modality image retrieval
5.1 Motivation
Examination of histological images (especially hematoxylin and eosin stained
brightfield (BF) microscopy images) forms the basis for many medical diag-
noses (e.g. cancer) [80]. These images are typically acquired in whole slide
image (WSI) scanners that generally only capture a single modality. However,
different modalities can capture different, complementary information about
the data, providing further insights into potential pathologies. In particular,
second harmonic generation (SHG) has been found to facilitate content under-
standing when examined together with the corresponding BF images.
To be able to utilize the large databases of BF and SHG images together, it
is thus important to be able to query them across modalities. Furthermore, BF
images captured with WSI scanners generally cover a large tissue area (even
up to 100.000 square pixels), while the coverage of SHG images (at the same
resolution) is typically much smaller, illustrating the need for methods that
can handle sub-image retrieval. Additionally, as SHG and BF images of the
corresponding specimen are often not taken simultaneously or with the same
machine, it may be necessary to perform the retrieval on the instance level.
In paper III we therefore proposed a modular method, able to handle instance-
level cross-modal histological image retrieval across very dissimilar modali-
ties, and evaluated it on the multimodal histological dataset of BF and SHG
images (see section 2.2 for the details).
52
5.2 Methods
Our proposed pipeline consists of 3 main steps. The first step provides a way
to bridge the gap between the modalities and bring them closer together. In
the second one, features are extracted and binned (through the use of bag of
words (BoW) model) to make them comparable across images. Using these
BoW encodings, pairwise similarities can be computed as the criterion for the
ranking of matches. The third step consists of re-applying the second step
on a subset of the first chosen few best matches to rerank them for increased
accuracy.
Figure 5.1. The first and third columns show example BF (above) and SHG (below)
image pairs [32] with their corresponding CoMIR [78] embeddings in the second and
last column.
53
compared via the cosine similarity measure (see subsection 3.1.2). The larger
the similarity score, the better (higher ranked) the match.
Figure 5.2. The top-1, 5 and 10 retrieval success using various reranking methods on
the task of retrieval across BF and SHG modalities and transformations. BF in SHG
denotes searching through images of SHG modality for a BF query and vice versa.
The results of using no reranking versus reranking the first 15 matches with
the three described reranking methods on the problem of retrieval across trans-
formations and BF, SHG modalities are summarized in Figure 5.2.
54
Based on these limited experiments, we confirm that the reranking method
we chose to use within our proposed pipeline is a reasonable option.
5.3 Results
Replacement study
To further confirm the proposed pipeline design choices in paper III, we per-
formed a replacement study, swapping the individual parts of the pipeline for
a few viable alternatives. More concretely, we used the Pix2Pix [47] and Cy-
cleGAN [106] architectures as alternatives for bridging the modality gap. For
feature extraction we evaluated SURF [7], SIFT [64] and ResNet [40]. In-
stead of feature extraction with BoW encodings, we tried a recent reverse im-
age search tool using 2D Krawtchouk Descriptors (2KDK) [25]. The results
showed that our proposed pipeline outperformed all other tested combinations
and was the only one able to deal both with the retrieval across modalities and
transformations.
5.4 Conclusions
Among the current state-of-the-art methods in cross-modality image retrieval,
few apply to instance-level retrieval across modalities as different as BF and
SHG. In addition, many are specific to category-level retrieval (requiring more
than one instance per category for training) [28]. In paper III, we proposed a
modular image retrieval pipeline for query by example across imaging modal-
ities on an instance level. We motivated the design choices and illustrated the
superiority of our method through extensive experimentation. We also dis-
cussed the reasons behind the failures of the compared methods and steps.
55
6. Handling statistical analyses in Imiomics
The most common way of joining the imaging data with non-imaging-based
parameters for use in research or diagnostics to date is to extract features and
measurements of interest from the images and examine them together with
other parameters. However, that requires concrete decisions on what those
features of interest might be and cannot represent the full richness of informa-
tion present in images.
A concept called Imiomics (imaging-omics) [94, 93] instead aims to do the
opposite and integrate the non-imaging measurements onto the whole-body
imaging data to preserve as much underlying information from the images as
possible. First, all subjects involved in the analyses are registered together so
that their pixel values at individual locations are comparable. Then the chosen
non-imaging parameters can be merged with the registered images through
per-pixel application of functions (e.g. correlation) depending on both the
parameter and the underlying pixel value. Instead of the raw intensity values,
the Jacobian determinants (JD) of the displacement fields, recovered by the
registration procedure, can be used [94]. These can be considered a measure of
areal/volumetric changes between subjects. Examples of analyses that can be
performed through Imiomics include creating a healthy whole-body imaging
atlas, anomaly detection, and cross-sectional and longitudinal studies.
While presenting information as an image can be more informative, it is
also subject to multiplicity issues (as discussed in section 3.4) and can thus
complicate the detection of true deviations/activity/significance in the data.
While numerous correction methods are available, they depend on the under-
lying data. In paper IV, we develop paradigms for correction evaluation on
whole-body scans, benchmark the methods for use within Imiomics and pro-
pose method extensions. This chapter briefly presents the key results of the
work. For more details and in-depth discussions, refer to the text of the origi-
nal paper.
6.1 Motivation
As mentioned in section 3.4, most of the correction methods have been de-
veloped with neuroimaging applications in mind. While the general issues
of image smoothness and test dependence persist across all medical imaging
applications, differences in modalities and scan regions severely affect com-
pliance with various assumptions of correction methods and their performance
on the data.
56
It is therefore essential to evaluate methods, not only on synthetic data but
even on datasets of interest, in order to establish principled correction proce-
dures. But the evaluation process is not always straightforward. In functional
neuroimaging, the standard approach is to evaluate them using special no-
activation scans, which may be impossible to acquire in different applications.
In paper IV, we thus performed (to the best of our knowledge) the first val-
idation of correction methods on large-scale imaging data, specifically whole-
body MRI, and propose an Imiomics-specific evaluation strategy. In addition,
anatomy-compliant method improvements are developed.
6.2 Methods
6.2.1 Evaluation strategies
For the particular case of correlation analyses, we propose two strategies for
correction method evaluation. The first one is done in the absence of activity,
analogous to the no-activation-based evaluations from neuroimaging studies.
Since it is not possible to acquire a non-imaging parameter that would surely
not be correlated with any signal in the body, we suggest using a mix of ar-
tificial and real-life data and constructing a synthetic set of measures to be
correlated with the imaging data.
The second evaluation strategy relies on data with known presence of activ-
ity pattern to empirically evaluate the type II error rate. Such evaluation can be
of interest as type I and II error rates are nontrivially connected [56]. And as-
suming that the tested methods really do keep the FWER below the imposed
upper bound, their performance can be further evaluated via sensitivity (i.e.
type II errors).
57
Correction on predefined anatomy-based clusters
When it is reasonable to assume that any true activity will always encompass
the entire structure/tissue/organ, the multiplicity problem can be alleviated by
treating individual structures as cluster units for inference.
A simple mean of the signal over each structure cluster could boost the
signal-to-noise ratio. However, it can be sensitive to incorrect delineations.
Hence the uncertainties of structure membership at each pixel should be com-
pensated for. Our proposed summarizing statistic for segmentation-defined
clusters uses a weighted signal average of pixels in each cluster as the cluster
signals. The weights are chosen such that they decrease towards the cluster
boundaries, representing the segmentation accuracy. If segmentation is done
in an automated manner, the network’s uncertainty can be used directly. If, on
the other hand, the segmentation is performed manually by multiple experts,
the fraction of the experts annotating a particular pixel as the given structure
can be used as that pixel’s accuracy/weight. Formally, given a structure seg-
mentation U with a known accuracy function a : Ω → [0, 1], the signal S of
that structure cluster is defined as
1
S= ∑ x · a(x)
N x∈U
(6.1)
58
6.3 Results
We evaluate several known correction procedures, namely Holm, RFT, TFCE,
permutation method and CBA, on correlation analyses for Imiomics. They are
compared to the proposed anatomy-based method and extensions through both
proposed evaluation strategies, namely in the assumed absence and known
presence of a signal. All the experiments are performed using a subset of the
POEM dataset (see section 2.2), split by gender, together with the Bioimpedance
Analysis (BIA) measurements. The significance level was set to 0.05, and the
primary cluster defining threshold to 0.001 where applicable. The BIA mea-
sure represents the total fat in the body and is expected to relate strongly to the
Jacobian determinants of the registration displacement fields. The integration
of BIA with image data was thus done through correlation maps (with JD).
In absence of activity
For the null case, a random vector from a standard normal distribution was
used in correlations. The nominal error rates over 200 repetitions showed that,
while cluster-based methods are typically better at retaining true activity, they
tend to also be more lenient with false positives. The anatomy-compliant ex-
tension to the permutation method (i.e. looking directly at the cluster extents)
offers an improvement over the original, but has a somewhat opposite effect
in TFCE and BIA.
In presence of activity
Analyzing correlations of BIA with JD, the (true) signal was expected to be
present in high-fat regions identified apriori through prior knowledge and man-
ual segmentation.
The number of voxels retaining activity post-correction was interpreted as
indicative of the method stringency. Ideally, all the observed activity inside
the areas with high-fat content should be retained. But according to theory,
some level of true activity outside of fat areas cannot be excluded. Through
examination of the voxel-activity histogram, we deduced that the cluster-based
methods are generally better at retaining all the activity in the fat regions and
that our proposed anatomy-based extensions improve true-activity retention.
6.4 Conclusions
We have proposed strategies for correction method evaluation on whole-body
MRI and Imiomics analyses, as well as adapted versions of various correc-
tion methods for better interpretable, anatomy-compliant error control. The
performed study constituted the first large-scale evaluation of the methods on
real-life data outside of neuroimaging.
59
With respect to the aims of the project, we proposed methods that improve
the robustness and provided an in-depth discussion on the appropriateness of
available methods for statistical evaluations in Imiomics.
Limitations of the study include the correlation-restricted analyses and sen-
sitivity of proposed methods to uncertainties of anatomical priors (they require
accurate delineations and/or a good uncertainty quantification of the inaccura-
cies). For corrections on predefined clusters directly, the statistic values may
not be trivially comparable among clusters for arbitrary accuracy functions. If
only the outermost segmentation borders are assumed to have lower accuracy,
for example, the method may be biased towards detecting signals from large,
circular areas (with a lower boundary-to-area ratio).
60
7. Conclusions and future work
In papers I and II, we focused on segmentation problems, with the more spe-
cific aims of relaxing the full annotation requirements for training DL models
and improving patch learning to enable training on images too large to fit in
a contemporary GPU memory. In paper I, we proposed a way to introduce
relative patch location and extent information through landmark-based dis-
tance maps. Validation with three different networks showed significant im-
provements in the segmentation accuracy and training stability on the task of
abdominal organ segmentation from whole-body MRI, enabling better patch
learning with a negligible increase in memory requirements.
Distance maps were used also in paper II, where we discuss and evaluate
the potential applicability of Boundary loss in segmentation learning under
very weak forms of supervision. We argue for using intensity-based distances
and show that the combination of the Minimum barrier distance and Boundary
loss outperforms the current state of the art in point-supervised training with-
out a need for extensive parameter tuning. With this approach, we managed to
relax the pixel-wise annotation requirements while still achieving satisfactory
segmentation results.
Image retrieval has long been used within diagnostics for retrieving and
comparing similar pathologies in histological images. More recently, with the
increase in available data, it has become popular even within radiology. In
both cases, the importance of being able to search across modalities is clear.
With the goal of establishing a method for accurate and reliable instance-
level retrieval across modalities and transformations, we proposed a three-step
method consisting of contrastive-learning-based 2D image representations, a
bag of words model and a reranking scheme. We provide an extensive replace-
ment study, motivating our design choices. Through comparison with state-of-
the-art methods, we show the clear superiority of the proposed method on the
problem of histological image retrieval across BF and SHG modalities. With
61
state-of-the-art top-K results, the results are up to par with respect to the set
aims.
The statistical analyses required for reliable inference are often overlooked
but play just as important a part in medical image analysis as the rest of the
image-processing tools and methods. Our work in paper IV aimed to improve
and evaluate the robustness of statistical analyses for Imiomics on whole-body
MRI. We proposed two evaluation strategies, evaluated a number of known
correction methods and extended their definitions to include priors on the un-
derlying anatomy in the images. We showed, experimentally, that using such
underlying anatomical knowledge can help relieve the multiplicity issue, ful-
filling our aims to improve robustness and reduce the stringency of methods
for Imiomics analyses.
These paper contributions are well aligned with the main objective of the
thesis work, to develop and improve automatic methods supporting medical
(specifically diagnostic) workflows.
Within the work on statistical analyses, we only proposed a very crude way
of using anatomical knowledge, and the possibilities of including it more di-
rectly, in a statistical manner, remain to be explored.
62
Sammanfattning på svenska
Medicinska bilder har spelat en viktig roll inom diagnostik, ända sedan den
första röntgenbilden togs år 1895. Under de senaste åren har både antalet olika
modaliteter och antal tagna bilder ökat enormt. Dessutom har bildupplösnin-
gen förbättrats, och tiden som behövs för att samla in bilder har minskat. Sam-
mantaget gör den här utvecklingen att läkare alltmer förlitar sig på medicinska
bilder för att ställa diagnos och för övervakning av olika sjukdomar.
Med den stora mängd tillgänglig bilddata finns det idag brist på experter för
att analysera dem, och den manuella insats som krävs för bildbehandling och
analys har nu blivit en flaskhals i processen. Datoriserade metoder har blivit
ett viktigt verktyg för att stödja läkarna i deras arbete och för att påskynda
bildbehandlings- och analysprocesser.
Många datorbaserade bildbehandlings- och analysmetoder har under lång
tid utvecklats med syfte att stödja läkare och förenkla deras arbete. För att
datorbaserade, automatiska och halvautomatiska, metoder ska tillämpas inom
medicin bör de vara pålitliga, förklarbara och reproducerbara. Inte minst inom
diagnostik, vars resultat kan ha stora konsekvenser för patienten.
Den specifika tillämpningen och anledningen till bildinsamling avgör på
vilket sätt bilderna behandlas eller/och analyseras på, vanligtvis med hjälp av
expertkunskap. Inom bildbaserad diagnostik används vanligtvis radiologiska
bilder och mikroskopibilder, vilka ofta innehåller kompletterande information.
Medan radiologiska bilder ofta används som ett fristående verktyg, kräver de
ibland ytterligare bearbetning, baserade på t.ex. histologiska bilder av biopsier.
På motsvarande sätt kan en undersökning av histologiska bilder antingen vara
målet för en medicinsk procedur, eller indikera behovet av t.ex. radiologisk
avbildning.
I den här avhandlingen siktar vi på att förbättra metoder som används inom
bildbaserad diagnostik, både för radiologiska bilder och för mikroskopibilder.
Vi utvecklar eller förbättrar metoder som kan appliceras i olika steg i en di-
agnostisk pipeline, nämligen bearbetning av radiologiska bilder (genom seg-
menteringsmetoder), biomedicinsk bildinhämtning (med fokus på histologiska
data) och statistiska analyser tillämpade på det så kallade Imiomics-konceptet.
I artiklar I och II arbetar vi med segmenteringsmetoder baserade på neural
nätverk. I vissa tillämpningar behöver man träna neurala nätverk med utk-
lippta delar, patches, av bilderna, eftersom medicinska bilder ibland är för
stora för att få plats i minnet i moderna GPU:er. I artikel I förbättrar och sta-
biliserar vi träningen av patch-baserade metoder genom användning av särskilt
normaliserade avståndstransformer baserade på anatomiska landmärken.
63
Moderna inlärningsbaserade segmenteringsmetoder för bildbehandling brukar
kräva en stor mängd annoterad data. Eftersom exakta annoteringar är dyra och
tidskrävande att generera, kan metoder som använder ofullständigt annoterad
data, baserad på träning med så kallad svag övervakning (weak supervision),
användas. I artikel II presenterar vi en segmenteringsmetod som kombinerar
Boundary loss med Minimum barrier distans (MBD) för bättre inlärning under
svag övervakning.
Histologiska bilder lagras ofta i stora databaser tillsammans med annan in-
formation som t.ex. tidigare diagnoser. När läkare undersöker ett nytt vävnad-
sprov kan de då titta på liknande vävnadsprov som finns i en databas, för att
säkerställa att nya diagnosen stämmer överens med annan data. Därför behövs
det korrekta och pålitliga automatiska metoder som kan söka och hämta bilder
från en databas, och hitta bilder som liknar den sökta bilden.
Artikel III introducerar en sådan metod, utvecklad specifikt för bildsökning
av histologiskt data. Metoden bygger på egenskapsextraktion, matchning och
omordning av identifierade toppresultat.
För en felfri och tillförlitlig medicinsk tolkning av de slutliga bildresultaten
och för att möjliggöra praktisk användning i medicinskt beslutsfattande krävs
det ofta statistiska analyser. I artikel IV utvärderar vi metoder för korrek-
tion av multipla tester i hypotesprövning på helkropps-MR-bilder. Dessutom
presenterar vi anatomi-baserade metoder, vilka visade sig ge bättre resultat i
korrektionen av multipla tester.
64
Acknowledgements
This PhD journey has been a long and exhausting one, and yet somehow si-
multaneously also super interesting and fun. It has taught me more than I ever
thought it would, but most importantly, it has brought some amazing people
to my life that I would otherwise not have had the pleasure of meeting.
There are a lot of people I would like to extend my thanks to, for being a
part of my journey and making it better through their presence.
First, to my main supervisor, Robin Strand: thank you for giving me this
opportunity and for your help, support and encouragement throughout these
years. I am immensely grateful for being allowed to explore my own ideas
and do things my way, even when it didn’t work out quite as planned.
Filip Malmberg, thank you for being a great teacher, for entertaining my
curiosity and patiently answering all my random questions. The same thank
you also extends to Justin Pearson. Justin, the writing of this thesis broke our
TDA focus. But I hope we get the chance to collaborate in the future!
Hvala i vama, Nataša Sladoje i Joakim Lindblad, na društvu i zabavnim
diskusijama. Bili ste mi mnogo dobri, uključeni i podržavajući saradniki -
hvala puno.
A big thank you goes to the seniors I have had the pleasure to teach with
(I’m looking at you, Joachim Parrow and Matteo Magnani): it was a pleasure
to learn from such great educators.
As with most jobs, being a PhD student involves a surprising amount of
bureaucracy-related tasks. Thank you, Anna-Lena Forsberg, Elisabeth Lindqvist
and Inger Hammarin, for helping me navigate through it all. Anna-Lena, you
have done much for me than your job description entails. Thank you for being
so open, supportive and understanding.
To all the Equal Opportunities group members that I’ve met through the
years - thank you for making the work environment better, and helping me see
through my own biases.
Carolina Wählby and Ida-Maria Sintorn, thank you for your contagious
smiles, and for being the amazing women that you are. I look up to you.
Ida-Maria, you light up the space with your presence. I cannot possibly put
into words how much your hugs and kind words meant to me. Thank you, so
so much, for checking in on me.
To all the past and present PhD students and coworkers at the division,
thank you for creating a welcoming and pleasant workplace! Gabriele Partel,
thank you for your wisdoms and all the delicious food, Anindya Gupta, for
lending a shoulder to lean on, and Teo Asplund, for fun discussions and book
suggestions.
65
To my PhD-parents-support group, Amanda&Håkan, Elisabeth&Ragnar,
Li&Nicolas, Virginia&Martin: thank you for helping me keep my sanity by
listening to my endless complaining and oversharing. May the tantrums and
VAB stay away from you!
Thank you, Jennifer Alvén, for making conferences fun and for all the par-
enting advice. Nicolas Pielawski, it’s been fun to listen to all your crazy ideas.
I hope we now finally get to work on some of them together. Elisabeth Wet-
zer, thank you for being a great collaborator and an even greater friend. I hope
we get the chance to work together again in the future. Virginia Grande, you
always know exactly what to say when I’m down, making my world a better
place by being there.
Dear Raphaela Heil, without you - breakdown and dehydration. And def-
initely no good thesis figures. Thank you for being a supportive and under-
standing friend, and my IT helpdesk. Raphaela and Nikolaus Huber, I will
miss our breakfasts and random laughs. Movie night soon, Dankeschön!
Håkan Wieslander, thank you for being a great friend and an irreplaceable
gym buddy. Mr Gupta, Ankit! Sharing the office has been super fun, and I
will miss our random deep talks. Thank you for your friendship and encour-
agement, and for always telling it like it is.
To my Slovenian circle, Teja and Martin Tement-štrumbelj, Sanja Obrovnik,
Maša Brumec. For always being there in times of need, no matter how much
time has passed. For being understanding about my lack of planning abilities
and for always carving out some time in your busy schedules for me. Hvala!
And I owe you plenty.
The last people to mention here are those that I hold dearest and without
whom this thesis would either not exist or cost me my sanity. To my dear little
family, Igor&Sever. Thank you, Igor, for all your support, love and encour-
agement. For the warmest hugs and the cosiest moments, for feeding me, for
listening to me, and for taking care of Sever through my late work hours. For
loving me and being annoyed with me in equal measure. You are the Batch-
Norm to my unstable training; my exploding gradients would be even more
explosive without you! Sever, my little sunshine, my explosive bundle of joy.
Thank you for teaching me that there are more important things in life than
work.
66
References
67
[11] C. M. Bishop. Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer, 1 edition, 2007.
[12] E. Breznik and R. Strand. Effects of distance transform choice in training with
boundary loss, 2021. Online, Retrieved from
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-499054.
[13] J. S. Bridle. Probabilistic interpretation of feedforward classification network
outputs, with relationships to statistical pattern recognition. In NATO
Neurocomputing, 1989.
[14] E. N. Brown and M. Behrmann. Controversy in statistical analysis of
functional magnetic resonance imaging data. Proceedings of the National
Academy of Sciences of the United States of America, 114:E3368–E3369,
2017.
[15] F. Cabitza, A. Campagner, F. Soares, L. G. Guadiana-Romualdo, F. Challa,
A. Sulejmani, M. Seghezzi, and A. Carobene. The importance of being
external: methodological insights for the external validation of machine
learning models in medicine. Computer Methods and Programs in
Biomedicine, 208:106288, 2021.
[16] K. C. Ciesielski, R. Strand, F. Malmberg, and P. K. Saha. Efficient algorithm
for finding the exact minimum barrier distance. Computer Vision and Image
Understanding, 123:53–64, 2014.
[17] M. D. Cirillo, D. Abramian, and A. Eklund. Vox2vox: 3d-gan for brain tumour
segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and
Traumatic Brain Injuries: 6th International Workshop, BrainLes 2020, Held in
Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Revised
Selected Papers, Part I 6, pages 274–284. Springer, 2021.
[18] E. Commission. 2018 reform of EU data protection rules (GDPR,
corrigendum). Available at http:
//data.europa.eu/eli/reg/2016/679/corrigendum/2018-05-23/oj,
accessed 2023-03-10.
[19] H. Cramér. Mathematical methods of statistics, volume 26. Princeton
university press, 1999.
[20] W. Crum, O. Camara, and D. Hill. Generalized overlap measures for
evaluation and validation in medical image analysis. IEEE Transactions on
Medical Imaging, 25(11):1451–1461, 2006.
[21] V. Curic, J. Lindblad, N. Sladoje, H. Sarve, and G. Borgefors. A new set
distance and its application to shape registration. Pattern Analysis and
Applications, 17:141–152, 2014.
[22] S. Dangi, C. A. Linte, and Z. Yaniv. A distance map regularized CNN for
cardiac cine MR image segmentation. Medical physics, 46(12):5637–5651,
2019.
[23] P.-E. Danielsson. Euclidean distance mapping. Computer Graphics and Image
Processing, 14(3):227–248, 1980.
[24] T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan. A review of some
techniques for inclusion of domain-knowledge into deep neural networks.
Scientific Reports, 12, 2022.
[25] J. S. DeVille, D. Kihara, and A. Sit. 2DKD: a toolkit for content-based local
image search. Source Code Biol Med., 2020.
68
[26] M. M. Deza and E. Deza. Encyclopedia of Distances. Springer, 2013.
[27] L. R. Dice. Measures of the amount of ecologic association between species.
Ecology, 26(3):297–302, 1945.
[28] S. R. Dubey. A decade survey of content based image retrieval using deep
learning. IEEE Transactions on Circuits and Systems for Video Technology,
32(5):2687–2704, 2022.
[29] O. J. Dunn. Multiple comparisons among means. Journal of the American
Statistical Association, 56(293):52–64, 1961.
[30] S. B. Eickhoff, S. Heim, K. Zilles, and K. Amunts. Testing anatomically
specified hypotheses in functional imaging using cytoarchitectonic maps.
NeuroImage, 32(2):570–582, 2006.
[31] A. Eklund, T. E. Nichols, and H. Knutsson. Cluster failure: Why fMRI
inferences for spatial extent have inflated false-positive rates. Proceedings of
the national academy of sciences, 113(28):7900–7905, 2016.
[32] K. Eliceiri, B. Li, and A. Keikhosravi. Multimodal biomedical dataset for
evaluating registration methods (patches from TMA cores). zenodo
https://zenodo.org/record/3874362, June 2020.
[33] C. Fouard, R. Strand, and G. Borgefors. Weighted distance transforms
generalized to modules and their computation on point lattices. Pattern
Recognition, 40(9):2453–2474, 2007.
[34] M. Ghafoorian, N. Karssemeijer, T. Heskes, M. Bergkamp, J. Wissink,
J. Obels, K. Keizer, F.-E. de Leeuw, B. van Ginneken, E. Marchiori, et al.
Deep multi-scale location-aware 3D convolutional neural networks for
automated detection of lacunes of presumed vascular origin. NeuroImage:
Clinical, 14:391–399, 2017.
[35] M. Ghafoorian, N. Karssemeijer, T. Heskes, I. W. van Uden, C. I. Sanchez,
G. Litjens, F.-E. de Leeuw, B. van Ginneken, E. Marchiori, and B. Platel.
Location sensitive deep convolutional neural networks for segmentation of
white matter hyperintensities. Scientific Reports, 7(1):1–12, 2017.
[36] R. C. Gonzalez. Digital image processing. Pearson, New York, NY, fourth
edition, global edition. edition, 2018.
[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in
Neural Information Processing Systems, volume 27. Curran Associates, Inc.,
2014.
[38] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press,
Cambridge, MA, USA, 2016.
[39] A. Hakim, S. Christensen, S. Winzeck, M. G. Lansberg, M. W. Parsons,
C. Lucas, D. Robben, R. Wiest, M. Reyes, and G. Zaharchuk. Predicting
infarct core from computed tomography perfusion in acute ischemia with
machine learning: Lessons from the ISLES challenge. Stroke,
52(7):2328–2337, 2021.
[40] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[41] R. Heller, D. Stanley, D. Yekutieli, N. Rubin, and Y. Benjamini. Cluster-based
69
analysis of fMRI data. NeuroImage, 33(2):599–608, 2006.
[42] S. Holm. A simple sequentially rejective multiple test procedure.
Scandinavian journal of statistics, pages 65–70, 1979.
[43] D. Hu, F. Xie, Z. Jiang, Y. Zheng, and J. Shi. Histopathology cross-modal
retrieval based on dual-transformer network. In 2022 IEEE 22nd International
Conference on Bioinformatics and Bioengineering (BIBE), pages 97–102,
2022.
[44] D. Huttenlocher, G. Klanderman, and W. Rucklidge. Comparing images using
the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 15(9):850–863, 1993.
[45] J. P. A. Ioannidis. Excess Significance Bias in the Literature on Brain Volume
Abnormalities. Archives of General Psychiatry, 68(8):773–780, 08 2011.
[46] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein.
nnU-Net: a self-configuring method for deep learning-based biomedical image
segmentation. Nature methods, 18(2):203–211, 2021.
[47] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with
conditional adversarial networks. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[48] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one
hundred layers Tiramisu: Fully convolutional densenets for semantic
segmentation. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 1175–1183, 2017.
[49] L. Jiao and J. Zhao. A survey on the new generation of deep learning in image
processing. IEEE Access, 7:172231–172263, 2019.
[50] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K.
Menon, D. Rueckert, and B. Glocker. Efficient multi-scale 3D CNN with fully
connected CRF for accurate brain lesion segmentation. Medical Image
Analysis, 36:61–78, 2017.
[51] P.-Y. Kao, S. Shailja, J. Jiang, A. Zhang, A. Khan, J. W. Chen, and B. S.
Manjunath. Improving patch-based convolutional neural networks for MRI
brain tumor segmentation by leveraging location information. Frontiers in
Neuroscience, 13, 2020.
[52] D. Karimi and S. E. Salcudean. Reducing the hausdorff distance in medical
image segmentation with convolutional neural networks. IEEE Transactions
on Medical Imaging, 39(2):499–513, 2020.
[53] A. Keikhosravi, J. S. Bredfeldt, A. K. Sagar, and K. W. Eliceiri. Chapter 28 -
second-harmonic generation imaging of cancer. In Quantitative Imaging in
Cell Biology, volume 123 of Methods in Cell Biology, pages 531–546.
Academic Press, 2014.
[54] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. Ben
Ayed. Boundary loss for highly unbalanced segmentation. Medical Image
Analysis, 67:101851, 2021.
[55] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi. A survey of the recent
architectures of deep convolutional neural networks. Artificial intelligence
review, 53:5455–5516, 2020.
[56] M. D. Lieberman and W. A. Cunningham. Type I and Type II error concerns in
fMRI research: re-balancing the scale. Social Cognitive and Affective
70
Neuroscience, 4(4):423–428, 12 2009.
[57] H. Lin, Y. Fu, P. Lu, S. Gong, X. Xue, and Y.-G. Jiang. TC-Net for ISBIR:
Triplet classification network for instance-level sketch based image retrieval.
In Proc. ACM Intl. Conf. on Multimedia, page 1676–1684. ACM, 2019.
[58] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense
object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.
[59] L. Lind. Relationships between three different tests to evaluate
endothelium-dependent vasodilation and cardiovascular risk in a middle-aged
sample. Journal of Hypertension, 31:1570–1574, 2013.
[60] J. Lindblad and N. Sladoje. Linear time distances between fuzzy sets with
applications to pattern matching and classification. IEEE Transactions on
Image Processing, 23(1):126–136, 2014.
[61] M. A. Lindquist and A. Mejia. Zen and the art of multiple comparisons.
Psychosomatic medicine, 77:114–125, 2015.
[62] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian,
J. A. van der Laak, B. van Ginneken, and C. I. Sánchez. A survey on deep
learning in medical image analysis. Medical Image Analysis, 42:60–88, 2017.
[63] G. Lohmann, J. Neumann, K. Mueller, and J. Lepsien. The multiple
comparison problem in fMRI: a new method based on anatomical priors.
MICCAI Workshop on Analysis of Functional Images, 1-8 (2008), 01 2008.
[64] D. Lowe. Object recognition from local scale-invariant features. In
Proceedings of the Seventh IEEE International Conference on Computer
Vision, volume 2, pages 1150–1157 vol.2, 1999.
[65] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective
receptive field in deep convolutional neural networks, 2017.
[66] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L. Martel. Loss
odyssey in medical image segmentation. Medical Image Analysis, 71:102035,
2021.
[67] J. Ma, Z. Wei, Y. Zhang, Y. Wang, R. Lv, C. Zhu, C. Gaoxiang, J. Liu, C. Peng,
L. Wang, Y. Wang, and J. Chen. How distance transform maps boost
segmentation CNNs: An empirical study. In International Conference on
Medical Imaging with Deep Learning, 2020.
[68] A. Mbilinyi and H. Schuldt. Cross-modality medical image retrieval with deep
features. In 2020 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pages 2632–2639, 2020.
[69] A. M. Meesters, K. Ten Duis, H. Banierink, V. M. Stirler, P. C. Wouters,
J. Kraeima, J.-P. P. de Vries, M. J. Witjes, and F. F. IJpma. What are the
interobserver and intraobserver variability of gap and stepoff measurements in
acetabular fractures? Clinical Orthopaedics and Related Research,
478(12):2801, 2020.
[70] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and
D. Terzopoulos. Image segmentation using deep learning: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(7):3523–3542,
2022.
[71] D. Müller, I. n. Soto-Rey, and F. Kramer. Towards a guideline for evaluation
metrics in medical image segmentation. BMC Research Notes, 15, 2022.
71
[72] H. Müller, N. Michoux, D. Bandon, and A. Geissbuhler. A review of
content-based image retrieval systems in medical applications—clinical
benefits and future directions. International Journal of Medical Informatics,
73(1):1–23, 2004.
[73] C. Niblack, P. B. Gibbons, and D. W. Capson. Generating skeletons and
centerlines from the distance transform. CVGIP: Graphical Models and Image
Processing, 54(5):420–437, 1992.
[74] T. Nichols and S. Hayasaka. Controlling the familywise error rate in functional
neuroimaging: a comparative review. Statistical Methods in Medical Research,
12(5):419–446, 2003. PMID: 14599004.
[75] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A deep neural
network architecture for real-time semantic segmentation.
[76] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch.
In NIPS-W, 2017.
[77] G. Petralia and A. R. Padhani. Whole-body magnetic resonance imaging in
oncology: uses and indications. Magnetic Resonance Imaging Clinics,
26(4):495–507, 2018.
[78] N. Pielawski, E. Wetzer, J. Öfverstedt, J. Lu, C. Wählby, J. Lindblad, and
N. Sladoje. CoMIR: Contrastive multimodal image representation for
registration. In Advances in Neural Information Processing Systems,
volume 33, pages 18433–18444. Curran Associates, Inc., 2020.
[79] L. Putzu, A. Loddo, and C. D. Ruberto. Invariant moments, textural and deep
features for diagnostic MR and CT image retrieval. In Computer Analysis of
Images and Patterns, pages 287–297. Springer Intl. Publishing, 2021.
[80] A. Rahman, C. Jahangir, S. M. Lynch, N. Alattar, C. Aura, N. Russell,
F. Lanigan, and W. M. Gallagher. Advances in tissue-based imaging: impact
on oncology research and clinical practice. Expert Review of Molecular
Diagnostics, 20(10):1027–1037, 2020. PMID: 32510287.
[81] A. Reinke, M. D. Tizabi, C. H. Sudre, M. Eisenmann, T. Rädsch,
M. Baumgartner, L. Acion, M. Antonelli, T. Arbel, S. Bakas, P. Bankhead,
A. Benis, M. J. Cardoso, V. Cheplygina, E. Christodoulou, B. Cimini, G. S.
Collins, K. Farahani, B. van Ginneken, B. Glocker, P. Godau, F. Hamprecht,
D. A. Hashimoto, D. Heckmann-Nötzel, M. M. Hoffman, M. Huisman,
F. Isensee, P. Jannin, C. E. Kahn, A. Karargyris, A. Karthikesalingam,
B. Kainz, E. Kavur, H. Kenngott, J. Kleesiek, T. Kooi, M. Kozubek,
A. Kreshuk, T. Kurc, B. A. Landman, G. Litjens, A. Madani, K. Maier-Hein,
A. L. Martel, P. Mattson, E. Meijering, B. Menze, D. Moher, K. G. M. Moons,
H. Müller, B. Nichyporuk, F. Nickel, M. A. Noyan, J. Petersen, G. Polat,
N. Rajpoot, M. Reyes, N. Rieke, M. Riegler, H. Rivaz, J. Saez-Rodriguez,
C. S. Gutierrez, J. Schroeter, A. Saha, S. Shetty, M. van Smeden, B. Stieltjes,
R. M. Summers, A. A. Taha, S. A. Tsaftaris, B. Van Calster, G. Varoquaux,
M. Wiesenfarth, Z. R. Yaniv, A. Kopp-Schneider, P. Jäger, and L. Maier-Hein.
Common limitations of image processing metrics: A picture story, 2021.
[82] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for
biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and
A. F. Frangi, editors, Medical Image Computing and Computer-Assisted
72
Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer
International Publishing.
[83] A. Rosenfeld and J. Pfaltz. Distance functions on digital pictures. Pattern
Recognition, 1(1):33–61, 1968.
[84] S. Roychowdhury, M. Diligenti, and M. Gori. Regularizing deep networks
with prior knowledge: A constraint-based approach. Knowledge-Based
Systems, 222:106989, 2021.
[85] S. S. M. Salehi, D. Erdogmus, and A. Gholipour. Tversky loss function for
image segmentation using 3D fully convolutional deep networks. In Machine
Learning in Medical Imaging: 8th International Workshop, MLMI 2017, Held
in Conjunction with MICCAI 2017, Quebec City, QC, Canada, September 10,
2017, Proceedings 8, pages 379–387. Springer, 2017.
[86] J. H. Scatliff and P. J. Morris. From röntgen to magnetic resonance imaging:
The history of medical imaging. North Carolina Medical Journal,
75(2):111–113, 3 2014.
[87] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015.
[88] M. L. Seghier. Ten simple rules for reporting machine learning methods
implementation and evaluation on biomedical data. International Journal of
Imaging Systems and Technology, 32(1):5–11, 2022.
[89] J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology,
46(1):561–584, 1995.
[90] W. Silva, T. Gonçalves, K. Härmä, E. Schröder, V. C. Obmann, M. C. Barroso,
A. Poellinger, M. Reyes, and J. S. Cardoso. Computer-aided diagnosis through
medical image retrieval in radiology. Scientific Reports, 12, 2022.
[91] S. M. Smith and T. E. Nichols. Threshold-free cluster enhancement:
Addressing problems of smoothing, threshold dependence and localisation in
cluster inference. NeuroImage, 44(1):83–98, 2009.
[92] R. Strand, K. C. Ciesielski, F. Malmberg, and P. K. Saha. The minimum
barrier distance. Computer Vision and Image Understanding, 117(4):429–437,
2013. Special issue on Discrete Geometry for Computer Imagery.
[93] R. Strand, S. Ekström, E. Breznik, T. Sjöholm, M. Pilia, L. Lind, F. Malmberg,
H. Ahlström, and J. Kullberg. Recent advances in large scale whole body MRI
image analysis: Imiomics. In Proceedings of the 5th International Conference
on Sustainable Information Engineering and Technology, SIET ’20, pages
10–15, New York, NY, USA, 2021. Association for Computing Machinery.
[94] R. Strand, F. Malmberg, L. Johansson, L. Lind, M. Sundbom, H. Ahlström,
and J. Kullberg. A concept for holistic whole body MRI data analysis,
Imiomics. PLOS ONE, 12:1–17, 02 2017.
[95] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey,
P. Elliott, J. Green, M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell,
A. Silman, A. Young, T. Sprosen, T. Peakman, and R. Collins. UK Biobank:
An open access resource for identifying the causes of a wide range of complex
diseases of middle and old age. PLOS Medicine, 12(3):1–10, 03 2015.
[96] P. Summers, G. Saia, A. Colombo, P. Pricolo, F. Zugni, S. Alessi, G. Marvaso,
B. A. Jereczek-Fossa, M. Bellomi, and G. Petralia. Whole-body magnetic
73
resonance imaging: Technique, guidelines and key applications.
ecancermedicalscience, 15, 2021.
[97] M. Tang, F. Perazzi, A. Djelouah, I. B. Ayed, C. Schroers, and Y. Boykov. On
regularized losses for weakly-supervised cnn segmentation. In V. Ferrari,
M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV
2018, pages 524–540, Cham, 2018. Springer International Publishing.
[98] P. J. Toivanen. New geodosic distance transforms for gray-scale images.
Pattern Recognition Letters, 17(5):437–450, 1996.
[99] L. Wasserman. All of statistics: a concise course in statistical inference,
volume 26. Springer, 2004.
[100] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics
Bulletin, 1(6):80–83, 1945.
[101] J. M. Wolterink, A. M. Dinkla, M. H. F. Savenije, P. R. Seevinck, C. A. T.
van den Berg, and I. Išgum. Deep mr to ct synthesis using unpaired data. In
S. A. Tsaftaris, A. Gooya, A. F. Frangi, and J. L. Prince, editors, Simulation
and Synthesis in Medical Imaging, pages 14–23, Cham, 2017. Springer
International Publishing.
[102] C.-W. Woo, A. Krishnan, and T. D. Wager. Cluster-extent based thresholding
in fMRI analyses: Pitfalls and recommendations. NeuroImage, 91:412–419,
2014.
[103] K. J. Worsley, S. Marrett, P. Neelin, A. C. Vandal, K. J. Friston, and A. C.
Evans. A unified statistical approach for determining significant signals in
images of cerebral activation. Human Brain Mapping, 4(1):58–73, 1996.
[104] Y. Xue, H. Tang, Z. Qiao, G. Gong, Y. Yin, Z. Qian, C. Huang, W. Fan, and
X. Huang. Shape-aware organ segmentation by predicting signed distance
maps. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 34, pages 12565–12572, 2020.
[105] P. Yang, Y. Zhai, L. Li, H. Lv, J. Wang, C. Zhu, and R. Jiang. A deep metric
learning approach for histopathological image retrieval. Methods, 179:14–25,
2020. Interpretable machine learning in bioinformatics.
[106] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Intl. Conf. on
Computer Vision (ICCV), 2017.
74
Acta Universitatis Upsaliensis
Digital Comprehensive Summaries of Uppsala Dissertations
from the Faculty of Science and Technology 2253
Editor: The Dean of the Faculty of Science and Technology
ACTA
UNIVERSITATIS
UPSALIENSIS
Distribution: publications.uu.se UPPSALA
urn:nbn:se:uu:diva-498953 2023