Download as pdf or txt
Download as pdf or txt
You are on page 1of 166

KU Leuven

Biomedical Sciences Group


Faculty of Medicine
Department of Imaging and Pathology
Medical Physics and Quality Assessment

THE DEVELOPMENT OF
MATHEMATICAL OBSERVERS FOR
OPTIMIZATION IN BREAST IMAGING

Dimitar PETROV

Jury:

Promotor: Prof. Dr. Ir. Hilde Bosmans


Co-promotor: Prof. Dr. Nicholas Marshall
Prof. Dr. Chantal Van Ongeval
Jury members: Prof. Dr. Johan Nuyts
Prof. Dr. Sabine Deprez
Dr. Ann-Katherine Carton Dissertation presented in partial
Dr. Ljiljana Platisa fulfilment of the requirements for
the degree of Doctor in Biomedical
Prof. Dr. Stephen Glick Sciences

April 2020
2
Table of Contents
TABLE OF CONTENTS 1
INTRODUCTION 6
DIGITAL MAMMOGRAPHY AND BREAST TOMOSYNTHESIS 6
STATISTICAL DECISION THEORY AND IMAGE QUALITY 7
1. STATISTICAL DECISION THEORY 8
2. IMAGE QUALITY ESTIMATION 23
THESIS OBJECTIVES AND WORK PLAN 31
REFERENCES 36
CHAPTER 1: SYSTEMATIC APPROACH TO A CHANNELIZED HOTELLING
MODEL OBSERVER IMPLEMENTATION FOR A PHYSICAL PHANTOM
CONTAINING MASS-LIKE LESIONS: APPLICATION TO DIGITAL BREAST
TOMOSYNTHESIS 47
INTRODUCTION 47
MATERIALS AND METHODS 49
1. IMAGE ACQUISITION 49
2. HUMAN OBSERVER STUDY 51
3. CHANNELIZED HOTELLING OBSERVER AND GENERAL WORK FLOW 51
4. CHANNEL SELECTION AND TUNING 53
5. CHO TRAINING 56
RESULTS 59
1. CHANNEL COMPARISON 59
2. CHANNEL TUNING PARAMETERS 60
3. EXPECTED SIGNAL 61
4. TRAINING IMAGES 62
5. BIAS 63
6. REPRODUCIBILITY 65
7. INFLUENCE OF DOSE LEVEL 66
DISCUSSION 66
CONCLUSIONS 71
REFERENCES 71
CHAPTER 2: REAL SPACE CHANNELIZATION FOR GENERIC DBT SYSTEM
IMAGE QUALITY EVALUATION WITH CHANNELIZED HOTELLING OBSERVER
78

3
INTRODUCTION 78
MATERIALS AND METHODS 79
1. PHANTOM PROPERTIES 79
2. SYSTEMS AND STUDIES 79
3. SIGNAL PRESENT AND SIGNAL ABSENT REGIONS OF INTEREST 80
4. HUMAN OBSERVER STUDY 81
5. MODEL OBSERVER 81
RESULTS 84
1. TUNING OF THE CHANNEL SET 84
2. THREE DOSE LEVELS STUDY 84
3. RECONSTRUCTION ALGORITHM STUDY 86
4. REPRODUCIBILITY STUDY 87
DISCUSSION AND CONCLUSIONS 88
REFERENCES 89
CHAPTER 3: CALCIFICATION CLUSTER DETECTION IN 2D FFDM AND DBT
WITH A CHANNELIZED HOTELLING OBSERVER 91
DIGITAL BREAST TOMOSYNTHESIS 91
1. INTRODUCTION AND PURPOSE 91
2. METHODS 94
3. RESULTS 99
4. DISCUSSION 102
5. CONCLUSIONS 105
2D FULL-FIELD DIGITAL MAMMOGRAPHY 105
1. INTRODUCTION 105
2. MATERIALS AND METHODS 106
3. RESULTS 108
4. DISCUSSION AND CONCLUSIONS 109
REFERENCES 110
CHAPTER 4: CHANNELIZED HOTELLING OBSERVER FOR MULTI-VENDOR
BREAST TOMOSYNTHESIS IMAGE QUALITY ESTIMATION: DETECTION OF
CALCIFICATION CLUSTERS IN AN ANTHROPOMORPHIC PHANTOM 115
INTRODUCTION 115
MATERIALS AND METHODS 115
1. IMAGE AND PHANTOM PROPERTIES 115
2. MODEL OBSERVER 116
RESULTS 117
DISCUSSION AND CONCLUSIONS 118
REFERENCES 118

4
CHAPTER 5: CHANNELIZED HOTELLING OBSERVER FOR BREAST VIRTUAL
CLINICAL TRIALS: APPLICATION TO DBT AND FFDM 120
INTRODUCTION 120
MATERIALS AND METHODS 122
1. IMAGE DATASET GENERATION 122
2. HUMAN OBSERVER STUDY 124
3. CHANNELIZED HOTELLING MODEL OBSERVER 124
RESULTS AND DISCUSSION 127
1. TUNING THE SSCHO FOR FFDM 127
2. TUNING THE VCHO FOR DBT 128
CONCLUSIONS 130
REFERENCES 131
CHAPTER 6: DEEP LEARNING APPLICATIONS 134
DEEP LEARNING CHANNELIZED HOTELLING OBSERVER FOR MULTI-VENDOR DBT SYSTEM
IMAGE QUALITY EVALUATION USING A STRUCTURED PHANTOM 134
1. INTRODUCTION 134
2. MATERIALS AND METHODS 136
3. RESULTS 142
4. DISCUSSION 145
4. CONCLUSIONS 146
RESNET18 FOR MULTI-VENDOR DBT IMAGE QUALITY EVALUATION USING A
STRUCTURED PHANTOM 147
1. INTRODUCTION 147
2. MATERIALS AND METHODS 148
3. RESULTS AND DISCUSSION 150
4. CONCLUSION 151
REFERENCES 152
CONCLUSIONS AND FUTURE WORK 154
LOW CONTRAST MASS-LIKE LESIONS 154
CALCIFICATION CLUSTERS 158
FUTURE WORK, POTENTIAL IMPROVEMENTS AND OUTLOOK 159
REFERENCES 161
SUMMARY 163
SAMENVATTING 164
CURRICULUM VITAE 165

5
Introduction
DIGITAL MAMMOGRAPHY AND BREAST
TOMOSYNTHESIS
Breast cancer is the most common form of cancer in women. It is associated with
the highest incidence and mortality rate among all types of cancer in Europe
(Ferlay et al., 2018). In order to reduce the mortality rate earlier cancer detection
is crucial (Mcphail et al., 2015), thus many countries have an established
population screening program (E. R. Myers et al., 2015). The European guidelines
for breast cancer screening suggest that a two-view 2D mammography
examination is performed every 2-3 years for women between 45-74 years
(European commission, 2019).
The most widely used mammography screening method is 2D full-field digital
mammography (FFDM), where single projections of the breast are captured by
an X-ray digital detector, producing the raw images. Following an application of
post-processing algorithms to improve the visual characteristic of the breast
images, the resulting images are viewed by radiologists. However, a major
drawback of the method is the limited lesion detectability due to the overlapping
fibroglandular tissues, which can obscure cancers and downgrade the observer
sensitivity (Bird et al., 1992). To cope with this, the digital breast tomosynthesis
(DBT) technique was introduced (Niklason et al., 1997).
DBT is a form of limited angle tomography, where multiple projections are taken
as the X-ray tube moves over a prescribed path (Sechopoulos, 2013a). The total
angular range of the tube movement and the choices for current systems range
between 15° and 50°. The projection images are reconstructed into a stack of
planes parallel to the detector, providing some volumetric information on the
breast structure and partly solving the tissue overlap found in FFDM (Skaane et
al., 2013).
While there are clear potential advantages of DBT techniques over FFDM, clear
evidence is needed that DBT improves lesion detectability, recall rate and lesion
type identification compared to FFDM, before DBT can be applied in screening
and diagnostic examinations. Looking at the clinical studies on this topic, results
range from no significant improvement in cancer detection between DBT and
FFDM (Gennaro et al., 2013; Gur et al., 2009; Hofvind et al., 2019) to better cancer
detection with DBT (Ciatto et al., 2013; Marinovich et al., 2018; Pattacini et al.,
2018; Skaane et al., 2019) while some authors found poorer calcification cluster
visualization with DBT (Horvat et al., 2019; Peppard et al., 2015; Spangler et al.,
2011). Note that DBT technology, both hardware and software, has changed over
the period in which these studies were initiated and concluded.
A crucial aspect of the process of DBT implementation into screening and an
important link to the optimal system operation is that of image quality

6
evaluation. As with all X-ray imaging techniques, radiological protection should
be imposed. The International Commission on Radiological Protection (ICRP,
2007) has set multiple principles for the safe use of ionizing radiation for
imaging. One of these principles is “optimization of protection”, which states that
the magnitude of the individual radiation dose should be kept as low as
reasonably achievable. In other words, there should always be a balance between
radiation dose and the amount of diagnostic information needed to successfully
perform the specific examination. In order to find this balance, it is essential to
set a reliable way of estimating the quality of the image with respect to the ability
to convey meaningful diagnostic information. In this work, a practical tool for
image quality evaluation in breast imaging will be developed, tested and
validated against human estimations.

STATISTICAL DECISION THEORY AND IMAGE QUALITY


This section is based on the book of Harrison Barrett and Kyle Myers
“Foundation of image science” 2004 (Harrison H Barrett et al., 2004). The
fundamental principles of decision theory and image quality estimation have
been summarized from the aforementioned book in the following section.
Image quality estimation is often linked to the efficiency with which a reader can
perform a certain decision task. Therefore any description of image quality (IQ)
should consider essential information about the following:
 Task. The task refers to a desired information to be extracted from the
images. For the case of mammography the task could be either to extract
quantitative information from the mammogram, e.g. breast density, or
to assign the image with a certain label, i.e. affiliation with a certain
hypothesis. The former is often called estimation task and the latter
classification. The most common task in mammography screening is
binary classification, where the radiologist scores a mammogram as
abnormality present or absent, usually followed by an estimation task.
For example, after the mammogram is scored as ‘lesion present’, the
location of the tumor, size and number of the lesions, etc. would have to
be estimated. In practice the differentiation between classification and
estimation is not too strict, the results from an estimation task may
serve as an input for classification and vice versa.
 Observer. In the decision theory, the method with which the task is
accomplished is called an observer. In the most common scenario in
medical imaging the images are scrutinized by a radiologist for
pathology, i.e. the observer is a human performing a classification task.
In many cases, the observer can be a mathematical operator, which
processes the image in a way to fulfill a classification or estimation task.
These observers are called machine observers and are designed and
developed to aid and help the workflow of the radiologists. ‘Computer-
aided diagnosis’ algorithms support the decision of the radiologist, and

7
model observers can automate specific detection tasks in order to save
valuable time.
 Images. The image properties used for the IQ estimation will determine
the approach of the observer to achieve the task. For example in a
classification task, the imaging system visualizes data from multiple
objects usually belonging to more than one of the possible classes. Thus
there will be distributions with a signal-carrying component and a
background-carrying component. The observer strategy will highly
depend on these distributions to perform the task.
 Measure of task performance. For a given task, observer and set of
images, the IQ can be estimated in terms of a certain figure of merit
(FOM) that measures how well the task is accomplished by the observer.
The FOM can depend on the task and the type of images. In
mammography, the most common task is that of binary classification,
where the observer performing the task can make two correct
predictions and two types of errors. The analysis of these observer
outputs forms the basis of receiver operating characteristic (ROC)
analysis. From this analysis, a number of task performance figures of
merit including area under the ROC curve (AUC), signal to noise ratio
(SNR), detectability (d’), percentage correctly detected lesions (PC) and
so forth can be derived. These are closely linked to this analysis and can
serve as an IQ measure.
Image quality evaluation premised on these four aspects is called ‘task-based IQ
estimation’, and is one of the pillars of diagnostic imaging evaluation. Only
through this method, we can assess how successfully a medical task is performed
by obtaining information on the imaging system performance and observer
performance. The following two sections give more detail on statistical decision
theory and the means of image quality estimation.
1. Statistical decision theory
In this section, the statistical paradigms and mathematical operations associated
with image quality estimation are discussed. The task of classification is
rigorously studied in the field of radar technology and many of the fundamental
assumptions emerge from this period (Peterson et al., 1954). With medical
imaging stepping forward in the digital era these principles were adopted and
improved (Harrison H. Barrett, 1990) along with the field of human perception
(A. E. Burgess et al., 1981; Chakraborty & Winter, 1990; Eckstein, 2011; Krupinski,
2010; Obuchowski, 2000; Wunderlich & Abbey, 2013). Our study focuses on the
‘normal’ versus ‘abnormal’ binary classification task. So from now on if a task is
mentioned this is a binary classification (unless stated otherwise).
1.1. Inputs to the classification process
In the field of mammography and breast tomosynthesis, in order to assess the
breast anatomy we initiate multiple measurements {𝑔𝑚 } that generate
information: pixel values for 2D mammography and voxel values for
tomosynthesis. All these measurements put together produce an image vector 𝑔

8
with a size of 𝑀𝑥1, where M is the number of measurements. The production of
such an image is generally linked to a single specific object 𝑓, that is visualized.
This is done by mapping the continuous object 𝑓 into the discrete vector image
𝑔 via the system mapping operator ℋ:
𝑔 = ℋ. 𝑓 + 𝑛 (1)
The mapping operator ℋ is a continuous to discrete operator and is specific to
the imaging system. The noise component 𝑛 represents the randomness in the
imaging process. It usually has zero mean and its higher order statistical
properties depend on the imaging system properties.
Given the non-ideal mapping operator and the noise component, an image is a
random representation of an object. Therefore in order to discriminate between
the two classes for a binary task (signal present and signal absent), we need to
study the statistical properties of this random data. In other words, the process
of classification largely relies on information how the data is distributed under
each of the two hypotheses: 𝐻0 for signal absent (normal anatomy) and 𝐻1 for
signal present (abnormal anatomy). This information is captured by the
probability density function 𝑝𝑟(𝑔|𝐻𝑖 ) , where 𝑖 equals 0 or 1. Here 𝑝𝑟(𝑔|𝐻𝑖 )
denotes the probability of any image 𝑔 to occur given the knowledge that the hypothesis
𝐻𝑖 is the truth (has occurred). The probability density function (PDF) gives a
complete mathematical description of an imaging system but for real systems, it
is difficult or infeasible to estimate with precision. Fortunately, it can be
sufficiently well approximated with prior knowledge of the randomness of the
image dataset. For example, if the only noise source in the images is that
associated with x-ray absorption events registered as a signal by the detector,
the PDF can be approximated by a Poisson noise model. In more realistic
circumstances there are multiple sources of noise in the imaging process, and in
this case, the central-limit theorem can be applied, thus each component of the
image 𝑔𝑚 , where 𝑚 ∈ 𝑀 the number of image components, can be treated as a
Gaussian distributed variable. Sometimes even the multiple measurements {𝑔𝑚 }
forming the image are independent of each other. In this scenario the PDF can be
summarized as follows:
𝑀 𝑀
(𝑔 −𝑠 )2 (2)
1 2 − 𝑚 2𝑖𝑚
𝑝𝑟(𝑔|𝐻𝑖 ) = ( 2
) ∏𝑒 2𝜎
2𝜋𝜎
𝑚=1

In the formula 𝑠𝑖𝑚 is the mean expected image under the ith hypothesis and mth
component; and 𝜎 is the standard deviation.
In the cases where the pixel values between different measurements are not
independent from one-another, the PDF can be approximated using the
multivariate normal distribution:
1
𝑀
1 (3)
(𝑔𝑚 −𝑠𝑖𝑚 )𝑇 𝐾 −1 (𝑔𝑚 −𝑠𝑖𝑚 )
𝑝𝑟(𝑔|𝐻𝑖 ) = ∏ 𝑒 −2
√2𝜋 𝑑𝑒𝑡(𝐾) 𝑚=1
Where 𝐾 is the covariance matrix tracing the relation between different pixels,
()T is the transpose operator and det( ) is the determinant of the matrix

9
These probability density functions will be used in the later sections to help us
find the optimal and/or practically applicable means to separate between the
two classes of images.
1.2. Ideal observer
The ideal observer is a statistical concept that maximizes the separation between
the two classes. It forms the basis of model observer analysis. A model observer
is a mathematical and statistical operations applied to the images in order to
produce a set of test statistic values, which can be used for decision making. The
ideal observer utilizes all of the possible statistical information about the image
dataset to achieve the highest possible observer performance by solving a
likelihood ratio test Λ(𝑔):
𝐻1 (4)
𝑝𝑟(𝑔|𝐻1 ) >
Λ(𝑔) = Λ
𝑝𝑟(𝑔|𝐻0 ) < 𝑐
𝐻0
Where Λ𝑐 is a threshold value used for the decision making process. With the
likelihood ratio given in this form the ideal observer will always choose the signal
present hypothesis (𝐻1 ), if the ratio is higher than the threshold value and vice
versa. Thus a crucial part of the ideal observer principle is not only the image
dataset PDFs, but also the choice of a threshold value Λ𝑐 . The computation of this
value involves a precise knowledge of the size and the prevalence of the four
possible costs 𝐶(𝐷𝑖 , 𝐻𝑗 ) of making a decision 𝐷𝑖 ; 𝑖 ∈ [0,1] , when hypothesis
𝐻𝑗 ; 𝑗 ∈ [0,1] is the truth. One of the most common methods for threshold
estimation is minimizing the overall average cost, also called a Bayes risk, whose
general form is:
1 1

𝐶̅ = ∑ ∑ 𝐶(𝐷𝑖 , 𝐻𝑗 )𝑝𝑟(𝐷𝑖 |𝐻𝑗 ) 𝑝𝑟(𝐻𝑗 ) (5)


𝑖=0 𝑗=0

In the case of binary classification, the minimization of this equation can be


reduced to the following:
𝐻1
𝑝𝑟(𝑔|𝐻1 ) > (𝐶(𝐷1 , 𝐻0 ) − 𝐶(𝐷0 , 𝐻0 )) 𝑝𝑟(𝐻0 )
Λ(𝑔) = = Λ𝑐 (6)
𝑝𝑟(𝑔|𝐻0 ) < (𝐶(𝐷0 , 𝐻1 ) − 𝐶(𝐷1 , 𝐻1 )) 𝑝𝑟(𝐻1 )
𝐻0
The formula shows that in order to solve the likelihood ratio, a knowledge of the
dataset PDFs and the costs of making a decision are needed. These conditions
dictate that the image background and signal properties need to be known
exactly. Most often in practice these requirements cannot be achieved, thus the
true ideal observer performance via the likelihood ratio is non-obtainable.
Another more practical approach to the ideal observer can be a linear
approximation to it. In this case only the first and second order statistics are
needed and not the complete probability density function. In the following
section this will be explained in more detail.

10
1.3. Ideal-linear observer
The process of classification can also be considered as a type of hypothesis test.
Usually such tests can be solved by a test statistic. In this case, the test statistic t
is derived from an image via a discriminant function, which maps the image pixel
values into a single test statistic scalar 𝑡 = 𝑇(𝑔). This transformation is also a
model observer, as test statistic values are generated from the image dataset.
Generally, the discriminant function is a non-linear function through a wide
range of input data. Estimation of such a discriminant function can be challenging
and can require a lot of data and/or assumptions and a linear approximation is
often the only practically feasible method. In some special cases, this will be
equal to the ideal classifier. The test statistic (𝑡) of any linear observer can be
summarized by the following equation:
𝑡 = 𝑇(𝑔) = 𝑤 𝑇 𝑔 (7)
where 𝑤 often combines multiple operations transforming the image 𝑔 into test
statistics. Both vectors have size of 𝑀𝑥1 and the test statistic is a scalar value
equal to the dot product between them.
With the application of this discriminant function to an image dataset with
multiple images, a distribution of test statistic values can be generated. For
binary classification task, such a test statistic distribution is calculated for each
hypothesis. Given the center limit theorem, the test statistic is a random variable
with respect to the input image, thus the classification of such test statistic
distributions can be carried out using the linear discriminant analysis (Fukunaga
1990), where a decision is made by partitioning the test statistic distribution
with a decision threshold 𝑡𝑐 . If the test statistic is lower than 𝑡𝑐 then hypothesis
𝐻0 is chosen, and if the test statistic is higher than 𝑡𝑐 then hypothesis 𝐻1 is
chosen.
Classification using linear discriminant analysis can be visualized and explained
in greater depth using an example with a dataset of two-pixel image vectors (i.e.
g). The concept of vector images is needed in order to perform mathematical
operations producing the test statistic values. The image which usually is
represented as a matrix with pixel values is now represented with each row of
pixel values stored consecutively after the previous, i.e. an 𝑀𝑥𝑀 image becomes
an 𝑁𝑥1 vector with 𝑁 = 𝑀2 . For the example dataset of two-pixel image vectors,
the vectorization is intrinsically carried out, due to the limited number of pixels.
In that case the image vectors can be plotted in an Cartesian plane with an initial
point 𝑂 = (0,0) and terminal point the corresponding image pixel values 𝐴 =
(𝑝𝑥0 , 𝑝𝑥1 ) (figure 1).

11
Figure 1. Visualization of the image vector concept plotted in an Cartesian plane.
The length and orientation of each two pixel image vector ⃗⃗⃗⃗⃗
𝑂𝐴 can be calculated
as:
(8)
⃗⃗⃗⃗⃗ ‖ = √(𝑝𝑥0 − 0)2 + (𝑝𝑥1 − 0)2 = √𝑝𝑥02 + 𝑝𝑥12
‖𝑂𝐴

𝜃 = tan−1 (𝑝𝑥1 ⁄𝑝𝑥0 )


Where ‖𝑂𝐴 ⃗⃗⃗⃗⃗ ‖ and 𝜃 are the length and the orientation respectively. The
orientation is given as the angle between the x-axis and the vector in a right-
handed Cartesian coordinate system. Throughout the text such image vectors
will be denoted with their terminal point only. In the following two subsections
we will investigate two cases of Gaussian distributed two-pixel datasets with and
without correlation, the form of the discriminant function will be investigated
and a fundamental classification principle introduced.
1.3.1. Gaussian distributed image PDF without correlation
In Gaussian distributed image PDFs without correlation, each pixel in the dataset
comes from a normal distribution with a given standard deviation. There are two
underlining hypotheses: signal present (𝐻1 ) with a subset of signal present
images 𝑔1 = (𝒩(7,1), 𝒩(6,1)), i.e. generated with mean value 𝑠1 = (7,6) and
standard deviation 𝜎1 = (1,1); and signal absent (𝐻0 ) with a subset of signal
absent images 𝑔0 = (𝒩(5,1), 𝒩(5,1)) with 𝑠0 = (5,5) and 𝜎0 = (1,1). Figure 2.
shows a scatter plot with the first pixel in the x-horizontal axis and the second
pixel in the vertical axis (1000 realizations of 𝑔1 and 𝑔0 ), i.e. visualizing the
relation 𝑝𝑥1 (𝑝𝑥0 ), where the images with underlying hypothesis H1 are depicted
with red, and those with underlying hypothesis H0 with blue. This way each point
in the graph represents an image (𝑔𝑖 = (𝑝𝑥𝑖,0 , 𝑝𝑥𝑖,1 ) regardless of the underlining
hypothesis. The horizontal axis plots the overlapping univariate PDFs for the first
pixel px0 given each of the two hypotheses: 𝑝𝑟(𝑝𝑥0 |𝐻0 ) and 𝑝𝑟(𝑝𝑥0 |𝐻1 ), while
the vertical axis plots the second pixel px1: 𝑝𝑟(𝑝𝑥1 |𝐻0 ) and 𝑝𝑟(𝑝𝑥1 |𝐻1 ). It can be
seen that the two univariate PDFs under the same hypothesis resample the
dataset PDF under the same hypothesis, i.e. 𝑝𝑟(𝑔|𝐻𝑖 ) = 𝑝𝑟(𝑝𝑥0 |𝐻𝑖 ) +
𝑝𝑟(𝑝𝑥1 |𝐻𝑖 ). Due to the fact that the image pixels are generated independently

12
from one another, the dataset PDFs follow the Gaussian distribution from
equation 2. Additionally a test statistic axis 𝑡 is visualized together with a
crossing line of an example decision threshold 𝑡𝑐 . The two generated subsets
(signal present and signal absent) can be seen as a linear transformation from
one another. In our two pixel images example this will mean, that the signal
present subset is a translation of the signal absent subset with average distance
of (2,1):
𝑔1 = 𝑔0 + 𝜏 + 𝑛 (9)
Where 𝜏 = (2,1) and 𝑛 is Gaussian noise with zero mean and σ=1.

Figure 2. A scatter plots of the two pixel image dataset g, where the first pixel is
plotted against the second pixel. On the axes are plotted also the univariate PDFs for
each pixel under hypothesis H0 (blue) and hypothesis H1 (red).
To accomplish a binary classification using linear discriminant analysis we need
to define a decision threshold on a test statistic axis, lines 𝑡𝑐 and 𝑡 in figure 2
respectively, dividing the two hypotheses in an ideal way. The ideal way, or for
this example – the ideal orientation of the decision threshold line, usually would
not be parallel/perpendicular to any of the two graph axes (figure 2). If this is
true, this means that only one of the two pixels, in this example, effectively gives
the differentiation between the two hypothesis and the other pixel is unaffected
by the hypothesis. In order to solve the problem in more general way, we need
to find the vector of the linear transformation that changes the most when the
linear transformation is applied to the first subset to produce the second data
subset (equation 5). In linear algebra such a vector is called eigenvector and in
the case of Gaussian distributions with no correlation, the eigenvector equals to
the difference between the average subsets:

13
𝑤 = 𝑠1 − 𝑠0 (10)
From now on we will refer to this eigenvector as an observer template or just
template. In signal decision theory the template defines the type of model
observer solving the task of classification. The template transforms the images
into a new subspace of test statistics via equation 7. In figure 1 the direction of
the new subspace is given by the axis t, passing through the two mean points of
𝑝𝑟(𝑔|𝐻0 ) and 𝑝𝑟(𝑔|𝐻1 ). The decision threshold 𝑡𝑐 would be a line perpendicular
to axis t, that divides the two classes in an ideal way. In statistical decision theory
the point 𝑡 = 𝑡𝑐 is usually set taking into account the resulting cost of making a
wrong or right decision, in the same manner as the ideal observer. This whole
process described with equation 4 is visualized in figure 3.

Figure 3. Calculation of test statistic distributions from input images.


For every given image in the dataset 𝑔, a test statistic 𝑡 is calculated, and with a
given decision threshold the image is scored either as signal present if the test
statistic is higher than the threshold or as signal absent if it is lower. This decision
is based solely on the data 𝑔 , which due to randomness generated from
imperfections and noise of the imaging system, is a loose representation of the
underlying object. These multiple noise sources usually give a prerequisite for
the central limit theorem, thus the test statistics will most often be normally
distributed around two values with non-zero standard deviation. Often the two
resulting distributions overlap and therefore the decisions made using the data
will always contain some misclassifications (figure 3).
The same mathematical problem can be solved also with the ideal observer
likelihood ratio approach. With a dataset 𝑔 generated using equation 2, we can
use this image PDF in equation 4 to calculate the likelihood ratio of the ideal
observer:
2 (𝑔 − 𝑠1 )2
𝑝𝑟(𝑔|𝐻1 ) √2𝜋𝜎 𝑒𝑥𝑝 (− 2𝜎 2 )
Λ(𝑔) = = =
𝑝𝑟(𝑔|𝐻0 ) (𝑔 − 𝑠0 )2
√2𝜋𝜎 2 𝑒𝑥𝑝 (− 2 ) (11)
2𝜎
(𝑔 − 𝑠1 )2 (𝑔 − 𝑠0 )2 𝑔(𝑠1 − 𝑠0 ) 𝑠02 − 𝑠12
= 𝑒𝑥𝑝 (− 2
+ 2
) = 𝑒𝑥𝑝 ( + )
2𝜎 2𝜎 𝜎2 2𝜎 2

14
𝐻1
𝑔(𝑠1 − 𝑠0 ) 𝑠02 − 𝑠12 >
𝜆(𝑔) = ln Λ(𝑔) = + ln Λ 𝑐 (𝑔) = λ𝑐
𝜎2 2𝜎 2 <
𝐻0
𝐻1
> 𝑠02 − 𝑠12 2
𝜆(𝑔) = 𝑔(𝑠1 − 𝑠0 ) (λ − ) 𝜎 = λ′ 𝑐
< 𝑐 2𝜎 2
𝐻0
In order to solve the likelihood ratio for images with Gaussian statistics, it is
useful to introduce the log-likelihood ratio, which is equal to the natural
logarithm of the likelihood ratio. This way the exponent function in the Gaussian
PDF is eliminated and allows for direct math operations with the first and the
second order statistics of the data. The same applies also for the likelihood ratio
threshold, which is transformed into log-likelihood ratio threshold. From
equation 11, it can be seen that solving the ideal observer ratio, we effectively
apply a linear discriminant function to the image dataset. In fact the discriminant
function is equal to equation 10:
𝑡 = 𝑇(𝑔) = 𝑤 𝑡 𝑔 = (𝑠1 − 𝑠0 )𝑡 𝑔 = 𝜆(𝑔) (12)
This is a proof that the ideal-linear observer in conditions of image dataset with
Gaussian statistics and no correlation is equal to the ideal-observer. This
transformation of the images into test statistic distributions is a model observer
and the model observer from equation 12 is called the non-prewhitening
matched filter or NPW.
In figure 3 the test statistics graph on the right, shows the two test statistic
distributions for the two hypotheses of the dataset g: 𝑝𝑟(𝑡|𝐻𝑖 ). The decision
threshold splits the two distributions into four fractions ranging from zero to
one: true positive fraction (TPF) and true negative (TNF) fractions in which the
observer correctly determines the underlying hypothesis and false positive
(FPF) and false negative (FNP) fractions in which the observer is mistaken. The
four scenarios are shown in table 1.
Table 1. The decision outcomes for binary decision problem.
Signal present ground Signal absent ground
truth – H1 truth – H0
Decision signal True positive (TPF) False positive (FPF)
present – H1
Decision signal False negative (FNF) True negative (TNF)
absent – H0

If the two probability density functions of the decision variable t given any of the
hypotheses are known, the four possible decision outcomes can be calculated for
a given tc:

𝑁𝑢𝑚. 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑠
𝑇𝑃𝐹 = lim [ ] = ∫ 𝑝𝑟(𝑡|𝐻1 )𝑑𝑡 (11)
𝑛𝑢𝑚→∞ 𝑁𝑢𝑚. 𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑚𝑎𝑔𝑒𝑠 𝑡𝑐

15

𝑁𝑢𝑚. 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑠
𝐹𝑃𝐹 = lim [ ] = ∫ 𝑝𝑟(𝑡|𝐻0 )𝑑𝑡
𝑛𝑢𝑚→∞ 𝑁𝑢𝑚. 𝑠𝑖𝑔𝑛𝑎𝑙 𝑎𝑏𝑠𝑒𝑛𝑡 𝑖𝑚𝑎𝑔𝑒𝑠 𝑡𝑐
𝑡𝑐
𝑇𝑁𝐹 = 1 − 𝐹𝑃𝐹 = ∫ 𝑝𝑟(𝑡|𝐻0)𝑑𝑡
−∞
𝑡𝑐
𝐹𝑃𝐹 = 1 − 𝑇𝑃𝐹 = ∫ 𝑝𝑟(𝑡|𝐻1 )𝑑𝑡
−∞

From equations 5 and figure 3 it can be seen that the four fractions depend on
the chosen decision threshold tc for these specific two PDFs. By plotting the
relation between TPF and FPF for a changing threshold value, a curve can be
generated, which summarizes the separation of the two test statistic PDFs. This
plot is called a receiver operating characteristic (ROC) curve and is a crucial
method for assessment of the difficulty of the task and the performance of the
observer. With perfect separation between the two PDFs the ROC curve will rise
close to the upper left corner in the diagram, and if no separation is possible –
the ROC curve will follow the diagonal of the diagram. The ROC curve generated
from the two distributions with the mathematical observer described with
equation 4 reading the two pixel image dataset is plotted on figure 4.

Figure 4. Receiver operating characteristic curve.


1.3.2. Gaussian distributed image PDF with correlation
A more general problem compared to the example discussed in the previous
section is the case with images with multi-normal distribution, which PDFs are
given with equation 3. In fact if the covariance matrix from equation 3 has
diagonal elements equal to 1 and zero elsewhere, i.e. the identity matrix, the
equation becomes equal to equation 2. Again for better understanding of the
linear discriminant analysis applied to this type of images, the two-pixel image
example is used again. The dataset consists of images coming from the same two
hypotheses: signal absent (𝐻0 ) and signal present (𝐻1 ). The images however are
generated with the multivariate normal distribution, where there is a correlation
between the two pixel values for each image. This effectively means that if the

16
first pixel has higher value then the second pixel is likely to have higher value
too, if the correlation coefficient is positive and vice versa for negative
correlation coefficient. Let 𝑔0 and 𝑔1 are the image subsets linked to the signal
absent hypothesis and the signal present hypothesis, respectively, then each
image equals to:
1 0.9 1 0.9
𝑔0 = (𝑚𝑣𝒩 (5, [ ]) , 𝑚𝑣𝒩 (5, [ ]))
0.9 1 0.9 1
(12)
1 0.9 1 0.9
𝑔1 = (𝑚𝑣𝒩 (7, [ ]) , 𝑚𝑣𝒩 (6, [ ]))
0.9 1 0.9 1
Where 𝑚𝑣𝒩(𝑥, 𝑦 ) stands for multivariate normal distribution with x expected
value and y covariance matrix. With application of the already derived NPW
template from the previous section (equation 6), we can estimate the ROC curve
for the dataset and the outcomes for solving the detection task given this
template. Figure 5 shows the scatter plot of the newly generated image dataset
with two elongated clouds showing the positive correlation between the first and
the second pixel for all images. The figure also shows the exact ROC curve for
these specific conditions. Nevertheless, due to the between pixel correlation
present in the dataset, it can visually be seen in figure 5, that the used template
is not equal to the actual eigenvector for the linear transformation between the
two image subsets. In other words, the direction between the average signal
absent image and the average signal present image (equation 6) is ideal only in
the case when there is no correlation between pixels and the scatter plot of the
data has circular distributions of the image subsets (figure 2 compared to figure
5).

Figure 5. ROC curve estimated from the correlated data by a non-prewhitened


template.
To account for the correlation, a prewhitening transformation is introduced
(figure 6). The process of prewhitening uses the covariance matrix of the data to

17
eliminate the underlying correlation. If the covariance matrix of the image
dataset 𝑔 is 𝐾 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
(𝑔 − 𝑠)(𝑔 − 𝑠)𝑡 , where s denotes the expected value and ( )𝑡
denotes the transpose operator, and this covariance matrix is nonsingular, we
can transform the dataset 𝑔 in to dataset 𝑧 by applying the inverse of the square
1
root of the covariance matrix 𝐾 −2 . If we calculate the covariance matrix of the
new dataset 𝑧, 𝐾𝑧 , we will find that it is the identity matrix I and has correlation
coefficients equal to zero:

𝐾 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
(𝑔 − 𝑠)(𝑔 − 𝑠)𝑡
1
𝑧 = 𝐾 −2 𝑔,
𝐾𝑧 = (𝑧̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
− 𝑧̅)(𝑧 − 𝑧̅)𝑡 =
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1 ̅̅̅̅̅̅̅̅
1 1 ̅̅̅̅̅̅̅̅
1
𝑇

= (𝐾 −2 𝑔 − 𝐾𝑔 2 𝑔 ) (𝐾 −2 𝑔 − 𝐾 −2 𝑔 ) = (13)

̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1 1
= 𝐾 −2 ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
(𝑔 − 𝑠)(𝑔 − 𝑠)𝑇 𝐾 −2 =
1 1
= 𝐾 −2 𝐾𝑔 𝐾 −2 = 𝐼

Figure 6. Visualization of the effect of prewhitening.


Figure 6 shows the transformation of the images when the prewhitening
operator is applied to the correlated image dataset. From equation 13 we can
assume that the two image PDFs are pixel independent and the distance between
the expected signal present and signal absent vectors is the eigenvector of the
transformation. Due to prewhitening the expected values of the PDFs are
changed, thus the orientation of the eigenvector has to be changed accordingly
with the same prewhitening operation:
𝑤 = (𝑠1 − 𝑠0 )𝑡 𝐾 −1 (14)

18
Note that in the equation the covariance matrix takes place with the inverse and
not the square root of the inverse, due to the fact that two prewhitening
operators are needed: first to prewhiten the image dataset 𝑔 and second to
prewhiten the eigenvector (𝑠1 − 𝑠0 ). Figure 7 visualizes the application of this
template to the prewhitened image dataset 𝑔. It can be clearly seen that there is
a difference in the ROC curve between figure 5, the non-prewhitened method
applied to correlated images, and figure 7. This difference will be quantified in a
following section. A model observer that uses the template from equation 14 is
called the prewhitening matched filter (PW) and is ideal for data with multi-
normal statistics.

Figure 7. ROC curve estimated from the prewhitened data by a prewhitened template.
The classification problem in this subsection can again be solved with the ideal
observer approach. In this case, the image datasets have PDFs summarized in
equation 3. If this equation is used in the likelihood ratio from equation 4, the
problem can be solved in the following way:
1
√2𝜋 det(𝐾) 𝑒𝑥𝑝 (− (𝑔 − 𝑠1 )𝑇 𝐾 −1 (𝑔 − 𝑠1 ))
𝑝𝑟(𝑔|𝐻1 ) 2
Λ(𝑔) = = =
𝑝𝑟(𝑔|𝐻0 ) 1
√2𝜋 det(𝐾) 𝑒𝑥𝑝 (− (𝑔 − 𝑠0 )𝑇 𝐾 −1 (𝑔 − 𝑠0 ))
2
1 1
= 𝑒𝑥𝑝 (− (𝑔 − 𝑠1 )𝑇 𝐾 −1 (𝑔 − 𝑠1 ) + (𝑔 − 𝑠0 )𝑇 𝐾 −1 (𝑔 − 𝑠0 )) =
2 2
1
(15)
= 𝑒𝑥𝑝 ( ((𝑠1 − 𝑠0 )𝐾 −1 𝑔𝑡 + (𝑠1 − 𝑠0 )𝑡 𝐾 −1 𝑔 + 𝑠0𝑡 𝐾 −1 𝑆0 − 𝑠1𝑡 𝐾 −1 𝑠0 ))
2
𝜆(𝑔) = ln Λ(𝑔) =
𝐻1
1 −1 𝑡 𝑡 −1 𝑡 −1 𝑡 −1 >
= ((𝑠1 − 𝑠0 )𝐾 𝑔 + (𝑠1 − 𝑠0 ) 𝐾 𝑔 + 𝑠0 𝐾 𝑆0 − 𝑠1 𝐾 𝑠0 ) ln Λ 𝑐 (𝑔) = λ𝑐
2 <
𝐻0

19
𝐻1
>
𝜆(𝑔) = (𝑠1 − 𝑠0 )𝑡 𝐾 −1 𝑔 (λ − 𝑠0𝑡 𝐾 −1 𝑆0 + 𝑠1𝑡 𝐾 −1 𝑠0 )𝜎 2 = λ′ 𝑐
< 𝑐
𝐻0

The image dataset PDFs are Gaussian and therefore it is convenient to use again
the log-likelihood ratio. In order to simplify this ratio, a few additional steps can
be carried out. The scalar component independent of the image data is added to
the log-likelihood threshold, leaving two nearly identical vectors – a column and
a row vector. With the column vector over a given vector space being isomorphic
to the row vector over the same vector space the two vectors can be summed, i.e.
(𝑠1 − 𝑠0 )𝐾 −1 𝑔𝑡 + (𝑠1 − 𝑠0 )𝑡 𝐾 −1 𝑔 = 2 (𝑠1 − 𝑠0 )𝑡 𝐾 −1 𝑔.
From equation 15 we can see that the log-likelihood ratio is equal to the ideal-
linear test statistics using the PW template and this demonstrates that the PW
linear model observer is equal to the ideal-observer for dataset with multi-
normal statistics:
𝑡 = 𝑇(𝑔) = 𝑤 𝑡 𝑔 = ∆𝑠 𝑡 𝐾 −1 𝑔 = (𝑠1 − 𝑠0 )𝑡 𝐾 −1 𝑔 = 𝜆(𝑔) (16)

1.4. Performance figure of merit


1.4.1. Area under the receiver operating characteristic curve (AUC)
The most common observer performance estimate taking into account the test
statistics probability density functions 𝑝𝑟(𝑡|𝐻𝑖 ) is the area under the ROC curve
(AUC):
1
𝐴𝑈𝐶 = ∫ 𝑇𝑃𝐹 𝑑(𝐹𝑃𝐹) (17)
0

The area under the ROC curve, introduced in section 1.3.1., gives the average TPF
over all FPF. The TPF and FPF range from 0 to 1 and therefore the AUC also
ranges from 0 to 1 with a meaningful range from 0.5 to 1. In the first extreme
case with AUC equal to 0.5, two conclusions can be drawn: that the observer is in
fact guessing or the task is infinitely difficult. The other extreme case is with an
AUC equaling 1 and here the observer is ‘perfect’ or the task is obvious. If the AUC
is lower than 0.5 it usually means that the observer was not trained properly, as
it detects the wrong class of images. For example an observer with AUC=0 is still
called the ‘perfect’ observer, with the only condition that each detected image
has to be assigned for the opposite class.
Figure 8 shows the ROC curves from section 1.3 in one graph. The three different
curves are acquired with the 2 discussed model observers applied to the 2 image
statistics – the NPW applied to images with Gaussian statistics without
correlation (NPW_corr=0.0), NPW applied to images with Gaussian statistics
with correlation (NPW_corr=0.9) and PW applied to images with correlation
(PW_corr=0.9) (Note that PW_corr=0.0 is equal to NPW_corr=0.0, as discussed in
sub-section 1.3.2). In the center of the plot the AUC values have been added. It
can be seen that the NPW_corr=0.9 performs worse than the case PW_corr=0.9,
which numerically confirms the visual results from section 1.3.2.

20
Figure 8. ROC curves estimated by the NPW and PW model observers applied on two
datasets with and without correlation.
1.4.2. Signal to noise ratio (SNR)
The simplest method of determining how separated two peaks in a plot are, is
the signal to noise ratio (SNR). This figure of merit extensively used in electronics
is also applicable for estimation of the separation of the PDFs of the test statistic
𝑝𝑟(𝑡|𝐻𝑖 ), visualized in figure 3.
𝑡1̅ − 𝑡0̅
𝑆𝑁𝑅 =
2 2 (18)
√𝜎1 + 𝜎0
2

Where 𝑡1̅ and 𝑡0̅ are the mean test statistics for signal present and signal absent
respectively and 𝜎1 and 𝜎0 are the standard deviations of the test statistics for
signal present and signal absent respectively.
Figure 9 shows the ROC curves from the model observers in section 1.3. The SNR
in this curve is depicted as color coded arrow with regards to each of the curves.
The SNR is non-linearly related to the AUC, as it ranges from 0, with AUC=0.5, to
infinity, where the arrow would connect with the top left corner in the ROC curve
plot with AUC=1.0.

21
Figure 9. ROC curves estimated by the NPW and PW model observers applied on
two datasets with and without correlation. With arrows the SNR is represented for
all the curves.

When the test statistics is normally distributed and the variance is equal for both
hypotheses, the conversion between SNR and AUC can be expressed by the
following (Vennart, 1997):
𝑆𝑁𝑅 = 𝑑′ = 2𝑒𝑟𝑓 −1 (2𝐴𝑈𝐶 − 1) (18)
Note that in this case the SNR is equal to a new figure of merit called the
detectability d’. The SNR can be estimated only in the aforementioned conditions
about the test statistics distributions. In order to acknowledge that there is no
prior information about the distribution shape and variance, the SNR is rather
called detectability and is usually denoted as 𝑑′.
1.4.3. Percentage correctly detected targets (PC)
The two previous methods rely on full or statistical knowledge about the test
statistic PDFs. In the event where these are impossible to estimate, there is an
alternative approach for image quality estimation related to AUC. The observer
is simultaneously presented with two images: 𝑔𝑠𝑝 and 𝑔𝑠𝑎 , each drawn from one
of the two hypotheses 𝑝𝑟(𝑡|𝐻𝑖 ), i.e. there is always one signal present and one
signal absent image. The task for the observer is to select the image coming from
class signal present 𝑝𝑟(𝑡|𝐻1 ) . This observer experiment is called the two-
alternative forced-choice (2AFC) method. The task is performed as follows: the
observer estimates two test statistic values for each of the two images and
compares them. The image with the higher test statistic value is always selected

22
as the one coming from the signal present class. When the estimated test statistic
for the signal present image is larger than the signal absent test statistics, i.e.
𝑇(𝑔𝑠𝑝 ) > 𝑇(𝑔𝑠𝑎 ), the outcome is correct and called a ‘hit’ (a term derived from
early radar detection experiments). The opposite, i.e. 𝑇(𝑔𝑠𝑎 ) > 𝑇(𝑔𝑠𝑝 ), produces
a ‘miss’. If the 2AFC trial is repeated multiple times over different images, it is
possible to estimate a figure of merit linked to the likelihood of the observer
selecting the desired image. This figure of merit is called the percentage correct
(PC) and is equal to the average number of ‘hits’ for a 2AFC study:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ′ℎ𝑖𝑡𝑠′ (19)
𝑃𝐶 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 2𝐴𝐹𝐶 𝑡𝑟𝑖𝑎𝑙𝑠
A property of the PC from a 2AFC observer study is that regardless of the test
statistic PDFs it is always equal to the AUC for the given observer and image
dataset. This makes the 2AFC method convenient for quick and reliable AUC
estimation and is often preferred in practice.
2. Image quality estimation
In this section the focus is more on the practical aspects of image quality
estimation. As discussed above, there are four major components to a task-based
image quality estimation: Task, Observer, Types of images and a Measure of task
performance. While in the previous section, the emphasize was on the theory
behind these four aspects, most of the methods are impractical or even not
applicable on real medical images. Here we start with the gold-standard of the
image quality estimation – a human observer study and different aspects when
using human observers for classification. Later the topic will be shifted towards
model observers applicable on clinical images. The section concludes with some
practical remarks on the image data preparation.
2.1. Human observer
In current medical imaging practice, it is the human observer who makes
decisions based on the image data. It is relatively complex to determine the exact
way in which humans transform an image into diagnostic information or
formulate a certain decision. The approach described in the previous section –
the use of discriminant function to derive test statistics used for decision making
is not useful in this case, as the discriminant function is unknown. Human
decision making can be approached from the biological and optical points of
view. The human eye is the optical receptor that transforms the visual data into
sensory input for the lateral geniculate nucleus (LGN), which further transforms
and transports the data to the primary visual cortex of the brain. These three
anatomical parts along with the image-processing activities of the visual
pathways form the human visual system (HVS). The HVS is the biological
representation of what the discriminant function is for the model observers
introduced in the previous sections. While it is possible to study biologically how
information is transported from the eye to the visual cortex, it is not trivial to
determine the image-processing functions of the HVS. The investigation of how
the HVS responds to changing visual input stimuli falls within the field of

23
psychophysics. Human observer response to specific image content is assessed
from which quantitative measures characterizing human perception are
determined. Common physical characteristics include contrast, luminance,
image resolution or different attributes describing a target within the images, e.g.
size, shape, noise, etc.
Kelly et al. (Kelly, 1977) studied the contrast sensitivity of human observers in
relation to spatial frequency. Human observers scored sine-wave images at
decreasing contrast and increasing frequency. It was found that the resulting
contrast-sensitivity function (CSF) fitted the following equation:
𝐶𝑡𝑟 = 𝑓 2 exp(−𝑓) (20)
Where 𝐶𝑡𝑟 is the threshold contrast i.e. the contrast at which the humans could
not detect the sine pattern and 𝑓 is the spatial frequency of the sine wave. Figure
10 shows a plot of the function.

Contrast sensitivity function

2
C tr =f exp(-f)
Contrast threshold

0
0 5 10 15
Spatial Frequency [cpd]

Figure 10. Visual human contrast-sensitivity function.


The CSF varies substantially with different image parameters influencing the
neurophysiology of the HVS. For example, the adaptation of the human eye to
different background luminance levels, flickering and motion, can change the
form and the fit of the function. This draws a line between the humans and the
ideal-observer described in the previous sections, where the human decisions
can be affected by such parameters, but the ideal-observer would always score
the same.
A more conventional and widely used method for estimation of HVS performance
for certain detection task is the rating-scale approach. Here human observers are
presented with a single image, for example a mammogram. The observer task is
to score the image with certain level of confidence about some decision e.g. the
presence of a lesion. The confidence levels are tabulated and set in the reading
study protocol. An example of such rating scale is shown on table 2.

24
Table 2. Example of rating scale for mammography images.
Rating Description of the confidence level
1 Mammogram is definitely normal
2 Mammogram is likely to be normal
3 Mammogram is equally likely to be normal or abnormal
4 Mammogram is likely to be abnormal
5 Mammogram is definitely abnormal
When rating images with normal or abnormal ground truth, the observer will
potentially make errors or be uncertain as to which class some of the images
belong. The experiment will then naturally produce two distributions of human
confidence scores – one for each image ground truth, the normal and abnormal
images. This process of image grading is the same as the one previously
explained in statistical decision theory (section 1.3). The different confidence
levels bring the same information as the decision variables, i.e. 𝑡 = 𝑇(𝑔) ,
although in this case the discriminant function T is unknown. Figure 11 shows
example results using the rating-scale method and the specific ROC curve for
these human scorings.

Figure 11. Rating-scale results from a human observer study and the ROC curve
calculated from the results.
This method is known as the ROC experimental method as an ROC curve is
produced. From the ROC curve, we can calculate the area under the curve (as in
section 1.4.1), which in this case is AUC=0.93. While the ROC experiment is a
useful method for human observer image quality estimation, it often takes a long
time to conduct. In the field of mammography ROC experiments require a
number of radiologists (perhaps five or six), whose time and availability are
generally limited. This makes the experiment expensive to conduct even for
small studies and perhaps even impossible if many system parameters need to
be adjusted or the image quality needs to be assessed daily.

25
A more practical way of using human readers in image quality assessment is via
forced-choice experiments (section 1.4.3 described the 2-AFC reading study). In
practice the 2-AFC experiments have the benefit of requiring a small amount of
signal absent images, as one signal absent image is required for each signal
present image, but also have certain limitations. Given the 1:1 ratio between
present and absent images displayed simultaneously to the observer, the rate of
which the observer would randomly pick one (guessing rate) of the images is
50%, which also sets a meaningful range for the percentage correct (PC) results:
from 50% PC up to 100% PC. Given this, 4-AFC experiments are more commonly
undertaken, as they have 75% PC meaningful range. Jakel et al. (Jakel &
Wichmann, 2006) shows that the 4-AFC method is the preferred paradigm
between numerous of different forced-choice experiments. The 4-AFC method is
statistically more efficient than the 2-AFC as the variance inherent in the
responses of the observer is lower. In the same way 8-AFC is more statistically
efficient than the 4-AFC, but requires significantly more signal absent images for
a given signal present image. Therefore, the 4-AFC experiment, being a good
compromise between required images and efficiency, is our choice throughout
the different studies carried out in the following chapters.
Elangovan et al. (Elangovan et al., 2018) have studied the performance of
specialist vs non-specialist observers for clinically relevant targets in 2D-
mammography and DBT. They found a significant difference in performance
between the two groups of observers with the specialists producing higher
scores. Nevertheless, the relative difference in observer performance for the
different stimuli was remained approximately constant. Thus, for comparison
studies, the 4-AFC method could be performed with non-specialists, while still
showing relevant difference in performance as with specialist observers. This is
a key feature in the forced-choice studies, as it is not necessary to request the
input of the radiologists, which make the experiments much more feasible. It
should be noted that the AFC studies usually solve a simple detection task with
no target searching involved. This support the replacement of radiologists with
non-specialist observers, as just the average HVS is tested and no expertise is
required. The conclusions drawn via this method should be used and presented
with caution, as conclusions of such an AFC study should never be compared to
the conclusions of a more thorough ROC type of study, especially if different type
of images and/or more experienced observers are used for the latter method.
2.2. The channelized Hotelling model observer
The ideal observer as discussed in section 1.2 sets the limit of the detectability in
the image dataset for the task of classification. In practice though, the
requirement of the full knowledge of the data PDFs can never be met. The real
clinical images contain variable signals and backgrounds, thus it is practically
unfeasible to make a good PDF estimation. For this purpose linear model
observers are introduced.
The ideal-linear observer is equal to the likelihood ratio of the ideal-observer in
certain cases described in section 1.3 and is the model observer that achieves

26
maximum performance among all linear observers. The major drawback comes
from the requirement for the exact knowledge of the first and the second order
statistics of the data. In other words, the probability density functions are not
required, but there is still the requirement of certain aspects of the data to be
known exactly. As with the ideal-observer, this is practically unfeasible, due to
the randomness in real medical images.
A linear observer which utilizes the ideal-linear observer template, but with first
and second order statistics calculated from a finite dataset, is called the optimal-
linear observer, or the Hotelling observer. The Hotelling observer maximizes the
observer performance with respect to the accuracy of the data used for forming
the Hotelling observer template. It can be mathematically summarized by the
following:
𝑡
𝑡 = 𝑤𝐻𝑜𝑡 𝑔 = ∆𝑔̅ 𝑡 𝐾𝑔−1 𝑔 (21)
Where 𝑤𝐻𝑜𝑡 is the Hotelling observer template and ∆𝑔̅ = 𝑔̅1 − 𝑔̅0 is the
difference in the mean of the two image subsets, also called the expected signal.
Note the difference in the expected signal ∆𝑔̅ and the covariance matrix 𝐾𝑔 in eq.
21 and expected signal 𝑠 and covariance matrix 𝐾 in eq. 16. This denotes the
difference in how these two components are estimated: for the Hotelling
observer they are known statistically based on the finite dataset 𝑔, where for the
ideal-linear observer they are a-priori known exactly, i.e. sample statistics for the
Hotelling observer and population statistics for the ideal observer. This way, the
Hotelling observer does not follow the performance of the ideal-observer, yet it
results in a useful tool for classification (detectability) tasks. This can include
tasks of mimicking human observer performance or evaluation of image post-
processing.
A practical limitation to the Hotelling observer comes from the limited data
available for template estimation, called training data. Specifically, if the
covariance matrix 𝐾𝑔 is estimated using a total number of training images
smaller than the number of unique elements of 𝐾𝑔 , the covariance matrix will be
singular and non-invertible. For example if an image has 200𝑥200 pixels, the
covariance matrix will have 2004 elements, of which 2𝑥2002 unique, so for
mathematically sound covariance matrix estimate, the training data has to
include 80 thousand images, which is obviously not practical. In order to cope
with this the data channelization mechanisms were introduced.
Initially the channel mechanism was introduced for the purpose of mimicking
the HVS (figure 10) (K. J. Myers & Barrett, 1987). The proposition was intended
to select certain frequencies in order to downgrade the ideal-linear observer
performance to match the human observers. As a consequence of this, by
applying a small number of selective frequency channels to the whole image
dataset, the image vectors were transformed into channel output vectors, with
size equal to the number of the channels:
𝑣 = 𝑈𝑡 𝑔 (21)

27
Where 𝑈 is the channel matrix with size 𝐼𝑥𝑀 and 𝑔 is the image dataset with size
𝑀𝑥𝐾. Here 𝐼 is the number of channels, 𝑀 is the number of pixels and 𝐾 is the
number of images in dataset. Thus by solving the equation the resulting
channelized dataset 𝑣 will have size of 𝐼𝑥𝐾 . A covariance matrix of the
channelized data will have the size of 𝐼𝑥𝐼:
𝐾𝑣 = 𝑈 𝑡 𝐾𝑔 𝑈 (21)
Usually the number of channels used in these channelized model observers is 𝐼 <
10. This way, by using the channelization mechanism in the Hotelling observer,
the requirement for training images decreases to less than 100 unique images,
which is practically feasible. This new model observer is called the channelized
Hotelling observer (CHO) and is the main topic of interest in this work. The test
statistics can be calculated as follows:
𝑡
𝑡 = 𝑤𝐶𝐻𝑂
−1
𝑣 = ∆𝑣̅ 𝑡 𝐾𝑣−1 𝑣 = ∆𝑔̅ 𝑡 𝑈[𝑈 𝑡 𝐾𝑔 𝑈] 𝑈 𝑡 𝑔 (21)

The channel mechanism can be used to change the features the CHO receives
from the image, thus it can be used to tailor the performance estimate to a
specific observer, e.g. human observers. There are numerous channel profiles
available to use, but they usually fall into one of the two following categories –
efficient and anthropomorphic channels. The first type of channels is usually
used along with a priori known information about the signals within the images,
like target shape, size or certain background properties, to achieve an as ideal as
possible detectability score. The latter class of channels include different
functions that aim to mimic human observer detectability scores. A more in-
depth methodological study of all components influencing the CHO performance
is given in the coming chapters.
2.3. Deep learning model observers
Deep learning is a subset of the machine learning technique based on the
computation of a multi-layer neural network [Duda R. ‘Pattern Recognition’
2001]. In classical machine learning a mathematical model is set to make
decisions based on input image data. The linear model observers described in
the previous sections are part of machine learning, where the mathematical
model is a linear classifier. The decision making process is usually carried out in
two steps – feature extraction and classification. The channel mechanism
described in subsection 2.2 is the feature extraction for the CHO linear model
observer and the CHO template formed from the covariance matrix and the
expected signal is related to the classifier producing decision variables. The deep
learning approach utilizes multiple linear functions set into different layers to
combine the feature extraction and classification steps in a single algorithm.
Feed-forward convolutional neural networks (CNN) are the simplest deep
learning implementations, with an architecture inspired by the processes
occurring in the human brain. CNNs are formed from a network of connected
artificial neuron functions, such that the output of some neurons serve as the
input for others. An example of such a network is shown in figure 12. The

28
artificial neuron is a mathematical function that represents a model of biological
neurons (Agatonovic-Kustrin & Beresford, 2000).

Figure 12. Example of feed-forward neural network structure.


In CNN networks, the incident inputs 𝑥𝑖 in the neuron are weighted by a separate
function 𝑤𝑖 for each neuron, and are combined and then transformed by a certain
activation function (Nwankpa et al., 2018) (figure 13). In the process of CNN
training, the weights of the neurons are trainable parameters and are optimized
in such a way to produce the desired model output.

Figure 13. Model of an artificial neuron (image from Agatonovic-Kustrin et al.


(Agatonovic-Kustrin & Beresford, 2000))
Usually CNNs are organized with three types of layers – ‘input’, ‘hidden’ and
‘output’ (Karpathy A., 2017). The input layer has always the size of the input
image, as it utilizes the structural information from neighboring pixels. The
hidden layer normally consists of a mixture of convolutional layers with their
activation functions, max-pooling layers and fully connected layers. The output
layer generates the model scores and is usually generated from the output of a
fully connected layer, used to calculate the model test statistics used for decision
making.
 The Convolutional layers (CONV) are the main components forming the
CNN. They consist of a set of small filters, each with trainable weights.

29
The filters are applied to the input data by convolving across the image,
producing response maps. These responses are then activated by a
certain activation function to produce a feature map – a rectified linear
unit (ReLU) (Agarap, 2018) is implemented for the example in figure 12.
The training process changes the filter weights in order to produce the
desired feature map, leading to the correct classification choice.
 The Max-Pooling layers (MP) reduce the size and number of variables of
the feature maps produced by the convolutional layers. They take the
maximum value over certain features, allocated by a kernel. Each time
the kernel moves across the feature map a new maximum value is stored
in the max-pooling output feature map.
 The Fully-Connected (Dense) layer (FC) mimics neurons that are fully
pairwise connected to all features in the previous layer. Each connection
is set by a trainable weight, followed by an activation function, usually a
Softmax function (Nwankpa et al., 2018).
In the training stage, small batches of images are passed through the CNN to
estimate their outputs. Given the images’ ground truth the weights of all neurons
of the network are updated in order match the CNN output to the actual image
class. This is usually performed using a loss function, which calculates the
deviation from the ground truth for each estimate and an algorithm called
optimizer, which updates the CNN weights in an optimal way. Once trained, all
trainable weights are fixed and an application of the DL classifier can be
performed. Here in a similar way to the linear observers, the DL model observer
receives images as an input and produces output values, which can be used for
the purpose of image quality estimation in a way similar to the one described in
the previous chapters. More details on the deep learning topic and two
applications will be given in chapter 5.
2.4. Preparation for image quality estimation
The observer reading studies in the work are based on a 4-AFC paradigm. Thus
in order to accomplish a successful 4-AFC reading experiment, there are certain
practical requirements to follow. This includes requirements for the image
dataset, the human reading and the model observer reading.
1. There should be no unintentional difference between the signal present
and signal absent image sets. The observer should not be able to detect
any cue, which could bias the detection task, or eliminate one of the four
candidate images as being signal present. For example different contrast
levels or image size for certain images, or visible patterns not present in
all of the dataset should be avoided.
2. The experiment should only test the observer response to a certain
stimulus, the task should be ‘signal-known exactly’, i.e. the target
location, size and shape within the signal present image are known to
the observer prior the start of the reading.
3. Prior to the reading experiment the human observers should perform
some training on a separate dataset with feedback after each choice. The

30
results of this pre-study are not stored for further consideration. This
ensures that the observers are familiar with the task and the controlling
software, which will potentially reduce the effect of the learning curve
and improve the stability and the certainty of their choice for the real
reading study.
4. The amount of signal absent images should be at least 3-times more than
the signal present images. This would prevent a signal absent image to
be shown to the observer more than once, which could be a cue for the
observer.
5. The ROI cropping algorithm should be executed carefully with respect
to ROI overlapping. While overlapping is generally allowed, it should be
kept at minimum given the structure of the object and background. A
specific part of a structured background can appear as a cue for the
observer.
6. The observer reading conditions should be as equal as possible to the
reading conditions of the radiologists in their daily practice. This
requires a diagnostic monitor and a reading room with low background
luminance.
7. In order to improve the accuracy of the different human observers, the
distance from the observer eyes to the screen should be as close as
possible during the reading time for a given observer. With a different
distance throughout the reading period, the perceived frequency
properties of the image change, which could lead to different
performance from the expected.
8. The training and the reading datasets for both model and human
observers should ideally be made out of separate X-ray acquisitions. If
this condition is not satisfied, it will cause a dataset bias, where images
seen by the observer more than once can act as a hint to lower the
alternatives of the 4-AFC test or even point out the target image.
Requirements and limitations for a successful reading experiment using the CHO
are extensively studied in chapter 1.

THESIS OBJECTIVES AND WORK PLAN


The main aim of this project is the development and verification of a model
observer methodology to assess the image quality of 2D digital mammography
and digital breast tomosynthesis. In order to do so, a previously developed
structured physical phantom will be used. Cockmartin et al. (L Cockmartin et al.,
2017) showed that the 3D structured phantom with a background of spheres and
3D printed lesions (also called the ‘L1’ phantom) was successfully used to
evaluate image quality on different DBT and FFDM systems under different
scanning conditions using human observers. We hypothesize that the phantom
evaluation can be achieved using model observers, which could allow for quicker
and more reliable observer performance estimations. In this thesis, different
approaches with channelized Hotelling model observers will be trained, tuned
and tested against human observer results in order to develop a practical tool for

31
future FFDM and DBT optimization and quality control. Initially, the applications
will be limited to the mass simulating lesions without spicules and the
calcification clusters of the 3D structured phantom. Later the algorithm will be
tested on more realistic virtual clinical trial images.
As a first step, we had to get familiar with the CHO model and find the most
practically applicable algorithm for image quality estimation with mass-like
lesions from the ‘L1’ phantom. Preliminary work showed that a CHO formed
solely by following the literature could not match human observers reading the
same DBT image dataset. This forced us to study all components and conditions
contributing to the CHO performance estimation for mass-like lesions to achieve
a good anthropomorphic algorithm (Chapter 1). The achieved anthropomorphic
CHO method, developed using a Siemens DBT system, was then applied to other
DBT vendors, which showed poor correlation with human results. In order to
achieve good generalization to a wider range of DBT systems, the channel
mechanism was improved to select the same frequency range regardless of the
reconstruction pixel size of the different DBT systems and the first comparative
studies with CHO on all commercially available systems was performed
(Chapter 2). The next step focused on the calcification clusters. With these
targets largely differing from mass lesions, a more innovative CHO approach was
required. The calcification clusters in the ‘L1’ phantom consist of 5 calcification
particles forming the target. A two-layer CHO algorithm was developed for the
cluster detection in DBT and FFDM. First, the particle locations were found by
scanning areas around the expected locations followed by a classification step,
where the separate particle test statistics were combined into a single cluster
test statistics (Chapter 3). Exactly the same algorithm was also tested for
different DBT vendors and showed good correlation without a requirement for
additional tuning (Chapter 4). With a CHO validated to work in different
scanning conditions and multiple DBT vendors, the CHO was tested on different
3D structured DBT test images. The virtual clinical trial image dataset obtained
from the OPTIMAM simulation framework consisted of 2D FFDM images and
DBT images of mass lesions. The already developed CHO algorithm was
redesigned to work for 2D images instead of a DBT stack and the observer
performance was improved by introducing a volumetric CHO for the DBT images
(Chapter 5). The update of the DBT systems with time showed that the
algorithm from Chapter 2 did not generalize well enough. To improve the CHO
we hypothesized that the ultimate channel algorithm for mass-like lesions can
be estimated using deep learning. Thus a separate deep learning model observer
was developed in parallel to the CHO. Both deep learning methods were trained
on 4-AFC examples from humans, which resulted in better generalization across
a wider range of DBT vendors (Chapter 6).
Chapter 1: Systematic approach to a channelized Hotelling
model observer implementation for a physical phantom
containing mass-like lesions: application to digital breast
tomosynthesis

32
In order to develop a robust CHO algorithm, all aspects involved in the
computation of the CHO decision variable were studied separately. A total of 108
DBT acquisitions of a structured phantom, acquired at 3 dose levels, were read
by human observers. A channelized Hotelling observer with three different
channel types – Gabor, Laguerre-Gauss and Difference of Gaussians, was tuned
to match the human observer scores. With regard to the CHO template, various
methods for generating the expected signal were studied along with the
influence of the number of training images used to form the covariance matrix.
Impact of the bias in the training process on the observer template was evaluated
together with human and model observer reproducibility. The resulting CHO,
matching the human observer scores most closely, had 8 Gabor channels with
tuned phase, orientation and frequency, and used an observer template
generated from training image dataset. The human and model observer
reproducibility was similar and the correlation between the two observers for
the dose levels study exceeded 0.95.
Chapter 2: Real space channelization for generic DBT system
image quality evaluation with channelized Hotelling observer
A common task in routine quality control (QC) of X-ray systems is to periodically
assess the image quality provided by the system. For DBT this includes systems
of different vendors. These systems have different intrinsic image properties,
due to the different geometries and reconstruction algorithms used. A selected
CHO for routine applications should provide a fair evaluation of any system, in
its comparison with other systems. The goal of this work was to test the CHO
applicability on DBT systems different from the one studied in chapter 1, without
further tuning. For this purpose DBT acquisitions on GE SenoClaire 3D, Giotto
Class, Fujifilm AMULET Innovality, Philips MicroDose, Siemens Inspiration and
Hologic Dimensions systems were taken. CHO readings of different dose levels,
reconstruction algorithms and reproducibility were tested against human
scores. The CHO algorithm was improved and retuned with making the channels
account for pixel size differences, thus the Gabor channel frequencies, width and
orientation were generated in real space and showed excellent overall results:
the linear correlation coefficients between human and model observers were
found to be between 0.87 and 0.99 for all tested conditions.
Chapter 3: Calcification cluster detection in 2D FFDM and DBT
with a channelized Hotelling observer
In addition to the detection of mass lesions, the efficiency of a system for
detecting microcalcification clusters is a crucial aspect of system performance.
The appearance of microcalcifications in mammograms significantly differs from
that of mass lesions, as the clusters are formed from a number of high contrast
miniature calcifications, whereas masses are much larger, yet appear with a
much less contrast. Therefore the CHO detection approach has to be changed to
account for the multiple objects forming the task. The exact location of the
calcification particles in our phantom is not known exactly, and we found that a

33
localization step had to be included in the CHO prior to the detection. To do so, a
two layer CHO was developed. The first layer consisted of a localizing CHO that
identified the most conspicuous calcifications using two Laguerre-Gauss
channels. The most probable locations of the three most visible calcification
particles (the highest decision variables) were passed to the second layer. Then
a CHO with 8 Gabor channels estimated the detectability for the calcification
cluster by placing the sensitive spot at the locations estimated from the previous
step.
An application on the 2D FFDM modality examined the ability of the CHO to
predict human observer results at 5 different tube voltage settings. For each tube
voltage 10 2D images of the spheres phantom were taken and signal present and
signal absent ROIs were extracted for the purpose of a 4-AFC observer study. The
results for the 2D modality showed correlation between CHO and human scores
higher than 0.94 and both observers confirmed that the change in tube voltage
in this range does not affect calcification detectability in this test object.
For the 3D modality 217 DBT scans were acquired at 3 dose levels to test the
human and model observer calcification detection reproducibility. Both image
sets were acquired using a Siemens Inspiration x-ray system. For all three dose
levels the correlation between CHO and humans was higher than 0.95 and the
CHO showed better reproducibility than human readers for the smaller
calcifications.
Chapter 4: Channelized Hotelling observer for multi-vendor
breast tomosynthesis image quality estimation: detection of
calcification clusters in an anthropomorphic phantom
The purpose of this study is to test the applicability of the previously developed
model observer for detection of calcification clusters in DBT (Chapter 3) on five
different types of DBT scanners.
The spheres phantom was scanned 180 times on Fujifilm Amulet Innovality, GE
Senographe Pristina, IMS Giotto Class, Hologic 3Dimensions and Siemens
Mammomat Revelation at three dose levels. The phantom images were then
prepared for 4-AFC reading study, where six medical physicist observers
participated. The previously developed two-layer CHO algorithm for calcification
clusters was also applied on the same image dataset in 4-AFC paradigm in order
to compare with the human observer results.
The results show that the CHO can successfully approximate the human observer
scores without further tuning. The Pearson’s correlation was higher than 0.94
for all reading sessions. This suggests that the algorithm can be used for future
image quality studies.
Chapter 5: Channelized Hotelling observer for breast virtual
clinical trials: application to DBT and FFDM
The gold standard procedure for breast imaging system performance evaluation
is a clinical study or trial. However, a thorough system evaluation and

34
optimization are often not feasible due to the large number of parameters
influencing the image quality. Recently, virtual clinical trials (VCTs) have
received close attention as an efficient method of approximately or even
replacing clincial studies (Gong et al. 2006, Badano et al. 2018). VCTs are based
on computerized models of the human anatomy, image acquisition, display and
processing, which mimick the clinical reality sufficiently closely that the results
correlate with those from a standard clinical trial. However, these studies often
involve a human reading step, which represents a major bottleneck in the VCT
workflow. In order to take a full advantage of the VCT workflow a model observer
can be used instead of human readers.
The impetus behind this study was to develop a CHO algorithm that correlated
with human reader results in a VCT study of low contrast mass lesions. For the
purpose DBT and 2D FFDM images were generated by Elangovan et al.
(Elangovan et al., 2018) using the VCT platform of the Optimam2 project, of
which KU Leuven was also a partner. Breast models that contain realistic
anatomical breast structures and a set of validated synthetic mass lesions were
used to create the detection tasks across a range of target sizes and contrast
levels for both DBT and 2D FFDM. The VCT platform included also modelling
tools used to simulate the image formation and degradation process for the
Hologic Selenia Dimensions 3D system including system geometry, noise, blur
and scatter. The human reading was done in 4-AFC paradigm. Image sets were
made available by the Optimam2 project partners.
For the DBT images a new volumetric CHO (vCHO) was implemented that
incorporates a 3D channel mechanism applied to the DBT volumes of interest
(VOI). This way the vCHO utilizes the correlation between the DBT planes as
additional information, which produces higher performance, compared to the
previous CHO method. For the 2D modality, the CHO from chapter 2 was
redesigned for the use of 2D anthropomorphic images. The results for the 2D
FFDM images showed correlation between the 2D CHO and human observers
higher than 0.97 for all lesion sizes and contrast levels, and for the DBT dataset,
the correlation between vCHO and humans exceeded 0.93.
Chapter 6: Deep learning applications
The purpose of this study is to test the applicability of deep learning algorithms
for image quality evaluation using the mass-like lesions of the spheres phantom.
The aim is twofold: First to develop a deep learning channelized Hotelling
observer (DL-CHO) and later to develop a complete ResNet18 (He et al., 2015)
deep learning network for the same purpose.
Acquisitions of the spheres phantom over 4 years of testing and validation were
collected. This resulted in 324 DBT scans from 6 different DBT systems at 3
different dose levels with their corresponding 4-AFC human observer results
levels and 270 acquisitions from 7 DBT systems at 3 dose levels without human
readout. The exact configuration of 4-AFC images, i.e. 1 signal present and 3
signal absent images, seen by each human observer was stored with the specific
human response, and then used for training of the DL algorithm. From all images

35
and human results 28000 examples were used for training and 1664 examples
used for validation.
The DL-CHO was developed using a single convolutional layer with five kernels
functioning like feature extracting channels, thereafter a standard CHO
algorithm was used for classification. The ResNet18 observer consisted of 18
layers incorporating the feature extraction and classification. Both deep learning
applications were using the same training, testing and validation image data. In
this study the DL-CHO performed better than the ResNet18 observer, with
correlation to human observer results higher than 0.91 for the DL-CHO and 0.9
for the ResNet18.

REFERENCES
Abbey, C. K., & Barrett, H. H. (2001). Human- and model-observer performance in ramp-
spectrum noise: effects of regularization and object variability. Journal of the Optical
Society of America A, 18(3), 473–488.
Abdurahman, S., Dennerlein, F., Jerebko, A., Fieselmann, A., & Mertelmeier, T. (2014).
Optimizing high resolution reconstruction in digital breast tomosynthesis using
filtered back projection. Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8539
LNCS, 520–527.
Abdurahman, S., Jerebko, A., Mertelmeier, T., Lasser, T., & Navab, N. (2012). Out-of-Plane
Artifact Reduction in Tomosynthesis Based on Regression Modeling and Outlier
Detection (pp. 729–736). Springer, Berlin, Heidelberg.
Agarap, A. F. (2018). Deep Learning using Rectified Linear Units (ReLU).
Agatonovic-Kustrin, S., & Beresford, R. (2000). Basic concepts of artificial neural network
(ANN) modeling and its application in pharmaceutical research. In Journal of
Pharmaceutical and Biomedical Analysis (Vol. 22, Issue 5, pp. 717–727).
Alberto Donzelli. (2012). The benefits and harms of breast cancer screening: an
independent review. The Lancet, 380(9855), 1778–1786.
Alnowami, M., Mills, G., Young, K., Dance, D. R., Awais, M., Halling-Brown, M. D., Wells, K.,
Elangovan, P., & Patel, M. (2018). A deep learning model observer for use in
alterative forced choice virtual clinical trials. Medical Imaging 2018: Image
Perception, Observer Performance, and Technology Assessment, March, 25.
Ba, A., Abbey, C. K., Baek, J., Han, M., Bouwman, R. W., Balta, C., Brankov, J., Massanes, F.,
Gifford, H. C., Hernandez‐Giron, I., Veldkamp, W. J. H., Petrov, D., Marshall, N.,
Samuelson, F. W., Zeng, R., Solomon, J. B., Samei, E., Timberg, P., Förnvik, H., …
Bochud, F. O. (2018). Inter‐laboratory comparison of channelized hotelling
observer computation. Medical Physics, 45(7), 3019–3030.
Badano, A., Graff, C. G., Badal, A., Sharma, D., Zeng, R., Samuelson, F. W., Glick, S. J., & Myers,
K. J. (2018). Evaluation of Digital Breast Tomosynthesis as Replacement of Full-
Field Digital Mammography Using an In Silico Imaging Trial. JAMA Network Open,
1(7), e185474.
Bakic, P. R., Pokrajac, D. D., De Caro, R., & Maidment, A. D. A. (2014). Realistic simulation
of breast tissue microstructure in software anthropomorphic phantoms. Lecture
Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence

36
and Lecture Notes in Bioinformatics), 8539 LNCS(3), 348–355.
Barrett, H. H. (1990). Objective assessment of image quality: effects of quantum noise and
object variability. Journal of the Optical Society of America A, 7(7), 1266.
Barrett, H. H., & Myers, K. J. (2004). Foundations of image science. ISBN: 978-0-471-15300-
9.
Barrett, H. H., Myers, K. J., Hoeschen, C., Kupinski, M. A., & Little, M. P. (2015a). Task-based
measures of image quality and their relation to radiation dose and patient risk.
Physics in Medicine and Biology, 60(2), R1-75.
Barrett, H. H., Myers, K. J., Hoeschen, C., Kupinski, M. a, & Little, M. P. (2015b). Task-based
measures of image quality and their relation to radiation dose and patient risk.
Physics in Medicine and Biology, 60(2), R1–R75.
Barrett, H. H., Myers, K. J., & Rathee, S. (2004). Foundations of Image Science. In Medical
Physics.
Barrett, H. H., Yao, J., Rolland, J. P., & Myers, K. J. (1993). Model observers for assessment
of image quality. Proceedings of the National Academy of Sciences of the United States
of America, 90(21), 9758–9765.
Berns, E., Becker, J., & Barke, L. (2016). Digital Mammography Quality Control Manual.
ACR American College of Radiology.
Bick, U., & Diekmann, F. (2007). Digital mammography: what do we and what don’t we
know? European Radiology, 17(8), 1931–1942.
Bird, R. E., Wallace, T. W., & Yankaskas, B. C. (1992). Analysis of cancers missed at
screening mammography. Radiology, 184(3), 613–617.
Bluekens, A. M. J., Karssemeijer, N., Beijerinck, D., Deurenberg, J. J. M., van Engen, R. E.,
Broeders, M. J. M., & den Heeten, G. J. (2010). Consequences of digital
mammography in population-based breast cancer screening: initial changes and
long-term impact on referral rates. European Radiology, 20(9), 2067–2073.
Bochud, F., Abbey, C., & Eckstein, M. (1999). Statistical texture synthesis of
mammographic images with super-blob lumpy backgrounds. Optics Express, 4(1),
33–42.
Bochud, F. O., Valley, J. F., Verdun, F. R., Hessler, C., & Schnyder, P. (1999). Estimation of
the noisy component of anatomical backgrounds. Medical Physics, 26(7), 1365–
1370.
Bosse, S., Maniry, D., Müller, K.-R., Wiegand, T., & Samek, W. (n.d.). Deep Neural Networks
for No-Reference and Full-Reference Image Quality Assessment.
Bouwman, R. W., Goffi, M., van Engen, R. E., Broeders, M. J. M., Dance, D. R., Young, K. C., &
Veldkamp, W. J. H. (2017). Can the channelized Hotelling observer including aspects
of the human visual system predict human observer performance in
mammography? Physica Medica, 33, 95–105.
Brankov, J. G. (2013). Evaluation of the channelized Hotelling observer with an internal-
noise model in a train-test paradigm for cardiac SPECT defect detection. Physics in
Medicine and Biology, 58, 7159–7182.
Burgess, A. E., Jacobson, F. L., & Judy, P. F. (2001). Human observer detection experiments
with mammograms and power-law noise. Med. Phys., 28(4), 419–437.
Burgess, A. E., Wagner, R. F., Jennings, R. J., & Barlow, H. B. (1981). Efficiency of human

37
visual signal discrimination. Science, 214(4516), 93–94.
Castella, C., Abbey, C. K., Eckstein, M. P., Verdun, F. R., Kinkel, K., & Bochud, F. O. (2007).
Human linear template with mammographic backgrounds estimated with a genetic
algorithm. Journal of the Optical Society of America A: Optics and Image Science, and
Vision, 24(12), B1–B12.
Castella, C., Eckstein, M. P., Abbey, C. K., Kinkel, K., Verdun, F. R., Saunders, R. S., Samei, E.,
& Bochud, F. O. (2009). Mass detection on mammograms: influence of signal shape
uncertainty on human and model observers. Journal of the Optical Society of
America A, 26(2), 425–436.
Castella, C., Kinkel, K., Descombes, F., Eckstein, M. P., Sottas, P.-E., Verdun, F. R., & Bochud,
F. O. (2008). Mammographic texture synthesis: second-generation clustered lumpy
backgrounds using a genetic algorithm. Optics Express, 16(11), 7595.
Chakraborty, D. P., & Winter, L. H. L. (1990). Free-response methodology: Alternate
analysis and a new observer-performance experiment. Radiology, 174(3), 873–881.
Chen, M., Bowsher, J. E., Baydush, A. H., Gilland, K. L., DeLong, D. M., & Jaszczak, R. J. (2002).
Using the Hotelling observer on multislice and multiview simulated SPECT
myocardial images. IEEE Transactions on Nuclear Science, 49 I(3), 661–667.
Ciatto, S., Houssami, N., Bernardi, D., Caumo, F., Pellegrini, M., Brunelli, S., Tuttobene, P.,
Bricolo, P., Fantò, C., Valentini, M., Montemezzi, S., & Macaskill, P. (2013).
Integration of 3D digital mammography with tomosynthesis for population breast-
cancer screening (STORM): A prospective comparison study. The Lancet Oncology,
14(7), 583–589.
Cockmartin, L., Bosmans, H., & Marshall, N. W. (2013). Comparative power law analysis of
structured breast phantom and patient images in digital mammography and breast
tomosynthesis. Medical Physics, 40(8), 081920.
Cockmartin, L., Marshall, N. W., & Bosmans, H. (2014). Comparison of SNDR, NPWE Model
and Human Observer Results for Spherical Densities and Microcalcifications in Real
Patient Backgrounds for 2D Digital Mammography and Breast Tomosynthesis (pp.
134–141). Springer, Cham.
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C. Van, &
Fredenberg, E. (2017). Design and application of a structured phantom for
detection performance comparison between breast tomosynthesis and digital
mammography. Physics in Medicine and Biology, Volume 62, Number 3, 15.
Das, M., & Gifford, H. C. (2011). Comparison of model-observer and human-observer
performance for breast tomosynthesis: effect of reconstruction and acquisition
parameters (N. J. Pelc, E. Samei, & R. M. Nishikawa (eds.); Vol. 7961, p. 796118).
International Society for Optics and Photonics.
Ebrahimi, M. S., & Abadi, H. K. (2018). Study of Residual Networks for Image Recognition.
EC. (2012). European Comission, Radiation Protection no162, Criteria for Acceptability of
Medical Radiological equipment used in diagnostica radiology, nuclear medicine
and radiotherapy. In Radiology: Vol. RADIATION.
Eckstein, M. P. (2011). Visual search: A retrospective. Journal of Vision, 11(5), 1–36.
Eckstein, M. P., Abbey, C. K., & Whiting, J. S. (1998). Human vs model observers in anatomic
backgrounds. Proc. SPIE, 3340, 16–26.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., Cooke, V., Wilkinson, L., Given-

38
Wilson, R. M., Wallis, M. G., & Wells, K. (2017). Design and validation of realistic
breast models for use in multiple alternative forced choice virtual clinical trials.
Physics in Medicine and Biology, 62(7), 2778–2794.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2018). Lesion
detectability in 2D-mammography and digital breast tomosynthesis using different
targets and observers. Physics in Medicine and Biology, 63(9), 1–15.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2017). Using non-
specialist observers in 4AFC human observer studies (T. G. Flohr, J. Y. Lo, & T. Gilat
Schmidt (eds.); p. 1013256).
Elangovan, P., Warren, L. M., Mackenzie, A., Rashidnasab, A., Diaz, O., Dance, D. R., Young,
K. C., Bosmans, H., Strudley, C. J., & Wells, K. (2014). Development and validation of
a modelling framework for simulating 2D-mammography and breast
tomosynthesis images. Physics in Medicine and Biology, 59(15), 4275–4293.
European commission. (2019). Screening ages and frequencies. European Breast Cancer
Guidelines.
FDA. (2002). MQSA. Mammography Quality Standards Act Regulations. Mammography,
21(CFR PART 900).
Ferlay, J., Autier, P., Boniol, M., Heanue, M., Colombet, M., & Boyle, P. (2007). Estimates of
the cancer incidence and mortality in Europe in 2006. Annals of Oncology, 18(3),
581–592.
Ferlay, J., Colombet, M., Soerjomataram, I., Dyba, T., Randi, G., Bettio, M., Gavin, A., Visser,
O., & Bray, F. (2018). Cancer incidence and mortality patterns in Europe: Estimates
for 40 countries and 25 major cancers in 2018. In European Journal of Cancer (Vol.
103, pp. 356–387). Elsevier Ltd.
Ferreira, P., Baptista, M., Di Maria, S., & Vaz, P. (2016). Cancer risk estimation in Digital
Breast Tomosynthesis using GEANT4 Monte Carlo simulations and voxel phantoms.
Physica Medica, 32(5), 717–723.
Fetterly, K. A., & Favazza, C. P. (2016). Direct estimation and correction of bias from
temporally variable non-stationary noise in a channelized Hotelling model
observer. Physics in Medicine and Biology, 61(15), 5606–5620.
Gallas, B. D. (2003). Variance of the channelized-hotelling observer from a finite number of
trainers and testers (D. P. Chakraborty & E. A. Krupinski (eds.); p. 100).
Gallas, B. D., & Barrett, H. H. (2003). Validating the use of channels to estimate the ideal
linear observer. Journal of the Optical Society of America. A, Optics, Image Science,
and Vision, 20(9), 1725–1738.
Gennaro, G., Hendrick, R. E., Ruppel, P., Chersevani, R., Di Maggio, C., La Grassa, M.,
Pescarini, L., Polico, I., Proietti, A., Baldan, E., Bezzon, E., Pomerri, F., & Muzzio, P. C.
(2013). Performance comparison of single-view digital breast tomosynthesis plus
single-view digital mammography with two-view digital mammography. European
Radiology, 23(3), 664–672.
Gifford, H. C., Karbaschi, Z., Banerjee, K., & Das, M. (2017). Visual-search models for
location-known detection tasks. 1013612(March 2017), 1013612.
Gifford, H. C., Liang, Z., & Das, M. (2016). Visual-search observers for assessing
tomographic x-ray image quality. Medical Physics, 43(3).
Glasziou, P., & Houssami, N. (2011). The evidence base for breast cancer screening.

39
Preventive Medicine, 53(3), 100–102.
Gong, H., Yu, L., Leng, S., Dilger, S. K., Ren, L., Zhou, W., Fletcher, J. G., & McCollough, C. H.
(2019). A deep learning- and partial least square regression-based model observer
for a low-contrast lesion detection task in CT. Medical Physics, 46(5), 2052–2063.
Gong, X., Glick, S. J., Liu, B., Vedula, A. A., & Thacker, S. (2006). A computer simulation study
comparing lesion detection accuracy with digital mammography, breast
tomosynthesis, and cone-beam CT breast imaging. Medical Physics, 33(4), 1041–
1052.
Green, D. M., & Swets, J. A. (1966). Signal detection theory and pshychophysics (Wiley & So).
Gur, D., Abrams, G. S., Chough, D. M., Ganott, M. A., Hakim, C. M., Perrin, R. L., Rathfon, G. Y.,
Sumkin, J. H., Zuley, M. L., & Bandos, A. I. (2009). Digital Breast Tomosynthesis:
Observer Performance Study. American Journal of Roentgenology, 193(2), 586–591.
Hadjipanteli, A., Elangovan, P., Looney, P., Mackenzie, A., Wells, K., Dance, D. R., & Young,
K. C. (2016). Detection of Microcalcification Clusters in 2D-Mammography and
Digital Breast Tomosynthesis and the Relation to the Standard Method of Measuring
Image Quality (pp. 217–221). Springer, Cham.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.
Hofvind, S., Holen, Å. S., Aase, H. S., Houssami, N., Sebuødegård, S., Moger, T. A., Haldorsen,
I. S., & Akslen, L. A. (2019). Two-view digital breast tomosynthesis versus digital
mammography in a population-based breast cancer screening programme (To-Be):
a randomised, controlled trial. The Lancet Oncology, 20(6), 795–805.
Horvat, J. V., Keating, D. M., Rodrigues-Duarte, H., Morris, E. A., & Mango, V. L. (2019).
Calcifications at digital breast tomosynthesis: Imaging features and biopsy
techniques. Radiographics, 39(2), 307–318.
Hu, Y.-H., & Zhao, W. (2011). The effect of angular dose distribution on the detection of
microcalcifications in digital breast tomosynthesis. Medical Physics, 38(5), 2455–
2466.
ICRP. (2007). ICRP Publication 103. Recommendations of the International Commission on
Radiological Protection, ICRP 37, (2-4).
Ikejimba, L., Glick, S. j, Samei, E., & Yo, Y. J. (2016). Comparison of model and human
observer performance in FFDM , DBT , and synthetic mammography. Proceedings
of SPIE Medical Imaging, 9783(978325), 1–10.
Jakel, F., & Wichmann, F. A. (2006). Spatial four-alternative forced-choice method is the
preferred psychophysical method for naive observers. Journal of Vision, 6(11), 13–
13.
Jäkel, F., & Wichmann, F. A. (2006). Spatial four-alternative forced-choice method is the
preferred psychophysical method for naïve observers. Journal of Vision, 6(11),
1307–1322.
Karbaschi, Z., & Gifford, H. C. (2018). Assessing CT acquisition parameters with visual-
search model observers. Journal of Medical Imaging, 5(02), 1.
Karpathy A. (2017). CS231n Convolutional Neural Networks for Visual Recognition.
Karssemeijer, N., & Thijssen, M. A. O. (1996). Determination of contrast-detail curves of
mammography systems by automated image analysis. Elsevier, 115–160.
Kelly, D. H. (1977). Visual contrast sensitivity. Optica Acta, 24(2), 107–129.

40
Kopans, D., Gavenonis, S., Halpern, E., & Moore, R. (2011). Calcifications in the Breast and
Digital Breast Tomosynthesis. The Breast Journal, 17(6), 638–644.
Krammer, J., Stepniewski, K., Kaiser, C. G., Brade, J., Riffel, P., Schoenberg, S. O., & Wasser,
K. (2017). Value of Additional Digital Breast Tomosynthesis for Preoperative
Staging of Breast Cancer in Dense Breasts. Anticancer Research, 37(9), 5255–5261.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep
Convolutional Neural Networks. Advances In Neural Information Processing
Systems, 1–9.
Krupinski, E. A. (2010). Current perspectives in medical image perception. Attention,
Perception, and Psychophysics, 72(5), 1205–1217.
Kundel, H. L., Nodine, C. F., Conant, E. F., & Weinstein, S. P. (2007). Holistic Component of
Image Perception in Mammogram Interpretation: Gaze-tracking Study. Radiology,
242(2), 396–402.
Kupinski, M. A., Clarkson, E., & Hesterman, J. Y. (2007). Bias in Hotelling observer
performance computed from finite data. Medical Imaging 2007: Image Perception,
Observer Performance, and Technology Assessment, 6515(65150 Suppl), 65150S.
Li, Z., Desolneux, A., Muller, S., Milioni de Carvalho, P., & Carton, A.-K. (2018). Comparison
of microcalcification detectability in FFDM and DBT using a virtual clinical trial. In
R. M. Nishikawa & F. W. Samuelson (Eds.), Medical Imaging 2018: Image Perception,
Observer Performance, and Technology Assessment (Vol. 10577, p. 12). SPIE.
Maidment, A. D. A. (2014). Virtual clinical trials for the assessment of novel breast
screening modalities. Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8539
LNCS, 1–8.
Maldera, A., De Marco, P., Colombo, P. E., Origgi, D., & Torresin, A. (2017). Digital breast
tomosynthesis: Dose and image quality assessment. Physica Medica, 33, 56–67.
Marinovich, M. L., Hunter, K. E., Macaskill, P., & Houssami, N. (2018). Breast Cancer
Screening Using Tomosynthesis or Mammography: A Meta-analysis of Cancer
Detection and Recall. JNCI: Journal of the National Cancer Institute, 110(9), 942–949.
Marshall, N. W., & Bosmans, H. (2012). Measurements of system sharpness for two digital
breast tomosynthesis systems. Physics in Medicine and Biology, 57(22), 7629–7650.
Massanes, F., & Brankov, J. G. (2017). Evaluation of CNN as anthropomorphic model
observer. In M. A. Kupinski & R. M. Nishikawa (Eds.), Medical Imaging 2017: Image
Perception, Observer Performance, and Technology Assessment (Vol. 10136, p.
101360Q). International Society for Optics and Photonics.
Mcphail, S., Johnson, S., Greenberg, D., Peake, M., & Rous, B. (2015). Stage at diagnosis and
early mortality from cancer in England.
Mertelmeier, T., Orman, J., Haerer, W., & Dudam, M. K. (2006). Optimizing filtered
backprojection reconstruction for a breast tomosynthesis prototype device. SPIE
Medical Imaging: Physics of Medical Imaging, 6142, 61420F.
Michielsen, K., Nuyts, J., Cockmartin, L., Marshall, N. W., & Bosmans, H. (2016). Design of a
model observer to evaluate calcification detectability in breast tomosynthesis and
application to smoothing prior optimization. Medical Physics, 43(12), 6577–6587.
Michielsen, K., Zanca, F., Marshall, N., Bosmans, H., & Nuyts, J. (2013). Two complementary
model observers to evaluate reconstructions of simulated micro-calcifications in

41
digital breast tomosynthesis. SPIE Medical Imaging: Image Perception, Observer
Performance, and Technology Assessment, 8673(1), 86730G.
Myers, E. R., Moorman, P., Gierisch, J. M., Havrilesky, L. J., Grimm, L. J., Ghate, S., Davidson,
B., Mongtomery, R. C., Crowley, M. J., McCrory, D. C., Kendrick, A., & Sanders, G. D.
(2015). Benefits and harms of breast cancer screening: A systematic review. In
JAMA - Journal of the American Medical Association (Vol. 314, Issue 15, pp. 1615–
1634). American Medical Association.
Myers, K. J., & Barrett, H. H. (1987). Addition of a channel mechanism to the ideal-observer
model. Journal of the Optical Society of America. A, Optics and Image Science, 4(12),
2447–2457.
Niklason, L. T., Christian, B. T., Niklason, L. E., Kopans, D. B., Castleberry, D. E., Opsahl-Ong,
B. H., Landberg, C. E., Slanetz, P. J., Giardino, a a, Moore, R., Albagli, D., DeJule, M. C.,
Fitzgerald, P. F., Fobare, D. F., Giambattista, B. W., Kwasnick, R. F., Liu, J., Lubowski,
S. J., Possin, G. E., … Wirth, R. F. (1997). Digital tomosynthesis in breast imaging.
Radiology, 205(2), 399–406.
Nwankpa, C., Ijomah, W., Gachagan, A., & Marshall, S. (2018). Activation Functions:
Comparison of trends in Practice and Research for Deep Learning.
Obuchowski, N. A. (2000). Sample size tables for receiver operating characteristic studies.
American Journal of Roentgenology, 175(3), 603–608.
Park, S., Jennings, R., Liu, H., Badano, A., & Myers, K. (2010). A statistical, task-based
evaluation method for three-dimensional x-ray breast imaging systems using
variable-background phantoms. Medical Physics, 37, 6253–6270.
Park, S., Zhang, G., & Myers, K. J. (2016). Comparison of Channel Methods and Observer
Models for the Task-Based Assessment of Multi-Projection Imaging in the Presence
of Structured Anatomical Noise. IEEE Transactions on Medical Imaging, 35(6),
1431–1442.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Facebook, Z. D., Research, A. I., Lin, Z.,
Desmaison, A., Antiga, L., Srl, O., & Lerer, A. (2007). Automatic differentiation in
PyTorch.
Pattacini, P., Nitrosi, A., Giorgi Rossi, P., Iotti, V., Ginocchi, V., Ravaioli, S., Vacondio, R.,
Braglia, L., Cavuto, S., & Campari, C. (2018). Digital Mammography versus Digital
Mammography Plus Tomosynthesis for Breast Cancer Screening: The Reggio Emilia
Tomosynthesis Randomized Trial. Radiology, 288(2), 375–385.
Peppard, H. R., Nicholson, B. E., Rochman, C. M., Merchant, J. K., Mayo, R. C., & Harvey, J. A.
(2015). Digital breast tomosynthesis in the diagnostic setting: Indications and
clinical applications. Radiographics, 35(4), 975–990.
Perry, N., Broeders, M., de Wolf, C., Törnberg, S., Holland, R., & von Karsa, L. (2006).
European guidelines for quality assurance in breast cancer screening and diagnosis.
In Annals of oncology : official journal of the European Society for Medical Oncology
/ ESMO (Vol. 19, Issue 4).
Peterson, W. W., Birdsall, T. G., & Fox, W. C. (1954). The theory of signal dectectability. IRE
Professional Group on Information Theory, 4(4), 171–212.
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H. (2017). Real
space channelization for generic DBT system image quality evaluation with
channelized Hotelling observer (M. A. Kupinski & R. M. Nishikawa (eds.); Vol. 10136,
p. 101360N). International Society for Optics and Photonics.

42
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H. (2017). Real
space channelization for generic DBT system image quality evaluation with
channelized Hotelling observer. Progress in Biomedical Optics and Imaging -
Proceedings of SPIE, 10136.
Petrov, D., Marshall, N. W., Young, K. C., & Bosmans, H. (2019). Systematic approach to a
channelized Hotelling model observer implementation for a physical phantom
containing mass-like lesions: Application to digital breast tomosynthesis. Physica
Medica, 58, 8–20.
Petrov, D., Marshall, N., Young, K., & Bosmans, H. (2018). Model and human observer
reproducibility for detecting microcalcifications in digital breast tomosynthesis
images. In R. M. Nishikawa & F. W. Samuelson (Eds.), Medical Imaging 2018: Image
Perception, Observer Performance, and Technology Assessment (Vol. 10577, p. 10).
SPIE.
Petrov, D., Marshall, N., Young, K., Zhang, G., & Bosmans, H. (2019). Model and human
observer reproducibility for detection of microcalcification clusters in digital breast
tomosynthesis images of three-dimensionally structured test object. Journal of
Medical Imaging, 6(01), 1.
Petrov, D., Michielsen, K., Cockmartin, L., Zhang, G., & Young, K. (2016). Development and
application of a channelized Hotelling observer for DBT optimization on structured
background test images with mass simulating targets. SPIE Medical Imaging, 9787,
1–9.
Platiša, L., Goossens, B., Vansteenkiste, E., Park, S., Gallas, B. D., Badano, A., & Philips, W.
(2011). Channelized Hotelling observers for the assessment of volumetric imaging
data sets. Journal of the Optical Society of America. A, Optics, Image Science, and
Vision, 28(6), 1145–1163.
Poplack, S. P., Tosteson, T. D., Kogel, C. A., & Nagy, H. M. (2007). Digital breast
tomosynthesis: Initial experience in 98 women with abnormal digital screening
mammography. American Journal of Roentgenology, 189(3), 616–623.
Pratt, L. Y. (1993). Discriminability-Based Transfer between Neural Networks. Advances
in Neural Information Processing Systems, 204–211.
Racine, D., Ba, A. H., Ott, J. G., Bochud, F. O., & Verdun, F. R. (2016). Objective assessment
of low contrast detectability in computed tomography with Channelized Hotelling
Observer. Physica Medica, 32(1), 76–83.
Rafferty, E. A. (2007). Digital Mammography: Novel Applications. Radiologic Clinics of
North America, 45(5), 831–843.
Rashidnasab, A., Elangovan, P., Diaz, O., Mackenzie, A., Young, K., Dance, D., & Wells, K.
(2013). Simulation of 3D DLA masses in digital breast tomosynthesis (R. M.
Nishikawa & B. R. Whiting (eds.); Vol. 8668, p. 86680Y). International Society for
Optics and Photonics.
Rashidnasab, A., Elangovan, P., Yip, M., Diaz, O., Dance, D. R., Young, K. C., & Wells, K.
(2013). Simulation and assessment of realistic breast lesions using fractal growth
models. Physics in Medicine and Biology, 58(16), 5613–5627.
Rodríguez-Ruiz, A., Castillo, M., Garayoa, J., & Chevalier, M. (2016). Evaluation of the
technical performance of three different commercial digital breast tomosynthesis
systems in the clinical environment. Physica Medica, 32(6), 767–777.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,

43
Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale
Visual Recognition Challenge. International Journal of Computer Vision, 115(3),
211–252.
Saunders, R. S., Baker, J. A., Delong, D. M., Johnson, J. P., & Samei, E. (2007). Does image
quality matter? Impact of resolution and noise on mammographic task
performance. Medical Physics, 34(10), 3971–3981.
Scarparo, D. C., Salvadeo, D. H. P., Pedronette, D. C. G., Barufaldi, B., & Maidment, A. D. A.
(2019). Evaluation of denoising digital breast tomosynthesis data in both
projection and image domains and a study of noise model on digital breast
tomosynthesis image domain. Journal of Medical Imaging, 6(03), 1.
Schrauf, M., & Stern, C. (2001). The visual resolution of Landolt-C optotypes in human
subjects depends on their orientation: the ’gap-down‘ effect. Neuroscience Letters,
299(3), 185–188.
Sechopoulos, I. (2013a). A review of breast tomosynthesis. Part I. The image acquisition
process. Medical Physics, 40(1), 1–12.
Sechopoulos, I. (2013b). A review of breast tomosynthesis. Part II. Image reconstruction,
processing and analysis, and advanced applications. Medical Physics, 40(1), 1–17.
Segars, W. P., Mahesh, M., Beck, T. J., Frey, E. C., & Tsui, B. M. W. (2008). Realistic CT
simulation using the 4D XCAT phantom. Medical Physics, 35(8), 3800–3808.
Shaheen, E., De Keyzer, F., Bosmans, H., Dance, D. R., Young, K. C., & Ongeval, C. Van. (2014).
The simulation of 3D mass models in 2D digital mammography and breast
tomosynthesis. Medical Physics Phys, 41(36), 81913–84920.
Shaheen, E., De Keyzer, F., Bosmans, H., Dance, D. R., Young, K. C., & Van Ongeval, C. (2014).
The simulation of 3D mass models in 2D digital mammography and breast
tomosynthesis. Medical Physics, 41(8), 081913.
Sharma, D., Graff, C. G., Badal, A., Zeng, R., Sawant, P., Sengupta, A., Dahal, E., & Badano, A.
(2019). Technical Note - In silico imaging tools from the VICTRE clinical trial.
Medical Physics.
Skaane, P., Bandos, A. I., Gullien, R., Eben, E. B., Ekseth, U., Haakenaasen, U., Izadi, M.,
Jebsen, I. N., Jahr, G., Krager, M., & Hofvind, S. (2013). Prospective trial comparing
full-field digital mammography (FFDM) versus combined FFDM and tomosynthesis
in a population-based screening programme using independent double reading
with arbitration. European Radiology, 23(8), 2061–2071.
Skaane, P., Bandos, A. I., Niklason, L. T., Sebuødegård, S., Østerås, B. H., Gullien, R., Gur, D.,
& Hofvind, S. (2019). Digital Mammography versus Digital Mammography Plus
Tomosynthesis in Breast Cancer Screening: The Oslo Tomosynthesis Screening
Trial. Radiology, 291(1), 23–30.
Spangler, M. L., Zuley, M. L., Sumkin, J. H., Abrams, G., Ganott, M. A., Hakim, C., Perrin, R.,
Chough, D. M., Shah, R., Gur, D., Ml, S., Ml, Z., & Jh, S. (2011). Detection and
Classification of Calcifications on Digital Breast Tomosynthesis and 2D Digital
Mammography: A Comparison Wo m e n’s I m ag i ng @BULLET O r ig i n a l R e s e
a rc h. AJR, 196, 320–324.
Svahn, T. M., Chakraborty, D. P., Ikeda, D., Zackrisson, S., Do, Y., Mattsson, S., & Andersson,
I. (2012). Breast tomosynthesis and digital mammography: A comparison of
diagnostic accuracy. British Journal of Radiology, 85(1019).

44
Tabar, L., Yen, M.-F., Vitak, B., Chen, H.-H. T., Smith, R. A., & Duffy, S. W. (2003).
Mammography service screening and mortality in breast cancer patients: 20-year
follow-up before and after introduction of screening. The Lancet, 361(9367), 1405–
1410.
Thomas, J. A., Chakrabarti, K., Kaczmarek, R., & Romanyukha, A. (2005). Contrast-detail
phantom scoring methodology. Medical Physics, 32(3), 807–814.
Timberg, P., Båth, M., Andersson, I., Svahn, T., Ruschin, M., Hemdal, B., Mattsson, S., &
Tingberg, A. (2008). Impact of dose on observer performance in breast
tomosynthesis using breast specimens. SPIE Medical Imaging: Physics of Medical
Imaging, 6913, 69134J.
U.S. Food and Drug Administration. (2018). VICTRE: Virtual Imaging Clinical Trials for
Regulatory Evaluation.
van Engen, R. E., Bosmans, H., Bouwman, R. W., Dance, D. R., Heid, P., Lazzari, B., Marshall,
N. W., Schopphoven, S., Strudley, C., Thijssen, M., & Young, K. C. (2014). A European
Protocol for Technical Quality Control of Breast Tomosynthesis Systems (pp. 452–
459). Springer, Cham.
Vennart, W. (1997). ICRU Report 54: Medical imaging—the assessment of image quality.
Radiography, 3(3), 243–244.
Warren, L. M., Mackenzie, A., Cooke, J., Given-Wilson, R. M., Wallis, M. G., Chakraborty, D.
P., Dance, D. R., Bosmans, H., & Young, K. C. (2012). Effect of image quality on
calcification detection in digital mammography. Med Phys, 39(6), 3202–3213.
Watson, A. B. (1983). Detection and recognition of simple spatial forms. Berlin
SpringerVerlag, 100–114.
Weilong Hou, Xinbo Gao, Dacheng Tao, & Xuelong Li. (2015). Blind Image Quality
Assessment via Deep Learning. IEEE Transactions on Neural Networks and Learning
Systems, 26(6), 1275–1286.
Wen, G., Markey, M. K., Haygood, T. M., & Park, S. (2018). Model observer for assessing
digital breast tomosynthesis for multi-lesion detection in the presence of
anatomical noise. Physics in Medicine and Biology, 63(4).
Wigati, K. T., Vancoillie, L., Salomon, E., Zhang, G., Cockmartin, L., Marshall, N., Bosmans,
H., Soejoko, D. S., Bliznakova, K., & Petrov, D. (2019). Channelized Hotelling
observer assessing microcalcification detectability on 2D mammography: a first
application to study the impact of tube voltage. Journal of Physics: Conference Series,
1248.
Witten, J. M., Park, S., & Myers, K. J. (2010). Partial least squares: A method to estimate
efficient channels for the ideal observers. IEEE Transactions on Medical Imaging,
29(4), 1050–1058.
Wunderlich, A., & Abbey, C. K. (2013). Utility as a rationale for choosing observer
performance assessment paradigms for detection tasks in medical imaging. Medical
Physics, 40(11).
Young, K. C., Alsager, A., Oduko, J. M., Bosmans, H., Verbrugge, B., Geertse, T., & van Engen,
R. (2008). Evaluation of software for reading images of the CDMAM test object to
assess digital mammography systems. 6913, 69131C.
Young, K. C., Cook, J. J. H., Oduko, J. M., & Bosmans, H. (2006). Comparison of software and
human observers in reading images of the CDMAM test object to assess digital

45
mammography systems. SPIE Medical Imaging: Physics of Medical Imaging, 6142,
614206.
Young, S., Bakic, P. R., Myers, K. J., Jennings, R. J., & Park, S. (2013). A virtual trial framework
for quantifying the detectability of masses in breast tomosynthesis projection data.
Medical Physics, 40(5), 051914.
Young, S., Park, S., Anderson, S. K., Badano, A., Myers, K. J., & Bakic, P. R. (2009). Estimating
breast tomosynthesis performance in detection tasks with variable-background
phantoms. SPIE Medical Imaging: Physics of Medical Imaging, 7258(301), 72580O.
Zeng, R., Badano, A., & Myers, K. J. (2015). Evaluating the sensitivity of the optimization of
acquisition geometry to the choice of reconstruction algorithm in digital breast
tomosynthesis through a simulation study. Phys. Med. Biol, 60, 1259.
Zeng, R., Badano, A., & Myers, K. J. (2017). Optimization of digital breast tomosynthesis
(DBT) acquisition parameters for human observers: effect of reconstruction
algorithms. Physics in Medicine & Biology.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice (4AFC)
software for observer performance evaluation in radiology (C. K. Abbey & M. A.
Kupinski (eds.); p. 97871E). International Society for Optics and Photonics.
Zhao, B., Zhou, J., Hu, Y.-H., Mertelmeier, T., Ludwig, J., & Zhao, W. (2008). Experimental
validation of a three-dimensional linear system model for breast tomosynthesis.
Medical Physics, 36(1), 240–251.
Zhou, W., Li, H., & Anastasio, M. A. (2019). Approximating the Ideal Observer and Hotelling
Observer for binary signal detection tasks by use of supervised learning methods.
IEEE Transactions on Medical Imaging, 1–1.

46
Chapter 1: Systematic approach to a channelized
Hotelling model observer implementation for a
physical phantom containing mass-like lesions:
application to digital breast tomosynthesis

Based on D. Petrov, N. Marshall, K. Young, H. Bosmans, ‘Systematic approach to a


channelized Hotelling model observer implementation for a physical phantom
containing mass-like lesions: Application to digital breast tomosynthesis’, Physica
Medica: European Journal of Medical Physics, Volume 58, 8 - 20 (2019)

INTRODUCTION
Breast screening employing conventional 2D full field digital mammography
(FFDM) plays a key role in early breast cancer detection and, when implemented
carefully, is known to reduce breast-cancer mortality (Alberto Donzelli, 2012;
Glasziou & Houssami, 2011). A limitation associated with FFDM is the projection
of all breast structures into a single image. This can obscure lesions and therefore
reduce sensitivity or increase the number of false alarms (Poplack et al., 2007;
Rafferty, 2007). Digital breast tomosynthesis (DBT) systems acquire a series of
projection images using an x-ray source that moves over a limited angle. The
resulting projection data are used to reconstruct a set of planes parallel to the
detector (Maldera et al., 2017; Sechopoulos, 2013a, 2013b), generating a 3D
dataset of the breast anatomy. Recent studies have shown that the addition of
DBT to digital mammography or even the stand-alone use of DBT can decrease
recall rate and increase lesion detection compared to standard FFDM (Skaane et
al., 2013; Svahn et al., 2012). Digital breast tomosynthesis is therefore a modality
that shows great potential (Ferreira et al., 2016; Niklason et al., 1997).
Physical phantoms are an established means of quantifying a (technical)
measure of image quality and ensure that image quality meets some sufficient
level while remaining within accepted dose levels (van Engen et al., 2014). It is
likely that the practice of physical phantom evaluation, initially scored by human
observers, will be carried over to the assessment and optimization of DBT
devices. One such candidate is the 3D structured phantom developed by
Cockmartin et al. (Cockmartin et al., 2017), which contains clinically relevant
targets and was developed for comparative studies of FFDM and DBT. Future
applications may include the use in routine QC and in technical optimization of
DBT systems. Currently, phantom images are read by human readers (HR). This
is a time consuming approach, and furthermore the same observers may not
always be available to score the images of successive investigations.
Computerized evaluations of physical phantoms may boost the development and
application of QC protocols for DBT systems, in line with the European approach
(Perry et al., 2006).

47
While evaluation of technical parameters that influence imaging performance
yields useful information (Marshall & Bosmans, 2012; Rodríguez-Ruiz et al., 2016;
van Engen et al., 2014; Zhao et al., 2008), some of these methods involve the use
of Fourier techniques, which require system linearity and therefore must be
applied with care to reconstructed image datasets. As an alternative, many
studies have described the potential of spatial domain model observers (MO)
(Abbey & Barrett, 2001; H H Barrett et al., 1993; Burgess et al., 2001) to perform
these detection tasks for a variety of medical imaging modalities, including CT
(Racine et al., 2016), FFDM (C Castella et al., 2009) and fluoroscopy (Fetterly &
Favazza, 2016). More specifically, the channelized Hotelling observer (CHO) has
been proven to be a useful classifier in DBT images (Michielsen et al., 2016; Park
et al., 2010, 2016; Platiša et al., 2011; Wen et al., 2018; Young et al., 2013; Zeng et
al., 2017). For example Young et. al. (Young et al., 2013) used a CHO to quantify
different DBT setups with varying scan angles and number of projections. Zeng
et. al. (Zeng et al., 2017) use three types of CHO to compare two DBT
reconstruction methods in different angular span and number of views, which
results were then verified by a human observer study. Wen et. al. (Wen et al.,
2018) compared CHO model observers for identifying the optimal DBT
acquisition geometries. Most of these studies focus on optimization via
simulation with mathematical phantoms (Perry et al., 2006). Turning to physical
phantoms, Park et. al. (Park et al., 2010) applied a Hotelling observer to assess
the detectability of spheres while Michielsen et. al. (Michielsen et al., 2016) used
a CHO in the evaluation of iterative algorithms to reconstruct microcalcifications
in a physical DBT phantom.
CHO methods typically involve three distinct steps: design and tuning of a
channel set relevant to the targets/backgrounds, training of the observer
template and finally application. This last step generates a scalar test statistic,
characterizing the response of the CHO to a given image. Evaluating this test
statistic at various thresholds generates a performance metric such as the area
under the curve (AUC). Channel selection influences CHO performance and the
degree to which HR performance is approximated. Castella et al. (C Castella et al.,
2007), and Bouwman et. al. (Bouwman et al., 2017) showed that some channels
provide better anthropomorphic matching than others, while Zeng et. al. (Zeng
et al., 2017) showed that channel parameter tuning can improve agreement with
HR scores. There is currently little consensus on the selection or development of
CHOs for physical test objects. The goal of this study was therefore to describe
the systematic development and validation of a CHO algorithm for use with the
physical phantom described by Cockmartin (Cockmartin et al., 2017), with the
focus on channel design, the generation of signal templates and the number of
images required for a robust estimate of the covariance matrix.
In the current study, the channel design incorporates studies on the channel
types and investigating appropriate channel parameters used to approximate
human observer results. The CHO signal template sets the target to be detected
and a proper estimation is crucial for the performance. The study will compare
how different estimations of the signal template influence the CHO scores. For

48
practical considerations the number of DBT phantom acquisitions needed for
assessment of a CHO reading is crucial for the implementation in daily practice.
The CHO will be trained with different numbers of DBT scans for training, and
the minimal amount of training DBT acquisitions will be investigated. In the
same context, the amount of training acquisitions can be lowered, if some of the
images for reading could be used also for training. This would clearly add
observer bias, which could influence the results. The CHO will be trained with
datasets inducing different amounts of bias, to test whether a small percent of
bias can be allowed, when the scanning time is a limit. Finally the developed
model observer will be tested for reproducibility against human results at three
dose levels.

MATERIALS AND METHODS


1. Image acquisition
This study is based around the DBT test object described by Cockmartin et al.
(Cockmartin et al., 2017) (Figure 1), a 3D physical phantom that uses spheres to
generate the backgrounds for the detection study. Readers are referred to
Cockmartin et al. (Cockmartin et al., 2017) for a detailed discussion of the
advantages and limitations of using spheres for this purpose. Briefly, the
phantom is made from a poly(methyl) methacrylate (PMMA) semi-circular
container filled with PMMA spheres of six different diameters (15.88, 12.70, 9.52,
6.35, 3.18, and 1.58 mm), with water filling the remaining volume (Figure 1). The
spheres are free to move and thus shaking the phantom produces another
background realization, but with similar power spectra characteristics. This
study evaluated the non-spiculated masses, with average diameters equal to 1.5,
2.1, 3.0, 4.3 and 5.7 mm, with a fixed position within the phantom volume. Each
phantom scan generates a single set of signal present data, while the same scan
yields 15 signal absent datasets.

Figure 1. Images of the phantom, from left to right: DBT reconstructed plane at the
level of the mass models, mammographic image without the background spheres
and a photograph
The phantom was scanned 108 times in total on a Siemens Inspiration
Tomosynthesis system (Siemens-Healthineers, Erlangen, Germany). Acquisition
parameters were related to settings under automatic exposure control (AEC),
namely 30 kV, W/Rh anode/filter combination and 204 mAs. Sixty acquisitions
were taken at the AEC dose level with manually selected factors (30kVp and
200mAs) and 24 acquisitions were then taken at a ‘Low’ dose level, close to half
the standard dose (30kVp and 112mAs) and 24 scans at a ‘High’ dose level

49
(30kVp and 400mAs). DBT volumes were reconstructed using the “Enhanced
Multiple Parameter Iterative Reconstruction (EMPIRE)” algorithm. The
reconstruction algorithm is a filtered back projection (FBP) with additional
iterative processing for artefact reduction, noise reduction and higher resolution
(Abdurahman et al., 2012, 2014). Between each scan, the phantom was shaken
for 10 seconds to generate a different background realization.

Figure 2. a). Extraction positions for signal present and signal absent volumes of
interest. b). Screenshot of the in-house developed ‘Foursquares’ 4AFC software
(Zhang et al., 2016). For illustration purposes the signal present image is marked
in green and the signal absent images – in red.
Once the scans were acquired, the volumes of interest (VOIs) containing the
lesions were located as follows. Sets of small high contrast, microcalcification
simulating lesions are also present within the phantom and these were used as
the localization landmark. DBT scans were made of the phantom with the lesion
models, but without the background spheres or water, and distances to the five
non-spiculated masses were measured from the calcification reference points.
All the lesions are located at fixed positions on a 2 mm thick PMMA sheet within
the phantom and therefore do not move during the shaking. Knowing the
location of the microcalcifications enables the position of the non-spiculated
masses to be calculated and VOIs extracted at the appropriate locations (lesions
are at the centre of the green squares in figure 2). As a check, the extraction
method was applied to a dataset of the empty phantom (no spheres) and a visual
check made that the extracted masses were at the centre of the relevant VOI. The
signal present VOIs were formed from a cropped volume of size 20x20x30mm 3
centred around the target mass. The physical height of the phantom is 48 mm
including the phantom walls (Cockmartin et al., 2017), generating ~40 DBT slices
between the top and bottom PMMA plates. Five slices at the top and bottom were
not used due to the influence of the PMMA plates on the background, giving VOIs
with 30 planes that have similar background statistics. This also meant that the
signal absent VOIs had to be extracted at the same z-height as the signal present
VOIs, but then from lesion free areas of the phantom (Figure 2.a).

50
While the complete VOIs were used for the human observer reading studies, the
particular application of a CHO, described later in the text, two dimensional
images were required. For the purpose regions of interest (ROI) were extracted
from the in-focus and the adjacent planes from all VOIs.
2. Human observer study
Six medical physicists participated in a four alternative forced choice (4-AFC)
image reading study (Jakel & Wichmann, 2006). For detection tasks (as
considered here), Jäkel and Wichmann suggest that 4-AFC and 8-AFC methods
are more time efficient and have lower uncertainty (after some given number of
trials) than a 2-AFC method. However, an 8-AFC experiment requires more signal
absent images than a 4-AFC study and therefore we consider the 4-AFC study
used here to be a good compromise. Three signal absent VOIs and one signal
present VOI were shown, with the task of indicating the signal present image. A
software tool developed in-house (‘Foursquares’ (Zhang et al., 2016)) (figure
2.b)) was used to conduct the 4AFC study.
Reading was performed on a diagnostic monitor (5MP Barco MDNG-6121) at an
ambient light level of 3 lx and viewing distance of 40 to 50 cm. No magnification
and window level adjustments were allowed, and only scrolling through the VOI
was permitted. No time constraints were imposed for the reading sessions. While
it is common for observers to read a training set before participating in this type
of study, this was not done in this study. All the observers had read many DBT
images of this phantom already, with images acquired under different conditions
for a different study, and thus a training set was not required. One reading
session (of 12 images/trials) typically took 5 minutes. During reading, the
observers were shown, next to the 4 VOIs also an additional ROI (similar to that
in figure 5.d.) containing the specific target to be detected.
The 60 DBT acquisitions taken at AEC dose level were split into 5 groups of 12
DBT stacks. A further 2 groups were formed for the high and low dose level
acquisitions, each with 12 DBT stacks. Volumes of interest were extracted and
subsequently sorted into 5 reading sessions, corresponding to the 5 lesion sizes
for each reading group. This way a reading session consisted of twelve 4AFC
trials, which gave one percentage correct (PC) result for each human reader.
Given the 3 dose levels and the 5 lesion sizes, each observer reads 35 sessions in
total. The overall PC for a given lesion size and dose level was found by averaging
over PC for all 6 readers, with uncertainty taken as the standard error on the
mean.
3. Channelized Hotelling observer and general work
flow
The Hotelling observer computes a test statistic t for each image g, by applying
an observer template (Harrison H Barrett et al., 2004):
𝑡(𝑔) = 𝑤 𝑇 𝑔 = ∆𝑔̅ 𝑇 𝑆𝑔−1 𝑔, (1)

51
where {}T is the vector transpose operator and w is the observer template,
formed from the mean difference between the signal present and signal absent
data (∆𝑔̅ ) and the inverse of the interclass covariance matrix of the signal present
and signal absent data (𝑆𝑔−1 ).
In practice, due to the large dimensions of the covariance matrix, inversion is
often impossible. To overcome this dimensionality problem, a channel
mechanism was introduced in the definition of the template (Myers & Barrett,
1987), resulting in the channelized Hotelling observer (CHO):
𝑇
𝑇
𝑡𝐶𝐻𝑂 (𝑣) = 𝑤𝐶𝐻𝑂 𝑣 = (𝑣
̅̅̅̅ 𝑣𝑠𝑎 𝑆𝑣−1 𝑣,
𝑠𝑝 − ̅̅̅̅) (2)
where 𝑣 = 𝑈 𝑔 is the channel output vector, formed by the product of the
𝑇

channel set U and image g. The work flow of development and application of a
CHO can be split into three interlinked phases: ‘channel tuning’, ‘training’ and
‘reading’ (figure 3). In the tuning phase, the channels are generated and
associated parameters adjusted using a set of training images to find a CHO
whose performance matches the human observer performance. During training,
the observer template is usually estimated from a set of training images
(Harrison H Barrett et al., 2004). In the last phase, called ‘reading’, the observer
template is applied to the images that have to be evaluated.
The 4-AFC method used to estimate human observer performance was also
implemented for the CHO. Decision variables for the 4 images in a given 4-AFC
trial were calculated and if the decision variable for the signal present image had
the highest value, the algorithm counted a ‘hit’ (figure 3). This action was
repeated for all signal present images in a reading session, where each time the
algorithm picks 3 random signal absent decision variables with replacement. The
PC is then calculated from the total number of ‘hits’. This was repeated six times
for the same reading session, matching the number of human readers and the
average PC was taken as a final result for the session. The uncertainty was
estimated via bootstrapping. The resampling process was repeated 30 times, and
the standard deviation was taken as an uncertainty measure. This work
implemented a single slice CHO (ssCHO) in the terminology of Platisa et al.
(Platiša et al., 2011), who showed that ssCHO and multi-slice CHO had similar
performance for clustered lumpy background (CLB) images. As Bochud et al.
(Bochud et al., 1999) and Castella et al. (Cyril Castella et al., 2008) have shown,
CLB images have an appearance similar to real mammographic backgrounds.
The CHO implementation requires signal known exactly conditions and given
that perfect segmentation cannot be guaranteed, the CHO was applied to 25
positions around the expected signal position. This action was also performed
over 5 planes in the through plane (z)-direction, which resulted in a 125 voxel
scanning area around the initial (expected) position of the lesion (i.e. or
4.25x4.25x5mm3 real space volume). After calculating the CHO performance for
all points, the maximum of this data was taken as the final result.

52
Figure 3. Flow chart of channelized Hotelling model observer performance
assessment in 4AFC test.
4. Channel selection and tuning
The first step is selection of the type of channel for the CHO as this obviously has
a large influence on MO performance. The channels can be viewed as a type of
filter applied to the images, extracting features relevant to the task (e.g.
detection) and generating higher scores if the targets are present in the images.
These channels have to be adjusted (tuned) so that they match the HR
performance obtained using the phantom. Two types of channels are commonly
used (Harrison H Barrett et al., 2015): anthropomorphic and efficient.
Anthropomorphic channels are used to incorporate some characteristics of the
human visual system within the model observer, while efficient channels are
used to approximate the ideal observer performance. Three of the most common
types of channels were studied in this work: Gabor, Difference of Gaussians
(DOG) and Laguerre-Gauss (LG), where the first two are considered
anthropomorphic and the third is considered efficient. Castella et al. (C Castella
et al., 2007) observed that using higher numbers of channels may lead to an
increased overall observer performance, but not necessarily to a better matching
of human results. For our study, the channel number was set to 8. This was based
on some earlier tests (data available, but not shown) in which it was found that

53
8 channels represented stable results and a workable compromise in terms of
image acquisition work load and required number of training images.

Figure 4. Images of the studied channel types


The Gabor function is defined in the spatial domain by multiplying a sinusoidal
wave with a Gaussian function (figure 4.1) (Watson, 1983) :
𝑥2 +𝑦2
−4ln(2)
𝑤2
𝐶𝑖,𝑗,𝑘 (𝑥, 𝑦) = 𝑒 𝑖 cos[2𝜋𝑓(x cos 𝜃𝑗 + 𝑦 sin 𝜃𝑗 ) + 𝜙𝑘 ],
(3)
𝑡𝑤 𝑡𝑓 𝜋𝑗
where 𝑤𝑖 = ,𝑓 = , 𝜃𝑗 = and 𝜙𝑘 = 45 𝑘.
(𝑒 𝑖 +2) 𝑊𝑖 𝑡𝑡

The Gaussian function determines the width (wi) of the channels and the
sinusoidal function does so for their frequency (𝑓), orientation (𝜃) and phase (𝜙).
Here, the frequency was set as a function of the standard deviation of the
Gaussian function, guaranteeing that only one maximum of the sinusoidal wave
has an impact and therefore forcing the sensitive region of the CHO to the center
of the image. Three parameters are required to generate a channel set: the
number of frequencies (I), orientations (J) and phases (K). Thus, in order to
generate 8 channels, one phase, two orientations and 4 frequencies were
selected. To set the properties of each channel within the channel set, three
tuning factors were implemented: tw, tf and tt. These set the Gaussian standard
deviation, the sine wave frequency and the orientation. During tuning, the
parameters were varied as follows: tw and tf ranged from 5 to 100 and tt was
varied such that 𝜃1 varied from 10° to 180°.
DOG channels are radially symmetric channel sensitivity functions, formed by
subtracting two Gaussian functions with different standard deviations (figure
4.2). They are defined as a function of radial frequency (pixels -1) (Abbey &
Barrett, 2001):

54
2 2
1 𝜌 1 𝜌
𝐶𝑗 (𝜌) = 𝑒𝑥𝑝 [− ( ) ] − 𝑒𝑥𝑝 [− ( ) ] (4)
2 𝑄𝜎𝑗 2 𝜎𝑗
with 𝜎𝑗 = 𝜎0 𝛼 𝑗 , the standard deviation of each channel. After generation in
frequency space, an inverse Fourier transform is applied, yielding a real space
set of channels. The first 8 DOG channels are generated by varying j from 0 to 7.
There are three channel parameters: 𝑄 defines channel bandwidth, 𝜎0 is
standard deviation of the first channel and 𝛼 determines the difference in
standard deviation between the channels. During tuning, 𝜎0 ranged from 0.002
to 0.02 while 𝛼 and 𝑄 were varied from 1.1 to 3.0.
As with DOG channels, LG channels are rotationally symmetric, and are formed
in the spatial domain as the product of Laguerre polynomials and Gaussian
functions (Gallas & Barrett, 2003) (figure 4.3):
√2 −𝜋𝑟 2 2𝜋𝑟 2
𝐶𝑗 (𝑟) = exp ( 2 ) 𝐿𝑗 ( 2 ) (5)
𝑎𝑢 𝑎𝑢 𝑎𝑢
where 𝑎𝑢 = √2𝜋𝜎𝑢 i.e. the standard deviation of the Gaussian. 𝐿𝑗 is the Laguerre
polynomial, defined as:
𝑗
𝑘
𝑗 𝑥 (6)
𝐿𝑗 (𝑥) = ∑(−1)𝑘 ( )
𝑘 𝑘!
𝑘=0
For this channel type, only the standard deviation factor 𝜎𝑢 was tuned for the
purpose of matching observer performance, with values ranging from 3 to 100.
First, rigorous tuning was performed against the human observer results at the
AEC dose level images. For the three channel types, each tuning parameter was
varied while the other parameters were held constant. The PC for the CHO was
calculated for all five mass lesion models and compared against those for the
averaged human reader data by applying the evaluation criteria in Table 1. The
three evaluation results generated by each set of channel parameters (for a given
channel type) were tabulated and the best performing parameter sets were
initially selected by requiring the absolute ME to be lower than 5 PC, a linear
slope (a) between 0.9 and 1.1 and a correlation (r) greater than 0.9. Out of this
initial selection, the channel parameter set that had the best overall evaluation
score for all three test scores was then selected for that channel type and used
for the further studies. In so doing, we systematically covered a wide range of
channel settings, however it is recognized that not all combinations in the
parameter space could be tested.

55
Table 1 Criteria used to assess CHO performance
Criterion Description
Mean error Measure of the distance between the MO scores and HR scores. If
(ME) the ME value is positive the MO underestimates the human
results, if negative, the MO overestimates the human observer
scores. A value closer to 0 is desired.
∑𝑛
𝑖=1 𝑦𝑖 −𝑥𝑖
𝑀𝐸 = (7)
𝑛
where 𝑦𝑖 and 𝑥𝑖 are the HR and MO scores respectively and 𝑛 is
lesion size
Linear Object size dependency of the MO versus that of the human
regression observer. If lower than 1, the effect of size is less pronounced for
slope (a) the MO than for human observers and vice versa. A value closer to
1 for the linear regression slope is desired for the MO.
Pearson Linear relation between the two observers, where the better
correlation observer has an r value closer to 1.
coefficient (r)

5. CHO training
Training the CHO requires the definition of the (expected) signal in the form of a
template that is applied to the images and this section examines how different
signal templates influence the CHO scores. The training phase also builds the
covariance matrix and therefore we also examine the number of DBT phantom
acquisitions needed for the implementation of a robust and practically
achievable CHO. In this context, the number of training acquisitions can be
reduced if some of the images used in the reading stage are used also for CHO
training. This adds bias to CHO which could influence the results and therefore
this section also examines CHO training with datasets that have a range of bias
levels.
The CHO template is given by the following formula (Gallas & Barrett, 2003):
𝑇
𝑤𝐶𝐻𝑂 = (𝑣
̅̅̅̅ 𝑣𝑠𝑎 𝐾𝑣−1 = 𝑠 𝑇 𝐾𝑣−1
𝑠𝑝 − ̅̅̅̅) (8)
This can be split into two parts: (1) the subtraction of mean signal present and
mean signal absent channel output vectors and (2) the covariance matrix of the
ensembles. Both parts are determined in the training phase.

56
Figure 5. Images of the studied ‘expected signal’ models: a). signal estimated from
training images; b). Central slice of the mass model; c). Maximum Intensity
Projection of the mass model in z-direction; d). DBT scan of the physical mass
model; e). Gaussian blob; f). Small Landolt C; g). Large Landolt C; h). Rectangle.
The standard way of including the expected signal is from image sets with known
truth, often acquired in low noise (high dose) conditions (Harrison H Barrett et
al., 2004). Nevertheless the signal template can also be provided in other ways
(Michielsen et al., 2016; Park et al., 2010) and therefore the influence of signal
template on CHO performance was explored for seven alternative templates,
listed in table 2 and illustrated in figure 5. Templates f), g) and h) are only weakly
related to the expected signal content but were included to assess the sensitivity
of the CHO to the template choice (variation in PC and scale of the scores) and
potential template mismatches. Note that all are 2D templates (Platiša et al.,
2011). Again, the criteria in table 1 were applied and the expected signal
estimation method with the highest score was used in further studies.
The covariance matrix in the template characterizes correlations between the
channel output values, with the diagonal giving the variances of the channel
−1/2
output values. If the channel output vector v is multiplied by 𝐾𝑣 , where 𝐾𝑣 has
already been estimated from vectors with the same statistics as v, the covariance
1

matrix of their product is the identity matrix, i.e. 𝑐𝑜𝑣(𝑦) = 𝑐𝑜𝑣 (𝐾𝑣 2 𝑣) = 𝛪. This
means that the fluctuations within y have only white noise properties. In the case
of the CHO, the pre-whitening is needed for both expected signal s and the
channel output vector v, as both their backgrounds are strongly correlated. Using
a CHO in correlated backgrounds requires two different estimates of the
covariance matrix – one for signal present and one for signal absent (Harrison H
Barrett et al., 2004). For this study, this was not feasible due to the limited
number of signal present images that can be derived from a single DBT
acquisition and therefore the covariance matrix was only estimated from the
signal absent training images.
The number of elements in the CHO covariance matrix is equal to the number of
channels squared. This is crucial for the inversion, as the inverse does not exist
unless the number of training channel output vectors exceeds the number of
vector elements, i.e. channels. For CHO implementations, this is easily achievable
but stable results require more than just the bare minimum and therefore the
number for a robust estimate was investigated.
To form the observer template, signal present training images were used to
estimate the expected signal present channel output vector, while signal absent
training images were used to form both the expected signal absent channel

57
output vector and the covariance matrix. In order to estimate the minimum
number of training images needed to derive CHO observer results similar to the
HR ground truth, the observer performance was studied in different conditions
of training, with images extracted from 2 to 48 DBT acquisitions. From each
acquisition, 3 signal present training images for each mass model diameter (the
central 3 slices from a signal present VOI) and 75 signal (the central 5 slices at
15 positions) absent training images were extracted. Training with the maximum
available number of images was defined as the ground truth (48 DBT
acquisitions giving 144 signal present and 2160 signal absent images).
Table 2. Methods used for template formation.
Template method Figure

Signal estimated from training images. Expected signal for a given 5a


lesion diameter estimated from 144 signal present VOIs, using 3 central
(adjacent) planes.
Central slice of the mass model, taken from the binary 3D printing 5b
stereo lithography (STL) file of the mass models, scaled and rotated to
match the physical mass model in the phantom
Maximum intensity projection (MIP) of the 3D mass model (STL file) in 5c
z-direction.
DBT scan of just the physical mass models (i.e. no spheres or water): 5d
template is the central plane of the reconstructed mass model

Gaussian blob with FWHM set to match the average mass diameter 5e
measured in images of the phantom with no spheres, i.e. background
free DBT acquisitions.
Landolt C (Schrauf & Stern, 2001). Outer diameter set to insert diameter 5f
for a given lesion.
Landolt C (Schrauf & Stern, 2001). Inner diameter set to insert diameter 5g
for a given lesion.
Rectangle with height equal to lesion diameter. 5h

The lower limit for training images was defined as the minimum number of DBT
scans needed to achieve CHO performance within the 95% confidence interval
(CI) of the ground truth PC results. The number of DBT acquisitions required was
then averaged over the 5 lesion sizes and this number of training images was
used to form the observer template for the following studies.
Given the limited number of images available for training and reading, these two
image datasets are often mixed. Using the same dataset for both training and
reading introduces bias into the results – ideally a different image dataset should
be used for reading. In the literature there are two methods to train an observer
(Gallas, 2003) – the holdout method and the re-substitution method. In order to
study how bias influences the observer results and what constitutes an
acceptable level of bias, 24 DBT acquisitions taken at the standard dose were

58
divided into two equal datasets for training and reading (Harrison H Barrett et
al., 2004). Bias was varied from 0% to 100% in steps of 1/12, with initially 0%
bias meaning that the two sets consisted of unique scans of the phantom – this is
the so-called ‘holdout’ method. Then, in successive (1/12) steps, training DBT
acquisitions were substituted by reading DBT acquisitions until the CHO only
used reading images for both training and reading. This is termed the ‘re-
substitution’ method and has 100% bias. The mean error for each biased MO
result from the respective non-biased result was calculated, and the highest bias
percentage that had a performance within the 95% CI of the non-biased results
was defined as the ‘permitted’ bias percentage. In order to explore the influence
of dose (noise) level on this bias estimate, this process was then repeated for the
low and high dose acquisition datasets.
If sound conclusions are to be drawn regarding image quality assessment of an
imaging system in a QC setting, then CHO reproducibility is crucial. This was
evaluated using 60 DBT acquisitions, split into 5 groups, each group then read by
the finalized CHO. The standard deviation from the 5 readings for each lesion was
taken as an estimate for the CHO reproducibility and compared to HR
reproducibility for the same 5 image groups. Finally, the potential use of the MO
was tested on a first practical application: the CHO was applied to the three sets
of DBT acquisitions made at low, AEC and high dose levels, to examine whether
the finalized CHO tracked HR performance as dose changed. The criteria in table
1 were used to compare the results of the CHO to the human readings.

RESULTS
1. Channel comparison
Gabor channels DOG channels LG channels
100 100 100
Model pc, %
Model pc, %
Model pc, %

80 80 80
60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Human pc, % Human pc, % Human pc, %

Figure 6. Graphs of CHO performance plotted against human observer performance.


The three graphs represent the results of the best tuning parameter set of the three
channel types, listed in Table 3.
After channel tuning for the best combination of correlation (r), slope and ME
against human observer, the MO performance of each channel type with the best
tuning parameters is shown in figure 6. The corresponding tuning parameter
values are listed in Table 3, with the Gabor channels giving the best overall score
for the lesion targets in this phantom.

59
Table 3. Overview of the final mean absolute error, linear regression slope (a) and
Pearson correlation (r) between human observer and CHO for the final parameter
set. The 95% CIs for the slope and correlation are shown in brackets.
Channel Channel parameters ME a r
type
Gabor 𝑡𝑡 = 1.2, 𝑡𝑓 = 65, 𝑡𝑤 = 15 -1.75 0.99 (0.50 – 0.967 (0.558 –
1.48) 0.998)
DOG 𝜎0 = 0.016, 𝛼 = 1.4, 𝑄 = 2.4 -3.87 0.99 (0.47 – 0.961 (0.517 –
1.52) 0.998)
LG 𝜎𝑢 = 14 -3.27 1.20 (0.59 – 0.964 (0.548 –
1.80) 0.998)

2. Channel tuning parameters

Figure 7. CHO performance versus channel tuning parameter value. The varied
parameter is indicated above each graph (equations 7, 8 and 9), where the other
tuning parameters were kept constant at the values indicated by the vertical
dashed line. The combination of tuning parameters specified by the dashed lines,
form the channels described in Table 7 and produce observer results shown in
figure 6.

60
Figure 7 shows CHO performance as channel parameters are systematically
varied for the three different channel types and separately for all lesion sizes. In
this figure, one parameter is varied while the remaining parameters are fixed at
the values in Table 7. For Gabor and LG channels, large tuning ability is observed,
whereas DOG performance is only influenced by the parameter 𝜎0 and not by α
or Q. Vertical lines indicate the parameters that give the closest match between
CHO and human observer performance (Figure 7 and Table 3).
3. Expected signal
Signal model
100
1.5mm
80 2.1mm
PC, %

3.0mm
60 4.3mm
40 5.7mm
20
0
er d n m e L in le ut
rv ine sia to lic ST C- ang C-o
se ra us an L s ed ct
ob T Ga y ph l ST m Re
an pt ntra Su
m
m m
Hu E Ce

Expected signal

Figure 8. Human observer performance along with CHO performances using


different estimations of the expected signal, ordered from highest to lowest
correlation with HR results
Table 4. Influence of signal template method on CHO performance
Template ME a r
Trained CHO -1.35 1.03 (0.50 – 1.56) 0.963 (0.537 – 0.998)
Gaussian blob -1.61 0.72 (0.56 – 0.88) 0.993 (0.892 – 0.999)
Empty phantom -2.40 0.77 (0.31 – 1.23) 0.951 (0.425 – 0.997)
Central slice -4.50 0.67 (0.15 – 1.19) 0.922 (0.210 – 0.995)
Summed -2.97 0.48 (-0.12 – 1.07) 0.827 (-0.205 – 0.989)
C-small 17.70 1.02 (-0.39 – 1.20) 0.685 (-0.499 – 0.977)
Rectangle 31.94 0.01 (-1.89 – 1.91) 0.008 (-0.881 – 0.884)
C-large 51.44 -0.52 (-1.52 – 0.47) -0.695 (-0.978 – 0.485)

The influence of signal template on PC for the Gabor CHO performance is plotted
in Figure 8 for the different expected signal templates studied. The first set of
results represents the HR scores (gold standard). Table 4 presents the evaluation

61
results for each CHO; best match is given by estimating the expected signal from
a training set of images i.e. what could be considered the standard method
(Harrison H Barrett et al., 2004). This is followed by using a Gaussian ‘blob’ with
FWHM equal to the mass diameter. The observer templates used for the
upcoming tests were thus produced from the training images.
4. Training images
Mass Size = 1.5mm Mass Size = 2.1mm Mass Size = 3.0mm Mass Size = 4.3mm Mass Size = 5.7mm

100 100 100 100 100

80 80 80 80 80
PC, %

PC, %

PC, %

PC, %
PC, %
60 60 60 60 60

40 40 40 40 40

2 20 48 24 48 24 48 2 12 48 2 18 48
No of DBT scans No of DBT scans No of DBT scans No of DBT scans No of DBT scans

Figure 9. Gabor channel CHO performance for the 5 masses as a function of the
number of training DBT acquisitions. The dotted horizontal lines represent the 95%
CI for the gold standard CHO trained with 48 DBT scans. The dashed vertical line
points to the highest number of acquisitions producing the CHO performance
outside the 95% CI of the gold standard.
Figure 9 shows the result of changing the number of DBT scans, used to train the
CHO template, from 2 to 48. The dashed vertical lines in this figure show the
points at which the PC moves outside the 95% CI of the ground truth, as the
number of DBT training scans is reduced from 48. The overall number of DBT
scans or acquisitions was found by averaging over data for the different lesions
(Table 5). As a result of this analysis, further studies with this CHO/phantom
combination will be performed with 12 training DBT acquisitions.
Table 5. The number of training DBT stacks and the number of signal present and
signal absent ROIs cropped from them, forming a CHO template, which produces
observer results inside the 95% CI of the gold standard CHO results.
Lesion
diameter 1.5 2.1 3.0 4.3 5.7 Overall
[mm]
DBT scans 20 4 4 12 18 11.6
Absent ROIs 1500 300 300 900 1350 870
Present ROIs 60 12 12 36 54 35

62
5. Bias

Figure 10. a). Low dose level; b). AEC dose level; c). High dose level. Bias ratio of
Gabor channel CHO readings with result outside the 95% CI of the non-biased data,
evaluated for different mass sizes. Horizontal dotted lines represent 95% CI of the
non-biased data and the dashed vertical line – the limit bias ratio.
Figure 10 shows the influence of changing bias, in 13 steps from 0% to 100%, on
the PC for the Gabor channel CHO, with the values ordered from 0% to 100%
bias. Table 6 gives the bias threshold ratio, defined as the fraction of biased
training images for which the MO results remain within the 95% CI of completely
non-biased training. Table 6 also shows the ME difference between 100% biased
and non-biased MO results.
As a result of this tuning and training procedure, the CHO with parameters
summarized in Table 7 was found as the finalized CHO for the structured
phantom used with the Siemens Inspiration DBT system.

63
Table 6. Bias threshold and ME results between 0% and 100% biased model
observer results.
Dose level Lesion diameter (mm) Bias threshold ME 0-100
1.5 5/12 10.1
2.1 12/12 2.57
Low 3.0 10/12 -1.7
4.3 11/12 5.4
5.7 12/12 0.6
1.5 4/12 21.6
2.1 12/12 0.9
AEC 3.0 12/12 0.6
4.3 5/12 -0.8
5.7 12/12 0.6
1.5 7/12 14.7
2.1 12/12 -1.0
High 3.0 12/12 -2.71
4.3 12/12 0.5
5.7 12/12 0.4

Table 7. CHO formation method with best overall performance against human
observer scores from structured phantom scanned on Siemens.
CHO Expected Training
Channels Bias
component signal images
Separate DBT
𝑡𝑡 = 1.2, From acquisitions for
Trained Cropped
acquired training and
Gabor 𝑡𝑓 = 65, from 12 DBT
(training) reading. Some
channels 𝑡𝑤 = 15 acquisitions
images bias not
detrimental.

64
6. Reproducibility
CHO reproducibility Human observer reproducibility
100 100
80 80

PC, %
PC, %

60 60
40 40
20 20
0 0
5

7
1.

2.

3.

4.

5.

1.

2.

3.

4.

5.
Lesion size [mm] Lesion size [mm]

Figure 11. Reproducibility of the Gabor channel MO (left) and human observers
(right). The longer line for each mass size in both graphs represents the mean PC
results and the two smaller lines, the upper and lower limit of the standard
deviation.
Figure 11 shows the mean and the standard deviation of the PC data from the
reproducibility study, indicating comparable reproducibility for both observers.
This figure shows slightly lower PC for the CHO for the 4.3 mm diameter mass,
relative to the HR data. It is possible that this is a result of the final tuning
parameters used for the Gabor channels. Table 8 lists the standard deviation of
the PC results from the five observations (for both model observer and human
reading) and for all lesion sizes (units of percentage correct).
Table 8. Standard deviation of the CHO and human observer performance in the
reproducibility study.
Standard deviation (PC)
Lesion diameter 1.5 mm 2.1 mm 3.0 mm 4.3 mm 5.7mm
Model observer 5.3 3.2 2.5 2.9 2.3
Human reader 7.3 5.1 1.9 2.3 0.6

65
7. Influence of dose level
Observer correaltion Observer performance for 3.0mm lesion
Low AEC High
r = 0.95 r = 0.99 r = 0.99 Human observer Model observer
100 100

Percentage correct, %
80
Model pc, %

80
60

40 60

20 40

0
0 20 40 60 80 100 Low AEC High
Human pc, % Dose level

Figure 12. Observer performance comparison for the three dose levels. a) presents
the CHO scores against Human observer readings, with the correlation coefficients
shown in the legend along with the dose level indication. b) PC scores for the 3.0
mm mass model for both observers versus dose level.
The results of applying the finalized CHO (Table 7) to the DBT scans acquired at
low, AEC and high dose levels are compared to HR data in Figure 12.a). Average
PC scores for human observer and CHO are plotted versus dose level in Figure
12.b). for the 2.1 mm diameter lesion. The evaluation criteria applied on these
data sets are listed in Table 9.
Table 9. CHO evaluation criteria for the dose levels study.
Dose level ME a r
Low -1.07 0.92 (0.36 – 1.49) 0.948 (0.407 – 0.997)
AEC -1.75 0.99 (0.50 – 1.48) 0.967 (0.558 – 0.998)
High 3.68 0.99 0.65 – 1.34) 0.983 (0.752 – 0.999)

DISCUSSION
The impetus behind this study was the design, tuning, validation and
documentation of a CHO for use with a 3D structured test object in DBT. Although
this test object had been shown to produce reliable results when read with
human readers (Cockmartin et al., 2017), the use of the phantom for routine QC
purposes is expected to benefit from more objective methods such as MO
methods. This has been the case for the CDMAM test object, where the
availability of automatic readout via CDCOM has greatly facilitated the used of
this phantom in routine QA of FFDM systems (Karssemeijer & Thijssen, 1996).

66
Three main aspects of CHO design were studied: channel type and associated
parameters, the expected signal template and the covariance matrix. This CHO
design phase was then completed with a study on the required number of test
images and the amount of acceptable bias. As a start, channel selection examined
three channel types, namely Gabor, DOG and LG, and assessed the influence of
channel tuning on CHO performance. After tuning, all three channel types
tracked human observer results with good accuracy. The number of channels for
each channel type was fixed to 8, as preliminary results showed that using higher
number of channels generally requires a substantially higher number of training
images. In addition higher number of channels increase the CHO performance, as
investigated by Castella et al. (C Castella et al., 2007), which could lead to poorer
correlation to human observer scores. Thus, 8 channels were chosen as a good
compromise between observer performance and requirement of training
images. The CHO with Gabor channels was selected for further investigation, as
this gave the closest correspondence to the human reader scores. DOG channels
gave similar slope and Pearson correlation, however the mean error was lower
for the Gabor set. With regard to channel tuning, the DOG channels showed a
rather low tuning potential: only the initial channel width had an impact on the
model observer performance. This lack of tuning ability would mean that where
CHO and human reading results differed, one would probably be forced to
implement some form of internal noise method, rather than using tuning to
match performance (Brankov, 2013).
For Gabor channels, where orientation, width and frequency can be tuned,
certain orientations (controlled by the parameter θ) gave notably higher
detectability. This may be related to the particular shape of the masses in the
phantom, especially for the ‘trained’ template, formed from many images. The
main aim of the lesion model was to represent a real mass (Shaheen et al., 2014)
and there was therefore no attempt to select a more isotropic mass model. Only
one shape of mass-like lesion was used (with different dimensions) and the
models were carefully glued to have a similar orientation. While a change in the
mass-like lesions used in the phantom is not anticipated, the use of Gabor
channels means that the CHO could probably be adapted to new mass lesion
types if required. At higher channel frequencies (𝑡𝑓 ) , there was a fall in the
detectability for larger lesions, as the channels start to exclude parts of the target.
A drop in detectability was also seen for smaller diameter masses as width (𝑡𝑤 )
was increased, due to the larger background area around the targets, included in
the computation of the decision variable. The response (i.e. in terms of PC) of
Laguerre Gauss channels were also found to be sensitive to changes in the
channel parameters. Increasing the initial channel width reduced the
detectability score for the masses with diameter 3.0 mm, 2.1 mm and 1.5 mm, as
expected.
Of equal importance is the training phase and generation of the template.
Expected signal approximations ranging from visually good estimates, expected
to give strong performance, to objects which were clearly not related to the
targets, were examined. The template built from signal present/signal absent

67
training images gave the closest match to the HR data (table 4), supporting the
standard approach to CHO template formation (Harrison H Barrett et al., 2004).
This was followed by the Gaussian blob signal, which somewhat surprisingly
outperformed the a-priori signal methods such as using a slice through the STL
file or a summed projection of the STL file. This brings us the question of which
template should be used for (comparative) performance testing. This could be
something close to a physical version of the signal (i.e. the input), or the signal as
rendered by the imaging system (i.e. the output). As expected, the signal template
choice has strong impact on CHO performance. Significant correlation (p<0.026)
was only found for the template trained from signal present/signal absent
images, the Gaussian blob, the empty phantom and the central STL file slice signal
estimation methods. The remaining signal templates (the MIP, rectangle and
Landolt C) gave some idea of the CHO sensitivity to mismatched templates.
Poorest performance occurred for the large Landolt C (C-large) (r=-0.69), whose
diameter excluded the majority of expected image signal, but was sensitive to a
region around the target. The results suggest that if only a limited amount of
images can be acquired, due to a practical infeasibility, the use of a Gaussian blob
might be a good candidate for the observer template.
The number of images used for CHO training depends (theoretically) on the
minimum number required and (practically) on the number of feasible
acquisitions for a phantom study such as this. This is in contrast to CHO studies
using simulated images, where computational resources are generally the
limiting factors (Park et al., 2016)(Young et al., 2009). As discussed, training
images may also be needed for signal template generation. The training for the
covariance matrix is crucial for inversion and for the task of pre-whitening,
where insufficient images give an unreliable covariance estimate or even a
singular (non-invertible) covariance matrix. For the 3D structured phantom
background in this work, the smallest number of training images needed to
achieve observer performance with a CHO using 8 Gabor channels within 95%
CI of the ground truth PC results was 12 DBT acquisitions. This gave 900 signal
absent VOIs with in plane dimensions of 236x236 pixels and 5 adjacent planes
(20x20x5 mm3) and 36 signal present VOIs with in plane dimensions of 236x236
pixels and 3 adjacent planes (20x20x3 mm3). This means that for a full reading
of the phantom, 24 DBT acquisitions (12 for training and 12 for reading) are
enough to produce observer results not significantly different from an observer
trained with images cropped from 48 DBT acquisitions and applied on a separate
set of 12 DBT acquisitions for reading. The results in figure 9 show the somewhat
surprising trend that the number of images used for training has only a small
influence on the PC value for most mass diameters, although there was some
influence on PC for the 2.1mm and 3.0mm diameter masses. To examine this
further, the training study was repeated using the “central slice” expected signal
(Figure 5b) instead of the trained expected signal (acquired from many images)
(Figure 5a). It was thought this would give some insight as to whether it was the
signal template definition or the covariance estimation that was responsible for
the trend seen. The results were closer to the expected behavior (i.e. there was

68
an increased influence of number of training images), and showed that a greater
number of training images would be required. This suggests that using the
expected signal may provide more information on the task than the central slice
signal template to the final observer template, thus requiring fewer training
images. This will be investigated in future work. Another limitation in the
training study is the method of extraction of the signal absent images. As seen in
figure 2.a. the signal absent ROIs overlap by 35% resulting in cropped ROIs that
are not completely independent from one another and this can introduce a
degree of bias in the training process. Extracting ROIs with 0% overlap would
increase the number of acquisitions required to train the model observer, which
could substantially prolong the scanning time. Acquisition of 24 scans typically
takes between 30 and 45 minutes on current DBT systems. This precludes the
use of such a test as part of daily QC, but is certainly practically feasible for
physicists performing acceptance and yearly QC tests. The number of VOIs
generated per DBT acquisition could be increased without a time penalty if a
larger phantom were constructed and/or more signals were present in the
phantom.
Turning to the bias study, the largest impact was seen on the smallest mass
(1.5mm), where bias for the AEC dose level gave a ME equal to 21.6 PC. However,
for this mass, PC remained within the 95% CI until the bias ratio reached 5/12,
while for the other 4 mass diameters, ME is less than 1 % PC at all bias levels.
Similar trend is observed for the Low and High dose levels, where the highest
bias impact is seen for the smallest lesion, meaning that the dataset bias is not
influenced by the dose level. Ideally, CHOs should use two different image sets
for training and reading, but for cases where this is not possible, these data show
that partly biased CHOs give acceptable results for the 3D structured phantom in
this work. The extent to which this holds for other targets and backgrounds
remains to be seen.
One potential application of the phantom used in this study is the routine
assessment of DBT system image quality. Once the image quality score is
established for some system baseline, the reproducibility determines the
smallest deviations that can be tracked. The current reproducibility study
showed that, when the same CHO is applied on another set of training and
reading images from the same modality, the observer performance results might
be slightly different. Nevertheless the differences were very similar to what was
seen for human observers reading the same test sets. Neither CHO nor human
readers could reliably detect the smallest mass (1.5 mm diameter), where PC was
close to 50%. For the next smallest lesions (2.1 mm and 3.0mm), the CHO had
higher reproducibility while spread on the human reader results was smallest
for the largest mass (5.0 mm). It is unclear as to why the CHO PC performance is
slightly lower for the 4.3 mm diameter mass. Obviously, the CHO enables
automated reading of phantom images, and suggests a great utility and benefit
to physics QC services with limited staffing resources.

69
These steps resulted in a CHO that gave the closest correspondence with HR data
for the phantom, yet using a practical number of channels and an acceptable
minimal number of training images. The features of this CHO are summarized in
Table 7. Application of this CHO on the 3D structured phantom data acquired at
Low, AEC and High dose levels showed good agreement with human observer
results. Linear correlation coefficient was higher than 0.95, slope larger than
0.92 and ME less than 2% PC. As for the impact of dose on detectability of masses,
the CHO and human observers found the same overall result: the dose level has
a minimal impact on detectability of these mass like lesions, a finding well known
for FFDM and which also appears to apply to DBT images of the targets and
background in the 3D structured phantom (Saunders et al., 2007; Timberg et al.,
2008).
The present study has a number of limitations. First, the choice was made to use
a ss-CHO (Platiša et al., 2011), applied over 5 adjacent planes, rather than a full
3D (volumetric) CHO. One could argue that a fully 3D CHO would offer a more
robust MO implementation, with access to the volumetric dataset and possibly
improved detection performance. Instead, three different channel functions
were explored with the ss-CHO and we were able to successfully match HR
performance using all three methods. Given that the aim of the study was to
implement a CHO for the evaluation of the 3D structured phantom for QA/QC
purposes, we consider the ssCHO a simple and efficient approach. Further work
may examine the use of multi-slice or volumetric CHOs for the phantom
evaluation.
Second, the CHO algorithm was tuned and trained using DBT acquisitions made
on a single device (Siemens Inspiration, using the EMPIRE reconstruction
algorithm). Future work will investigate the applicability of this CHO to other
DBT systems and reconstruction algorithms. Differences in DBT scanning
parameters (angular range, anti-scatter grid use, reconstruction algorithms
(Sechopoulos, 2013b)), lead to strong differences in the appearance of DBT
images between the various systems. Whether this translates into differences in
task performance evaluated using a CHO is unclear, yet preliminary data (Petrov
et al., 2017) show that a CHO with Gabor channels generated in real space can be
a starting point in this investigation.
In the present analysis, medical physicists performed the human reading tests.
Compared to standard contrast-detail phantoms with homogenous backgrounds
(Karssemeijer & Thijssen, 1996), the structured test object has more realistic
targets and background, closer to a radiologist task. While radiologists would
likely perform differently from medical physicists, we expect that this would give
a systematic offset, with minimal impact on scoring for QC purposes (Elangovan
et al., 2017). Differences in scoring between different groups of medical
physicists can be expected too. We are using a 4AFC method for observer
performance evaluation. This was considered a good compromise between the
number of tests and the required statistical stability, as seen in similar studies in
the literature (Jäkel & Wichmann, 2006).

70
One could question whether a mass-like object is required, when evaluating
imaging performance when some studies have shown a link between radiologist
scores using simulated realistic (calcification) lesions and technical test object
scores using sharp edged gold discs (Warren et al., 2012). Future developments
in DBT, for example the synthetic mammogram calculated from the
reconstructed stack, may use a-priori information on breast anatomy and search
for lesion-like objects in the volume, enhancing their appearance in the final
image. Purely technical (perhaps structure-less) test objects may fail to provide
an accurate assessment of system or algorithm performance in this case. If this
is pursued, then some decision as to the shape and type of the mass-model
lesions to be included must be made – clearly this needs to cover the range of
lesions relevant to breast imaging. Alternatively, the results from the different
template study suggest we may not expect large differences using cancer-like
targets with slightly different shapes, as the Gaussian blob gave promising
results when used as signal template. This could be seen as a kind of averaging
over a range of non-spiculated lesion orientations. Nevertheless, if the lesion
models are too far–removed from reality, then there may be differences in
absolute performance for both human and correspondingly the developed CHO
performance, as found in the study by Elangovan et al. (Elangovan et al., 2018),
which study was used as a starting point for the CHO development found in
chapter 5.
Finally, the CHO was only tested on one type of background i.e. that of the
structured phantom. In the future, we expect to extend the use of the CHO to
applications in virtual clinical trials (Maidment, 2014). This will require further
evaluation in real and simulated anthropomorphic backgrounds covering a
range of breast glandularities and lesion types. Whether the Gabor based CHO
derived here for the phantom background would prove successful is not known,
however the systematic CHO design process laid out here would help in the
design of a robust model observer.

CONCLUSIONS
With reconstructed images that may use non-linear methods of image generation
and processing, most Fourier based techniques cannot be applied and thus CHOs
are a promising means of image evaluation, but only if carefully tuned and
trained. This study has shown that a CHO can be tuned, trained and applied with
acceptable reproducibility to DBT acquisitions of a physical phantom. The CHO
was successfully applied to sets of images acquired at different dose levels. The
systematic procedure outlined here should help in CHO development for other
tomosynthesis systems and for new applications.

REFERENCES
Abbey, C. K., & Barrett, H. H. (2001). Human- and model-observer performance
in ramp-spectrum noise: effects of regularization and object variability.
Journal of the Optical Society of America A, 18(3), 473–488.

71
Abdurahman, S., Dennerlein, F., Jerebko, A., Fieselmann, A., & Mertelmeier, T.
(2014). Optimizing high resolution reconstruction in digital breast
tomosynthesis using filtered back projection. Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 8539 LNCS, 520–527.
Abdurahman, S., Jerebko, A., Mertelmeier, T., Lasser, T., & Navab, N. (2012). Out-
of-Plane Artifact Reduction in Tomosynthesis Based on Regression Modeling
and Outlier Detection (pp. 729–736). Springer, Berlin, Heidelberg.
Alberto Donzelli. (2012). The benefits and harms of breast cancer screening: an
independent review. The Lancet, 380(9855), 1778–1786.
Barrett, H H, Yao, J., Rolland, J. P., & Myers, K. J. (1993). Model observers for
assessment of image quality. Proceedings of the National Academy of
Sciences of the United States of America, 90(21), 9758–9765.
Barrett, Harrison H, Myers, K. J., Hoeschen, C., Kupinski, M. A., & Little, M. P.
(2015). Task-based measures of image quality and their relation to
radiation dose and patient risk. Physics in Medicine and Biology, 60(2), R1-
75.
Barrett, Harrison H, Myers, K. J., & Rathee, S. (2004). Foundations of Image
Science. In Medical Physics.
Bochud, F., Abbey, C., & Eckstein, M. (1999). Statistical texture synthesis of
mammographic images with super-blob lumpy backgrounds. Optics
Express, 4(1), 33–42.

Bouwman, R. W., Goffi, M., van Engen, R. E., Broeders, M. J. M., Dance, D. R., Young,
K. C., & Veldkamp, W. J. H. (2017). Can the channelized Hotelling observer
including aspects of the human visual system predict human observer
performance in mammography? Physica Medica, 33, 95–105.
Brankov, J. G. (2013). Evaluation of the channelized Hotelling observer with an
internal-noise model in a train-test paradigm for cardiac SPECT defect
detection. Physics in Medicine and Biology, 58, 7159–7182.
Burgess, A. E., Jacobson, F. L., & Judy, P. F. (2001). Human observer detection
experiments with mammograms and power-law noise. Med. Phys., 28(4),
419–437.
Castella, C, Abbey, C. K., Eckstein, M. P., Verdun, F. R., Kinkel, K., & Bochud, F. O.
(2007). Human linear template with mammographic backgrounds
estimated with a genetic algorithm. Journal of the Optical Society of America
A: Optics and Image Science, and Vision, 24(12), B1–B12.
Castella, C, Eckstein, M. P., Abbey, C. K., Kinkel, K., Verdun, F. R., Saunders, R. S.,
Samei, E., & Bochud, F. O. (2009). Mass detection on mammograms:

72
influence of signal shape uncertainty on human and model observers.
Journal of the Optical Society of America A, 26(2), 425–436.
Castella, Cyril, Kinkel, K., Descombes, F., Eckstein, M. P., Sottas, P.-E., Verdun, F.
R., & Bochud, F. O. (2008). Mammographic texture synthesis: second-
generation clustered lumpy backgrounds using a genetic algorithm. Optics
Express, 16(11), 7595.
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2018). Lesion
detectability in 2D-mammography and digital breast tomosynthesis using
different targets and observers. Physics in Medicine and Biology, 63(9), 1–
15.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2017). Using
non-specialist observers in 4AFC human observer studies (T. G. Flohr, J. Y. Lo,
& T. Gilat Schmidt (eds.); p. 1013256).
Ferreira, P., Baptista, M., Di Maria, S., & Vaz, P. (2016). Cancer risk estimation in
Digital Breast Tomosynthesis using GEANT4 Monte Carlo simulations and
voxel phantoms. Physica Medica, 32(5), 717–723.
Fetterly, K. A., & Favazza, C. P. (2016). Direct estimation and correction of bias
from temporally variable non-stationary noise in a channelized Hotelling
model observer. Physics in Medicine and Biology, 61(15), 5606–5620.
Gallas, B. D. (2003). Variance of the channelized-hotelling observer from a finite
number of trainers and testers (D. P. Chakraborty & E. A. Krupinski (eds.);
p. 100).
Gallas, B. D., & Barrett, H. H. (2003). Validating the use of channels to estimate
the ideal linear observer. Journal of the Optical Society of America. A, Optics,
Image Science, and Vision, 20(9), 1725–1738.
Glasziou, P., & Houssami, N. (2011). The evidence base for breast cancer
screening. Preventive Medicine, 53(3), 100–102.
Jakel, F., & Wichmann, F. A. (2006). Spatial four-alternative forced-choice method
is the preferred psychophysical method for naive observers. Journal of
Vision, 6(11), 13–13.
Jäkel, F., & Wichmann, F. A. (2006). Spatial four-alternative forced-choice method
is the preferred psychophysical method for naïve observers. Journal of
Vision, 6(11), 1307–1322.

73
Karssemeijer, N., & Thijssen, M. A. O. (1996). Determination of contrast-detail
curves of mammography systems by automated image analysis. Elsevier,
115–160.
Maidment, A. D. A. (2014). Virtual clinical trials for the assessment of novel
breast screening modalities. Lecture Notes in Computer Science (Including
Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 8539 LNCS, 1–8.
Maldera, A., De Marco, P., Colombo, P. E., Origgi, D., & Torresin, A. (2017). Digital
breast tomosynthesis: Dose and image quality assessment. Physica Medica,
33, 56–67.
Marshall, N. W., & Bosmans, H. (2012). Measurements of system sharpness for
two digital breast tomosynthesis systems. Physics in Medicine and Biology,
57(22), 7629–7650.
Michielsen, K., Nuyts, J., Cockmartin, L., Marshall, N. W., & Bosmans, H. (2016).
Design of a model observer to evaluate calcification detectability in breast
tomosynthesis and application to smoothing prior optimization. Medical
Physics, 43(12), 6577–6587.
Myers, K. J., & Barrett, H. H. (1987). Addition of a channel mechanism to the ideal-
observer model. Journal of the Optical Society of America. A, Optics and
Image Science, 4(12), 2447–2457.
Niklason, L. T., Christian, B. T., Niklason, L. E., Kopans, D. B., Castleberry, D. E.,
Opsahl-Ong, B. H., Landberg, C. E., Slanetz, P. J., Giardino, a a, Moore, R.,
Albagli, D., DeJule, M. C., Fitzgerald, P. F., Fobare, D. F., Giambattista, B. W.,
Kwasnick, R. F., Liu, J., Lubowski, S. J., Possin, G. E., … Wirth, R. F. (1997).
Digital tomosynthesis in breast imaging. Radiology, 205(2), 399–406.
Park, S., Jennings, R., Liu, H., Badano, A., & Myers, K. (2010). A statistical, task-
based evaluation method for three-dimensional x-ray breast imaging
systems using variable-background phantoms. Medical Physics, 37, 6253–
6270.
Park, S., Zhang, G., & Myers, K. J. (2016). Comparison of Channel Methods and
Observer Models for the Task-Based Assessment of Multi-Projection
Imaging in the Presence of Structured Anatomical Noise. IEEE Transactions
on Medical Imaging, 35(6), 1431–1442.

Perry, N., Broeders, M., de Wolf, C., Törnberg, S., Holland, R., & von Karsa, L.
(2006). European guidelines for quality assurance in breast cancer
screening and diagnosis. In Annals of oncology : official journal of the
European Society for Medical Oncology / ESMO (Vol. 19, Issue 4).
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H.
(2017). Real space channelization for generic DBT system image quality

74
evaluation with channelized Hotelling observer (M. A. Kupinski & R. M.
Nishikawa (eds.); Vol. 10136, p. 101360N). International Society for Optics
and Photonics.
Platiša, L., Goossens, B., Vansteenkiste, E., Park, S., Gallas, B. D., Badano, A., &
Philips, W. (2011). Channelized Hotelling observers for the assessment of
volumetric imaging data sets. Journal of the Optical Society of America. A,
Optics, Image Science, and Vision, 28(6), 1145–1163.
Poplack, S. P., Tosteson, T. D., Kogel, C. A., & Nagy, H. M. (2007). Digital breast
tomosynthesis: Initial experience in 98 women with abnormal digital
screening mammography. American Journal of Roentgenology, 189(3), 616–
623.
Racine, D., Ba, A. H., Ott, J. G., Bochud, F. O., & Verdun, F. R. (2016). Objective
assessment of low contrast detectability in computed tomography with
Channelized Hotelling Observer. Physica Medica, 32(1), 76–83.
Rafferty, E. A. (2007). Digital Mammography: Novel Applications. Radiologic
Clinics of North America, 45(5), 831–843.
Rodríguez-Ruiz, A., Castillo, M., Garayoa, J., & Chevalier, M. (2016). Evaluation of
the technical performance of three different commercial digital breast
tomosynthesis systems in the clinical environment. Physica Medica, 32(6),
767–777.
Saunders, R. S., Baker, J. A., Delong, D. M., Johnson, J. P., & Samei, E. (2007). Does
image quality matter? Impact of resolution and noise on mammographic
task performance. Medical Physics, 34(10), 3971–3981.
Schrauf, M., & Stern, C. (2001). The visual resolution of Landolt-C optotypes in
human subjects depends on their orientation: the ’gap-down‘ effect.
Neuroscience Letters, 299(3), 185–188.
Sechopoulos, I. (2013a). A review of breast tomosynthesis. Part I. The image
acquisition process. Medical Physics, 40(1), 1–12.
Sechopoulos, I. (2013b). A review of breast tomosynthesis. Part II. Image
reconstruction, processing and analysis, and advanced applications.
Medical Physics, 40(1), 1–17.
Shaheen, E., De Keyzer, F., Bosmans, H., Dance, D. R., Young, K. C., & Ongeval, C.
Van. (2014). The simulation of 3D mass models in 2D digital
mammography and breast tomosynthesis. Medical Physics Phys, 41(36),
81913–84920.
Skaane, P., Bandos, A. I., Gullien, R., Eben, E. B., Ekseth, U., Haakenaasen, U., Izadi,
M., Jebsen, I. N., Jahr, G., Krager, M., & Hofvind, S. (2013). Prospective trial
comparing full-field digital mammography (FFDM) versus combined FFDM

75
and tomosynthesis in a population-based screening programme using
independent double reading with arbitration. European Radiology, 23(8),
2061–2071.
Svahn, T. M., Chakraborty, D. P., Ikeda, D., Zackrisson, S., Do, Y., Mattsson, S., &
Andersson, I. (2012). Breast tomosynthesis and digital mammography: A
comparison of diagnostic accuracy. British Journal of Radiology, 85(1019).
Timberg, P., Båth, M., Andersson, I., Svahn, T., Ruschin, M., Hemdal, B., Mattsson,
S., & Tingberg, A. (2008). Impact of dose on observer performance in breast
tomosynthesis using breast specimens. SPIE Medical Imaging: Physics of
Medical Imaging, 6913, 69134J.
van Engen, R. E., Bosmans, H., Bouwman, R. W., Dance, D. R., Heid, P., Lazzari, B.,
Marshall, N. W., Schopphoven, S., Strudley, C., Thijssen, M., & Young, K. C.
(2014). A European Protocol for Technical Quality Control of Breast
Tomosynthesis Systems (pp. 452–459). Springer, Cham.
Warren, L. M., Mackenzie, A., Cooke, J., Given-Wilson, R. M., Wallis, M. G.,
Chakraborty, D. P., Dance, D. R., Bosmans, H., & Young, K. C. (2012). Effect
of image quality on calcification detection in digital mammography. Med
Phys, 39(6), 3202–3213.
Watson, A. B. (1983). Detection and recognition of simple spatial forms. Berlin
SpringerVerlag, 100–114.
Wen, G., Markey, M. K., Haygood, T. M., & Park, S. (2018). Model observer for
assessing digital breast tomosynthesis for multi-lesion detection in the
presence of anatomical noise. Physics in Medicine and Biology, 63(4).
Young, S., Bakic, P. R., Myers, K. J., Jennings, R. J., & Park, S. (2013). A virtual trial
framework for quantifying the detectability of masses in breast
tomosynthesis projection data. Medical Physics, 40(5), 051914.
Young, S., Park, S., Anderson, S. K., Badano, A., Myers, K. J., & Bakic, P. R. (2009).
Estimating breast tomosynthesis performance in detection tasks with
variable-background phantoms. SPIE Medical Imaging: Physics of Medical
Imaging, 7258(301), 72580O.
Zeng, R., Badano, A., & Myers, K. J. (2017). Optimization of digital breast
tomosynthesis (DBT) acquisition parameters for human observers: effect
of reconstruction algorithms. Physics in Medicine & Biology.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice
(4AFC) software for observer performance evaluation in radiology (C. K.
Abbey & M. A. Kupinski (eds.); p. 97871E). International Society for Optics
and Photonics.
Zhao, B., Zhou, J., Hu, Y.-H., Mertelmeier, T., Ludwig, J., & Zhao, W. (2008).

76
Experimental validation of a three-dimensional linear system model for
breast tomosynthesis. Medical Physics, 36(1), 240–251.

77
Chapter 2: Real space channelization for generic
DBT system image quality evaluation with
channelized Hotelling observer
Based on D. Petrov, N. Marshall, K. Young, H. Bosmans, ‘Real space channelization
for generic DBT system image quality evaluation with channelized Hotelling
observer’, Proc. SPIE 10136, Medical Imaging 2017: Image Perception, Observer
Performance, and Technology Assessment, 101360N (2017)

INTRODUCTION
Conventional 2D mammography imaging, using either screen-film or digital
detectors, has limitations in detection sensitivity due to overlapping breast
tissues. In digital breast tomosynthesis (DBT), a series of projection images is
acquired as the x-ray tube moves over a limited angle, these projections are then
reconstructed in planes parallel to the detector. DBT systems ca therefore pro-
vide some three dimensional information of the breast anatomy, solving at least
partly the problem of the masking effect of overlapping structures. However,
rigorous optimization of this technique is a challenge, given the large number of
adjustable parameters that are involved. Furthermore, if image quality
assessment by human observers (HO) is needed then this is not feasible for all
conditions, as it would require a significant amount of time. Model observers
(MO) offer a potential solution, and if validated, could be used to predict human
reading. Such observers have been developed in previous studies and it was tried
to establish a correlation between HO and MO. The background images used in
these studies are either not clinically relevant or use digitally inserted lesions
within real clinical backgrounds (Eckstein et al., 1998; Ikejimba et al., 2016). Our
previous MO (Petrov et al., 2016), which was applied on physical phantom images
containing 3D printed lesions that were reconstructed with 3 different
reconstruction algorithms, produced good correlation between the MO and HO
scores. The study was performed on a Siemens Inspiration system and the
channelized Hotelling observer (CHO) with Laguerre-Gauss channels and
internal noise was tuned for each reconstruction algorithm. The work from the
previous chapter, show an improved and better validated CHO algorithm, which
will be used as a basis for the current study.
In this work, an anthropomorphic CHO with novel Gabor channels was
constructed. The channels account for the image pixel size, such that they have
the same physical properties in the spatial domain regardless of the system
image dataset being evaluated. The study was performed on Fujifilm AMULET
Innovality (Fujifilm, Tokyo, Japan), General Electric SenoClaire 3D (GE Health-
care, Buc, France), Giotto Class (I.M.S, Bologna, Italy), Hologic Selenia Dimensions
(Hologic, Inc., Marlborough, USA), Philips MicroDose (Philips, Solna, Sweden)
and Siemens Inspiration (Siemens AG Healthcare, Erlangen, Germany) DBT
systems. The impact of reconstruction, dose level and reproducibility of

78
detection results was studied in terms of the correlation between the human
observer and the MO.

MATERIALS AND METHODS


1. Phantom properties

Figure 1. Images of the phantom, from left to right: Photograph, Mammographic


image without the background spheres and DBT reconstructed plane.
The 3D structured phantom (Cockmartin et al., 2017) used was made of an acrylic
semi-circular, compressed breast-shaped container filled with equal volumes of
acrylic spheres of six different diameters (15.9, 12.7, 9.52, 6.35, 3.18 and
1.58mm); the space between the spheres is filled with water (Figure 1). This
creates images with power spectra similar to those found in patients
(Cockmartin et al., 2013). As the spheres are free to move slightly inside the
phantom, shaking the phantom is enough to produce different phantom
configurations that still have similar power spectra. Target objects were included
in the phantom: calcification groups, spiculated and non-spiculated 3D printed
simulated mass models (Shaheen et al., 2014), with differing diameters. For this
study only the non-spiculated masses were evaluated, with diameters averaged
over 3 dimensions equal to: 1.5, 2.1, 3.0, 4.3 and 5.7mm. All mass lesions are
positioned at the same height the phantom (and thus in the same in-focus plane),
parallel to the detector. They are made of the same material and the different
diameters give different contrast levels relative to the background. For each
phantom scan only one unique signal for a given mass model is acquired, while
from the same scan, fifteen signal absent images can be created. Therefore, in
order to conduct an observer study, multiple acquisitions have to be made.
2. Systems and studies
In order to validate the applicability of the MO, the phantom was scanned on six
different systems and for different conditions. Three types of studies were
performed:
2.1. Dose level
The correlation between model and human observers (MO-HO) was studied at
three different dose levels. The reference dose was set by the systems’ automatic
exposure control (AEC), then a lower dose scan (half of the AEC dose) and higher
dose scan (usually double the AEC dose), were performed. For each condition, 10
DBT images were acquired on Giotto Class (I.M.S, Bologna, Italy), 12 on Fujifilm
AMULET Innovality (Fujifilm, Tokyo, Japan) and 10 on Philips MicroDose

79
(Philips, Solna, Sweden) and 12 on General Electric SenoClaire 3D (GE
Healthcare, Buc, France) DBT systems.
2.2. Reconstruction algorithm
The influence of 3 different reconstruction algorithms was studied on Siemens
Inspiration tomosynthesis system (Siemens-Healthineers, Erlangen, Germany).
These data were already reported elsewhere (Petrov et al., 2016). Fifteen DBT
acquisitions were acquired in AEC mode and then reconstructed by the following
algorithms: Filtered Back Projection (FBP) (Mertelmeier et al., 2006), FBP with
Super Resolution and Statistical Artifact Reduction (SRSAR) (Abdurahman et al.,
2014) and Maximum Likelihood for Transmission with resolution modeling
(MLTRpr) (Michielsen et al., 2013).
2.3. Reproducibility
Sixty DBT scans on a Hologic Selenia Dimensions (Hologic, Inc., Marlborough,
USA) DBT system in AEC dose level were separated into 5 separate series of 12
DBT acquisitions each, allowing the repeatability of detection results to be
studied.
3. Signal present and signal absent regions of interest
Similar to chapter 1, volumes of interest (VOI) were created from the DBT image
stacks. As an aid to cropping, the central microcalcification particle of the largest
size group was taken as a reference point. Furthermore, the distances to the five
non-spiculated lesions were estimated from FFDM 2D images of the phantom
without the background spheres and corrected for magnification, then visually
verified. This verification was required for all DBT systems, except the Siemens
Inspiration, for which the original cropping algorithm was developed. The signal
present VOIs were formed from a cropped volume of 20x20x30mm3, with the
target lesion in the center. Along with the signal present images, signal absent
images were created from lesion free areas including the same DBT planes as the
signal present images.
VOIs were defined in terms of physical size (mm3). All DBT systems in this study
had plane thickness of 1mm, but differences in the reconstructed image pixel size
in x-y direction across the five DBT systems (from 0.085 to 0.108mm) gave
regions of interest (ROI) with different lengths in terms of number of pixels.
Namely the ROI sizes in pixels were 236x236 for Siemens, 186x186 for Hologic,
222x222 for Giotto and 200x200 for Philips, GE and Fujifilm, all with 30 planes
forming the VOI.
In the same manner as with chapter 1, the scanning ssCHO algorithm required
2D images, thus the in-focus and four the adjacent planes (two above and two
bellow) were extracted from the VOIs for the automated image quality
evaluation. Thus from every VOI, 5 ROIs were extracted around the expected
location of the lesions.

80
4. Human observer study
Five medical physicists participated in all the different study conditions in
several reading sessions. The observer study followed the four alternative forced
choice (4AFC) paradigm. The VOIs were visualized with an in-house software
tool for scoring (Zhang et al., 2016). The reading was performed on a diagnostic
monitor (5MP Barco MDNG-6121) at an ambient light level of approximately 3
lx. The observers were asked to score all sessions and studies with a consistent
distance to the screen. No magnification and window level adjustments were
allowed, and only scrolling through the VOI was permitted. No time constraints
were given for the reading sessions.
At the end of each reading session, the percentage correct (PC) was obtained for
each human observer, the value of which gave the probability of correctly
identifying a signal present image among signal absent images. When all readers
finished a given reading session, the overall PC was calculated as the mean of the
population and the uncertainty was calculated as the standard error of the mean
of the population.
5. Model observer
5.1. Channelized Hotelling model observer
Following from the introduction of this thesis, the CHO test statistic 𝑡 can be
written as:
𝑇
𝑇
𝑡𝐶𝐻𝑂 (𝑣) = 𝑤𝐶𝐻𝑂 𝑣 = (𝑣
̅̅̅̅ 𝑣𝑠𝑎 𝐾𝑣−1 𝑣
𝑠𝑝 − ̅̅̅̅) (1)
Where 𝑤𝐶𝐻𝑂 is the observer template, { }𝑇 is the vector transpose operator, ̅̅̅̅
𝑣𝑠𝑝
and ̅̅̅̅
𝑣𝑠𝑎 are respectively the expected signal present and signal absent channel
output vectors and 𝐾𝑣 is the covariance matrix. Here the expected signal
(𝑣̅̅̅̅ 𝑣𝑠𝑎 was estimated using a Gaussian ‘blobs’ with full width at the half
𝑠𝑝 − ̅̅̅̅)
maximum equal to the average size of the mass-targets to be detected. Due to the
limited amount of DBT acquisitions, the CHO template was trained with the same
images used for reading afterwards, this gave a dataset bias of 100%.
5.2. Channels
The channels used for this study were generated using a modified Gabor function
in the spatial domain (Chen et al., 2002):
𝑥 2 +𝑦 2
−4ln(2)
𝑊(𝑥, 𝑦) = 𝑒 𝑤𝑠2 cos[2𝜋𝑓(x cos 𝜃 + 𝑦 sin 𝜃) + 𝜙],
(2)
𝑡𝑠𝑡𝑑 𝑡𝑓𝑞 𝜋 𝑛𝑡ℎ
Where 𝑊𝑠 = , 𝑓= , 𝜃= and 𝜙 = 45 𝑛𝑝ℎ.
(𝑒 𝑛𝑓𝑞 +2) 𝑝𝑥𝑠 𝑊𝑠 𝑡𝑡

The Gabor function is defined by a sinusoidal wave multiplied by a Gaussian


function. The Gaussian function sets the width (W s) of the channels and the
sinusoidal wave sets frequency (𝑓), orientation (𝜃) and the phase (ϕ). In the
present work the frequency was set as a function of the width of the Gaussian
function in a way to include only one maximum of the sinusoidal wave within the

81
channel FWHM. This sets the sensitive spot of the MO at the center of the image.
In our implementation, only three features of the channels can be selected – the
number of frequencies (nfq), orientations (nth) and phases (nph). The previous
chapter had shown that four frequencies, two orientations and one phase were
a good candidate setting, which gives eight unique channels.
For estimating human observer reading scores, the CHO algorithm should
include a tuning step, where its results are (usually) downgraded in order to
match the human results. There are three tuning factors included within the
channel generation tstd, tfq and tt, respectively a tuning factor for the Gaussian
function standard deviation, a tuning factor for the frequency and a tuning factor
for the orientation. By altering the values of channel width, frequency and
orientation, the channelized features extracted from the images can be varied, in
order to approximate the human reading scores. This was an iterative procedure
resulting in channel parameters that provided the highest correlation between
model observer and human results.
The automated image quality estimation for a number of DBT systems can be
approached in two ways. First is to tune the feature extraction algorithm and find
the channel set which provides a good correlation with human observers. This
would produce multiple channel sets unique from one another for each of the
DBT systems. This can be potentially time consuming and prone to overfitting. A
more desired approach is to find a channel set which generalize the task of mass-
lesion detection in DBT and is usable regardless with which DBT vendor is
considered. Our choice is the second method, thus a procedure was set up to find
the channel tuning parameters which work well for all of the studied systems
and conditions. In this study, we iteratively searched for the best candidate in
terms of correlation and slope at standard scanning parameters (at AEC dose
level and the standard reconstruction algorithm) and then applied this on all
systems and different conditions for the final comparison between CHO and
human observer.

Figure 2. Gabor function profiles for different pixel sizes (pxs).


Reconstruction pixel size of the final images (planes) varies between the
different systems. So if an ROI with a certain real space size is chosen, the

82
selected regions will have different number of pixels across different systems. By
this means, generating channels with the same tuning parameters for different
systems will produce different spatial frequency and width, which could alter the
CHO detectability scores. Assuming human detectability is (largely) affected by
the real space target size, detectability scores for an anthropomorphic CHO
should not be affected by the number of pixels within the ROIs. Thus the channel
mechanism and the CHO algorithm should be tuned, trained and applied with
regards to the real space properties of the targets.
Figure 2 visually shows the problem. A Gabor channel is generated for the
Hologic DBT system (pxs=0.108mm), the red line in the figure shows its profile
in real space. If a Gabor channel is then generated for the Siemens DBT system
(pxs=0.085) with the same exact tuning parameters as for Hologic, this will
produce a channel profile in real space depicted with the green line. It can be
seen that the channel width is reduced spatially and the channel frequency is
increased. To cope with this in equation 2 the pixel size occurs in the Gabor
channel formula, affecting the width and the frequency of the generated
channels. With this new channel generation method, the produced channel
profile for the Siemens system (blue line in the figure) matches perfectly the
channel profile for the Hologic system (red line in the figure). This shows that
the proposed method produces spatially invariant channels regardless of the
used DBT system.
5.3. Adaptations of the MO
In order to cope with practical aspects of the MO implementation, adaptations
were implemented in accordance to the CHO developed in chapter 1.

Figure 5. Illustration of the MO scanning mechanism


Misalignments often occur in the process of ROI extraction. In order to cope with
this, the channels of the CHO were shifted along 25 pixels around the center of
each ROI. This action was repeated also to the four adjacent slices around the
central slice, which gave 125 voxels scanning area around the initial, expected
position of the lesion (figure 5). After calculating the observer test statistic for all
points, the final result was taken as the maximum of these data.

83
Each dataset was read by the CHO in a 4AFC paradigm and percentage correct
(PC) was chosen for the MO performance estimate. For the purpose the algorithm
calculates the test statistics for the signal present and the signal absent images
and pairs each signal present test statistics with three randomly selected signal
absent test statistics for comparison. If the CHO response is the highest for the
signal present image, then the algorithm counts a ‘hit’. The final percentage
correct is calculated as the number of hits was divided by the number of
realizations. The standard error was estimated by bootstrapping this process 30
time.
The goodness of the fit between the model and the human observers will be
tested in the same way as in chapter 1. Three criteria parameters were used: the
Pearson’s correlation coefficient, the linear regression slope and the mean error
estimate with ideal values of 1, 1 and 0 respectively.

RESULTS
1. Tuning of the channel set

Figure 3: a).Channel profilesused for this study on the left; b). Images of the used
channels on the right.
Table 1. Parameters of the channels.
Parameter 𝒕𝒘 𝒕𝒇𝒒 𝒕𝒕 𝑾𝒔 𝒇 𝜽 𝝓
13.0 9.0 6.0 3.0
Value 39 50 2.5 , , , 3.0𝑝𝑥𝑠, 5.0𝑝𝑥𝑠, 8.0𝑝𝑥𝑠, 18.0𝑝𝑥𝑠 0°, 72° 0°
𝑝𝑥𝑠 𝑝𝑥𝑠 𝑝𝑥𝑠 𝑝𝑥𝑠

Table 1 contains the tuning parameter values used for this study and figure 3a).
shows the profiles of the channels grouped in couples with the same frequency,
but different orientation as depicted in figure 3.b). The parameters of the applied
channels are approximated from a tuning step performed on all systems at
standard dose and reconstruction settings. The best parameters in terms of MO-
HO correlation were then used on all different conditions.
2. Three dose levels study
Scores of the MO performance plotted against HO performance for the non-
spiculated masses are presented in figure 6 for the case where the dose level is

84
varied. The goodness of the fit criteria are used to estimate how well the CHO
approximates human observer results, and the Pearson’s correlation, linear
regression slope and mean error are listed in Table 2, along with the 95%
confidence intervals where possible. Note that data points closer to the diagonal
of the plot, mean better accuracy of the model observer. Overall the results show
good Pearson’s correlation coefficients with values higher than 𝑟 > 0.87. For the
GE low dose session the MO overestimates the 3.0mm lesion, which produced
linear regression slope of 𝑎 = 0.73, and for the case of Fujifilm Low dose with
𝑎 = 1.42 the model observer was underestimating the smaller lesion sizes. The
absolute mean error is smaller than 10PC for the Giotto system, where the model
observer underestimates all lesions on average. The rest of the reading sessions
showed much better agreement with humans.

Giotto Class Philips MicroDose


High AEC Low High AEC Low
100 100

80 80
CHO, PC %

CHO, PC %

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Human, PC % Human, PC %

Fujifilm Innovality GE SenoClaire


High AEC Low High AEC Low
100 100

80 80
CHO, PC %
CHO, PC %

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Human, PC % Human, PC %

Figure 6. Dose level study. Performance of human and model observers.

85
3. Reconstruction algorithm study
S ie m e n s In s p ir a t io n

H ig h AEC Low
100

80

CHO, PC %
60

40

20

0
0 20 40 60 80 100

H um an, PC %

Figure 7. Reconstruction algorithm study. Performance of human and model


observers.
Table 2. Goodness of the fit criteria between MO and HO reading scores for the dose
levels study. The Pearson correlation, linear regression slope and the mean error
are given with the 95% confidence intervals given in brackets where possible.

System Condition Correlation (r) Slope (a) ME, %

AEC dose 0.98 (0.71; 1.00) 0.87 (0.54; 1.21) 6.0

Giotto High dose 0.98 (0.68; 1.00) 1.05 (0.62; 1.47) 9.3

Low dose 0.97 (0.64; 1.00) 1.03 (0.58; 1.48) 8.7

AEC dose 0.99 (0.98; 1.00) 0.95 (0.85; 1.04) -1.4

Philips High dose 0.98 (0.76; 1.00) 0.87 (0.58; 1.17) 0.8

Low dose 0.96 (0.54; 1.00) 0.91 (0.44; 1.37) 0.9

AEC dose 0.99 (0.87; 1.00) 1.22 (0.93; 1.52) 1.6

Fujifilm High dose 0.98 (0.72; 1.00) 1.18 (0.73; 1.63) 0.7

Low dose 0.99 (0.79; 1.00) 1.42 (0.97; 1.87) 3.3

AEC dose 0.87 (-0.07; 0.99) 1.05 (-0.06; 2.16) -1.9

GE High dose 0.95 (0.39; 1.00) 0.97 (0.36; 1.57) 0.2

Low dose 0.89 (0.05; 0.99) 0.73 (0.06; 1.41) -8.7

86
Figure 7 shows the results for different reconstruction algorithms. The linear
regression line parameters along with their 95% confidence intervals are given
in table 3. The correlation found was satisfactory. The lowest correlation
(r=0.97) was found on the Standard FBP session with reasonable slope and mean
error criteria.
Table 3. Goodness of the fit criteria between MO and HO reading scores for the
reconstruction algorithms study. The Pearson correlation, linear regression slope
and the mean error are given with the 95% confidence intervals given in brackets
where possible.
System Condition Correlation (r) Slope (a) ME
Standard FBP 0.98 (0.77; 1.00) 0.79 (0.53; 1.06) 1.5
FBP with
Siemens 0.97 (0.64; 1.00) 1.07 (0.60; 1.54) 3.3
SRSAR
MLTRpr 0.98 (0.64; 1.00) 0.87 (0.55; 1.19) 7.5

4. Reproducibility study
H o lo g ic D im e n s io n s H o lo g ic D im e n s io n s
A v e ra g e d
S e t1 S e t2 S e t3 S e t4 S e t5

100
100
80
CHO, PC %

80
CHO, PC %

60
60
40
40

20
20

0 0
0 20 40 60 80 100 0 20 40 60 80 100

H um an, PC % H um an, PC %

Figure 8. Reconstruction algorithm study. Performance of human and model


observers.
Figure 8 shows the results from the reproducibility study. Despite the widely
distributed individual PC scores on the left graph, a satisfactory goodness of the
fit criteria are observed. If the average PC across the 5 reading sets for each
observer type is taken on the right graph, the goodness of the fit remains
satisfactory (table 4).

87
Table 4. Goodness of the fit criteria between MO and HO reading scores for the
reproducibility study. The Pearson correlation, linear regression slope and the
mean error are given with the 95% confidence intervals given in brackets where
possible.
System Condition Correlation (r) Slope (a) ME
Average of all
Hologic 0.99 (0.92; 1.00) 1.19 (0.97; 1.41) 4.0
5 datasets

DISCUSSION AND CONCLUSIONS


Our goal was to develop a MO capable of predicting human observer
performance, when performing basic, but clinically relevant phantom-generated
detection tasks for different DBT systems. This MO is now a candidate for image
quality assessment and optimization tasks, where multiple conditions need to be
investigated. The key part of the work presented here is the new way of
generating spatial channels accounting for pixel size differences, which allowed
the use of the same set of channel parameters for all systems and conditions. It
was shown that the presented CHO can predict human observer scores on images
with different properties for five different DBT systems, three dose levels, three
reconstruction algorithms and remain stable during repeatability tests.
However, a limitation of this study is, that a dataset bias of 100% is present
within the current model observer. The covariance matrix is estimated from the
same set of signal absent images, which are later used for the detectability
estimate. This was done because of the limited number of DBT scans available.
To eliminate the impact, signal absent images from different planes towards the
bottom of the phantom (real distance of 8mm) were extracted and used for
training. This resulted in a large difference for the correlation coefficients and
overall the results were poorer. It is likely that planes far away from the in-focus
plane have different noise properties, than the central planes and this is
influencing the CHO performance. Preliminary studies on the Philips system
using a separate image dataset specifically allocated for training (i.e. 0% bias),
has shown no major difference in correlation and linear regression parameters
compared to those for the biased CHO. This might be caused by the use of
Gaussian ‘blobs’ for the expected signal, which eliminates the requirement for
training signal present images. This might help with reducing the effect of bias
for the smaller lesion sizes observed in chapter 1 figure 10. Differences in
correlation coefficients were within 2%, for the slope of the regression line
within 7% and for the mean error within 16%. It is generally accepted that
whenever possible the latter unbiased model observer should be used and
training and reading should not be performed on the same datasets (Kupinski et
al., 2007). In conclusion, a CHO built using Gabor channels has been developed to
predict human observer performance for non-spiculated mass lesions within a
3D structured phantom using a limited number of DBT series. The MO shows
good results for five different DBT systems in different conditions.

88
REFERENCES
Abdurahman, S., Dennerlein, F., Jerebko, A., Fieselmann, A., & Mertelmeier, T.
(2014). Optimizing high resolution reconstruction in digital breast
tomosynthesis using filtered back projection. Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 8539 LNCS, 520–527.
Chen, M., Bowsher, J. E., Baydush, A. H., Gilland, K. L., DeLong, D. M., & Jaszczak, R.
J. (2002). Using the Hotelling observer on multislice and multiview
simulated SPECT myocardial images. IEEE Transactions on Nuclear Science,
49 I(3), 661–667.
Cockmartin, L., Bosmans, H., & Marshall, N. W. (2013). Comparative power law
analysis of structured breast phantom and patient images in digital
mammography and breast tomosynthesis. Medical Physics, 40(8), 081920.
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Eckstein, M. P., Abbey, C. K., & Whiting, J. S. (1998). Human vs model observers in
anatomic backgrounds. Proc. SPIE, 3340, 16–26.
Ikejimba, L., Glick, S. j, Samei, E., & Yo, Y. J. (2016). Comparison of model and
human observer performance in FFDM , DBT , and synthetic
mammography. Proceedings of SPIE Medical Imaging, 9783(978325), 1–10.
Kupinski, M. A., Clarkson, E., & Hesterman, J. Y. (2007). Bias in Hotelling observer
performance computed from finite data. Medical Imaging 2007: Image
Perception, Observer Performance, and Technology Assessment, 6515(65150
Suppl), 65150S.
Mertelmeier, T., Orman, J., Haerer, W., & Dudam, M. K. (2006). Optimizing filtered
backprojection reconstruction for a breast tomosynthesis prototype
device. SPIE Medical Imaging: Physics of Medical Imaging, 6142, 61420F.
Michielsen, K., Zanca, F., Marshall, N., Bosmans, H., & Nuyts, J. (2013). Two
complementary model observers to evaluate reconstructions of simulated
micro-calcifications in digital breast tomosynthesis. SPIE Medical Imaging:
Image Perception, Observer Performance, and Technology Assessment,
8673(1), 86730G.
Petrov, D., Michielsen, K., Cockmartin, L., Zhang, G., & Young, K. (2016).
Development and application of a channelized Hotelling observer for DBT
optimization on structured background test images with mass simulating
targets. SPIE Medical Imaging, 9787, 1–9.
Shaheen, E., De Keyzer, F., Bosmans, H., Dance, D. R., Young, K. C., & Van Ongeval,
C. (2014). The simulation of 3D mass models in 2D digital mammography

89
and breast tomosynthesis. Medical Physics, 41(8), 081913.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice
(4AFC) software for observer performance evaluation in radiology (C. K.
Abbey & M. A. Kupinski (eds.); p. 97871E). International Society for Optics
and Photonics.

90
Chapter 3: Calcification cluster detection in 2D
FFDM and DBT with a channelized Hotelling
observer
DIGITAL BREAST TOMOSYNTHESIS
Based on D. Petrov, N. Marshall, K. Young, G. Zhang, H. Bosmans, ’Model and
human observer reproducibility for detection of microcalcification clusters in
digital breast tomosynthesis images of three-dimensionally structured test object’,
J. Med. Imag. 6(1) 015503 (2019)
1. Introduction and purpose
Digital breast tomosynthesis (DBT) is an imaging technique that provides some
3D information on the breast from a series of projection images. This modality
promises improved low contrast lesion detectability compared to conventional
2D mammography in the presence of breast anatomical structure (Skaane et al.,
2013; Svahn et al., 2012), due to the reduction in overlapping fibroglandular
tissues. The ability of DBT systems to detect microcalcifications, however,
remains a topic of interest, as some studies show conflicting results (Hadjipanteli
et al., 2016; Kopans et al., 2011; Li et al., 2018; Spangler et al., 2011). The gold
standard for such image quality estimation is a clinical trial, where the human
observer efficiency to detect calcification clusters in anatomical images is
evaluated. It is difficult to find an alternative to large scale clinical studies, where
many real world aspects of clinical studies are evaluated in the measured
performance (e.g. sensitivity and specificity) of a given modality for some imaged
population. There is rapid progress being made in the field of virtual studies
(Bakic et al., 2014; U.S. Food and Drug Administration, 2018) and these promising
methods are likely to augment or help to orientate practical clinical studies at
some point in the future. If a practical estimate of imaging performance for a
modality in the field is required at a fixed time point (e.g. for QC purposes) then
we are left with test object-based evaluations. Work by Warren et al. (Warren et
al., 2012) has shown that well defined stimuli within a test object can correlate
with human observer performance reading clinical images, suggesting that there
is value in carefully controlled evaluations using test objects.
Quality control guidelines in mammography (Berns et al., 2016; FDA, 2002; van
Engen et al., 2014) often specify the minimum detectability level which must be
reached for some specified targets if the system in question is to be suitable for
clinical use in a breast screening context. In practice, these stimuli are generated
by test objects containing circular or speck-like objects positioned within a
homogeneous background, such as the CDMAM (Thomas et al., 2005) or the ACR
(Berns et al., 2016) phantoms. Regarding the CDMAM test object, human readers
initially performed the reading to assess the object detectability in terms of
contrast threshold or threshold gold thickness, for the specific exposure
condition used clinically, on a given imaging system. Over time, human reading

91
has been replaced by computerized or automated reading (Karssemeijer &
Thijssen, 1996; Young et al., 2008).
It is important that the evaluation process is reproducible, i.e. producing a stable
detection threshold, such that system performance can be tracked reliably for QC
purposes over time. The variability of threshold contrast estimated using the
CDMAM phantom has been compared for human and automatic readout (Young
et al., 2006), showing that some of the automated readout/processing methods
gave reliable threshold contrasts over certain target diameters. However, for
DBT there is still no consensus on how to estimate the minimally required
detectability level, as its benefit regarding 3D information of the breast and the
reduced overlapping glandular tissue cannot be assessed in the same way as 2D
mammography (e.g. via CDMAM). Testing the detectability of specific targets
within test objects against specified (regulatory) detectability levels also has
implications regarding the reproducibility of the measurements made, either by
human readers or by automated readout software. It is important that the
uncertainty on the measured thresholds is known so that meaningful decisions
regarding system performance can be made.
The recent work by Cockmartin et. al. (L Cockmartin et al., 2017)
introduced a 3D structured test phantom suitable for evaluation of DBT imaging,
with scoring again performed by human readers. This phantom is the starting
point for this work, with the aim of investigating human reader reproducibility
and testing computerized scoring alternatives. Recent studies show that model
observers can be used for such scoring tasks in DBT, showing good correlation
with human results (Lesley Cockmartin et al., 2014; Hu & Zhao, 2011; Michielsen
et al., 2016). The calcification clusters within the phantom are of special interest,
given that the detection of small objects such as microcalcifications is likely to be
limited by system quantum noise (Bochud et al., 1999) even in breast simulating
structured backgrounds. This work first describes the scoring of the calcification
clusters by human observers to derive detectability score and the associated
reproducibility. A model observer (MO) is then developed using the human
reader output and the reproducibility of both observers was compared.
The MO presented here is a development of an earlier channelized Hotelling
observer (CHO) (Dimitar Petrov et al., 2018) that modelled human observer
performance for the task of detecting calcifications using internal noise. That
approach of adjusting CHO performance to match that of human readers using
internal noise, however, is prone to overfitting, and will not likely generalize well
to new data (Brankov, 2013). In a previous work (D. Petrov et al., 2017; Dimitar
Petrov et al., 2019) summarized in chapter 1 and 2 of this thesis, we found that
the internal noise approach could be replaced by careful channel tuning for the
case of non-spiculated mass lesions. Here, the CHO channel parameters were
selected to include sufficient information from the images such that human
observer performance is matched. This method was shown to approximate
human performance for six different models of DBT system, covering a wide
range of scanning parameters, without the need for further tuning. Nevertheless,

92
the calcification clusters require different approach for detection, as a cluster is
made up from a number of calcification particles. Here the localization is crucial
for a properly working algorithm, as the calcification particles are substantially
smaller, than the non-spiculated masses. With the scanning and detection carried
out in one step, the precise localization of each calcification particle might be
compromised from the anthropomorphic properties of the channel set. To
overcome these problems, this work describes a two-stage CHO that utilizes
separate localization and classification stages (Kundel et al., 2007). The concept
of a two-stage CHO was introduced as visual search observers (VSO) in prior
work by the group of Gifford et al. (Das & Gifford, 2011; H. C. Gifford et al., 2017;
Howard C. Gifford et al., 2016; Karbaschi & Gifford, 2018), with an initial
localization stage in which an MO finds candidates potentially belonging to a
given calcification cluster. This was followed by a candidate deletion step and
then the second stage applied another MO at the derived locations to estimate
the cluster detectability. The work of Das et al. (Das & Gifford, 2011) using a VSO
did not require that all MCs in the cluster be detected and this requirement will
also be employed in the work described here. There are some differences from
the visual search based methods (Das & Gifford, 2011; H. C. Gifford et al., 2017;
Howard C. Gifford et al., 2016; Karbaschi & Gifford, 2018), notably that our
implementation focuses on channel tuning at the classification stage.
Furthermore, the initial localization stage implements a CHO that utilizes prior
information about the target size, shape and location, to detect the most probable
locations of the calcification particles forming the cluster: in the classification
stage, a tuned CHO is applied at the derived locations to estimate detectability.
This way, the sensitive area of the algorithm is always aligned to the same
locations within the images regardless of the classification scores and tuning
settings, ensuring that the detectability changes reflect the influence of the MO
classifier and not misalignments. Finally, in this work we extend the
reproducibility evaluation over three dose levels as the noise level may have
some influence on both human and CHO reproducibility.

93
2. Methods
2.1. Phantom properties and image acquisition

Figure 1. a). A photograph of the phantom. b). On the left, a DBT slice of the
phantom and on the right the positions where the ROIs were cropped from. Green
squares depict the positions of the signal absent ROIs and red the signal present
ROIs. c). Example images of the cropped ROIs with 5 calcifications forming a cluster
(4 in the corners of a square and 1 in the center).
The images used in this study were generated using a 3D structured phantom (L
Cockmartin et al., 2017), made of an acrylic semi-circular container filled with
equal volumes of acrylic spheres of six different diameters (15.9, 12.7, 9.52, 6.35,
3.18 and 1.58mm) with the remaining volume filled with water (figure 1). Five
calcification clusters were inserted in the phantom, with calcification particles of
a given group lying respectively within diameter ranges: 90-100µm (mean
95µm), 112-125µm (mean 119µm), 140-160µm (mean 150µm), 180-200µm
(mean 190µm) and 224-250µm (mean 237µm) (figure 1). Each of the five groups
consisted of five individual calcifications placed in an X shape within the
phantom, with one calcification located at the center and four at the extreme
positions.
The phantom was scanned a total of 217 times on a Siemens Inspiration
tomosynthesis system (Siemens AG Healthcare, Erlangen, Germany) in order to
generate a sufficiently large image database to assess reproducibility. The first
acquisition was acquired under automatic exposure control (AEC) to establish
typical tube voltage, anode/filter (A/F) and tube current-time product (mAs) for
the phantom on this system. The following 216 DBT scans were made at 3 dose
levels: 72 acquisitions with fixed parameters set as close as possible to the AEC

94
values (30kV, W/Rh A/F and 200 mAs), 72 acquisitions at a low dose setting
(30kV, W/Rh A/F and 112 mAs) and 72 acquisitions at a high dose setting (30kV,
W/Rh A/F and 400 mAs), all reconstructed with the EMPIRE reconstruction
algorithm (Abdurahman et al., 2014). The phantom was shaken after each
acquisition to produce different realizations of the background. The images of
each dose group were split into 6 sets, each containing 12 scans, the first 5 of
which (i.e. 60 DBT scans) were used for human and model observer reading
while the 6th set was used for training the model observer template. Volumes of
interest (VOIs) with size of 20 x 20 x 30 mm3 (236 x 236 x 30 voxels) were
cropped for the observer studies (figure 1.b.). From each scan, 15 partially
overlapping signal absent VOIs were extracted from the signal free volume
within the phantom, along with 5 unique signal present VOIs centered at each of
the five calcification clusters, corresponding to the five cluster sizes. Thus the 12
scans generated image reading sets that contained 180 signal absent VOIs and
60 signal present VOIs.
2.2. Human observer study
Six medical physicists participated in a four alternative forced choice (4AFC)
image reading study. Three signal absent VOIs along with 1 signal present VOI
were simultaneously visualized and the observer task was to indicate the signal
present image. If the score was correct, then this is defined as a ‘hit’. After each
reading session the percentage correctly detected signals (PC) was calculated by
the formula:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 ℎ𝑖𝑡𝑠
𝑃𝐶 = .100% (1)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠 𝑖𝑛 𝑎 𝑠𝑒𝑠𝑠𝑖𝑜𝑛
The 4-AFC study was conducted using an in-house software tool (Zhang et al.,
2016) and the images were read on a diagnostic monitor (5MP Barco MDNG-
6121) at an ambient light level of 3 lux. The observers were instructed to score
the sessions at a consistent distance from the screen equal to around 50cm. No
magnification and window level adjustments were allowed, only scrolling
through the VOI was permitted (in fact, this was necessary). No time constrains
were imposed for the reading sessions and the whole study was performed
within a time period of 1 week. It is common for the observers to read a training
dataset before participating in such reading studies, however at that time our
observers participated in other reading studies with the same phantom, thus a
training set was not required. To study the reproducibility, standard deviation
(SD) was estimated for each of the 6 observers by bootstrapping 100 times: 12
random 4AFC scores were drawn without replacement from the 60 scores that
were available by combining the 5 datasets at a given dose level and calcification
size.
For each condition the smallest SD among the human observers was used to
quantify the human reproducibility, with the assumption that the smallest SD in
the responses represents the most precise human reader in our group with the
least external distraction. This way the greater part of the variability would be

95
caused by the reading task, and the effect of observer strain or distraction, will
be diminished. The mean PCs from the 5 image datasets with error bars equal to
the bootstrapped SD were plotted along with the observer scores for each
session and dose level.
2.3. Model observer
Signal detection theory (Green & Swets, 1966) utilizes the concept of a decision
variable, which is an internal response generated by the observer that is related
to the presence of a stimulus (e.g. a calcification here) and which forms the basis
of the decision made by the observer for each AFC trial. In this study a
channelized Hotelling observer (CHO) was implemented, with a decision
variable calculated using the following formula (H H Barrett et al., 1993):
𝑡
𝜆(𝑣) = 𝑤𝐶𝐻𝑂 𝑣, 𝑣 = 𝑈 𝑡 𝑔; 𝑤𝐶𝐻𝑂 = 𝑠 𝑡 𝐾𝑣−1 (2)

Here, 𝑣 is the channel output vector, which is a product of the transposed


channel matrix U and the image vector g. In the same expression, 𝑤𝐶𝐻𝑂 is the
observer template, formed from the expected signal output vector 𝑠 and the
inverse of the covariance matrix 𝐾𝑣−1 . The covariance matrix was estimated from
signal absent images and the expected signal was approximated using a Gaussian
function with full width at half maximum (FWHM) equal to the average diameter
of individual calcifications within a calcification group. This way, only signal
absent training images were needed for the estimation of the CHO template for
our application. In order to minimize the CHO bias, the training of the observer
template was performed on a separate image set of 12 DBT acquisitions that
were not used for reading. From the training image dataset, 180 signal absent
VOIs were extracted, where for each VOI the 5 central slices were taken for the
covariance matrix formation, making 900 signal absent training images.
The Gaussian approximation of the expected signal was used with the consent to
the results from chapter 1, where this template showed good potential to replace
the image estimated expected signal. This way the sensitive region of the CHO
included only a single Gaussian function, corresponding to a single calcification
particle. As a consequence, detection of the calcification cluster within a VOI
requires the exact location of the calcification particles within the cluster of 5.
Due to physical manufacturing tolerances, the exact location of the targets within
the physical phantom is not known and thus the CHO algorithm was divided into
two stages – localization followed by classification (Kundel et al., 2007).

96
Localization CHO

Figure 2. Visualization of the searching mechanism. On the left: A decision variable


map showing the observer performance within the central 16x16 pixels of a signal
present image. The position of the maximum observer performance is taken as the
real location of the center calcification. On the right, from this position the extreme
calcifications are detected within areas with the same size.
The localization stage was inspired by the work of Michielsen et. al. (Michielsen
et al., 2016) in which the calcification clusters were detected via scanning 5
volumes within the VOI and the localization and the classification were done in a
single stage. In the current study the central calcification was detected by
scanning an area in the center of the ROI with the size of 1.36x1.36mm 2 (figure
2), i.e. 16x16 pixels for the Siemens system. This produces a decision variable
map, where the position of the highest observer response was then taken as the
most probable calcification position. From this location, the expected positions
of the four extreme calcifications are located 5.2mm from the center calcification.
Each of the four corner calcifications was detected by the same scanning area
used to find the center calcification (1.36x1.36mm2). This was repeated in the 4
adjacent planes (2 above and 2 below the central plane) and, for each
calcification, the most probable position within the VOI was stored for the
classification stage. This process was performed for both signal present and
signal absent images.
The localization CHO was implemented using 2 Laguerre-Gauss (LG) (Gallas &
Barrett, 2003) channels:
𝑗
√2 −𝜋𝑟 2 2𝜋𝑟 2 𝑗 𝑥
𝑘
𝐶𝑗 (𝑟) = exp ( 2 ) 𝐿𝑗 ( 2 ) ; 𝐿𝑗 (𝑥) = ∑(−1)𝑘 ( ) (3)
𝑎𝑢 𝑎𝑢 𝑎𝑢 𝑘 𝑘!
𝑘=0

where 𝑎𝑢 = √2𝜋𝜎𝑢 is the standard deviation of the Gaussian and 𝜎𝑢 is used as a


tuning factor. The rank j of the Laguerre function was varied from 0 to 1 forming
the 2 channels.
The number of channels and the tuning factor were varied in order to pick the
channels that maximize the localization abilities of the algorithm. This was
performed by visually inspecting the decision variable map (Figure 2), with the

97
final channel set selected as the one that generated a narrow peak (ideally at only
one pixel within the scanned area).
Classification CHO
At the classification stage, а CHO was used to compute five observer responses
for each signal present and signal absent image, further used to estimate the final
observer performance. Given that the CHO performance was to be compared
with human reader results, a CHO with anthropomorphic channel properties was
used (Harrison H Barrett et al., 2015). Eight Gabor channels (Watson, 1983) were
selected:
𝑥 2 +𝑦 2
−4ln(2)
𝑤𝑖2
𝐶𝑖,𝑗,𝑘 (𝑥, 𝑦) = 𝑒 cos[2𝜋𝑓(x cos 𝜃𝑗 + 𝑦 sin 𝜃𝑗 ) + 𝜙𝑘 ],
(4)
𝑡𝑤 𝑡𝑓 𝜋𝑗
where 𝑤𝑖 = ,𝑓 = , 𝜃𝑗 = and 𝜙𝑘 = 45 𝑘.
(𝑒 𝑖 +2) 𝑝𝑥𝑠 𝑊𝑖 𝑡𝑡

The parameters 𝑤𝑖 , 𝑓 , 𝜃𝑗 and 𝜙𝑘 control the width, (spatial) frequency,


orientation and phase of the channels, respectively. In this Gabor channel
implementation, the frequency was set as a function of channel width, resulting
in a channel set with just one sinusoidal maximum around the point selected by
𝑥 = 0 and 𝑦 = 0 in the formula. In generating the 8 channels here, one phase (𝑖 =
0), two orientations (𝑗 = [0,1]) and four frequencies (𝑘 = [0,3]) were selected.
Our implementation therefore has 3 tuning factors, 𝑡𝑤 , 𝑡𝑓 and 𝑡𝑡 , influencing
channel width, frequency and orientation, respectively. In this study, prior to the
comparison of the human and model observers, the Gabor parameters were
tuned to give the closest approximation of human observer performance. The
tuning was performed by evaluating the CHO performance at the AEC dose level
with different combinations of the three tuning parameters, where the 𝑡𝑤 ∈
[0.9, 20.0]; 𝑡𝑓 ∈ [3, 160]; 𝑡𝑡 ∈ [5 𝑑𝑒𝑔, 175 𝑑𝑒𝑔] . Candidates for channel
parameter sets were selected using three criteria between CHO and human
observer scores: the Pearson’s correlation (r), linear regression slope (α) and the
mean error (ME), with ideal scores of 𝑟 = 1, 𝛼 = 1 𝑎𝑛𝑑 𝑀𝐸 = 0. Initial candidate
parameter values had to fulfill the criteria: correlation r > 0.98, the linear slope
a ∈ [0.98,1.02], and the absolute 𝑀𝐸 < 2 𝑃𝐶. These criteria were combined with
equal weight. These sets of candidate parameters were then tested on the Low
and High dose levels and the best performing parameter set for all dose levels
was chosen. In order to avoid overfitting, the tuning was performed on a separate
image dataset, different from the images used for reading in the reproducibility
study.
In order to calculate the CHO response for signal present and signal absent VOIs,
the Gabor channels were generated for each VOI at the most probable
calcification locations estimated from the localization stage. This gave 5 observer
responses corresponding to the 5 calcification particles forming the cluster.
Since the human observers internally produce a single observer response for the
VOI and score the calcification cluster as a single target, the CHO should mimic
this operation (Michielsen et al., 2016). If we assume the 5 CHO responses to be

98
independent, the overall image observer response can be calculated as a
summation of the observer responses from the individual calcifications:
3

𝜆(𝑣|𝐻𝑖 )𝑘 = ∑ 𝜆𝑗𝑘 (5)


𝑗=1

where (𝑣|𝐻𝑖 ), 𝑖 ∈ {0,1} refers to the image with signal present or signal absent
ground truth, 𝑘 refers to the 5 calcification group sizes and 𝑗 refers to the 5
calcification positions within the cluster. In the formula 𝑗 ∈ {1,2,3}, as the human
observers reported, that the detection of three calcifications in the expected
positions are enough to score the image positively. In other words, the model
observer decision variable for the calcification cluster is a sum of the three
highest decision variables calculated for the individual calcifications.
To facilitate comparison with the human results, the CHO scores were expressed
as the percentage of correctly estimated signals (PC) in a 4AFC test. This was
done by comparing a decision variable estimated from a signal present image
with 3 decision variables from 3 signal absent images. The outcome was scored
a ‘hit’ when the decision variable for the signal present image was the greatest
among the four. The PC was calculated using the same formula used for the
human observer scores. Finally, CHO reproducibility was assessed by calculating
the SD from bootstrapping the results 100 times, where each sample consisted
of 12 4AFC trials. The mean and SD estimates were plotted along with the CHO
results for each reading session.
3. Results
3.1. CHO tuning

Figure 3. Top: CHO performance expressed as percentage correct (PC) plotted


against Gabor channel tuning parameter values. The title of each graph indicates
the parameter varied, while the other tuning parameters were kept constant at the
values indicated by the two vertical dashed lines in the other graphs. The tuning
parameters specified by the dashed lines describe the channels used for the

99
reproducibility study. Bottom: The three images below each graph visualize the
exact channels produced by setting the parameter at the lowest, the median and
the highest value within the specific studied range. The visualizations are formed
from 8 rectangles (60x236px) cropped from the 8 used channels.

a). MO(HO) b). MO(HO) c). MO(HO)


100 100 100
Model PC, %

Model PC, %

Model PC, %
80 80 80
60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Human PC, % Human PC, % Human PC, %

Figure 4. Model observer PC results plotted against the human observer PC results.
a). Low dose level; b). AEC dose level; c). High dose level.
The tuning parameter for the localization CHO (𝜎𝑢 ) produced a single peak in the
1st stage CHO response within the scanned area with 𝜎𝑢 = 2 (figure 2). Figure 3
shows the tuning ability of the Gabor channels used in the classification stage
with visualization of the channels. Each graph represents the effect in the CHO
scores when tuning only a single parameter, while keeping the others constant.
It can be seen that the channel tuning parameters clearly influence CHO
performance, with a different effect for each calcification diameter. Table 1
contains the tuning parameters chosen for this study (indicated with a vertical
line in figure 3), with the criteria scores shown in Table 2. Figure 4 shows the
model against human observer results, where the CHO results are produced
using the aforementioned tuning parameters.
Table 1. CHO channels tuning parameters.
First layer CHO
Second layer CHO (classification)
(localization)
𝝈𝒖 𝜃𝑡 [degree] 𝑡𝑓 𝑡𝑤
2 50 57 4.1

Table 2. Model observer against human observer results criteria.


Dose Pearson’s correlation Linear regression slope Mean error
level (r) (a) (ME)
Low 0.994 1.13 4.44 PC
AEC 0.991 1.05 0.69 PC
High 0.977 1.13 5.89 PC

100
3.2. Observer reproducibility

Figure 5. Observer results for a). Low dose level; b). AEC dose level; c). High dose
level. For each dose level the top graph shows the human observer reproducibility
results and the bottom – model observer reproducibility results. On the graphs the
mean and the SD estimates are depicted with black and with gray – the observer
scores for each of the 5 sessions.
Scatter plots of the model and human observer reproducibility results are
presented in figure 5. For each mean calcification diameter, the graphs show PC
values for each of the 5 image datasets (12 scans in each dataset) in gray and the
mean PC with the SD estimates are indicated with black lines/bars. Table 3 gives
SD values for each dose level, observer and mean calcification diameter.
The reproducibility results show that the CHO is approximately a factor of 2
times more reproducible than human reading for the smaller calcification sizes.
For the human observer, the lowest reproducibility was found for the 119 µm
calcification cluster in the low dose level dataset with 𝑆𝐷𝐻𝑢𝑚𝑎𝑛,119µm = 13.1 PC
and the most reproducible are the two largest calcification clusters, which were
100% correctly read by the humans with no variation, and thus SD equal to 0.0.
For the model observer, the AEC dose level 95 µm calcification cluster had the
lowest reproducibility, with a 𝑆𝐷𝑀𝑜𝑑𝑒𝑙,95µm = 5.5 PC and the most reproducible
were the two largest clusters for low and high dose levels, which had SD equal to
0.0 PC. For the case of the AEC dose level, the CHO in fact exhibits some variability
in PC for the two largest calcification clusters, with SD equal to 2.2 and 0.9
compared to the value of 0.0 seen for the human observers.

101
Table 3. Standard deviation of the observer results measured in PC.
Dose
Observer 95 µm 119 µm 150 µm 190 µm 237 µm
level
𝑆𝐷𝐶𝐻𝑂 5.5 4.8 5.2 1.1 0.0
Low
𝑆𝐷𝐻𝑢𝑚𝑎𝑛 12.9 13.1 9.6 0.0 0.0
𝑆𝐷𝐶𝐻𝑂 4.7 5.4 3.2 2.2 0.9
AEC
𝑆𝐷𝐻𝑢𝑚𝑎𝑛 11.2 12.4 7.0 0.0 0.0
𝑆𝐷𝐶𝐻𝑂 4.2 5.4 1.2 0.0 0.0
High
𝑆𝐷𝐻𝑢𝑚𝑎𝑛 11.9 3.6 3.7 0.0 0.0

4. Discussion
The aim of this study was to develop an anthropomorphic CHO algorithm for
automated image quality estimation in DBT using a structured phantom with
calcification clusters and assess the human and model observer reproducibility,
when reading images of this type. Three dose levels were studied in order to
check how observer reproducibility varies as the system (x-ray) noise level is
varied between images. The initial work on this topic only on AEC dose level
(Dimitar Petrov et al., 2018) did not have the localization stage described in this
study, but instead used an internal noise algorithm to adjust the CHO
performance so that the human observer PC results were matched. Using the
internal noise step, we found algorithm evaluation results of r = 0.997; a = 0.92
and ME = -0.57 PC at the AEC dose level only (Dimitar Petrov et al., 2018). Using
internal noise, however, compromised the localization abilities of that algorithm,
which led to a poor performance when applied to the two additional dose levels
studied in the present work. Furthermore the guessing rate of the previous CHO
implementation, exceeded 40%. These arguments were the trigger to further
improve the algorithm. The CHO detection was split into two stages, an initial
localization step followed by a classification step. The localization stage used 2
LG channels and did not include any observer performance like area under the
curve (AUC), detectability index (d’) or PC, as such an estimate could compromise
the localization performance of the algorithm. The most probable calcification
locations forming the calcification cluster were estimated only by the 1 st stage
model observer responses calculated from 5 volumes within the VOIs. In order
to avoid bias the localization stage was performed on both signal present and
signal absent images. The second stage that of classification, applied 8 (tuned)
Gabor channels at the most probable calcification positions for the signal present
and the signal absent images. The single decision variable produced for each
image was then used in the 4AFC evaluation. Using this technique, the guessing
rate in terms of PC of the modified CHO method was 23 ± 4 PC, which is close to
the expected value of 25 for a 4AFC method.
Both the visual search observers introduced by Gifford et al. (Das & Gifford, 2011;
H. C. Gifford et al., 2017; Howard C. Gifford et al., 2016; Karbaschi & Gifford, 2018)
and our method use two consecutive MOs. However, the impetus behind our

102
implementation is solely to allow channel tuning. In general the VSO algorithms
find as many candidates as possible within the image in the first step, followed
by a candidate deletion, where a candidate is dismissed if its performance is
under a certain threshold or their location is considered mistaken. In contrast,
the two-stage CHO presented here, is spatially restricted in its first stage to 5
areas known a-priori, where in each of them a single calcification particle is
expected. In fact, once the localization channel parameters are set, we expect that
the algorithm will always pick the same position for each of the five particles.
This way, i.e. without candidate deletion, the classification stage can utilize a
tuned channel set to approximate human performance. In the case of the VSO
method, it seems the opposite is observed in that the performance matching
mostly occurs in the localization and candidate deletion stage.
The results in Figure 3 clearly show that the tuning of the CHO can be done by
selecting the channel parameters instead of addition of internal noise, which is
prone to overfitting (Brankov, 2013). The graph on the left side (figure 3)
showing the observer performance against the orientation of the Gabor filters
notes an improvement in performance at around 90° spatial orientation for the
calcification particles with a size of 150μm. This suggests that the imaged
calcifications might not be ideally circular and a more realistic expected signal
estimate could be used in the future instead of a Gaussian blob. The graph in the
middle (figure 3) showing the observer performance against the frequency
tuning parameter shows a drastic drop in performance for 𝑡𝑓 > 99 , as this
frequency is too high to include the full calcification size, making the signal
present and signal absent decision variables in a closer value range. The same
effect is reached when the channel width is too small, depicted on the right graph
(figure 3) for 𝑡𝑤 < 2.5. Also in the same graph, by making the channels much
wider than the object size (i.e. increasing the 𝑡𝑤 value), the performance is
gradually lowered, due to inclusion of too much background, which in turn
reduces the effect of the signal in the decision variable.
The results in Figure 4 show that the parameters used for the 2 nd (classification)
stage can be tuned so that CHO performance predicts the human reader results
and the single parameter set applies at all three dose (noise) levels for this
system. The 2 stage CHO gives a good approximation to human reader results
with 𝑟 > 0.98; 𝑎 < 1.1 and 𝑀𝐸 < 5.9𝑃𝐶 , without the use of internal noise. An
increase in microcalcification detection rate with increasing dose is clearly
visible in the data. After tuning and training, the model observer achieved the
same scores as human readers and therefore the impact of dose on the scores
was similar for both MO and human reading. In an earlier study using human
scoring of the same phantom(L Cockmartin et al., 2017), the phantom has already
been used to evaluate series of DBT systems operated in 2D and DBT mode, using
human reading. A combined analysis using 2D and DBT data is required before
initial estimates of ‘optimal’ dose levels can be given for patient studies, for the
different of DBT systems. Detailed dose optimization studies are outside the
scope of the present study. It must be stressed that the human reader data was
used to train the CHO and therefore it gives similar PC results as the human

103
readers by design. Nevertheless the reading results shown in Figure 4 represent
the CHO performance on separate image set and differ within error bars from
the reading results from the tuning process.
The reproducibility data show that the MO is more reproducible than human
observers for the smaller calcification sizes. This is an encouraging result for the
application of the test object for quality assurance purposes using model
observers. For example, the increased reproducibility of the CHO would lead to
a statistically significant difference between AEC and low dose levels for the
119𝜇𝑚 and 150𝜇𝑚 calcification clusters with respectively 𝑝 < 0.001 and 𝑝 <
0.001 for the CHO, opposed to 𝑝 = 0.068 and 𝑝 = 0.088 for the human
observers. For the larger calcification particle sizes (190 and 237μm) at AEC dose
level, despite the good agreement between the CHO and human observer scores,
the CHO reproducibility is notably poorer, with values of 2.2 PC and 0.9 PC for
the CHO compared to 0.0 PC and 0.0 PC found for human readers. This could
indicate that some further tuning might be needed or that a larger image dataset
is required
Young et. al. (Young et al., 2006) compared human reader reproducibility with
that of four different methods of processing contrast-detail curves for the
CDMAM mammography test object, which also presents a 4AFC detection tasks
to the observers. The four processing methods were applied to the output of the
CDCOM module automatic detection algorithm developed by Karssemeijer and
Thijssen (Karssemeijer & Thijssen, 1996). The study found lower contrast
thresholds for the CDCOM based methods yet higher reproducibility for the
human observers than for the automatic scoring methods. In contrast, the data
in our study show better reproducibility for the CHO by a factor of ~2. Higher
reproducibility for human reading of CDMAM over the CDCOM method could
result from the task being disc detection in a homogeneous background
containing only x-ray quantum mottle. If a test object with targets set in a
homogeneous background (e.g. the ACR digital mammography phantom) is used
for constancy testing of DBT systems, the in-focus plane can be established for a
given system and this plane used for constancy testing, always scored by the
same reader. This scenario is likely to give low uncertainty on the threshold
diameters obtained, however will not assess the detection performance of the
system for targets in structured background (which is likely to be related to
aspects such as the angular range and reconstruction algorithm used)
(Sechopoulos, 2013). One could argue that a simpler phantom could be used for
constancy testing, where the aim is to find changes in system performance.
This CHO implementation has certain limitations. (1) Our reliance on tuning to
match the detection performance of a certain group of human observers could
be considered a limitation in this work. For anthropomorphic MOs, one could
argue that the channel functions should be rather closely linked to the task and
thus tuning away from the expected (task) function to alter MO performance is
somewhat arbitrary and a move away from a pure task-based observer. Future
work could explore the implementation of a full VSO, however we will be

104
constrained by the test object. In fact, our CHO was specifically developed for the
evaluation of the 3D structured test object, in the place of human observers in
the Medical Physics department. We are tuning to a specific group of observers
(obviously we could tune to a different set of observers), however we wanted to
a tool that produced similar output as the observers at our center. (2) The
localization stage requires a priori knowledge of the approximate position of the
calcification particles forming the cluster. This is needed in order to avoid
scanning large areas within the phantom as this would require a significant
amount of computational time. In our case the exact locations of the targets
rarely exceeded 5-6 pixels from the expected locations, thus 8 pixels from the
center is a preventative measure. Although the CHO algorithm can be used on
images derived with different acquisition parameters (kV, image reconstruction,
etc.) or on different DBT systems, the channel parameters for both stages will
need to be validated and then possibly retuned in order to achieve a good
agreement with human observers. Nevertheless, our previous study (Dimitar
Petrov et al., 2017) on detectability of the non-spiculated simulating masses in
the same phantom background, showed that channels generated in the real space
perform well in different conditions like different DBT systems, dose levels and
reconstruction algorithms (chapter 2). The channel parameters described in this
paper might be used as a starting point for a finer tuning to specific applications.
5. Conclusions
This work has demonstrated that the CHO is approximately a factor of 2 times
more reproducible for smaller calcification diameters than the human observers
in this study. This makes the combination of structured phantom and CHO
reading a promising approach for Quality Assurance and related activities.

2D FULL-FIELD DIGITAL MAMMOGRAPHY


The results of this study are based on K.T Wigati, L. Vancoillie, E. Salomon, G.
Zhang, L. Cockmartin, N. Marshall, H. Bosmans, D. S Soejoko, K. Bliznakova, D.
Petrov, ‘Channelized Hotelling observer assessing microcalcification detectability
on 2D mammography: a first application to study the impact of tube voltage’, IOP
Conf. Series: Journal of Physics: Conf. Series 1248 (2019)
1. Introduction
Full-field digital mammography (FFDM) is the accepted method for population
breast cancer screening. As such the European commission (EC, 2012; Perry et
al., 2006)] has set a strict quality control procedure in order to successfully apply
the ‘as low as reasonably achievable’ principle. Often the suggested limiting
values are hard to link to actual image quality. The impetus behind the original
study (Wigati et al., 2019) in which the automated image quality evaluation tool
took part, was to study the change of image quality at different tube voltage
settings on a Siemens Inspiration 2D FFDM system.
There are two levels of limiting values in the European directives regarding the
tube voltage – the tube voltage accuracy should be within ±1 kVp from the set

105
value and the suspension limit, where the system should be excluded from the
patient workflow, is ±2 kVp deviation from the set value. In order to test these
limits, a physical phantom was scanned at AEC settings and this tube voltage was
set as the reference point from which 4 more tube voltage points were added to
the image dataset: ±1 kVp and ±2 kVp. The calcification clusters present in the
physical phantom had been read by human observers.
The aim of this subchapter is to present a newly developed 2D CHO algorithm
based on the DBT CHO algorithm from the previous subchapter. The 2-layer CHO
methodology will be revisited to work with 2D images and will be re-tuned to
estimate the already available human observer results.
2. Materials and methods
2.1. Phantom, image acquisition and human observer study
The spheres phantom developed by Cockmartin et al. (L Cockmartin et al., 2017)
was used with the same calcification cluster targets as described in the previous
subchapter. The 2D images were acquired on Siemens Mammomat Inspiration
FFDM system. The reference tube voltage level was 30kVp. With the reference
level set, the phantom was scanned 10 times at half automatic mode, where the
tube voltage was manually set and the tube mAs were selected by the system
exposure control. This was repeated for tube voltages in the range [28kV – 31kV].
The scanning parameters are given in table 4.
Table 4. Scanning parameters for the FFDM system
Tube Voltage Acquisitions Exposure Anode/Filter
28 10 129±3 W/Rh
29 10 115±4 W/Rh
30 10 98±2 W/Rh
31 10 83±3 W/Rh
32 10 74±2 W/Rh
From the FFDM images region of interest (ROI) segments with 20x20mm 2 were
extracted. Twenty image from each FFDM image were cropped, namely 15 signal
absent images and 1 signal present for each of the 5 calcification cluster sizes.
With a pixel size of 0.085mm the ROI size resulted in 236x236 pixels.
Five experienced medical physicists had read the complete dataset in 4-AFC
paradigm using the ‘Foursquares’ tool and in reading conditions described in the
previous subchapter. The percentage correctly detected targets (PC) was used as
an estimate of image quality. The final results are expressed as a diameter
threshold. This figure of merit points the threshold diameter, from which the
targets become significantly visible. The thresholds were calculated from 𝑃𝐶 =
62.5 using a logistic regression fit to the PC data:
0.75
𝑃𝐶(𝑑) = 0.25 + (6)
1 + 𝑒 −𝑓(𝑑−𝑑𝑡𝑟 )

106
Where 𝑑 is the average calcification particle size for the specific cluster, 𝑑𝑡𝑟 is the
diameter threshold and 𝑓 is a free parameter. The regression fit was performed
in GraphPad Prism software, used also for the diameter threshold uncertainty
estimates.
2.2. 2D CHO for calcification clusters
The CHO model observer from the previous subchapter was reworked for the
FFDM application. Hence, the 2D datasets differs from the DBT datasets for
which the model observer was originally. As an example, the 2-layer design had
to be reworked for input from 2D images only. A minor rewriting of the 1 st layer
was performed, where the localization was constrained to x-y direction (the z-
direction was obviously not available). This gave an (X,Y) location for most
probable location for the calcification particles for each specific image; the 2 nd
classification layer was applied at this location. Given the same pixel size for
FFDM and DBT, the tuning parameter for the Laguerre-Gauss channels in the
localization step was kept the same. Due to the different properties of the
background visualized with FFDM and DBT, the Gabor channels in the
classification layer were retuned. The tuning was performed on the 28 kVp tube
voltage level, and once a good channel parameter set was found the same CHO
was applied on the other 4 tube voltage levels.
The goodness of the fit between the tuned model observer and the average
human observer was estimated using the same criteria as in the previous
subchapter: Pearson’s correlation, linear regression slope and mean error. The
PC results from the model observer were used as diameter threshold and were
compared to the one estimated using the human PC results.

107
3. Results
3.1. CHO tuning

28 kVp 29 kVp 30 kVp

100 100 100

80 80 80
CHO PC, %

CHO PC, %

CHO PC, %
60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
H um an PC , % H um an PC , % H um an PC , %

31 kVp 32 kVp

100 100

80 80
CHO PC, %

CHO PC, %

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
H um an PC , % H um an PC , %

Figure 6. Model observer PC results plotted against the human observer PC results
for the 5 tube voltage levels.
Figure 6 shows the goodness of the fit between the tuned CHO and the human
observers. For three out of five calcification sizes (0.15mm, 0.19mm and
0.237mm), both human and model observer evaluate close to a perfect
percentage correct score. The 2-layer CHO channel parameters used to produce
these graphs are given in Table 5. Table 6 shows the goodness of the fit criteria
values with Pearson’s correlation better than 0.95; linear regression slope at
most 1.21 and mean error smaller than 11PC. The results indicate that the CHO
decisions were in close agreement with the human observers.
Table 5. CHO channels tuning parameters.
Localization Classification CHO
CHO
𝝈𝒖 𝜃𝑡 [degree] 𝑡𝑓 𝑡𝑤
2 110 51 4.9

108
Table 6. Model observer against human observer results criteria.
Linear
Tube voltage Pearson’s Mean error
regression
level correlation (r) (ME)
slope (a)
28 kVp 0.97 0.97 2.6
29 kVp 0.95 0.95 7.7
30 kVp 0.99 1.03 3.6
31 kVp 0.97 1.21 11
32 kVp 1.00 1.08 1.9

3.2. Impact of tube voltage on microcalcification detectability

a ). b ).
M o d e l o b s e rv e r H u m a n o b s e rv e r
28 kVp 29 kVp 30 kVp 28 kVp 29 kVp 30 kVp

31 kVp 32 kVp 31 kVp 32 kVp

100 100
D ia m e t e r t h r e s h o ld , m m M odel H um an

80 80 0 .1 4
PC, %

PC, %

60 60
0 .1 2
40 40

20 20
0 .1 0

0 0
0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 26 28 30 32 34

C a l c i f ic a t io n d i a m e t e r , m m C a l c i f ic a t io n d i a m e t e r , m m kVp

Figure 7. a). Graphs of the model and human observers psychometric curves. b).
Plot of the diameter threshold for both types of observers.
The CHO and Human observer PC results were plotted against the average
calcification particle size and a logistic regression was fitted to the resulting
dependency (Figure 7a)). This was later used to calculate the diameter threshold
values for all tube voltage conditions (Figure 7b) and Table 7).
Table 7. The diameter threshold values shown in figure 7.b) in millimeters.
28 kVp 29 kVp 30 kVp 31 kVp 32 kVp
Human
0.11±0.02 0.12±0.01 0.12±0.01 0.12±0.02 0.12±0.01
observer
Model
0.11±0.01 0.13±0.01 0.12±0.01 0.13±0.01 0.12±0.01
observer

4. Discussion and conclusions


The results show satisfactory agreement between the human and model
observer. The worst agreement is observed at the 31 kVp voltage level with

109
ME=11PC and a=1.21, which in turn leads to the highest separation for the
diameter threshold values shown in Figure 2b). The Gabor channel tuning factors
in this study differ slightly in value to the ones used for DBT, which confirms that
due to the different background properties a retuning was required for the FFDM
images. In fact the original channels used for the DBT study are also applicable,
but don’t derive the best achievable agreement between the two observers.
The overlapping error bars from the diameter threshold study show that there
is no significant difference between the two observers in the threshold estimates
for the 28 kVp, 30 kVp and 32 kVp tube voltage levels (p>0.06). For 29 kVp and
31kVp there is a significant difference between the two observers with p<0.039.
Nevertheless given the small amount of reading images (10 acquisitions per
condition) the statistical power of such significance tests is arbitrary. The results
show that only the CHO at 31 kVp finds a difference in image quality within the
5 tube voltage levels with diameter threshold poorer than the rest. The human
observers do not find a significant difference at these acquisition settings
(p>0.05).
There are certain limitations to the application of the CHO for this study. The
observer template was formed with an ‘ideal’ Gaussian blob as expected signal
and covariance matrix trained on 150 signal absent images. While the training
images are sufficient for estimation of a non-singular covariance matrix, it is not
trivial to estimate how well the observer template is formed to solve the
classification task. Due to time constrains and impracticality the phantom could
be scanned only 10 times per tube voltage condition (50 acquisitions altogether).
This made the dataset bias unavoidable: the fact that training and reading of the
CHO algorithm was performed on the same image dataset.
The study shows, however, a successful application of a 2-layer channelized
Hotelling observer for image quality estimation in 2D full-field digital
mammography. While more images are needed for higher statistical significance
in the results, the study has demonstrated a working concept for a CHO with good
agreement to the human reader scores.

REFERENCES
Abdurahman, S., Dennerlein, F., Jerebko, A., Fieselmann, A., & Mertelmeier, T.
(2014). Optimizing high resolution reconstruction in digital breast
tomosynthesis using filtered back projection. Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 8539 LNCS, 520–527.
Bakic, P. R., Pokrajac, D. D., De Caro, R., & Maidment, A. D. A. (2014). Realistic
simulation of breast tissue microstructure in software anthropomorphic
phantoms. Lecture Notes in Computer Science (Including Subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8539
LNCS(3), 348–355.
Barrett, H H, Yao, J., Rolland, J. P., & Myers, K. J. (1993). Model observers for

110
assessment of image quality. Proceedings of the National Academy of
Sciences of the United States of America, 90(21), 9758–9765.
Barrett, Harrison H, Myers, K. J., Hoeschen, C., Kupinski, M. a, & Little, M. P.
(2015). Task-based measures of image quality and their relation to
radiation dose and patient risk. Physics in Medicine and Biology, 60(2), R1–
R75.
Berns, E., Becker, J., & Barke, L. (2016). Digital Mammography Quality Control
Manual. ACR American College of Radiology.
Bochud, F. O., Valley, J. F., Verdun, F. R., Hessler, C., & Schnyder, P. (1999).
Estimation of the noisy component of anatomical backgrounds. Medical
Physics, 26(7), 1365–1370.
Brankov, J. G. (2013). Evaluation of the channelized Hotelling observer with an
internal-noise model in a train-test paradigm for cardiac SPECT defect
detection. Physics in Medicine and Biology, 58, 7159–7182.
Cockmartin, L, Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Cockmartin, Lesley, Marshall, N. W., & Bosmans, H. (2014). Comparison of SNDR,
NPWE Model and Human Observer Results for Spherical Densities and
Microcalcifications in Real Patient Backgrounds for 2D Digital
Mammography and Breast Tomosynthesis (pp. 134–141). Springer, Cham.
Das, M., & Gifford, H. C. (2011). Comparison of model-observer and human-
observer performance for breast tomosynthesis: effect of reconstruction and
acquisition parameters (N. J. Pelc, E. Samei, & R. M. Nishikawa (eds.); Vol.
7961, p. 796118). International Society for Optics and Photonics.
EC. (2012). European Comission, Radiation Protection n o162, Criteria for
Acceptability of Medical Radiological equipment used in diagnostica
radiology, nuclear medicine and radiotherapy. In Radiology: Vol.
RADIATION.
FDA. (2002). MQSA. Mammography Quality Standards Act Regulations.
Mammography, 21(CFR PART 900).
Gallas, B. D., & Barrett, H. H. (2003). Validating the use of channels to estimate
the ideal linear observer. Journal of the Optical Society of America. A, Optics,
Image Science, and Vision, 20(9), 1725–1738.
Gifford, H. C., Karbaschi, Z., Banerjee, K., & Das, M. (2017). Visual-search models
for location-known detection tasks. 1013612(March 2017), 1013612.
Gifford, Howard C., Liang, Z., & Das, M. (2016). Visual-search observers for
assessing tomographic x-ray image quality. Medical Physics, 43(3).
Green, D. M., & Swets, J. A. (1966). Signal detection theory and pshychophysics

111
(Wiley & So).
Hadjipanteli, A., Elangovan, P., Looney, P., Mackenzie, A., Wells, K., Dance, D. R., &
Young, K. C. (2016). Detection of Microcalcification Clusters in 2D-
Mammography and Digital Breast Tomosynthesis and the Relation to the
Standard Method of Measuring Image Quality (pp. 217–221). Springer,
Cham.
Hu, Y.-H., & Zhao, W. (2011). The effect of angular dose distribution on the
detection of microcalcifications in digital breast tomosynthesis. Medical
Physics, 38(5), 2455–2466.
Karbaschi, Z., & Gifford, H. C. (2018). Assessing CT acquisition parameters with
visual-search model observers. Journal of Medical Imaging, 5(02), 1.
Karssemeijer, N., & Thijssen, M. A. O. (1996). Determination of contrast-detail
curves of mammography systems by automated image analysis. Elsevier,
115–160.
Kopans, D., Gavenonis, S., Halpern, E., & Moore, R. (2011). Calcifications in the
Breast and Digital Breast Tomosynthesis. The Breast Journal, 17(6), 638–
644.
Kundel, H. L., Nodine, C. F., Conant, E. F., & Weinstein, S. P. (2007). Holistic
Component of Image Perception in Mammogram Interpretation: Gaze-
tracking Study. Radiology, 242(2), 396–402.
Li, Z., Desolneux, A., Muller, S., Milioni de Carvalho, P., & Carton, A.-K. (2018).
Comparison of microcalcification detectability in FFDM and DBT using a
virtual clinical trial. In R. M. Nishikawa & F. W. Samuelson (Eds.), Medical
Imaging 2018: Image Perception, Observer Performance, and Technology
Assessment (Vol. 10577, p. 12). SPIE.
Michielsen, K., Nuyts, J., Cockmartin, L., Marshall, N. W., & Bosmans, H. (2016).
Design of a model observer to evaluate calcification detectability in breast
tomosynthesis and application to smoothing prior optimization. Medical
Physics, 43(12), 6577–6587.
Perry, N., Broeders, M., de Wolf, C., Törnberg, S., Holland, R., & von Karsa, L.
(2006). European guidelines for quality assurance in breast cancer
screening and diagnosis. In Annals of oncology : official journal of the
European Society for Medical Oncology / ESMO (Vol. 19, Issue 4).
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H.
(2017). Real space channelization for generic DBT system image quality
evaluation with channelized Hotelling observer. Progress in Biomedical
Optics and Imaging - Proceedings of SPIE, 10136.
Petrov, Dimitar, Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans,
H. (2017). Real space channelization for generic DBT system image quality
evaluation with channelized Hotelling observer (M. A. Kupinski & R. M.
Nishikawa (eds.); Vol. 10136, p. 101360N). International Society for Optics
and Photonics.

112
Petrov, Dimitar, Marshall, N. W., Young, K. C., & Bosmans, H. (2019). Systematic
approach to a channelized Hotelling model observer implementation for a
physical phantom containing mass-like lesions: Application to digital
breast tomosynthesis. Physica Medica, 58, 8–20.
Petrov, Dimitar, Marshall, N., Young, K., & Bosmans, H. (2018). Model and human
observer reproducibility for detecting microcalcifications in digital breast
tomosynthesis images. In R. M. Nishikawa & F. W. Samuelson (Eds.),
Medical Imaging 2018: Image Perception, Observer Performance, and
Technology Assessment (Vol. 10577, p. 10). SPIE.
Sechopoulos, I. (2013). A review of breast tomosynthesis. Part II. Image
reconstruction, processing and analysis, and advanced applications.
Medical Physics, 40(1), 1–17.
Skaane, P., Bandos, A. I., Gullien, R., Eben, E. B., Ekseth, U., Haakenaasen, U., Izadi,
M., Jebsen, I. N., Jahr, G., Krager, M., & Hofvind, S. (2013). Prospective trial
comparing full-field digital mammography (FFDM) versus combined FFDM
and tomosynthesis in a population-based screening programme using
independent double reading with arbitration. European Radiology, 23(8),
2061–2071.
Spangler, M. L., Zuley, M. L., Sumkin, J. H., Abrams, G., Ganott, M. A., Hakim, C.,
Perrin, R., Chough, D. M., Shah, R., Gur, D., Ml, S., Ml, Z., & Jh, S. (2011).
Detection and Classification of Calcifications on Digital Breast
Tomosynthesis and 2D Digital Mammography: A Comparison Wo m e n’s I
m ag i ng @BULLET O r ig i n a l R e s e a rc h. AJR, 196, 320–324.
Svahn, T. M., Chakraborty, D. P., Ikeda, D., Zackrisson, S., Do, Y., Mattsson, S., &
Andersson, I. (2012). Breast tomosynthesis and digital mammography: A
comparison of diagnostic accuracy. British Journal of Radiology, 85(1019).
Thomas, J. A., Chakrabarti, K., Kaczmarek, R., & Romanyukha, A. (2005). Contrast-
detail phantom scoring methodology. Medical Physics, 32(3), 807–814.
U.S. Food and Drug Administration. (2018). VICTRE: Virtual Imaging Clinical
Trials for Regulatory Evaluation.
van Engen, R. E., Bosmans, H., Bouwman, R. W., Dance, D. R., Heid, P., Lazzari, B.,
Marshall, N. W., Schopphoven, S., Strudley, C., Thijssen, M., & Young, K. C.
(2014). A European Protocol for Technical Quality Control of Breast
Tomosynthesis Systems (pp. 452–459). Springer, Cham.
Warren, L. M., Mackenzie, A., Cooke, J., Given-Wilson, R. M., Wallis, M. G.,
Chakraborty, D. P., Dance, D. R., Bosmans, H., & Young, K. C. (2012). Effect
of image quality on calcification detection in digital mammography. Med
Phys, 39(6), 3202–3213.
Watson, A. B. (1983). Detection and recognition of simple spatial forms. Berlin
SpringerVerlag, 100–114.
Wigati, K. T., Vancoillie, L., Salomon, E., Zhang, G., Cockmartin, L., Marshall, N.,
Bosmans, H., Soejoko, D. S., Bliznakova, K., & Petrov, D. (2019). Channelized

113
Hotelling observer assessing microcalcification detectability on 2D
mammography: a first application to study the impact of tube voltage.
Journal of Physics: Conference Series, 1248.
Young, K. C., Alsager, A., Oduko, J. M., Bosmans, H., Verbrugge, B., Geertse, T., &
van Engen, R. (2008). Evaluation of software for reading images of the
CDMAM test object to assess digital mammography systems. 6913, 69131C.
Young, K. C., Cook, J. J. H., Oduko, J. M., & Bosmans, H. (2006). Comparison of
software and human observers in reading images of the CDMAM test object
to assess digital mammography systems. SPIE Medical Imaging: Physics of
Medical Imaging, 6142, 614206.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice
(4AFC) software for observer performance evaluation in radiology (C. K.
Abbey & M. A. Kupinski (eds.); p. 97871E). International Society for Optics
and Photonics.

114
Chapter 4: Channelized Hotelling observer for
multi-vendor breast tomosynthesis image quality
estimation: detection of calcification clusters in an
anthropomorphic phantom
The text is based on an upcoming proceeding D. Petrov, N. Marshall, H. Bosmans,
‘Channelized Hotelling observer for multi-vendor breast tomosynthesis image
quality estimation: detection of calcification clusters in an anthropomorphic
phantom’, International workshop on breast imaging (IWBI) May 2020.

INTRODUCTION
Digital breast tomosynthesis (DBT) is a three dimensional imaging technique,
which promises better low contrast detectability compared to digital
mammography, especially for mass lesions whose detectability is largely
determined by breast structure noise. The ability of DBT systems to detect
microcalcifications, however, remains a topic of interest. In previous work
(Petrov et al., 2019) we described a two-layer channelized Hotelling model
observer (CHO) algorithm, which successfully predicted human results from 4-
AFC observer study performed with DBT images of microcalcification clusters
from a physical phantom (“L1”). The scope of that study was to compare the
reproducibility of the human and model observers, thus only images from a
single system (Siemens Inspiration DBT unit) were considered.
In this work, the same CHO algorithm will be applied to images of the L1 phantom
acquired on five other DBT scanners, at three dose levels on each system. The
feasibility of using the same channel tuning parameters from our previous work
will be tested and the CHO reading results will be compared to human observer
scores reading the same images.

MATERIALS AND METHODS


1. Image and phantom properties
The study was performed using the L1 anthropomorphic phantom made of an
acrylic semi-circular, compressed breast-shaped container filled with equal
volumes of acrylic spheres and water (Cockmartin et al., 2017). Five calcification
clusters are also present in the phantom each with differing average particle size:
95 μm, 119 μm, 150 μm, 190 μm and 237 μm. The phantom was scanned on five
different DBT systems: Fujifilm Amulet Innovality, GE Senographe Pristina, IMS
Giotto Class, Hologic 3Dimensions and Siemens Mammomat Revelation. For each
DBT system 12 acquisitions were taken at three dose levels: at the tube voltage
and tube current-time product (mAs) selected by the AEC, and then at factors of
0.5 and 2 times the AEC mAs level. These datasets are referred to throughout the
text as AEC, Low and High dose levels. From the reconstructed images, volumes
of interest (VOIs) were extracted from specific regions in the phantom with the

115
calcification clusters centered, giving the signal present images, and from areas
without any lesion present, giving the signal absent images. The size of the
extracted volumes was 2x2x3cm3, which leads to a number of voxels for each
DBT system varying from 200x200x30 voxels to 308x308x30 voxels.
Human reading was performed using a four alternative forced choice (4AFC)
paradigm. The cropped images were visualized via a software tool, that also
facilitated the observer scoring (Zhang et al., 2016). The human reader results
were analyzed and the percentage of correctly detected calcification clusters
(PC) was used as the image quality figure of merit (FOM).
2. Model observer
The channelized Hotelling model observer (CHO) developed and described in
chapter 3 in this thesis was applied in this study. This CHO uses a two-layer
approach, where the lesion localization and classification are split onto two
consecutive algorithms. In the localization step a CHO with two Laguerre-Gauss
(LG) channels is used to scan five areas in each VOI corresponding to the
expected locations of the calcification particles forming the cluster. The observer
responses are calculated and the most probable target location for each of the
calcification particles is stored. In the second step the classification is performed.
To do so, a CHO with 8 Gabor channels is applied to each of the five locations
estimated from the previous step. The calcification cluster CHO response is
calculated as the summation of the three best visible calcification particles, i.e.
the three highest CHO response values from the five responses for each
calcification particle. The MO was developed and tuned using images of the L1
phantom on the Siemens Inspiration DBT system. Due to the different number of
voxels per image between the different DBT systems, using the same tuning
factors would lead to different spatial frequencies extracted from the images and
different spatial region of sensitivity for the CHO. To cope with that in the channel
generation algorithm the DBT system pixel size was included as a variable in a
similar way as in chapter 2. This way by keeping the same tuning parameters
estimated from our previous study, the generated channel set will be identical in
real space, thus similar features will be extracted from all images regardless of
the image size. Other than that, the MO was applied ‘as is’ to the five new DBT
datasets, without any additional tuning steps.
In order to compare with human observer results, the CHO was also applied in
4AFC observer study paradigm. The PC was calculated via comparison of three
signal absent CHO responses to one signal present CHO response.
The goodness of the fit between the human and model observers was assessed
using the following criteria (Petrov et al., 2019): Pearson’s correlation (𝑟); linear
regression slope (𝑎); and the mean error (ME). The target values i.e. indicating
good correlation for these parameters were 𝑟 = 1, 𝑎 = 1 and 𝑀𝐸 = 0.

116
RESULTS
F u jif ilm In n o v a lit y G E P r is t in a G io t t o C la s s
H ig h d o s e A E C dose Low dose H ig h d o s e A E C dose Low dose H ig h d o s e A E C dose Low dose

100 100 100

CHO, PC %
CHO, PC %
CHO, PC %

80 80 80

60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
H um an, PC % H um an, PC % H um an, PC %

H o lo g ic 3 D im e n s io n s S ie m e n s R e v e la t io n
H ig h d o s e A E C dose Low dose H ig h d o s e A E C dose Low dose

100 100
CHO, PC %

CHO, PC %

80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100

H um an, PC % H um an, PC %

Figure 1. Plots for each DBT system comparing the mean CHO PC scores (on Y axis)
and the mean human observer PC scores (on X axis) for calcification clusters in the
anthropomorphic phantom. For each graph, the plots for the three dose levels are
visualized in different colors.
Figure 1 gives a graphical representation of the model observer PC data versus
the human observer scores, when reading the same image dataset. Table 1 shows
the goodness of the fit criteria scores. The results show good ability to predict
the human observer detectability scores, with Pearson’s correlation coefficient
varying from 0.94 to 1.00. The linear regression slope ranged from 0.96 to 1.37
while the mean error varied between -2.2PC to 5.2PC. Good correlation is
therefore found using the same tuning parameters for the localization and
classification stages. The poorest agreement found for the Siemens High dose
level, was caused by overestimation of the PC scores for the 150μm calcification
cluster, despite that all other calcification sizes were closely estimated with PC
scores similar to the ones of the humans.

117
Table 1. Regression line parameters with the 95% confidence intervals where
possible.
System Dose level r a ME
Fujifilm AEC 0.98 (0.68 1.00) 0.96 (-0.57 1.35) -2.2
Low 0.99 (0.90 1.00) 1.01 (0.79 1.22) -0.9
High 0.97 (0.64 1.00) 1.21 (0.68 1.74) -1.8
GE Pristina AEC 0.98 (0.68 1.00) 1.23 (1.01 1.45) 2.9
Low 0.99 (0.81 1.00) 1.03 (0.71 1.34) 3.6
High 1.00 (0.93 1.00) 1.04 (0.84 1.24) 3.5
Giotto AEC 0.97 (0.65 1.00) 1.19 (0.96 1.29) 0.7
Low 1.00 (0.95 1.00) 1.13 (0.96 1.29) 3.1
High 0.99 (0.90 1.00) 1.09 (0.85 1.32) 4.9
Hologic AEC 0.99 (0.90 1.00) 1.06 (0.84 1.28) 2.7
Low 0.97 (0.63 1.00) 1.17 (0.65 1.69) 1.6
High 1.00 (0.98 1.00) 1.10 (1.06 1.14) 1.8
Siemens AEC 1.00 (0.98 1.00) 1.19 (1.10 1.28) 5.2
Low 0.95 (0.45 1.00) 1.11 (0.46 1.75) 1.7
High 0.94 (0.31 1.00) 1.37 (0.42 2.32) 0.9

DISCUSSION AND CONCLUSIONS


In the present study the two stage MO for detection of calcification clusters in
structured phantom from our previous work was applied to image datasets
acquired on 5 new DBT systems. The MO algorithm performance was compared
against human observers reading the same image datasets and the results show
good correlation between the two observers. This suggests that the developed
CHO can be used in the medical physicist practice to quantify microcalcification
detection performance and to track changes in this score over time.

REFERENCES
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Petrov, D., Marshall, N., Young, K., Zhang, G., & Bosmans, H. (2019). Model and
human observer reproducibility for detection of microcalcification clusters
in digital breast tomosynthesis images of three-dimensionally structured
test object. Journal of Medical Imaging, 6(01), 1.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice

118
(4AFC) software for observer performance evaluation in radiology (C. K.
Abbey & M. A. Kupinski (eds.); p. 97871E). International Society for Optics
and Photonics.

119
Chapter 5: Channelized Hotelling observer for
breast virtual clinical trials: application to DBT and
FFDM
INTRODUCTION
Breast cancer is the most common type of cancer diagnosed in women. Among
the many initatives to reduce the associated burden, high quality breast cancer
screening has been shown to lead to a reduction in the mortality rate (Ferlay et
al., 2007; Tabar et al., 2003). Most breast cancer screening programs use 2D full-
field digital mammography (FFDM) (Bick & Diekmann, 2007; Bluekens et al.,
2010), in which typically two views of each breast are captured by the digital
detector and scored by the radiologist. However, a limitation associated with
FFDM is the projection of overlying fibroglandular tissue in the x-ray image,
which can potentially obscure lesions and lead to a certain sensitivity for the
modality (Rafferty, 2007). Digital breast tomosynthesis (DBT) has the potential
to improve the sensitivity compared to FFDM imaging (Svahn et al., 2012). DBT
utilizes multiple projections taken over a limited angle around the breast, which
are then reconstructed as a quasi-three-dimensional (3D) stack, providing
volumetric information on the breast structure and, to some extent, supresses
the overlapping structures (Sechopoulos, 2013a, 2013b). Recent studies have
confirmed that, compared to standard FFDM, DBT can decrease recall rate and
increase the detectability of low contrast lesions (Krammer et al., 2017; Skaane
et al., 2013; Svahn et al., 2012).
A large number of scanning parameters are available (Sechopoulos 2013a;
Sechopoulos 2013b) and vendors have adopted different parameter sets for DBT
solutions; the ideal combination of parameters remains under investigation
(Zeng et al., 2017). The gold standard means of evaluating the resultant detection
performance from a given combination of scan parameters is a clinical trial.
However, the use of clinical studies to systematically explore such a large
parameter space is simply not practical. Recently, virtual clinical trials (VCTs)
have generated a lot of interest as an efficient means of exploring system
performance (Badano et al., 2018; Scarparo et al., 2019; Sharma et al., 2019). VCTs
use computer models of lesions and/or human anatomy as input to an image
simulation pipeline; the images are then evaluated after the relevent image
processing and reconstruction stages have been applied (Bakic et al., 2014;
Elangovan et al., 2014; Maidment, 2014). All the components of this virtual
imaging chain have to be validated for realism so that the results are expected to
correlate with those from a real clinical trial, within some allowed margin.
However, these studies often involve a human reading step, where the images
are scored for the abnormality. This human reading represents a major
bottleneck in the VCT workflow, as they are often expensive and time consuming.

120
In order to take full advantage of the VCT approach, model observers should be
developed and validated.
Recent work by Elangonvan et. al. (Elangovan et al., 2018) compared
detectability of simulated mass-lesions and spherical targets in 2D
mammography and DBT using human observers with different levels of
experience. Image sets for a 4-alternative forced choice (4-AFC) human observer
study were simulated using a VCT framework developed by the same group
(Elangovan et al., 2014, 2017). In the frame of the Optimam project cooperation,
we were offered these datasets and readings for the development of a model
observer that could replace human readout for future applications on target
detectability in 2D mammography and DBT.
Many studies have shown that the channelized Hotelling model observer (CHO)
is a good candidate for this task; with the correct tuning steps, CHOs are able to
approximate human observer scores in FFDM and DBT imaging. As an example,
Castella et.al (Castella et al., 2009) and Bouwman et al. (Bouwman et al., 2017)
found good agreement between different CHO models and human observer
results for simple detection tasks in simulated FFDM images. Ikejimba et al.
(Ikejimba et al., 2016) have compared CHO and human observer results for
FFDM, DBT and synthetic mammography images of physical phantoms. In
previous work we developed a scanning CHO that correlated closely with human
observer results for different DBT scanning conditions using realistic targets in
a breast simulating test object (Dimitar Petrov et al., 2019). Preliminary tests
with this CHO applied to the Optimam VCT dataset from Elangovan et al.
(Elangovan et al., 2018) showed that the human scores could not be predicted
within the same error intervals. The CHO itself was a single slice algorithm that
tended to underestimate the detectability results and thus an algorithm with
improved detection performance was needed, while maintaining the
anthropomorphic modelling properties of the CHO.
Platiša et al. (Platiša et al., 2011) proposed three different types of CHO for
application in DBT images: single slice CHO (ssCHO), multi-slice CHO (msCHO)
and the volumetric CHO (vCHO). While the latter two methods cannot be applied
to FFDM images, Platiša et al. found that the vCHO had the highest performance
among the three for volumetric images. The vCHO has been used in other studies.
As an example, Zeng et al. (Zeng et al., 2015) compared the ssCHO against the
vCHO for DBT VCT images and confirmed a slight advantage for the vCHO with
better detectability and robustness. Wen et al. (Wen et al., 2018) used a vCHO to
compare different DBT geometries in a multi-lesion detection task. The
anthropomorphic properties of this method, i.e. the ability to predict human
scores, are however, still not well described in the literature.The aims of this
study were therefore twofold. The first was to study and if needed tune our
previous DBT algorithm, which was a scanning ssCHO from chapter 1, to 2D
FFDM images. The second was to use this ssCHO as the starting point for the
design and implementation of an anthropomorphic vCHO able to predict the

121
human observer DBT reading results for lesions simulated in the Optimam
background.

MATERIALS AND METHODS


This section briefly recaps the image dataset generation and the human reading
study methodology used by Elangovan et al (Elangovan et al., 2018).
1. Image dataset generation
The images used in this study were obtained using the Optimam VCT framework,
which is described in detail elsewhere (Elangovan et al., 2014, 2017). Four virtual
breast models were created that included adipose tissue, fibro-glandular tissue,
Cooper’s ligaments, blood vessels and skin layers. Image segments of the
produced models had been validated for realism via ROC-based analysis
(Elangovan et al., 2017). Six irregular solid masses with physiological
characteristics were used as targets. Generated using a Diffussion Limited
Aggregation (DLA) method (A Rashidnasab et al., 2013), each of the six masses
had a unique surface structure and shape. After insertion in a simulated breast
background, the masses were also scored for realism by experienced radiologists
who had given an average feedback rating of “definitely realistic” for both DBT
and 2D FFDM (A Rashidnasab et al., 2013; Alaleh Rashidnasab et al., 2013). The
mass models were resized to generate 4mm and 6mm diameter targets and were
inserted into the breast models by voxel replacement. The projection data were
produced by simulating the system geometry and acquisition settings of a
Hologic Selenia Dimensions 3D system. The projection data were acquired using
ray tracing and were corrected for the presence or absence of an anti-scatter grid
(2D FFDM, resp. DBT), noise and blur (Elangovan et al., 2014). The exposure
factors were set equal to those picked by the automatic exposure control of a real
system for real breasts of equivalent size and glandularity (table 1), and an MGD
of 2.5mGy was simulated for both modalities. The 15 DBT projections were
generated using the manufacturer’s FBP image reconstruction application.
Table 1. X-ray exposure parameters for FFDM and DBT.
Tube voltage Anode/Filter Mean Pixel size
[kVp] combination glandular [μm]
dose [mGy]
FFDM 31 W/Rh 2.5 70
DBT 33 W/Al 2.5 100

122
Figure 1. Image segments without (top row) and with (bottom row) a signal for
both FFDM and DBT.
To generate the image segments for the human observer study, 3x3cm 2 regions
of interest (ROI) were extracted from the simulated breast model images with
and without a centrally located target (Elangovan et al., 2018) (figure 1). The
available DBT stacks or volumes of interest (VOIs) had 18 slices for the 6 mm
lesion and 12 slices for the 4 mm lesion; interplane spacing was always 1mm. For
each modality and target size, the signal present image segments were simulated
with three average contrast levels, achieved by inserting the target into breast
regions that were more hetereogenous (~glandular) or homogeneous (~fatty).
These locations (and resulting contrasts) were chosen such that a percentage of
correctly detected targets (PC) equal to 90.7% occurred within the range of three
PC results for the corresponding three contrast levels. The contrast was
calculated as the relative difference between the mean background signal and
the mean target signal calculated using the raw image data of the FFDM
acquisitions. The reading image dataset consisted of 50 signal present ROIs for
each condition (600 ROIs in total) and 900 signal absent ROIs for each modality.
Table 2. Experimental conditions and size of the datasets.
2D DBT
Mass Average Number of Number of Average Number of Number of
lesion contrast training images reading images contrast training images reading images
size levels present/absent present/absent levels present/absent present/absent
5%,
9%, 7%,
4 mm 180/1200 50/300 3.5%, 180/570 50/150
5%
2%
4%,
5%, 3%,
6 mm 180/1200 50/300 2.5%, 180/570 50/150
1%
1%

123
2. Human observer study
The human reading study (Elangovan et al., 2018) was performed using a 4-AFC,
whereby three signal absent images and one signal present image were
simultaneously presented to the observer whose task was then to select the
image which most likely contained the target. Signal cues were also provided to
guide the observers for the signal size and shape, along with toto circles for each
of the four images, indicating the potential location of the target. The
experiments were conducted in low ambient light (<6 lux) on a diagnostic LCD
monitor (Barco, B-8500, 5MP, Belbium). Five medical physicists, six clinical
readers and five naïve non-specialists observers, 16 observers in total,
participated in the study (Elangovan et al., 2018). Human observer performance
was evaluated as the percentage correctly detected targets (PC) from the 4-AFC
study. These human observer results were used as an input for tuning of the CHO.
Additionally to the PC results, the contrast threshold (C tr) was estimated. In
accordance with (Elangovan et al., 2018), the three PC scores for each modality
and lesion size were plotted against the three contrast levels of the targets and a
linear regression was fitted to the data. Using linear regression, the C tr was
calculated as the lesion contrast that gave an observer score of 90.7 PC.
3. Channelized Hotelling model observer
In signal detection theory, a model observer is an algorithm that produces a
decision variable given some input. An internalized observer response is
associated with the presence or absence of a stimulus and this is used in the
decision making process. In this study a channelized Hotelling observer (CHO)
was implemented, with a decision variable calculated using the following
formula:
𝑡
𝜆(𝑣) = 𝑤𝐶𝐻𝑂 𝑣, 𝑣 = 𝑈 𝑡 𝑔; 𝑤𝐶𝐻𝑂 = 𝑠 𝑡 𝐾𝑣−1 (1)
in which {} is the vector transpose operator. The steps are applied as follows:
𝑡

the images (𝑔) are first filtered by the channel set 𝑈 to produce a channel output
vectors 𝑣 . Then the channelized expected signal 𝑠 and the inverse of the
covariance matrix 𝐾𝑣 are estimated in order to calculate the CHO template 𝑤𝐶𝐻𝑂 .
The decision variable 𝜆(𝑣) is a scalar given by the product of the observer
template and the channel output vector.
In previous work (Dimitar Petrov et al., 2019), we described in detail a CHO that
successfully predicted the presence of mass simulating lesions in a physical
phantom using an ssCHO detection approach. In a scanning step, the template
was systematically applied to the different planes to account for in-plane
misalignments and to find the plane (i.e. in the z-direction) in which the lesion
generated the highest signal. The resultant CHO (Dimitar Petrov et al., 2019) that
agreed closest with the human scores had: (a) tuned Gabor channels, (b) an
expected signal estimated from the acquired image dataset and (c) separate
datasets for training and reading. These findings were used as a starting point in
this study, with the channel set 𝑈 consisting of a number of tuned Gabor channels
and the expected signal 𝑠 estimated from an image dataset dedicated for training.

124
Given the differences between the phantom images used previously (Dimitar
Petrov et al., 2019) and the Optimam images used in this work, the number of
channels and associated tuning parameters had to be adjusted. However, the
scanning step used previously was not required as the location of all lesions
within the images was known exactly.
In order to train the new CHO, an additional 180 signal present ROIs were
required and therefore simulated with the Optimam tools for each condition and
modality and a further 600 signal absent ROIs were created for each modality.
This formed a dataset of 220 signal present and 1500 signal absent ROIs for the
2D FFDM study. Due to computational memory constraints 220 signal present
and 720 signal absent VOIs with only 9 central planes were used for the DBT
study (table 2). For each modality and each contrast and lesion size category, the
signal present and signal absent images were split into 50 signal present and 300
signal absent images solely used for reading and the rest for training. This
method, called the ‘hold-out’ (Barrett & Myers, 2004), ensures that the dataset
bias is 0%.
In order to compare the CHO performance to the human observer performance,
the estimated decision variables from signal present and signal absent images
were implemented as a 4-AFC test, in which the decision variable from a signal
present image was compared to 3 decision variables from signal absent images.
If the decision variable of the signal present image was the highest, a hit was
concluded. In this manner, the PC performance of the model observer and the
human observers were compared. In addition, the threshold contrast was
estimated in the same way as for the human observers explained previously.
3.1. 2D full-field digital mammography ssCHO
For the 2D FFDM study, the image dataset was split into training and reading
subsets. The training dataset was used for observer template estimation, where
the covariance matrix and the expected signal were estimated using both signal
present and signal absent training images. When applying our previously
developed algorithm to 2D images, scanning was not needed (as the locations
were known exactly) and the channels were applied directly to the centre of the
image dataset (as all signals are perfectly centred).
Fifteen Gabor channels were generated then tested using the following formula
(Dimitar Petrov et al., 2019):
𝑥 2 +𝑦 2
𝐶𝑖,𝑗,𝑘 (𝑥, 𝑦) = 𝑒
−4ln(2) 2
𝑤 (𝑖) cos[2𝜋𝑓(x cos 𝜃𝑗 + 𝑦 sin 𝜃𝑗 ) + 𝜙𝑘 ], (2)

𝒕𝒘 𝒕𝒇
where 𝒘(𝒊) = ,𝒇 = , 𝜽𝒋 = 𝒋 𝒕𝒕 and 𝝓𝒌 = 𝟒𝟓 𝒌.
(𝒆𝒊 +𝟐) 𝒑𝒙𝒔 𝑾𝒊

Here, the Gaussian function determines the width (w i) of the channels with
respect to the pixel size (pxs) and the sinusoidal function – their frequency (𝑓),
orientation (𝜃) and phase (𝜙). Three parameters are required to generate a

125
channel set: the number of frequencies (I), orientations (J) and phases (K). Thus,
to generate 15 channels, one phase, three orientations and five frequencies were
determined. The number of channels was selected based on preliminary tests
(data available, not shown). To set the properties of each channel within the
channel set, three tuning factors were implemented: 𝑡𝑤 , 𝑡𝑓 𝑎𝑛𝑑 𝑡𝑡 . These set
respectively the Gaussian standard deviation, the sine wave frequency and the
orientation. To study the effect of tuning factors on CHO performance, different
parameter combinations were evaluated. Parameter tuning was performed by
varying one parameter, while holding the other two fixed. Values for 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓
were varied from 5 to 95 in steps of 5, and 𝑡𝑡 ranged between 10 and 170 in steps
of 10. The use of finer steps was possible, but not justified given the
reproducibility of the human reading results.
3.2. Digital breast tomosynthesis vCHO
For the DBT study, a vCHO was implemented using volumetric channels. With
the addition of the third dimension, the number of degrees of freedom available
when generating the Gabor channels is greatly increased. Therefore some
constraints were applied to the tuning factors used with the 3D Gabor function
to make the channel estimation manageable. The rotation of the channels was
limited only to yaw rotation and the roll and pitch rotations were fixed to 0°, such
that the Gabor channels were always parallel to the image slices of the VOI. The
extent of the Gabor channel in z-direction was set equal to the real space size in
the x-y direction. In this way the number of tuning factors was the same as for
the FFDM CHO and the same formula could be used (Dimitar Petrov et al., 2019),
the only difference being the addition of the z-direction.
Thirty 3D Gabor channels were generated using the following formula:
𝑥 2 +𝑦 2 𝑧2
−4ln(2) 2 +
𝐶𝑖,𝑗,𝑘 (𝑥, 𝑦, 𝑧) = 𝑒 𝑤𝑥𝑦 (𝑖) 𝑤𝑧2 (𝑖)
cos[2𝜋𝑓(x cos 𝜃𝑗 + 𝑦 sin 𝜃𝑗 ) + 𝜙𝑘 ], (3)

𝒕𝒘 𝒕𝒇
where 𝒘𝒙𝒚 (𝒊) = , 𝒘𝒛 (𝒊) = 𝒘𝒙𝒚 (𝒊) 𝒑𝒙𝒔 , 𝒇 = , 𝜽𝒋 = 𝒋 𝒕𝒕 and 𝝓𝒌 =
(𝒆𝒊 +𝟐) 𝒑𝒙𝒔 𝑾𝒊
𝟒𝟓 𝒌.
where 𝑤𝑧 is the width of the channel in z-direction. In order to generate the
channel sets, 2 phases (k), 3 orientations (j) and 5 frequencies (i) were selected.
As for the ssCHO used with the FFDM data, the three tuning factors, 𝑡𝑤 , 𝑡𝑓 𝑎𝑛𝑑 𝑡𝑡
were varied and the vCHO performance was compared to the human results; the
range of values studied for 𝑡𝑤 , 𝑡𝑓 𝑎𝑛𝑑 𝑡𝑡 was the same as for the ssCHO. The
goodness of fit for both ssCHO and the vCHO against the human observer results
was estimated via three criteria: the Pearson correlstion coefficient (r), the linear
regression slope (a) and the mean error (ME) (Dimitar Petrov et al., 2019).

126
RESULTS AND DISCUSSION
1. Tuning the ssCHO for FFDM
The tuning phase resulted in 6137 tuning parameter sets with their
corresponding CHO performance. After careful comparison to the human results
using the three criteria (r, a and ME), values of 130, 60, 30 were selected for
𝑡𝑡 , 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 , respectively. Figure 2.a. shows the plot of CHO PC for the three
contrast levels and two lesion sizes against the corresponding human PC. The
three criteria describing the goodness of the fit can be found in Table 3. Both
Pearson’s correlation and linear slope are close to 1.0, while the mean error is
less than 2 PC for both lesion sizes. We associate this with successful tuning of
the CHO.
All three tuning parameters were found to have significant impact on the CHO
performance (figure 2.b.). The orientation of the Gabor channels (𝑡𝑡 ) clearly
shows that for the range between 40 and 80 and also between 100 and 140 the
CHO performs similarly well, where for the rest the performance is notably
lower. This indicates that the CHO is sensitive to the non-symmetry of the non-
spiculated mass lesions. For both the frequency and width tuning parameters
(𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 ) a periodic pattern to the PC can be seen as these parameters are
varied. These experiments underline the necessity of CHO tuning.
All three tuning parameters were found to have significant impact on the CHO
performance (figure 2.b.). The orientation of the Gabor channels (𝑡𝑡 ) clearly
shows that for the range between 40 and 80 and also between 100 and 140 the
CHO performs similarly well, where for the rest the performance is notably
lower. This indicates that the CHO is sensitive to the non-symmetry of the non-
spiculated mass lesions. For both the frequency and width tuning parameters
(𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 ) a periodic pattern to the PC can be seen as these parameters are
varied. These experiments underline the necessity of CHO tuning.

127
a). MO(HO) b). tt
4mm 6mm
100 1% contrast 3% contrast 5% contrast
100
CHO PC, %

90
80

PC, %
80 60

70 40

70 80 90 100 0 30 60 90 120 150 180


Human PC, % deg

tf tw
1% contrast 3% contrast 5% contrast 1% contrast 3% contrast 5% contrast
100 100

80 80
PC, %

PC, %

60 60

40 40

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
value value

Figure 2. Tuning results for the ssCHO on FFDM. a). The CHO reading result against
human PC scores for both lesion sizes; b). The impact of each of the three tuning
parameters to the CHO performance (with the 2 other parameters constant). The
dashed vertical line for the three graphs shows the chosen parameter value used to
compare with human performance in a).
Table 3. Goodness of fit results for the model observer against human observer.
Parameters for 𝑡𝑡 , 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 were 130, 60, 30 for 2D FFDM, respectively.
Parameters for 𝑡𝑡 , 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 were 60, 35, 20 for DBT, respectively.
Person Linear slope Mean error
Lesion size
correlation (r) (a) (ME)
4mm 0.99 1.02 -1.9
2D FFDM
6mm 0.99 1.02 1.8
4mm 0.97 0.95 -0.7
DBT
6mm 0.98 0.94 -0.1

2. Tuning the vCHO for DBT


Figure 3 presents the results for the vCHO. The best performing channel set for
the DBT dataset had 𝑡𝑡 , 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑓 equal to 60, 35 and 20 respectively. The
goodness of fit criteria results are shown in Table 3, with Pearson’s correlation

128
more than 0.97, linear regression slope higher than 0.94 and mean error smaller
than 1 PC.

a). MO(HO) b). tt


CHO-4mm CHO-6mm 1% contrast 2.5% contrast 4% contrast
100 100
CHO PC, %

90 80

PC, %
80 60
70 40

70 80 90 100 0 30 60 90 120 150 180


Human PC, % deg

tf tw
1% contrast 2.5% contrast 4% contrast 1% contrast 2.5% contrast 4% contrast
100 100

80 80
PC, %

PC, %

60 60

40 40

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
value value

Figure 3. Tuning results for the vCHO on DBT. a). The CHO reading results against
human PC scores for both lesion sizes; b). The impact of each of the three tuning
parameters on the CHO performance, while the other parameters were kept
constant. The dashed vertical line in the three graphs shows the chosen parameter
value.
The tuning results show a significant drop in performance for all contrast levels
for 𝑡𝑡 = 90°, this is also observed in the tuning results for the ssCHO. The 𝑡𝑡
channel parameter controls the separation in degrees between following
orientation subsets. With 𝑡𝑡 = 90° and the fact that three orientations are used
for both vCHO and ssCHO in the present study, the channel generation algorithm
creates channels at 0°, 90° and 180°, effectively duplicating the 1st and the 3rd
channel orientation subset. The results suggest that partly replicated channel
subsets do not perform well and should be avoided. The periodic pattern
observed for the ssCHO for 𝑡𝑓 𝑎𝑛𝑑 𝑡𝑤 (Figure 2b) is also seen in the vCHO (Figure
3b), although less pronounced. Furthermore, the fall in performance at extreme
𝑡𝑡 values for the vCHO is not as large as that seen for the ssCHO. This observation,
along with absence of the large drop in performance at lower 𝑡𝑤 values for the
vCHO suggests that the vCHO is not as tuneable as the ssCHO. A future study
could focus on the potential of vCHO tuning by varying the number of channels.
Nevertheless with increasing channel width (higher 𝑡𝑤 values), the performance
falls, as expected.

129
The vCHO was chosen and developed due to its higher performance compared to
the ssCHO approach. When initially the ssCHO described in chapter 1 was
applied to the VCT tomosynthesis images of the present study, the mean absolute
error was larger than 10% PC. The results show that the new vCHO performs
better – with mean absolute error less than 2% PC.
For both the ssCHO and the vCHO, threshold contrasts were calculated for the
relevant lesion sizes using the linear fit between the PC results as a function of
lesion contrasts as described earlier. This was needed for the purpose of
comparing the ssCHO and vCHO performance against the human contrast
threshold published by Elangovan et al. (Elangovan et al., 2018). The uncertainty
on the results was estimated by bootstrapping the PC results 1000 times. The
results are given in Figure 4 and Table 4 and show that the threshold contrast
values for the CHO closely follow those of the human observer for all conditions.
The differences between the MO and human reader results for a given modality
and lesion diameter were not significantly different (p>0.01).

Figure 4. Comparison of threshold contrast between the CHO and the human
observer scores.
Table 4. Percentage contrast threshold to achieve 90.3% PC scores.
FFDM DBT
4mm 6mm 4mm 6mm
Human
7.0% ± 0.9% 3.8% ± 1.6% 2.1% ± 0.8% 0.8% ± 0.3%
observer
Model
6.7% ± 0.9% 4.0% ± 1.2% 2.3% ± 0.8% 0.6% ± 0.3%
observer

CONCLUSIONS
This work has extended a previously developed model observer designed for the
scoring of images of a structured test object from a scaning ssCHO type to a fully

130
volumetric vCHO that operates on an image stack of simulated breast images that
have a more complex/realistic appearance. A systematic approach to channel
tuning is an important aspect of CHO implementation if close agreement with
human readings is a prerequisite. These results demonstrate that the CHO
algorithms and associated tuning parameters described in this work have the
potential to replace human observers for both 2D FFM and DBT images of non-
spiculated masses as generated using a virtual clinical trial framework.

REFERENCES
Badano, A., Graff, C. G., Badal, A., Sharma, D., Zeng, R., Samuelson, F. W., Glick, S. J.,
& Myers, K. J. (2018). Evaluation of Digital Breast Tomosynthesis as
Replacement of Full-Field Digital Mammography Using an In Silico Imaging
Trial. JAMA Network Open, 1(7), e185474.
Bakic, P. R., Pokrajac, D. D., De Caro, R., & Maidment, A. D. A. (2014). Realistic
simulation of breast tissue microstructure in software anthropomorphic
phantoms. Lecture Notes in Computer Science (Including Subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8539
LNCS(3), 348–355.
Barrett, H. H., & Myers, K. J. (2004). Foundations of image science. ISBN: 978-0-
471-15300-9.
Bick, U., & Diekmann, F. (2007). Digital mammography: what do we and what
don’t we know? European Radiology, 17(8), 1931–1942.
Bluekens, A. M. J., Karssemeijer, N., Beijerinck, D., Deurenberg, J. J. M., van Engen,
R. E., Broeders, M. J. M., & den Heeten, G. J. (2010). Consequences of digital
mammography in population-based breast cancer screening: initial
changes and long-term impact on referral rates. European Radiology, 20(9),
2067–2073.
Bouwman, R. W., Goffi, M., van Engen, R. E., Broeders, M. J. M., Dance, D. R., Young,
K. C., & Veldkamp, W. J. H. (2017). Can the channelized Hotelling observer
including aspects of the human visual system predict human observer
performance in mammography? Physica Medica, 33, 95–105.
Castella, C., Eckstein, M. P., Abbey, C. K., Kinkel, K., Verdun, F. R., Saunders, R. S.,
Samei, E., & Bochud, F. O. (2009). Mass detection on mammograms:
influence of signal shape uncertainty on human and model observers.
Journal of the Optical Society of America A, 26(2), 425–436.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., Cooke, V., Wilkinson, L.,
Given-Wilson, R. M., Wallis, M. G., & Wells, K. (2017). Design and validation
of realistic breast models for use in multiple alternative forced choice
virtual clinical trials. Physics in Medicine and Biology, 62(7), 2778–2794.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2018). Lesion
detectability in 2D-mammography and digital breast tomosynthesis using
different targets and observers. Physics in Medicine and Biology, 63(9), 1–

131
15.
Elangovan, P., Warren, L. M., Mackenzie, A., Rashidnasab, A., Diaz, O., Dance, D. R.,
Young, K. C., Bosmans, H., Strudley, C. J., & Wells, K. (2014). Development
and validation of a modelling framework for simulating 2D-mammography
and breast tomosynthesis images. Physics in Medicine and Biology, 59(15),
4275–4293.
Ferlay, J., Autier, P., Boniol, M., Heanue, M., Colombet, M., & Boyle, P. (2007).
Estimates of the cancer incidence and mortality in Europe in 2006. Annals
of Oncology, 18(3), 581–592.
Ikejimba, L., Glick, S. j, Samei, E., & Yo, Y. J. (2016). Comparison of model and
human observer performance in FFDM , DBT , and synthetic
mammography. Proceedings of SPIE Medical Imaging, 9783(978325), 1–10.
Krammer, J., Stepniewski, K., Kaiser, C. G., Brade, J., Riffel, P., Schoenberg, S. O., &
Wasser, K. (2017). Value of Additional Digital Breast Tomosynthesis for
Preoperative Staging of Breast Cancer in Dense Breasts. Anticancer
Research, 37(9), 5255–5261.
Maidment, A. D. A. (2014). Virtual clinical trials for the assessment of novel
breast screening modalities. Lecture Notes in Computer Science (Including
Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 8539 LNCS, 1–8.
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H.
(2017). Real space channelization for generic DBT system image quality
evaluation with channelized Hotelling observer. Progress in Biomedical
Optics and Imaging - Proceedings of SPIE, 10136.
Petrov, D., Marshall, N. W., Young, K. C., & Bosmans, H. (2019). Systematic
approach to a channelized Hotelling model observer implementation for a
physical phantom containing mass-like lesions: Application to digital
breast tomosynthesis. Physica Medica, 58, 8–20.
Platiša, L., Goossens, B., Vansteenkiste, E., Park, S., Gallas, B. D., Badano, A., &
Philips, W. (2011). Channelized Hotelling observers for the assessment of
volumetric imaging data sets. Journal of the Optical Society of America. A,
Optics, Image Science, and Vision, 28(6), 1145–1163.
Rafferty, E. A. (2007). Digital Mammography: Novel Applications. Radiologic
Clinics of North America, 45(5), 831–843.
Rashidnasab, A., Elangovan, P., Diaz, O., Mackenzie, A., Young, K., Dance, D., &
Wells, K. (2013). Simulation of 3D DLA masses in digital breast
tomosynthesis (R. M. Nishikawa & B. R. Whiting (eds.); Vol. 8668, p.
86680Y). International Society for Optics and Photonics.
Rashidnasab, A., Elangovan, P., Yip, M., Diaz, O., Dance, D. R., Young, K. C., & Wells,
K. (2013). Simulation and assessment of realistic breast lesions using
fractal growth models. Physics in Medicine and Biology, 58(16), 5613–5627.

132
Scarparo, D. C., Salvadeo, D. H. P., Pedronette, D. C. G., Barufaldi, B., & Maidment,
A. D. A. (2019). Evaluation of denoising digital breast tomosynthesis data
in both projection and image domains and a study of noise model on digital
breast tomosynthesis image domain. Journal of Medical Imaging, 6(03), 1.
Sechopoulos, I. (2013a). A review of breast tomosynthesis. Part I. The image
acquisition process. Medical Physics, 40(1), 1–12.
Sechopoulos, I. (2013b). A review of breast tomosynthesis. Part II. Image
reconstruction, processing and analysis, and advanced applications.
Medical Physics, 40(1), 1–17.
Sharma, D., Graff, C. G., Badal, A., Zeng, R., Sawant, P., Sengupta, A., Dahal, E., &
Badano, A. (2019). Technical Note - In silico imaging tools from the VICTRE
clinical trial. Medical Physics.
Skaane, P., Bandos, A. I., Gullien, R., Eben, E. B., Ekseth, U., Haakenaasen, U., Izadi,
M., Jebsen, I. N., Jahr, G., Krager, M., & Hofvind, S. (2013). Prospective trial
comparing full-field digital mammography (FFDM) versus combined FFDM
and tomosynthesis in a population-based screening programme using
independent double reading with arbitration. European Radiology, 23(8),
2061–2071.
Svahn, T. M., Chakraborty, D. P., Ikeda, D., Zackrisson, S., Do, Y., Mattsson, S., &
Andersson, I. (2012). Breast tomosynthesis and digital mammography: A
comparison of diagnostic accuracy. British Journal of Radiology, 85(1019).
Tabar, L., Yen, M.-F., Vitak, B., Chen, H.-H. T., Smith, R. A., & Duffy, S. W. (2003).
Mammography service screening and mortality in breast cancer patients:
20-year follow-up before and after introduction of screening. The Lancet,
361(9367), 1405–1410.
Wen, G., Markey, M. K., Haygood, T. M., & Park, S. (2018). Model observer for
assessing digital breast tomosynthesis for multi-lesion detection in the
presence of anatomical noise. Physics in Medicine and Biology, 63(4).
Zeng, R., Badano, A., & Myers, K. J. (2015). Evaluating the sensitivity of the
optimization of acquisition geometry to the choice of reconstruction
algorithm in digital breast tomosynthesis through a simulation study. Phys.
Med. Biol, 60, 1259.
Zeng, R., Badano, A., & Myers, K. J. (2017). Optimization of digital breast
tomosynthesis (DBT) acquisition parameters for human observers: effect
of reconstruction algorithms. Physics in Medicine & Biology.

133
Chapter 6: Deep learning applications
DEEP LEARNING CHANNELIZED HOTELLING OBSERVER
FOR MULTI-VENDOR DBT SYSTEM IMAGE QUALITY
EVALUATION USING A STRUCTURED PHANTOM
This chapter is based on the text that will be published in the proceedings of the
SPIE Medical imaging 2020 conference (Houston, USA)
1. Introduction
Digital breast tomosynthesis (DBT) is a breast imaging technique in which a
series of projection images are acquired as the x-ray tube moves over a limited
angle around the breast. The projection data are reconstructed to produce a
quasi-volumetric dataset in which the breast is represented as a stack planes
parallel to the detector. In this way the DBT method recovers some of the lesion
detection sensitivity that is lost due to the projection of overlapping tissues in 2D
mammography. Image quality estimation and system characterization in a
radiological setting, however, can be challenging given the large number of
parameters influencing DBT system performance. While human observer
detectability studies are considered the gold standard for quality assurance
performance testing and optimization, a significant amount of time would be
required for broad ranging optimization and this is generally not available.
Recent studies have shown that the channelized Hotelling observer (CHO) is a
potential substitute for simple human observer detection studies (Petrov et al.,
2019). The CHO is a numerical model observer, which incorporates a channel set
to extract feature vectors that generate a linear classifier for the decision making
task. In practice, CHO workflow includes a tuning step, in which the functions
used to generate the channels and associated parameters are selected. In a
previous study (Petrov et al., 2017) our group evaluated DBT image quality with
a CHO applied to images produced by a structured phantom that contained 3D
printed non-spiculated mass simulating targets (Cockmartin et al., 2017). The
CHO used Gabor channels with the DBT system pixel size as the only system
dependent parameter, allowing the same channel set to be used for multiple
systems and acquisition parameters. However, further use of this algorithm
showed occasionally poor agreement with human observer results for newer
applications on different DBT systems with the same structured phantom (data
available, but not shown). This was an indication that the process of selecting the
channel functions that predicted the human reader results was a potential
bottleneck in the model observer work flow. Finding anthropomorphic channels
(Petrov et al., 2019), even for one channel type (e.g. Gabor functions), is also a
time consuming process and there is no guarantee of convergence towards a
sufficiently accurate solution. Appropriate anthropomorphic channel sets could
be realized with multiple functions, or even using channels which are not defined
by a specific combination of functions, e.g. image like feature extractors. With

134
regard to the different DBT systems, it may not be possible to find a single
channel function that predicts all the human observer results for the same
specific task. In order to generalize the CHO and improve performance
prediction across the vendor solutions, we propose and a new method using
deep learning (DL) to compute the channel set from a large number of human
readings on non-spiculated mass lesions from a physical phantom scanned with
different types of DBT systems, with the ultimate aim of constructing a robust,
multivendor CHO for DBT systems.
Recently, a number of studies have investigated the use of DL networks for the
image quality evaluation task. For example, Massanes et al. (Massanes & Brankov,
2017) used a convolutional neural network (CNN) to automatically estimate the
detectability of targets in correlated noise background images and compare it to
human observer scores; Alnowami et al. (Alnowami et al., 2018) used CNN
classifiers to estimate detectability in mammograms of a virtual clinical trial and
demonstrated promising correlation with human reader results. Zhou et al.
(Zhou et al., 2019) proposed a single-layer neural network to estimate the
Hotelling model observer template, along with a complete CNN, to approximate
the ideal observer performance. The DL networks in these papers incorporate
both the feature extraction algorithm and the classifier. This partly solves the
extensive training problem of a CHO, but requires a lot of training examples in
order to achieve good generalization performance, i.e. applicability on multiple
systems and conditions. In addition, these methods are known to require high-
end hardware to support the heavy computational load of these networks.
Other studies split the feature extraction and the classification processes using
machine learning for the first part and regression based model observers for the
latter. Gong et al. (Gong et al., 2019) introduced pre-trained CNNs to build feature
vectors that are then further engineered by a partial least square regression-
based observer to generate test statistics for the decision making task. The use
of internal noise by (Gong et al., 2019) to match human performance might
compromise the model observer generalization. Witten et al. (Witten et al., 2010)
successfully proposed a CHO channel set estimated from a training dataset. The
technique used for the channel generation is partial least squares (PLS), where
the channels are estimated from signal present and signal absent training
images. Due to the design of these channels, no anthropomorphic properties are
extracted, and comparison with human observers is not a requirement.
The approach proposed in this work is similar to the PLS channels method. The
well-known CHO model observer is kept for the classification step, but a DL
algorithm is introduced that acts as an anthropomorphic channel generator. We
call this model observer the DL-CHO, which is based on the CHO developed
previously for the 3D printed mass lesions in the structured phantom (Petrov et
al., 2019), but instead of applying Gabor channels, a DL algorithm is trained on
the image datasets that were allocated for training. All available DBT phantom
images, previously read by human observers in a 4 alternative forced choice (4-
AFC) paradigm, were assigned with labels according to the human choice and

135
used to train a custom convolutional neural network with a single convolutional
layer. The CNN output resembles a channel output vector that can be used as an
input of a CHO. In this way, anthropomorphic feature extraction is carried out
by the convolutional layer and the classification step is performed by a standard
linear model observer. The decision variable outputs from this DL-CHO are used
to calculate a loss function, which is used for backpropagation and to update the
convolutional layer parameters such that they minimize the loss. The DL-CHO
was trained on human readings from 6 DBT systems with images acquired at
three dose levels. The resulting DL channels were used for testing against human
detectability and reproducibility on separate image dataset taken on 7 DBT
systems.
2. Materials and methods
2.1. Image acquisition
The image datasets for this study were acquired using a 3D physical phantom
developed by Cockmartin et al. (Cockmartin et al., 2017) (figure 1). The phantom
is made of a PMMA semi-cylindrical container filled with PMMA spheres of
different sizes and water. This creates images with power spectra similar to
those found in patients (Cockmartin et al., 2013). The spheres are not fixed in
place and therefore shaking the phantom produces another background
realization, but with similar power spectra characteristics. Five 3D printed non-
spiculated mass simulating targets with different sizes (with average diameters
from 1.5mm to 5.7mm) are also present within the phantom structure and serve
as the signal in the current study.
The phantom was scanned 324 times on Fuji Innovation, GE Pristina, Giotto
Class, Hologic 3Dimensions, Siemens Inspiration and Siemens Revelation DBT
systems and used in a variety of human observer experiments on three dose
levels (Cockmartin et al., 2017; Petrov et al., 2019, 2017). All results were used for
the DL-CHO training (table 1). The phantom was scanned a further 270 times on
Fujifilm Amulet Innovality, GE Senographe Pristina, GE Senographe SenoClaire,
IMS Giotto Class, Hologic Selenia Dimensions, Siemens Mammomat Inspiration
and Siemens Mammomat Revelation DBT systems for testing of the trained DL-
CHO algorithm (table 1).

Figure 1. The spheres phantom, from left to right: reconstructed tomosynthesis


plane, mammographic image without the background spheres and a photograph

136
Table 1. The number of DBT acquisitions per system and dose level for the human
reading & DL-CHO training dataset and the DL-CHO testing dataset. The semicolon
separated values in both columns show the images taken in Low, AEC and High dose
level order. Zero indicates that a particular dataset was not used for training or for
testing.
LOW; AEC; HIGH HUMAN READING
DL-CHO TESTING
DOSE LEVELS (DL-CHO TRAINING)
FUJIFILM AMULET
24; 24; 24 12; 12; 12
INNOVALITY
GE SENOGRAPHE PRISTINA 12; 12; 12 12; 12; 12
GE SENOGRAPHE
0; 0; 0 12; 12; 12
SENOCLAIRE
IMS GIOTTO CLASS 20; 20; 20 10; 10; 10
HOLOGIC 3DIMENSIONS 12; 12; 12 0; 0; 0
HOLOGIC SELENIA
0; 0; 0 0; 60; 0
DIMENSIONS
SIEMENS MAMMOMAT
12; 60; 12 12; 12; 12
INSPIRATION
SIEMENS MAMMOMAT
12; 12; 12 12; 12; 12
REVELATION
TOTAL 324 270
2.2. Human observer study

Figure 2. a). extraction positions for signal present and signal absent VOIs; b).
Screenshot of the ‘Foursquares’ 4-AFC tool.
From the DBT acquisitions, volumes of interest (VOI) of 20x20x30mm3, with and
without signal, were cropped and used in human observer experiments (figure
2). Three signal absent VOIs and one signal present VOI were shown, with the
task of indicating the signal present image. Six medical physicists performed the
4-AFC image reading studies and the results were collected and grouped for the

137
DBT systems. A software tool developed in-house (“Foursquares” (Zhang et al.,
2016)) was used to conduct the 4-AFC studies. The reading was performed on a
diagnostic monitor (Barco MDNG-6121) in a room with ambient light level <3 lx.
The observers were instructed to view the images at a viewing distance of 40 to
50 cm and no time constraints were imposed. All observers had previously read
many DBT images of the phantom acquired under different conditions, thus no
training procedure before the reading was carried out. The percentage of
correctly detected targets (PC) was calculated from the results and used as the
figure of merit (FOM) for the image quality evaluation. The uncertainty of the PC
results was estimated using the standard error on the mean.
2.3. DL-CHO images and preprocessing
For the DL-CHO training, the exact configuration of the four VOIs was stored and
annotated with label 1 for the selected image and 0 for the rest. In addition, for
every set of 4 VOIs read by the human readers, the human choice was labelled
and stored as a label vector, this combination of images and label vector will be
called an example. A preliminary study showed that if all available examples are
used for the training, the resulting DL channels became structureless and lost
their feature extraction abilities (Figure 3). After a few epochs the DL-CHO was
not able to achieve the FOM of the human readers. This effect was caused by the
many DL training sessions that used human PC results at about 25%, i.e. more 4-
AFC ‘misses’, than ‘hits’ (in fact, just pure guessing). When training with such
examples, the kernel update values were dominated by background structures
that triggered the human choice rather than the (weak) lesion was that was
present. Due to the random nature of the background details, such training
datasets produce noisy and uniform DL-CHO channels with little structure. To
cope with this, a PC threshold was estimated experimentally and all available
examples with PC smaller than 50% were excluded from the training dataset.

Figure 3. Visualization of the diminishing training progress using all the human
reading data, including the images with PC ~25%: the 1st channel after the 1st epoch
and after the 10th epoch.
The CHO component of the DL-CHO is taken from earlier work (Petrov et al.,
2019) and is based on a single-slice CHO algorithm applied on 2D images (Platiša
et al., 2011), which requires 2D images as an input. This required the extraction
of regions of interest (ROIs) from the VOIs used for the human reader studies. As
the signal present VOIs were centered at the expected position of the lesions, the
central three planes were taken as an input for the DL-CHO, because the

138
maximum visibility of the targets occurs in slightly different positions in the z-
direction. Each VOI example (4 VOIs and label vector) therefore produced three
ROI examples: each with 4 ROIs with the same plane position along with their
replicated label vectors. In this way, the ROI examples from all DBT systems and
dose levels formed a DL-CHO training set of 29664 examples.
The different DBT systems reconstruct the DBT stacks with different pixel sizes
determined by detector and/or reconstruction algorithm. With an ROI size in
real space of 20x20mm2, the extracted ROIs of the different DBT system brands
ranged from 186x186 to 236x236 pixels. In order to use all examples in the same
manner and produce a generic channel set for all systems, the ROIs were resized
to 150x150 pixels. The down-scaling was performed using a bi-linear
interpolation with anti-aliasing to minimize the aliasing artifacts. The anti-
aliasing was performed via Gaussian filter smoothing with standard deviation
estimated by the following formula:
(𝑠−1) 𝑖𝑛𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑠𝑖𝑧𝑒
𝜎𝑎𝑎 = , where 𝑠 is the down-scaling factor, 𝑠 = (1)
2 𝑜𝑢𝑡𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑠𝑖𝑧𝑒

The ROI pixel values have different magnitude depending on the dose level and
system manufacturer. In order to make the DL-CHO training more effective and
avoid false DL-CHO decisions in the validation process because of these
differences, the ROI intensity range was standardized. This was performed by
rescaling the original pixel values between 0 and 1, with ‘float32’ precision, and
saved as a 32-bit tiff file. This allowed more efficient training of the DL-CHO,
while preserving the ROI first and second order statistics.
2.4. Deep learning channelized Hotelling observer
The channelized Hotelling observer computes a test statistic 𝑡 for each image, by
applying the observer template 𝑤𝐶𝐻𝑂 (Petrov et al., 2019):
𝑡
𝑡𝐶𝐻𝑂 (𝑣) = 𝑤𝐶𝐻𝑂 𝑣 = (𝑣
̅̅̅̅
𝑠𝑝 − ̅̅̅̅)
𝑡
𝑣𝑠𝑎 𝑆𝑣−1 𝑣 (2)
The observer template 𝑤𝐶𝐻𝑂 consists of the difference between the mean signal
present and signal absent data and the inverse of the interclass covariance
matrix 𝑆𝑣 . The CHO uses channels to extract information from the images by
calculating a channel output vector 𝑣 = 𝑈 𝑡 𝑔, which is the product of the channel
set 𝑈 and an image 𝑔. In our work the channel output vectors were calculated
using a single convolutional layer, where five kernels were used with size equal
to the size of the input image 𝑔 and a bias vector. In this way, every convolutional
kernel produces a single scalar, and the complete convolution process gives a
channel output vector 𝑣 with a size of [5𝑥1] elements. The number of kernels
was determined with regard to the size of the training dataset. The more kernels
the convolutional layer has, the more features can be extracted from the images,
however more training images are needed. In our study five kernels were found
to be a good compromise between amount of training images and DL-CHO
performance.

139
In order to calculate the CHO test statistic, the convolutional layer is applied to
all image datasets (figure 4). In order to train the convolutional kernels, the exact
combinations of four images (1 signal present and 3 signal absent images) read
by the human observers, are given as a reading input. The CHO output for these
four images is used along with the label vector to calculate a binary cross-
entropy loss estimate. This estimate is used with a stochastic gradient descent
with momentum to update and optimize the convolutional layer weights and
biases (Krizhevsky et al., 2012). The DL-CHO is developed using the PyTorch
library in Python. The gradients used for the convolutional layer optimization
are calculated via the autograd package (Paszke et al., 2007).
From the human 4-AFC scoring 29664 training examples were created with 4
images each. In order to train the DL-CHO the available training examples were
split in three subsets: 26000 examples for CNN kernel training, 2000 examples
for training of the CHO template and the remaining 1664 examples for validation.
The kernel weights and biases were updated after every training example until
all 26000 are used, then the resulting DL-CHO was applied on the 1664 validation
examples to estimate the validation loss and accuracy. This process of a training
pass over all available images defines an epoch and this is repeated 50 times. In
each epoch, the CNN kernel values are different, and the convolutional kernels
are further trained. At the beginning of the training, the validation subset is
allocated and fixed throughout the complete training process. Conversely, the
image subsets for the CHO template training and the CNN kernels training were
pooled dynamically at the beginning of each epoch, which increases the DL-CHO
generalization and reduces the algorithm overfitting.

Figure 4. Flow chart of the DL-CHO training.


After the DL-CHO was trained, the convolutional kernels and bias vector were
fixed and saved for use on new sets of DBT acquisitions (Table 1 column ‘DL-CHO
testing’) that were not included in the training process. In this testing stage the
same channel set and bias vector were used for all ROIs extracted from the extra

140
306 DBT testing acquisitions (table 1) without further tuning. The CHO template
was estimated using the 1664 examples allocated for validation in the training
process, thus 1664 signal present and 4992 signal absent images were used to
calculate the channelized expected signal and the interclass covariance matrix.
This way the DL-CHO testing relies completely on images not used to update the
DL channels, thus no dataset bias is expected. The application of the trained DL-
CHO does not require any demanding computation, so for this task a standard
computer was used with a standard CHO algorithm.
Most of the validation DBT acquisitions were not read by human observers, this
included the data from all DBT systems except Hologic Dimensions and GE
SenoClaire. In order to study how the DL-CHO could read new images in
comparison to the human observers, we hypothesized that human reading on
images from similar conditions would statistically be read with a score within
the confidence interval of the earlier readings, i.e. the human reading uncertainty
is similar to the human reading reproducibility.
The testing of the DL-CHO was performed in two parts:
1. Dose levels study – the trained model observer was applied on images
acquired with different vendor DBT systems and three dose levels. The
DL-CHO PC results was plotted against the human observer PC results
and the goodness of the fit between the two observers was assessed
using three criteria (Petrov et al., 2019):
a. Pearson’s correlation coefficient (𝑟) : measures the linear
relation between the DL-CHO and the human observer results.
A value closer to 1 is desired.
b. Linear regression slope (𝑎) : measures the target size
dependency of the DL-CHO versus the human observers. A
value closer to 1 is desired
c. Mean error (𝑀𝐸): Measures the average distance between the
DL-CHO and human observer PC scores. A value closer to 0 is
desired.
2. Reproducibility study – the trained model observer was applied on 60
DBT acquisitions taken with the same DBT system at the same scanning
parameters. The acquisitions were split into 5 sets and the standard
deviation from the 5 readings for each lesions were taken as an estimate
for the DL-CHO reproducibility. This was then compared to the human
reader reproducibility for the same 5 image groups

141
3. Results
3.1. Training results and DL channel set

Figure 5 a). Graph of the accuracy and loss during the training process; b). The
channel set estimated after 50 training epochs used for all systems.
Figure 5 a). shows the graph of accuracy and loss against epoch number. After
each epoch, the average training loss and accuracy were stored along with the
validation loss and accuracy calculated from the 1664 validation examples. After
50 runs over the full training dataset, the final convolution kernel values were
stored as matrices (figure 5b)) along with the bias vector. The training loss and
accuracy estimates showed little improvement after the 10th epoch, with the loss
remaining below 0.08 and a validation accuracy of about 0.87. For all 5 channels
(figure 5b)) there is a region of enhanced signal visible at the center; this is most
likely due to the average target present in the phantom. It can be seen in figure 1
b) that four channels have negative central signals (black grey value) and one
positive (white grey value), this could be linked to the anthropomorphic
detection task required from the DL-CHO, which differs from the usual
application of the deep learning networks, which is that of ‘ideal’ classification.
This might help the DL channels to suppress the presence of some lesions and
downgrade the DL-CHO performance. It should be noted, that the channels with
negative central signals have negative bias value and the channel with positive
central signal have a positive bias value.
3.2. Testing against human PC results
Dose levels study
The channel set from the training (figure 5 b)) was applied in a conventional way
to the image dataset that was saved for validation and the results were compared
to the human readings (in absolute terms). The results show good agreement
with humans (table 2 and figure 5) for all systems and reading sessions. The
goodness of the fit criteria were close to ideal for most of the sessions with

142
Pearson’s correlation coefficient higher than 0.90; linear regression slope better
than 0.60 and mean absolute error between the observers smaller than 12.7 PC.
Table 2. The fitting criteria results: Pearson’s correlation, linear regression slope
and mean error.
DBT system Dose level r a ME [PC]
High 0.99 0.87 1.1
Fujifilm AEC 1.00 0.74 -2.6
Low 0.98 0.60 -5.2
High 0.91 0.76 7.1
GE Pristina AEC 0.91 1.06 5.6
Low 0.92 0.89 6.2
High 0.96 1.04 3.3
GE SenoClaire AEC 0.97 1.17 5.5
Low 0.96 0.94 5.0
High 0.97 0.93 -3.3
Giotto Class AEC 0.90 1.13 3.3
Low 0.99 0.91 -0.2
High 0.95 1.00 3.7
Siemens Revelation AEC 0.99 1.03 7.0
Low 0.99 0.86 1.1
High 0.96 0.76 0.1
Siemens Inspiration AEC 0.97 0.79 -5.6
Low 0.96 1.11 4.6
AEC Set 1 0.97 1.26 5.5
AEC Set 2 0.93 0.83 1.7
Hologic Dimensions AEC Set 3 0.96 0.88 1.2
AEC Set 4 0.97 1.08 12.7
AEC Set 5 0.98 1.08 9.0

143
Fujifilm GE Pristina GE SenoClaire Giotto Class
High AEC Low High AEC Low High AEC Low High AEC Low

100 100 100 100

80 80 80 80
CHO, PC %

CHO, PC %
CHO, PC %
CHO, PC %
60 60 60 60

40 40 40 40

20 20 20 20

0 0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Human, PC % Human, PC % Human, PC % Human, PC %

Hologic Dimensions Siemens Inspiration Siemens Revelation


Set1 Set2 Set3 Set4 Set5 High AEC Low High AEC Low
100 100 100

80 80 80
CHO, PC %

CHO, PC %

CHO, PC %
60 60 60

40 40 40

20 20 20

0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Human, PC % Human, PC % Human, PC %

Figure 6. Scatter plots of the percentage correctly detected targets of the DL CHO
against human observer for 7 DBT systems.
Reproducibility study

DL-CHO reproducibility Human reproducibility


100 100

80 80
PC, %

60
PC, %

60

40 40

20 20

0 0
5

7
5

1.

2.

3.

4.

5.
1.

2.

3.

4.

5.

Lesion size [mm] Lesion size [mm]

Figure 7. Reproducibility of the DL-CHO (left) and human observers (right) for the
Hologic Dimensions DBT system at AEC dose level. The longer horizontal line for
each graph and lesion size represents the mean PC and the two smaller lines – the
upper and lower limit of the standard deviation.

144
Table 3. Standard deviation of the model and human observer results in the
reproducibility study.

Standard deviation (PC)

Lesion
1.5mm 2.1mm 3.0mm 4.3mm 5.7mm
diameter

DL-CHO 3.1 4.7 4.8 1.1 2.4

Human
7.7 15 4.5 4.4 3.3
observers
Figure 7 shows the mean and the standard deviation of the PC results for the
reproducibility study for the Hologic Dimensions DBT system. Table 3 lists the
standard deviation values for DL-CHO and the human observers. The results
clearly indicate that the DL-CHO is more reproducible with less variability in the
PC results when reading images taken at the same conditions. Overall the
reproducibility of the DL-CHO stay below 5 PC, where for humans it varies up to
15 PC.
4. Discussion
The results show the potential of the developed DL-CHO to replace the human
observer reading for future image quality studies on these DBT systems using
the spheres phantom. The ability to predict human results is maintained over a
wide range of acquisition doses and over the considerable differences between
the DBT reconstruction algorithms. The DL-CHO showed better reproducibility
than humans, when the observer is applied on number of datasets taken at the
same acquisition settings. Human reading PC increases with increasing lesion
diameter and, to a lesser extent, with increasing acquisition dose. These same
trends were seen for all the DBT systems. While expected, this is important
verification for the DL-CHO.
We have to note that the human observer studies used for DL-CHO training were
acquired over a period of 3 years. While the reading laboratory was not changed
during this time frame, for some reading sessions a different set of six human
observers were used. This was necessary in order to acquire enough images
required for a robust DL channel estimation. The time for a human reading study
for one DBT system at three dose levels with 6 independent observers was
typically between one and two weeks. This can be compared to the application
of the trained DL-CHO to all 7 DBT systems (at 3 dose levels) studied in the
present work (figure 2) which took only a few hours. Obviously, once
designed/trained, the model observer promises great time saving benefits for
the evaluation of images such as the test object images here.
The standard application of a CHO requires extra dataset of training images to
calculate the observer template. In our previous study Petrov et al. (Petrov et al.,
2019), it was concluded that no less than 17 DBT phantom scans were needed in

145
order to acquire training and reading datasets for a robust image quality
estimate. For the case of DL-CHO extra training dataset is not necessary: the DL-
CHO is tuned and trained on the fixed dataset of training images from all systems
and scanning parameters, the observer template in this case is a fixed vector.
This greatly facilitates image quality evaluation as a reliable image quality
estimate can be made from a dataset of just 12 DBT phantom acquisitions for a
given system and dose level.
There are certain limitations to the DL-CHO presented in this work. The channels
are tuned to work specifically for the spheres phantom (Cockmartin et al., 2017)
and will not give meaningful results if applied on a different DBT phantom. Also,
we would not expect such a good correlation between the human and DL-CHO
results if the phantom is read by different group of human observers, e.g.
radiologists or naïve observers. While our native human readers group obviously
contains some between observer variation, substituting a new observer in the
group of six will change the fit results in table 2 slightly. The channel set has been
validated to work on multiple DBT systems only at the described conditions. A
change of parameters in the imaging chain, the scanning protocol, or of the
reconstruction algorithm might require re-validation with a new set of human
reader scores, acquired for these new acquisitions. This would also be the case if
a new DBT system is introduced. Another limitation is associated with the
resizing of the images used for training and testing of the DL-CHO. The resizing
was necessary for the multi-vendor application of the model observer and
includes an anti-aliasing filter that applies Gaussian blur to the processed images.
In general this may reduce the image quality compared to the original image. On
these terms the DL-CHO training dataset differs slightly to the original human
reading dataset. While this might not be crucial for the mass-like target detection,
given the results we suspect that the DL-CHO channels compensate for this
reduced image quality.
Overall, the DL-CHO algorithm can be seen to be particularly promising for QA
applications, where imaging performance needs to be estimated periodically,
perhaps eventually against a limiting standard, where high precision on the
generated test object score would be required. For any new DBT systems
introduced or a major update of a current DBT system a human reading on the
system acceptance test would be required for the purpose of re-validation of the
DL-CHO.
4. Conclusions
This study has proposed a method of estimating a DL channel set for use in a
standard linear CHO model observer. The process includes a convolutional layer
with 5 kernels; the output of these kernels serve as the input for a CHO classifier.
The channels were estimated using a large dataset of DBT scans acquired on six
DBT models and evaluated by human readers. The DL-CHO was then applied to
an independent set of DBT scans and human reading results, acquired on seven
DBT models. These was good agreement between the DL-CHO output and the
human observer data for the models considered and over a wide range of dose

146
level. The DL-CHO algorithm has clear potential for use in DBT system image
quality evaluation.

RESNET18 FOR MULTI-VENDOR DBT IMAGE QUALITY


EVALUATION USING A STRUCTURED PHANTOM
This chapter is based on the text that will be published in the proceedings of the
SPIE Medical imaging 2020 conference (Houston, USA)
1. Introduction
Digital breast tomosynthesis (DBT) is a breast imaging technique, where a series
of projection images are acquired as the x-ray tube moves over a limited angle
around the breast. A reconstruction of the projection data produces a quasi-
volumetric image of the breast visualized as evenly spaced planes, parallel to the
detector plane. The DBT method therefore an attempt to overcome the reduction
in lesion detection sensitivity due to tissue overlap in 2D mammography. Image
quality estimation and system characterization for the purposes of quality
assurance or optimization, however, can be challenging given the large number
of parameters influencing the DBT system performance. Furthermore,
organizing human observer detectability studies for all optimization and
characterization procedures, while offering a gold standard performance
measure, requires considering input in terms of reading time by the observers.
Recent studies have investigated the use of deep learning (DL) networks as a
classifier for image quality estimation. For example Hou et al. (Weilong Hou et al.,
2015) and Bosse et al. (Bosse et al., n.d.) propose a convolutional neural network
(CNN) method for image quality estimation on natural images largely based on
human scores for the ground truth. In the field of medical imaging few groups
have developed DL driven model observers (MOs) for image quality estimation.
For example, Massanes and Brankov (Massanes & Brankov, 2017) Alnowami et
al. (Alnowami et al., 2018) and Gong et al. (Gong et al., 2019) propose different
applications for a DL approach to generating a model observer for different X-
ray imaging modalities. These studies, while aiming to mimic human
performance, do not use the human observer ground truth for training, but on
the image class from which the image originates (signal present or signal absent).
This can be seen as developing an ideal observer, then applying steps in which
the performance is downgraded so that the MO detectability matches that of the
human observers.
The aim of this study was to develop a CNN anthropomorphic classifier for image
quality evaluation in DBT for mass-simulating lesions using images of a 3D
structured (Cockmartin et al., 2017). This test object is evaluated using a 4-
alternative forced choice (4-AFC) method, in which human observers are shown
four images and have to pick the image they consider to contain the signal. In
order to develop the DL classifier, every image selected by the human reader as
positive was labelled as image containing the signal, regardless of the signal
ground truth. These images are then split into training and validation groups and

147
were given to a pre-trained ResNet18 neural network (He et al., 2015) modified
for the purpose of this study. The ResNet18 CNN method uses residual
connections to cope with the vanishing gradient problem found in the deeper
network layers (Ebrahimi & Abadi, 2018). This also allows for higher
classification accuracy with smaller amount of training parameters. After the
training step, the ResNet18 algorithm was then applied to sets of separate DBT
acquisitions in order to validate the DL classifier performance against human
readers.
2. Materials and methods
2.1. Image acquisition and human reading
The image datasets for the reading were acquired using a structured phantom.
The phantom (Cockmartin et al., 2017) is made of a PMMA semi-circular
container filled with PMMA spheres of different sizes; the remaining volume
between the spheres is filled with water. This creates images with power spectra
similar to those found in patients (Cockmartin et al., 2013). Five non-spiculated
mass simulating targets, covering a range of diameters are also present within
the phantom structure. There were 260 DBT scans (each scan generated a single
DBT image volume) available with human reading results, acquired on six
different models of DBT system. The imaging systems were the Fuji Innovation,
GE Pristina, Giotto Class, Hologic 3Dimensions, Siemens Inspiration and Siemens
Revelation. These DBT scans were acquired at both the automatic exposure
control (AEC) dose level for a given system and at lower and higher dose levels.
The phantom was also used to generate another 270 DBT image volumes in total;
these scans were acquired on Fuji Innovation, GE Pristina, GE SenoClaire, Giotto
Class, Hologic Dimensions, Siemens Inspiration and Siemens Revelation DBT
systems. The first group of DBT volumes (i.e. with the human reading results)
were used for the CNN training, while the second set of DBT scans were used for
validation of the trained CNN algorithm against the human results.
From the DBT acquisitions, volumes of interest (VOI) with the size of
20x20x30mm3, with and without signal, were cropped and used in human
observer experiments. The DBT images were scored as a 4-AFC study in which
four images were shown to an observer, one of which contained the signal
stimulus. The observer had to select the image considered to contain the signal
stimulus. A software tool developed in-house (“Foursquares” (Zhang et al.,
2016)) was used to conduct the 4AFC studies. Following a reading session, the
percentage of correctly identified signals (PC) was calculated and this was used
as the figure of merit. In total, six medical physicists participated in these reading
studies; all results were collected and grouped for use as input for the CNN
training. For all readers and reading sessions, the exact configuration of the four
images was stored and annotated with a label “1” for the selected image and “0”
given to the remaining 3 images. In this way, for each set of 4-AFC images that
had been read, there was a label vector indicating which of the four images had
been selected by the human reader.

148
The second dataset with 270 acquisitions used for ResNet18 validation has not
been read by human observers. A human reading of all validation acquisitions
would not be feasible to facilitate in a reasonable time period, as the results for
the first dataset were accumulated from multiple studies within a time period of
two years. Therefore, with the assumption that the human observer
reproducibility is similar to the PC results uncertainty (Petrov et al., 2019), the
human results from the training dataset of 324 acquisitions were used also as PC
results for the validation dataset.
2.2. ResNet18
The new DL approach for image quality estimation presented here is based on
the residual convolutional neural network (ResNet) (He et al., 2015) with 18
layers, pre-trained on more than a million natural images from the ImageNet
database (Russakovsky et al., 2015). The pre-training was used to initialize the
network variables in order to achieve quicker and requiring less training images
training stage. The layers of the network consist of an input layer, 16 hidden
convolutional layers with different number of nodes and 1 fully connected layer
with a single binary output. The ResNet18 takes as an input 2D images with an
x-y size of 224x224 pixels, so in order to preserve the ResNet18 structure and
pre-trained weights and biases, from all VOIs the central three regions of interest
(ROI), including the ‘in focus’ plane and the two adjacent, were extracted and
resized to 224x224 pixels using a bi-linear interpolation. The output of the
network was taken after application of a softmax activation function and this was
the variable used for the decision making. In order to train the ResNet18 with
the human responses, the exact combinations of four images (1 present and 3
absent) read by the human observers, were given as the input. The ResNet18
output for these four images was used along with the label vector to calculate a
binary cross-entropy loss estimate, which showed how well the algorithm
performs the task. This estimate was used with a stochastic gradient descent
(SGD) with momentum to update the network weights and biases. From the
human 4-AFC scoring data, 29664 training examples were created, each
containing 4 images. These training datasets were then randomly assigned to
two groups: 28000 datasets for training and 1664 for validation. The network
weights and biases were updated after each training example until all 28000 had
been used. The resulting ResNet18 was then applied to the 1664 dataset
examples available in order to estimate the loss and accuracy for validation. The
values were stored and following a randomization step, the algorithm was
trained again using the same (but randomized) dataset, for total of 50 epochs.
Once the ResNet18 had been trained (i.e. after the 50 epochs) a further testing
step was carried out with the dataset of 270 DBT acquisitions not included in the
training process. In order to compare the DL network results against the human
observer, the ResNet18 output was used in a 4-AFC study; one signal present and
three signal absent images were given as an input and if the network output was
the highest for the signal present image, the 4-AFC trial was scored as a ‘hit’.
Conversely, if a signal absent image scored highest then a ‘miss’ was assigned.

149
From the number of ‘hits’ and the total number of 4-AFC trials, the PC of detected
targets was calculated.
3. Results and discussion
3.1. Training results

Figure 1. The validation accuracy and loss of the ResNet18 against the training
epoch.
The pre-trained ResNet18 was further trained on the DBT phantom images in
order to achieve a high accuracy for the image quality estimation task. After
every pass over all 28000 training examples, the resulting state of the ResNet18
was validated against human observers on a separate set of 1664 examples. The
loss and the accuracy of the validation step against the training epoch are shown
on figure 1. The graph shows no significant improvement in the validation
accuracy after the 20th epoch, thus after 50 epochs the training process was
stopped. The resulting network have loss lower than 0.05 and accuracy higher
than 91%, when compared to the human observer decisions in a 4-AFC test.
3.2. Validation against human PC results
Figure 2 shows the percentage correct validation results for the ResNet18
classifier against human observers. Three criteria were used for the evaluation
of DL classifier, taken from the work of Petrov et al. (Petrov et al., 2019). These
are the Pearson correlation coefficient, the gradient of the linear regression and
the mean absolute error between the ResNet18 and the human reader data. The
results show satisfactory agreement to human results with Pearson’s correlation
higher than 0.92 (the lowest value was found for the Giotto at High dose level);
linear regression slope better than 0.58 (lowest value was found for Giotto at
High dose level) and mean absolute error between the observers smaller than 11
PC (highest value was found for Siemens Revelation at Low dose level).

150
F u jif ilm G E P r is t in a

H ig h d o s e A E C dose Low dose H ig h d o s e A E C dose Low dose

100 100

R e s N e t, P C %
R e s N e t, P C %
80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
H um an, PC % H um an, PC %

G E S e n o C la ir e G io t t o C la s s

H ig h d o s e A E C dose Low dose H ig h d o s e A E C dose Low dose

100 100
R e s N e t, P C %

R e s N e t, P C %
80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
H um an, PC % H um an, PC %

H o lo g ic D im e n s io n s S ie m e n s In s p ir a t io n

S e t1 S e t2 S e t3 S e t4 S e t5 H ig h d o s e A E C dose Low dose

100 100
R e s N e t, P C %
R e s N e t, P C %

80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100
H um an, PC % H um an, PC %

S ie m e n s R e v e la t io n

H ig h d o s e A E C dose Low dose

100
R e s N e t, P C %

80

60

40

20

0
0 20 40 60 80 100
H um an

Figure 2. Scatter plots of the PC of the ResnNet18 against human observer for 7
DBT systems.
4. Conclusion
With the present study, we propose a method for automated image quality
estimation of DBT images using ResNet18 deep learning approach. The trained
DL network was used on a separate set of images to validate the correlation with

151
humans. The results show good agreement with human observers for most
tested system conditions and DBT scanners.

REFERENCES
Alnowami, M., Mills, G., Young, K., Dance, D. R., Awais, M., Halling-Brown, M. D.,
Wells, K., Elangovan, P., & Patel, M. (2018). A deep learning model observer
for use in alterative forced choice virtual clinical trials. Medical Imaging
2018: Image Perception, Observer Performance, and Technology Assessment,
March, 25.
Bosse, S., Maniry, D., Müller, K.-R., Wiegand, T., & Samek, W. (n.d.). Deep Neural
Networks for No-Reference and Full-Reference Image Quality Assessment.
Cockmartin, L., Bosmans, H., & Marshall, N. W. (2013). Comparative power law
analysis of structured breast phantom and patient images in digital
mammography and breast tomosynthesis. Medical Physics, 40(8), 081920.
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Ebrahimi, M. S., & Abadi, H. K. (2018). Study of Residual Networks for Image
Recognition.
Gong, H., Yu, L., Leng, S., Dilger, S. K., Ren, L., Zhou, W., Fletcher, J. G., &
McCollough, C. H. (2019). A deep learning- and partial least square
regression-based model observer for a low-contrast lesion detection task
in CT. Medical Physics, 46(5), 2052–2063.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image
Recognition.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with
Deep Convolutional Neural Networks. Advances In Neural Information
Processing Systems, 1–9.
Massanes, F., & Brankov, J. G. (2017). Evaluation of CNN as anthropomorphic
model observer. In M. A. Kupinski & R. M. Nishikawa (Eds.), Medical
Imaging 2017: Image Perception, Observer Performance, and Technology
Assessment (Vol. 10136, p. 101360Q). International Society for Optics and
Photonics.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Facebook, Z. D., Research, A.
I., Lin, Z., Desmaison, A., Antiga, L., Srl, O., & Lerer, A. (2007). Automatic
differentiation in PyTorch.
Petrov, D., Cockmartin, L., Marshall, N., Vancoillie, L., Young, K., & Bosmans, H.
(2017). Real space channelization for generic DBT system image quality
evaluation with channelized Hotelling observer (M. A. Kupinski & R. M.
Nishikawa (eds.); Vol. 10136, p. 101360N). International Society for Optics

152
and Photonics.
Petrov, D., Marshall, N. W., Young, K. C., & Bosmans, H. (2019). Systematic
approach to a channelized Hotelling model observer implementation for a
physical phantom containing mass-like lesions: Application to digital
breast tomosynthesis. Physica Medica, 58, 8–20.
Platiša, L., Goossens, B., Vansteenkiste, E., Park, S., Gallas, B. D., Badano, A., &
Philips, W. (2011). Channelized Hotelling observers for the assessment of
volumetric imaging data sets. Journal of the Optical Society of America. A,
Optics, Image Science, and Vision, 28(6), 1145–1163.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer
Vision, 115(3), 211–252.
Weilong Hou, Xinbo Gao, Dacheng Tao, & Xuelong Li. (2015). Blind Image Quality
Assessment via Deep Learning. IEEE Transactions on Neural Networks and
Learning Systems, 26(6), 1275–1286.
Witten, J. M., Park, S., & Myers, K. J. (2010). Partial least squares: A method to
estimate efficient channels for the ideal observers. IEEE Transactions on
Medical Imaging, 29(4), 1050–1058.
Zhang, G., Cockmartin, L., & Bosmans, H. (2016). A four-alternative forced choice
(4AFC) software for observer performance evaluation in radiology (C. K.
Abbey & M. A. Kupinski (eds.); p. 97871E). International Society for Optics
and Photonics.
Zhou, W., Li, H., & Anastasio, M. A. (2019). Approximating the Ideal Observer and
Hotelling Observer for binary signal detection tasks by use of supervised
learning methods. IEEE Transactions on Medical Imaging, 1–1.

153
Conclusions and future work
Task-based image quality evaluation for the purpose of system optimization and
quality control in digital breast imaging can be a time consuming task. The
standard procedure requires reading by human observers, or even medical
specialists, in order to assess the image quality. If model observers can be tuned
to human readings, they can remove a major bottleneck in the image quality
estimation process. In this work we sought to design, implement and validate a
practical channelized Hotelling model observer methodology for image quality
evaluation in the field of digital breast imaging.
In order to validate the applicability of a model observer algorithm to replace
humans for the task, several studies have been performed using images of a 3D
structured phantom (L1 phantom (Cockmartin et al., 2017)) and of a virtual
clinical trial. The previous chapters have shown that the model observer
algorithms differ in implementation, dependent on the type of lesion to be
detected, either mass-like lesions or calcification clusters. These studies realized
two algorithms, one for each lesion type that will be used in our practice for task-
based image quality evaluation in 2D FFDM and DBT imaging by the Leuven team
and beyond.

LOW CONTRAST MASS-LIKE LESIONS


In chapters 1, 2, 5 and 6, anthropomorphic model observers were developed to
assess the detectability of low contrast mass-lesions. The agreement between the
MO and human observers was successively improved and generalized such that
the MO could cope with the wide range of DBT systems available on the market.
This PhD project initially focused on the application of MO methods to DBT
images, as there was a clear need for robust and reproducible methods to
quantify the imaging performance of DBT systems. This particular breast
imaging system produces a volumetric dataset with anisotropic pixel spacing,
with a small pixel spacing in-plane - that differs between the different vendors -
while a large spacing between the reconstructed planes. This is very challenging
for the model observer research. Our initial study (Petrov et al., 2016) had shown
that a CHO with channel set and an observer template estimated in from limited
amount of training images, with Laguerre Gauss channels tuned for the highest
performance with internal-noise algorithm to match human observers (Abbey &
Barrett, 2001; Brankov, 2013; Chen et al., 2002; Gallas & Barrett, 2003) did not
perform well against human observers on images of the 3D structured phantom
and didn’t prove to be practical in a long term. The CHO overestimated object
detectability compared to humans and required retuning for every different
lesion size and studied scanning condition. The first chapter of this work
addressed specifically how we were able to set up a successful CHO for the
phantom images on the Siemens Mammomat Inspiration DBT system (Siemens
Healthcare, Erlangen, Germany). The channel selection in DBT images was found
to be crucial in obtaining close agreement between the humans and the model

154
observer and we have described how to calculate the model observer test
statistics given a limited number of training images. This involved a systematic
study of the channel types and associated parameters, the expected signal
template and the covariance matrix estimation. The results showed that the
channel type and parameter selection is crucial for an anthropomorphic model
observer. The three studied channel types all showed different potential for
tuning. Based on these investigations, eight Gabor filters were selected as
candidate channels for the human performance matching task. With the channel
set fixed it was found that for the specific 3D structured phantom, 17 DBT sets
are a good compromise between system occupation time to acquire the images
and the requirement for a sufficient amount of images for model observer
training at an acceptable bias level. The developed CHO methodology was then
used to estimate the image quality of DBT acquisitions in three dose levels, which
showed good agreement with human observers. The reproducibility study
showed, that human and model observers are similarly reproducible with a
similar range of variance.
This study placed a solid basis for future CHO applications in DBT using the 3D
structured phantom. With the model observer estimating the human results
successfully on a Siemens Mammomat Inspiration DBT system for a wide dose
range, the next step was to extend the method to DBT systems from different
manufacturers. A preliminary study was taken that examined the applicability of
the current CHO to the Hologic Selenia Dimensions (Hologic Inc., Marlborough,
USA). The results showed that the channel set had to be re-tuned for a good
agreement with the human reader results. This lead us to the hypothesis that
with a well-designed anthropomorphic channel set it is possible to estimate the
human results on numerous DBT systems from different manufacturers. In
chapter 2 the structured phantom was scanned multiple times on IMS Giotto
Class (IMS, Bologna, Italy), Fujifilm AMULET Innovality (Fujifilm, Tokyo, Japan),
Philips MicroDose (Philips, Solna, Sweden), GE Senographe SenoCaire 3D (GE
Healthcare, Buc, France) and Hologic Selenia Dimensions DBT systems. It was
observed that due to the different pixel size choices from the different DBT
manufacturers the ROIs cropped from the reconstructed volumes would contain
different numbers of pixels, given the same size in real space. Accounting for this
would extract the same features from the images and thus the channel set was
made to account for pixel size differences. The newly tuned channel set
performed well in approximating the human scores on all of the studied DBT
systems, but with some discrepancy between human observer and CHO for the
GE SenoClaire system. However, with the addition of new DBT systems, namely
Hologic 3Dimensions, Siemens Mammomat Revelation and GE Senographe
Pristina, with new and improved reconstruction algorithms on some systems,
the applicability and generalization of the CHO described in chapters 1 and 2
would no longer be achieved and the algorithm could not match the human
observer image quality scores for some of the systems. For this reason a new
approach was needed with a more robust channel set, capable of better or
broader generalization.

155
With deep learning (DL) techniques showing great promise in many fields,
including observer performance and image perception studies, these methods
were adopted as a means of improving the generalization of the MO. One key
difference is that DL techniques described in the literature applied to perception
studies normally try to approximate the ideal observer. This type of model
observer, for example, will try to maximize the separation between the two
classes present in the image dataset and this usually surpasses rather than
matches human observer performance. Such a method is not entirely useful for
our task, as ideal observers do not mimic human observer performance and can
miss changes in DBT reconstruction or image processing that would improve
human observer performance; image features important to increased detection
performance by human observers should also influence the performance of the
anthropomorphic MO. Regardless of any concrete implementation, the DL
algorithms require a large enough dataset during the training stage. The next
phase of this work started from the following hypothesis: 594 DBT acquisitions
from 8 DBT systems, cropped into the required regions of interest for image
quality evaluation with more than half of them read by human readers, are
sufficient to drive the development of a DL based model observer. From the
human readings, 28994 training examples were selected, grouped as required
for a reading study and labelled with the human observer response. Two model
observers were then derived: a deep learning CHO (DL-CHO) and a ResNet18
based implementation. The impetus behind the first model observer was that the
classical search of channel functions relevant to all seven DBT models would
most likely have been very laborious, requiring perhaps the combination of
different channel functions and/or not possible to generate the channels in an
analytical way. Thus, a convolutional layer was used to act as a channel set, and
the deep learning techniques were used to update this channel set in an
accordance with the training examples from the humans. This resulted in a
channel set of 5 kernels applied in the conventional CHO procedure for the
classification of the phantom images for the comparison with humans. The
second model observer, based on ResNet18 network, utilized a complete deep
learning network pre-trained on natural images and retrained for our purpose,
then applied to the same reading dataset as the first model observer for the
comparison with humans. For both of the DL classification methods, the
generalization test consisted of images from two DBT systems, included in the
reading dataset, but excluded from the training dataset. In this way, it was tested
how well the DL-CHO and ResNet18 would predict human results, if applied to
images from newly introduced DBT systems or with different reconstruction
algorithms that were never seen in training.
With the CHO for the low-contrast lesion detection of the 3D structured phantom
well described and validated, the next step was to test its applicability in 2D
FFDM and in DBT with more realistic breast simulating images. For this purpose
the study with the virtual clinical trial images of the OPTIMAM project was
performed. A CHO method in these images, rather than human reading, would
substantially improve the potential for DBT optimization. Along with this, a

156
model observer working in 2D full-field digital mammography would also be of
interest, to illustrate the potential add-on value of the (newer) DBT technique.
The images simulated with the OPTIMAM VCT framework consisted of two
datasets of DBT and 2D FFDM images with human observers results published
in the work of Elangovan et al. (Elangovan et al., 2018). The CHO model observer
was applied ‘as is’ to the DBT images and the results showed that even after
intensive tuning of the channel parameters, the model observer could not match
the human observer results. We were forced to conclude that a new CHO
methodology was needed to cope with the increased complexity of the VCT
images. For the 2D FFDM images the CHO was redesigned to work with just 2D
images, and for the 3D DBT VCT images a new volumetric CHO was developed
with a volumetric channel set. The results showed that the two proposed CHO
methods successfully estimated the human results and have the potential to be
used as a substitute in future VCT image quality studies.
These studies, performed for non-spiculated mass lesions, produced two clear
candidates for the use in day to day medical physicist practice. With the images
of the 3D structured phantom the DL-CHO produced the best performance, while
for a more complex background, volumetric channelization was a solution. The
overall conclusion of this part was that analytical channels, as described in
chapter 1 and 2, worked well under certain limited conditions, but required
extension before they could be applied successfully on all commercially available
DBT systems and in more complex backgrounds. While we are confident that the
DL-CHO or even a more basic CHO may work on other types of breast imaging
devices, this will always require transfer learning (DL) (Pratt, 1993) or tuning
and testing phases (CHO). This is not necessarily a limitation, as a human readout
or any performance test by human readers should also be performed during the
first commissioning or acceptance test of a device. These human reading results
can be used to find the desired channel set, which can be used later for system
consistency testing and periodic image quality tests. This makes it practically
feasible to generate separate channel sets, specifically for each imaging system.
For the purpose of system benchmarking and image quality evaluation over a
wider parameter range, the deep learning methods showed greater potential. As
the DL-CHO outperformed the ResNet18 observer in the anthropomorphic
aspect of the studies, this first algorithm would be our long term candidate. In
addition to our results, theoretically a model observer with analytically defined
template is expected to generalize better than an algorithm in which the
template is derived from finite examples of training data. The former method
would always require less training images in order to derive a good estimate for
the observer template. In other words, the DL-CHO is expected to perform better
compared to humans for wider range of image datasets, as only the feature
extraction is estimated entirely from image examples and the observer template,
while trained with images for the first and second order statistics required for
the Hotelling observer template, is known to be the optimal linear template. The
ResNet18 based observer generates a classifier that is optimal for the training
dataset, but it is not known whether this will be optimal for image datasets

157
derived at different scanning parameters, as the classifier is purely empirical and
not based on a specific theory.
The model observers for low-contrast lesions were developed specifically for
non-spiculated masses from the L1 phantom developed by Cockmartin et al.
(Cockmartin et al., 2017) or for the images of the VCT framework (Elangovan et
al., 2014). This phantom also includes a series of spiculated masses but because
of the background construction they have an unrealistically high contrast when
imaged in DBT mode. Most of the detectability scores were therefore close to
100%; this is easily obtained with a model observer but has little or no use with
respect to system optimization or benchmarking. In fact, these spiculated masses
have not been used for any specific study. The VCT images did not include
spiculated masses either, although in recent efforts the OPTIMAM team have
introduced spiculated masses. It is very difficult to make simulated spiculated
masses disappear in tomosynthesis; truly realistic integration of these objects
into breast tissue probably requires breast simulating models with integrated
spiculated lesions, however these were not available during this study.

CALCIFICATION CLUSTERS
In chapters 3 and 4 a two-layer CHO algorithm was proposed for the detection of
calcification clusters in the images of the L1 structured phantom. Calcification
clusters are visually and in terms of composition very different from mass
lesions. They usually consist of a number of high contrast tiny specs (less than
0.5mm in diameter), while mass lesions present as a single object with very
limited contrast and larger size. In the structured phantom, the calcification
clusters are formed from five calcification particles, arranged as on the side of a
dice (a ‘quincunx’). The individual calcification particles are referred to as ‘calcs’
throughout this chapter. Given this, the detection task for a group of five calcs
differs from the detection of just a single calcification, as the observer should
estimate a single test statistic for the whole image, regardless of the number of
detected particles. This might be challenging for the design of a model observer,
as the CHO usually has one sensitive spot set by the x-y position of channels.
There were two possible approaches: the first one is to use a channel set with a
large enough spatial size, so that they’re able to extract features from the cluster
as a whole and estimate the expected signal template with all calcs included in a
single effort; a second approach would be to scan for each calc individually and
then combine the observer responses in a final test statistic. For our purpose, the
first approach is not feasible, as the location of all calcs is not known exactly and
the algorithm for ROI segmentation does not always succeed in centering the
clusters exactly within the ROI. A ‘miss’ would be attributed to an incorrect
position, rather than not being detectable. We have chosen the second approach,
that of scanning, as the calcs locations are known within a certain area. For this
case the 2-layer CHO model observer was developed, in order to first determine
the most probable locations of the particles forming the cluster, and then assess
detectability at these positions. The first layer consisted of a CHO with efficient
LG channels that could successfully find the most probable locations of the 5 calcs

158
in each cluster. These locations were passed to the second layer, consisting of a
CHO with anthropomorphic Gabor channels to estimate the final image quality
figure of merit.
Initially the algorithm was developed for tomosynthesis and was validated
against human observers on Siemens Mammomat Inspiration, and the results
showed good correlation between the humans and the model observer
performance. This algorithm was later applied to images acquired with Siemens
Mammomat Revelation, Fujifilm AMULET Innovality, GE Senographe Pristina,
IMS Giotto Class and Hologic 3Dimensions DBT systems and successfully
matched the human observer results without any channel tuning. The only
change was simply to use the pixel size specific to the system, which in turn
changed the channel frequency and channel width automatically. Next, the 2-
layer CHO was redesigned to work for 2D FFDM images and it was tested for
correlation with human reader results acquired over five different x-ray tube
voltage levels. Due to the difference of the background properties between FFDM
and DBT images, the CHO algorithm required retuning of the second layer. The
final model observer showed good agreement with the human observer results.
The successful detection task generalization and reproducibility, seen from the
results in the DBT modality, show great potential for the 2-layer CHO approach
for future applications. The CHO for the 2D FFDM should be used with caution as
experience with this modality is more limited and the MO was constructed from
a limited number of images. Nevertheless, it is a good starting point for a future
study including more images.
A study with a model observer based on the ResNet18 network trained and
tested for the calcification cluster images in a similar way as in chapter 6 was
performed. Due to the fact that the ResNet18 network scans the complete images
via multiple convolutions with small kernels, the uncertainty in the calc positions
was expected not to have an impact on the observer performance. Nevertheless,
the results (data available, but not shown) indicated that this model observer
could not reach the human performance. This might be caused by the limited
amount of training data, in comparison to the large amount of DBT training data
available for the non-spiculated lesions.

FUTURE WORK, POTENTIAL IMPROVEMENTS AND


OUTLOOK
The initial, practical request to propose a single model observer algorithm for a
complete image quality evaluation in breast imaging in fact remains unanswered.
With this thesis work, we have, however, provided some insight into what is
practically achievable for two breast imaging modalities, and are currently
investigating such methods applied to the synthetic mammograms calculated
from DBT stacks. The evidence from this study suggests that a single algorithm
for all conditions might be impossible to achieve, as different lesion types,
different imaging techniques, scanning protocols, reconstruction algorithms and
test objects probably require different approaches for the image quality

159
estimation and/or different feature extraction algorithms. This seems in line
with how human reading is performed in practice: masses are detected as such,
calcification clusters may be detected via the largest calcs, and one by one. The
fields of psychophysics and more general studies into image perception and
signal detectability by humans under wide ranging conditions are likely to
provide useful methods. This research is typically performed by means of eye
trackers. While many such studies have been performed for standard
mammography (2D) imaging, especially when used in screening regimes, the
study of DBT volume reading is still in its infancy. As soon as typical (3D) search
patterns could be described, a more dedicated MO could probably be developed.
Future improvements to the algorithms developed within the scope of this thesis
depend on the desired goals. We have addressed the problem of system
benchmarking and system consistency testing in DBT via the DL-CHO algorithm
for mass lesions and the 2-layer CHO for the calcification clusters. The
finalization of such algorithms for all, or most, currently available 2D FFDM
systems is likely to require a substantial number of 2D images with human
readout and algorithm fine tuning. The target signals and the background for
these studies have mostly been generated by the L1 phantom. While this
phantom was developed to compare 2D FFDM to DBT imaging, initial work
indicates that the phantom is also valuable for the evaluation of detection
performance in synthetic mammograms. Ultimately, one would like to compare
the detectability of specific object for several (breast) imaging modalities, from
2D FFDM, DBT with the synthetic summary image, all the way up to breast CT,
contrast enhanced mammography, (3D) ultrasound and MRI. This is obviously a
huge challenge. The availability of virtual clinical trial (VCT) simulation chains
would at least enable a start to be made, and this is work in progress for a number
of groups (Elangovan et al., 2018; Maidment, 2014; Segars et al., 2008; U.S. Food
and Drug Administration, 2018). Comparison of DBT to classical (2D) ultrasound
for the same type of lesions would be very difficult due to the strong user
dependency is the selection of view/probe orientation and which views of the
breast are documented by the operator during the examination. The systematic
scanning patterns used in 3D dimensional ultrasound make this a more
promising modality for evaluation in this regard. A large community of medical
physicists are also working on new breast phantoms that may be useful for many
modalities. A VCT approach may ultimately be much faster than getting all test
object details right and model observer methods are likely to play a large role in
the necessary task evaluation.
The methods explored in this thesis extend outside the domain of breast imaging.
Computer tomography applications are candidate number one. In this modality,
relatively high doses are given to the patients yet the intrinsic quality of the CT
scan generally remains untested. We have successfully participated in a
multicenter study in which required the generation and application of a CHO
algorithm on a defined dataset of signal present and signal absent images (Ba et
al., 2018). Following this, a new phantom was designed, purchased and extensive

160
testing the CT scanners and associated scanning protocols with this phantom is
planned at our institution.
Interventional radiology, with complex moving structures such as linear
guidewires or stents represents an even harder challenge. Development of model
observers for this modality is currently limited by the availability of dynamic test
objects containing relevant details. This could prompt the use of VCT methods,
however all of the image processing, which plays a crucial role in interventional
radiology and cardiology imaging, must be included in the simulation platform
and access to this will remain a difficulty. If these problems can be overcome then
MOs will have a role to play in optimization, benchmarking studies and even in
routine quality control.
Our attempt to include DL techniques in the VCT could point the way to the next
step: taking CAD techniques on board. These techniques are themselves evolving
quickly, yet require extensive testing prior to use on patient data. In this thesis,
we have emphasized the need to design robust, unbiased methods, from the
selection of a large number of appropriate input images, through to algorithm
optimization and the final testing stage. Lessons learned here can now be applied
to the evaluation or even development of CAD algorithms.
It is our hope that more applications with model observers, possibly combined
with artificial intelligence tools, will be developed for improved image quality
evaluations and ultimately better patient care. With that aim, we have already
made course module on the design, development and use of model observers and
our software tools have been assembled in an easy to use tool box.

REFERENCES
Abbey, C. K., & Barrett, H. H. (2001). Human- and model-observer performance
in ramp-spectrum noise: effects of regularization and object variability.
Journal of the Optical Society of America A, 18(3), 473–488.
Ba, A., Abbey, C. K., Baek, J., Han, M., Bouwman, R. W., Balta, C., Brankov, J.,
Massanes, F., Gifford, H. C., Hernandez‐Giron, I., Veldkamp, W. J. H., Petrov,
D., Marshall, N., Samuelson, F. W., Zeng, R., Solomon, J. B., Samei, E., Timberg,
P., Förnvik, H., … Bochud, F. O. (2018). Inter‐laboratory comparison of
channelized hotelling observer computation. Medical Physics, 45(7), 3019–
3030.
Brankov, J. G. (2013). Evaluation of the channelized Hotelling observer with an
internal-noise model in a train-test paradigm for cardiac SPECT defect
detection. Physics in Medicine and Biology, 58, 7159–7182.
Chen, M., Bowsher, J. E., Baydush, A. H., Gilland, K. L., DeLong, D. M., & Jaszczak, R.
J. (2002). Using the Hotelling observer on multislice and multiview
simulated SPECT myocardial images. IEEE Transactions on Nuclear Science,
49 I(3), 661–667.
Cockmartin, L., Marshall, N. W., Zhang, G., Lemmens, K., Shaheen, E., Ongeval, C.
Van, & Fredenberg, E. (2017). Design and application of a structured

161
phantom for detection performance comparison between breast
tomosynthesis and digital mammography. Physics in Medicine and Biology,
Volume 62, Number 3, 15.
Elangovan, P., Mackenzie, A., Dance, D. R., Young, K. C., & Wells, K. (2018). Lesion
detectability in 2D-mammography and digital breast tomosynthesis using
different targets and observers. Physics in Medicine and Biology, 63(9), 1–
15.
Elangovan, P., Warren, L. M., Mackenzie, A., Rashidnasab, A., Diaz, O., Dance, D. R.,
Young, K. C., Bosmans, H., Strudley, C. J., & Wells, K. (2014). Development
and validation of a modelling framework for simulating 2D-mammography
and breast tomosynthesis images. Physics in Medicine and Biology, 59(15),
4275–4293.
Gallas, B. D., & Barrett, H. H. (2003). Validating the use of channels to estimate
the ideal linear observer. Journal of the Optical Society of America. A, Optics,
Image Science, and Vision, 20(9), 1725–1738.
Maidment, A. D. A. (2014). Virtual clinical trials for the assessment of novel
breast screening modalities. Lecture Notes in Computer Science (Including
Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 8539 LNCS, 1–8.
Petrov, D., Michielsen, K., Cockmartin, L., Zhang, G., & Young, K. (2016).
Development and application of a channelized Hotelling observer for DBT
optimization on structured background test images with mass simulating
targets. SPIE Medical Imaging, 9787, 1–9.
Pratt, L. Y. (1993). Discriminability-Based Transfer between Neural Networks.
Advances in Neural Information Processing Systems, 204–211.
Segars, W. P., Mahesh, M., Beck, T. J., Frey, E. C., & Tsui, B. M. W. (2008). Realistic
CT simulation using the 4D XCAT phantom. Medical Physics, 35(8), 3800–
3808.
U.S. Food and Drug Administration. (2018). VICTRE: Virtual Imaging Clinical
Trials for Regulatory Evaluation.

162
Summary

163
Samenvatting

164
Curriculum vitae

Dimitar Petrov was born on October 8th in 1990 in Smolyan, Bulgaria. He


obtained a degree of Bachelor in Medical physics at the Sofia University, Sofia,
Bulgaria in 2013. A year later he obtained a Master degree in Medical physics.
During that time, he was also working as a medical physicist in the National
center of radiobiology and radiation protection in Sofia, Bulgaria in the
department of Radiation protection in medical exposure. In February 2015, he
started his PhD project at the department of Radiology at KU Leuven, Leuven,
Belgium. The topic of the PhD project was to develop a mathematical model
observers for optimization in breast imaging.

Peer reviewed articles

D. Petrov, N. Marshall, K. Young, H. Bosmans, “Systematic approach to a


channelized Hotelling model observer implementation for a physical phantom
containing mass-like lesions: Application to digital breast tomosynthesis”, Physica
Medical: European journal of Medical Physics, Volume 58, 8-20 (2019)
D. Petrov, N. Marshall, K. Young, G. Zhang, H. Bosmans, “Model and human
observer reproducibility for detection of microcalcification clusters in digital
breast tomosynthesis images of three-dimensionally structured test object”, J. Med.
Imag 6(1) 015503 (2019)
A. Ba, CK Abbey, K. Baek, M. Han, RW Bouwman, C. Balta, J. Brankov, F. Massanes,
HC Gifford, I. Hernandez-Giron, WJH Valdkamp, D. Petrov, N. Marshall, FW
Samuelson, R. Zeng, JB Solomon, E. Samei, P. Timberg, H. Fornvik, I. Reiser, L. Yu,
H. Gong, FO Bochud, “Inter-laboratory comparison of channelized Hotelling
observer computation” Med Phys, 45(7):3019-3030 (2018)

Conference proceedings

D. Petrov, K. Michielsen, L. Cockmartin, G. Zhang, K. Young, N. Marshall, H.


Bosmans, “Development and application of a channelized Hotelling observer for
DBT optimization on structured background test images with mass simulating
targets”, Proc. SPIE 9787, Medical Imaging 2016: Image Perception, Observer
Performance, and Technology Assessment, 97871K (2016)
D. Petrov, L. Cockmartin, N. Marshall, L. Vancoillie, K. Young, H. Bosmans, “Real
space channelization for generic DBT system image quality evaluation with
channelized Hotelling observer”, Proc. SPIE 10136, Medical Imaging 2017: Image
Perception, Observer Performance, and Technology Assessment, 101360N
(2017)

165
D. Petrov, N. Marshall, L. Cockmartin, H. Bosmans, “Development and application
of a channelized Hotelling observer for DBT optimization on structured
background test images with mass simulating targets”, Proc. SPIE. 10718, 14th
International Workshop on Breast Imaging (2018)
KT Wigati, L. Vancoillie, E. Salomon, G. Zhang, L. Cockmartin, N. Marshall, H.
Bosmans, DS Soejoko, D. Petrov, “Channelized Hotelling observer assessing
microcalcification detectability on 2D mammography: a first application to study
the impact of tube voltage”, IOP Conf. Series: Journal of Physics: Conf. Series 1248
(2019) 012018
E. Salomon, F. Semturs, E. Unger, L. Cockmartin, D. Petrov, M. Figl, H. Bosmans,
J. Hummel, “Equivalent breast thickness and dose sensitivity of a next iteration 3D
structured breast phantom with lesion models”, Upcoming Proc. SPIE, Medical
Imaging 2020: Image Perception, Observer Performance, and Technology
Assessment (2020)
D. Petrov, N. Marshall, L. Vancoillie, L. Cockmartin, H. Bosmans, “Deep learning
channelized Hotelling observer for multi-vendor DBT system image quality
evaluation”, Upcoming Proc. SPIE, Medical Imaging 2020: Image Perception,
Observer Performance, and Technology Assessment (2020)
D. Petrov, N. Marshall, L. Vancoillie, L. Cockmartin, H. Bosmans,
“Anthropomorphic ResNet18 for multi-vendor DBT image quality evaluation”,
Upcoming Proc. SPIE, Medical Imaging 2020: Image Perception, Observer
Performance, and Technology Assessment (2020)
D. Petrov, N. Marshall, H. Bosmans, “Channelized Hotelling observer for multi-
vendor breast tomosynthesis image quality estimation: detection of calcification
clusters in an anthropomorphic phantom”, Upcoming Proc. SPIE, 15th
International Workshop on Breast Imaging (2020)
D. Petrov, H. Bosmans, N. Marshall, “Task-based artifact spread function
estimation in digital breast tomosynthesis using a structured phantom”, Upcoming
Proc. SPIE, 15th International Workshop on Breast Imaging (2020)
L. Vancoillie, D. Petrov, L. Cockmartin, N. Marshall, H. Bosmans, “Application of a
model observer for detection of lesions in synthetic mammograms”, Upcoming
Proc. SPIE, 15th International Workshop on Breast Imaging (2020)

166

You might also like