Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Biomedical Signal Processing and Control 90 (2024) 105831

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control


journal homepage: www.elsevier.com/locate/bspc

Analysis of the Clever Hans effect in COVID-19 detection using Chest X-Ray
images and Bayesian Deep Learning
Julián D. Arias-Londoño ∗, Juan I. Godino-Llorente
Department of Signals, Systems and Radiocommunications, ETSI Telecomunicación, Universidad Politécnica de Madrid, Av. Complutense, 30, 28040, Madrid, Spain

ARTICLE INFO ABSTRACT

Dataset link: https://github.com/jdariasl/COVI In recent months, the detection of COVID-19 from radiological images has become a topic of significant interest.
D_BayesianNET Several works have proposed different AI models to demonstrate the feasibility of the application. However, the
literature has also reported unwanted behaviours, spurious correlations, and biases of the developed systems
MSC:
68T05
that significantly limit their translation to the clinic. This paper deals with a set of interpretability techniques to
68T10 analyse spurious correlations during the inference, the consistency of the decisions, and the uncertainty of the
68U10 models, and evaluate the model’s performance in a broader and thoughtful way, especially regarding biasing
68W99 effects, aiming to provide new methodological cues that can increase the systems’ robustness. Two different
off-the-shelf convolutional neural networks (DenseNet-121 and EfficientNet-B6) were tested along with their
Keywords:
Bayesian counterparts. Different saliency maps are used to evaluate the effects of artifacts and confounding
Deep learning
Bayesian learning factors, and, taking advantage of uncertainty estimations, a new version of the importance of context measure
Explainability was proposed, to provide more evidence of the spurious correlation affecting models’ performance. In view of
Uncertainty the results, DenseNet is preferred in both its standard and Bayesian versions, reaching BAcc over 97% training
Calibration with a large data set (more than 70,000 images). However, results demonstrate that models are significantly
COVID-19 affected by the biasing effects, which is minimised by pre-processing with a semantic segmentation of the
Pneumonia lungs to guide the learning process towards areas with causal relationships with the problem under study. The
Radiological imaging conclusions could be extrapolated to the general context of pneumonia detection from chest RX.
Chest X-Ray

1. Introduction the references therein), especially for the last radiological modality.
Different DL architectures have been used for this purpose, reporting
Artificial Intelligence (AI) technologies have emerged as tools ca- classification accuracies of around 94% or even higher, categorising
pable of automatically identifying underlying clinical patterns and as controls, other bacterial/viral/fungal types of pneumonia, and SARS-
mechanisms for the search for new digital biomarkers [1]. With the CoV2 infection [3,4]. Most of these papers use off-the-shelf Convolu-
advent of the COVID-19 pandemic and in the search for rapid, more ob- tional Neural Networks (CNN), including ResNet-18 or ResNet-50 [5],
jective, accurate, and sensitive procedures that could complement the DenseNet-121 [6], VGG-16 or VGG-19 [7], Inception [8], Efficient-
diagnosis and evaluation of COVID-19, a research trend has emerged Net [9], MobileNet [10]; but there are also attempts using custom-made
that uses AI to evaluate thorax Computerised Tomography (CT) or architectures [4,11].
plain chest X-Ray (XR) images for automatic evaluation of the disease. Due to the sudden appearance of the pandemic, no standardised
The underlying hypothesis is that automatic models derived from ra- data sets are available, so every research work used a different corpus.
diological images using AI techniques can characterise the pneumonic The first published articles used data sets of scarcely a few hun-
states associated with COVID-19 even in asymptomatic populations [2]. dred images of patients with COVID-19 [12–14] and made critical
Furthermore, due to the rapid increase in contacts and hospitalisations, methodological mistakes due to duplicate samples and improper testing
radiologists and medical experts were inundated with a large number partitions, as discussed in [4,15]. With the continuous increase in
of images that needed analysis, creating an ideal scenario for deploying available data, the most recent and reliable work uses thousands of
AI solutions that help physicians by screening infected patients. images in their studies [16]. Yet, to the best of our knowledge, the most
The literature on the detection of COVID-19 using thorax CT or extensive reported data set was compiled and used in [4].
chest XR images and Deep Learning (DL) models is vast (see [3] and

∗ Corresponding author.
E-mail address: julian.arias@upm.es (J.D. Arias-Londoño).

https://doi.org/10.1016/j.bspc.2023.105831
Received 11 July 2023; Received in revised form 15 September 2023; Accepted 3 December 2023
Available online 6 December 2023
1746-8094/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Apart from the problem that data set size represents for the con- Regarding explainability techniques, in [21], LRP was adapted to
sistency and generalisation capabilities of the results [15], one of the work with the prototypical part network (ProtoPNet), which is a self-
main drawbacks of the existing literature is that it focusses mainly on explaining model that uses prototype patches of the training images
variations of the network architecture, with little or almost no detailed as core patterns to explain the decisions made by the model. The
attention to certain variability factors that might bias the results—such authors call their technique Prototypical Relevance Propagation (PRP)
as the image projection, the technology of the detector, the gender and and show how it is able to detect bias of the network decisions to-
age of the patient, etc. A deeper analysis of many proposed AI models wards textual annotations and how the models can act as hospital
for COVID-19 shows that purely model-centric approaches yield wrong detectors instead of disease detectors when there is a slight imbalance
conclusions. This is because AI models, especially overparametric DL in data sources. Consequently, the authors recommend ensuring label-
models, can correctly categorise images using patterns in the image that balancing while multi-source data sets are used but did not provide
are not necessarily associated with anatomical or functional compo- any methodological contribution to avoid the observed bias. Further-
nents related to the disease [4]. As discussed in [17,18], this behaviour more, the authors in [28] showed that ProtoPNet, on which PRP
is not unique to medical image processing applications, but is particu- relies, presents serious limitations in providing meaningful prototypical
larly problematic for medical diagnosis, since it reflects that the model explanations when tested with adversarial samples.
learns the underlying specific characteristics of the data set rather than Although the former discussed works have addressed part of the
the patterns associated with the disease under study. In other words, discussion of bias and spurious correlations in the context of COVID-19
the model learns spurious correlations that help it make correct predic- detection, the most recent work has turned a blind eye to this problem
tions, but without a causal relationship with the expected phenomenon and continues to compare new approaches and models mainly in terms
under analysis. Lapuschkin et al. [17] call this behaviour the Clever of accuracy, only including uncertainty estimates of model decisions in
Hans effect (CHE) in analogy to a psychology phenomenon used to the best cases [29]. Although the incorporation of uncertainty estimates
describe ‘‘when an animal or a person senses what someone wants them has particular relevance in the biomedical context, there is also evi-
dence that the CHE can even exist with ‘‘high certainty’’ [30]. Beyond
to do, even though they are not deliberately being given signals’’ [19]. For
research work in this field of application, the lack of reliability is one
their part, Carter et al. [18] call this phenomenon ‘‘overinterpretation’’
of the leading causes of the limited clinical translation of AI-based
and define it as ‘‘confident predictions made by algorithms based on details
solutions, as it undermines the confidence of medical professionals
that do not make sense to humans’’, such as random patterns or image
in their use [31]. Therefore, a better dependability analysis must be
borders. Consequently, many AI models fail to perform adequately once
performed, along with the classical performance metrics, to assess the
deployed in the real world, as these spurious or artifactual correlations
real performance of AI-based systems supporting medical decisions.
may not be present [17]. Unexpectedly, although the combination of
In this context, this work makes additional efforts to provide in-
multiple data sets and image resources can, in principle, satisfy the
terpretability to DL model’s outcomes in the context of COVID-19
data requirements of DL models’ proper training procedures, it can also
detection, by analysing the consistency of the decisions, the uncer-
induce some of those unwanted behaviours, mainly when each source
tainty of the models, their performance, the appearance of spurious
exclusively contains labelled samples from a single class [20] or there
correlations during the inference, and certain biasing effects. These
is a strong label imbalance between source domains [21]. It has been
analyses contribute to the establishment of a more robust standardised
demonstrated that in those cases, the DL models use shortcut learning
methodology that provides better support for the results of AI systems
strategies capable of reducing the training loss function by learning
in biomedical contexts and ultimately to translate these technologies
characteristics associated with the data set but not necessarily with the into the medical practise.
phenomena under analysis [22,23]. For comparison purposes, two different off-the-shelf CNN models
Facing all these problems has become a new trend in biomedical are used in this work: DenseNet-121 and EfficientNet-B6. These have
image analysis [24], but as [3] clearly stated, few works in the context been integrated into several previous works that address the same prob-
of COVID-19 detection have paid attention to the issues presented and lem. Additionally, Bayesian variants of these models were trained using
made explicit efforts to build reliable models. Among the works that variational layers with reparameterised Monte Carlo estimators [32]
have investigated further bias factors in the detection of COVID-19 and the Bayes by Backprop learning algorithm. From the Bayesian ver-
using CNN-type models, the primary strategy introduced to support sion of the models, and to evaluate the consistency of the decisions, this
the models’ predictions and/or evidence inconsistencies in their de- work also analyses the calibration curves of the predicted score vs. the
cisions, has been the use of posthoc interpretability analyses based correct class and the uncertainty vs. the correct class. Also, following
on saliency maps obtained from several techniques such as Gradient- a similar analysis as in [30], the uncertainty of two interpretability
weighted Class Activation Mapping [4,25] and Layer-wise relevant maps, LRP and Deep Learning Important FeaTures (DeepLIFT) [33], is
propagation (LRP) [26]. This technique has allowed researchers to analysed to determine to what extent artifacts in the image affect the
identify that often, the DL models interpret as essential those regions models’ performance. A variant of the importance of the context index
corresponding to the breastbone, clavicle, stomach, or burned-in an- presented in [17] is proposed to, under certain assumptions, assess the
notations introduced by the X-Ray device. To avoid some of these level of bias in COVID-19 detection models considering uncertainty
confounding factors, in [26], burned-in text annotations appearing on estimations of interpretability maps.
the images were manually covered with black rectangles. The results The rest of the paper is organised as follows. Section 2 introduces
showed that the black rectangles can reduce the model’s bias to spe- the material and methods used in this paper. Section 3 mainly describes
cific annotations, but create artifacts to which the models still pay the results and an analysis of the different experiments. And Section 4
attention. Consequently, the most extended methodological decision ends with a discussion and conclusions.
is to integrate a lung segmentation preprocessing step into the whole
detection problem [4], which induces the model to learn relevant 2. Materials and methods
features belonging only to the lung area, yet relies on the performance
of the lung segmenter. Some works have introduced different strategies This section presents the data set used for the experimental phase,
for estimating uncertainty of the models’ predictions [27] as a way to the pre-processing techniques, and the methods developed. Fig. 1
improve the confidence in the models’ decisions, but they did not anal- shows a schematic of the proposed methodology to help the reader
yse the effects of image artifacts and decisions biases in the uncertainty identify each of the components and how they are used throughout the
estimations. process.

2
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 1. General scheme of the methodology.

Table 1 all these variability factors, as well as several domains represented by


Number of images per class for training and testing subsets.
several data sets, is intentionally done to ensure a generalisation ability
Subset Control Pneumonia COVID-19 and to minimise spurious correlations.
Training 45 022 21 707 7716 Only AP and PA views were selected. No differentiation was made
Testing 4961 2407 857
between erect, either standing or sitting, or decubitus. This information
was inferred by carefully analysing the DICOM tags and required a
manual check due to certain labelling errors.
2.1. Corpus A detailed description of the corpus is presented in [4].

The corpus used for the experimental phase was compiled from data 2.2. Pre-processing
belonging to different public repositories. It contains images of patients
with COVID-19, pneumonia, and without any observable pathology (no XR images were resized to 224 × 224 pixels, encoded with 16 bits,
findings or controls). Table 1 reports the number of images per subset pre-processed using DICOM WindowCenter and WindowWidth details
and class. In general terms, the corpus is built around more than 80,000 (when needed), and converted to a Monochrome 2 photometric interpre-
XR images, including more than 8500 patients with COVID-19. tation. The XR images were also converted and stored as uncompressed
In all cases, the annotations were made by a specialist, as indicated greyscale ’.png’ files.
by the authors of the repositories. For comparison purposes, as in [4,42], three different pre-processing
The COVID-19 class is built around three open data collection
schemes (Fig. 2) were evaluated.
initiatives: HM Hospitales COVID-19 data set, [34], BIMCV-COVID-19
The first scheme applies a simple image equalisation using His-
(1st and 2nd iterations) [35] and Actualmed COVID-19 [36]. The final
togram Equalisation (HE) and Contrast Limited Adaptive Histogram
result of this compilation process is a subset of 8573 XR images from
Equalisation (CLAHE) (Fig. 2Left) to the raw XR image.
more than 3600 patients at different stages of the disease. Only patients
The second scheme involves zooming and cropping to the rectan-
with at least one positive PCR test or positive immunological test for
SARS-CoV-2 were included in the COVID-19 class. gular region of interest (RRoI) containing the lungs, and performing
The pneumonia class was compiled from ChestX-Ray8, [37], MIMIC- an equalisation like the one used in the first scheme (Fig. 2.Centre)
CXR [38], the Montgomery set [39,40], CheXpert [41] and the China- following the methods presented in [4]. This method removes the
Shenzhen set [39]. The compilation process led to a subset of 24,114 background areas in the image, and the burned-in text annotations,
XR images. which are well known to be confounding factors [20].
And finally, the control class was compiled from the ChestX-Ray8 The third scheme identifies the region of interest associated with the
[37], Actualmed COVID-19 [36], Montgomery set [39,40], and the lungs (LRoI) by semantic segmentation using a U-Net architecture,1 and
China-Shenzhen data sets [39]. The final result of this compilation crops to the RRoI. This last method extracts an external black mask, to
process is a subset of 49,938 XR images. later perform an equalisation only on non-zero pixels (Fig. 2Right).
Table 2 summarises the most significant characteristics of the data These schemes aim to decrease the variability of the data, simplify
sets used. As shown, the corpus has several variability factors that the network training process, and focus on the lungs’ region while
might influence the results, such as the type of projection (Posterior–
Anterior [PA] and Anterior–Posterior [AP]), and the type of detector
(Computed Radiography [CR] or Digital Radiography [DX]). The cor- 1
Following the methods available at https://github.com/imlab-uiip/lung-
pus is reasonably well balanced according to gender and age. Including segmentation-2d.

3
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Table 2
Demographic data of the data sets used. Column C-19 corresponds to the COVID-19, Pn to Pneumonia, and Co to Control
class.
Mean age ± std # Males/# Females # Images AP/PA DX/CR C-19 Pn Co
HM Hospitales [34] 67,8 ± 15,7 3703/1857a 5560 5018/542 1264/4296 Y N N
BIMCV [35] 62,4 ± 16,7 1527/1486b 3013 1171/1217 1145/1868 Y N N
ACT [36] – – 188 30/155 126/59 Y N Y
ChinaSet [39] 35,4 ± 14,8 449/213 662 0/662 662/0 N Y Y
Montgomery [39,40] 51,9 ± 2,41 63/74 138 0/138 0/138 N Y Y
CRX8 [37] 45,75 ± 16,83 34760/27030 61 790 21860/39930 61790/0 N Y Y
CheXpert [41] 62.38 ± 18,62 2697/1926 4623 3432/1191 – N Y N
MIMIC [38] – – 16 399 10850/5549 – N Y N
a
1377/929 patients.
b
727/626 patients.

Fig. 2. Examples of the three pre-processing schemes applied to a sample taken from the Actualmed COVID-19 data set [36] Left: Equalised raw image (equalisation). Centre:
Cropped image (cropping). Right: Semantic segmentation of the lungs (segmentation).

ensuring the comparability of the experiments. From now on, these DenseNet obtains additional inputs from all preceding layers and passes
three pre-processing strategies will be called, respectively, equalisation, on its own feature maps to all subsequent layers. Therefore, the 𝓁-th
cropping, and segmentation. layer receives 𝓁 inputs, consisting of the previous convolutional blocks’
feature maps. Although the input to the layers increases, preserving
2.3. Models information makes the required number of filters and parameters in
each layer significantly smaller than in ResNets [45].
This work uses two standard CNN models to analyse the effects of This work uses the 121 version of DenseNet (the smallest pro-
artifacts and spurious correlations during the inference phase for classi- posed in [45]), which corresponds to a network with 121 layers in
fication among control, pneumonia, and COVID-19 samples. The mod- total divided into four dense blocks plus three transition layers. The
els used are DenseNet-121 [43] and EfficientNet-B6 [44], which have 1000-dimensional fully connected output layer was replaced by a three-
previously been used for the detection of COVID-19 with competitive dimensional layer according to the number of classes in the data set (see
results [6,9]. Section 2.1).

2.3.1. DenseNet-121 2.3.2. EfficientNet-B6


DenseNet-121 is a CNN explicitly designed for image classification, EfficientNet-B6 is a deep CNN first introduced in [44]. It has been
which was first introduced in [43]. Its key innovation is its dense shown to achieve state-of-the-art results in several benchmark data sets
connectivity pattern, where each layer receives inputs from all pre- for image classification.
ceding layers in a feed-forward fashion. This results in a very deep EfficientNet uses a method to optimise a CNN architecture by
neural network with relatively few parameters, being more efficient scaling up the depth (𝑑), width (𝑤), and resolution (𝑟) of the network
than traditional deep neural networks. DenseNet-121 has achieved simultaneously. This approach allows the network to improve perfor-
state-of-the-art performance in a variety of image classification tasks. mance with fewer parameters. The main problem with this approach
This architecture was designed to solve the problem of input infor- is that the optimal values of 𝑑, 𝑟 and 𝑤 depend on each other and are
mation (or gradient) vanishing, also called the ‘‘washout’’ problem. In limited by resource constraints. This problem is addressed in Efficient-
networks of practical size, information must traverse through multiple Net by proposing a compound scaling method to uniformly scale the
layers before reaching the end from the beginning, being difficult to network 𝑑, 𝑟, and 𝑤 in a principled way using a compound coefficient
maintain the flow of information and gradients. The strategy followed 𝜙 shared throughout the formulation of all parameters under analysis.
to overcome this problem involves incorporating residual connections EfficientNet uses a baseline architecture composed of 9 stages
between subsequent layers [45], allowing signals to bypass each layer, and whose main building blocks were mobile inverted bottleneck
proceeding directly to the next via identity connections, thus allowing layers [46]. The selection of 𝑑, 𝑟, and 𝑤 was carried out for values of 𝜙
for better flow of information and gradients. In addition, to ensure in the interval [0, 7] and subject to restrictions of memory and FLOPS,
maximum information flow between layers, DenseNet connects all resulting in networks with a wide range of sizes and performance.
layers directly with each other, ensuring maximum information flow EfficientNet-B6 corresponds to the optimised network for 𝜙 = 6. Its per-
between layers in the network [43]. In contrast to the summation formance on ImageNet [47] was comparable to that of EfficientNet-B7
of consecutive layers’ feature maps suggested in [45], each layer in but with considerably fewer parameters.

4
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

EfficientNet-B6 has 528 layers in total and is based on a combina- 2.4. Model interpretability methods
tion of convolutional, pooling, and squeeze-and-excitation blocks. The
convolutional layers perform feature extraction, while pooling layers Due to the success of DL models in various fields and applications,
downsample the spatial dimensions of feature maps. The squeeze- in recent years, many efforts have been made to develop methods and
and-excitation blocks focus on channel-wise feature recalibration to tools to explain the decisions made by this type of models [53]. Accord-
enhance the representation of important features and suppress irrele- ing to [54], there are different approaches to building explainable AI
vant ones. In our case, the last fully connected layer was replaced by a methods, including understandability, comprehensibility, interpretabil-
ity, and transparency. The most explored are model explainability or
three-dimensional layer according to the number of classes in the data
interpretability methods, which are intended to explain or provide
set.
meaning to model predictions [55].
Typically, model explainability methods study the problem of as-
2.3.3. Stochastic variational versions of the models signing an attribution value to each input feature of the model. For
cases where the input is an image, the set of attributions can be rear-
Typical neural networks for regression or classification tasks are
ranged to have the same shape as the input sample, and the resulting
trained using the Maximum Likelihood (ML) principle (or Maximum a
image is called an attribution map [53]. For a classification problem,
posteriori [MAP] in the case of 𝐿1 , or 𝐿2 regularised loss functions).
the attribution value corresponds to the relevance or contribution that
Given a data set  = {𝐱𝑖 , 𝑦𝑖 }𝑁 𝑖=1
of 𝑁 training samples, where 𝐱𝑖 one particular feature (or neuron) 𝑗 makes for the model to decide
represents the input and 𝑦𝑖 is the corresponding output, the ML criterion the correct class 𝑐. The attribution map is then represented by the
provides only point estimations of the weights of the connections of the array 𝐑𝑐 = [𝑅𝑐1 , 𝑅𝑐2 , … , 𝑅𝑐𝑗 , … , 𝑅𝑐𝑛 ], where 𝑛 is the total number of input

network, w, maximising the likelihood 𝑝(|𝐰) = (𝐱𝑖 ,𝑦𝑖 )∈ 𝑝(𝑦𝑖 |𝐱𝑖 , 𝐰). features.
Networks trained in this way are then prone to overfitting and inca- There are many methods to estimate the attribution maps from CNN
pable of correctly assessing the uncertainty of their decisions; therefore, predictions; however, due to the lack of compatibility with different
they often make overly confident decisions about the correct class or model architectures, these methods are difficult to compare and eval-
prediction, which is particularly problematic for some applications, uate empirically [53]. This work used two methods to evaluate and
such as those in security or biomedical contexts. Contrary to ML, compare the effects of spurious correlations on model predictions. The
Bayesian learning (BL) imposes a prior distribution 𝑝(𝐰) on the parame- methods were selected based on their widespread use, as well as their
ters of the models and estimates a posterior distribution of the parame- sensitivity and completeness properties [30,53,56].
ters given the data 𝑝(𝐰|) [48]. The posterior predictive distribution of
𝑦 given a test input 𝐱 is then given by 𝑝(𝑦|𝐱, ) = ∫ 𝑝(𝑦|𝐱, 𝐰)𝑝(𝐰|)𝑑𝐰. 2.4.1. Layer-wise relevance propagation
This approach makes models more robust to overfitting, performs better LRP is an explanation technique that assumes a conservation law
through the network layers to establish the contribution of the input
when trained from small data sets [49], and gives the model the ability
features to the network output [57]. In other words, LRP splits the
to quantify the uncertainty of predictions [50].
relevance of the input features in proportion to the contribution of
Unfortunately, taking the expectation under the posterior distribu- each input to the neuron activation of the correct class 𝑐. To do that,
tion on weights is equivalent to using an ensemble of an uncount- LRP backpropagates the relevance scores of the output layer backward
ably infinite number of neural networks, which is intractable in prac- onto neurons of the lower layer are achieved. Formally, let 𝐱𝑖 be an
tice [32]. Therefore, several approximations have been proposed to input sample, and 𝑓𝑐 (𝐱𝑖 ) the corresponding network’s output for the
estimate the model’s posterior [32,50]. For neural networks, the vari- neuron associated with class 𝑐. Relevance scores at the input layer are
ational approximation to the Bayesian posterior distribution on the estimated by applying the rule [57]:
weights is the most used approach [50], which relies on a cost function ∑ 𝑐,(𝑙+1) ∑ 𝑐,(𝑙) ∑ 𝑐,(1)
𝑓𝑐 (𝐱𝑖 ) = ⋯ 𝑅𝑗 = 𝑅𝑗 = ⋯ = 𝑅𝑗 (2)
known as the expected lower bound given by:
𝑗∈𝑙+1 𝑗∈𝑙 𝑗∈𝑛

 (, 𝜃) = KL[𝑞(𝐰|𝜃) ∥ 𝑝(𝐰)] − E𝑞(𝐰|𝜃) [log 𝑝(|𝐰)] (1) where =𝑅𝑐𝑗 𝑅𝑐,(1)
represents the relevance of the input feature (or
𝑗
neuron) 𝑗 which, for the 𝑙th layer, is estimated as [58]:
where KL is the Kullback–Leibler divergence between an approxi-
mate posterior distribution 𝑞(𝐰|𝜃) and the prior distribution 𝑝(𝐰). 𝜃 ∑ 𝑧𝑙+1
𝑗𝑘
𝑅𝑐,(𝑙) = ∑ 𝑙+1 𝑅𝑘
𝑐,(𝑙+1)
(3)
represents the set of hyperparameters that characterise the posterior 𝑗
𝑘∈𝑙+1 𝑗 ′ 𝑧𝑗 ′ 𝑘
distribution 𝑞(⋅) and must be set during the learning process. The KL
term acts as a regularisation term that controls the complexity of the being 𝑧𝑙+1
𝑗′ 𝑘
the contribution of the neurone 𝑗 to the activation of the
model. The second term in Eq. (1) corresponds to the prediction error. neurone 𝑘 in the 𝑙 + 1 layer and in turn to make it relevant. The
The exact estimation of the KL term is computationally expensive for denominator in Eq. (3) enforces the conservation property. Besides,
training neural networks, so, in this article, the Bayes by backpro there exist stabilising modifications of Eq. (3) that can be consulted
algorithm [32], which is based on the reparametrisation trick [51], is in [58].
used. Thus, we created Bayesian versions of the DenseNet (BDenseNet)
2.4.2. Deep learning important features
and EfficientNet (BEfficientNet) networks following the stochastic vari-
Similarly to LRP, DeepLIFT is a method to decompose the out-
ational implementation in [52], which uses weight priors based on
put prediction by backpropagating the contributions of all neurons
isotropic Gaussians. Furthermore, according to the proposal for KL
back to the input features, but instead of other backpropagation-based
re-weighting in mini-batches presented in [32], an alternative loss
approaches, at each step, DeepLIFT compares the activation of each
function 𝑘 (𝑘 , 𝜃) was defined. It weights the KL term in Eq. (1) so that neuron with its ‘reference activation’ obtained by forward propagating
the complexity cost more influences the first few mini-batches, whereas the network using a ‘reference’ input sample [33]. Using this difference-
the data largely influences the later mini-batches. The coefficient that from-reference approach, DeepLIFT is able to propagate a relevance
𝐵−𝑘
multiplies the KL term is given by 𝜋𝑘 = 22𝐵 −1 , where 𝐵 is the number score even in situations where the gradient is zero. The reference input
of mini-batches, and 𝑘 represents the index that identifies the mini- must be selected according to the problem being addressed; in image
batch that is being processed. During inference, the posterior predictive processing applications, a black image is typically used as a reference.
distribution is estimated using Monte Carlo sampling. For DeepLIFT, the relevance score in the output layer is estimated

5
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

as 𝑅𝑐,𝐿 = 𝑓𝑐 (𝐱𝑖 ) − 𝑓𝑐 (𝐱),


̄ where 𝐱̄ represents the reference input. The Table 3
Performance of the different model architectures for the three different image
backpropagation rule, in this case, is given by [53],
pre-processing strategies (equalisation, cropping, and segmentation).
∑ 𝑧𝑙+1
𝑗𝑘
− 𝑧̄ 𝑙+1
𝑗𝑘 Pre-processing Metric Model architecture
𝑅𝑐,(𝑙)
𝑗 = ∑ 𝑙+1 𝑅𝑐,(𝑙+1) (4)
𝑙+1 𝑘 DenseNet EfficientNet BDenseNet BEfficientNet
𝑘∈𝑙+1 ′ 𝑧
𝑗 𝑗′ 𝑘 − 𝑧̄ 𝑗′ 𝑘
PPV 98.56 ± 0.62a 99.14 ± 0.57 97.20 ± 0.77 92.36 ± 2.78
where 𝑧̄ 𝑙+1
𝑗′ 𝑘
is the contribution of the neuron 𝑗 to the activation of the Recall 98.95 ± 0.21 98.99 ± 0.22 97.11 ± 1.63 93.37 ± 3.28
Equalisation
neuron 𝑘 in the 𝑙 + 1 layer when the baseline 𝐱̄ is fed into the network. F1 98.75 ± 0.21 99.07 ± 0.37 97.15 ± 1.20 92.79 ± 1.78
BAcc 98.95 98.99 97.11 93.37
This work uses the LRP and DeepLIFT implementations available
at [59]. And following the approach proposed in [30], the uncertainty PPV 97.71 ± 0.51 97.43 ± 0.72 94.54 ± 1.99 81.19 ± 8.63
Recall 97.13 ± 1.68 97.08 ± 1.55 95.01 ± 1.17 82.54 ± 5.42
of the interpretability maps provided by LRP and DeepLIFT (B-LRP and Cropping
F1 97.41 ± 1.06 97.25 ± 1.08 94.77 ± 1.49 81.79 ± 6.94
B-DeepLIFT for LRP and DeepLIFT methods, respectively) is analysed BAcc 97.13 97.08 95.01 82.54
to determine to what extent elements outside the RRoI or LRoI affect PPV 98.47 ± 0.56 98.32 ± 0.78 95.66 ± 1.74 80.89 ± 7.61
the performance of the models and their certainty with respect to the Recall 98.51 ± 0.42 98.30 ± 0.56 96.37 ± 1.95 80.92 ± 7.15
Segmentation
relevance of the input characteristics. However, instead of visually F1 98.49 ± 0.47 98.30 ± 0.49 95.99 ± 1.07 80.90 ± 7.35
inspecting the uncertainty of the attribution maps for some samples BAcc 98.51 98.30 96.37 80.92
(as in [30]), this work uses them to estimate a global measure of a mean ± std.
performance (see Section 3.1 for details).

3. Experiments and results so, an operator 𝛼 ({𝐑𝑐𝑚 }) = 𝛼 ({[𝑅𝑐1 , … , 𝑅𝑐𝑗 , … , 𝑅𝑐𝑛 ]𝑚 }) = 𝛼 ({𝑅𝑐𝑚,𝑗 }) is
used to calculate the entry percentile of a set {𝐑𝑐𝑚 }𝑀 𝑚=1
of the attribution
This section presents the experimental setup and the main results of maps obtained from 𝑀 Monte Carlo samples from a Bayesian model.
the experiments carried out. With this in mind, IoC𝛼 is defined as
𝛼 |−1 ∑
|𝑃𝑜𝑢𝑡 𝑐 })
𝛼 𝛼 ({𝑅
3.1. Experimental setup 𝑞∈𝑃𝑜𝑢𝑡 𝑚,𝑞
IoC𝛼 (𝐱𝑖 ) = ∑ (5)
−1
|𝑃𝑖𝑛𝛼 | 𝑝∈𝑃 𝛼 𝛼 ({𝑅 𝑐 })
𝑚,𝑝
𝑖𝑛
Models were trained using the data set described in Section 2.1. A
5-fold cross-validation strategy was used to evaluate the performance of where 𝑃𝑖𝑛𝛼 is the set of pixels inside the LRoI with positive relevance
𝛼 is composed of pixels outside the LRoI
scores in the 𝛼 percentile, 𝑃𝑜𝑢𝑡
the systems. Experiments were carried out to establish the performance
of the models in the classification of the three classes: control, pneumo- with positive relevance scores in the 𝛼 percentile, and 𝑅𝑐𝑚,𝑗 is the
nia, and COVID-19. The original versions of DenseNet and EfficientNet relevance of the pixel 𝑗 in the attribution map of the 𝑚th Monte Carlo
were trained using an Adam optimiser with 40 epochs, a learning rate sampling of the Bayesian model evaluated in the input sample 𝐱𝑖 .
of 2−5 , and a weight decay of 1−7 . Similar to [4], data augmentation
for pneumonia and COVID-19 classes was leveraged with the following 3.2. Results
augmentation types: horizontal flip, Gaussian noise with a variance
of 0.015, rotation, elastic deformation, and scaling. COVID-19 and Table 3 summarises the main performance results for the three pre-
Pneumonia classes are oversampled to compensate for data imbalance, processing strategies considered and for both versions of DenseNet and
and data augmentation is applied more frequently to those classes (80% EfficientNet (i.e., the frequentist (standard) and the Bayesian counter-
vs. 50% to control). All operations are applied with a probability of 0.5. part). In the frequentist versions, EfficinetNet provides better results
Bayesian versions of the models, namely BDenseNet and BEfficientNet than DenseNet, reaching a BAcc of 98.95%, while DenseNet performed
were trained using the same optimiser configuration but for 80 epochs, better than EfficientNet for the Bayesian counterpart. In any case, the
since these models require more iterations to converge. Training and standard version of both architectures provides slightly better results
simulations were performed in a cluster with four Nvidia RTX 3090 according to all calculated metrics. These results are also supported by
GPUs of 24 GB each. The performance of the models was evaluated in Figs. 3 and 8, which show the confusion matrices for the frequentist
terms of Balanced Accuracy (BAcc), Positive Predictive Value (PPV), and Bayesian models for each of the three pre-processing schemes
Recall, F1-score, calibration curves, and confusion matrices. Moreover, evaluated, and for the three classes considered, namely: control, pneu-
following the evaluation proposals in [60], the Maximum Calibration monia, and COVID-19. Table 3 shows that except for BEfficientNet,
Error (MCE) was included along with the calibration curves. This the pre-processing with the segmentation approach provides results
summary calibration statistic is preferred in those high-risk applications comparable to those using the raw images with just an equalisation,
where reliable confidence measures are absolutely necessary, such as in which indicates that selected CNN networks, specially DenseNet-121,
medical contexts. can achieve similar performance when the training is guided to regions
To quantify how much networks pay attention to artifacts in the of the image where a physiological interpretation of the inference
image, the LRoI is segmented and used to approximate the impor- process can be enhanced.
tance of context metric (IoC) [17]. Although the characteristics that Since the segmentation pre-processing approach delineates the ex-
define the presence of pneumonia or COVID-19 cannot be associated ternal shape of the lungs, there is a certain risk that the system makes
with the entire lung area, any relevance assigned by the network to decisions based on differences in their external shape, as was identified
pixels outside this region can be considered an indicator of spurious in [20]. These differences may be induced due to the position of the
correlations. IoC was estimated only for the first two pre-processing patient, the sex, the projection, or the pathology itself. To evaluate
schemes described in Section 2.2 (i.e., equalisation and cropping), this extent, Fig. 4 shows, for explorations that were pre-processed using
as estimating the IoC after semantic lung segmentation (third pre- the third approach (which crops to the RRoI and performs a semantic
processing scheme) is meaningless. Furthermore, as suggested in [30], segmentation), the average XR images extracted from all samples be-
uncertainty estimations are calculated from the interpretability maps longing to each of the three classes considered: control, pneumonia, and
provided by Bayesian models by defining a new metric IoC𝛼 (IoC in the COVID-19. These images show an almost identical external shape for
𝛼 percentile), so that if a pixel not belonging to the LRoI has positive the consequent classes, thus suggesting that the shape would not be a
relevance in the 𝛼th percentile 100(1 − 𝛼)% of the attribution maps confounding factor. This is mainly attributed to the process of cropping
samples, the model incorrectly considers it as strongly relevant. To do to the RRoI, which aligns and resizes the lungs.

6
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 3. Confusion matrices and calibration curves for the two base model architectures (EfficientNet and DenseNet) and the three pre-processing schemes considered (equalisation,
cropping, and segmentation).

Fig. 4. Average images for the training set using the segmentation procedure for each class. Left: Control. Centre: Pneumonia. Right: COVID-19.

7
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Table 4
Performance of the different model architectures using the pre-processing strategies considered during the validation phase, occluding the
external parts out of the lungs area.
Training pre-processing Testing pre-processing Metric Model architecture
DenseNet EfficientNet BDenseNet BEfficientNet
PPV 73.11 ± 21.47 58.39 ± 32.11 67.88 ± 21.71 59.74 ± 0.31
Recall 85.35 ± 9.15 63.15 ± 29.92 79.99 ± 13.82 57.63 ± 0.27
Cropping
F1 75.52 ± 10.03 44.70 ± 12.41 69.36 ± 9.10 43.79 ± 12.71
BAcc 85.35 63.15 79.99 57.63
Equalisation
PPV 45.70 ± 26.15 60.26 ± 31.16 66.95 ± 13.08 67.00 ± 36.25
Recall 53.56 ± 21.44 42.60 ± 19.38 67.46 ± 13.88 51.84 ± 38.55
Segmentation
F1 41.53 ± 10.64 42.60 ± 19.38 66.71 ± 11.99 33.45 ± 28.94
BAcc 53.56 56.66 67.46 51.84
PPV 61.15 ± 27.57 56.89 ± 27.71 58.51 ± 25.75 64.91 ± 28.77
Recall 51.55 ± 37.39 61.05 ± 42.73 63.42 ± 40.90 61.49 ± 28.07
Cropping Segmentation
F1 38.05 ± 16.72 35.90 ± 24.01 41.00 ± 21.24 50.60 ± 19.10
BAcc 51.55 61.05 63.42 61.49

To evaluate the behaviour of the models in terms of the attribution Table 5


Importance of Context (IoC) estimated for the two base architectures and for each class.
assigned to pixels in the images, Fig. 5 shows the attribution maps
Only the correctly classified XR images are included for estimating the IoC.
(calculated for DenseNet and EfficientNet using the LRP and DeepLIFT
Pre-processing Model Attribution map Class
methods, and for the equalisation and cropping pre-processing strate-
Control Pneumonia COVID-19
gies) of a control and a pneumonia patient. Furthermore, Fig. 6 shows
the attribution maps for one control and a patient with COVID-19. Attri- LRP 1.16 ± 0.20 1.11 ± 0.18 1.31 ± 0.22
DenseNet
DeepLIFT 1.25 ± 0.27 1.01 ± 0.20 1.45 ± 0.29
bution maps are generalised approaches to visualise the contributions Equalisation
to the correct class prediction of each image’s pixel by superposing the LRP 1.20 ± 0.35 1.02 ± 0.19 1.40 ± 0.31
EfficientNet
DeepLIFT 1.38 ± 0.39 1.09 ± 0.25 1.50 ± 0.35
image and the importance score obtained from a model interpretability
method [61]. No matter the architecture, the pre-processing strategy LRP 1.42 ± 0.18 1.16 ± 0.18 1.56 ± 0.29
DenseNet
DeepLIFT 1.72 ± 0.30 1.26 ± 0.24 1.77 ± 0.39
or the method used to calculate the attribution maps, they show a Cropping
LRP 1.48 ± 0.31 1.21 ± 0.25 1.57 ± 0.39
significant number of pixels relevant (in red) to the classification EfficientNet
DeepLIFT 1.39 ± 0.29 1.12 ± 0.25 1.47 ± 0.36
process that does not fall within the LRoI. Specifically, a large number
of significant pixels are concentrated in the breastbone, the diaphragm,
the cardiac silhouette, and in the background of the image (with special
attention to the upper corners). The corresponding attribution maps for to estimate IoC. The results show, for all scenarios, IoC values > 1,
BDenseNet and BEfficientNet are included in Appendix. and higher for DeepLIFT compared to LRP, suggesting that the area
Considering that there is an ongoing discussion about the quality outside the LRoI is more relevant for classification (i.e., contains more
of attribution maps to clearly identifycriminatory characteristics in the relevant pixels) than the LRoI itself. These results confirm the presence
images [62], and to evaluate the effect of the predictive capability of a strong CHE.
of the models in those regions outside the LRoI, Table 4 provides To complement these results, Table 6 reports, for the Bayesian
results for the different architectures using the first two pre-processing version of the models (BDenseNet and BEfficientNet), the IoC𝛼 values
strategies (i.e., equalisation and cropping), but validating with images estimated from the DeepLIFT attribution maps, for each class (i.e., con-
artificially modified to contain a black mask occluding the external part trol, pneumonia, and COVID-19), and for different values of 𝛼 {25,
of the LRoI. This experiment is intended to evaluate to what extend 50, 75, 95}. DeepLIFT was chosen because it provided larger values in
models that are not guided to focus on a specific RoI, can find as rele- Table 5, and is thus considered the worst scenario. Only well-classified
vant, information in the images that is not related with the pathology samples are included to estimate these values. The results show sig-
under study. As expected, performance dropped significantly, suggest- nificantly lower values than those reported in Table 5, being close to
ing that despite the good accuracy provided by the baseline methods,
1 (or even smaller) in certain cases (such as in the identification of
the models are taking into account significant information from the XR
pneumonia with BDenseNet). These results suggest better interpretabil-
images that do not belong to the lungs (i.e., to the LRoI). Note that the
ity and coherence of the Bayesian networks due to a better balance
results in Table 4 show large standard deviations, since the models tend
of relevant points within/outside the LRoI, but still demonstrate that
to assign most samples to a single class.
models are paying attention to the external area of the LRoI, confirming
On the other hand, Table 5 shows the mean IoC values estimated
the presence of a CHE.
for the two base network architectures (i.e., the frequentist version of
The previous results are also supported graphically by Fig. 7, which
DenseNet and EfficientNet) and the first two pre-processing modali-
ties (equalisation and cropping). The LRoI was used to estimate the shows the attribution maps for both Bayesian approaches using B-LRP
IoC according to the LRP and DeepLIFT methods. Consequently, the and B-DeepLIFT with 𝛼 = 25 and following the equalisation pre-
pre-processing using the segmentation scheme was not included for processing approach. As expected, in these cases, significant pixels are
comparison, since the input XR images were pre-processed using the inside the LRoI. Note that in these figures, the lungs’ boundaries are
same mask. The method used in this paper to estimate IoC approximates delineated with many points, but they are considered non-relevant
the original definition, using LRoI as a whole for the pneumonia and (in blue). In comparison with BEfficientNet, BDenseNet provides more
COVID-19 classes instead of a specific RoI inside. Thus, IoC would relevant points within the LRoI, thus, it is considered, again, a more
provide additional evidence on to what extent the network is paying interpretable architecture.
attention to regions considered non-relevant and not linked with the Going back to Fig. 3, it also shows the calibration curves for the
diseases at hand. So, as long as the relevant pixels are inside the LRoI, frequentist versions of the models along with the MCE values. The
the IoC would lead to values close to 0. IoC values < 1 are expected results show poor calibrations for all pairs (pneumonia vs. control,
when the most relevant pixels fall within the LRoI, and > 1 when pneumonia vs. COVID-19, and control vs. COVID-19) and for all pre-
placed outside the LRoI. Only correctly classified samples are included processing schemes considered (with 0.26 < MCE > 0.57). All models

8
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 5. Attribution maps of a control and a pneumonia patient obtained for the original images equalised and the cropped images. For the sake of comparison, the maps are
obtained for DenseNet and EfficientNet using LRP and DeepLIFT. Red dots correspond to pixels with positive attribution (i.e., relevant) and blue dots to pixels with negative
attribution (i.e., nonrelevant).

9
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 6. Attribution maps of one control and one COVID-19 patient obtained for the original images equalised and the cropped images. Maps are obtained for DenseNet and
EfficientNet using LRP and DeepLIFT for comparison purposes. Red dots correspond to pixels with positive attribution (i.e., relevant) and blue dots to pixels with negative
attribution (i.e., nonrelevant).

10
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 7. Attribution maps of the controls, pneumonia, and COVID patients obtained for the segmented images. For comparison purposes, the maps are obtained for BDenseNet
and BEfficientNet using B-LRP and B-DeepLIFT for 𝛼 = 25. Red dots correspond to pixels with positive attribution (i.e., relevant) and blue dots to pixels with negative attribution
(i.e., nonrelevant).

11
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 8. Confusion matrices and calibration curves for the Bayesian versions of the model architectures and the three pre-processing schemes (equalisation, cropping, and
segmentation).

tend to overestimate the probabilities for the pair pneumonia vs. con- Finally, Fig. 9 shows the kernel density distribution of the uncer-
trol, whereas the two remaining pairs tend to underestimate it. In tainty of the model and the predictive uncertainty when the models
addition, Fig. 8, also shows the calibration curves for the Bayesian classify the samples correctly and when they make mistakes. Follow-
versions of the architectures. The results show a good calibration ing [52], the prediction uncertainty is measured as the entropy of
for all stochastic models when calculating the probability of the pair the mean of the predictive distribution obtained from Monte Carlo
pneumonia vs. control (MCE < 0.1 for BDenseNet), but a poor and sampling during the prediction phase, and the model uncertainty is
obtained by computing the difference between the entropy of the mean
underestimated calibration for the remaining two combinations. This
of the predictive distribution and the mean of the entropy. The results
behaviour appears in the three pre-processing schemes considered.
show that BDenseNet provides similar patterns for model uncertainty
These results suggest that models make decisions based on under- and prediction uncertainty for the subsequent three pre-processing
estimated probabilities, especially for pairs involving COVID-19 cases. approaches. These patterns show very low model and prediction un-
In any case, the calibration curves show a much smoother behaviour certainty when they hit, whereas uncertainty increases significantly
(monotonically increasing) for the Bayesian versions of the networks. when they fail, suggesting trustable true positives, no matter the pre-
This behaviour suggests a more predictable and interpretable output. processing approach used. Additionally, Fig. 9 shows that BEfficientNet

12
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Table 6
Importance of Context per percentiles (IoC𝛼 ) estimated for the two Bayesian versions of the model architectures and each class. Only the
correctly classified samples are included for estimating the IoC𝛼 . The attributions maps are estimated using B-DeepLIFT.
Model Pre-processing Class Percentile
IoC25 IoC50 IoC75 IoC95
Control 1.01 ± 0.33 1.11 ± 0.27 1.20 ± 0.26 1.29 ± 0.26
Equalisation Pneumonia 0.89 ± 0.24 0.93 ± 0.21 0.99 ± 0.20 1.01 ± 0.20
COVID-19 1.11 ± 0.32 1.09 ± 0.26 1.13 ± 0.24 1.21 ± 0.23
BDenseNet
Control 1.15 ± 0.26 1.20 ± 0.24 1.25 ± 0.23 1.30 ± 0.22
Cropping Pneumonia 0.77 ± 0.17 0.84 ± 0.16 0.89 ± 0.16 0.92 ± 0.16
COVID-19 1.23 ± 0.25 1.27 ± 0.24 1.33 ± 0.24 1.41 ± 0.24
Control 1.08 ± 0.29 1.13 ± 0.19 1.16 ± 0.20 1.20 ± 0.23
Equalisation Pneumonia 1.14 ± 0.26 1.04 ± 0.14 1.05 ± 0.13 1.05 ± 0.15
COVID-19 1.07 ± 0.23 1.11 ± 0.17 1.13 ± 0.18 1.18 ± 0.20
BEfficientNet
Control 1.30 ± 0.27 1.34 ± 0.18 1.36 ± 0.15 1.40 ± 0.20
Cropping Pneumonia 1.34 ± 0.29 1.15 ± 0.18 1.11 ± 0.17 1.10 ± 0.18
COVID-19 1.21 ± 0.27 1.26 ± 0.22 1.30 ± 0.23 1.34 ± 0.24

Fig. 9. Model and prediction uncertainty distributions estimated using the Bayesian versions of DenseNet and EfficientNet for the three pre-processing schemes (equalisation,
cropping, and segmentation) and differentiating between model hits and failures.

provides, for the last two pre-processing strategies (cropping and seg- To provide new evidence in this regard, this work explicitly fo-
mentation), comparable model uncertainty patterns in both, hits and cusses on the CHE visible in the application of DL techniques for
failures, as well as similar prediction uncertainty, but with signifi- COVID-19 screening, providing interpretability to the results obtained
cantly higher uncertainty than in BDenseNet. These results suggest that by analysing various factors to demonstrate the need for a specific pre-
BEffientNet is less reliable in its decisions. processing strategy of the chest XR images to semantically segment the
lungs, even at the cost of precision. The aim is to force the network to
4. Discussion focus on certain areas to reduce shortcuts in the learning process.
For this purpose, the paper uses two off-the-shelf architectures,
The COVID-19 pandemic has led to a trend in research that uses DL DenseNet and EfficientNet, and their Bayesian counterparts, which, in
to evaluate XR images for automatic disease evaluation. However, as principle, are more interpretable architectures. The results suggest that
suggested by [20], despite the high precision reached by these models, the DenseNet-based architecture performs better in both its frequentist
they are not exempt from practical limitations. and Bayesian versions. The precision, which was obtained with models
This article provides evidence and discusses certain limitations of trained with a large data set (more than 70,000 XR images, being,
DL models for the detection of COVID-19 from chest XR images due to our knowledge, the largest used for this purpose), is very relevant
to artifactual correlations. Although some research has superficially (BAcc > 97%), but further experiments have shown significant learning
addressed this issue, most recent work continues to compare models shortcuts, suggesting that the decision made by the models is guided
based solely on their accuracy, ignoring interpretability, explicability, by the ‘principle of least effort’ [22,63], detecting textures with a not
and uncertainty estimates. Furthermore, the limited interpretability and clear casual relationship with the disease. This is confirmed by the
explicability of these DL models undermine their confidence and the attribution maps calculated using the LRP and DeepLIFT methods, by a
possibility of being transferred to the clinical setting. set of experiments validating with a black mask that covers the area

13
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 10. Attribution maps of a control and a pneumonia patient obtained for the original images equalised and the cropped images. For comparison purposes, the maps are
obtained for BDenseNet and BEfficientNet using B-LRP and B-DeepLIFT for 𝛼 = 25. Red dots correspond to pixels with positive attribution (i.e., relevant) and blue dots to pixels
with negative attribution (i.e., nonrelevant).

14
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Fig. 11. Attribution maps of one control and one COVID-19 patient obtained for the original images equalised and the cropped images. For comparison purposes, maps are
obtained for BDenseNet and BEfficientNet using B-LRP and B-DeepLIFT for 𝛼 = 25. Red dots correspond to pixels with positive attribution (i.e., relevant) and blue dots to pixels
with negative attribution (i.e., non relevant).

15
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

outside the LRoI, and by the experiments reporting IoC values > 1, 5. Conclusions and future lines
which suggest that the area outside the LRoI has more relevant points
for the decision than the area inside. This work is focused on the analysis of spurious correlations and
Despite the high precision obtained by the artificial models used to CHE phenomena observed during the detection of COVID-19 by DL
detect COVID-19, these models reveal significant limitations in terms models from Chest RX. The results evidenced that without careful
of interpretability. Although artificial models pose, in general terms, analysis and image pre-processing techniques, the models tend to ap-
well-known limitations (e.g. data set bias [4], learning under covariate ply shortcut learning strategies associated with the data source, data
shift [64], anti-causal learning [65], and the tank legend [66]), the imbalance, artifacts, and spurious text, labels, or black space shapes,
CHE is one of the main existing barriers. This is because DL networks which corresponds to the overinterpretation behaviour also described
learn the easiest possible solution to the problem [22], taking shortcuts in other fields of application [18].
instead of learning the intended solution. This fact leads to a good This work has shown that combining a Bayesian version of DenseNet
accuracy of the models, but based on non interpretable causal-effect with a segmentation pre-processing approach constitutes the most reli-
relationships. In our problem, this is clearly seen by the large number able scheme of those tested for detecting COVID-19 from plain chest
of relevant pixels shown in the attribution maps outside the LRoI, XR images. Using a Bayesian approach allows us to get uncertainty
measures on the model’s predictions that can also be extended to
also reflected in the IoC values obtained significantly over 1 and the
saliency visualisation obtained from models’ explainability techniques
validation results occluding the regions out of the LRoI, which showed
and IoC indexes. However, the IoC applicability is subject to prob-
a significant drop in the models’ performance.
lems where a clear RoI can be identified and segmented; otherwise,
In any case, given the IoC values reported, BDenseNet networks
other approaches must be developed to overcome similar issues as
have demonstrated to be more interpretable and ascertainable, but their
those observed and described in this work. Apart from the results
results are not completely coherent: even these networks are affected by
of specific model configurations and pre-processing techniques, this
the CHE, apparently learning from patterns not belonging to the LRoI.
work is intended to emphasise the need for a more rigorous evaluation
The results have also shown that pre-processing using a semantic methodology of DL models in the context of COVID-19 detection that
segmentation of the lungs (exemplified in 2Right) provides slightly goes far beyond accuracy metrics and provides some elements that
worse but comparable results to those using raw images with just an could be incorporated into a standardised validation methodology.
equalisation. These results are in line with those reported in [4] for a Nevertheless, important aspects are still open regarding the in-
different DL architecture. In view of the significant issues identified re- terpretability of the results, which would require an analysis based
lated to shortcut learning due to the CHE, this pre-processing technique on more recent approaches for model interpretability, such as better
reveals itself as a mandatory step in any DL system aimed at detecting attribution maps in terms of localisation metrics [62] or even moving
COVID-19 from chest XR images. The use of this pre-processing scheme from attribution maps to concept level explainability tools given the
does not completely solve potential biases introduced by the network, identified limitations of the former ones [68]. Besides, since most of
but is considered mandatory to guide the learning process towards the work done in the field of explainability has been orientated toward
an area with a causal relationship with the problem under study, common computer vision tasks, the use of specific interpretability
obligating the network to make an effort to attend to this specific area. techniques tailored for medical image analysis is worth exploring in this
Notwithstanding the positive effects, pre-processing with a semantic context [69]. Furthermore, the feasibility of other methods focused on
segmentation of the lungs is not free of potential biases. This process providing systems with additional information to guide training (such
may introduce a confounding factor due to the external shape of the as texts, shapes, or boxes) or to remove biasing factors (such as domain
lungs, which may be related to the patient’s position, sex, projection adversarial training [23]) should also be evaluated.
(i.e., PA or AP), or the pathology itself. However, results suggest Concerning uncertainty and calibration performance, other loss
that the semantic segmentation process and the cropping to the RRoI functions designed to increase the calibration of Bayesian networks [70]
minimise this risk by producing class-independent masks. are worth exploring in the context of COVID-19 and pneumonia de-
In any case, it is worth remembering that some of the limitations tection, which could provide a better balance and reliability of the
described are inherent to the discriminative learning typically per- networks’ outcomes.
formed in DL architectures for computer vision, as classification is In conclusion, DL models certainly have the potential to solve the
shape-agnostic and strongly relies on identifying textures rather than challenge under study, serving as scientific models for prediction and
predefined shapes [67]. In our case, the opacities due to COVID-19 do explanation. However, we should not confuse the performance of a data
not have a predefined shape, clear boundaries, and structure, but their set with the acquisition of the underlying diagnostic capacity [22,71].
texture might appear in many areas of the image superimposed on other This is why, among other aspects, guiding the process with semantic
structures. Therefore, we could assume that the joint combination of segmentation is mandatory to minimise learning shortcomings.
textures, contrast, and structures is the differential characteristic that This work goes into the CHE present when applying DL techniques
to screen for COVID-19, but the conclusions could also be extrapolated
humans follow to identify the underlying lesions. But, as mentioned
to the general context of pneumonia detection from chest RX images or
above, the networks mainly focus on textures. This might be a plausible
even further.
explanation that supports the high precision of the models and the
learning shortcuts, which also justifies the need to look deeper into
CRediT authorship contribution statement
their interpretability.
Regarding calibration, models make decisions based on underes- Julián D. Arias-Londoño: Conceptualization, Methodology, Soft-
timated probabilities, especially for pairs involving COVID-19 cases. ware, Validation, Formal analysis, Investigation, Resources, Data cu-
This could directly be due to the unbalanced data set, with fewer ration, Writing – original draft, Visualization, Supervision. Juan I.
COVID-19 images. In any case, the monotonically increasing shape of Godino-Llorente: Conceptualization, Methodology, Formal analysis,
the calibration curves suggests that Bayesian versions of the networks Investigation, Data curation, Writing – original draft, Writing – review
are more consistent and predictable, and thus their results are more & editing, Supervision, Project administration, Funding acquisition.
interpretable.
With respect to uncertainty, the results show that BDenseNet pro- Data availability
vides more reliable and trustworthy results than BEfficientNet. More-
over, results suggest trustable true positives for BDenseNet, no matter The code necessary to reproduce the experimental findings can be
the pre-processing approach used. found at https://github.com/jdariasl/COVID_BayesianNET.

16
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

Acknowledgements Appendix

Funded by the agreement between the Comunidad de Madrid (Con- Figs. 10 and 11 show the attribution maps for BDenseNet and
sejería de Educación, Universidades, Ciencia 𝑦 Portavocía) and Univer- BEfficientNet using B-LRP and B-DeepLIFT, respectively, calculated
sidad Politécncia de Madrid, to finance research actions on SARS-CoV-2 with percentile 25 from the set of attribution maps obtained from
and COVID-19 disease with the REACT-UE resources of the European each of the Monte Carlo samples of the models. As in Figs. 5 and 6,
Regional Development Funds. they show the equalisation and cropping pre-processing strategies of
This work was also supported by the Ministry of Economy and Com- a control and a pneumonia patient, and one control and a COVID-
petitiveness of Spain under Grants DPI2017-83405-R1 and PID2021- 19 patient, respectively. Results show a large number of nonrelevant
128469OB-I00; and partially funded by the Autonomous Community pixels delineating the boundaries of the different structures. Compared
of Madrid through the ELLIS Unit Madrid.
to BEfficientNet, BDenseNet shows a higher density of relevant points
Universidad Politécnica de Madrid supports Julián D. Arias-Londoño
within the LRoI. Thus, it is again considered a more interpretable
through a María Zambrano grant, 2021.
architecture.
Abbreviations
References
The following abbreviations are used in this manuscript:
[1] K. Mäkelä, M.I. Mäyränpää, H.-K. Sihvo, P. Bergman, E. Sutinen, H. Ollila, R.
AI Artificial Intelligence Kaarteenaho, M. Myllärniemi, Artificial intelligence identifies inflammation and
AP Anterior–Posterior confirms fibroblast foci as prognostic tissue biomarkers in idiopathic pulmonary
fibrosis, Hum. Pathol. 107 (2021) 58–68.
BAcc Balanced Accuracy
[2] J.F.-W. Chan, S. Yuan, K.-H. Kok, K.K.-W. To, H. Chu, J. Yang, F. Xing, J. Liu,
BDenseNet Bayesian DenseNet C.C.-Y. Yip, R.W.-S. Poon, et al., A familial cluster of pneumonia associated with
BEfficientNet Bayesian EfficientNet the 2019 novel coronavirus indicating person-to-person transmission: a study of
BL Bayesian Learning a family cluster, The Lancet 395 (10223) (2020) 514–523.
[3] A. Shah, M. Shah, Advancement of deep learning in pneumonia/COVID-19 clas-
CHE Clever Hans effect
sification and localization: A systematic review with qualitative and quantitative
CLAHE Contrast Limited Adaptive Histogram Equalisation analysis, Chronic Dis. Transl. Med. 8 (03) (2022) 154–171.
CNN Convolutional Neural Network [4] J.D. Arias-Londoño, J.A. Gomez-Garcia, L. Moro-Velazquez, J.I. Godino-Llorente,
CT Computer Tomography Artificial intelligence applied to chest X-ray images for the automatic detec-
CR Computed Radiography tion of COVID-19. a thoughtful evaluation approach, IEEE Access 8 (2020)
226811–226827.
DICOM Digital Imaging and Communications in Medicine
[5] E. Tartaglione, C.A. Barbano, C. Berzovini, M. Calandri, M. Grangetto, Unveiling
Standard COVID-19 from chest x-ray with deep learning: a hurdles race with small data,
DL Deep Leaning Int. J. Environ. Res. Public Health 17 (18) (2020) 6933.
DeepLIFT Deep Learning Important FeaTures [6] R. Zhang, X. Tie, Z. Qi, N.B. Bevins, C. Zhang, D. Griner, T.K. Song, J.D.
Nadig, M.L. Schiebler, J.W. Garrett, et al., Diagnosis of coronavirus disease 2019
DX Digital Radiography
pneumonia by using chest radiography: value of artificial intelligence, Radiology
HE Histogram Equalisation 298 (2) (2021) E88–E97.
IoC Importance of Context [7] M.M. Rahaman, C. Li, Y. Yao, F. Kulwa, M.A. Rahman, Q. Wang, S. Qi, F. Kong,
LRP Layer-wise Relevance Propagation X. Zhu, X. Zhao, Identification of COVID-19 samples from chest X-Ray images
LRoI Region of Interest corresponding to the lungs using deep learning: A comparison of transfer learning approaches, J. X-ray Sci.
Technol. 28 (5) (2020) 821–839.
MAP Maximum a Posteriori
[8] N. Tsiknakis, E. Trivizakis, E.E. Vassalou, G.Z. Papadakis, D.A. Spandidos,
MCE Maximum Calibration Error A. Tsatsakis, J. Sánchez-García, R. López-González, N. Papanikolaou, A.H.
ML Maximum Likelihood Karantanas, et al., Interpretable artificial intelligence framework for COVID-19
RoI Region of Interest screening on chest X-rays, Exp. Therapeutic Med. 20 (2) (2020) 727–735.
[9] E. Luz, P. Silva, R. Silva, L. Silva, J. Guimarães, G. Miozzo, G. Moreira, D.
SARS Severe Acute Respiratory Syndrome
Menotti, Towards an effective and efficient deep learning model for COVID-19
PA Posterior–Anterior patterns detection in X-ray images, Res. Biomed. Eng. (2021) 1–14.
PCR Polymerase Chain Reaction [10] E.F. Ohata, G.M. Bezerra, J.V.S. das Chagas, A.V.L. Neto, A.B. Albuquerque,
PPV Positive Predictive Value V.H.C. De Albuquerque, P.P. Reboucas Filho, Automatic detection of COVID-
RRoI Rectangular Region of Interest 19 infection using chest X-ray images through transfer learning, IEEE/CAA J.
Autom. Sin. 8 (1) (2020) 239–248.
XR Plain Chest X-Ray
[11] L. Wang, Z.Q. Lin, A. Wong, Covid-net: A tailored deep convolutional neural
network design for detection of COVID-19 cases from chest x-ray images, Sci.
Institutional review Rep. 10 (1) (2020) 1–12.
[12] D. Das, K. Santosh, U. Pal, Truncated inception net: COVID-19 outbreak screening
using chest X-rays, Phys. Eng. Sci. Med. 43 (2020) 915–925.
This retrospective study used non-institutional public Health Insur- [13] J. Civit-Masot, F. Luna-Perejón, M. Domínguez Morales, A. Civit, Deep learning
ance Portability and Accountability Act (HIPAA) compliant deidentified system for COVID-19 diagnosis aid using X-ray pulmonary images, Appl. Sci. 10
imaging data. Ethical review and approval were waived for this reason. (13) (2020) 4640.
[14] Y. Oh, S. Park, J.C. Ye, Deep learning COVID-19 features on CXR using limited
Informed consent training data sets, IEEE Trans. Med. Imaging 39 (8) (2020) 2688–2700.
[15] M. Roberts, D. Driggs, M. Thorpe, J. Gilbey, M. Yeung, S. Ursprung, A.I.
Aviles-Rivero, C. Etmann, C. McCague, L. Beer, et al., Common pitfalls and
Patient consent was waived due to the use of open data.
recommendations for using machine learning to detect and prognosticate for
COVID-19 using chest radiographs and CT scans, Nat. Mach. Intell. 3 (3) (2021)
Declaration of competing interest 199–217.
[16] J. Zhang, Y. Xie, G. Pang, Z. Liao, J. Verjans, W. Li, Z. Sun, J. He, Y. Li, C.
The authors declare the following financial interests/personal rela- Shen, et al., Viral pneumonia screening on chest X-rays using confidence-aware
tionships which may be considered as potential competing interests: anomaly detection, IEEE Trans. Med. Imaging 40 (3) (2020) 879–890.
Juan Ignacio Godino-Llorente reports financial support was provided [17] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller,
Unmasking Clever Hans predictors and assessing what machines really learn,
by Comunidad de Madrid (Consejerıa de Educación, Universidades,
Nat. Commun. 10 (1) (2019) 1096.
Ciencia y Portavocıa) and Universidad Politécncia de Madrid. Juan [18] B. Carter, S. Jain, J.W. Mueller, D. Gifford, Overinterpretation reveals image
Ignacio Godino Llorente reports financial support was provided by State classification model pathologies, Adv. Neural Inf. Process. Syst. 34 (2021)
Agency of Research. 15395–15407.

17
J.D. Arias-Londoño and J.I. Godino-Llorente Biomedical Signal Processing and Control 90 (2024) 105831

[19] O. Pfungst, Clever Hans:(the Horse of Mr. Von Osten.) a Contribution to [43] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
Experimental Animal and Human Psychology, Holt, Rinehart and Winston, 1911. convolutional networks, in: Proceedings of the IEEE Conference on Computer
[20] A.J. DeGrave, J.D. Janizek, S.-I. Lee, AI for radiographic COVID-19 detection Vision and Pattern Recognition, 2017, pp. 4700–4708.
selects shortcuts over signal, Nat. Mach. Intell. 3 (7) (2021) 610–619. [44] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural
[21] S. Gautam, M.M.-C. Höhne, S. Hansen, R. Jenssen, M. Kampffmeyer, Demon- networks, in: International Conference on Machine Learning, PMLR, 2019, pp.
strating the risk of imbalanced datasets in chest x-ray image-based diagnostics by 6105–6114.
prototypical relevance propagation, in: 2022 IEEE 19th International Symposium [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
on Biomedical Imaging (ISBI), IEEE, 2022, pp. 1–5. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[22] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F.A. 2016, pp. 770–778.
Wichmann, Shortcut learning in deep neural networks, Nat. Mach. Intell. 2 (11) [46] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted
(2020) 665–673, http://dx.doi.org/10.1038/s42256-020-00257-z. residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on
[23] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
Marchand, V. Lempitsky, Domain-adversarial training of neural networks, J. [47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
Mach. Learn. Res. 17 (1) (2016) 1–35. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition
[24] M. Reyes, R. Meier, S. Pereira, C.A. Silva, F.-M. Dahlweid, H.v. Tengg-Kobligk, challenge, Int. J. Comput. Vis. 115 (2015) 211–252.
R.M. Summers, R. Wiest, On the interpretability of artificial intelligence in [48] R.M. Neal, Bayesian Learning for Neural Networks, Vol. 118, Springer Science
radiology: challenges and opportunities, Radiol.: Artif. Intell. 2 (3) (2020) & Business Media, 2012.
e190043. [49] K. Shridhar, F. Laumann, M. Liwicki, A comprehensive guide to bayesian
[25] H. Panwar, P. Gupta, M.K. Siddiqui, R. Morales-Menendez, P. Bhardwaj, V. convolutional neural network with variational inference, 2019, arXiv preprint
Singh, A deep learning and grad-CAM based color visualization approach for arXiv:1901.02731.
fast detection of COVID-19 cases using chest X-ray and CT-scan images, Chaos [50] A. Graves, Practical variational inference for neural networks, Adv. Neural Inf.
Solitons Fractals 140 (2020) 110190. Process. Syst. 24 (2011).
[26] P.R. Bassi, R. Attux, A deep convolutional neural network for COVID-19 detection [51] P.K. Diederik, M. Welling, et al., Auto-encoding variational bayes, in: Proceedings
using chest X-rays, Res. Biomed. Eng. (2021) 1–10. of the International Conference on Learning Representations (ICLR), Vol. 1, 2014.
[27] B. Ghoshal, A. Tucker, Estimating uncertainty and interpretability in deep [52] R. Krishnan, P. Esposito, M. Subedar, Bayesian-Torch: Bayesian neural net-
learning for coronavirus (COVID-19) detection, 2020, arXiv preprint arXiv:2003. work layers for uncertainty estimation, 2022, http://dx.doi.org/10.5281/zenodo.
10769. 5908307, https://github.com/IntelLabs/bayesian-torch.
[28] A. Hoffmann, C. Fanconi, R. Rade, J. Kohler, This looks like that... does it? [53] M. Ancona, E. Ceolini, C. Öztireli, M. Gross, Towards better understanding
shortcomings of latent space prototype interpretability in deep networks, 2021, of gradient-based attribution methods for deep neural networks, 2017, arXiv
arXiv preprint arXiv:2105.02968. preprint arXiv:1711.06104.
[29] S. Calderon-Ramirez, S. Yang, A. Moemeni, S. Colreavy-Donnelly, D.A. Elizondo, [54] U. Kamath, J. Liu, Explainable Artificial Intelligence: An Introduction To
L. Oala, J. Rodríguez-Capitán, M. Jiménez-Navarro, E. López-Rubio, M.A. Molina- Interpretable Machine Learning, Springer, 2021.
Cabello, Improving uncertainty estimation with semi-supervised deep learning for [55] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks:
covid-19 detection using chest x-ray images, IEEE Access 9 (2021) 85442–85454. visualising image classification models and saliency maps, in: Proceedings of
[30] K. Bykov, M.M.-C. Höhne, K.-R. Müller, S. Nakajima, M. Kloft, How much can I the International Conference on Learning Representations (ICLR), ICLR, 2014.
trust you?–quantifying uncertainties in explaining neural networks, 2020, arXiv [56] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in:
preprint arXiv:2006.09000. International Conference on Machine Learning, PMLR, 2017, pp. 3319–3328.
[31] C.J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, D. King, Key chal- [57] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On
lenges for delivering clinical impact with artificial intelligence, BMC Med. 17 pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
(2019) 1–9. propagation, PLoS One 10 (7) (2015) e0130140.
[32] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in [58] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, K.-R. Müller, Layer-wise
neural network, in: International Conference on Machine Learning, PMLR, 2015, relevance propagation: an overview, in: Explainable AI: Interpreting, Explaining
pp. 1613–1622. and Visualizing Deep Learning, Springer, 2019, pp. 193–209.
[33] A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through [59] N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds,
propagating activation differences, in: International Conference on Machine A. Melnikov, N. Kliushkina, C. Araya, S. Yan, et al., Captum: A unified and
Learning, PMLR, 2017, pp. 3145–3153. generic model interpretability library for pytorch, 2020, arXiv preprint arXiv:
[34] HM-Hospitales, Covid data save lives english version, 2020, https:// 2009.07896.
www.hmhospitales.com/coronavirus/covid-data-save-lives/english-version, [On- [60] C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural
line; accessed 8-August-2020]. networks, in: International Conference on Machine Learning, PMLR, 2017, pp.
[35] M. de la Iglesia Vayá, J.M. Saborit, J.A. Montell, A. Pertusa, A. Bustos, M. 1321–1330.
Cazorla, J. Galant, X. Barber, D. Orozco-Beltrán, F. García-García, M. Caparrós, [61] E. Mohamed, K. Sirlantzis, G. Howells, A review of visualisation-as-explanation
G. González, J.M. Salinas, BIMCV COVID-19: a large annotated dataset of RX techniques for convolutional neural networks and their evaluation, Displays 73
and CT images from COVID-19 patients, 2020, arXiv:2006.01174. (2022) 102239.
[36] DarwinAI Corp., Canada and Vision and Image Processing Research Group, [62] M. Bohle, M. Fritz, B. Schiele, Convolutional dynamic alignment networks for
University of Waterloo, Actualmed COVID-19 chest X-ray dataset initiative, 2020, interpretable classifications, in: Proceedings of the IEEE/CVF Conference on
URL https://github.com/agchung/Figure1-COVID-chestxray-dataset. Computer Vision and Pattern Recognition, 2021, pp. 10029–10038.
[37] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R.M. Summers, Chestx-ray8: Hospital- [63] G.K. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley
scale chest x-ray database and benchmarks on weakly-supervised classification Press, MA, USA, Cambridge, 1949.
and localization of common thorax diseases, in: Proceedings of the IEEE [64] S. Bickel, M. Brückner, T. Scheffer, Discriminative learning under covariate shift,
Conference on Computer Vision and Pattern Recognition, 2017, pp. 2097–2106. J. Mach. Learn. Res. 10 (2009) 2137–2155.
[38] A.E.W. Johnson, T.J. Pollard, S.J. Berkowitz, N.R. Greenbaum, M.P. Lungren, [65] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, J.M. Mooij, On causal
C.-y. Deng, R.G. Mark, S. Horng, MIMIC-CXR, a de-identified publicly available and anticausal learning, in: International Conference on Machine Learning, 2012.
database of chest radiographs with free-text reports, Sci. Data (ISSN: 2052-4463) [66] A. Torralba, A.A. Efros, Unbiased look at dataset bias, CVPR 2011 (2011)
6 (1) (2019) 317, http://dx.doi.org/10.1038/s41597-019-0322-0. 1521–1528.
[39] S. Jaeger, S. Candemir, S. Antani, Y.-X.J. Wáng, P.-X. Lu, G. Thoma, Two public [67] N. Baker, H. Lu, G. Erlikhman, P.J. Kellman, Deep convolutional networks do
chest X-ray datasets for computer-aided screening of pulmonary diseases, Quant. not classify based on global object shape, PLoS Comput. Biol. 14 (2018).
Imaging Med. Surg. 4 (6) (2014) 475. [68] R. Achtibat, M. Dreyer, I. Eisenbraun, S. Bosse, T. Wiegand, W. Samek, S. La-
[40] S. Jaeger, A. Karargyris, S. Candemir, L. Folio, J. Siegelman, F. Callaghan, Z. Xue, puschkin, From ‘‘where’’ to ‘‘what’’: Towards human-understandable explanations
K. Palaniappan, R.K. Singh, S. Antani, et al., Automatic tuberculosis screening through concept relevance propagation, 2022, arXiv preprint arXiv:2206.03208.
using chest radiographs, IEEE Trans. Med. Imaging 33 (2) (2013) 233–245. [69] M. Graziani, V. Andrearczyk, S. Marchand-Maillet, H. Müller, Concept attribu-
[41] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, tion: Explaining CNN decisions to physicians, Comput. Biol. Med. 123 (2020)
B. Haghgoo, R. Ball, K. Shpanskaya, et al., Chexpert: A large chest radiograph 103865.
dataset with uncertainty labels and expert comparison, in: Proceedings of the [70] R. Krishnan, O. Tickoo, Improving model calibration with accuracy versus
AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 590–597. uncertainty optimization, Adv. Neural Inf. Process. Syst. 33 (2020) 18237–18248.
[42] J.D. Arias-Londoño, A. Moure-Prado, J.I. Godino-Llorente, Automatic identifica- [71] R.M. Cichy, D. Kaiser, Deep neural networks as scientific models, Trends in
tion of lung opacities due to COVID-19 from chest X-Ray images. Focussing Cognitive Sciences 23 (2019) 305–317.
attention on the lungs, Diagnostics 13 (8) (2023) 1381.

18

You might also like