Professional Documents
Culture Documents
Bae 2021
Bae 2021
https://doi.org/10.1007/s10278-021-00499-2
Abstract
This study aimed to develop a method for detection of femoral neck fracture (FNF) including displaced and non-displaced
fractures using convolutional neural network (CNN) with plain X-ray and to validate its use across hospitals through internal
and external validation sets. This is a retrospective study using hip and pelvic anteroposterior films for training and detecting
femoral neck fracture through residual neural network (ResNet) 18 with convolutional block attention module (CBAM) + + .
The study was performed at two tertiary hospitals between February and May 2020 and used data from January 2005 to
December 2018. Our primary outcome was favorable performance for diagnosis of femoral neck fracture from negative
studies in our dataset. We described the outcomes as area under the receiver operating characteristic curve (AUC), accuracy,
Youden index, sensitivity, and specificity. A total of 4,189 images that contained 1,109 positive images (332 non-displaced
and 777 displaced) and 3,080 negative images were collected from two hospitals. The test values after training with one
hospital dataset were 0.999 AUC, 0.986 accuracy, 0.960 Youden index, and 0.966 sensitivity, and 0.993 specificity. Values
of external validation with the other hospital dataset were 0.977, 0.971, 0.920, 0.939, and 0.982, respectively. Values of
merged hospital datasets were 0.987, 0.983, 0.960, 0.973, and 0.987, respectively. A CNN algorithm for FNF detection in
both displaced and non-displaced fractures using plain X-rays could be used in other hospitals to screen for FNF after train-
ing with images from the hospital of interest.
4
* Jaehoon Oh Department of Otolaryngology – Head and Neck Surgery,
ojjai@hanmail.net; ohjae7712@gmail.com College of Medicine, Hanyang University, Seoul,
Republic of Korea
* Tae Hyun Kim
5
taehyunkim@hanyang.ac.kr Department of HY, College of Medicine, KIST
Bio-Convergence, Hanyang University, Seoul,
1
Department of Emergency Medicine, College of Medicine, Republic of Korea
Hanyang University, 222 Wangsimni‑ro, Seongdong‑gu, 6
Department of Emergency Medicine, College of Medicine,
Seoul 04763, Republic of Korea
Chung-Ang University, Seoul, Republic of Korea
2
Department of Computer Science, Hanyang University, 7
Department of Emergency Medicine, Seoul National
222 Wangsimni‑ro, Seongdong‑gu, Seoul 04763,
University Bundang Hospital, Gyeonggi‑do,
Republic of Korea
Republic of Korea
3
Machine Learning Research Center for Medical Data,
Hanyang University, Seoul, Republic of Korea
13
Vol.:(0123456789)
Journal of Digital Imaging
13
Journal of Digital Imaging
that was not indicative of FNF. Their reports were stated and data augmentation. First, we added zeros to the given
by radiologists as “unremarkable study,” “non-specific input images to create square shapes and resized the images
finding,” or “no definite acute lesion.” We also excluded to a fixed 500 × 500 pixel. Next, to increase the amount of
images of severe deformation of anatomical structures training data and secure robustness to geometric transforma-
caused by acute fracture in another area or past damage tions such as scale, translation, and rotation, we augmented
but did not exclude patients with implants due to past sur- the training images by applying random transformations
gery in the absence of anatomical deformation. A relatively (e.g., flip, flop, or rotation) and randomly cropping patches
larger number of non-fracture images compared to fracture of 450 × 450 during the training process. Although the result-
images were obtained at random, with a ratio of fracture ing cropped image includes both hips within the processed
to non-fracture images of 1:3. We collected these images input image, our data pre-processing and augmentation are
from the same hospitals and within the same time period. not aiming to select the region of interests (i.e., hip regions).
X-ray images were extracted using the picture archiv-
ing and communication system (PACS, Piview, INFINITT Fracture Detection by Image Classification
Healthcare, Seoul, South Korea) and stored as.jpeg format
images. No personal information was included when saving The overall network architecture of the proposed method
images for data collection, and data were obtained without for FNF detection was based on image classifiers since our
personal identifying data. In addition, arbitrary numbers were fracture diagnosis task can be categorized as a typical binary
assigned to images, which were then coded and managed. classification problem. Our architecture was based on CBAM,
which includes residual neural network (ResNet) and spatial
Data Pre‑processing and Augmentation attention modules [31, 32]. Specifically, ResNet used residual
mapping to successfully address the issue of vanishing gradi-
In general, medical images acquired by different sources have ents in a CNN. This residual mapping, which can be achieved
different sizes (i.e., resolution) and aspect ratios. Moreover, using the skip connection between layers, allows the model to
as it is difficult to obtain many training datasets with ground- contain many layers. CBAM integrates an attention module
truth labels, there is a need for proper pre-processing steps with ResNet and improves the classification performance with
13
Journal of Digital Imaging
13
Journal of Digital Imaging
◂Fig. 2 Architecture and network comparison of deep learning neu- Effects of Data Augmentation
ral network for detection of FNF. a Architecture of ResNet18 with
attention module. ResNet18 is composed of 8 residual blocks, every
two blocks belong to the same stage and have the same channels of
To demonstrate the performance of our data augmentation, we
output. b Diagram of CBAM + + . The two outputs pooled along the provided AUC values with and without using the augmenta-
channel axis are resized by additional pooling operations accord- tion technique. The AUC value of the Hospital A dataset was
ing to stage number and forwarded to a convolution layer. c the 0.880 when the data augmentation was not applied in training
AUC of internal validation with Hospital A dataset and the number
of Network parameters. The parameter number of ResNet18 with
time. In contrast, with our data-augmentation, this AUC value
CBAM + + is the lowest and the AUC value is the highest was 0.991. We saw that the data augmentation brings a signifi-
cant impact on an insufficient training set of medical images.
a few additional parameters in feed-forward CNNs. CBAM Network Comparison for Detection of FNF
sequentially generates attention maps along the two dimen-
sions of channel and space in the given input feature map. We conducted experiments with four types of attention mod-
These attention maps refined the input feature map. ules: CBAM (with spatial attention and with channel attention),
CBAM- (with spatial attention and without channel attention),
Effective Attention via Coarse‑to‑Fine Approaches CBAM + (with the proposed spatial attention and with channel
attention), and CBAM + + (with the proposed spatial attention
The CBAM module functions well in an image-classification and without channel attention), using ResNet18 and ResNet50
task. However, since low-level activation maps are acquired in as baselines. We trained the networks with the same conditions
the early stages of the CNN, it is difficult to accurately gener- and evaluated the performance of each module. Figure 2c shows
ate spatial attention maps using the spatial attention module the number of parameters in each network and the AUC value
in CBAM. Therefore, we slightly modified the spatial atten- of internal validation set from the Hospital A. There was a small
tion module in an early stage of the CNN so that a coarse number of data in our training set; we achieved similar AUC
attention map could be calculated. As shown in Fig. 2 (a) and values to a larger network. Among the modules, ResNet18 with
(b), we applied average-pooling and max-pooling operations CBAM + + was the most efficient, with the smallest number
along the channel axis and concatenate them. Subsequently, of parameters and the highest AUC value. Therefore, we used
additional average-pooling operations were applied before the ResNet18 with CBAM + + as our final model.
convolution layer. These average pooling operations make the
intermediate feature map. By decreasing the number of aver- Visualization and Verification of the Medical Diagnosis
age pooling operations in each stage, we obtain different levels
of activation maps of size (H/8, W/8), (H/4, W/4), (H/2, W/2), The proposed method is a computer-aided diagnostic system
and (H, W) after average pooling in each stage of the network. aimed to help radiologists and emergency doctors in medical
On this intermediate feature map, we applied a convolution imaging analysis. Therefore, we not only classify whether
layer with filter size 5 for encoding areas to emphasize and the input X-ray image includes fractured parts, but also
suppress. We increased the resolution to HxW using the bilin- visualize suspicious parts. For visualization, we employed
ear interpolator so that the attention map has the same size as Grad-CAM to highlight important regions in the image for
the input feature map. Finally, we multiplied the attention map diagnosis since the proposed network is composed of CNNs
to the input feature map. Using our modified attention mod- with fully connected layers [33]. As a result, Grad-CAM was
ule, we generated more accurate attention maps with a small applied to the last convolutional layer placed before the fully
number of convolutional operations compared to the original connected layer to verify the medical diagnosis.
CBAM, even though the number of parameters is equivalent.
Primary Outcomes and Validation
Experiments
Our primary outcome was favorable performance of detec-
To verify the performance of the proposed method, we per- tion of FNF from negative studies in our dataset. For valida-
form the following two steps. (1) Phase 1: Hospital A dataset tion, we used the module with accuracy, Youden index, and
is separated into training data (80%), internal validation data AUC. Accuracy is the fraction of correct predictions over total
(10%), and test data (10%). We determined the optimal cut-off predictions [34]. The Youden index is calculated at [(sensitiv-
value using result from internal validation. And we performed ity + specificity) – 1] and is the vertical distance between the
the test with Hospital A test dataset and external validation 45-degree line and the point on the ROC curve. Sensitivity,
with Hospital B dataset. (2) Phase 2: Training, internal valida- also known as recall, is the fraction of correct predictions over
tion, and testing were conducted with all data from Hospitals A all FNF cases, and specificity is the fraction of correct nor-
and B, using 80%, 10%, and 10% of each dataset, respectively. mal predictions over all normal cases. The AUC is the area
13
Journal of Digital Imaging
under the ROC that plots the relation of the true positive rate Table 1 Baseline characteristics of participants who provided images
(sensitivity) to the false-positive rate (1-specificity). Since the in datasets
metrics except for AUC change according to the cut-off values All datasets Hospital A set Hospi-
that determine fractures or negative predictions, we used the (n = 2,090) tal B set
AUC as the primary evaluation metric. In the test and external (n = 2,099)
validation, we selected the optimal cut-off value where Youden Total 4,189 2,090 2,099
index was highest in internal validation because plain radiogra- Positive images, n 1,109 589 520
phy of the pelvis is commonly used as a first-line screening test Non-displaced, n 332 139 193
to diagnose hip fractures. We applied the cut-off value using Displaced, n 777 450 327
the same algorithm as in the external validation set. Age, years 75.7 [12.9] 76.8 [13.8] 74.3 [11.7]
Sex, male 28.1% 28.4% 27.8%
Negative images 3,080 1,501 1,579
Statistical Analysis Age, years 46.4 [16.5] 50.9 [19.1] 42.3 [12.2]
Sex, male 48.5% 52.0% 45.2%
Data were compiled using a standard spreadsheet appli- Continuous variables are presented by mean [standard deviation] and
cation (Excel 2016; Microsoft, Redmond, WA, USA) categorical variables are presented by N (%)
and analyzed using NCSS 12 (Statistical Software 2018,
NCSS, LLC. Kaysville, UT, USA, ncss.com/software/ validation with that of test after training and internal vali-
ncss). Kolmogorov–Smirnov tests were performed to dation. Two-tailed p < 0.05 was considered significantly
demonstrate normal distribution of all datasets. We gen- different.
erated descriptive statistics and present them as frequency
and percentage for categorical data and as either median
and interquartile range (IQR) (non-normal distribution) or
mean and standard deviation (SD) (normal distribution) Results
or 95% confidence interval (95% CI) for continuous data.
We used one ROC curve and cut-off analysis for inter- A total of 4,189 images containing 1,109 positive images (332
nal validation and two ROC curves with the independent difficult cases and 777 easy cases) and 3,080 negative images
groups designed for comparing the ROC curve of external were collected from 4,189 patients (Table 1). From hospital A,
Table 2 Diagnostic performance matrix (A) and outcomes (B) on the internal validation, test and external validation test with optimal cut off
values (phase 1). Optimal cut off value was estimated when it is located on the highest Youden index in the internal validation
(A)
(1) Internal (3) External
Pos Neg (2) Test after (1) Pos Neg Pos Neg
Validation Validation
(B)
(2) Test after (1) (3) External Validation
Youden Youden
AUC Accuracy Sen Spe AUC Accuracy Sen Spe
Index index
0.999 0.977
0.986 0.960 0.966 0.993 0.971 0.920 0.939 0.982
(0.996, 1.000) (0.965, 0.984)
E 1.000 0.985
D 0.933 0.861
(1) Internal validation with Hospital A set, (2) Test after training and internal validation with Hospital A set, (3) External validation with Hos-
pital B set. Pos positive, fracture of femoral neck Neg negative, no fracture. AUC under the curve of the receiver operating characteristic curve
(ROC). Accuracy, the fraction of the correct predictions over the total predictions. The Youden index; the sensitivity + specificity – 1 that is the
vertical distance between the 45° line and the point on ROC curve. Sen sensitivity, Spe specificity. In the internal and external validation, we
selected the optimal cutoff value when the Youden index value is highest in the test after training with Hospital A set. CI confidence interval. E,
easy cases subclassified to garden III or IV types. D, difficult cases subclassified to garden I or II types. *p < 0.05 is statistically significant
13
Journal of Digital Imaging
2,090 images consisted of 589 positive images (male, 28.4%; test and external validation. Test values after training and
mean [SD] age, 76.8[12.8] years) of FNF and 1,501 nega- internal validation with hospital A dataset were 0.999
tive images with no fracture. From hospital B, 2,099 images (0.996, 1.000) AUC, 0.986 accuracy, and 0.960 Youden
comprising 520 positive images (male, 27.8%; age, 74.3[11.7] index, with a 0.966 sensitivity (1.000 in easy cases and
years) and 1,579 negative images were analyzed. 0.933 in difficult cases) and a 0.993 specificity. Values
Phase 1: Comparison of external validation results with of external validation with the hospital B dataset were
those of testing after internal validation in one hospital. lower, at 0.977 (0.965, 0.984) AUC, 0.971 accuracy, and
Diagnostic performance matrix and outcomes are 0.920 Youden index, with a 0.939 sensitivity (0.985 in
shown in Table 2. The optimal cut-off value was 0.72, easy cases and 0.861 in difficult cases) and a 0.982 speci-
with the highest Youden index (0.939) on the ROC in ficity (p < 0.001). These results are shown in Table 2 and
internal validation for estimating values in the internal Fig. 3.
13
Journal of Digital Imaging
Table 3 Diagnostic performance matrix (A) and outcomes (B) on the internal validation, and test after training with all dataset (phase 2). Opti-
mal cut off value was estimated when it is located on the highest Youden index in the internal validation
(A)
(1) Internal
Pos Neg (2) Test after (1) Positive Neg
validation
(B)
Internal Validation Test after Internal Validation
Youden Youden
AUC Accuracy Sen Spe AUC Accuracy Sen Spe
Index index
0.991 0.987
0.981 0.963 0.982 0.981 0.983 0.960 0.973 0.987
(0.971, 0.997) (0.962, 0.995)
E 0.982 1.000
D 0.982 0.946
(1) Internal validation with Hospital A and B set, (2) Test after training and internal validation with Hospital A and B set. Pos positive, fracture
of femoral neck; Neg negative, no fracture. AUC under the curve of the receiver operating characteristic curve (ROC). Accuracy, the fraction
of the correct predictions over the total predictions. The Youden index; the sensitivity + specificity – 1 that is the vertical distance between the
45° line and the point on ROC curve. Sen sensitivity, Spe specificity. In the internal and external validation, we selected the optimal cutoff value
when the Youden index value is highest in the test after training with Hospital A set. CI confidence interval. E, easy cases subclassified to garden
III or IV types. D, difficult cases subclassified to garden I or II types. *p < 0.05 is statistically significant
Phase 2: Evaluation of the internal test with combined We set the cut-off value (58.61) at the highest Youden
image datasets. index (0.963) on the ROC of internal validation after
Fig. 4 Visualization with Grad-Class Activation mapping (CAM) nal plain X-ray images and CAM applied images without fracture.
results in external validation test. A Correct detection images. The B False detection images. The images in in 1st row are false-positive
images in 1st and 2nd row (true positive images) are the original images, whereas the images in in 2nd, 3rd, and 4th row are false-
plain C-ray images and CAM applied images with FNF, whereas negative images with unidentified areas highlighted by CAM in
the images in 3rd and 4th row (true negative images) are the origi- images
13
Journal of Digital Imaging
training with merged image datasets. Test values using non-displaced fracture [24, 25]. In our study, we did not
the merged dataset were 0.987 (0.962, 0.995) AUC, 0.983 classify the normal, displaced fracture, and non-displaced
accuracy, and 0.960 Youden index with a 0.973 sensitiv- fracture but detected displaced and non-displaced fracture
ity (1.000 in displaced images and 0.946 in non-displaced images together. However, the sensitivity of easy displaced
images) and a 987 specificity, as shown in Table 3 and fracture was 1.000, and that of difficult non-displaced frac-
Fig. 3. ture was 0.933. Our algorithm can aid in detection of not
only Garden type III or IV, but also Garden type I or II FNF
from hip or pelvic AP X-ray images.
Visualization with Grad‑CAM We conducted external validation with the dataset of a
second hospital after deep learning with a single dataset for
Figure 4 visualizes the feature maps of our network functions. detection of FNF. The external validation test results were
Testing results of FNF after training through ResNet18 with 0.971 accuracy, 0.939 sensitivity, and 0.977 AUC. Com-
the spatial attention module using our network could con- paring the external validation with test after training and
centrate on bilateral hip joints, even if images were negative. the internal validation, the difference of AUC was 0.023 (p
value < 0.001), likely due to the resolution difference and
degradation of data with decreasing image intensity level
Discussion and contrast. However, with merged images of disparate hos-
pitals using the same protocols, the AUC between a single
Traditional deep neural networks require many training institution and multiple institutions was statistically not dif-
datasets, but medical images with annotations are not easy ferent (difference of AUC = 0.013, p = 0.076). This indicates
to acquire and are insufficient in number to train large net- that the completed model trained using one hospital dataset
works. To avoid the overfitting problem with insufficient can be transferred to other hospitals and used as a screen-
datasets, we employ an efficient network that shows high ing tool for FNF. Hospitals that use the model would need
performance with a small dataset. Thus, we developed to train it using their own positive and negative images to
CBAM + + as an efficient version of CBAM. In the experi- optimize performance in specific environments.
ments for detection of FNF with X-ray, we evaluated per-
formances of ResNets with different types of attention mod-
ules. We ultimately selected ResNet18 with the proposed Limitations
CBAM + + as it showed the best performance with the
smallest number of parameters. Moreover, we demonstrated There were several limitations in this study. First, the age
the performance of the proposed network by providing visu- and sex of patients according to presence or absence of
alization results through Grad-CAM. fractures were not completely matched because FNF has a
The X-ray image detection function for FNF through particularly high incidence at certain age and sex. To apply
deep learning performed with images of hospital A shows the results of this study to clinical situations, the method
results that are equivalent to or higher than the sensitivity of must be verified in adult patient of various ages and sexes
medical staff, especially emergency medicine doctors. In a who visited the emergency room. Therefore, we thought that
previous study, the sensitivity and specificity of X-ray read- limiting the range of the control group for statistical match-
ings of emergency medical doctors, except radiologists and ing would result in a deterioration in clinical application.
orthopedic surgeons, were 0.956 and 0.822 [21]. In other Second, we did not compare the performance of our model
recent studies of deep learning for detection of femoral frac- to that of physicians with respect to key factors such as clini-
ture conducted at a single institution, ranges of outcomes cal outcomes, the time required to reach diagnosis, and the
were 0.906–0.955 for accuracy, 0.939–0.980 for sensitiv- equipment required to use the model as a screening tool.
ity, and 0.890–0.980 for AUC [20–22]. These results in Third, there was a difference in resolution between the medi-
our study showed excellent respective outcomes of 0.986, cal images and the input images due to resizing. Therefore, it
0.966, and 0.999, respectively. Our deep learning algorithm is possible for information loss to occur when attempting to
increased the capability for FNF detection with X-ray as detect FNF since the medical images were downsized [35].
the first screening tool and can be applied to clinical prac- Finally, external validation in phase 1, applying the network
tice of any hospital that provides training and test datasets. of trained with single hospital directly to another hospital
Krogue et al. showed that the sensitivity using deep learning showed lower accuracy than the results using a single hospi-
was 0.875 in displaced fracture and 0.462 in non-displaced tal. So, a phase 2 study should be conducted in which images
fracture, and Mutasa et al. showed that the sensitivity using of the two hospitals are learned together, because we do not
generative adversarial network with digitally reconstructed yet know how many images are needed to apply our network
radiographs was 0.910 in displaced fracture and 0.540 in equally to external data.
13
Journal of Digital Imaging
13
Journal of Digital Imaging
27. Garden RS: Low-Angle Fixation in Fractures of the Femoral Neck. 33. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra
Journal of Bone and Joint Surgery-British Volume 43(4):647-663, D: Grad-CAM: Visual Explanations From Deep Networks via
1961 Gradient-Based Localization. Proceedings of the IEEE Interna-
28. Frandsen PA, Andersen E, Madsen F, Skjødt T: Garden’s classifi- tional Conference on Computer Vision (ICCV) 618–626, 2017
cation of femoral neck fractures. An assessment of interobserver 34. Youden WJ: Index for rating diagnostic tests. Cancer 3(1):32-35,
variation. J Bone Joint Surg Br 70(4):588–590, 1988 1950
29. Thorngren KG, Hommel A, Norrman PO, Thorngren J, Wingstrand 35. Kwon G, Ryu J, Oh J, Lim J, Kang BK, Ahn C, et al: Deep learn-
H: Epidemiology of femoral neck fractures. Injury 33:1-7, 2002 ing algorithms for detecting and visualising intussusception on
30. Van Embden D, Rhemrev SJ, Genelin F, Meylaerts SA, Roukema plain abdominal radiography in children: a retrospective multi-
GR: The reliability of a simplified Garden classification for intra- center study. Sci Rep 10(1):17582, 2020
capsular hip fractures. Orthop Traumatol Surg Res 98(4):405-
408, 2012 Publisher’s Note Springer Nature remains neutral with regard to
31. He K, Zhang X, Ren S, Sun J: Deep Residual Learning for Image jurisdictional claims in published maps and institutional affiliations.
Recognition. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 770–778, 2016
32. Woo S, Park J, Lee J, Kweon IS: CBAM: Convolutional Block
Attention Module. Proceedings of the European Conference on
Computer Vision (ECCV) 3–19, 2018
13