Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371139395

Using BERT Models for Breast Cancer Diagnosis from Turkish Radiology
Reports

Article in Language Resources and Evaluation · June 2023


DOI: 10.1007/s10579-023-09669-w

CITATION READS

1 230

4 authors, including:

Pinar Uskaner Hepsağ Selma Ayse Özel


Case Western Reserve University Cukurova University
4 PUBLICATIONS 60 CITATIONS 71 PUBLICATIONS 1,255 CITATIONS

SEE PROFILE SEE PROFILE

Adnan Yazici
Nazarbayev University
23 PUBLICATIONS 416 CITATIONS

SEE PROFILE

All content following this page was uploaded by Pinar Uskaner Hepsağ on 10 June 2023.

The user has requested enhancement of the downloaded file.


Language Resources and Evaluation
https://doi.org/10.1007/s10579-023-09669-w

ORIGINAL PAPER

Using BERT models for breast cancer diagnosis from Turkish


radiology reports

Pınar Uskaner Hepsağ1 · Selma Ayşe Özel2 · Kubilay Dalcı3 · Adnan Yazıcı4

Accepted: 16 May 2023


© The Author(s), under exclusive licence to Springer Nature B.V. 2023

Abstract
Diagnostic radiology is concerned with obtaining images of the internal organs
using radiological imaging procedures. These images are then interpreted by a diag-
nostic radiologist, who produces a textual report that assists in the diagnosis of ill-
ness or injury. Early detection of certain illnesses, particularly cancer, is critical, and
the reports produced by diagnostic radiologists play a key role in this process. To
develop models for the early detection of cancer, text classification techniques can
be applied to radiological reports. However, this process requires access to a data-
set of radiology reports, which is not widely available. Currently, radiology report
datasets exist for high-resource languages such as English and Dutch, but not for
low-resource languages such as Turkish. This article describes the collection of a
mammography report dataset for Turkish, consisting of 62 reports from real patients
that were manually labeled by an expert for diagnosing breast cancer. Basic machine
learning models were applied to this dataset using pre-trained BERT, DistilBERT,
and an ensemble learning hard voting approach. The results showed that BERT on
Turkish achieved the best performance, with a 91% F1-score. Hard Voting, which
combined the results of ­BERTTurkish, ­BERTClinical, and ­BERTMultilingual, achieved the
highest F1-score of 93%. The results show that BERT and Hard Voting outperform
the other machine learning models for breast cancer diagnosis from Turkish radiol-
ogy reports.

Keywords Turkish dataset · Breast cancer · Contextualized word embeddings ·


Radiology reports · Machine learning

Selma Ayşe Özel, Kubilay Dalcı and Adnan Yazıcı have contributed equally to this work.

Extended author information available on the last page of the article

13
Vol.:(0123456789)
P. Uskaner Hepsağ et al.

1 Introduction

Radiologists seek to identify critical features such as microcalcifications (MCs),


architectural distortions (ADs), and asymmetries as biomarkers of cancer or cancer
risk (Abdelrahman et al., 2021). Radiological reports are interpretations of diagnos-
tic imaging and a radiologist writes these reports. Radiologists usually use complex
medical terms and limited vocabulary when writing reports. Therefore, radiology
reports contain a smaller vocabulary than other electronic health records (Suárez-
Paniagua et al., 2021). Since radiology reports are unstructured, it is very difficult to
extract meaningful data from them. On the other hand, extracting information from
radiology reports is very important to help physicians make more accurate decisions
when diagnosing breast cancer (Casey et al., 2021).
From the literature, it appears that there are many studies that deal with radio-
logical reports1 in common languages such as English, Dutch, etc. for classification
tasks. On the other hand, this type of work in low-resource languages is not very
common. Turkish is an agglutinative and morphologically rich language in which
grammar is expressed by means of suffixes added to nouns and verbs. As far as we
know, there is no corpus of labeled radiological reports in Turkish and the lack of
a Turkish dataset is a major shortcoming in this field. In this paper, we present the
first dataset of mammography reports in Turkish. This dataset was collected from
patients in the Department of General Surgery, Faculty of Medicine, Cukurova Uni-
versity2. We hope that this new dataset will enable future research in the field of
breast cancer classification using the Turkish radiology reports corpus.
Advances in natural language processing (NLP) and machine learning have
made it possible to use large volumes of clinical text to help physicians and medical
researchers identify early symptoms of diseases (Grancharova & Dalianis, 2021).
Recently, natural language processing (NLP) has been widely used to extract infor-
mation from radiology reports (Pons et al., 2016). There are many studies that apply
NLP methods to text classification tasks, but few studies on radiological report clas-
sification (Shin et al., 2017). However, most of these studies use NLP for clinical
texts in English (Onan et al., 2017). There are a few number of studies in the Turkish
language using NLP for clinical text classification (Arifoğlu et al., 2014; Parlak &
Uysal, 2020), however, none of these studies aim to diagnose breast cancer.
Contextualized word embeddings such as BERT (Devlin et al., 2018) and its vari-
ations DistilBERT (Sanh et al., 2019), XLNet (Yang et al., 2019) have been recently
introduced and used in almost all NLP tasks, including text classification tasks, and
show even better performance, especially on the medical domain such as Lee et al.
(2020); Gu et al. (2021), and Smit et al. (2020). The models were trained with a
single-lingual and cross-lingual large corpus, and then the pre-trained models were
successfully applied to NLP tasks in radiology (Suárez-Paniagua et al., 2021). How-
ever, to our best knowledge, there is no recent study in which pre-trained models
have been applied to breast cancer prediction using Turkish radiology reports. We

1
https://​radre​port.​org/​home.
2
Details of this dataset can be found in Sect.5.2

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 1  The proposed system’s


overview

take advantage of the pre-trained models to get baseline results for our newly col-
lected radiology report dataset.
The quality of the training dataset has a significant impact on the success of clas-
sification tasks (Bayer et al., 2022). Gathering data in the clinical domain is very
expensive and needs an expert for annotation. Because of these difficulties, the size
of our radiology report dataset is very small, therefore we used augmentation meth-
ods to increase the size of the dataset with the aim of achieving more successful
classification results. We also used ensemble learning due to their success in text
classification. The main aspect of the ensemble learning approach is to obtain more
accurate results by combining the classification results obtained by using different
base classifiers Kılınç, (2016). In the voting process, results of a set of different clas-
sifiers are obtained and the class label is determined by majority voting (Zhu et al.,
2016).
The main objective of this study is to collect a new dataset of Turkish radiol-
ogy reports to further investigate the problem of diagnosing breast cancer using
radiology reports in Turkish. In this work, we evaluate this new dataset by applying
basic machine learning models, namely support vector machines, random forest, sto-
chastic gradient descent, logistic regression, and naive bayes as well as we employ
fine-tuning BERT-based pre-trained models, namely BERTTurkish3, BERTClinical4,
BERTMultilingual5 and DistilBERTbase6 and compare classification performances of
these models. Moreover, we use voting ensembles in this study to improve the clas-
sification results of the models. The overview of the proposed model is given in
Fig. 1.
The main contributions of this work can be summarised as follows:

• We built a new Turkish radiology report dataset obtained from real breast cancer
patients.

3
https://​huggi​ngface.​co/​dbmdz/​bert-​base-​turki​sh-​cased.
4
https://​huggi​ngface.​co/​emily​alsen​tzer/​Bio_​Clini​calBE​RT.
5
https://​huggi​ngface.​co/​bert-​base-​multi​lingu​al-​cased.
6
https://​huggi​ngface.​co/​disti​lbert-​base-​uncas​ed.

13
P. Uskaner Hepsağ et al.

• We applied pre-trained models (BERT and DistilBERT) and classical machine


learning models to diagnose breast cancer from the collected dataset and com-
pared their performances.
• We used data augmentation methods to the collected dataset due to its small size,
and showed the effects of data augmentation on the classification performance of
the models.
• We also employed an ensemble learning method based on the pre-trained BERT
models to improve breast cancer classification results.
• We showed that using our new dataset with the adaptation of BERT for Turk-
ish results in satisfactory classification accuracy for breast cancer diagnosis from
radiological reports.

This paper is organized as follows: Sect. 2 discusses the related work on radiology
report classification. Section 3 defines the construction of the dataset. Section 4
explains the proposed model for this problem. Section 5 summarizes evaluation met-
rics and experimental setup for the task and presents our experimental results along
with the dataset. Finally, Section 6 concludes the paper with the potential future
goals.

2 Related work

This research work aimed to apply NLP and text classification techniques to mam-
mography reports written in Turkish for the classification of breast cancer as malig-
nant or benign. In this section, we present the most recent studies proposed for solv-
ing this problem.
Castro et al. (2017) focus on a rule-based BI-RADS Observation Kit (BROK)
algorithm and Bayesian networks to classify mammography reports. BROK uses
regular-expression-based string matching to obtain the reports’ BI-RADS category
and the laterality of the breast to which it is assigned, when possible. Breast imag-
ing-reporting and data system (BI-RADS) (Niknejad, 2022) is a risk assessment and
quality assurance tool developed by the American College of Radiology (Bell, 2020)
that provides a widely accepted lexicon and reporting schema for imaging of the
breast. They first evaluate the BROK algorithm to extract BI-RADS categories data
from mammography reports and then use this data as input to a Bayesian network to
categorize radiology reports as malignant and benign according to the probability
provided by the Bayesian network.
Another study on radiology reports is prepared by Casey et al. (2021). The
researchers present a systematic review of publications using NLP on radiology
reports (2015–2019) following the preferred reporting items for systematic reviews
and meta-analysis (PRISMA) (Moher et al., 2015). Nguyen et al. (2020) develop a
hybrid system of a language model and a BI-RADS score classifier for automati-
cally summarizing Dutch radiology reports on breast cancer. BI-RADS applies to
mammography, ultrasound, and MRI. They use a bag-of-words approach and obtain
TF-IDF scores for word features. In short, they create a TF-IDF matrix from the
radiology reports and use it as an input feature for machine learning algorithms to

13
Using BERT models for breast cancer diagnosis from Turkish…

classify the BI-RADS score. The SVM algorithm is found to be the most successful
with an accuracy of 83.3% in predicting the BI-RADS score compared to the other
algorithms.
A recent study by Saib et al. (2020) presents a hierarchical CNN classification
method applied to the prediction of ICD-O codes for pathological breast cancer
reports. The performance of hierarchical CNN is compared with the performance of
multiclass CNN. The results show that a multiclass CNN has F1 macro and F1 micro
scores of 0.717 and 0.738, respectively, compared to a hierarchical CNN with F1
macro and F1 micro scores of 0.722 and 0.748. A recent study by Boroumandzadeh
and Parvinnia (2021) proposes an approach consisting of five main parts, namely
text report processing, feature extraction, feature engineering, prediction, and evalu-
ation for automatic extraction of BI-RADS classifications from text reports. In the
first part, the information is extracted from medical reports using a mammography
dictionary. In the second step, the combination of Word2Vec and TF-IDF is used to
generate a feature vector for each medical report based on the extracted information.
In the third step, the patient records are added to the feature vectors generated by
Word2Vec and TF-IDF, called HIS. In the fourth step, the resulting feature vectors
are classified using support vector machine (SVM), Naive Bayesian (NB), extreme
gradient boosting (XGBoost), and multilevel fuzzy min-max neural network (MLF)
classifiers. In the last step, the results of with-HIS are compared with the results
of without-HIS in terms of accuracy. The experimental results show that MLF is
more accurate than other algorithms. The accuracy of MLF for HIS is 89%, while
the accuracy of MLF without HIS is 85%.
Islam et al. (2020) focuses on the comparison of five supervised machine learning
techniques, namely Support Vector Machine (SVM), K-nearest neighbors, Random
Forests, artificial neural networks (ANNs), and logistic regression, for breast can-
cer detection. They report a comparative analysis between the classification tech-
niques using the Wisconsin Breast Cancer Dataset. Their experimental results obtain
the highest accuracy of 98.57% by ANNs, while the lowest accuracy is obtained by
random forests and logistic regressions with an accuracy of 95.7%. In Gupta et al.
(2021), they propose a method to classify gene mutations based on the text descrip-
tion of these genetic mutations for cancer prediction using natural language process-
ing (NLP) techniques. In this study, the texts are transformed into a matrix of token
counts using the Word2Vec, TfidfVectorizer, and CountVectorizer transformation
models, and this sparse matrix is then classified using the Logistic Regression, Ran-
dom Forest, XGBoost, and recurrent neural network (RNN) models. The experi-
mental results show that RNN outperforms all other algorithms with 70% accuracy.
In Shin et al. (2017), the researchers propose a novel neural attention mechanism
where convolutional neural networks (CNN) with attention analysis are used to clas-
sify radiological head computed tomography reports. Their experiments show that
CNN attention models perform better than non-neural models.
Several text classification and NLP techniques have been applied to the medi-
cal domain, such as sentiment analysis (Özçift, 2020), radiology report classifica-
tion (Shin et al., 2017), and medical article classification (Parlak & Uysal, 2020).
Most of these text classification tasks use English texts (Çelıkten & Bulut, 2021).
However, we can find only a few studies for radiological report analysis using

13
P. Uskaner Hepsağ et al.

NLP methods in a language other than English. One of them is Suárez-Paniagua


et al. (2021) in which the authors use radiology reports written in Spanish and
analyze them using a hybrid entity recognition system and BERT. They dem-
onstrate the performance of combining the hybrid system with multiple BERT
and Google Healthcare Natural Language API over the documents translated into
English with cross-language word matching, and this technique achieves the best
recall in the task. In another study, Magna et al. (2020) focuses on breast can-
cer diagnosis using free medical text. They apply word2vec and BERT methods
for converting text data into word vectors. Moreover, the results obtained using
word2vec and BERT are compared with the technique TF-IDF. Then, these word
vectors are classified using machine learning and deep learning models. Support
vector machine (SVM), Decision tree (DT), Naïve Bayes (NB), K-nearest neigh-
bor (KNN), Random forest (RF) are used as machine learning algorithms and
Multiplicative Long Short Term Memory (LSTM), Dense, Bidirectional LSTM
are used as Deep Learning algorithms. The experimental results show that the
combination of word2vec and Bidirectional LSTM with a macro F1 score of 98%
performs better than all others in classifying breast cancer in the Spanish dataset.
In Grancharova and Dalianis (2021), the authors investigate the fine-tuning of a
Swedish BERT model and a multilingual BERT model for NERC on a Swedish
corpus of electronic patient records. The experimental results of their study show
that the Swedish model is very successful with a recall of 0.9220 and a precision
of 0.9226.
Medical text classification is an important research problem for Turkish, due to
the compelling morphological structure of Turkish and the presence of many field-
specific words in medical texts. There are only a few studies in the medical field that
deal with text classification in Turkish. These studies are Parlak and Uysal (2020);
Arifoğlu et al. (2014), and Çelıkten and Bulut (2021).
The authors in Parlak and Uysal (2020) examine a comprehensive compari-
son of the classification of Turkish and English counterparts of the same abstracts
published in Turkish medical journals. They use unigram, bigram, and hybrid (the
combination of unigram and bigram) methods to extract features. Also, the authors
use seven classification algorithms. The best results are obtained from the combina-
tion of unigram features with distinguishing feature selector (DFS) and multinomial
naïve Bayes (MNB) classifier for both data sets.
Arifoğlu et al. (2014) construct an international classification of diseases (ICD-
10-AM) coding system for patient records in Turkish hospitals to classify disease.
This task requires expert knowledge and is also error-prone, as human annotators
must consider thousands of possible codes when assigning the correct ICD label to
a document. They employ the bag-of-words approach using a Lucene search engine
and the Borda counting method and show that their results are successful in auto-
matically assigning ICD-10-AM codes to clinical findings.
Çelıkten and Bulut (2021) study on the classification problem of Turkish medi-
cal texts using BERT models. In this study, article summaries belonging to 10 dif-
ferent disease groups and two BERT-based models are used for the medical text
classification problem. With these models, it is aimed to match the article abstracts
with the appropriate disease category. They achieve an F-score of 0.82 and 0.93 for

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 2  The workflow of dataset collection

multilingual BERT and BERTurk, respectively. The results show that the BERTurk
model is more successful than other compared models for Turkish medical text
classification.
In our work, we apply three different BERT models to classify radiology reports
in Turkish for diagnosing breast cancer. We also apply the hard voting method to
improve the classification results. For this purpose, we collect a dataset from a Turk-
ish hospital, which makes our study different from others. The classification success
of the models is evaluated in our collected Turkish radiology texts corpus. Classifi-
cation performance of BERT modes is also compared with that of classical machine
learning methods. To our knowledge, our study is the first study that prepares and
evaluates a Turkish radiology reports dataset for breast cancer diagnosis.

3 Collecting Turkish radiology report dataset

The main focus of this research is to collect a dataset of radiology reports from
patients having doubts about breast cancer. For this reason, we collect a dataset
for the classification of radiology reports with the approval of the Ethics Com-
mittee of Çukurova University for non-interventional clinical research. Figure 2
shows the workflow of data collection. At the beginning of data collection in the
hospital, we first obtain the mammography images of the patients who came to
the Department of General Surgery with breast complaints. Second, we collected
the radiological reports of the mammograms of these patients in the information
technology department of the hospital. Using the patient numbers, we obtained
the corresponding mammography images in digital imaging and communica-
tions in medicine (DICOM) format from the hospital’s picture archiving system

13
P. Uskaner Hepsağ et al.

Fig. 3  An example radiology report

Fig. 4  The most common words obtained from a Malignant type radiology reports and b)Benign type
radiology reports

(PACS) and the corresponding radiology reports. We obtained mammogram


images and associated radiology reports from only 62 patients. The mammogram
images of 62 patients were acquired using the Fujifilm Amulet Innovality B-115
mammography machine at the Department of Radiology, Faculty of Medicine,
Çukurova University. Subsequently, the Department of Information Technology
(IT) removed all patient-identifiable information such as patient name, address,
date of birth, etc. from both the images and the associated radiology reports.
Labeling of radiology reports is done by medical experts. Radiologists cat-
egorize radiological imaging results using a numbered system (Medical, 2022).
Physicians use a standard system to describe the findings and results of mam-
mograms. This system (called the Breast Imaging Reporting and Data System or
BI-RADS) sorts the results into categories numbered 0 through 6. After collect-
ing the reports, a physician reviewed the reports and labeled them as benign and
malignant according to the categories from BI-RADS. For this study, we use only
radiological reports as data. An example of a radiological report from the dataset
is shown in Fig. 3.
Our dataset for radiology reports consists of 62 reports in total so that 20 reports
are labeled as benign and 42 reports are labeled as malignant. To understand the
characteristic of the dataset, we extracted the 35 most common words from benign
and malignant reports, respectively. The cloud of the most common words for
benign and malignant reports is given in Fig. 4. It is clearly seen that some words
such as "meme" (in English, "breast"), "kitle" (in English, "mass") are common in
benign and malignant reports.

13
Using BERT models for breast cancer diagnosis from Turkish…

4 Materials and methods

In this work, we employ classical machine learning methods and pre-trained BERT
models to classify Turkish radiology reports and compare their performances. In
recent years, contextualized word embeddings such as BERT (Devlin et al., 2018)
and DistilBERT (Sanh et al., 2019) are used in almost any NLP task including text
classification. In this study, we utilize pre-trained BERT and DistilBERT for the
classification of breast cancer radiology reports. To improve the success of the pre-
trained BERT models, we apply ensemble learning of classifiers with hard voting.

4.1 Classical machine learning models

How to represent texts in a collection is one of the most critical questions in text
classification tasks. There are several techniques for converting texts into numerical
vectors. In this work, we use the term frequency-inverse document frequency (TF-
IDF) method to obtain features from our created radiology reports dataset and use
them for applying classical machine learning algorithms to classify breast cancer
radiology reports as malignant or benign.
Term frequency-inverse document frequency (TF-IDF) is one of the most com-
monly used methods in Text Mining. This method calculates the weights of words to
show how significant a word is to the document. The formula for TF-IDF is shown
in Eq. 1.
tfidf (i, j) = tf (i, j)log(N∕dfi ) (1)
where i is a word, j is a document, tf(i, j) is the number of occurrences of word
i in document j, df i is the number of documents containing word i, and N is the
total number of documents in the collection. After converting each report to a
numeric vector form by using the TF-IDF weighting method, the collected dataset
is classified by applying support vector machines, random forest, stochastic gradient
descent, logistic regression, and naive Bayes classifiers.

4.1.1 Support vector machines

Support vector machines (SVM) are a supervised learning technique and can be
used for classification and regression problems. SVM attempts to find the best linear
or nonlinear boundary to partition data into two or more classes (Agarwal, 2013).
The most commonly used SVM classifier is a binary one. It tries to predict the class
of test samples between two possible classifications. SVM tries to find the optimal
linear separation hyperplane for the classes in an N-dimensional space, where N
defines the number of features. The optimal (best) hyperplane is selected by maxi-
mizing the distance between the class data and the boundary (Soui et al., 2021). The
distance between the data points and the boundary is defined as the margin and the
data points closest to the boundary are called support vectors. These support vectors

13
P. Uskaner Hepsağ et al.

are used to determine the boundary. When an SVM model classifies new unlabeled
samples, they are classified according to the side of the best hyperplane on which
they fall.

4.1.2 Random forest

Random forest (RF) groups a set of decision trees that form a forest (Ao et al.,
2019). It builds decision trees on different samples and takes their majority vote
for the classification task. Each decision tree is a set of rules based on the values
obtained from the input features (Agarwal, 2013). It involves selecting a random
subset of features from the dataset. Therefore, each model is generated from the
samples (bootstrap samples) of the original data with replacement. Now, each model
is trained independently, resulting in output. The final output is based on a majority
vote after combining the results of all models.

4.1.3 Stochastic gradient descent

Stochastic gradient descent (SGD) is a variant of gradient descent that works in


parts. SGD is a linear classifier that tunes the parameters of the algorithm to mini-
mize the cost function. If the objective function can be decomposed into multiple
terms, SGD iterates based on the gradient of a single term at a time. The gradient
of the loss function is computed each time for a random sample with a decreasing
learning rate, which is faster than gradient descent, that considers the entire data set
when tuning the parameters. SGD converges quickly when the samples are similar
(Faris et al., 2021).

4.1.4 Logistic regression

Logistic regression (LR) is a supervised learning model for predicting the probabil-
ity that data points belong to one of two classes (binary classification) (Devarakonda
& Demmel, 2020). This model is used in many applications, such as disease risk
prediction, website click prediction, and fraud detection, where data often needs to
be classified into two classes. The sigmoid function is used for predicting values of
probabilities. The graph plots the predicted values between 0 and 1. The values are
then plotted towards the edges at the top and bottom of the Y-axis where the labels
are 0 and 1. Based on these values, the target variable can be assigned to one of the
two classes.

4.1.5 Naïve Bayes

The Naive Bayes (NB) classification algorithm is one of the most common machine
learning algorithms and is based on the Bayes theorem. The NB classifier calcu-
lates the probability of each individual being assigned to a class with the maximum
posterior probability. A frequency table is created to calculate the probability for
each feature. Then, the frequency tables are converted to probability tables using the
Bayes theorem. The result of the prediction is the name of the class with the highest

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 5  The architecture of BERT


(Devlin et al., 2018) for created
Turkish radiology report data

probability. NB classifier assumes that all predictors are independent. It is very suc-
cessful in medical data applications. The NB classifier has several advantages, such
as it is efficient in training, is not affected by irrelevant features, and can handle real
and discrete data (Maysanjaya et al., 2018).

4.2 BERT models

In the NLP field, a large amount of training data is needed to work with Deep Learn-
ing-based models. Bidirectional encoder representations from transformers (BERT)
is a major turning point in the field of NLP to solve the problem of having small
training data in transfer learning. Devlin et al. (2018) used two unsupervised tech-
niques to train BERT, namely masked language model (MLM) and next sentence
prediction (NSP). In MLM, they randomly mask words in the sentence and then try
to predict them. In NSP, given the first sentence, the technique predicts if a chosen
next sentence is probable or not. BERT uses both directions and the full context of
the sentences to predict masked words. The overall architecture of the BERT (Dev-
lin et al., 2018) model is given in Fig. 5. The basic BERT model contains an encoder
with 12 Transformer blocks, 12 self-attention heads, and a hidden size of 768. The
network takes as input a sequence of no more than 512 tokens and outputs the rep-
resentation of the sequence. The sequence consists of one or two segments, where
the first token of the sequence is always [CLS], which contains the special classifica-
tion embedding, and another special token [SEP] is used to separate the segments
(Tokgoz et al., 2021). The bi-directionality of the BERT model distinguishes it from
previous language models.
Google has provided pre-trained BERT models for various datasets. These mod-
els are available in various languages such as Turkish, Chinese, and multilingual
versions. In this study, we evaluated and compared three different models based on
BERT, namely BERTClinical, BERTTurkish and BERTMultilingual, to show the perfor-
mance of the models with our dataset. For the radiology report classification task,
we used the final hidden state h of the first token [CLS] as the representation of
the whole sequence. To achieve classification, we only retain the embeddings of

13
P. Uskaner Hepsağ et al.

the [CLS] token, and add a linear layer to BERT to reduce the dimensionality to
match the number of labels. These embeddings are passed through another linear
layer, whose output help us identify the predicted class. Our aim is to obtain base-
line results from models, and for this reason, we follow the common practice in the
literature of using BERT for classification with the [CLS] token.

4.2.1 Multilingual BERT model

Multilingual BERT model was introduced by Devlin et al. (2018) in 2019. The mul-
tilingual BERT model is pre-trained on 104 languages, including Turkish. The train-
ing data consists of Wikipedia articles in each language. It does not use markers to
indicate the input language and has no explicit mechanism to ensure that translation-
equivalent pairs have similar representations. The multilingual model of BERT has
two versions that differ in capitalization and punctuation. In this study, the case-
sensitive version was used when the text data was not converted to lowercase.

4.2.2 Turkish BERT model

We also used the Turkish BERT (digital library, 2020) model to classify Turkish
radiology reports. The Turkish BERT model is the result of training with the Turk-
ish Wikipedia corpus and was released by the MDZ Digital Library team. The cur-
rent version of the model was trained on a filtered and sentence-segmented version
of the Turkish OSCAR corpus, a recent Wikipedia dump, various OPUS corpora,
and a special corpus.

4.2.3 Clinical BERT model

ClinicalBERT is a modified BERT model: Specifically, the representations are


learned using medical notes and further processed for downstream clinical tasks.
Clinical BERT was published by Alsentzer et al. (2019) in 2019. The quality of
learned representations of text depends on the text the model was trained on. BERT
is pre-trained on BooksCorpus and Wikipedia. The Clinical BERT model, which
was specifically pre-trained for clinical domains using MIMIC notes, was used to
see if experimental results improved when a specific domain model was used. The
study utilized Clinical BERT, an English medical text-trained model, to investigate
whether employing a domain-specific model enhances experimental outcomes.
Although Clinical BERT wasn’t trained on Turkish or Multilingual, it’s worth noting
that Turkish clinical terms share the same Latin roots as their English equivalents.

4.3 DistilBERT

BERT (Devlin et al., 2018) is a very large and memory-hungry model that is slow
in the training and testing phases. Therefore, DistilBERT (Sanh et al., 2019) has
been proposed that is a ‘distilled‘ version of BERT, which is smaller and faster than
BERT without reducing its accuracy of it. The general architecture is the same as

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 6  The architecture of DistilBERT (Sanh et al., 2019) used for Turkish radiology report data

that of BERT, except that token type embedding and the pooler have been removed,
while the number of layers has been reduced by a factor of 2, which has a signifi-
cant impact on computational efficiency. The model was distilled on very large
stacks with dynamic masking and with the prediction of the next sentence (NSP)
(VERMA, 2021). Some percentage of the input tokens are masked (Replaced with
[MASK] token) at random and the model tries to predict these masked tokens.
Masking and NSP here refer to the process of converting a word to be predicted
in the Masked Language model to ["MASK"] and training the entire sequence to
predict that particular word. The architecture of DistilBERT is given in Fig. 6. As
can be seen in Fig 6, the architecture of DistilBERT consists of student and teacher
networks and each network contains encoders (the block containing Attention, Nor-
malization, feed forward network (FFN) and Normalization placed on top of each
other). Attention matrices generated by multi-head attention (MHA) reduce the loss
between MHA (teacher) and MHA (student). Similarly, Hidden States reduce the
loss between Hidden State (student) and Hidden State (teacher) outputs of encoder
stacks in student and teacher networks.

13
P. Uskaner Hepsağ et al.

Fig. 7  Hard Voting

4.4 Ensemble classifier

The voting classifier is one of the most powerful ensemble classifiers (Kumar,
2020). It combines outputs of different classification models and chooses the most
predicted label as the label prediction (Delgado, 2021). There are two types of vot-
ing classifiers, namely hard voting and soft voting. While hard voting classifies data
based on the predicted class labels of each model, soft voting classifies the data
based on the probabilities that are predicted by each model. In this study, we used a
hard voting classifier to classify radiological reports according to label predictions
that are made by BERT models. Figure 7 shows the hard voting process. Equation 2
shows the formula for the computation of hard voting.
prediction = majority_prediction(prediction1 , prediction2 , ⋯ , predictionm ) (2)

13
Using BERT models for breast cancer diagnosis from Turkish…

where predictioni shows the prediction that is obtained from the ith classifier, and m
is the number of models.
In classification tasks, predictions are affected by bias, variance, and noise.
Therefore, ensemble models are used to counteract these drawbacks. In this study,
we used an ensemble of different combinations of BERT models such as (Distil-
BERTTurkish , BERTClinical and BERTMultilingual), (DistilBERTTurkish , BERTTurkish , and
­BERTMultilingual), etc., and then decided on the combination with the best results.
Thus, we used the results of the BERT models in the hard voting.

5 Experimental evaluation and results

5.1 Evaluation metrics

We use the accuracy, precision, recall, and F1-score metrics (Han et al., 2022) to
evaluate the classification performance of radiology reports for the prediction of
breast cancer.
Accuracy, shown in Eq. 3, is the ratio of correctly predicted samples to total
samples.
TP + TN
Accuracy = (3)
TP + TN + FP + FN
where TP is the samples in the positive class (class = Malignant) and TN is the sam-
ples in the negative class (class = Benign) that were correctly predicted, FP is the
instances in the negative class that were predicted to be in a positive class, and FN
is the instances in the positive class that were predicted to be in the negative class.
Precision, shown in Eq. 4, is the ratio of correctly predicted positive samples to
total predicted positive samples.
TP
Precision = (4)
TP + FP
Recall, shown in Eq. 5, is the ratio of correctly predicted positive samples to all
samples in the actual positive class.
TP
Recall = (5)
TP + FN
F1-Score, shown in Eq. 6, is the weighted average of Precision and Recall.
2 ∗ Precision ∗ Recall
F1 − score = (6)
Recall + Precision

5.2 Implementation details

Data Pre-processing: Preprocessing methods are applied to improve data quality


when classical machine learning algorithms are used for text classification. In this

13
P. Uskaner Hepsağ et al.

study, we apply the below preprocessing steps to the textual reports by using the
Natural Language Tool Kit (NLTK) (Wang & Hu, 2021) which is a Python package
and contains important preprocessing functions.

• Lowercase conversion: All uppercase letters are converted to lowercase other-


wise uppercase and lowercase versions of the same word are considered differ-
ent. For example, the word “Kitle” and “kitle”, which means “mass” in English,
are considered different even though they are the same. Therefore, converting
each word to lowercase is one of the most important preprocessing steps.
• Tokenization: It is the process of separating each word in the whole document.
In this step, each text is broken down into a list of individual words using NLTK-
library. For example, the text “cilt altı yağ dokusu ve meme başları doğaldır”
can be tokenized into “cilt”, “altı”, “yağ”, “dokusu”, “ve”, “meme”, “başları”,
“doğaldır”.
• Stop word elimination: Stop words are words that are used frequently in a lan-
guage (e.g., “a”, “this”, “that”, etc.). One of the basic preprocessing steps is
to remove stop words because there are many words in Turkish that we repeat
frequently in our daily life and they do not add value to the document for the
classification process. Therefore, we first obtain a list of stop words for Turkish
(Aksoy & Öztürk, 2018) and then remove them to obtain the less frequently used
words that are more important for classification.
• Converting numbers to words: The textual documents used in our work con-
tain numeric values expressed by using both digits and letters. For example, in
some reports, we have the number 8 written as 8 (by using only digits), in some
other documents this number is written by using the word “eight”. Actually, both
terms have the same meaning, but when we tokenize the document, we store “8”
and “eight” as different tokens. To solve this problem, we convert all the digit
numbers into their equivalent word forms.
• Punctuation elimination: Text documents contain various symbols to separate
sentences in the document. These symbols are unnecessary for text classifica-
tion. All punctuation marks are removed from documents to improve the over-
all performance of the classification algorithms. For example, “Cilt, cilt altı yağ
dokusu ve meme başları doğaldır. Lipomatö parankim mevcuttur”. After remov-
ing punctuation marks, we obtain the following text “Cilt cilt altı yağ dokusu ve
meme başları doğaldır Lipomatö parankim mevcuttur”.
• Stemming: Stemming is the reduction of words to their word stems, such as
“boyutunda” and “boyutta”, which are all based on the root word “boyut”. Zem-
berek (Akın & Akın, 2007), a natural language processing tool for Turkish, is
used for stemming in this research. The Turkish stemmer removes the suffixes/
affixes from words.

Preprocessing applied to an example sentence is given in Table 1.

K-fold cross-validation: Since our dataset is too small to train the classification
models, we apply five-fold cross-validation (Han et al., 2022) in which we divide our
data into five distinct subsets, then we use four subsets as training data, and leave the

13
Using BERT models for breast cancer diagnosis from Turkish…

Table 1  Output of preprocessing steps for sentence " ⋯ 9 mm çapında iki adet internal ekojenitesi izle-
nen bazı Hipoekoik lezyonlar izlenmektedir"
State Sentence

Original ⋯ 9 mm çapında iki adet internal ekojenitesi izlenen bazı Hipoekoik


lezyonlar izlenmektedir.
Lowercase ⋯ 9 mm çapında iki adet internal ekojenitesi izlenen bazı hipoekoik
lezyonlar izlenmektedir.
Tokenization ⋯ 9 mm çapında iki adet internal ekojenitesi izlenen bazı hipoekoik
lezyonlar izlenmektedir .
Stop word elimination ⋯ 9 mm çapında iki adet internal ekojenitesi izlenen hipoekoik lezyonlar
izlenmektedir .
Converting numbers to words ⋯ dokuz mm çapında iki adet internal ekojenitesi izlenen hipoekoik
lezyonlar izlenmektedir .
Punctuation elimination ⋯ dokuz mm çapında adet internal ekojenitesi izlenen hipoekoik lezyon-
lar izlenmektedir
Stemming ⋯ dokuz mm çap adet internal ekojenite izle hipoekoik lezyon izle

remaining subset as test data. The average accuracy, precision, recall, and F-measure
values we get as the results of five experiments indicate the validity of the classifica-
tion model. In this study, we chose k value as five (i.e., five-fold cross-validation).
Data augmentation and class distribution balancing: Our collected data includes
62 reports in a total of which 20 of them are labeled as benign and the remaining 42
reports are labeled as malignant. Therefore, the size of the dataset is too small and
the total number of benign texts is much less than that of malignant texts. To achieve
high accuracy in text classification, the size of the data is important. Collecting data
such as medical reports is usually tedious. To solve the small-sized data problem,
we use a simple data augmentation, namely easy data augmentation (EDA) method
for textual data. EDA (Wei & Zou, 2019) augments texts in several ways. These are
random insertion, random deletion, random interchange, and synonymous substitu-
tion. In random insertion, random words, apart from stop words, are selected and
each word is replaced with a synonym of the word. This is repeated n times. For
each text, n is the number of additions. In random deletion, each word is removed
with a random probability. In a random swap, two words are randomly selected and
their positions are swapped. In synonym replacement, n words except stop words
are randomly selected and each word is replaced with a synonym of the word. In
this part, Turkish WordNet (Çetinoğlu et al., 2018) is used as a synonym dictionary
when synonym replacement and random insertion are applied. The EDA technique
of random insertion, random deletion, random swap, and substitution of synonyms
is applied to each radiology report in the original dataset to increase the number of
samples in the dataset. Using these methods, 10 new report samples were obtained
from each benign report and malignant report. So, after dividing data into train and
test for each fold, we augmented training data. Figure 8 shows an original radiol-
ogy report and augmented reports using random insertion, deletion, swap, and syno-
nym replacement operations of EDA. After the augmentation method, we obtain 160
benign and 350 malignant reports as shown in Table 2.

13
P. Uskaner Hepsağ et al.

Fig. 8  The original radiology report with augmented reports

13
Using BERT models for breast cancer diagnosis from Turkish…

Table 2  Number of Instances Original dataset Augmented dataset with


in the Original and Augmented EDA
Datasets
# of # of # of # of
reports in reports in reports in reports in
training test training test

Benign 16 4 160 4
Malignant 35 7 350 7
Total 51 11 510 11

Table 3  The results of classical Weighted average/TF-IDF F1-score


machine learning algorithms on
the pure dataset Multinomial Naïve Bayes 0.65
SVM 0.70
SGD 0.71
LogisticRegression 0.70
RandomForest 0.62

The significance of bold in the table 3 shows the highest score

As the number of instances in both classes of the original and augmented datasets
is not equal these datasets are imbalanced and it may have a negative impact on minor-
ity class classification. We applied weighted random sampling (Efraimidis, 2015)
technique to overcome this challenge. In Weighted random sampling, each item has
an associated weight and the probability of each item being selected is determined by
the item weights. After applying weighted random sampling on the minority class, we
obtain 350 malignant and 350 benign reports in the training data of each fold of five-
fold cross-validation.
Parameter settings: For text classification problems, AdamW optimizer (Loshchilov
& Hutter, 2017) is often used with BERT models and provides good generalization
performance. Therefore, we use the AdamW optimizer in the experiments. Moreover,
the binary cross entropy loss function is a function that calculates the cross entropy loss
between the actual value and the estimated value of the model and is used in binary
classification problems. Since our problem is a binary text classification, we use the
binary cross entropy loss function in our study. We apply Optuna (Akiba et al., 2019)
for hyperparameter tuning. During training, the learning rate is set to 2e − 5 with a
weight decay of 1e − 8. We also use Sigmoid as the activation function in the output
layer and set the patience to 2. For the classical classifiers (SVM, RF, NB, SGD, LR)

13
P. Uskaner Hepsağ et al.

Table 4  The results of classical Weighted average/TF-IDF F1-score


machine learning algorithms
on the augmented and balanced Multinomial Naïve Bayes 0.79
dataset
SVM 0.77
SGD 0.76
LogisticRegression 0.78
RandomForest 0.74

The significance of bold in the table 4 shows the highest score

we test several parameter values and choose the ones that give the best classification
performance.

5.3 Experimental results

5.3.1 Experimental results of classical machine learning methods

As baseline results, we first apply classical machine learning models on the pure
dataset. The results of the pure dataset, without applying augmentation and weighted
random sampling, using machine learning algorithms are presented in Table 3. As
can be seen in Table 3, it can be said that SGD is the best algorithm, while Random
Forest is the worst algorithm for the pure dataset. SVM and Logistic Regression
provide the second-best F1-score of 70% for the pure dataset. However, Multinomial
Naive Bayes is the second worst algorithm for the pure dataset.
We also apply classical machine learning models on the augmented and balanced
dataset by applying EDA and weighted random sampling methods as described
in Sect. 5.2. The results are given in Table 4. When comparing the results of the
machine learning algorithms for the augmented and balanced dataset, we can see
that Multinomial Naive Bayes is the best algorithm in terms of accuracy, preci-
sion, recall, and F1-score. With Multinomial Naive Bayes we get the best F1-score
of 79%. Logistic Regression also gives the second-best F1-score of 78%. Random
Forest achieves the worst F1-score of 74%. From the results of the data enriched
with the EDA and sampling methods, we can conclude that the use of Multinomial
Naive Bayes diagnoses breast cancer with the highest F1-score. In other words, Mul-
tinomial Naive Bayes predicts most malignant findings as malignant, while Random
Forest predicts most malignant findings as benign. Thus, based on these results, we
can say that Multinomial Naive Bayes is more successful than Random Forest in
the early detection of breast cancer. Also, the use of Logistic Regression improves
the detection of benign findings in the collected dataset. We also observe that when
using Multinomial Naive Bayes for the collected dataset, the number of false posi-
tives is the lowest, while it is the highest for SGD and Random Forest. These results
show that most of the benign findings are predicted as benign using Multinomial
Naive Bayes, while they are predicted as malignant using SGD and Random For-
est. These results obtained with SGD and Random Forest may lead to unnecessary

13
Using BERT models for breast cancer diagnosis from Turkish…

Table 5  The results of BERT F1-score


models and hard voting on the
pure dataset DistilBERTTurkish 0.65
BERTTurkish 0.73
BERTClinical 0.70
BERTMultilingual 0.92
Hard voting 0.90

The significance of bold in the table 5 shows the highest score

Table 6  The results of BERT F1-score


models and hard voting on the
augmented and balanced dataset DistilBERTTurkish 0.69
BERTTurkish 0.89
BERTClinical 0.87
BERTMultilingual 0.92
Hard voting 0.91

The significance of bold in the table 6 shows the highest score

biopsy procedures. Therefore, by using Multinomial Naive Bayes for the collected
dataset, we can reduce unnecessary breast biopsies.
Comparing the results of Tables 3, 4, we can say that the results of the enriched
dataset are higher than the results of the pure dataset. Thereby, unnecessary breast
biopsies can be detected more accurately with Multinomial Naive Bayes, while
the use of Random Forest can lead to a lower success rate in reducing unnecessary
breast biopsies. These results show that the use of augmentation and balancing the
dataset have a positive impact on reducing unnecessary breast biopsies.

5.3.2 Experimental results of BERT models

Table 5 shows the results of the pure dataset with BERT models. If we compare
the versions of BERT, we can see that BERTClinical has the worst results while
BERTMultilingual has the best results. Moreover, DistilBERTTurkish has the worst
results except for the precision score. On the other hand, Hard Voting is performed
to improve the classification results. However, we could not get better results with
Hard Voting than with BERTMultilingual. Consequently, BERTMultilingual is the most
successful model compared to the others. However, comparing Table 5 with Table 3,
we see that the results of the pure dataset with BERTs are better than the results of
the pure dataset with machine learning algorithms.
We also apply contextualized word embeddings (BERT versions and DistilBERT)
to our augmented and balanced dataset, and the results are shown in Table 6. Com-
paring the three versions of BERT, we find that the performance of the BERTClinical
model is the worst, while ­BERTMultilingual has the best results in terms of F1-score
among the models from BERT. According to our experiment, B ­ ERTMultilingual out-
performs ­BERTTurkish. This is because BERTMultilingual was trained on a vast dataset

13
P. Uskaner Hepsağ et al.

Table 7  The results of BERT F1-score


models and hard voting on the
augmented and balanced dataset DistilBERTTurkish 0.86
using self-copy
BERTTurkish 0.91
BERTClinical 0.86
BERTMultilingual 0.87
Hard voting 0.93

The significance of bold in the table 7 shows the highest score

of 104 languages, whereas BERTTurkish was trained solely on Turkish language with
limited data. BERT cannot generate high-quality representations for low-resource
languages, which necessitates gathering more data to transform them into high-
resource languages, as Wu and Dredze point out (Wu & Dredze, 2020). Our study
shows that BERTTurkish yields poorer results due to its small monolingual corpus.
On the other hand, when we compare BERTClinical with other BERT versions, we
find that BERTClinical’s outcomes are the least desirable. This is because it was
only trained on English medical data, which was a limited dataset that contained
only English medical articles, resulting from another small corpus. According
to these results, the numbers of false positives and false negatives are lowest for
BERTMultilingual, while the numbers of false negatives and false positives are highest
for BERTClinical. The enrichment of the collected data with the EDA method may
affect the Turkish words used in the augmented reports. This is because the F1-score
of BERTTurkish is lower than BERTMultilingual. These results show that BERTClinical
and BERTTurkish are more successful in classifying malignant reports with EDA
enhancement, while B ­ ERTMultilingual is the most successful in classifying benign
reports. Thus, we can conclude that using the EDA approach to expand the collected
Turkish dataset has an impact on the number of benign and malignant reports found
correctly depending on the domain. We also test DistilBERTTurkish with the EDA-
enriched and balanced dataset. As can be seen in Table 6, ­DistilBERTTurkish gives
the worst results.
In addition to these BERT models, we also use the ensemble method of Hard
Voting to improve breast cancer detection. However, we could not achieve better
results with Hard Voting than with BERTMultilingual. According to these classifica-
tion scores, we can say that BERTMultilingual is the most successful one in classifying
benign reports as benign and malignant reports as malignant.
If we compare Table 5 with Table 6, we see that the results of the pure dataset are
lower than the results of the EDA-enriched and balanced dataset. This shows that
the EDA augmentation and balancing methods increase the success of the BERT
models. Also, when comparing the models of BERT with the baseline models, as
can be seen in Tables 4, 6, F1-score values of BERTMultilingual are higher than the
F1-score values of Multinomial Naive Bayes. The overall results show that using the
BERTMultilingual model improves the classification performance of predicting benign
cases from radiological reports.

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 9  Confusion matrices obtained from the augmented and balanced dataset by using a Turkish BERT
model. b Clinical BERT model. c Multilingual BERT model. d Distil BERT model and e Hard voting

In our study, as can be seen in Table 7, we experimented with data augmen-


tation by duplicating it, and observed that the best classification performance
was obtained from BERTTurkish , compared to DistilBERTTurkish , BERTClinical , and
BERTMultilingual models. BERTTurkish ’s superior performance could be attributed to
its training on Turkish language texts, which was useful in classifying the col-
lected Turkish report datasets. While BERTMultilingual produced the second-best
score, BERTClinical performed the worst, possibly because it was trained on Eng-
lish medical documents. Though medical terms in English and Turkish are simi-
lar, using BERTClinical for classifying Turkish radiology reports was not suitable.
In contrast, DistilBERTTurkish performed as well as BERTClinical.

13
P. Uskaner Hepsağ et al.

Our results indicate that using a Turkish-trained BERT model enhances the
performance of the classifier in classifying Turkish radiology reports. This, in
turn, can reduce unnecessary biopsy operations that may cause anxiety in patients
who do not have breast cancer. Additionally, BERTTurkish can help to detect breast
cancer more accurately based on the collected data, reducing the number of
incorrect findings by radiologists based on radiological images. We also imple-
mented hard voting to test collected Turkish radiological reports and achieved an
F1 score of 93%. These findings show that the Hard Voting model has improved
the experimental results significantly.
The confusion matrix for three versions of BERT, DistilBERT, and Hard Voting
resulting from the EDA-enriched and balanced dataset is shown in Fig. 9. As you
can see from the figure, the number of true positives is the lowest, and the number
of false negatives is the highest for the BERTClinical model. This result shows that
the BERTTurkish , BERTMultilingual and DistilBERTTurkish models classify the malignant
reports more correctly. On the other hand, the number of true-negatives is the lowest
in the DistilBERTTurkish model, while the number of false-negatives is the lowest in
the BERTMultilingual model. Based on these results, we can say that benign reports are
classified more correctly with the BERTMultilingual model than with the other BERT
models. In addition, we can say that the number of false-positive reports increases
while the number of true-negative reports decreases when we use Hard Voting.
When we compare the experimental results of the augmented and balanced data
with the pure data, it is clearly seen that using pure data decreases the classification
results of pre-trained BERT models.
In the data augmentation and balancing method, the data to be included in the test
part are first separated from the whole dataset, and the remaining samples are used
as the training set for augmentation and balancing. In other words, the samples in
the train and test sets are disjoint. On the other hand, when we check the Machine
Learning algorithm results for this augmented and balanced dataset with that of the
pure dataset, we get higher scores for accuracy, precision, recall, and F1 score when
we use augmented and balanced data.
From the overall results, we can conclude that using BERTMultilingual on the aug-
mented and balanced data is the most successful model in detecting breast cancer
according to the interpreted reports of radiologists. In doing so, we can find that
BERTMultilingual provides the most successful results in classifying the collected
Turkish radiology reports compared to the other methods tested in this study.
Figure 10 shows the confusion matrices of BERT models for the pure dataset. As
it can be seen from the figure, the number of true-negative reports is zero in each
model, except for the BERTMultilingual model. The number of false-negative reports is
also zero in all models.

5.3.3 Comparison of the results with the previous studies

Table 8 shows the comparison of our work with previous studies. In our work, we
created a new dataset of Turkish radiology reports to test the models. Nevertheless,
we used classical machine learning as baseline methods and compared them with
pre-trained BERT models to evaluate the performance of deep learning methods.

13
Using BERT models for breast cancer diagnosis from Turkish…

Fig. 10  Confusion matrices obtained from the pure data by using a Turkish BERT model. b Clinical
BERT model. c Multilingual BERT model

We also applied EDA data augmentation method with weighted random sampling to
increase the size and balance the dataset to test the effects of data augmentation and
balancing on the classification of Turkish radiology reports. However, other studies
use different datasets without applying augmentation methods. Similar to our study,
studies (Casey et al., 2021; Devarakonda & Demmel, 2020), and Parlak and Uysal
(2020) use BERT models in the classification part, our study applies Hard Voting
using the results of BERT models. However, as far as we know, none of the pre-
vious studies made such a comparison. While other studies employed soft voting
or hierarchical ensemble methods, in our study we used the hard voting ensemble
method calculated using the results of BERTTurkish , BERTMultilingual, and BERTClinical.
As shown in Table 8, we performed a detailed comparison of the pre-trained BERT
models for classifying Turkish radiology reports. The best result we obtained is an

13
Table 8  Comparison of our work with previous studies to classify medical data
Study Dataset Machine learning Ensemble method Augmentation Deep learning Result

13
Our study Our newly created Turk- SVM Hard voting East data aug- Turkish BERT F1-score 0.92
ish dataset Logistic regression mentation Multilingual BERT
SGD (EDA) Clinical BERT
Random forest DistilBERT
Multinomial Naïve Bayes
Parlak and Uysal (2020) Dataset uses informa- SVM – – – F1-score 0.86
tion from the PACS XGBoost
systems at Namazi Naïve Bayes
Hospital and Saadi MLF
Hospital
Sanh et al. (2019) University of Pittsburgh Naïve Bayes – – – F1-measure 0.93
text information extrac- SVM
tion system (TIES) PART​
Yang et al. (2019) Articles published in the – – – Multilingual BERT F1-score 0.93
TUBITAK National BERTurk
Medical Database
between 1976 and 2014.
Kılınc (2016) Stockholm EPR PHI – – – Swedish KB-BERT F-score 0.92
corpus and the multilingual
M-BERT
Castro et al. (2017) Memorial sloan kettering Logistic regression – – RNN Accuracy 0.70
cancer center (MSKCC) Random forest
XGBoost
Bell (2020) Wisconsin breast cancer SVM – – – Accuracy 0.98
dataset K-nearest neighbors
Random Forests
Artificial neural networks
(ANNs)
Logistic regression
P. Uskaner Hepsağ et al.
Table 8  (continued)
Study Dataset Machine learning Ensemble method Augmentation Deep learning Result

Islam et al. (2020) MIMIC III and the SVM – – LSTM F1-macro 0.98
clinical histories of Decision tree (DT) Bi-LSTM
268, 989 patients , Naïve Bayes (NB)
written in Spanish, of K-nearest neighbor
the Dr. Guillermo Grant (KNN)
Benavente Regional Random forest (RF)
Clinical Hospital
Ozcift (2020) Breast cancer radiology SVM – – – Accuracy 0.83
reports from the Zieken- Logistic regression
huis Groep Twente Random forest
(ZGT), a hospital in KNN
Hengelo, Netherlands, Multinomial Naïve Bayes
recorded between 2012 Gradient boosted trees
and 2018 Ridge classifier
Medical (2022) Data collected from Logistic regression Soft voting – – Accuracy 0.91
Using BERT models for breast cancer diagnosis from Turkish…

variousopen-source Random forest


hospital web pages Decision tree
Ao et al. (2019) NCR database – Hierarchical ensemble – CNN F1-micro 0.748
Devarakonda and Dem- Radiology head CT SVM – – CNN with neural atten- Accuracy 0.88
mel (2020) reports of patients from Logistic regression tion mechanism
intensive care units Random forest
(ICUs) provided by
Emory Healthcare-
Tokgoz et al. (2021) Radiology reports from a – – – Multi BERT hybrid F1 0.855
major pediatric hospital system
in Buenos Aires

13
P. Uskaner Hepsağ et al.

F1-score of 0.92, which is compatible or higher with the previously observed values
for classifying Turkish medical documents.

6 Conclusion and future work

This paper describes a new Turkish dataset of radiology reports obtained from breast
cancer patients to further support breast cancer research in Turkey. Our research
aims to demonstrate ways to decrease the cost associated with breast biopsy during
cancer diagnosis. As such, we also investigate methods to minimize the number of
unwarranted biopsies, and alleviate the detrimental effects of biopsies, such as anxi-
ety and cost, in breast cancer diagnosis.
The resulting dataset consists of Turkish radiology reports from 62 breast cancer
patients labeled as malignant or benign. Since the number of reports in our collected
dataset is small and imbalanced, EDA and self-copying augmentation versions and
weighted random sampling methods are used. We experimentally evaluate the classi-
fication performances of classical machine learning and BERT models on the origi-
nal (pure) and augmented+balanced versions of our newly created Turkish radiology
reports dataset. In addition, we use a hard voting ensemble classifier to improve the
results of the pre-trained BERT models. The experimental results show that among
the pre-trained BERT models the classification results of the BERTMultilingual are bet-
ter than those of the BERTClinical, DistilBERTTurkish , and BERTTurkish models when
using EDA augmentation method. Therefore, we can say that the BERTMultilingual
with EDA augmentation method is the best choice for breast cancer detection from
the Turkish radiology reports.
Our evaluation also includes BERT models with data augmentation through self-
copying. Our experimental results indicate that BERTTurkish outperforms other BERT
models when utilizing self-copying as an augmentation method. Hence, we recom-
mend BERTTurkish with self-copy augmentation as the optimal choice for detecting
breast cancer in Turkish radiology reports. Additionally, we observed that using the
Hard Voting model further improved the experimental results when combined with
the data augmentation.
Our belief is that the dataset will prove to be a critical resource in driving pro-
gress toward state-of-the-art results for detecting breast cancer from Turkish reports.
Moving forward, we intend to assess the dataset using various Deep Learning mod-
els that take into account the grammatical rules of the Turkish language.

Author contributions All authors have contributed equally to this work.

Funding This work was supported by Scientific Research Project Unit of Çukurova University [grant
number FDK-2016-6931]; and Nazarbayev University (Kazakhstan) Faculty-development competitive
research [grant number FY2019-FGP-1-STEMM].

Data availability The dataset analysed during the current study are not publicly available due patient pri-
vacy by the Ethics Committee of Cukurova University but are available from the Ethics Committee of
Cukurova University on reasonable request.

13
Using BERT models for breast cancer diagnosis from Turkish…

Declarations
Competing interest The authors have no relevant financial or non-financial interests to disclose.

Ethical approval This work is done with the approval of the Ethics Committee of Cukurova University for
non-interventional clinical research.

Informed consent No conflicts of interest.

Consent for publication All authors whose names appear on the submission approved the version to be
published.

References
Abdelrahman, L., Al Ghamdi, M., Collado-Mesa, F., & Abdel-Mottaleb, M. (2021). Convolutional
neural networks for breast cancer detection in mammography: A survey. Computers in Biology
and Medicine, 131, 104248
Agarwal, S. (2013). Data mining: Data mining concepts and techniques. 2013 international confer-
ence on machine intelligence and research advancement (pp. 203–207). USA: IEEE.
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyper-
parameter optimization framework. 25th ACM SIGKDD international conference on knowledge
discovery and data mining (pp. 2623–2631). ACM.
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages.
Structure, 10(2007), 1–5.
Aksoy, A., Öztürk, T. (2018). Turkish stopwords. https://​github.​com/​ahmet​ax/​trstop.
Alsentzer, E., Murphy, J., Boag, W., Weng, W. H., Jin, D., Naumann, T., & McDermott, M. (2019).
Publicly available clinical BERT embeddings. Proceedings of the 2nd clinical natural language
processing workshop (pp. 72–78). Association for Computational Linguistics.
Ao, Y., Li, H., Zhu, L., Ali, S., & Yang, Z. (2019). The linear random forest algorithm and its advan-
tages in machine learning assisted logging regression modeling. Journal of Petroleum Science
and Engineering, 174, 776–789.
Arifoğlu, D., Deniz, O., Aleçakır, K., & Yöndem, M. (2014). Codemagic: semi-automatic assignment
of icd-10-am codes to patient records. Information sciences and systems 2014 (pp. 259–268).
Springer.
Bayer, M., Kaufhold, M.-A., & Reuter, C. (2022). A survey on data augmentation for text classification.
ACM Computing Surveys, 55(7), 1–39.
Bell, D.J. (2020). American college of radiology. Retrieved December 7, 2020 from  https://​
radio​paedia.​org/​artic​les/​ameri​can-​colle​ge-​of-​radio​logy?​lang=​us
Boroumandzadeh, M., & Parvinnia, E. (2021). Automated classification of bi-rads in textual mammogra-
phy reports. Turkish Journal of Electrical Engineering & Computer Sciences, 29(2), 632–647.
Casey, A., Davidson, E., Poon, M., Dong, H., Duma, D., Grivas, A., Grover, C., Suárez-Paniagua, V.,
Tobin, R., Whiteley, W., et al. (2021). A systematic review of natural language processing applied to
radiology reports. BMC Medical Informatics and Decision Making, 21(1), 1–18.
Castro, S. M., Tseytlin, E., Medvedeva, O., Mitchell, K., Visweswaran, S., Bekhuis, T., & Jacobson, R.
S. (2017). Automated annotation and classification of bi-rads assessment from radiology reports.
Journal of biomedical informatics, 69, 177–187.
Çelıkten, A., & Bulut, H. (2021). Turkish medical text classification using bert. 2021 29th signal process-
ing and communications applications conference (SIU) (pp. 1–4). IEEE.
Çetinoğlu, Ö., Bilgin, O., & Oflazer, K. (2018). Turkish wordnet. Turkish natural language processing
(pp. 317–336). Springer.
Delgado, R. (2021). A semi-hard voting combiner scheme to ensemble multi-class probabilistic classi-
fiers. Applied Intelligence, 2021, 1–25.
Devarakonda, A., & Demmel, J. (2020). Avoiding communication in logistic regression. 2020 IEEE 27th
international conference on high performance computing, data, and analytics (HiPC) (pp. 91–100).
IEEE.

13
P. Uskaner Hepsağ et al.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. Preprint at https://​arvix.​org/​abs/​1810.​04805
Digital Library. (2020). Turkish bert model. https://​huggi​ngface.​co/​dbmdz/​bert-​base-​turki​sh-​cased.
Efraimidis, P.S. (2015). Weighted random sampling over data streams. Preprint https://​arvix.​org/​
abs/​1012.​0256
Faris, H., Habib, M., Faris, M., Elayan, H., & Alomari, A. (2021). An intelligent multimodal medical
diagnosis system based on patients’ medical questions and structured symptoms for telemedicine.
Informatics in Medicine Unlocked, 23, 100513.
Grancharova, M., & Dalianis, H. (2021). Applying and sharing pre-trained bert-models for named entity
recognition and classification in swedish electronic patient records. Proceedings of the 23rd nordic
conference on computational linguistics (NoDaLiDa) (pp. 231–239). ACL.
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021).
Domain-specific language model pretraining for biomedical natural language processing. ACM
Transactions on Computing for Healthcare (HEALTH), 3(1), 1–23.
Gupta, M., Wu, H., Arora, S., Gupta, A., Chaudhary, G., & Hua, Q. (2021). Gene mutation classifica-
tion through text evidence facilitating cancer tumour detection. Journal of Healthcare Engineering,
2021, 10.
Han, J., Pei, J., & Tong, H. (2022). Data Mining: Concepts and Techniques. Morgan Kaufmann.
Islam, M. M., Haque, M. R., Iqbal, H., Hasan, M. M., Hasan, M., & Kabir, M. N. (2020). Breast can-
cer prediction: a comparative study using machine learning techniques. SN Computer Science, 1(5),
1–14.
Kılınç, D. (2016). The effect of ensemble learning models on turkish text classification. Celal Bayar Uni-
versity Journal of Science, 12(2), 15.
Kumar, A. (2020). Hard versus soft voting classifier python example. Retrieved September 07, 2020
from https://​vital​flux.​com/​hard-​vs-​soft-​voting-​class​ifier-​python-​examp​le/
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). Biobert: a pre-trained biomed-
ical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.
Loshchilov, I., & Hutter, F. (2017) Decoupled weight decay regularization. Preprint
at  https://​arvix.​org/​abs/​1711.​05101
Magna, A. A. R., Allende-Cid, H., Taramasco, C., Becerra, C., & Figueroa, R. L. (2020). Application of
machine learning and word embeddings in the classification of cancer diagnosis using patient anam-
nesis. IEEE Access, 8, 106198–106213.
Maysanjaya, I., Pradnyana, I., & Putrama, I. (2018). Classification of breast cancer using wrapper and
naïve bayes algorithms. Journal of Physics: Conference Series, 1040, 012017.
Medical, T.A.C.S. Editorial Content Team. (2022). Understanding your Mammogram
report. Retrieved January 14, 2022 from https://​www.​cancer.​org/​cancer/​breast-​cancer/​scree​
ning-​tests-​and-​early-​detec​tion/​mammo​grams/​under​stand​ing-​your-​mammo​gram-​report.​html
Moher, D., Shamseer, L., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle, P., & Stewart, L.
A. (2015). Preferred reporting items for systematic review and meta-analysis protocols (prisma-p)
2015 statement. Systematic Reviews, 4(1), 1–9.
Nguyen, E., Theodorakopoulos, D., Pathak, S., Geerdink, J., Vijlbrief, O., Van Keulen, M., & Seifert, C.
(2020). A hybrid text classification and language generation model for automated summarization
of dutch breast cancer radiology reports. 2020 IEEE second international conference on cognitive
machine intelligence (CogMI) (pp. 72–81). IEEE.
Niknejad, M.T. (2022). Breast imaging-reporting and data system (BI-RADS). Retrieved  January
28, 2022 from https://​radio​paedia.​org/​artic​les/​breast-​imagi​ng-​repor​ting-​and-​data-​system-​bi-​
rads?​lang=​us
Onan, A., Korukoğlu, S., & Bulut, H. (2017). A hybrid ensemble pruning approach based on consensus
clustering and multi-objective evolutionary algorithm for sentiment classification. Information Pro-
cessing & Management, 53(4), 814–833.
Özçift, A. (2020). Medical sentiment analysis based on soft voting ensemble algorithm. Yönetim Bilişim
Sistemleri Dergisi, 6(1), 42–50.
Parlak, B., & Uysal, A. K. (2020). On classification of abstracts obtained from medical journals. Journal
of Information Science, 46(5), 648–663.
Pons, E., Braun, L. M., Hunink, M. M., & Kors, J. A. (2016). Natural language processing in radiology:
A systematic review. Radiology, 279(2), 329–343.

13
Using BERT models for breast cancer diagnosis from Turkish…

Saib, W., Sengeh, D., Dlamini, G., & Singh, E. (2020). Hierarchical deep learning ensemble to auto-
mate the classification of breast cancer pathology reports by icd-o topography. Preprint
at  https://​arvix.​org/​abs/​2008.​12571.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of bert: smaller,
faster, cheaper and lighter. Preprint at  https://​arvix.​org/​abs/​1910.​01108
Shin, B., Chokshi, F. H., Lee, T., & Choi, J. D. (2017). Classification of radiology reports using neural
attention models. 2017 international joint conference on neural networks (IJCNN) (pp. 4363–4370).
IEEE.
Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., & Lungren, M.P. (2020). Chexbert: combining
automatic labelers and expert annotations for accurate radiology report labeling using bert. Preprint
at https://​arvix.​org/​abs/​2004.​09167
Soui, M., Mansouri, N., Alhamad, R., Kessentini, M., & Ghedira, K. (2021). Nsga-ii as feature selec-
tion technique and adaboost classifier for covid-19 prediction using patient’s symptoms. Nonlinear
Dynamics, 106(2), 1453–1475.
Suárez-Paniagua, V., Dong, H., & Casey, A. (2021). A multi-bert hybrid system for named entity recogni-
tion in spanish radiology reports. CLEF eHealth.
Tokgoz, M., Turhan, F., Bolucu, N., & Can, B. (2021). Tuning language representation models for clas-
sification of Turkish news. 2021 International symposium on electrical, electronics and information
engineering (pp. 402–407). IEEE.
Verma, A. (2021). Python guide to HuggingFace DistilBERT—smaller, faster and cheaper distilled
BERT. Retrieved March 16, 2021 from https://​analy​ticsi​ndiam​ag.​com/​python-​guide-​to-​huggi​
ngface-​disti​lbert-​small​er-​faster-​cheap​er-​disti​lled-​bert/
Wang, M., & Hu, F. (2021). The application of nltk library for python natural language processing in cor-
pus research. Theory and Practice in Language Studies, 11(9), 1041–1049.
Wei, J., & Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text
classification tasks. Preprint at https://​arvix.​org/​abs/​1901.​11196
Wu, S., & Dredze, M. (2020). Are all languages created equal in multilingual BERT. Proceedings of the
5th workshop on representation learning for NLP (pp. 120–130). Association for Computational
Linguistics.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized
autoregressive pretraining for language understanding. Advances in Neural Information Processing
Systems, 32, 10.
Zhu, Y., Moh, M., & Moh, T.-S. (2016). Multi-layer text classification with voting for consumer reviews.
2016 IEEE international conference on big data (Big Data) (pp. 1991–1999). IEEE.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.

Authors and Affiliations

Pınar Uskaner Hepsağ1 · Selma Ayşe Özel2 · Kubilay Dalcı3 · Adnan Yazıcı4

* Pınar Uskaner Hepsağ


puskaner@atu.edu.tr
Selma Ayşe Özel
saozel@cu.edu.tr
Kubilay Dalcı
kubilaydalci@hotmail.com

13
P. Uskaner Hepsağ et al.

Adnan Yazıcı
adnan.yazici@nu.edu.kz
1
Department of Computer Engineering, Adana Alparslan Türkeş Science and Technology
University, 01250 Adana, Turkey
2
Department of Computer Engineering, Çukurova University, 01330 Adana, Turkey
3
Department of General Surgery, Çukurova University, 01330 Adana, Turkey
4
Department of Computer Science, Nazarbayev University, 010000 Nur Sultan, Kazakhstan

13
View publication stats

You might also like