Professional Documents
Culture Documents
Journal Pre-Proof: Pattern Recognition Letters
Journal Pre-Proof: Pattern Recognition Letters
Journal Pre-proof
PII: S0167-8655(20)30043-X
DOI: https://doi.org/10.1016/j.patrec.2020.02.005
Reference: PATREC 7782
Please cite this article as: Oliver Faust, Murtadha Kareem, Alex Shenfield, Ali Ali, U Rajendra Acharya,
Validating the robustness of an internet of things based atrial fibrillation detection system, Pattern
Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.02.005
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
1
1
Validating the robustness of an internet of things based atrial fibrillation detection system
Oliver Fausta , Murtadha Kareema , Alex Shenfielda , Ali Alib , U Rajendra Acharyac,d
a Department of Engineering and Mathematics, Sheffield Hallam University, United Kingdom
b Sheffield
Teaching Hospitals NHS Foundation Trust, University of Sheffield, United Kingdom
c Ngee Ann Polytechnic, Singapore
d Department of Bioinformatics and Medical Engineering, Asia University, Taiwan
Article history:
ABSTRACT
fatigue, palpitation and light-headedness (Thrall et al., 2007). matrix and ROC curve. The subsequent discussion takes cen-
The fact that AF is an independent risk factor for stroke has tre stage in this paper, because in this section we outline why a
significant clinical and economic implications (Ali et al., 2015). robust decision-making process is a necessary prerequisite for
Overall, treatment of AF often includes cardioversion (restoring IoT based AF monitoring. The Discussion section also provides
sinus rhythm) which may involve using medication or electrical limitations and future work, before we wrap up the paper with
cardioversion. the conclusions.
Before we can treat this serious disease, it is necessary to
diagnose it first. One way of diagnosing AF is to record HR 2. Background on intelligent Internet of medical things
signals and subsequently look for signs of AF in these signals.
IIoTs deliver measurement data to a central storage location
The most promising approaches, documented in the scientific
for centralized decision-making (Gubbi et al., 2013). In the
literature, use computer support for that detection task (Hagi-
medical domain, such measurement data are usually physio-
wara et al., 2018; Song et al., 2012). The Computer-Aided Di-
logical signals, such as HR signals (Dimitrov, 2016). Figure
agnosis (CAD) systems are trained and tested on labeled data,
1 depicts a use case scenario for a medical IIoTs. Patients wear
i.e. data that was analyzed by human cardiologists. Looking at
a HR sensor, which is depicted in the figure as a black oval
the scientific literature in more detail revealed that all studies
with a red dot. That sensor communicates the HR signal to a
stopped at a one-time evaluation of the learning model (Chen
mobile device which relays the data to the cloud server via an
et al., 2017; Gotlibovych et al., 2018; Xia et al., 2018; Yıldırım
IoT protocol. The LSTM based deep learning model, which
et al., 2018; Yuan et al., 2016). That means these studies did not
was validated in this study, analyzes the incoming HR signals
test the efficacy of their model with more and more varied HR
in real-time. If AF is detected, a cardiologist is informed. The
data. Unfortunately, using the model with more and more var-
cardiologist can review the suspicious HR trace and reach a di-
ied data is the use case scenario which arises in clinical practice.
agnosis. That diagnosis can be communicated to the patients
To be specific, in a practical setting the patient data represents
in the form of a simple traffic light scheme, as indicated in the
a sample from the continuum of all possible data. That con-
figure.
tinuum is better represented with more and more varied data.
Hence, testing a model with more and more varied data instills 2.1. Long short term memory based deep learning
trust that the model will work in a practical scenario. The ab-
The use case scenario features LSTM based deep learning.
sence of such blindfold validation is a research gap, because
As such, LSTM is an extension of Recurrent Neural Network
the lack of follow up verification results in a diminished trust in
(RNN) (LeCun et al., 2015). That algorithm was chosen be-
the CAD systems. This lack of trust is one of the reasons why
cause LSTM can process data sequences (Sak et al., 2014). This
we do not see the wide spread application of CAD systems in
functionality is distinct from other deep learning approaches
clinical practice.
which consider an input vector as data point in a multidimen-
The current study tried to overcome the above-mentioned sional space (Faust et al., 2018a). LSTM includes the idea of
problem by evaluating the performance of a mature deep learn- time by incorporating memory cells and a sophisticated updat-
ing model with unseen data. We have asked the model to detect ing algorithm (Oh et al., 2018). The updating algorithm uses
AF in signals from 82 long-term HR recordings of subjects with three regulators or gates which control the flow of information
paroxysmal or sustained AF. The system achieved a classifica- within the LSTM. The input gate determines how much new
tion accuracy of 94%, a precision of 95% and it had 95% recall. data flows into the cell. The forget gate controls how much in-
This result is significant as it is obtained using totally unknown formation is retained and the output gate controls the outflow of
data. As such, the model was established with a training data set information. That flexible use of memory enables LSTM to ap-
that came from 20 patients and now we have validated it with proach the utility of a ”general purpose computer” (Siegelmann
data from 82 patients. Doing so goes against the common con- and Sontag, 1995).
cept of using a smaller portion of the data for testing. Hence, it Figure 2 shows a functional diagram of the LSTM algorithm.
is more difficult for the deep learning model to establish the cor- The upper part of the diagram indicates the RNN loop un-
rect result. Furthermore, the training data was measured with rolling, which results in individual LSTM cells. The hidden
a different setup from the validation data, i.e. sampling fre- state vector ~ht ∈ Rh and the cell state vector ~ct ∈ Rh are passed
quency, voltage range, and resolution were different. The fact from one cell to the next. The cells consume the input vector ~xt
that the results were achieved in that different environment indi- at different time instances t. Each cell A has LSTM functional-
cates that the deep learning system knows what an AF affected ity, as indicated in the lower part of the figure.
HR signal looks like. Hence, these results establish that the Each cell incorporates the three gates to establish the LSTM
deep learning model is reliable to detect AF in clinical settings. functionality (Gers et al., 2002). The forget gate regulates the
Having that impetus is a prerequisite for Intelligent Internet of information content stored within the cell and thereby it plays
Things (IIoT) based diagnosis support tools. a vital role in modelling the way humans remember and indeed
To communicate our findings, we have structured the remain- forget (Gers et al., 2000). It is implemented as the first multi-
der of the paper as follows. The next section provides back- plier from the left, highlighted in orange. The vector f~t ∈ Rh
ground on IIoT and the deep learning model. The subsequent controls the forget gate. Mathematically, that vector is estab-
section introduces the methods used for validating that model. lished with:
Section 4 states the validation results in the form of confusion f~t = σ W f ~xt + U f ~ht−1 + ~b f (1)
3
Fig. 1: Use case diagram of an IoT based decision support system. The color of the mobile device indicates the diagnosis result. Red indicates immediate action is
required, orange indicates caution, and green no reason for concern.
3. Methods
where σ(...) is the sigmoid activation function. W f ∈ Rh×d and
U f ∈ Rh×d are weight matrices and ~b f ∈ Rh is a forget spe- Recently, we developed an LSTM based deep learning model
cific bias vector. The parameters h and d indicate the number of to detect the AF episodes using HR signals (Faust et al., 2018b).
hidden units and the dimensionality of the input vector respec- That model can provide the intelligence needed for state of the
tively. art IoT based diagnosis support systems. The current study
The input gate is implemented as the second multiplier from setup tests the same model with more and more varied data ob-
the left, highlighted in blue. The vector ~it ∈ Rh controls the tained from a different source.
output gate. The mathematical definition of ~it is similar to f~t : Figure 3 provides an overview of the study setup. The upper
half of the figure shows the initial study setup. The LSTM based
~it = σ Wi ~xt + Ui ~ht−1 + ~bi (2) deep learning system was trained and tested with labeled HR
signal data from 20 subjects sourced from PhysioNet’s Atrial
the weight matrices Wi ∈ Rh×d and Ui ∈ Rh×d as well as the Fibrillation Database (AFDB). Training the deep learning net-
work means establishing the weight values of the individual
bias vector ~bi ∈ Rh are different.
neurons. The weight vectors from the initial study were used
The output gate is implemented as the third multiplier from
in the validation setup. However, more and more varied data
the left, highlighted in green. The vector ~it ∈ Rh controls the
were used for the blindfold validation testing. The following
input gate. The mathematical definition of ~it is similar to the
sections detail the validation setup by introducing the data sets
outer two gates:
and the processing steps in more detail.
~ot = σ Wo ~xt + Uo ~ht−1 + ~bo (3)
3.1. Data used
h×d h×d
the weight matrices Wo ∈ R and Uo ∈ R as well as the The initial study was based on data collected from the MIT-
bias vector ~bo ∈ Rh are different. BIH AFDB which is available on PhysioNet (Moody, 1983;
Having established the control vectors for the gates, we can Goldberger et al., 2000). This database includes twenty three
progress to formulate the equations which define the cell output. 10 hour Electrocardiogram (ECG) recordings. These record-
The cell state vector for the current time is established with the ings were labelled with so called rhythm annotation files, the
following equation: content of which was prepared manually. The specific types of
rhythm annotations were (AFIB (atrial fibrillation), (AFL (atrial
~ct = f~t ◦ ~ct−1 + ~it ◦ Tanh(Wc ~xt + Uc ~ht−1 + ~bc ) (4) flutter), (J (AV junctional rhythm), and (N (used to indicate all
4
other rhythms). Furthermore, the database contains beat anno- of True Positive (TP), True Negative (TN), False Positive (FP),
tation files that were used to extract the HR signals. and False Negative (FN). Most medical test results refer to a
For the blindfold validation, we used the data from 82 sub- positive case (classifying the subject having the disease) and a
jects, which were sourced from the Long-Term AF Database negative case (classifying the patient not having the disease).
(LTAFDB) (Petrutiu et al., 2007). All patients in that database The ROC curve is a method used to evaluate the diagnostic
were diagnosed with paroxysmal or sustained AF. Each data accuracy of tests in modern medicine (Mas, 2018). It is widely
set is composed from two ECG signals each of which has a du- used to illustrate how well a diagnostic model can distinguish
ration of 24 to 25 hours and was sampled at 128 Hz. The data between the presence and absence of disease and works equally
sets incorporate also rhythm and beat annotations, which were well with data sets that exhibit class imbalance. The ROC
established manually by experienced cardiologists. Moreover, represents the TP rate (sensitivity) plotted against the FP rate
the R peaks are labelled, and the RR interval sequence was ex- (1-specificity) for various cut-off points (Powers, 2011). Each
tracted based on these labels. Rhythm annotations were used to point on the ROC graph indicates the sensitivity/specificity cor-
label the RR interval sequence as either AF or non-AF. Table 1 responding to a specific decision threshold. The Area Under
provides the measurement parameters that were used to acquire Curve (AUC) is a summary metric indicating the discrimina-
the data for the two individual databases. The fact that there is tory power of a classifier.
a significant difference in the sampling frequency indicates the
measurement setups were quite different. 3.3. Initial study setup and performance
The pre-processing of the data in the validation setup follows
closely the initial study. A sliding window was used to parti- For the initial study, we designed a deep RNN (Pascanu et al.,
tion the HR signals into overlapping sequences of 100 beats. 2013; Pearlmutter, 1989) with LSTM (Hochreiter and Schmid-
Each sequence of 100 beats was labeled as AF if one of the huber, 1997; Bengio et al., 1994) model to detect AF based on
beats, within the sequence was labeled AF, all other sequences HR traces. The pre-processed data blocks were fed directly into
were labeled non-AF, indicating normal or other cardiac dis- the model without performing any feature engineering (Faust,
ease. Data from 82 subjects were used with the blindfold val- 2018). Table 2 provides the ten-fold cross validation results ob-
idation strategy to evaluate the performance of the robust deep tained for the initial model based on the data from 20 AFDB
learning system. This means the proposed methods can be used subjects.
to generalize not only unknown data, but also the unknown pa-
tients as well.
4. Validation results
3.2. Performance measures This section documents the blindfold validation results of the
The confusion or error matrix summarizes the prediction re- LSTM based deep learning model obtained in (Faust et al.,
sults of a classification problem (Hay, 1988). The layout al- 2018b). The results are based on HR signals from the 82
lows visualization of the performance of a decision-making LTAFDB subjects. The radar plot, shown in Figure 4, provides
algorithm. To be specific, the matrix has two rows and two a graphical representation of the model performance in terms
columns that represent the classes of actual and predictive anal- of accuracy, precision, recall and F1 score. The labels, around
ysis. Therefore, these performance measures report the number the radar plot, correspond to the patient ID as mentioned in the
5
Table 1: Comparison between AFDB and LTAFDB.
AFDB LTAFDB
Duration 10 h Between 24 h and 25 h
ADC resolution 12-bit 12-bit
Voltage range ±10 mV 20 mV
Sampling frequency 250 Hz 128 Hz
Measurement location Boston’s Beth Israel Hospital Not reported
Table 2: Ten fold cross validation result established during the initial training Blindfold validation should become a standard method to eval-
and testing of the LSTM deep learning model.
uate deep learning and other decision support algorithms.
TP TN FP FN Accuracy AUC The fact that the deep learning system was trained with HR
430,615 523,241 7,040 7,407 98.51% 0.9986 signals is of particular importance for home health care. HR
signals capture the beat-to-beat interval of the human heart
(Acharya et al., 2013). The beat impulse is the most prominent
LTAFDB. The IDs were ordered in terms of accuracy. The sig- signal feature when the electrical activity of the human heart
nal with label 10, shown at 12 o’clock, has the highest accuracy. is recorded. That prominence is reflected in the high signal-to-
The accuracy performance of the model decreases clockwise, noise ratio. Hence, measuring the beat-to-beat interval is robust
i.e. the accuracy was lower for signal 12 than for signal 10 etc. to noise. As a consequence, the measurement setup is simple,
The model achieved the lowest accuracy of just under 70% for when compared to other physiological signals, such as ECG
the HR signal from patient 22. (Martis et al., 2013a). Another advantage of HR is that the time
We have evaluated the model performance with a confusion from one beat to the next can be encoded with a 2-byte integer.
matrix in order to validate AF and normal HR sequences ob- Hence, a digital HR signal consists of approximately 2 bytes
tained from 82 subjects. Figure 5 shows the results obtained a second. In contrast, to sample the complete electrical activ-
using the new test data: (a) confusion matrix by blindfold val- ity of the human heart requires around 256 samples a second,
idation, (b) ROC of the model. It can be noticed from the re- each of which is encoded with 2 bytes. Therefore, ECG sig-
sults in Table 3, that the LSTM deep learning model achieved nals have a 256 times higher data rate. The lower data rate has
a classification accuracy of 94% on the LTAFDB blindfold val- practical benefits, in terms of data communication, storage and
idation data set - classifying 95% of normal and 95% of HR processing. For example, HR signals can be communicated via
sequences showing signs of AF correctly. The deep learning Bluetooth Low Energy (BLE) whereas ECG signals require a
classifier achieves an AUC of 0.9658 which is close to a perfect broader wireless channel, such as WiFi. Using BLE instead of
score of 1, hence the classifier discriminates well between dis- WiFi has beneficial implications for battery powered devices.
ease and no disease. Most classifier values varied between 0.0 The low data rate and patient led signal acquisition makes
(definitely negative) and 1.0 (definitely positive). The closer the HR signals a good choice for IoT based health care applica-
ROC plot to the TP rate, the higher overall accuracy of the test tions. For such applications the HR data travels from the point
performed. Figure 5b shows the ROC curve which indicates the of measurement to a central cloud service over communication
AF diagnostic performance. infrastructure. Having the data at a central location has a num-
Table 3 provides the blindfold validation results for 3 AFDB ber of advantages. Deep learning can be used to detect AF in
subjects, as documented during the initial study, and 82 real time. That is a significant advantage over ECG measure-
LTAFDB subjects. The 82 LTAFDB data sets represent 95 ments with Holter monitors, because Holter data can only be
times more data than the initial 3 data sets from the AFDB. analyzed after the measurement period is complete. Validating
the deep learning model for AF detection has paved the way
for an IoT based AF diagnosis support tool. The knowledge
extracted from a limited amount of labeled data can be used to
5. Discussion provide real-time decision support. In order to reach the diag-
nosis, the deep learning result should be validated by a cardi-
This study shows that deep learning model is able to mimic ologist. The basis for this validation can be the HR data which
human decision-making, when it comes to extracting relevant has been flagged as showing the subtle waveform alterations
knowledge from a labeled data set. Training the deep learning caused by AF.
algorithm models the knowledge generation which transpires A limitation of our study is that we did not address the impor-
during the education of a cardiologist. The blindfold valida- tant concept of training on the job. To be specific, a cardiologist
tion, conducted in this study, models the clinical practice where learns while doing the job. The proposed deep learning model
the decision-Fmaking system is presented with signals from a is static, i.e. it did not learn during the validation. At one point
wide range of patients. Validating the deep learning with un- the knowledge, extracted from 20 patients, will be insufficient
known data that was obtained with a different measurement to cope with practical scenarios. In the future, we have to find
setup builds trust in the AF detection model. Furthermore, the a way to model that continuous training in order to improve the
results might be transferable to other areas of CAD and beyond. diagnostic quality of the proposed AF detection system. One
6
Fig. 4: Illustration of performance for various test subjects obtained by blindfold validation.
way of providing this continuous learning is to retrain the deep In this case, our deep learning model understands HR signals,
learning model with measurement data. A prerequisite for such such that it can differentiate AF from non-AF affected signals.
a methodology is to have the HR data stored in a central loca- This is different from the traditional machine learning approach
tion. Hence, streaming the HR data to a central cloud server which is based on feature engineering. To be specific, the tra-
might prove to be an advantage when the continuous learning ditional approach differentiates AF from non-AF signals based
problem is tackled. on parameters. It is likely that these parameters change when
more and more varied data are analyzed. Hence, it is common
practice for studies based on traditional methods to state this as
6. Conclusion a limitation, i.e. more and more varied data is needed as well as
retraining the machine learning model to improve the diagnos-
The current study indicates that deep learning can acquire
tic quality.
the knowledge to understand specific aspects of HR signals.
7
(a) (b)
Fig. 5: Results obtained using the new test data: (a) confusion matrix by blindfold validation, (b) ROC of the model.
Table 3: Blindfold validation obtained from the LSTM deep learning model for both 3 subjects from the AFDB and 82 subjects from the LTAFDB
In the current study, we have used more and more varied data Benjamin, E.J., Wolf, P.A., DAgostino, R.B., Silbershatz, H., Kannel, W.B.,
without retraining the deep learning model. Data from 82 pa- Levy, D., 1998. Impact of atrial fibrillation on the risk of death: the fram-
ingham heart study. Circulation 98, 946–952.
tients were used for this blindfold validation. With this data, Chen, W., Liu, G., Su, S., Jiang, Q., Nguyen, H., 2017. A chf detection method
our model achieved a classification accuracy of 94%, a preci- based on deep learning with rr intervals, in: Engineering in Medicine and
sion of 95% and it had 95% recall. These results indicate that Biology Society (EMBC), 2017 39th Annual International Conference of
our deep learning model has fewer limitations when compared the IEEE, IEEE. pp. 3369–3372.
Dimitrov, D.V., 2016. Medical internet of things and big data in healthcare.
to traditional machine learning approaches. To be specific, we Healthcare informatics research 22, 156–163.
showed that knowledge extracted from a small training data set Faust, O., 2018. Documenting and predicting topic changes in computers in
can be applied to a larger and more varied validation data set. biology and medicine: A bibliometric keyword analysis from 1990 to 2017.
This concept is significant for practical diagnostic support, be- Informatics in Medicine Unlocked 11, 15–27.
Faust, O., Hagiwara, Y., Hong, T.J., Lih, O.S., Acharya, U.R., 2018a. Deep
cause applying knowledge acquired during the limited training learning for healthcare applications based on physiological signals: a review.
period is exactly what a cardiologist does. Hence, a practical Computer Methods and Programs in Biomedicine 161, 1–13.
IoT based medical decision support system must apply knowl- Faust, O., Martis, R.J., Min, L., Zhong, G.L.Z., Yu, W., 2013. Cardiac arrhyth-
edge, extracted during training, to samples, i.e. patient data, mia classification using electrocardiogram. J. Med. Imaging Health Inf 3,
448.
from a very large data set. Faust, O., Shenfield, A., Kareem, M., San, T.R., Fujita, H., Acharya, U.R.,
2018b. Automated detection of atrial fibrillation using long short-term mem-
ory network with rr interval signals. Computers in biology and medicine
References 102, 327–335.
Gers, F.A., Schmidhuber, J.A., Cummins, F.A., 2000. Learning to forget: Con-
tinual prediction with lstm. Neural Computation 12, 2451–2471.
Acharya, U.R., Faust, O., Kadri, N.A., Suri, J.S., Yu, W., 2013. Automated
Gers, F.A., Schraudolph, N.N., Schmidhuber, J., 2002. Learning precise timing
identification of normal and diabetes heart rate signals using nonlinear mea-
with lstm recurrent networks. Journal of machine learning research 3, 115–
sures. Computers in biology and medicine 43, 1523–1529.
143.
Adamson, J., Beswick, A., Ebrahim, S., 2004. Is stroke the most common cause
Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,
of disability? Journal of Stroke and Cerebrovascular Diseases 13, 171–177.
R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E., 2000. Phys-
Ali, A., Bailey, C., Abdelhafiz, A.H., 2012. Stroke prevention with oral an-
iobank, physiotoolkit, and physionet: components of a new research re-
ticoagulation in older people with atrial fibrillation-a pragmatic approach.
source for complex physiologic signals. Circulation 101, e215–e220.
Aging and disease 3, 339.
Gotlibovych, I., Crawford, S., Goyal, D., Liu, J., Kerem, Y., Benaron, D., Yil-
Ali, A.N., Howe, J., Abdel-Hafiz, A., 2015. Cost of acute stroke care for pa-
maz, D., Marcus, G., Li, Y., 2018. End-to-end deep learning from raw
tients with atrial fibrillation compared with those in sinus rhythm. Pharma-
sensor data: Atrial fibrillation detection using wearables. arXiv preprint
coeconomics 33, 511–520.
arXiv:1807.10707 .
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies
Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M., 2013. Internet of things
with gradient descent is difficult. IEEE transactions on neural networks 5,
(iot): A vision, architectural elements, and future directions. Future genera-
157–166.
8
Oliver Faust
Corresponding author