Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Enhanced Medical Time-Series Forecasting Using LSTM, MDN, and Attention Mechanism

Abstract—We introduce the LSTM-MDN-ATTN model for


predicting the medical time-series data. The LSTM-MDN-ATTN as hemoglobin, mean corpuscular volume, and red blood
model predicts the future value of medical data by cell count of EMR data [5].
approximating the distribution of target data. Since medical data Mixture Density Networks (MDN) [5] have been
is multivariate data with various test items, attention mechanism proposed to solve these problems. The MDN models the
is used to model the distribution suitable for target data. The
conditional probability distribution of the target data to
attention layer used in this study predicts target data by focusing
on the distribution that is related to the target data. The obtain a complete description of the target data. In recent
proposed LSTM-MDN- ATTN model shows better results studies, the MDN com- bined with LSTM showed excellent
compared to baseline models using lab test data from Asan performance in regression problems [6, 7]. The LSTM-MDN
Medical Center in Seoul. [7] model performed better by demonstrating stochastic
Index Terms—electronics medical records, mixture density behavior compared to the LSTM- RMSE model in an
networks, recurrent neural networks, time-series regression
experiment in which an object was captured and moved to a
I. INTRODUCTION designated location using a robot arm. However, because EMR
data consists of various diagnostic codes, it is complicated
Prediction of the future state of the patient may greatly to model the appropriate distribution of the target data.
help the doctor to determine the current state of the In this study, we propose a Long Short-Term Memory
patient. In healthcare systems, however, most applications Mixture Density Networks Attention (LSTM-MDN-ATTN)
of deep learning are to predict a patient’s disease model using attention mechanisms. The attention
through image processing, signal processing, or natural mechanisms increase the weight of diagnostic codes that are
language processing based on patient history. In this study, highly related to the target data to generate appropriate
we try to predict the future value of biomarkers to help conditional probabil- ity distributions of the target data [8,
doctors diagnose diseases. There have been many studies to 9].
predict electronics medical records (EMR) using LSTM [1],
which has outstanding per- formance in the time-series II. METHODS
analysis because the EMR data recorded according to the Fig. 1 shows an overview of the LSTM-MDN-ATTN
patient’s visit has a time series characteristic [2, 3, 4]. model. We denote input sequence { V = v1, v 2 ,. . . , vt}
CBLSTMs [3] successfully predicted the parameters of with V Rt×n, where t is the length of input sequence, and
computer numerical control machines used in health n is the number of features. In the LSTM-MDN-ATTN
monitoring system using LSTM. Time-aware LSTM model, the first layer is LSTM units layer which has
[4] outperformed other regression models for monitoring proven stable and powerful for modeling sequences. The
progress in Parkinson by predicting important biomarkers LSTM converts the given input sequences to a vector
for Parkinson’s disease. However, since the Long Short-Term representation of length t,{called hidden state, H = h1,
Memory-Root Mean Square Error (LSTM-RMSE) model h2 , . . . , h t} Rt×m, and m
used in previous studies approximates the conditional is the size of the hidden state. The second layer is Mixture
averages of the target data, only a limited description of density network (MDN) that models the distribution using a
the properties of the target can be obtained. Besides, since linear combination of Gaussian
Σ kernel functions:
LSTM-RMSE is a deterministic model optimized for one-
to-one or many- p(y|x)= k α (x)g (y|x)
to-one, it is suboptimal to predict continuous variables such i i
i=1
k
(1)
Σ
p(y|x)= αi(x)gi(y|x)
i=1

where k is the number of mixture components, α is the


Negative Log
Likelihood cost

Attention Exp Attention Exp Attention Exp

ᾱ σ ᾱ σ ᾱ σ

MDN MDN MDN

h1 h2 ht

LSTM LSTM LSTM

V1 V2 Vt

Fig. 1. LSTM-MDN-ATTN architecture.

mixture coefficients, gi(y x) is multivariate Gaussian. The


| function by taking the negative logarithm of the likelihood,
MDN layers project each hidden state, ht, into three
vectors αt, µt, and σt with Rk, k is number of kernels, using Losst:
fully connected layer: Σ
k

Losst = − ln ( t N(yt+1|µti, σit))


α0i (5)
(2)
h t = W h ht + h
0
i=1
b
αt = Wαh0 + bα
µ t = W µ h0 + bµ
(3) ᾱ

σt = log (exp (Wσh0 + bσ) + 1)


where Wh Rk×m, Wα, Wµ, Wσ Rk×k are trainable
x β

parameters.∈bh, bα, bµ, bσ Rk are bias ∈ terms. αt, µt, and



σt are final outputs of the LSTM-MDN model for predicting softmax

future state of a patient, where αt are mixture weights of


the gaussian mixture model, µt are means of target x

variable, and σt are standard deviations.


Unlike typical MDN, we proposed an attention layer, called v q k

target attention (TATTN), inspired by the Scaled Dot-Product


Attention, where relevance scores are computed by dot product Embedding Embedding

of vectors from transformer [9]. Fig. 2 represents a


detailed view of the attention mechanism. The TATTN
generates scores for each mixture coefficient most ᾱ ht

{
relevant to} tg ∈ = tg1, tg2, . . . , tgt R , called target value
t

Fig. 2. Detailed view of TATTN.


which represents the patient’s current state. After all αt are
generated, thet
mixture coefficient α0 is computed by:
III. EXPERIMENTS
A. Datasets
vt = α t
In this study, we used lab test results obtained from the
qt = Wqαt + bq databases of Asan Medical Center, Seoul, Korea. The lab
kt = tanh (Wktgt + test results are consists of inpatient and outpatient from
(4)
bk) January 2007 to December 2017. This study was
qtkT approved by the
βt = softmax( √ t )
d Institutional Review Board (IRB) of Asan Medical Center.
α0t = softmax(βtvt) We also obtained written informed consent from every
patient.
where Wq, W k ∈ Rk×k and bq, bk Rk are trainable We constructed a dataset for evaluating the LSTM-MDN-

parameters. d is ∈
a scaling factor in order to avoid growing ATTN model through some pre-processing. First, the
dataset extracted EMR information from patients who
visited the V
t+1

large in magnitude because of the dot products. We use the loss hospital more than 10 times for a fixed window size of 10.

ᾱ μ σ ᾱ μ σ ᾱ μ σ

tg1 tg2 tgt


TABLE I
PERFORMANCE COMPARISON AMONG DIFFERENT MODELS.

L2012 L2015 L2081


Model
RMSE MAE R-Square RMSE MAE R-Square RMSE MAE R-Square
LSTM-RMSE 0.33309 0.24386 0.77366 44.63161 29.08252 0.62202 2.68772 1.90102 0.71389
LSTM-MDN 0.32458 0.22988 0.78509 44.33455 28.67702 0.62704 2.65469 1.8345 0.72088
LSTM-MDN-ATTN 0.32341 0.22983 0.78673 44.32293 28.53329 0.62723 2.59542 1.79768 0.7332

27 features with low missing rates was selected from lab final target is L2081, MCV. MCV indicates the mean blood
test results in the dataset such as white blood cells, red blood volume. The MCV of a typical person is between 81 and 96
cells, hemoglobin and so on. The missing values were and the unit is equal to fl. The MCV of the experimental
replaced by actual values at previous visits. Because of dataset has a maximum value of 114.6 and a minimum
different sparsity of inpatients and outpatients, the dataset value of 70.6. In the L2081, LSTM-MDN-ATTN model
was aggregated on a monthly basis. If there were multiple shows the best performance in RMSE and MAE. In all three
visits for a month, we used only the last record. The dataset models, the R-square is 0.7 or higher. This experiment
is normalized by using the min-max scaling method. showed that the LSTM-MDN-ATTN model outperformed
B. Training Details the LSTM-RMSE model that have become one of the most
The environment of experimental is the Tensorflow popular networks for modeling time-series. In addition,
frame- work. Hyper-parameters are set as follows: mini the LSTM-MDN-ATTN model shows better performance
batch=64, LSTM-hidden-dimension=32, and mixture- than the LSTM-MDN model, showing that the proposed
components=6. We trained the network using the Adam TATTN layer is effective in the EMR dataset used.
optimizer popular in the field of deep learning at learning IV. CONCLUSION
rate 0.0001. There are three metrics for evaluating the
In this study, we propose the LSTM-MDN-ATTN model
LSTM-MDN-ATTN model, RMSE, MAE, and R2. The lower
for predicting the future state of the patient by modeling
values indicate better performance under RMSE and
the distribution suitable for target data in the multivariate
MAE.
medical data. The LSTM-MDN-ATTN model uses the
r
1 N attention mech-
RMSE Σ (yn − anism to focus on distributions that are highly correlated with
= n i=n (6) the target data to improve prediction accuracy.
yˆn)2
1
MAE = Σnn=1 |yn − yˆ n |
n
R2 is used to determine if the model has been properly R EFERENCES
trained for the dataset. As R2 approaches zero, it means
that the regression model did not fit properly into the data
set.
[1] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
R2 = 1 − Σ N
n=1 (yn − computation, 9(8):17351780, 1997.
(7) [2] R. Zhao, J. Wang, R. Yan, and K. Mao, Machine health monitoring with
ΣN yˆn)
(yn − y)
n=1
C. Results LSTM networks, 10th International Conference on Sensing Technology
(ICST), pp. 16, 2016.
Table 1 shows a summary of the experimental results. x103/mm3. The maximum value of L2015 in the experimental
The performance of three models such as LSTM-RMSE, dataset is 1.0, and the minimum value is equal to 533.0. In the
LSTM- MDN [7], and LSTM-MDN-ATTN was compared L2015 variable, LSTM-MDN- ATTN also shows the lowest
through the experiments. L2012, L2015 and L2081 were used RMSE and MAE values. The
as prediction targets. First, L2012 means red blood cells. The
normal range of L2012 at Asan Medical Center in Seoul is
4.2 to 6.3, with a unit of x106/mm3. The maximum value is
7.33 and the minimum value is 1.12 in the L2012 of the
experimental dataset. The LSTM-MDN-ATTN model has
the best RMSE and MAE in L2012. R-square shows that all
three models are good at learning more than 0.75. The
target variable, L2015, mean the number of platelets. The
L2015 has a normal range of 150 to 350, and the unit is
[3] R. Zhao, R. Yan, J. Wang, and K. Mao, Learning to monitor
machine health with convolutional bi-directional LSTM networks,
Sensors, vol. 17, no. 2, pp. 273290, 2017.
[4] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou,
”Patient subtyping via time-aware LSTM networks,” in
Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD 2017), Halifax,
Canada, 2017, pp. 65- 74.
[5] C. M. Bishop, Mixture density networks, Tech. Rep., 1994.
[6] D. Ha and J. Schmidhuber, Recurrent world models facilitate
policy evolution, arXiv preprint arXiv:1809.01999, 2018.
[7] R. Rahmatizadeh, P. Abolghasemi, A. Behal, and L. Bol oni, From
virtual demonstration to real-world manipulation using LSTM
and MDN, in Proc. AAAI, New Orleances, LA, USA, 2018.
[8] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio, Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention, in International
Conference on Ma- chine Learning, 2015.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez,
. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. Adv.
Neural Inform. Process. Syst. (NIPS), 2017, pp. 60006010.

You might also like