WCSP 2018 8555945

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Multi-Layer Parallel LSTM Network for Human

Activity Recognition with Smartphone Sensors


Tao Yu, Jianxin Chen, Na Yan, Xipeng Liu
Key Lab of Broadband Wireless Communication and Sensor Network Technology Ministry of Education
Nanjing University of Posts And Telecommunications
Nanjing, China 210003
1016010402@njupt.edu.cn, chenjx@njupt.edu.cn

Abstract—With the development of mobile communicationhu- • We propose a RNN-based approach to classify the human
man activity recognition (HAR) with smartphones has attracted a activities. It could extract features automatically which
lot of attentions in recent years. On the other hand, the appearance might preserve time dependency.
of deep learning technologies makes it possible to extract features
• We design a parallel LSTM architecture to reduce compu-
automatically instead of hand-crafted extracting features in the
traditional machine learning methods. Among deep model, CNN- tation consumption in this RNN model.
based HAR methods dominate the studies compared to RNN- • Plenty of experiments have been performed to verify our
based methods. In this paper, we propose a RNN-based multi- proposed method. The results indicate that the model
layer parallel LSTM network to recognize human activities. The performs better than traditional machine-learning methods
experimental results on the public UCI HAR dataset indicate
that the proposed approach performs better than the traditional and achieves the similar performance as that of CNN,
machine-learning methods, and achieves the similar performance but it has lower computation complexity than CNN-based
as that of CNN, but it has lower computation complexity than methods.
CNN-based methods. The rest of paper is organized as follows. Section II
Index Terms—human activity recognition, deep learning, paral-
lel LSTM network, smartphone sensor
introduces the related work on the deep learning technologies
applied in human activity recognition. Section III gives
I. I NTRODUCTION an overview of the Recurrent Neural Network (RNN) and
Long-Short Term Memory (LSTM) network, and presents the
Human activity recognition (HAR), as a significant part of detailed architecture of a parallel LSTM network. In Section
Human Robot Interaction, is applied widely in the healthcare IV, we discuss the model in terms of the recognition accuracy
domain such as elder care support, rehabilitation assistance and the computation complexity on the public HAR dataset.
and cognitive disorder recognition systems [1]. Generally, the Finally, conclusions are summarized in Section V.
data for human activity recognition collected from two type
of device: camera and sensors [2]. With the development
of mobile communication, the sensor-based approaches using II. RELATED WORK
smartphones with the low cost inertial sensors, such as ac- The task of human activity recognition with sensor data
celerometer, gyroscope and magnetometer, for HAR have being using the deep learning methods has been well studied over
received the extensive concerns [3] [4]. last several years. In [7], Wang et al. surveyed and highlighted
The traditional machine learning(ML) method applied in the recent advancement of deep learning approaches for
HAR often extracts features from sensor data before clas- sensor-based activity recognition. The survey showed that
sification, and the features are mostly related to statistical CNN-based methods dominate the studies and they are better
features in time domain and the frequency domain [5] [6]. at inferring the long-term repetitive activities while RNN at
Nevertheless, choosing features in a specific application requires recognizing short activities. In [8], Kotaro et al. used CNN
professional knowledge and involves a huge workload. And for HAR with dynamic features which captured the dynamic
there is often loss of information, such as the time dependency characteristics of the time series of sensor data. It was found
between actions, after extracting features [7]. In recent years, that the performance of dynamic features with CNN is better
deep learning methods, especially the CNN-based methods, than static features with SVM. In [9], both the dynamic
have been widely used for HAR. They have deep complex features from original sensor data and the statistical features
architecture and consider various characteristics from human were used for HAR with a CNN architecture. The obtained
actions [7] [8] [9]. results on public HAR dataset demonstrated that the CNN-base
In this paper, we focus on the HAR with smartphone sensors model significantly outperformed the baseline approaches.
by deep learning methods. We propose an approach based on the As mentioned above, although CNN-based methods are
long short-term memory (LSTM) network to recognize human well studied for the HAR tasks, there are few works based on
activities from the time series data collected by the inertial RNN, especially the LSTM. In [10], W. Zhu et al. proposed an
sensors attached to smartphones. The contributions of this paper end-to-end fully connected deep LSTM network for skeleton
are as follows: based action recognition. The proposed model facilitated the

978-1-5386-6119-2/18/$31.00 ©2018 IEEE


y y0 y1 yt
(ht)
(h0) (h1) (ht)
unfold
(h0) (h1)
tanh = tanh tanh tanh

x x0 x1 ĊĊ xt

Fig. 1. RNN Unfold

automatic learning of feature co-occurrences from the skeleton state ht and output yt , and Whh is the weight between the
joints and achieved the state-of-art performance on several previous hidden state ht−1 and the current one ht . bh and by
datasets. In [11], Friday N. H. et al. proposed Deep learning are the basis vectors.
fusion strategies to increase performance of HAR, which Nevertheless, the range of the historical information pre-
was a hybrid of convolutional neural network and variant of
recurrent neural network. The convolutional neural network
captured the local regional feature from multiple raw sensor ht
data, aggregated by the gated recurrent units. However, this
framework was still under implementation.
Overall, in this paper, we try to deal with the HAR problem
with LSTM network and expect to get outstanding results. Ct1 Ct
y н
III. A MULTI - LAYER PARALLEL LSTM NETWORK tanh
In this section, we give an overview of RNN and LSTM. Output Gate
Forgot Gate Input Gate
Then we introduce out the multi-layer parallel LSTM network y y
and the training parameters such as dropout and loss function.
ı ı tanh ı
A. RNN AND LSTM ht1 ht
RNN is a kind of artificial neural network that contains
cyclic connections, which can model the contextual information.
RNN shares the parameters for every element of a sequence
and generates outputs that depend on the current and previous xt
inputs. It uses hidden states to hold information on previous
inputs. Fig. 2. LSTM Unit
Fig.1 illustrates the RNN architecture and its unfolding form.
Here x = (x0 , x1 , x2 , · · · , xt ) and y = (y0 , y1 , y2 , · · · , yt )
represent the input and output series. As the computational served by the hidden states is limited, which is known as a
flow of the RNN unit shown in the Fig. 1, a hidden state gradient vanishing problem. This problem results from a fact
ht receives information from the previous hidden state ht−1 that the information of a given input would decay or blow up
and the current input xt , acting like the memory of network exponentially as it circulates around the hidden states. The most
that keeps information about what previously computed. The effective solution for this problem is the LSTM architecture,
parameters involved in a RNN are described as follows. which can learn long-term dependency from a time series. The
architecture of the LSTM unit is shown in Fig.2. The LSTM
ht = tanh (Wxh · xt + Whh · ht−1 + bh ) (1) unit is consisted of a self-connected memory cell (ct ) and three
gates: an input gate (it ) to control the storage of the input data,
yt = Why · ht + by (2)
a forget gate (ft ) to control the discard of the previous state and
Here Wxh is the weight matrix between the input xt and the an output gate (ot ) to generate the output results. At a given
hidden state ht , Wyh is the weight matrix between the hidden time step t, with the input and output represented by xt and ht ,
the LSTM activations are calculated as follows.
it = σ (Wxi · xt + Whi · ht−1 + Wci · ct−1 + bi ) (3)
Prediction Label

ft = σ (Wxf · xt + Whf · ht−1 + Wcf · ct−1 + bf ) (4)


Softmax
ct = ft ct−1 + it tanh (Wxc · xt + Whc · ht−1 + bc ) (5)
Softmax Layer

ot = σ (Wxo · xt + Who · ht−1 + Wco · ct + bo ) (6)


Fullconnect
ĊĊ Layer 2
ht = ot tanh (ct ) (7)
Here denotes the element-wise product and σ is a sigmoid Fullconnect
function. Wab is the weight matrix between a and b, and bc is ĊĊ Layer 1
the basic term of c.
However, the appending of three gates in the LSTM
neuron bring about more training parameters and computation ĊĊ
compared to the standard RNN neuron. Thus, the LSTM
Unfold
network would take lots of time while dealing with the
long-time-step samples. ht
Merging
LSTM LSTM
B. ARCHITECTURE OF MULTI-LAYER PARALLEL LSTM
Layer
NETWORK
In this work, we propose a multi-layer parallel LSTM net- Primary Features
work as a classification algorithm for human activity recogni-
tion. Combine
We extract the features of human activities with the LSTM O1 O2 ĊĊ On
network from the time series collected by smartphone sensors,
and classify the activities on the basis of the features with a Parallel
softmax classifier. LSTM_1 LSTM_1 ĊĊ LSTM_1 LSTM
A time series information can be embedded as a matrix Layer
comprised of the sensor data collected during an interval by
multiple sensors, which can be described as S1 S2 ĊĊ Sn Input Layer

Ax0 Ay0 Az0 Bx0 Ay0 Az0 · · ·
 Split
Ax1 Ay1 Az1 Bx1 Ay1 Az1 · · · Axt Ayt Azt Bxt Byt Bzt
 
Ax2 Ay2 Az2 Bx2 Ay2 Az2 · · ·
  (8)
 .. .. .. .. .. .. .. 
 . . . . . . . 
Fig. 3. Multi-Layer Parallel LSTM Network
Axt Ayt Azt Bxt Ayt Azt ···
where A, B represent various sensors, x, y, z represent the
three directions of each sensor and t is the length of one time step of each LSTM unit are collected and combined as
activity . However, as mentioned before, it is a time-consuming the preliminary features with a size of n × h1 , where h1 is the
work that dealing with such a long-time-step signal by unique number of the hidden neurons of each LSTM unit. After that,
LSTM neuron. So in our solution, several LSTM units are used a merging LSTM layer is added. It is fed with the preliminary
in parallel to handle different parts of the activity information. features and iterates through the order of the outputs from the
Fig.3 depicts the architecture of the Multi-Layer Parallel previous layer. In other words, the merging LSTM layer works
LSTM Network. It consists of input layer, parallel LSTM in chronological order as well.
layer, the merging LSTM layer, the full-connect layers and The output at the last time step of the merging LSTM layer
softmax layer. is selected as the feature of the activity, which contains the
In the input layer, the time series signal is split evenly into n time dependency. It is a vector with size of h2 , where h2
segments, as S1 ∼ Sn , on time dimension while each segment represents the number of the hidden neurons of the merging
has the same time steps with no overlap. Corresponding to LSTM unit.
it, there are n LSTM units fed with the n segments, which According to the splitting and merging steps mentioned
construct the parallel LSTM layer. Every LSTM unit has the above, we could extract the time dependency from the samples
same hyper-parameters, such as neurons number and dropout quickly. Moreover, signal processing could start early before
rate, so as to handle each segment equally. The LSTM unit the complete acquisition of an activity, which would reduce
deals with the segment data step by step in chronological the waiting time for each activity recognition in practical
order and iterates through the loop. The outputs at the last applications.
After that, two full-connect layers are used to reduce the possible. Table.I lists parts of the hyper-parameters and the
dimension of the features to a 1 × m vector, where m is the experimental setup after adjustment. We train the model with
number of the activities. Then the output is feed into a softmax learning rate of 0.005 and 100 batches of each epoch, which
logistic regression layer and it would produce the classification is proved to be effective in the similar environment [14]. The
outcome. performance of the model is evaluated on the dataset after each
epoch of training. And the training is done for more than 10000
epochs and stops if there is no increase in performance for the
C. TRAINING SETUP subsequent 10 epochs.
To avoid the over-fitting problem, we define a dropout rate for
each hidden layer, which permits the hidden layer to drop out
a certain neurons during training. The training parameters are TABLE I
randomly initialized and optimized with the stochastic descent EXPERIMENTAL SETUP PART I
method by minimizing the training loss function, which is Parameter Value
m Time steps of input 128
1 X Size of input channels 9
Γ=− yi · log yi 0 + λ · kW k (9) Hidden neurons of full connect layers 64-32
m i=1
Activation function of full connected layers ReLU
where m represents the number of samples in per training Learning rate 0.005
Probability of dropout 0.4
batch, and λ is a weight parameter. Samples of each batch 200
The loss function consists of two parts: a cross-entropy Batches of each epoch 100
function and a L2-normalization of all trainable parameters.
The cross-entropy function carries out between the predicted
output (yi ) from the model and the label (yi 0 ) from the C. RESULTS
benchmark dataset. The L2-normalization is another way to
Generally, sensor-based activity recognition methods
avoid over fitting, which limits the trainable parameter W
are evaluated in two aspects: recognition accuracy and
to a relatively small value. We will not stop training the
computation complexity. Table.II lists the performance under
model epoch by epoch using the training date until the model
different settings of the parallel LSTM layer. The number
becomes stable, and then calculate the recognition accuracy on
of the parallel LSTM units, represented by n, and the
the test date.
input time step of each parallel LSTM unit, represented
by m, are set by the following principle which could make
IV. EXPERIMENTS EVALUATION full use of each sample and in order to eliminate the effect of it.
We evaluate our model on the UCI dateset [12] in terms of 
128

accuracy and computation complexity. {(n, m) | n ∈ N+ , m = , m > 10} (10)
n
From the table, we note that the recognition accuracy is
A. UCI DATASET around 94%, which comes to a conclusion that the model could
This dateset consists of recordings about 30 subjects from extract enough similar features from the samples under different
19 to 48 years old. They are instructed to carry out six settings. However, the computation complexity decreases with
activities: walking, going upstairs, going downstairs, sitting, the increase of the parallel LSTM units. It is because that the
standing and lying. A smartphone (Samsung Galaxy SII) is LSTM unit would take much time to deal with a long-time-step
worn on the waist of subjects with a sampling rate of 50 Hz. input. Meanwhile, the computation complexity would increase
The triaxial acceleration from the accelerometer and triaxial a little when the much more parallel LSTM units are used.
angular velocity from the gyroscope are captured and the date This might be due to that the merging LSTM layer would take
are labeled manually. The sensor signals (accelerometer and much more time to deal with the output from the parallel LSTM
gyroscope) are pre-processed by applying butterworth low-pass layer under such settings. Overall, the model reaches the least
filters within a fixed-width sliding windows of 2.56 seconds computation complexity at the setting of n = 6, m = 21, with
and 50% overlap (128 readings per window). The obtained the recognition accuracy of 94.21%.
dateset is randomly partitioned into two parts, in which 70% In addition, we study the influence of the number of the
was selected for training and the remaining for testing. LSTM unit neurons upon the overall performance. Fig.4 shows
the effect of the number of the parallel LSTM unit neurons
on the model recognition accuracy. From it, we note that the
B. EXPERIMENT SETUP performance of the model will be improved by increasing the
The tensorflow machine learning library [13] is used for number of neurons in the parallel LSTM unit and converge to
our model implementation, and the experiment is running on a 94.27%. Fig.5 shows the effect of the number of the neurons
machine with Intel Xeon E5-2695 v4 CPUs and NVIDIA Tesla in the merging LSTM unit on the model recognition accuracy,
M40 GPUs. We incorporate a greedy-wise tuning of hyper- which has the similar law with the parallel LSTM unit. Consid-
parameters and achieve the optimal performance as much as ering the burden of the neurons number to the computation, we
TABLE II that it is better to set the dropout rate between 0.35 and 0.5.
MODEL PERFORMANCE AT DIFFERENT NUMBER OF LSTM Overall, we get the remaining model setups which are
UNITS
displayed in Table.III and the model reaches the recognition
Number of Time Step of Recognition Recognition accuracy of 94.34%.
Parallel Units Each Input Accuracy (%) Time (ms)
11 11 92.55 8.02
10 12 93.07 7.43
9 14 93.75 7.19 TABLE III
8 16 94.06 6.63 EXPERIMENTAL SETUP PART II
7 18 94.10 6.21
6 21 94.21 5.76 Parameter Value
5 25 93.99 9.11 Number of parallel LSTM units 6
4 32 94.02 10.23 Time step of each parallel LSTM unit input 21
2 64 93.96 18.22 Hidden neurons of parallel LSTM unit 24
1 128 93.72 32.63 Hidden neurons of merging LSTM unit 64
Probability of dropout 0.4

choose the number of the parallel and the merging LSTM unit Lastly, we compare our multi-layer LSTM network with
at setting of 24 and 64, which is on the premise of guaranteeing other state-of-the-art methods in HAR. Table.IV lists the
the model performance. achieved human activity recognition accuracy with different
methods on the UCI dateset. From it, we note that our multi-
layer LSTM network achieves the recognition accuracy of
0.95 94.34%, which is superior than the traditional machine learning
0.94 methods like SVM, HMM, and it is closed to the results with
RECOGNITION ACCURACY

0.93 CNN-based methods. But compared with the CNN-based


0.92 methods, our solution have rather less computation complexity
0.91 as in Table.V. The feature extraction from a long-time-step
0.90 sample input with parallel-merging mode that we used could
0.89 reduce much computation time compared with the serial LSTM
0.88
network. Meanwhile, our method could take advantage of the
0.87
intermediate results of several previous calculations, which
0.86
4 8 12 16 20 24 28 32 36 40 greatly reduces the computation time as well. In addition, the
NUMBER OF THE PARALLEL LSTM UNIT NEURONS LSTM-bases method could feed with the data collected at each
moment in real time while the CNN-based method should
carry out a recognition with a complete sample.
Fig. 4. Recognition accuracy of the model with different number of the neurons
in the parallel LSTM unit
TABLE IV
CLASSIFICATION RESULTS OF UCI DATESET

0.945 Method Accuracy (%)


Dynamic Time Warping [15] 89.00
RECOGNITION ACCURACY

Handcrafted features + SVM [16] 89.00


0.940 Convolutional Neural Network [17] 90.89
Hidden Markov Models [18] 91.76
PCA + SVM [19] 91.82
0.935 Stacked Autoencoders + SVM [19] 92.16
Hierarchical Continuous HMM [20] 93.18
LSTM Network (This paper) 94.34
0.930 Convolutional Neural Network [21] 94.79
Convolutional Neural Network [9] 95.31

0.925
12 16 20 24 28 32 36 40 44 48 52 56 64 80 96 112 128 144
NUMBER OF THE MERGING LSTM UNIT NEURONS
TABLE V
COMPUTATION COMPLEXITY
Fig. 5. Recognition accuracy of the model with different number of the neurons Method Recognition
in the merging LSTM unit Time (ms)
Convolutional Neural Network [21] 21.32
Moreover, we also study the influence of dropout rate on the Convolutional Neural Network [9] 25.61
Multi-layer Serial LSTM network 65.02
overall performance. Experiments show that the model will be Multi-layer Parallel LSTM Network 5.76
over fitting in training at an early time with a high dropout rate
and not converge with a low dropout rate. Finally, we found
V. C ONCLUSIONS [12] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A
public domain dataset for human activity recognition using smartphones,”
This paper proposes a smartphone-based HAR method European Symposium on Artificial Neural Networks, Computational In-
telligence and Machine Learning ESANN, pp. 467–442, 2013.
using a multi-layer parallel LSTM network. The proposed [13] “Tensorflow api.” https://www.tensorflow.org.
method can automatically extract features of time dependency [14] K. Li, X. Zhao, J. Bian, and M. Tan, “Sequential learning for multimodal
from the original sensor data and classifies the activities 3d human activity recognition with long-short term memory,” in IEEE
International Conference on Mechatronics and Automation, pp. 1556–
with a softmax. A public HAR dateset, UCI dateset, is used 1561, 2017.
for the simulation experiment which contains six activities. [15] S. Seto, W. Zhang, and Y. Zhou, “Multivariate time series classification
The hyper-parameters of the network are adjusted to the using dynamic time warping template selection for human activity recog-
nition,” in Computational Intelligence, 2015 IEEE Symposium, pp. 1399–
optimal state, which leads to an accuracy of 94.34%. The 1406, 2016.
result shows that our proposed method performs better than [16] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “Human
traditional machine-learning method and spends less time activity recognition on smartphones using a multiclass hardware-friendly
support vector machine,” in International Conference on Ambient Assisted
during recognition than CNN-based methods, which is suitable Living and Home Care, pp. 216–223, 2012.
for building a low-cost real-time HAR system on smart phone [17] C. A. Ronaoo and S. B. Cho, “Evaluation of deep convolutional neural
platform. network architectures for human activity recognition with smartphone
sensors,” Korea Information Science Society, pp. 858–861, 2015.
[18] C. A. Ronao and S. B. Cho, “Human activity recognition using smart-
phone sensors with two-stage continuous hidden markov models,” in
ACKNOWLEDGMENT International Conference on Natural Computation, pp. 681–686, 2014.
[19] Y. Li, D. Shi, B. Ding, and D. Liu, “Unsupervised feature learning for
This work was supported by the funding of Key human activity recognition using smartphone sensors,” in Mining Intel-
ligence and Knowledge Exploration: Second International Conference,
Lab of Broadband Wireless Communication and Sensor pp. 99–107, 2014.
Network Technology (Nanjing University of Posts and [20] C. A. Ronao and S. B. Cho, “Recognizing human activities from s-
Telecommunications, Ministry of Education, JZNY201704), martphone sensors using hierarchical continuous hidden markov models,”
International Journal of Distributed Sensor Networks, vol. 13, no. 1,
Nanjing University of Posts and Telecommunications pp. 1–16, 2017.
(NY217021, NY218014). [21] C. A. Ronao and S. B. Cho, “Human activity recognition with smart-
phone sensors using deep learning neural networks,” in Pergamon Press,
pp. 235–244, 2016.

R EFERENCES
[1] C. Chen, R. Jafari, and N. Kehtarnavaz, “A survey of depth and inertial
sensor fusion for human action recognition,” Multimedia Tools & Appli-
cations, vol. 76, no. 3, pp. 4405–4425, 2017.
[2] Y. Chen and C. Shen, “Performance analysis of smartphone-sensor
behavior for human activity recognition,” IEEE Access, vol. 5, no. 99,
pp. 3095–3110, 2017.
[3] A. Jahangiri and H. A. Rakha, “Applying machine learning techniques
to transportation mode recognition using mobile phone sensor data,”
IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 5,
pp. 2406–2417, 2015.
[4] C. V. S. Buenaventura and N. M. C. Tiglao, “Basic human activity
recognition based on sensor fusion in smartphones,” in Integrated Network
and Service Management, pp. 1182–1185, 2017.
[5] Z. Chen, Q. Zhu, S. Y. Chai, and L. Zhang, “Robust human activity
recognition using smartphone sensors via ct-pca and online svm,” IEEE
Transactions on Industrial Informatics, vol. 13, no. 6, pp. 3070–3080,
2017.
[6] H. He, Y. Tan, and J. Huang, “Unsupervised classification of smartphone
activities signals using wavelet packet transform and half-cosine fuzzy
clustering,” in IEEE International Conference on Fuzzy Systems, pp. 1–6,
2017.
[7] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor-
based activity recognition: A survey,” Pattern Recognition Letters, 2018.
[8] K. Nakano and B. Chakraborty, “Effect of dynamic feature for human ac-
tivity recognition using smartphone sensors,” in International Conference
on Awareness Science and Technology, pp. 539–543, 2017.
[9] I. Andrey, “Real-time human activity recognition from accelerometer data
using convolutional neural networks,” Applied Soft Computing, pp. 1–8,
2017.
[10] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie,
“Co-occurrence feature learning for skeleton based action recognition
using regularized deep lstm networks,” National Laboratory of Pattern
Recognition, pp. 3697–3703, 2016.
[11] N. H. Friday, M. A. Al-Garadi, G. Mujtaba, U. R. Alo, and A. Waqas,
“Deep learning fusion conceptual frameworks for complex human activity
recognition using mobile and wearable sensors,” in International Confer-
ence on Computing, Mathematics and Engineering Technologies, pp. 1–7,
2018.

You might also like