Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Human Activity Recognition with HMM-DNN Model

Licheng Zhang, Xihong Wu and Dingsheng Luo*


Key Lab of Machine Perception (Ministry of Education), Speech and Hearing Research Center
School of Electronics Engineering and Computer Science, Peking University
Beijing, 100871, China
Emails: {zhanglc.wxh.dsluo}@cis.pku.edu.cn
* Corresponding author

Ab stract-Activity recognition commonly made use of hidden extracted features. They compared the performance of HMM­
Markov models (HMMs) to exploit temporal dependencies GMM with that of a large amount of static classifiers, which did
between activities. The emission distribution of HMMs could be not take the temporal dependencies between activities into
represented by generative models, such as Gaussian mixture consideration, including GMM, NaIve Bayes (NB), logistic
models (GMMs), or discriminative models, such as random forest regression (LR), DT, support vector machine (SVM), k-nearest
(RF). These models, especially discriminative ones, needed to neighbor (KNN), artificial neural network (ANN), and so on.
manually extract features from the sensor data, which relied on
The results showed that HMM-GMM performed better than
the experience of the researchers, and usually was a time­
static classifiers. Ataya et al. [5] collected triaxial accelerometer
consuming task when complicated features are extracted.
data of nine activities and extracted several features for each
Furthermore, with these methods, the process of quantization of
frame, including mean, energy, correlation, and so on. Then they
the sensor data, i.e., manual feature extraction, might lose much
used several models to model the emission distribution of
useful information and thus led to a performance debasement. In
this paper, we recommend deep neural networks (DNNs) for
HMMs, including DT, RF, SVM, KNN, GMM, and AdaBoost.
modeling the emission distribution ofHMMs, which automatically
The results showed that taking the temporal dependencies
learn features suitable for classification from the raw sensor data between activities into account improved classification
and then estimate the posterior probabilities of the HMM states. performance, and discriminative classifiers, such as HMM­
We collected a dataset of daily activities and based on which SVM, performed better than the generative one (HMM-GMM),
experiments were performed to compare our HMM-DNN model and among the discriminative models, HMM-RF performed
with bothHMM-GMM andHMM-RF. The results illustrated that better than other models, i.e., HMM-DT, HMM-SVM, HMM­
HMM-DNN outperformed both HMM-GMM and HMM-RF. KNN, HMM-AdaBoost.

Keywords-activity recognition; deep neural networks; hidden From the aforementioned examples, we can find that
Markov models; sensor data; accelerometer previous works needed to manually extract features from the
sensor data, which relied on the experience of the researchers.
I. INTRODUCTION The problem of these methods lies in that the quantization of the
Activity recognition usually made use of hidden Markov sensor data, i.e., manual feature extraction, might lose much
models (HMMs) to model temporal dependencies between useful information that would improve classification
consecutive activities [1-5]. One critical step of activity performance.
recognition with HMMs is to model the emission distribution of In this paper, we recommend deep neural networks (DNNs)
HMMs. The emission distribution could be modeled by for modeling the emission distribution of HMMs, which
generative models, such as Gaussian mixture models (GMMs) automatically learn features suitable for classification from the
[1, 2], or discriminative models, such as decision stump (DS) [3], raw sensor data and then estimate the posterior probabilities of
decision tree (DT) [4], random forest (RF) [5], and so on. These the HMM states. We collected an acceleration dataset of daily
models that modeled the emission distribution of HMMs, activities using the accelerometer on a smartphone and did
especially discriminative models, needed to manually extract experiments on this dataset. We used DNN, GMM, and RF to
features from the sensor data, which relied on the experience of model the emission distribution of HMMs respectively, and
the researchers, and then modeled the extracted features. For compared the performance of our HMM-DNN method with that
example, Lester et al. [3] collected sensor data and extracted of HMM-GMM and HMM-RF. The experimental results
features from the data, including mean, variances, and so on. showed that HMM-DNN performed better than HMM-GMM
Then they used DS and GMM to model the emission distribution andHMM-RF.
of HMMs based on the extracted features. The results showed
that HMM-DS performed better than HMM-GMM. Mannini et The remainder of this paper is organized as follows. Section
al. [1] used a dataset of acceleration in [6] for activity II describes the related work. Section III gives a brief description
recognition. They extracted four kinds of features for each data of HMMs and detail how to use DNN to model the emission
frame, including the DC component of acceleration which was distribution ofHMMs. Section IV introduces the data collection
estimated by taking average of the data samples within each process, details the experimental design and gives the results of
frame, the energy, the frequency-domain entropy, and the experiments. Section V summarizes our conclusion and gives
correlation coefficients. Then they used GMM to model the the future work.

Proc. 2015111E 14th Int'! Conf. on Cognitive Informatics & Cognitive Computing IICCI'CC15J
H. Ge,l.lu, Y. Wang, H. Howard, P. Chen, II. Tao, a.Zhang, & U.Zadeh [Eds.!
918+4613·1290·91151$31.00 ©2015 IEEE

192
II. RELATED WORK SVM, KNN, GMM, and AdaBoost. They compared the
performance of these models as follows: HMM-DT, HMM-RF,
Activity recognition that used HMMs to take the temporal
HMM-SVM, HMM-KNN, HMM-GMM, HMM-AdaBoost, and
dependencies between activities into consideration and used
static classifiers, i.e., DT, RF, SVM, KNN, GMM, and
generative models or discriminative models to model the
AdaBoost, which did not consider the temporal dependencies
emission distribution of HMMs has been researched for many
between activities. The results showed that taking the temporal
years. Lester et al. [3] collected sensor data using a multi-sensor
dependencies between activities into consideration improved
board mounted on the shoulder of the subject. They extracted a
classification performance, and discriminative classifiers such
total of 651 features for each frame. Then they selected several
as HMM-SVM, performed better than the generativ � one
features from them that enabled the classifiers to distinguish well
(HMM-GMM), and among the discriminative models, HMM­
betw�en activities. The recognized activities were sitting,
RF performed better than other models, i.e., HMM-DT, HMM­
standmg, walking, jogging, walking upstairs, walking
SVM, HMM-KNN, HMM-AdaBoost. Ronao et al. [8] used an
downstairs, riding a bicycle, driving car, riding elevator down,
acceleration and gyroscopic sensor dataset, namely Hwnan
and riding elevator up. They did several experiments. First of all,
Activity Recognition Using Smartphones Dataset from [9],
they compared the performance of static classifiers, i.e., a
collected using the accelerometer and gyroscope on a
discriminative DS and a generative NB, and the results showed
smartphone, for activity recognition. The sampling rate was 50
that the discriminative model performed better. Then they used
Hz. The frame length was 2.56 s and two consecutive frames
HMM to model the temporal dependencies between activities
overlapped by 50%. 561 features were extracted for each data
and used GMM and DS to model the emission distribution of
frame. The researchers used random forest variable importance
HMMs, and the results showed that HMM-DS performed better
measures to select important features and obtained a feature
than DS and NB, but DS and NB performed better than HMM­
subset. Six activities were recognized, including walking,
GMM. Mannini et al. [1] used an acceleration dataset from MIT
walki�g upstairs, walking downstairs (moving activities), sitting,
[6] for activity recognition. The sampling frequency was 76.25
standmg, and lying (stationary activities). They proposed two­
Hz. The frame length was 6.7 s. Two consecutive frames
stage HMMs for activity recognition. Specifically, they used
overlapped by 50%. They extracted four kinds of features for
each �ata frame, the DC component, the energy, the frequency­
GMM to model the selected features and HMM to model the
temporal dependencies between activities. HMM-GMM frrst
domam entropy, and the correlation coefficients. Then they used
classified a data frame into moving or stationary subclasses.
HMM to model the temporal dependencies between activities
Then, for each subclass, HMM-GMM classified the data frame
and used GMM to model the emission distribution of HMMs.
Th� rec?gn �zed activi�ies were siting, lying, standing, walking,
into a specific activity. They compared the performance of the
two-stage HMM model with that of DT, NB, and ANN. The
starr clrrnbmg, runmng, and cycling. They compared the
results showed that the two-stage HMM-GMM model
performance of HMM-GMM with a large amount of static
performed best.
classifiers, which did not consider the temporal dependencies
between activities, including GMM, NB, LR DT, SVM, KNN, The above works needed to manually extract features from
ANN, and so on. The results showed that HMM-GMM the sensor data relying on the experience of the researchers, and
performed better than static classifiers. Lee et al. [7] collected then modeled the emission distribution ofHMMs based on these
acceleration data from a triaxial accelerometer on a smartphone. features. The common problem of the above works is that the
The sampling rate of the accelerometer was 12 Hz and the frame quantization of the sensor data, i.e., manual feature extraction
length was five seconds, i.e., 60 samples. They used GMM to might lose much useful information that would improv �
model the raw acceleration data of each data frame. They also classification performance.
made use ofHMM to model the temporal dependencies between
activities. They recognized five activities: standing, running, In this paper, we recommend DNNs for modelling the
walking, ascending, and descending. They compared the emission distribution of HMMs, which automatically learn
performance of HMM-GMM with that of a static classifier features suitable for classification from the raw sensor data and
namely ANN. The results showed that HMM-GMM performe d then estimate the posterior probabilities of the HMM states. We
better than ANN when recognizing standing, running, walking, did experiments on our collected acceleration dataset and proved
and ascending, while ANN performed better than HMM-GMM that our recommended method performed better than traditional
when recognizing descending. Ataya et al. [5] collected methods.
acceleration data using a triaxial accelerometer sensor placed at
III. HMM-DNN BASED ACTIVITY RECOGNITION
the hip level. The sampling rate was 100Hz. Nine activities were
recognized, including lying down, slouching/sitting, standing We recommend HMM-DNN for activity recognition.
(stati� activities �, stamping, cycling, running, slow walking, fast Specifically, we use HMM to model the temporal dependencies
walkmg, and usmg stairs (dynamic activities). The frame length between activities and use DNN to model the emission
was 2 s. Two consecutive frames overlapped by 50 percent. probabilities ofHMMs.
They used a normalized Signal Magnitude Area value to
In this part, we first give a brief description of HMMs and
distinguish static and dynamic activities. For static activities,
then detail how to use DNN to model the emission distribution
they continued to extract 14 features for each data frame
including mean, energy, and so on. For dynamic activities, the ; ofHMMs.

extracted 18 features, including entropy, correlation between A. Using HMMs to model the temporal dependencies
axes, spectral energy, and so on. They used several models to HMMs are probabilistic models that model time sequences.
model the emission distribution of HMMs, including DT, RF, A HMM consists of a sequence of hidden states and a sequence

193
Input Hidden Output
la er layers layer

Fig. 1. The graphical representaion of a HMM

of observations. The graphical representation of a HMM is


shown in Fig. 1. At each time step t, there is a hidden variable
Zt and an observed variable Xt in the HMM. In the field of
activity recognition from sensor data, the hidden variable is the
activities performed, such as walking and running, and the
observed variable is the data vector of each frame. HMMs make
two basic assumptions,

• The current hidden state depends only on the previous


hidden state.

• The observed variable xt is governed only by the Fig. 2. The graphical representaion of a DNN
corresponding hidden state Zt.
parameters, DNNs possess the ability of automatically learning
A HMM is determined by prior probability distribution, features suitable for classification from the raw sensor data.
transition probability distribution and observation probability
distribution. The probability density P(Zl) is the prior In a DNN, the data vector of a frame is fed into the input
probability distribution. The conditional probability density units. For each hidden unit j, its activation probability Yj is
p(ZtIZt-l) is the transition model and the conditional calculated using the inputs from the previous layer as follows
probability density p(xtIZt) is the observation model. Training [12]:
a HMM is to maximizing the joint probability P(X1:T,Zl:T),
with T to be the total time steps. Based on the above two
assumptions, the joint distribution is calculated using the above
three distributions as follows [10]:
where bj is the bias of unit j, i is an index over units in the
T T previous layer, Yi is the input of unit i in the previous layer, and
P(X1:T,Zl:T) =
P(Zl) n
t
p(ZtIZt-l) n p(xtlzt)· (1) Wij is the weight on the connection from unit i to unit j. Then
=2 t=l the activation probabilities of the units of this layer are taken as
After the training of the HMM, we get the parameters of the the inputs of the next layer. For multiclass classification, output
HMM. Then, we used Viterbi algorithm [11] to infer the label unit j uses "softmax" to convert the inputs from the previous
sequences that best explain a new sequence of observations. layer into a class probability P{

B. Using DNNs to model the emission distribution ofHMMs


One critical step of training a HMM is to learn the
observation model p(xtIZt). In this paper, we recommend
DNNs for modeling the observation probability distribution of where k is an index over all classes.
HMMs. The parameters of a DNN, including the weights and the
DNNs are a kind of the neural network model. It has an input biases, can be initialized using the method of random
layer, two or more hidden layers, and an output layer. The initialization or using the method of pre-training by treating
graphical representation of the DNN model is shown in Fig. 2. every two adjacent layers as a Restricted Boltzmann Machine
The nodes represent the variables and the links between the (RBM), which was proposed by Hinton in [13]. After parameter
nodes represent the weight parameters. Arrows denote the initialization, the whole network is trained using back­
information flow direction through the network. Compared with propagation algorithm with the information of the labels, to get
the traditional three-layer ANN, a DNN has many more the parameters optimized.
parameters because it has many more hidden layers and each After DNN training, the DNN outputs the class probabilities
hidden layer has many more units. Due to the large amount of

194
Accuracy
0.9352

0.9 0.8488 0.8569

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0. 1

0
GMM I-IMM-GMM RF HMM-RF DNN I-IMM-DNN

Fig. 3. The accuracy of different methods

p(ZtIxt) given an observation xt at time step t. If the number of consecutive frames overlapped by 50 percent. We took the data
classes is N, p(ZtIxt) is (Pi' Pz, ... , PN) calculated using (3). in the fourth collection as the testing data. The data of the other
three collections were taken as the training data. TABLE I gives
But HMMs require p(xtlzt) as is shown in (1). We use
the details of the training set and testing set. The fIrst column
Bayes' theorem to calculate the emission probabilities required
represents different activities, the second column gives the
byHMMs:
number of frames of different activities in the training set, and
the third column gives the number of frames of different
(4) activities in the testing set. In the last row of TABLE I, we give
the total number of frames of training set and testing set
It should be noted that p(xt) has no effect on the results of respectively.
inferring the label sequences. Therefore, the emission
probabilities can be calculated by dividing the DNN outputs by
B. Experimental Design
the total number of the corresponding class. We compared our HMM-DNN method with traditional
methods. Among the traditional methods, HMM-RF and HMM-
IV. EXPERIMENTS
In this part, we introduce the data collection process, detail
the experimental design and give the results of experiments. TABLE !. THE AMOUNT OF DATA OF DIFFERENT ACTIVITIES IN OUR
DATASET
A. Data Collection
We collected acceleration data using the accelerometer on an
Number offrames Number offrames
android smartphone. The smartphone was attached to the left Activities
in the training set in the testing set
upper arm of one subject. The whole process is performed as
follows. The subject fIrst rode a bike to a building. After arriving
at the building, the subject began to walk and walked into the Riding 839 238
building. Then the subject walked upstairs to the eighth floor.
Then he walked into the office of his tutor and stood to listen to Walking 582 199
the tutor. After the teaching of the tutor, the subject walked out
of the office and then went downstairs to the ground floor. Then
Waking upstairs 783 237
he walked out the building and [mally rode the bike away from
the building. Five activities were considered: riding, walking,
walking upstairs, standing, and walking downstairs. The whole Standing 979 19 1
process is described as follows: riding-walking-walking upstairs
-walking-standing-walking-walking downstairs-walking-riding. Walking
797 246
downstairs
The process was performed for four times when the tutor asked
the subject to go to his offIce and taught him. The sampling rate
To tal amount 3980 1111
is 42 Hz. We set the frame length to be 1 s, which was proved to
be sufficient to distinguish different activities [14], and two

195
Precision

0.9337
0.9 103
0.9 0.8472

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

o
GMM HMM-GMM RF HMM-RF DNN HMM-DNN

Fig. 4. The precision of different methods

GMM performed better than other methods as was shown in [1, the number of hidden layers and the number of units in each
5, 9]. Therefore, we chose HMM-RF and HMM-GMM as the hidden layer. When we adjusted the number of units of one layer,
baseline. We extracted four kinds of features for HMM-RF and we keep the number of units of the other layers fixed. For
HMM-GMM: the DC component, the energy, the frequency­ example, when we adjusted the number of units of one of the
domain entropy, and the correlation coefficients, which were hidden layers, we fixed the number of units of the other hidden
used in [1, 2]. For RF, the number of trees was set to 50 as was layers and first set the number of units to be a small value such
used in [5]. For GMM, the number of components in the mixture as 50 and a large value such as 3072 respectively. Then we
was set to 10. More components have no significant impact on gradually increased the value from 50 by a small value such as
the accuracy. 100 and decreased the value from 3072 by a small value such as
100 and observed the results of experiments. When we found a
For our HMM-DNN method, the parameters of the DNN
small interval for the optimal solution, for example, between 400
were randomly initialized. The DNN was trained using
and 600, we set the small value for increasing and decreasing to
stochastic gradient decent with a mini-batch size of 100 training
be smaller such as 10 rather than 100. Finally, the DNN that
samples. We ran 1000 epochs at a learning rate of 0.1 with a
achieved the best recognition accuracy had a structure of 126-
weight decay of 0.0002. An epoch refers to passing over all the
500-500-2000-5. That is to say, it had 126 units in the input layer,
mini-batches, i.e., the entire training set. We adjusted the
500 units in the first hidden layer, 500 units in the second hidden
structure of the DNN to achieve the best performance, including
layer, 2000 units in the third hidden layer, and 5 units in the out-

Recall

0.9322
0.8656
0.9 0.8402

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0. 1

0
GMM HMM-GMM RF HMM-RF DNN HMM-DNN

Fig. 5. The recall of different methods

196
put layer, representing 5 categories. It should be noted that 126 of the future, and thus further improve the classification
data points of the inputs consist of 42 samples (1 second length) performance. In addition, we will study whether the number of
of each axis for three axes. hidden layers in DNNs would be the larger the better and
whether the number of units in each layer would be the larger
During HMM training, we performed Laplace smoothing on
the better.
the prior and the transition model with a small value of 0.01, to
avoid zero probability. ACKNOWLEDGMENT

C. Results The work is supported in part by National Basic Research


In this part, we give the results of HMM-GMM, HMM-RF Program (973 Program) of China (No. 2013CB329304), the
andHMM-DNN. Besides, we give the results of static classifiers, National Natural Science Foundation of China (No. 90920302,
i.e., GMM, RF and DNN, which did not consider the temporal No. 91120001, No.61121002), a "Twelfth Five-Year" National
dependencies between activities. The accuracies of HMM­ Science & Technology Support Program of China (No.
GMM, HMM-RF, HMM-DNN and static classifiers are 2012BAIl2BOl), the Seeding Grant for Medicine and
presented in Fig. 3. The precisions of the different methods are Information Sciences of Peking University (No. 2 0 1 4- MI- I 0)
presented in Fig. 4. And the recalls of these methods are and the Key Program of National Social Science Foundation of
presented in Fig. 5. China (No. 12&ZD119).

The results of HMM-GMM, HMM-RF and HMM-DNN REFERENCES


show that the accuracy, the precision and the recall of HMM­ [I] A. Mannini and A. M. Sabatini, "Machine learning methods for
DNN are all higher than those of HMM-GMM and HMM-RF. classifying human physical activity from on-body accelerometers,"
The reason that our HMM-DNN method perform better than Sensors,vol. 10,no. 2,pp. 1 154- 1 175,2010.
traditional methods, i.e., HMM-GMM andHMM-RF, lies in that [2] A. Mannini and A. M. Sabatini, "Accelerometry-based classification of
manual feature extraction by traditional methods lose much human activities using markov modeling," Computational intelligence
and neuroscience,vol. 20II,no. 4,20II.
useful information that would improve classification
performance, while DNN used for modeling the emission [3] 1. Lester, T. Choudhury, N. Kern, G. Borriello, and B. Hannaford, "A
Hybrid Discriminative/Generative Approach for Modeling Human
distribution by our HMM-DNN method automatically learn Activities," Proceedings of the 19th international joint conference on
features suitable for classification from the raw sensor data and Artificial intelligence,vol. 5,pp. 766-772,2005.
give better emission distribution ofHMMs. Besides, we can fmd [4] T. Maekawa, Y. Kishino, Y. Sakurai, and T. Suyama, "Activity
that bothHMM-DNN andHMM-RF perform better thanHMM­ recognition with hand-worn magnetic sensors," Personal and ubiquitous
GMM, and both DNN and RF perform better than GMM. It computing,vol. 17,no. 6,pp. 1085-1094,2013.
proves once again that discriminative models perform better [5] A. Ataya, P. Jallon, P. Bianchi, and M. Doron, "Improving activity
than the generative one, which has been concluded in [3, 5]. In recognition using temporal coherence," In 35th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society
addition, we can fmd that HMM-GMM perform better than (EMBC),pp. 4215-4218,2013.
GMM, and HMM-RF perform better than RF, and HMM-DNN
[6] L. Bao and S. S. Intille, "Activity recognition from user-annotated
perform better than DNN. It proves once again that taking the acceleration data," Pervasive Computing, Lecture Notes in Computer
temporal dependencies between activities into consideration Science,vol. 300 1,pp. 1- 17,2004.
improves activity recognition performance, which has been [7] Y. S. Lee and S. B. Cho, "Activity recognition using hierarchical hidden
concluded in [5]. markov models on a smartphone with 3D accelerometer," In Hybrid
Artificial Intelligent Systems, Lecture Notes in Computer Science,
V. CONCLUTTON AND FUTURE WORK Springer Berlin Heidelberg, vol. 6678,pp. 460-467,20 1 1.
[8] C. A. Ronao and S. B. Cho, "Human activity recognition using
Traditional methods needed to manually extract features smartphone sensors with two-stage continuous hidden Markov models,"
from the sensor data, which relied on the experience of the In 10th IEEE International Conference on Natural Computation (ICNC),
researchers. The problem of these methods lies in that the pp. 68 1-686,2014.
quantization of the sensor data, i.e., manual feature extraction [9] D. Anguita,A. Ghio,L. Oneto,X. Parra,and 1. L. Reyes-Ortiz,"A public
might lose much useful information that would improve the domain dataset for human activity recognition using smartphones,"
European Symposium on Artificial Neural Networks (ESANN),pp. 437-
classification performance. In this paper, we recommend DNNs
442,2013.
for modeling the emission distribution of HMMs, which
[ 10] K. P. Murphy, Machine learning: a probabilistic perspective,MIT press,
automatically learn features suitable for classification from the
20 12.
raw sensor data. We collected an acceleration dataset and did
[II] L. R. Rabiner, "A tutorial on hidden markov models and selected
experiments on this dataset. We used DNN, GMM, and RF to applications in speech recognition," Proceedings of the IEEE,vol. 77,no.
model the emission distribution of HMMs respectively, and 2,pp. 257-286, 1989.
compared the performance of HMM-DNN with that of HMM­ [ 12] G. Hinton,L. Deng,D. Yu,G. E. Dahl,A. Mohamed,N. Jaitly,A. Senior,
GMM and HMM-RF. The results showed that HMM-DNN V. Vanhoucke,P. Nguyen,T. N. Sainath,and B. Kingsbury,"Deep neural
performed better than HMM-GMM and HMM-RF. networks for acoustic modeling in speech recognition," Signal Processing
Magazine,IEEE,vol. 29,no. 6,pp. 82 -97,nov. 20 12.
In the future, we will make our collected acceleration dataset [ 13] G. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for
public, and serve as a public dataset for researchers to compare deep belief nets," Neural computation,vol. 18,pp. 1527-1554,2006.
different methods in the field of activity recognition. Besides, [ 14] T. PlOtz, N. Y. Hammerla, and P. Olivier, "Feature learning for activity
we will try to make use of recurrent neural networks to consider recognition in ubiquitous computing," Proceedings of the Twenty-Second
international joint conference on Artificial Intelligence,vol. 22,pp. 1729-
more information in the past to contribute to the interpretation
1734,20 1 1.

197

You might also like