Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO.

7, JULY 2019 3821

A Novel Semisupervised Deep Learning Method


for Human Activity Recognition
Qingchang Zhu , Zhenghua Chen , and Yeng Chai Soh

Abstract—Human activity recognition (HAR) based on in- used to understand the status of a person so as to provide more
ertial sensors has been investigated for many industrial in- healthcare suggestions. Another potential application is that fall
formatics applications, such as healthcare and ubiquitous detection can be recognized based on triaxial accelerometer [5].
computing. Existing methods mainly rely on supervised
learning schemes, which require large labeled training data. Besides the applications related to healthcare, another useful
However, labeled data are sometimes difficult to acquire, application based on human activity recognition is that it can
while unlabeled data are readily available. Thus, we in- support many solutions for smart home applications in terms of
tend to make use of both labeled and unlabeled data with ubiquitous computing, e.g., gesture recognition [6] and activity
semisupervised learning for accurate HAR. In this paper, we
recognition [7].
propose a semisupervised deep learning approach, using
temporal ensembling of deep long short-term memory, to Many sensors can be utilized for HAR. Vision-based HAR
recognize human activities with smartphone inertial sen- has been well studied in the literature, such as human pose es-
sors. With the deep neural network processing, features timation in [8] and group activity recognition in [9]. Yang et al.
are extracted for local dependencies in the recurrent frame- have proposed a one-shot learning method for the recognition
work. Besides, with an ensemble approach based on both
of human actions, gesture, and expression [10]. Kirstein et al.
labeled and unlabeled data, we can combine together the
supervised and unsupervised losses, so as to make good have proposed a vector quantization-based method for life-long
use of unlabeled data that the supervised learning method interactive learning for multiple visual classification [11]. The
cannot leverage. Experimental results indicate the effec- camera indeed can capture more information than other sensors
tiveness of our proposed semisupervised learning scheme, in terms of understanding human behaviors. However, it suffers
when compared to several state-of-the-art semisupervised
from some conditions such as low illumination and resolution
learning approaches.
factors. Thus, other sensor modality should be used to enhance
Index Terms—Deep long short-term memory (DLSTM), or enrich the understanding of human activity recognition. Lim
human activity recognition, semisupervised learning, tem- et al. have proposed a two-step incremental learning method to
poral ensembling.
recognize the normal and exceptional human behaviors based on
four types of ubiquitous sensors [12]. Inertial sensor-based phys-
I. INTRODUCTION ical activity recognition [13] is useful due to its simple modality.
ITH the prevalent wearable devices and smartphones, The inertial measurement units of accelerometer and gyroscope,
W they can be readily used for human activity recognition
(HAR), which is an interesting problem in understanding the be-
and sometimes magnetometer, are embedded in smartphones.
With the built-in inertial sensors of the smart devices, the
haviors of smart-device users. In some informatics applications, motion data of users can be easily obtained by using some ap-
HAR is studied extensively according to a survey on body-worn plications or software. However, users are normally not willing
sensors [1]. The sensors can be based on wireless sensor net- to annotate their activities each time, leading to massive amount
works [2] or those embedded in mobile devices [3]. One of the of unlabeled data with only a small portion of labeled one.
promising applications based on HAR is in healthcare systems. How to leverage the unlabeled data in the recognition system
For example, the healthcare Internet of Things platform [4] is is, therefore, a meaningful problem. Semisupervised learning is
a suitable way to solve this problem. Indeed, typical semisu-
pervised learning methods have been studied for human activity
Manuscript received March 26, 2018; revised August 7, 2018; ac-
cepted December 11, 2018. Date of publication December 24, 2018; date recognition [14], and weakly supervised learning combined with
of current version July 3, 2019. This work was supported by the A*STAR multiinstance learning has been investigated as well [15]. Bhat-
Industrial Internet of Things Research Program under the RIE2020 IAF- tacharya et al. applied the sparse coding method to learn from
PP Grant A1788a0023. Paper no. TII-18-0752. (Corresponding author:
Zhenghua Chen.) unlabeled data in human transportation mode recognition [16].
Q. Zhu and Y. C. Soh are with the Department of Electrical and Elec- However, these methods are based on simple low-level feature
tronic Engneering, Nanyang Technological University, Singapore 639798 extraction, which sometimes cannot guarantee a satisfactory
(e-mail:, Zhuq0004@e.ntu.edu.sg; eycsoh@ntu.edu.sg).
Z. Chen is with the Institute for Infocomm Research, Agency for Sci- performance.
ence, Technology and Research, Singapore 138632 (e-mail:, chen0832 In this paper, we propose a deep long short-term memory
@e.ntu.edu.sg). (DLSTM) method with temporal ensembling for HAR with
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. smartphone inertial sensors. The approach can be used in a
Digital Object Identifier 10.1109/TII.2018.2889315 semisupervised way. The traditional feature extraction method

1551-3203 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
3822 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO. 7, JULY 2019

in human activity recognition is to calculate some fundamental investigated the weakly supervised learning in human activity
statistics of raw data. Based on these low-level features, we in- recognition [14]. They first proposed to use traditional semisu-
tend to automatically extract high-level features with the novel pervised learning methods, i.e., cotraining [19] and selftraining
deep neural networks. The recurrent neural networks have some [20]. The cotraining is a method that uses separate classifier
advantages in processing time-series data. In the deep architec- trained on multiple views of the samples and it is an extension
ture, a DLSTM is used since it can deal with long range depen- of self-training that treats the samples based on the confidence
dencies of the temporal sequences [17]. In order to make good from previous steps. They then proposed a label propagation ap-
use of unlabeled data, we attempt to leverage the past predictions proach to handle labeled and unlabeled data using a multigraph
as ensembles, which are the outputs of the deep neural networks approach [21]. They further summarized the semisupervised
with some random operations inside to increase the generaliza- learning with multiinstance learning into a weakly supervised
tion based on the previous epochs. This has been shown to be learning framework [14]. Multiinstance learning is to deal with
effective in the work using convolutional neural networks for a bag of instances, which only has one annotation in a bag.
image classification [18]. The updating process relied on the The methods proposed by Stikic et al. are based on time sam-
past knowledge is inspired by the idea of bootstrap aggregating pling [14], [21], while some sophisticated activities need em-
(bagging). The difference between the bagging ensemble meth- pirical estimation of the sampling. As these methods do not
ods and temporal ensembling is that the former one produces consider temporal analysis, Guan et al. proposed an autoregres-
repetitions from the training sets and chooses the subsets for sive hidden Markov model to solve the multiinstance learning in
training to decrease the variance, while the latter one produces acceleration-based activity recognition [15]. However, although
extra inputs with the help of data augmentation and dropout tech- these methods can make good use of unlabeled data, they ba-
nique, which embedded randomness inside. The unsupervised sically rely on the low level features. To improve performance,
learning loss can be measured due to the stochastic property and high-level features should be learned by using some feature
is combined with the normal supervised learning loss for BPTT learning methods.
to determine suitable weights of the deep architecture.
The contributions of this paper are as follows. B. Deep Learning for Feature Representations
1) We propose a novel semisupervised deep learning
Feature learning has also been studied with many unsuper-
method, i.e., temporal ensembling of DLSTM, for hu-
vised methods by learning some features from both labeled and
man activity recognition to improve the performance of
unlabeled data. For example, a sparse coding framework has
recognition by leveraging massive and cheap unlabeled
been applied in the human activity recognition problem, with
data.
good feature representations especially in an unsupervised way
2) The DLSTM network is able to model sequential data
[16]. But this sparse method requires a time-consuming opti-
and learn high-level representations. The temporal en-
mization process. Some researchers have applied deep learning
sembling scheme can unitize both labeled and unlabeled
methods in the human activity recognition problem since it has
data to improve the performance of DLSTM for HAR.
some breakthroughs in other pattern recognition areas such as
By combing these two advantages into one scheme, we
computer vision [22] and speech recognition [23]. For example,
have proposed a deep learning method that can efficiently
Zeng et al. proposed convolutional neural networks (CNN) on
handle the unlabeled data.
acceleration data to capture the scale invariant and local de-
3) We evaluate our proposed approach on real experimental
pendency patterns [24]. It seems that CNN is suitable for two-
dataset and analyze the impacts of some important em-
dimensional (2-D) images but may not fit well for the temporal
pirical parameters. The experimental results have demon-
signals. Besides the CNN, some other works have focused on
strated the effectiveness of the proposed method by com-
recurrent neural networks due to its property in processing time-
paring to the benchmark and state-of-the-art approaches.
series data. The long short-term memory (LSTM) is applied in
In the following sections, we first give a literature review
conjunction with CNN in a deep model for distinguishable fea-
on human activity recognition based on inertial sensors in
tures in terms of classification [25]. Another combination of
Section II. In Section III, we formulate the problem and describe
deep convolutional and LSTM neural networks is also proposed
our proposed approach in details. In Section IV, experimental
to obtain high-level features on acceleration data [26]. Further,
results are reported based on real experimental data with dis-
multicolumn bidirectional LSTM is applied to find better feature
cussion and analysis. Last but not least, conclusions are drawn
representations of motion data from mobile devices [3]. How-
and future works are discussed in Section V.
ever, these methods are purely supervised learning methods,
which cannot leverage the unlabeled data.
II. RELATED WORKS
Many approaches have been developed for human activity C. Combination of Deep Learning and Semisupervised
recognition based on inertial sensors. In this section, we review Learning
the latest development in this area.
Laine et al. proposed the CNN-based temporal ensembling
for semisupervised learning in image classification [18]. The
A. Semisupervised Learning
convolution operations, which capture the represented features
Semisupervised learning has been studied quite extensively in terms of scale invariant pattern and saliency of objects in im-
for inertial sensor-based activity recognition. Stikic et al. ages, are quite suitable for 2-D image signals. However, these
ZHU et al.: NOVEL SEMISUPERVISED DEEP LEARNING METHOD FOR HUMAN ACTIVITY RECOGNITION 3823

Fig. 3. Inner structure of the LSTM.


Fig. 1. Architecture of the proposed temporal ensembling of DLSTM
for HAR.

are extracted as the low-level features. The DLSTM networks


are then followed to learn the high-level features. Besides, the
dropout approach is applied as a regularizer to increase the gen-
eralization of the deep architecture. On one hand, with some
labels, supervised loss can be calculated. On the other hand,
due to the randomness of the networks, unsupervised loss is
calculated based on the comparison between unlabeled predic-
tion and the ensembled predictions in the past epochs. The last
layer is the softmax dense layer to provide classification results.
Finally, the total loss is combined to determine the parameters
of the deep neural networks using the back propagation method.
We shall elaborate on the proposed algorithm and discuss each
part of the flowchart in the following sections.

A. Deep Learning for High-Level Features Extraction


Fig. 2. DLSTM structure.
The LSTM is first proposed by Hochreiter and Schmidhuber
[29]. The recurrent structure is designed to capture the local
neural networks may not be suitable for the sequential data. dependency of the signals. Because of some gates and the mem-
This CNN-based method is inspired and investigated in terms ory cell state inside the LSTM, the information can be persistent
of the loss function [27] and ladder network [28]. The ensemble for a while. The LSTM has been successfully applied in tough
framework in image classification relies on the regularization tasks such as speech signal processing [30] and hand-writing
methods, i.e., stochastic transformation as data augmentation recognition [17].
and perturbations as dropout. Sajjadi et al. [27] provided a for- We assume that the dataset consists of labeled and unla-
mulation of the loss function in terms of unlabeled data. Given beled data. We denote labeled data and their annotations as
the weights of the mutual-exclusivity loss and transformation {xl , yl }, l = 1, . . . , L, where L is the number of labeled data.
stability loss in unlabeled training samples, they can be inte- The labels are yl ∈ {1, . . . , C}, where C is the number of
grated into the entire loss function, which means unlabeled data classes in total. Meanwhile, unlabeled data are denoted as
is leveraged to train the neural networks. {xu }, u = 1, . . . , U , where U is the number of unlabeled data.
Our proposed approach is based on the framework of tempo- The structure of DLSTM is shown in Fig. 2 and we formulate
ral ensembling, but it is different from extracting the features them as follows. We assume that the DLSTM has K layers. The
for different types of signals. Since the DLSTM network can inputs are the labeled and unlabeled data, denoted as hkt −1 at
model sequential data and learn high-level features, it is a good time step t. And the output of the kth layer of LSTM is denoted
candidate for HAR. as hkt at time step t. The inner structure of the LSTM is shown in
Fig. 3. Three gates, i.e., input gate, forget gate, and output gate,
III. TEMPORAL ENSEMBLING OF DLSTM FOR and two memory cell states are parts of the LSTM to process
SEMISUPERVISED LEARNING the persistence memory of the past content and we denote them
as ikt , ftk , okt , C̃tk , and Ctk , respectively,
The architecture of the proposed method is shown in Fig. 1.    
Given the raw data from the sensors, first, an augmentation ikt = σ Wik hkt−1 , hkt −1 + bki
technique is employed to increase the training dataset as well as    
embed some stochastic process therein. Then, simple statistics ftk = σ Wfk hkt−1 , hkt −1 + bkf
3824 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO. 7, JULY 2019

   
C̃tk = tanh WCk hkt−1 , hkt −1 + bkC
Ctk = ftk ∗ Ct−1
k
+ ikt ∗ C̃tk
   
okt = σ Wok hkt−1 , hkt −1 + bko
 
hkt = okt ∗ tanh Ctk (1)

where the σ(.) is the sigmoid function, Wik , Wfk , WCk , and
Wok are the weights, and bki , bkf , bkC , and bko are the biases. The
sigmoid function is as follows:
1
σ(x) = . (2) Fig. 4. Dropout illustrations [34]. (a) Original neural networks.
1 + e−x (b) Dropout.

The current network output can be stacked as the input of the


next layer. From the beginning, we can use a batch of samples noise to the inputs to enlarge the scale of the training numbers
as the input of to first layer that is as they can enhance the generalization of deep neural networks.
h0t = xt . (3)
C. Loss and Temporal Ensemble With BPTT
B. Generalization Based on Randomness After the additional layer processing, the outputs of the net-
works become
Three operations are embedded in the deep architecture so as
to make good use of unlabeled data in terms of the unlabeled zi = g(xi ; θ) (4)
loss. They can be regarded as regularizers, bringing randomness
inside the neural networks for evaluating the unsupervised loss. where g(.) is the neural network that tries to fit the training
First, the Gaussian noise layer is added into the input layer. samples. And all the weights and biases of all layers in the
By adding the Gaussian noise to the input randomly, the neural neural networks are termed as θ
network gains some robustness to avoid overfitting. The change
of the input is small, but the neural networks can still learn the θ = [Wi , bi , Wf , bf , WC , bC , Wo , bo ]. (5)
desired output. Thanks to the perturbation, the networks can
The DLSTM plays an important role in learning features. The
also be regularized, as stated in the previous work [31]. Besides,
temporal ensembling process is based on the learned features at
after introducing randomness, the output of the neural networks
each time step. After dropout, we denote the output of the neural
becomes a stochastic variable.
networks as zi for each sample. For supervised learning loss,
Second, data augmentation can also act as a regularizer in the
we can use the standard cross entropy to evaluate the difference
view of randomness. Furthermore, it can enlarge the number of
between training labels and predicted ones. As a result, the
the training sets so as to improve the performance of the neural
supervised learning loss is defined as follows:
networks. The data augmentation of the time-series data is to
use window slicing or window warping as discussed in the past 1
lossl (θ) = − (yi log(zi ) + (1 − yi )log(1 − zi )). (6)
work [32]. An overall input can be sliced into many frames. The L i
window slicing process can also be overlapped. This is a typical
way to increase the number of training samples in terms of time As for the unsupervised learning loss, which lacks the labels,
series data [33]. Besides, it can also use a random scaling ratio we can compare the predicted outputs with the previous ensem-
to rescale the magnitude of the signals. ble outputs. It is the stochastic operations in the deep neural
Third, the dropout can be regarded as an ensemble as well networks that make this loss work. And the square loss is one
[34]. The illustration of a normal dropout is shown in Fig. 4. candidate to evaluate the difference between them. Thus,
Some neurons are neglected when randomly selected in a prob-
1 
abilistic way. If it is run for many times, then the randomly lossu (θ) = ||zi − z̃i ||2 (7)
dropout networks form an ensemble process. Although it is dif- CU i
ferent from the subsets in the bagging ensemble method, it also
where the z̃i is the temporal ensemble output from the previous
produces many different outputs based on different neurons. For
time steps. It is the temporal accumulation of ensembles of pre-
example, we add dropout technique to increase the generaliza-
dictions in the last several epochs. If we combine all the instances
tion of the neural networks with the probabilities denoted as p.
into a matrix form, then the temporal ensembling outcome now
According to the study of Sajjadi et al. [27], the transforma-
is denoted as z̃. It can be updated based on the previous output
tion and stability loss can also be used to evaluate the unlabeled
from the past time epochs with the following formula:
loss. As a result, it is required to introduce the randomness into
the deep architecture. The three aforementioned operations can Zτ
produce different outputs. Besides, it is beneficial to add some z̃ = (8)
1 − ατ
ZHU et al.: NOVEL SEMISUPERVISED DEEP LEARNING METHOD FOR HUMAN ACTIVITY RECOGNITION 3825

where Zτ is the latent variable in matrix form at each epoch τ , TABLE I


LOW LEVEL FEATURES BASED ON SOME STATISTICS ON TIME AND
which is defined as follows: FREQUENCY DOMAIN
Zτ = αZτ −1 + (1 − α)z (9)
where Zτ can be initialized as 0. In (8), α is the momentum fac-
tor, acting as a rate monitoring the previous temporal ensembles.
And the operation in (8) can also be regarded as a normalization
process.
Combining the supervised and unsupervised learning com-
ponents, we can determine the total loss function with a weight
loss(θ) = lossl (θ) + w(τ ) ∗ lossu (θ) (10)
where w(τ ) is the weight parameter that depends on time epoch.
It should be small at the early stage of the deep neural network
training and can be ramped up after the early training phase,
say after 80 training epochs. The loss is to be minimized and to
find the suitable parameters θ. This is called back propagation
as studied by Rumelhart et al. [19] with gradient descent to
determine the weights in neural networks. But in recurrent neural For each time-series instance, it can be sampled using a
networks, the case now is a little bit different and a phenomenon window framing with overlapping. Empirically, the fixed-width
called gradient vanishing appears. Hence, we choose the back window can enrich the performance as this window can cover a
propagation through time (BPTT). It can be used to determine full action period, e.g., walking cycle. Following the instructions
the weights and biases in the recurrent neural networks. The of the work in [33], we choose the parameters of their settings.
Adam [35] is employed in the optimization process with gradient Specifically, the length of the sliding window is 2.56 s, and the
clipping to tackle the gradient vanishing problem. overlapping rate is 50%.
Finally, a softmax classifier is utilized based on the learned Low level features are some simple statistics. Basic statis-
features to provide probabilistic interpretation, i.e., tics are calculated for each window instance based on the raw
j
expW z i +b data to extract low level features. The summary of the com-
p(Y = j|zi , W, b) = C (11) puted features is shown in Table I. At first, some statistics are
W c z i +b
c=1 exp calculated in the temporal aspect. For each window instance,
where W and b denote weight matrix and bias vector of the 10 terms are utilized to represent the low level features in the
softmax layer, respectively. And Y = j denotes the stochastic time domain. These features are maximum, minimum, mean
variable Y belonging to the jth class. And the model selects the value, standard variance, median absolute value, signal mag-
prediction with the maximal probability nitude area, interquartile range, energy, entropy, and autore-
gression coefficients. More details can be found in the dataset
ŷi = arg max P (Y = j|zi , W, b) (12)
j description file [33]. Besides the time domain, frequency do-
main features are also extracted via the Fourier transform. In
IV. EXPERIMENTS AND DISCUSSIONS the spectral aspect, five terms are calculated and they are largest
We evaluate the proposed method using real experimental frequency component, skewness, kurtosis, energy of interval,
data and compare it with the state-of-the-art semisupervised and angle. Then, the extracted low level features, which are
methods. We first describe the dataset and the experimental setup of 561 dimensional, are directly fed into the designed neural
in details, especially some hyper parameter settings. Then, we network to learn high-level features.
discuss the effects of some parameters. Finally, we report and
analyze the experimental results. B. Experimental Setup
The state-of-the-art semisupervised learning methods are im-
A. Dataset
plemented for comparison. The toolbox is written in Python
The dataset based on smartphone inertial sensors was col- based on Scikit learning library. Cotraining is a two-view
lected with 30 subjects with ages ranging from 19 to 48 years method by processing different views for confident predictions
old, known as the UCI dataset in [33]. The smartphone used and the most confident one is used to deal with the training data
is Galaxy S II, and the device is attached to the waist of the in an iterative way [36]. We separate the instance feature into
subjects. The activities are sitting, laying, walking, walking up- two views and use the random forest classifier with the number
stairs, walking downstairs, and standing, which represent the of trees in the forest set as 500. We call it cotraining-RF. Be-
most common daily activities. The built-in sensors include ac- sides, other classifiers are also examined for the cotraining meth-
celerometer and gyroscope with a 50 Hz sampling rate. The ods, and they are logistic regression (LR), K-nearest-neighbor
number of training samples and testing samples are approxi- (KNN), and support vector machine (SVM), which are called
mately 7000 and 3000, respectively. the co-training-LR, co-training-KNN, and co-training-SVM,
3826 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO. 7, JULY 2019

TABLE II
DEEP NEURAL NETWORKS PERFORMANCES WITH DIFFERENT
PROBABILITIES OF THE DROPOUT LAYER

level. Specifically, we denote p = 0.2 as low level, p = 0.5 as


medium level, and p = 0.8 as high level, respectively. From the
performance comparison under different probability levels in
the dropout layers, the difference is not large. The best perfor-
mance is 97.21% at the dropout probabilities of 0.2 and 0.2,
respectively. The value of 0.2 is a typical value to use in train-
ing the neural network, while a value of 0.8 is large with too
much noise. The performance of p = (0.8, 0.8) is only 93.85%,
Fig. 5. Deep neural networks performances with different number of which we should discard. As a result, the probabilities of the
layers. two dropout layers are p = (0.2, 0.2).
Furthermore, the computing time is also analyzed. The testing
time for around 3000 testing samples is 2.118 s. That is to say,
respectively. Self-training [36] is using the KNN classifier, with the testing phase can be done in real time, with a time cost
the number of neighbors k = 5. Label propagation (LP) [21] at 0.7 ms per instance. Although the training time is a little
is used with a KNN kernel with k = 6 and the clapping factor bit longer than other semisupervised learning methods, it is
α = 0.8. still meaningful since it is trained only once. It is known that
The deep architecture is implemented based on the Lasagne it is rather time-consuming to determine the weights of deep
[37] using Python. The other parameters follow the settings neural networks. But it is also known that the testing process is
in [18]. Every layer apply the nonlinear leaky rectifier linear just some multiplications and additions when using the trained
unit [38] with a value of 0.1. And the initialization follows the weights. As a result, considering the classification performance,
way in [39]. Furthermore, after the input layer, the standard the deep neural networks are favorable methods.
variance of Gaussian noise layer is 0.2. Besides, the maximal
learning rate of the Adam optimization is set to be 0.003 with D. Comparison With Other Semisupervised Learning
the Adam momentum parameters of 0.9 and 0.999, respectively. Methods
The maximal weight of the unsupervised loss is set to be 50. The
total epoch is 150 with an early stopping criteria. The computer The state-of-the-art semisupervised learning baseline meth-
used has dual core CPUs of Intel Xeon at 2.70 GHz and a GPU ods that were implemented for comparison herein are cotraining-
of NVIDIA K40. RF [36], self-training [36], and LP [21]. We also evaluated the
proposed method of temporal ensembling of DLSTM (TEDL-
STM). We have also compared other deep learning methods to
C. Empirical Parameters evaluate the performance of our proposed method, e.g., Deep-
It is useful to explore the effects of some hyper parameters ConvLSTM [26], temporal ensemble of CNN (TECNN) [18].
in the proposed architecture. We choose the most important As the DeepConvLSTM is a purely supervised learning method,
parameters to observe the effect on the performance, i.e., the the result of all samples as training set is provided only. The
number of layers and the probability of the dropout. performance on the UCI dataset with six classes is 92.10%.
We consider the number of layers of one to five. The per- For LSTM networks, Bi-directional LSTM (BLSTM) is also
formances with different number of layers are shown in Fig. 5. popular. We also presented a temporal ensemble of BLSTM
It can be found that when the number of layers is two, it can (TEBLSTM) for semisupervised human activity recognition.
achieve the best performance; therefore, we select it as the de- The structure of the TEBLSTM is the same as the TEDLSTM,
fault number of layers in all the experimental setup. In fact, as except for the bidirectional LSTM in each layer.
the number of layers increases, the neural networks will easily We have reported the performances of the semisupervised
overfit and does not perform well even though it performs well learning algorithms with the following percentages of labeled
for the training set. Furthermore, the larger the number of lay- samples as the training set: 1%, 10%, 20%, 30%, 40%, 50%,
ers, the longer is the time required for the training process. As a 60%, 70%, 80%, 90%, and 100%. More details are illustrated
result, the number of the layers is set to be 2. in Table III. The results are also presented in Fig. 6. From these
As for the probabilities of the dropout layers, we design an results, we can observe some phenomena as follows.
experiment to observe this effect. The results of different com- 1) From Fig. 6, we can obviously find that the proposed
binations of two dropout layers are shown in Table II. The TEBLSTM and TEDLSTM can achieve much better
probabilities of two layers are evaluated from low level to high performance than the other semisupervised learning
ZHU et al.: NOVEL SEMISUPERVISED DEEP LEARNING METHOD FOR HUMAN ACTIVITY RECOGNITION 3827

TABLE III
COMPARISON BETWEEN TRADITIONAL SEMISUPERVISED LEARNING METHODS AND THE PROPOSED METHOD

Fig. 6. Comparison with other semisupervised learning methods on


UCI dataset. Fig. 7. Confusion Matrix with all samples for training for our proposed
TEDLSTM.

methods. For example, for the situation with only 10%


training samples, the proposed methods outperform the
other methods. As the percentage of training samples
increases, while all the semisupervised learning meth-
ods achieve better outcomes, but the proposed method is
still the best method in terms of accuracy among all the
semisupervised learning approaches. Due to the similar
structures of TEDLSTM and TEBLSTM, their results are
comparable.
2) For the scenario with a little labeled data, i.e., with only
1% labeled samples as training set, the advantage of
the proposed method TEDLSTM is very apparent. It
achieves about 11%, 6%, and 6% better accuracies than
the co-training-RF, self-training, and label propagation
approaches, respectively.
3) When all the labeled samples are fed into the proposed
method, the result is 97.21%. This performance is also
better than the other semisupervised methods.
The confusion matrix of the proposed method with all the
training samples are shown in Fig. 7. The confusion matrix of
the proposed TEBLSTM is also presented in Fig. 8. From the Fig. 8. Confusion Matrix with all samples for training for our proposed
comparison between predicted labels and true labels, we can TEBLSTM.
3828 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO. 7, JULY 2019

TABLE IV
COMPARISON BETWEEN SUPERVISED LEARNING METHODS WITH LOW LABELED PERCENTAGES AND THE PROPOSED METHOD

Fig. 10. Statistical analysis of the experimental results.


Fig. 9. Recognition performance on the updated 12 classes UCI
dataset.
E. Experiments With Larger Classes and Statistical
Analysis
find that the proposed method can easily distinguish the Laying
activity from others. The second best performance is the walking 1) Experiments with larger classes: The UCI dataset has
category, which is quite similar to the walking upstairs and been updated with 12 classes, including the activity transition
downstairs except for the vertical difference in the acceleration categories [33]. Herein, we shall use this dataset to evaluate
data. It is known that if one places the sensor at the waist, it is the methods compared in Section D and analyze them. The
quite difficult to tell the difference between sitting and standing experimental results are shown in Fig. 9. The performance of
because the most significant physical movement depends on the DeepConvLSTM [26] has an accuracy of 91.55% with all the
knee. And this is also found in the work of [40], indicating labeled samples as training set. From the results, it is apparent
that the lower torso is the best on-body placement for human that our proposed methods, i.e., TEDLSTM and TEBLSTM,
activity recognition when using inertial measurement units. But outperform the other methods in most of the settings.
the proposed method can still achieve a satisfactory result for 2) Statistical analysis: We have also conducted the t-testing
the recognition of sitting and standing. of the proposed method and the compared semisupervised
Besides comparing with the traditional semisupervised learn- methods, providing the p-value. The statistical analysis results
ing methods, supervised learning methods are also compared are presented in Fig. 10. Each method was run for multiple
and shown to further validate the performance of the proposed times with different random seed settings when all the samples
methods. The comparative results are shown in Table IV. Based were used for training. And the p-value of the hypothesis for,
on these results, it is clear that the proposed semisupervised which the proposed method outperforms the other methods is
learning methods outperform other supervised learning meth- 4.06 × 10−8 , indicating that this hypothesis is not rejected. The
ods in all situations. Especially in the situations of less labeled statistical analysis has also been performed to examine the ef-
percentages, the proposed methods can improve the recognition fect of the number of layers for the first three settings for their
results by at least 5%. With less labeled percentages, the process satisfactory performance. The results are shown in Fig. 11. The
to collect labeled data is faster and easier if one would like to p-value of the number of 2 against 3 is 1.05 × 10−4 , indicating
apply it for human activity recognition problems in practice. that the number of two layers is a good choice.
ZHU et al.: NOVEL SEMISUPERVISED DEEP LEARNING METHOD FOR HUMAN ACTIVITY RECOGNITION 3829

transfer learning, which is widely used for cross-domain prob-


lems [45]. Owing to the development of pervasive and intel-
ligent devices, the unlabeled data will be huge, which can be
leveraged to improve the performance of our proposed semisu-
pervised learning method. Therefore, the proposed method will
promote the development of the field of pervasive healthcare.

REFERENCES
[1] A. Bulling, U. Blanke, and B. Schiele, “A tutorial on human activity recog-
nition using body-worn inertial sensors,” ACM Comput. Surv., vol. 46,
no. 3, 2014, Art. no. 3.
[2] D. Tao, L. Jin, Y. Wang, and X. Li, “Rank preserving discriminant analysis
for human behavior recognition on wireless sensor networks,” IEEE Trans.
Ind. Inform., vol. 10, no. 1, pp. 813–823, Feb. 2014.
[3] D. Tao, Y. Wen, and R. Hong, “Multi-column bi-directional long short-
term memory for mobile devices-based human activity recognition,” IEEE
Fig. 11. Statistical analysis of the experimental results with effect of Internet Things J., vol. 3, no. 6, pp. 1124–1134, Dec. 2016.
the number of layers. [4] G. Yang et al., “A health-iot platform based on the integration of intelligent
packaging, unobtrusive bio-sensor, and intelligent medicine box,” IEEE
Trans. Ind. Inform., vol. 10, no. 4, pp. 2180–2191, Nov. 2014.
[5] C. Wang et al., “Low-power fall detector using triaxial accelerometry and
V. CONCLUSION barometric pressure sensing,” IEEE Trans. Ind. Inform., vol. 12, no. 6,
pp. 2302–2311, Dec. 2016.
In this paper, we have proposed a semisupervised deep [6] M. A. Simao, P. Neto, and O. Gibaru, “Unsupervised gesture segmentation
learning method for human activity recognition. The proposed by motion detection of a real-time data stream,” IEEE Trans. Ind. Inform.,
method employs the DLSTM network to extract high-level fea- vol. 13, no. 2, pp. 473–481, Apr. 2017.
[7] S. A. Rokni and H. Ghasemzadeh, “Synchronous dynamic view learning:
tures. Besides the feature extraction, a temporal ensembling is A framework for autonomous training of activity recognition models using
investigated with some randomness inside to enhance the gen- wearable sensors,” in Proc. 16th ACM/IEEE Int. Conf. Inf. Process. Sensor
eralization of the neural networks. The output of the neural Netw., 2017, pp. 79–90.
[8] X. Chu et al., “CRF-CNN: Modeling structured information in human pose
networks when using unlabeled data are compared and evalu- estimation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 316–324.
ated with the past ensembled predictions so as to calculate the [9] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A
unsupervised learning loss. A joint loss function of supervised hierarchical deep temporal model for group activity recognition,” in Proc.
IEEE Conf. Comput. Vision Pattern Recognit., 2016, pp. 1971–1980.
learning and unsupervised learning losses is utilized in the back [10] Y. Yang, I. Saleemi, and M. Shah, “Discovering motion primitives for
propagation through time to determine the weights of the recur- unsupervised grouping and one-shot learning of human actions, gestures,
rent neural networks. After training with labeled and unlabeled and expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 7,
pp. 1635–1648, Jul. 2013.
data, especially when the training samples are with a small por- [11] S. Kirstein, H. Wersing, H.-M. Gross, and E. Körner, “A life-long learn-
tion of annotations, the network can work quite well. Real ex- ing vector quantization approach for interactive learning of multiple cate-
perimental data were applied to evaluate the performance of the gories,” Neural Netw., vol. 28, pp. 90–105, 2012.
[12] G. H. Lim, “Two-step learning about normal and exceptional human be-
proposed approach. The experimental results showed that the haviors incorporating patterns and knowledge,” in Proc. IEEE Int. Conf.
proposed method outperforms the other state-of-the-art semisu- Multisensor Fusion Integration Intell. Syst., 2016, pp. 162–167.
pervised learning methods with different percentages of labeled [13] R. C. Luo and C.-C. Chang, “Multisensor fusion and integration: A review
on approaches and its applications in mechatronics,” IEEE Trans. Ind.
data for training. We also tested some empirical parameters of Inform., vol. 8, no. 1, pp. 49–60, Feb. 2012.
the proposed approach using actual data. [14] M. Stikic, D. Larlus, S. Ebert, and B. Schiele, “Weakly supervised recog-
The future work could be on exploring the unseen classes nition of daily life activities with wearable sensors,” IEEE Trans. Pattern
Anal. Mach. Intel., vol. 33, no. 12, pp. 2521–2537, Dec. 2011.
recognition problem. The recognition of unseen classes is in- [15] X. Guan, R. Raich, and W.-K. Wong, “Efficient multi-instance learning for
deed an interesting problem, but due to the limitation of current activity recognition from time series data using an auto-regressive hidden
design, it cannot handle such a problem. Zero-shot learning with markov model,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 2330–
2339.
attribute learning may be a possible solution to this problem as [16] S. Bhattacharya, P. Nurmi, N. Hammerla, and T. Plötz, “Using unlabeled
studied by Cheng et al. [41]. The semisupervised learning prob- data in a sparse-coding framework for human activity recognition,” Per-
lem would also be interesting if combined with multiinstance vasive Mobile Comput., vol. 15, pp. 242–262, 2014.
[17] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J.
learning. In another future work, we intend to apply the proposed Schmidhuber, “A novel connectionist system for unconstrained handwrit-
method in pervasive healthcare field, which is an important topic ing recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5,
[42], [43]. In this field, the recognition technique should be able pp. 855–868, May 2009.
[18] S. Laine and T. Aila et al., “Temporal ensembling for semi-supervised
to cope with lifelogging, pervasive, intelligent, and uncontrolled learning,” 2016, arXiv:1610.02242.
environment [44]. For uncontrolled environment, when the en- [19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
vironment changes, the previously collected labelled data may representations by error propagation,” DTIC Document, Tech. Rep., 1985.
[Online]. Available: https://apps.dtic.mil/docs/citations/ADA164453
be useless for the new environment. With the proposed ap- [20] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning
proach, we only need to collect a small portion of labelled data (chapelle, o. et al., Eds.; 2006)[book reviews],” IEEE Trans. Neural
for the new environment. We can also attempt to combine with Netw., vol. 20, no. 3, pp. 542–542, Mar. 2009.
3830 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 15, NO. 7, JULY 2019

[21] M. Stikic, D. Larlus, and B. Schiele, “Multi-graph based semi-supervised [43] P. Yang et al., “Lifelogging data validation model for internet of things
learning for activity recognition,” in Proc. Int. Symp. Wearable Comput., enabled personalized healthcare,” IEEE Trans. Syst. Man Cybern. Syst.,
2009, pp. 85–92. vol. 48, no. 1, pp. 50–64, Jan. 2017.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [44] J. Qi, P. Yang, M. Hanneghan, and S. Tang, “Multiple density maps in-
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- formation fusion for effectively assessing intensity pattern of lifelogging
cess. Syst., 2012, pp. 1097–1105. physical activity,” Neurocomputing, vol. 220, 2016, pp. 199–209.
[23] G. Hinton et al., “Deep neural networks for acoustic modeling in speech [45] S. A. Rokni and H. Ghasemzadeh, “Autonomous training of activity
recognition: The shared views of four research groups,” IEEE Signal recognition algorithms in mobile sensors: A transfer learning approach
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012. in context-invariant views,” IEEE Trans. Mobile Comput., vol. 17, no. 8,
[24] M. Zeng et al., “Convolutional neural networks for human activity recog- pp. 1764–1777, Aug. 2018.
nition using mobile sensors,” in Proc. 6th Int. Conf. Mobile Comput., Appl.
Services, 2014, pp. 197–205.
[25] N. Y. Hammerla, S. Halloran, and T. Ploetz, “Deep, convolutional, and
recurrent models for human activity recognition using wearables,” 2016, Qingchang Zhu received the B.Eng. degree in
arXiv:1604.08880. automation from Sun Yat-sen University, Guang-
[26] F. J. Ordóñez and D. Roggen, “Deep convolutional and LSTM recurrent dong, China, in 2013 and the Ph.D. degree
neural networks for multimodal wearable activity recognition,” Sensors, in electrical and electronic engineering from
vol. 16, no. 1, pp. 115–139, 2016. Nanyang Technological University, Singapore, in
[27] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with stochas- 2018.
tic transformations and perturbations for deep semi-supervised learning,” He is currently a Scientist with Tencent, Shen-
in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1163–1171. zhen, China. His research interests include en-
[28] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi- ergy efficient buildings and machine learning,
supervised learning with ladder networks,” in Proc. Neural Inf. Process. with particular applications to indoor localization
Syst., 2015, pp. 3546–3554. and activity recognition.
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[30] A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM net-
works for improved phoneme classification and recognition,” in Proc. Int. Zhenghua Chen received the B.Eng. degree
Conf. Artif. Neural Netw., 2005, pp. 753–753. in mechatronics engineering from the Univer-
[31] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. sity of Electronic Science and Technology of
Cambridge, MA, USA: MIT Press, 2016. [Online]. Available: China, Chengdu, China, in 2011 and the Ph.D.
http://www.deeplearningbook.org degree in electrical and electronic engineering
[32] A. Le Guennec, S. Malinowski, and R. Tavenard, “Data augmen- from Nanyang Technological University, Singa-
tation for time series classification using convolutional neural net- pore, in 2017.
works,” in Proc. ECML/PKDD Workshop Adv. Analytics Learn. Tem- He is currently a Scientist with the Institute for
poral Data, Sep. 2016, Riva Del Garda, Italy, [Online]. Available: https:// Infocomm Research, Agency for Science, Tech-
halshs.archives-ouvertes.fr/halshs-01357973/document nology and Research, Singapore. His research
[33] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public interests include data analytics in smart build-
domain dataset for human activity recognition using smartphones,” in ings, ubiquitous computing, Internet of Things, machine learning, and
Proc. Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., 2013, deep learning.
pp. 437–442.
[34] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from overfit-
ting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
[35] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2014, arXiv:1412.6980. Yeng Chai Soh received the B.Eng. (Hons. I)
[36] M. Stikic, K. Van Laerhoven, and B. Schiele, “Exploring semi-supervised degree in electrical and electronic engineering
and active learning for activity recognition,” in Proc. 12th IEEE Int. Symp. from the University of Canterbury, Christchurch,
Wearable Comput., 2008, pp. 81–88. New Zealand, and the Ph.D. degree in electrical
[37] S. Dieleman et al., “Lasagne: First release,” Aug. 2015. [Online]. Avail- engineering from the University of Newcastle,
able: http://dx.doi.org/10.5281/zenodo.27878 Callaghan, NSW, Australia, in 1983 and 1987,
[38] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve respectively.
neural network acoustic models,” in Proc. 30th Int. Conf. Mach. Learn., He joined the Nanyang Technological Univer-
2013, vol. 28, [Online]. Available: https://pdfs.semanticscholar.org/367f/ sity, Singapore, after his Ph.D. degree and is cur-
2c63a6f6a10b3b64b8729d601e69337ee3cc.pdf rently a Professor with the School of Electrical
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur- and Electronic Engineering. He has served as
passing human-level performance on imagenet classification,” in Proc. the Head of the Control and Instrumentation Division, the Associate Dean
IEEE Int. Conf. Comput. Vision, 2015, pp. 1026–1034. (Research and Graduate Studies) and the Associate Dean (Research)
[40] I. Cleland et al., “Optimal placement of accelerometers for the detection with the College of Engineering. He has authored and coauthored more
of everyday activities,” Sensors, vol. 13, no. 7, pp. 9183–9200, 2013. than 260 refereed journal papers in these areas. His research interests
[41] H.-T. Cheng, M. Griss, P. Davis, J. Li, and D. You, “Towards zero-shot include robust control and applications, robust estimation and filtering,
learning for human activity recognition using semantic attribute sequence optical signal processing, and energy efficient systems. His most re-
model,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., cent research projects and activities are in sensor networks, sensor
2013, pp. 355–358. fusion, distributed control and optimization, and control and optimization
[42] J. Qi, P. Yang, G. Min, O. Amft, F. Dong, and L. Xu, “Advanced internet of ACMV systems.
of things for personalised healthcare system: A survey,” Pervasive Mobile Dr. Soh was a panel members of several national grants and schol-
Comput., vol. 41, pp. 132–149, 2017. arships evaluation and awards committees.

You might also like