Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

2014 IEEE International Conference on Computer and Information Technology

Feature extraction and pattern recognition for human motion


by a deep sparse autoencoder

Hailong Liu Tadahiro Taniguchi


The Graduate School of Information Science The College of Information Science
and Engineering, Ritsumeikan University, and Engineering, Ritsumeikan University,
1-1-1 Noji Higashi, Kusatsu, Shiga 525-8577, Japan 1-1-1 Noji Higashi, Kusatsu, Shiga 525-8577, Japan
Email: liu@em.ci.ritsumei.ac.jp Email: taniguchi@em.ci.ritsumei.ac.jp

Abstract—Human motion data is high-dimensional time- a large gram matrix when the dataset is very large. This
series data, and it usually contains measurement error and greatly increases the computation cost. A human motion
noise. Recognizing human motion on the basis of such high- recognition task usually comes with a large amount of high-
dimensional measurement row data is often difficult and cannot
be expected for high generalization performance. To increase dimensional measured motion capture data. Fox et al. pro-
generalization performance in a human motion pattern recog- posed beta processes autoregressive hidden Markov model
nition task, we employ a deep sparse autoencoder to extract (BP-AR-HMM) enabling good results in the unsupervised
low-dimensional features, which can efficiently represent the segmentation of visual motion capture data. BP-AR-HMM
characteristics of each motion, from the high-dimensional has a beta process prior and Dirichlet transition probabilities
human motion data. After extracting low-dimensional features
by using the deep sparse autoencoder, we employ random in a hidden Markov model (HMM) [4]. Taniguchi et al.
forests to classify low-dimensional features representing human proposed a motion segmentation method for an imitation
motion. In experiments, we compared using the row data and learning architecture for unsegmented human motion. They
three types of feature extraction methods - principal component used singular value decomposition (SVD) to reduce the
analysis, a shallow sparse autoencoder, and a deep sparse dimensionality of human motion and extract a unit human
autoencoder - for pattern recognition. The experimental results
show that the deep sparse autoencoder outperformed the other motion embedded in a low dimensional sub-space, and
methods with the highest averaged accuracy rate, 75.1%, and used a sticky hierarchical Dirichlet process HMM (sticky
the lowest standard deviation, ±3.30%. The proposed method, HDP-HMM) and an unsupervised chunking method that
application of a deep sparse autoencoder, thus enabled higher uses a Gibbs sampler based on the minimal description
accuracy rate, better generalization and more stability than length (MDL) principle to segment for unsegmented human
could be achieved with the other methods.
motion data [5]. Taniguchi et al. also proposed a concept
Keywords-Human motion; Feature extraction; Pattern recog- called the Double Articulation Analyzer on the basis of the
nition; Deep sparse autoencoder; Deep learning previously proposed method in [5]. They replaced the MDL
with a Nested Pitman-Yor language model (NPYLM), so
I. I NTRODUCTION that the increase in computational time in NPYLM is less
We focus on the preprocessing of human motion data for than that in MDL [6]. Researchers [4], [5] and [6] have
pattern recognition in this paper. Measured human motion used methods based on a Bayesian model, and they have
row data is high-dimensional time-series data. Therefore, used Gibbs sampling. The performance of Gibbs sampling,
pattern recognition and the segmentation of human motion though, depends crucially on the distribution of the input
data are often difficult if we use the measured row data data or features. If the input data are poorly extracted, the
directly. unsupervised learning outputs poor results. This problem is
Wang et al. proposed using a Gaussian process dynamical shared with supervised learning methods. We want extracted
model (GPDM) to learn low-dimensional models of human features, which can efficiently represent high-dimensional
poses and motion from high-dimensional motion capture human motion data.
data [1]. GPDM is based on Gaussian process latent variable Recently, feature extractors applying a deep learning ap-
models (GPLVM) [2], which build a Markov chain in proach, which uses neural networks having a deep structure
latent space through a Gaussian process (GP). Hirose et al. with three or more layers, have attracted attention. The hu-
proposed a computation method for bottom-up construction man brain has a multi-layer structure composed by neurons.
of a low-dimensional representation, which used kernel Such a deep structure is believed to be the main reason why
canonical correlation analysis (KCCA) to combine posture humans can realize the abilities of recognition, cogitation
information and visual images [3]. However, [1], [2] and [3] and memory. Deep learning focuses on the deep structure of
used kernel methods. Such kernel methods usually require neural networks, with the aim of realizing a machine which

978-1-4799-6239-6/14 $31.00 © 2014 IEEE 173


DOI 10.1109/CIT.2014.144
has cognitive capabilities similar to those of the human brain.
Hinton et al. proposed deep belief nets (DBN), which are
composed of multiple logistic belief neural networks and ĞĐŽĚĞƌ ,ŝĚĚĞŶůĂLJĞƌǁŝƚŚ
ZĞĐŽŶƐƚƌƵĐƚŝŽŶůĂLJĞƌŽĨϮŶĚ^ ƐƉĂƌƐĞƚĞƌŵŽĨϮŶĚ^
one restricted Boltzmann machine ʢRBMʣ [7]. RBM is a ;sŝƐŝďůĞůĂLJĞƌŽĨϯƌĚ^Ϳ
perceptron proposed by Smolensky, which is no connection
ƌƌŽƌĨƵŶĐƚŝŽŶ WŵĞƚŚŽĚ ŶĐŽĚĞƌ
between neurons in the same layer, and can be used to
extract feature. [8]. Salakhutdinov et al. designed a new deep ĞĐŽĚĞƌ ,ŝĚĚĞŶůĂLJĞƌŽĨǁŝƚŚ
ZĞĐŽŶƐƚƌƵĐƚŝŽŶůĂLJĞƌŽĨϭƐƚ^ ƐƉĂƌƐĞƚĞƌŵϭƐƚ^
structure called deep Boltzmann machines (DBM), which ;sŝƐŝďůĞůĂLJĞƌŽĨϮŶĚ^Ϳ
is composed of each layer by RBM [9]. Bengio proposed
ŶĐŽĚĞƌ
the theory of deep learning, and showed that an architecture WŵĞƚŚŽĚ
with insufficient depth can require many more computational ƌƌŽƌĨƵŶĐƚŝŽŶ sŝƐŝďůĞůĂLJĞƌŽĨϭƐƚ^
elements, potentially exponentially more, than architectures
whose depth is matched to the task [10]. This theory was Figure 1. Image of the deep sparse autoencoder
elucidated from the study of multi-layer electrical circuits
by Håstad et al., who prove that there are monotone func-
tions which can be computed in depth k, but that require
exponential size to compute by a depth k − 1 monotone First, we define the time-series dataset of human motion
weighted threshold circuit [11]. In addition, [10] shows that information as matrix X ∈ RDX ×NX . The vector of each time
training an autoencoder (AE) seems easier than training an step t is
RBM, and AEs have been used as building blocks to train
deep networks, where each level is associated with an AE xt = (xt,1 , xt,2 , . . . , xt,DX )T , (1)
that can be trained separately. Taylor et al. used multiple where xt ∈ RDX , DX is the dimensionality of xt , and NX is
conditional RBM models, which add layers like those in the quantity of matrix X.
a DBN, to extract binary features of human motion [12]. We use a hyperbolic tangent function tanh(·) as an acti-
Baccouche et al. used convolutional neural networks (CNN) vation function in each SAE. Since the range of tanh(·) is
and a recurrent neural network (RNN) to create a fully (−1, 1), we must normalize the dataset X from (−∞, +∞)
automated deep model for learning to classify human action to (−1, 1). The normalized data is defined by nt ∈ RDX of
by images [13]. each time step t in Eq. (2):
In this paper, we propose employing a deep learning
model, called a deep sparse autoencoder (DSAE), to extract nt = (nt,1 , nt,2 , . . . , nt,DX )T , (2)
high-level low-dimensional features from high-dimensional
human motion data, and employing random forests (RFs) for and the d-th dimension of nt is normalized by
pattern recognition of low-dimensional features of human  x −X 
t,d min
motion. We show that using DSAE as a data pretreatment nt,d = 2 − 1. (3)
Xmax − Xmin
process can improve the accuracy of pattern recognition for
human motion information when using RFs. Xmax and Xmin are the minimum and maximum values of all
the dimensions for all data x in matrix X.
II. L OW- DIMENNSIONAL FEATURE EXTRACTION BY We want to extract features with time series attributed.
DEEP SPARSE AUTOENCODER When the matrix of visible layer V in the first SAE is
created, we use a windowing process to set a time window,
The AE has a three-hierarchical directed graph; i.e., one
which includes w data. The t-th vector of windowing data
with a visible layer, a hidden layer and a reconstruction
vt ∈ RDV is
layer. The AE uses an activation function to encode data
of the visible layer and to generate data of the hidden layer. vt = (nt−w+1,1 , . . . , nt−w+1,DX , . . . , nt,1 , . . . , nt,DX )T . (4)
The AE uses the same activation function to decode data of
the hidden layer and to generate data of the reconstruction where t ≥ w. When vt moves at the time axis by one time
layer. If the reconstruction layer data is equal to the visible step, we can obtain the matrix of visible layer V ∈ RDV ×NV ,
layer data, the hidden layer data can then be regarded as a which has DV = w × DX dimensions and NV = NX − w + 1
compressed feature of the visible layer data. To get more time steps.
sparse features, we introduce the sparse term in the AE The SAE has two process parts: encode and decode. The
hidden layer, and this term is called a sparse autoencoder encode function of the l-th SAE piled in the DSAE, which
(SAE). The DSAE is a learning architecture obtained by is shown in Eq. (5), generates the t-th vector of the hidden
(l) (l) (l)
piling up many SAEs. The SAEs individually accomplish layer ht ∈ RDH from the visible layer’s t-th vector vt ,
the training one by one in this paper. The DSAE model is
(l) (l) (l) (l)
shown in Fig. 1. ht = tanh(We vt + be ). (5)

174
We also use decode function Eq. (6) to reconstruct the t- (l) (l) (l) (l)
(l) (l) is to let weight matrices We , Wd and biases be , bd
th vector of the reconstruction layer rt ∈ RDV from the do partial differentiations for error function Eq. (7). These
hidden layer. partial differential equations of the decoder are
(l) (l) (l) (l)
rt = tanh(Wd ht + bd ). (6) ∂ E (l) 1 Nv (l) (l)
(l)
= ∑ ht γt + α Wd ,
Nv t=1
(10)
(l) (l) (l) ∂ Wd
In Eq. (5), We ∈ RDH ×DV is a weight matrix of the
(l) (l)
encoder, and be ∈ RDH is a bias vector of the encoder. ∂ E (l) 1 Nv
(l) (l)
= ∑ γt ,
Nv t=1
(11)
DH is the dimensionality of the hidden layer in the l-th ∂ bd
(l) (l) (l) (l) (l)
SAE. In Eq. (6), Wd ∈ RDV ×DH and bd ∈ RDV are the (l)
weight matrix and bias vector, respectively, of the decoder. where vector γt ∈ RDV is
To ensure the reconstruction layer data is close to the (l) T (l) (l) (l)
γt = (rt rt − 1)(vt − rt ). (12)
visual layer data, we use the squared error to set the error
function for all data, as shown by Eq. (7): The partial differential equations of the encoder are
NV
α ∂ E (l) 1 Nv (l) T (l)
E (l)
=
1

(l) (l)
||rt − vt ||22 +
(l) (l)
(||We ||22 + ||Wd ||22 ) (l)
= ∑ vt ξt + α We ,
Nv t=1
(13)
2NV 2 ∂ We
t=1
(l) ∂ E (l) 1 Nv
DH
(l) (l)
= ∑ ξt , (14)
+β ∑ KL(θ ||h̄i ). (7) ∂ be Nv t=1
i=1 (l)

The general AE easily overfits to the given dataset when where vector ξt ∈ RDH is
(l) (l)
the elements of We and Wd become very large. We use (l) T (l) T (l)
(l) (l)
ξt = Wd γt + β (1 − rt rt )ε , (15)
ridge regression, which limits the elements of We and Wd , (l)
with L2 norm as a penalty term in Eq. (7). The strength and the i-th element of vector ε ∈ R is DH

of the penalty term is controlled by α . We also hope for 1−θ θ


the hidden layer to be sparse, because we want to obtain εi = (l)
− (l) . (16)
1 − h̄i h̄i
more obvious features. Usually, we would use L1 norm
as the sparse item, but L1 norm cannot be differentiated. We want to minimize the SAE error, so we use the results
(l) (l) (l)
DH (l) of partial differentiation to update weight matrices We , Wd
Therefore, we use ∑i=1 KL(θ ||h̄i )
to calculate the sparse (l) (l)
item in Eq. (7), which means it is the Kullback-Leibler (KL) and biases be , bd . The updating equations are shown by
divergence between two Bernoulli random variables with Eqs. (17), (18), (19), and (20).
(l)
mean θ and h̄i [14]. We let β control the strength of the (l) (l) ∂ E (l)
(l) W+ e ← W e − λe , (17)
sparse term. We let h̄i be close to θ when the sparse item ∂ We
(l)
is minimized. The equation of the sparse item is
(l) (l) ∂ E (l)
(l) θ 1−θ b+ e ← be − λe (l) , (18)
KL(θ ||h̄i ) = θ log (l)
+ (1 − θ ) log (l)
, (8) ∂ be
h̄i 1 − h̄i (l) (l) ∂ E (l)
W+ d ← W d − λd , (19)
where θ ∈ R is the sparsity target, which needs to be ∂ Wd
(l)
(l)
specified, h̄i is the average value of the i-th dimension,
(l) (l) ∂ E (l)
(l)
and h̄(l) ∈ RDH is the average vector composed by b+ d ← bd − λd (l)
. (20)
∂ bd
(l) 1 1 NV (l) 
h̄i = 1+ ∑ ht,i . (9) where (·)+ means the term has been updated. λe and λd
2 NV t=1 are the respective learning rates of the encoder and decoder,
(l) (l) which can control the intensity of every update.
where ht,i is the i-th element of ht . In Eq. (8), the log(·)
(l) In the BP algorithm, how to set the learning rate is always
function is inside. Therefore, the extent of θ /h̄i must be a difficult problem. The value of the learning rate directly
within (0, +∞). However, the range of the activation function affects the speed and effectiveness of the training process.
(l)
tanh(·) is (−1, 1), so we cannot calculate log(θ /h̄i ). For We then use one-dimension searching to automatically find
(l)
this problem, we use linear mapping; let h̄i map from the optimal value of the learning rate, where E (l) is the
(−1, 1) to (0, 1). minimum in the differential direction, after calculating the
Finally, we use the back propagation (BP) method [15] to differential. This method can automatically find the best
minimize the error for training the SAE. The BP method learning rate for each update. For one-dimension searching,

175
we define the move step s. Then we initialize λ and s
with small positive numbers. The best learning rate is when
λ ∗ = λ + s. The update equation of s in one-dimensional
searching is
  + 
+ −0.5s E (l) > E (l)
s ←  (l) + . (21)
s E < E (l)

We set the stop condition of one-dimensional searching as


+
the time when the change in error |E (l) −E (l) | is lower than
a certain small value. We also use the same criteria to stop
updating weight matrices and biases of the error equation
(Eq. (7)). Figure 2. The training process of random forests
At this point, we obtain an optimized l-th SAE. We then
 (l) (l)  (l)
use feature matrix H(l) = h1 , . . . , ht ∈ RDH ×NV of the
(l)
l-th SAE as the visible layer V(l+1) ∈ RDH ×NV to train the
next (l + 1)-th SAE; in other words, V(l+1) = H(l) . In this
way, DSAE will pile up from multiple SAEs which will
accomplish the training one by one, and extract high-level
low-dimensional features from high-dimensional human mo-
tion data.

III. PATTERN RECOGNITION BY RANDOM FORESTS


After extracting low-dimensional feature vectors from
human motion data with DSAE, we apply RFs for the Figure 3. Evaluation of random forests for test set
pattern recognition task. RFs was proposed by Breiman [16],
and it is a kind of supervised ensemble learning, which is
composed by a plurality of weak classifiers. IV. E XPERIMENTS
Figure 2 shows the training process of RFs. The theory
Our experiments evaluated two processes: motion feature
of RFs is that the input data is insufficient to completely
extraction and pattern recognition of human motion. In the
represent the actual state of subjects. It uses bootstrap
feature extraction processes, we used DSAE with four SAEs
sampling, which is a method of re-sampling, to generate
to extract three-dimensional features. To evaluate the DSAE
T sub-datasets. The sum of these sub-datasets represents
performance, we used principal component analysis (PCA),
the actual state of subjects better than the original dataset
which is a linear feature-extraction method, and used just
does. Un-sampled data are used as Out of Bag (OoB) data
one SAE, which had a shallow structure. We employed RFs
to assess its accuracy of classification. We use each sub-
to do the pattern recognition and calculated the accuracy.
dataset to build each decision tree. We define each node of a
RFs were applied to three-dimensional features extracted
decision tree to have J dimensions. For each node splitting,
by PCA, SAE and DSAE, and 372-dimensional windowing
we randomly select L partial dimensions, where L and J
data.
are integers greater than 0, and L is substantially less than
J. After selecting the L partial dimensions, we choose one A. Dataset of human motion information
optimal dimension which is the most easily split for data
In this experiment, we used human motion datasets from
inside the node by some evaluation function; e.g., Entropy
the CMU Graphics Lab Motion Capture Database1 . The trial
or the Gini Coefficient. Afterwards, RFs uses the optimal
number of data used was 13 29, 13 30 and 13 31. These
dimension to split the data inside the node. We used two stop
three trials contained seven kinds of action, as shown in
conditions for splitting. The first is that we set the maximum
Fig. 4. The frame rate for the three trials was 120 fps, and
depth of the tree, and the second is that the classification
each data time step was 64-dimensional, which include root
error for OoB data be less than a specified value.
position and joint angles. To reduce the computational cost,
We can train T trees independently. For prediction, we
we converted the frame rate to 30 fps. The number of time
also let the test data input to each tree independently. We
steps, action and sequence for each converted trial is shown
then integrate the classification results of each tree, and
in Table I. Trial 13 29 had 1148 steps, including actions 1,
obtain the final results. The evaluation of random forests
for test set is shown in Fig. 3. 1 CMU Graphics Lab Motion Capture Database, http://mocap.cs.cmu.edu

176
Table I
DATASET OF HUMAN MOTION INFORMATION (30 FPS )
Trial No. Number of time steps Actions and sequence
Action 1→Action 2→
13 29 1148 steps Action 3→Action 4→
Action 5→Action 6
Action 1→Action 2→
13 30 617 steps Action 5→Action 6→
(a) Action 1
Action 7
Action 2→Action 5→
13 31 755 steps Action 4→Action 7→
Action 3→Action 1

2, 3, 4, 5 and 6. Trial 13 30 had 617 steps, which included


actions 1, 2, 5, 6 and 7. Trial 13 31 had 755 steps, including (b) Action 2
actions 2, 5, 4, 7, 3 and 1.
Afterwards, we divided the trials into three groups, with
three trials in each group, and used cross-validation to
investigate the generalization of the proposed method. The
training set of group 1 included trials 13 29 and 13 30, trial
13 31 was a test set. In group 2, we used trials 13 29 and
13 31 as the training set, and let trial 13 30 be the test set. (c) Action 3
Trials 13 31 and 13 30 were the training set and trial 13 29
the test set of group 3.

B. Experiment 1: feature extraction


We employed three feature extraction methods: PCA,
SAE and DSAE. First, we used a windowing process as a
pretreatment for the human-motion data to get the features
(d) Action 4
of the time-series data. We set the window size as w = 6;
i.e., six time steps, each being 0.2 s. Thus, the input data
changed from 64 to 372 dimensions. For SAE and DSAE,
we empirically set the parameters; the strength of the penalty
term was α = 0.3, the strength of the sparse term was
β = 0.7, and the sparsity target was θ = 0.5 in Eq. (7).
The convergence condition was set as the difference in the
gradient between two loops in the BP method being less (e) Action 5
than 0.00001. The changes of dimensions for PCA and
SAE were 372 dimensions to 3 dimensions. For DSAE, we
extracted features from 372 dimensional data in the order
of 140 dimensions, 50 dimensions, 20 dimensions and 3
dimensions.
Experimental results of feature extraction for group 1,
group 2 and group 3 are shown in Figs. 5, 6 and 7, respec- (f) Action 6
tively. Figures 5a, 6a and 7a show the three-dimensional
features extracted by PCA. Figures 5b, 6b and 7b show
the three-dimensional features extracted by SAE. The three-
dimensional features extracted by DSAE are shown in
Figs. 5c, 6c and 7c. In each of these figures, the upper
left graph represents the distribution of extracted features in
three-dimensional space, and the other three graphs represent (g) Action 7
the distribution of extracted features in the projection onto
the three surfaces of three-dimensional space. Different Figure 4. Seven kinds of action used in our experiments
colors of the points represent different action labels (the
points were colored artificially).

177

 
  
 

 
 

Feature 3
Feature 3

Feature 2
Feature 2
 


 


  



 
  
   
  
    

Feature 2      Feature 2   Feature 1

   
 Feature 1 Feature 1
Feature 1
 
   

   

Feature 3

Feature 3
Feature 3

Feature 3

   

   

       


              
Feature 1 Feature 2 Feature 1 Feature 2

$FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ

$FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ $FWLRQ

(a) three-dimensional features by PCA with labels (a) three-dimensional features by PCA with labels
 
   
 

 
 

Feature 3
Feature 3

Feature 2
Feature 2

 
 

   
 
 
  
 
       

Feature 2
     Feature 2   Feature 1      
  Feature 1 Feature 1
Feature 1
       

 
 
Feature 3

Feature 3
Feature 3

Feature 3

 
 

 
 

   
           
      Feature 1 Feature 2
Feature 1 Feature 2
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ

(b) three-dimensional features by SAE with labels


(b) three-dimensional features by SAE with labels


 

 





Feature 3

Feature 2
Feature 3

Feature 2


 


 
  
 
  
     
 
   Feature 2      
    Feature 1
Feature 2   Feature 1     Feature 1
Feature 1    
   

 
 
Feature 3

Feature 3
Feature 3

Feature 3

 
 

 
 

   
           
      Feature 1 Feature 2
Feature 1 Feature 2
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ

(c) three-dimensional features by DSAE with labels


(c) three-dimensional features by DSAE with labels ɹ
Figure 6. Feature extraction for group 2 by PCA, SAE and DSAE with
Figure 5. Feature extraction for group 1 by PCA, SAE and DSAE with
labels
labels
ɹ

178

 
 Because of that, the data collected for each trial had
  some offsets for the same kind of action, such as human
 facing direction and different motion trajectories. From
Feature 3



Feature 2

 the experimental results for groups 1, 2 and 3, we find

 that features extracted by PCA for the same actions were


 distributed in different parts of the three-dimensional feature


Feature 2
   
      
space. However, even though there were offsets between
 Feature 1
Feature 1
   
trials, the features extracted by SAE and DSAE for the
 
same action were distributed within roughly the same area
 
of three-dimensional feature space. This shows that SAE
Feature 3

Feature 3

 
and DSAE are good at feature extraction from nonlinear
 
data, and they could correct some linear offsets. In addition,
 
we believe that the distribution of features for different
actions, which was extracted by SAE and DSAE, could be
   
 
Feature 1
  
Feature 2

well classified; i.e., features for each action had almost no
$FWLRQ $FWLRQ $FWLRQ $FWLRQ overlap distribution in the three-dimensional feature space.
$FWLRQ $FWLRQ $FWLRQ
But the distribution of features by DSAE was wider than the
(a) three-dimensional features by PCA with labels 
distribution of features by SAE. Thus, in pattern recognition,
 

DSAE achieves better generalization than SAE for new data.



 We use the experimental results for pattern recognition to
Feature 3

prove this conclusion in subsection IV-C.


Feature 2





 
 C. Experiment 2: pattern recognition
 
 
   
 
Feature 2  

Feature 1       We employed RFs for the pattern recognition and cal-
Feature 1
    culated accuracy for 372-dimensional windowing data and
three-dimensional features extracted by PCA, SAE and
 
DSAE. We then used three groups of data for cross-
Feature 3

Feature 3

  validation. We wanted to find out how different the pattern


recognition accuracy rate was between using feature extrac-
 
tion or not (windowing data vs. PCA, SAE and DSAE),
    using linear and nonlinear feature extraction methods (PCA
         
Feature 1 Feature 2 vs. SAE and DSAE), and using shallow and deep nonlinear
$FWLRQ $FWLRQ $FWLRQ $FWLRQ

$FWLRQ $FWLRQ $FWLRQ


feature extraction methods (SAE vs. DSAE).
We used RFs code from OpenCV2 . We set the maximum
(b) three-dimensional features by SAE with labels 
 

number of sub-datasets (trees) to T = 200. The stop-splitting
 conditions were a maximum depth of 50 for each tree and

 OoB accuracy below 0.01%. We also set to calculate variable
Feature 3

Feature 2


 importance and compute surrogate split in CvRTParams3

function. In addition, for the 372-dimensional windowing
 

 
data, randomly selected partial dimensions were set by


 


   L = 10, and for the three-dimensional features extracted by
Feature 2   Feature 1      

  
Feature 1

PCA, SAE and DSAE, randomly selected partial dimensions
were set by L = 1.
  Based on the above experimental conditions, the results
Feature 3

Feature 3

 
of pattern recognitions are shown in Table II. To confirm
whether RFs had well trained training sets for the three
  groups, after completing the training we put the training
   
set data input to the trained model and calculated accuracy
         
Feature 1 Feature 2 rates.
$FWLRQ $FWLRQ $FWLRQ $FWLRQ

$FWLRQ $FWLRQ $FWLRQ 2 OpenCV, http://opencv.org


3 CvRTParams
(c) three-dimensional features by DSAE with labels ɹ is a function in OpenCV for setting training parameters of
random forests. http://docs.opencv.org/modules/ml/doc/random trees.html#
Figure 7. Feature extraction for group 3 by PCA, SAE and DSAE with cvrtparams
labels
ɹ

179
Table II in dimension compression.
PATTERN RECOGNITION RESULTS BY RF S
Finally, we look at the standard deviation of accuracy
Method Accuracy rate [%] rate for the test sets. The highest standard deviation was
For training set For test set
Windowing data 99.9 44.0
for windowing data and was ±7.16%. This high value
Group 1 PCA 100 17.2 was because RFs had to randomly select partial dimensions
SAE 99.7 69.3 from 372-dimensional windowing data before the split node.
DSAE 99.4 74.4
In some cases, due to random selection the possibilities
Windowing data 100 50.8
Group 2 PCA 99.4 24.0 of choosing partial dimensions which were easily or not
SAE 99.7 81.0 easily split were both great. On the other hand, the smallest
DSAE 99.6 80.1 standard deviation was obtained when using DSAE; this was
Windowing data 99.9 63.5
Group 3 PCA 99.6 31.8 ±3.30% for the test set. This shows that DSAE applied to
SAE 99.6 69.6 new data has good generalization and excellent stability.
DSAE 99.1 70.9
Windowing data 99.9 ± 4.44 × 10−2 52.8 ± 7.16 V. C ONCLUSION AND FUTURE WORK
Average PCA 99.7 ± 2.20 × 10−1 24.3 ± 5.02
SAE 99.6 ± 4.55 × 10−2 73.3 ± 5.15 In this paper, we have focused on feature extraction
DSAE 99.4 ± 1.57 × 10−1 75.1 ± 3.30 methods for human-motion recognition. We used DSAE as
a feature extraction method, which has a deep structure
and can extract features layer by layer. We used RFs to
From Table II, we find that the accuracy rate for the three perform pattern recognition of human motion on the basis
groups’ training sets with the four methods exceeded 99%. of extracted features. We also used 372-dimensional win-
This shows that the RFs which had trained the training sets dowing data, three-dimensional features extracted by PCA,
for the four methods were excellent. and three-dimensional features extracted by SAE to enable
For the test set of three groups, DSAE achieved the recognition by RFs to compare the recognition accuracies
highest averaged accuracy rate, which was 75.1%. PCA to that of three-dimensional features extracted by DSAE.
had the lowest averaged accuracy rate, which was 24.3%. In the experiments, the highest averaged accuracy rate was
This shows that the nonlinear feature extraction method is 75.1% and this was obtained using DSAE. The smallest
superior to the linear feature extraction method for human standard deviation was ±3.30%, again when using DSAE.
motion data. The proposed method thus has higher accuracy rate, better
The accuracy rate of the test set when using PCA was generalization and more stability than other methods.
lower than the accuracy rate of test set for windowing data. In our future work, we will focus on improving the
We think that because PCA is a linear feature extraction accuracy rate of the proposed method; e.g., by changing the
method, it undermined the potential non-linear distribution data normalization method, using whitening, and redesigning
of human motion data. From (a) in Figs. 5, 6 and 7, we to obtain a new deep structure.
can see that the distributions of three-dimensional features R EFERENCES
by PCA were in different positions for some of the same ac-
tions, and some features of different actions were distributed [1] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process
dynamical models for human motion,” Pattern Analysis and
in small parts of the feature space. Therefore, the lowest Machine Intelligence, IEEE Transactions on, vol. 30, no. 2,
accuracy rate was obtained from PCA. pp. 283–298, 2008.
We also compared the accuracies of the test sets between
windowing data when using SAE and when using DSAE. [2] N. D. Lawrence, “Gaussian process latent variable models for
The accuracy provided by the windowing data was clearly visualisation of high dimensional data,” Advances in neural
information processing systems, vol. 16, pp. 329–336, 2004.
lower than that when using SAE or DSAE. This illustrates
the effectiveness of feature extraction methods using AE. [3] T. Hirose and T. Taniguchi, “Abstraction multimodal low-
The averaged accuracy rate when using SAE was 73.3%, dimensional representation from high-dimensional posture
and averaged accuracy rate when using DSAE was 75.1%. information and visual images,” Journal of Robotics and
This shows that a feature extraction method with a deep Mechatronics, vol. 25, no. 1, pp. 80–88, 2013.
structure works better than one with a shallow structure. [4] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Will-
Because SAE extracted three-dimensional features from sky, “Sharing features among dynamical systems with beta
372-dimensional data, the change in dimensions was too processes,” in Advances in Neural Information Processing
great and SAE would lose some useful information through Systems, 2009, pp. 549–557.
the dimension compression. Since DSAE has a deep struc-
[5] T. Tagniguchi, K. Hamahata, and N. Iwahashi, “Unsupervised
ture, the change of dimensions between each layer was segmentation of human motion data using sticky hdp-hmm
smaller than when using SAE. This kind of ladder-type and mdl-based chunking method for imitation learning,” Ad-
feature extraction method can reduce the information loss vanced Robotics, vol. 25, no. 17, pp. 2143–2172, 2011.

180
[6] T. Taniguchi and S. Nagasaka, “Double articulation analyzer which was also with windowing data. About windowing
for unsegmented human motion using pitman-yor language data, the averaged accuracy rate of training set was highest,
model and infinite hidden markov model,” in System Inte-
but that of test set was lowest. Those results show that SVM
gration (SII), 2011 IEEE/SICE International Symposium on.
IEEE, 2011, pp. 250–255. was over-fitted to the training set, so that its generalization
performance was poor. About PCA, the average accuracy
[7] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning rate of training set and test set were both lower than SAE
algorithm for deep belief nets,” Neural computation, vol. 18, and DSAE. This once again shows that these nonlinear
no. 7, pp. 1527–1554, 2006.
feature extraction methods were superior to the linear feature
[8] P. Smolensky, “Information processing in dynamical systems: extraction method for human motion data. About SAE and
Foundations of harmony theory,” in Parallel Distributed Pro- DSAE, they both had high performances. However, the
cessing: Explorations in the Microstructure of Cognition, averaged accuracy rates of SAE (for training set: 96.7±1.11,
Vol. 1, D. E. Rumelhart, J. L. McClelland, and C. PDP for test set: 74.8 ± 11.3) were a little higher than DSAE
Research Group, Eds. Cambridge, MA, USA: MIT Press, (for training set: 96.0 ± 1.24, for test set: 72.8 ± 11.0). We
1986, pp. 194–281.
notice that their standard deviations for test set were large
[9] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann ma- (about ±11.0). Although the average accuracy rate of DSAE
chines,” in International Conference on Artificial Intelligence was lower than SAE, we could not conclude that SAE was
and Statistics, 2009, pp. 448–455. superior to DSAE.
We compared the accuracy rates of RFs (Table II) and
[10] Y. Bengio, “Learning deep architectures for ai,” Foundations
and trendsR in Machine Learning, vol. 2, no. 1, pp. 1–127,
SVMs (Table III). We find that, the standard errors of SVMs
2009. were much larger than the standard errors of RFs. This
shows that the generalization performance of SVMs were
[11] J. Håstad and M. Goldmann, “On the power of small-depth less stable than that of RFs for this task. Because RFs
threshold circuits,” Computational Complexity, vol. 1, no. 2, produced many sub-datasets through re-sampling from the
pp. 113–129, 1991.
input dataset, RFs could greatly increase the expression of
[12] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling dataset for the true states. The RFs could exhibit good
human motion using binary latent variables,” Advances in generalization performance in recognition task of the test
neural information processing systems, vol. 19, p. 1345, 2007. set.
[13] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and
A. Baskurt, “Sequential deep learning for human action
recognition,” in Human Behavior Understanding. Springer,
2011, pp. 29–39.

[14] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, 2011.

[15] D. E. Rumelhart, G. E. Hintont, and R. J. Williams, “Learning


representations by back-propagating errors,” Nature, vol. 323,
no. 6088, pp. 533–536, 1986.

[16] L. Breiman, “Random forests,” Machine learning, vol. 45,


no. 1, pp. 5–32, 2001. Table III
PATTERN RECOGNITION RESULTS BY SVM
A PPENDIX Method Accuracy rate [%]
For training set For test set
We compared the performance of RFs with that of support Windowing data 100 19.5
vector machines (SVM). We employed LIBSVM3 for SVM. Group 1 PCA 67.4 34.1
The experimental condition of SVMs was same as that SAE 96.8 72.8
DSAE 97.1 73.2
of RFs (please refer to subsection IV-C). We used the
Windowing data 100 26.3
radial basis kernel function: k(h, h ) = exp(−γ ||h − h ||22 ) for Group 2 PCA 47.6 21.7
SVMs, and set its parameter γ to 10. SAE 95.5 86.9
The results obtained by using SVMs are shown in Ta- DSAE 96.2 83.7
Windowing data 94.2 14.2
ble III. From Table III, we find the highest averaged accuracy Group 3 PCA 59.5 16.2
rate of training set was 98.1 ± 3.34 with windowing data, SAE 97.7 64.6
and the lowest one was 58.2 ± 9.95 with PCA. On the other DSAE 94.7 61.7
Windowing data 98.1 ± 3.34 20.0 ± 6.08
hand, for test set, the highest averaged accuracy rate was Average PCA 58.2 ± 9.95 24.0 ± 9.19
74.8 ± 11.3 with SAE, and the lowest one was 20.0 ± 6.08, SAE 96.7 ± 1.11 74.8 ± 11.3
DSAE 96.0 ± 1.24 72.8 ± 11.0
3 LIBSVM, http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

181

You might also like