Professional Documents
Culture Documents
E14 - Feature Extraction and Pattern Recognition For Human Motion
E14 - Feature Extraction and Pattern Recognition For Human Motion
Abstract—Human motion data is high-dimensional time- a large gram matrix when the dataset is very large. This
series data, and it usually contains measurement error and greatly increases the computation cost. A human motion
noise. Recognizing human motion on the basis of such high- recognition task usually comes with a large amount of high-
dimensional measurement row data is often difficult and cannot
be expected for high generalization performance. To increase dimensional measured motion capture data. Fox et al. pro-
generalization performance in a human motion pattern recog- posed beta processes autoregressive hidden Markov model
nition task, we employ a deep sparse autoencoder to extract (BP-AR-HMM) enabling good results in the unsupervised
low-dimensional features, which can efficiently represent the segmentation of visual motion capture data. BP-AR-HMM
characteristics of each motion, from the high-dimensional has a beta process prior and Dirichlet transition probabilities
human motion data. After extracting low-dimensional features
by using the deep sparse autoencoder, we employ random in a hidden Markov model (HMM) [4]. Taniguchi et al.
forests to classify low-dimensional features representing human proposed a motion segmentation method for an imitation
motion. In experiments, we compared using the row data and learning architecture for unsegmented human motion. They
three types of feature extraction methods - principal component used singular value decomposition (SVD) to reduce the
analysis, a shallow sparse autoencoder, and a deep sparse dimensionality of human motion and extract a unit human
autoencoder - for pattern recognition. The experimental results
show that the deep sparse autoencoder outperformed the other motion embedded in a low dimensional sub-space, and
methods with the highest averaged accuracy rate, 75.1%, and used a sticky hierarchical Dirichlet process HMM (sticky
the lowest standard deviation, ±3.30%. The proposed method, HDP-HMM) and an unsupervised chunking method that
application of a deep sparse autoencoder, thus enabled higher uses a Gibbs sampler based on the minimal description
accuracy rate, better generalization and more stability than length (MDL) principle to segment for unsegmented human
could be achieved with the other methods.
motion data [5]. Taniguchi et al. also proposed a concept
Keywords-Human motion; Feature extraction; Pattern recog- called the Double Articulation Analyzer on the basis of the
nition; Deep sparse autoencoder; Deep learning previously proposed method in [5]. They replaced the MDL
with a Nested Pitman-Yor language model (NPYLM), so
I. I NTRODUCTION that the increase in computational time in NPYLM is less
We focus on the preprocessing of human motion data for than that in MDL [6]. Researchers [4], [5] and [6] have
pattern recognition in this paper. Measured human motion used methods based on a Bayesian model, and they have
row data is high-dimensional time-series data. Therefore, used Gibbs sampling. The performance of Gibbs sampling,
pattern recognition and the segmentation of human motion though, depends crucially on the distribution of the input
data are often difficult if we use the measured row data data or features. If the input data are poorly extracted, the
directly. unsupervised learning outputs poor results. This problem is
Wang et al. proposed using a Gaussian process dynamical shared with supervised learning methods. We want extracted
model (GPDM) to learn low-dimensional models of human features, which can efficiently represent high-dimensional
poses and motion from high-dimensional motion capture human motion data.
data [1]. GPDM is based on Gaussian process latent variable Recently, feature extractors applying a deep learning ap-
models (GPLVM) [2], which build a Markov chain in proach, which uses neural networks having a deep structure
latent space through a Gaussian process (GP). Hirose et al. with three or more layers, have attracted attention. The hu-
proposed a computation method for bottom-up construction man brain has a multi-layer structure composed by neurons.
of a low-dimensional representation, which used kernel Such a deep structure is believed to be the main reason why
canonical correlation analysis (KCCA) to combine posture humans can realize the abilities of recognition, cogitation
information and visual images [3]. However, [1], [2] and [3] and memory. Deep learning focuses on the deep structure of
used kernel methods. Such kernel methods usually require neural networks, with the aim of realizing a machine which
174
We also use decode function Eq. (6) to reconstruct the t- (l) (l) (l) (l)
(l) (l) is to let weight matrices We , Wd and biases be , bd
th vector of the reconstruction layer rt ∈ RDV from the do partial differentiations for error function Eq. (7). These
hidden layer. partial differential equations of the decoder are
(l) (l) (l) (l)
rt = tanh(Wd ht + bd ). (6) ∂ E (l) 1 Nv (l) (l)
(l)
= ∑ ht γt + α Wd ,
Nv t=1
(10)
(l) (l) (l) ∂ Wd
In Eq. (5), We ∈ RDH ×DV is a weight matrix of the
(l) (l)
encoder, and be ∈ RDH is a bias vector of the encoder. ∂ E (l) 1 Nv
(l) (l)
= ∑ γt ,
Nv t=1
(11)
DH is the dimensionality of the hidden layer in the l-th ∂ bd
(l) (l) (l) (l) (l)
SAE. In Eq. (6), Wd ∈ RDV ×DH and bd ∈ RDV are the (l)
weight matrix and bias vector, respectively, of the decoder. where vector γt ∈ RDV is
To ensure the reconstruction layer data is close to the (l) T (l) (l) (l)
γt = (rt rt − 1)(vt − rt ). (12)
visual layer data, we use the squared error to set the error
function for all data, as shown by Eq. (7): The partial differential equations of the encoder are
NV
α ∂ E (l) 1 Nv (l) T (l)
E (l)
=
1
∑
(l) (l)
||rt − vt ||22 +
(l) (l)
(||We ||22 + ||Wd ||22 ) (l)
= ∑ vt ξt + α We ,
Nv t=1
(13)
2NV 2 ∂ We
t=1
(l) ∂ E (l) 1 Nv
DH
(l) (l)
= ∑ ξt , (14)
+β ∑ KL(θ ||h̄i ). (7) ∂ be Nv t=1
i=1 (l)
The general AE easily overfits to the given dataset when where vector ξt ∈ RDH is
(l) (l)
the elements of We and Wd become very large. We use (l) T (l) T (l)
(l) (l)
ξt = Wd γt + β (1 − rt rt )ε , (15)
ridge regression, which limits the elements of We and Wd , (l)
with L2 norm as a penalty term in Eq. (7). The strength and the i-th element of vector ε ∈ R is DH
175
we define the move step s. Then we initialize λ and s
with small positive numbers. The best learning rate is when
λ ∗ = λ + s. The update equation of s in one-dimensional
searching is
+
+ −0.5s E (l) > E (l)
s ← (l) + . (21)
s E < E (l)
176
Table I
DATASET OF HUMAN MOTION INFORMATION (30 FPS )
Trial No. Number of time steps Actions and sequence
Action 1→Action 2→
13 29 1148 steps Action 3→Action 4→
Action 5→Action 6
Action 1→Action 2→
13 30 617 steps Action 5→Action 6→
(a) Action 1
Action 7
Action 2→Action 5→
13 31 755 steps Action 4→Action 7→
Action 3→Action 1
177
Feature 3
Feature 3
Feature 2
Feature 2
Feature 3
Feature 3
Feature 3
Feature 3
(a) three-dimensional features by PCA with labels (a) three-dimensional features by PCA with labels
Feature 3
Feature 3
Feature 2
Feature 2
Feature 2
Feature 2 Feature 1
Feature 1 Feature 1
Feature 1
Feature 3
Feature 3
Feature 3
Feature 3
Feature 1 Feature 2
Feature 1 Feature 2
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
Feature 2
Feature 3
Feature 2
Feature 2
Feature 1
Feature 2 Feature 1 Feature 1
Feature 1
Feature 3
Feature 3
Feature 3
Feature 3
Feature 1 Feature 2
Feature 1 Feature 2
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
$FWLRQ $FWLRQ $FWLRQ
178
Because of that, the data collected for each trial had
some offsets for the same kind of action, such as human
facing direction and different motion trajectories. From
Feature 3
Feature 2
the experimental results for groups 1, 2 and 3, we find
that features extracted by PCA for the same actions were
distributed in different parts of the three-dimensional feature
Feature 2
space. However, even though there were offsets between
Feature 1
Feature 1
trials, the features extracted by SAE and DSAE for the
same action were distributed within roughly the same area
of three-dimensional feature space. This shows that SAE
Feature 3
Feature 3
and DSAE are good at feature extraction from nonlinear
data, and they could correct some linear offsets. In addition,
we believe that the distribution of features for different
actions, which was extracted by SAE and DSAE, could be
Feature 1
Feature 2
well classified; i.e., features for each action had almost no
$FWLRQ $FWLRQ $FWLRQ $FWLRQ overlap distribution in the three-dimensional feature space.
$FWLRQ $FWLRQ $FWLRQ
But the distribution of features by DSAE was wider than the
(a) three-dimensional features by PCA with labels
distribution of features by SAE. Thus, in pattern recognition,
DSAE achieves better generalization than SAE for new data.
We use the experimental results for pattern recognition to
Feature 3
C. Experiment 2: pattern recognition
Feature 2
Feature 1 We employed RFs for the pattern recognition and cal-
Feature 1
culated accuracy for 372-dimensional windowing data and
three-dimensional features extracted by PCA, SAE and
DSAE. We then used three groups of data for cross-
Feature 3
Feature 3
Feature 2
importance and compute surrogate split in CvRTParams3
function. In addition, for the 372-dimensional windowing
data, randomly selected partial dimensions were set by
L = 10, and for the three-dimensional features extracted by
Feature 2 Feature 1
Feature 1
PCA, SAE and DSAE, randomly selected partial dimensions
were set by L = 1.
Based on the above experimental conditions, the results
Feature 3
Feature 3
of pattern recognitions are shown in Table II. To confirm
whether RFs had well trained training sets for the three
groups, after completing the training we put the training
set data input to the trained model and calculated accuracy
Feature 1 Feature 2 rates.
$FWLRQ $FWLRQ $FWLRQ $FWLRQ
179
Table II in dimension compression.
PATTERN RECOGNITION RESULTS BY RF S
Finally, we look at the standard deviation of accuracy
Method Accuracy rate [%] rate for the test sets. The highest standard deviation was
For training set For test set
Windowing data 99.9 44.0
for windowing data and was ±7.16%. This high value
Group 1 PCA 100 17.2 was because RFs had to randomly select partial dimensions
SAE 99.7 69.3 from 372-dimensional windowing data before the split node.
DSAE 99.4 74.4
In some cases, due to random selection the possibilities
Windowing data 100 50.8
Group 2 PCA 99.4 24.0 of choosing partial dimensions which were easily or not
SAE 99.7 81.0 easily split were both great. On the other hand, the smallest
DSAE 99.6 80.1 standard deviation was obtained when using DSAE; this was
Windowing data 99.9 63.5
Group 3 PCA 99.6 31.8 ±3.30% for the test set. This shows that DSAE applied to
SAE 99.6 69.6 new data has good generalization and excellent stability.
DSAE 99.1 70.9
Windowing data 99.9 ± 4.44 × 10−2 52.8 ± 7.16 V. C ONCLUSION AND FUTURE WORK
Average PCA 99.7 ± 2.20 × 10−1 24.3 ± 5.02
SAE 99.6 ± 4.55 × 10−2 73.3 ± 5.15 In this paper, we have focused on feature extraction
DSAE 99.4 ± 1.57 × 10−1 75.1 ± 3.30 methods for human-motion recognition. We used DSAE as
a feature extraction method, which has a deep structure
and can extract features layer by layer. We used RFs to
From Table II, we find that the accuracy rate for the three perform pattern recognition of human motion on the basis
groups’ training sets with the four methods exceeded 99%. of extracted features. We also used 372-dimensional win-
This shows that the RFs which had trained the training sets dowing data, three-dimensional features extracted by PCA,
for the four methods were excellent. and three-dimensional features extracted by SAE to enable
For the test set of three groups, DSAE achieved the recognition by RFs to compare the recognition accuracies
highest averaged accuracy rate, which was 75.1%. PCA to that of three-dimensional features extracted by DSAE.
had the lowest averaged accuracy rate, which was 24.3%. In the experiments, the highest averaged accuracy rate was
This shows that the nonlinear feature extraction method is 75.1% and this was obtained using DSAE. The smallest
superior to the linear feature extraction method for human standard deviation was ±3.30%, again when using DSAE.
motion data. The proposed method thus has higher accuracy rate, better
The accuracy rate of the test set when using PCA was generalization and more stability than other methods.
lower than the accuracy rate of test set for windowing data. In our future work, we will focus on improving the
We think that because PCA is a linear feature extraction accuracy rate of the proposed method; e.g., by changing the
method, it undermined the potential non-linear distribution data normalization method, using whitening, and redesigning
of human motion data. From (a) in Figs. 5, 6 and 7, we to obtain a new deep structure.
can see that the distributions of three-dimensional features R EFERENCES
by PCA were in different positions for some of the same ac-
tions, and some features of different actions were distributed [1] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process
dynamical models for human motion,” Pattern Analysis and
in small parts of the feature space. Therefore, the lowest Machine Intelligence, IEEE Transactions on, vol. 30, no. 2,
accuracy rate was obtained from PCA. pp. 283–298, 2008.
We also compared the accuracies of the test sets between
windowing data when using SAE and when using DSAE. [2] N. D. Lawrence, “Gaussian process latent variable models for
The accuracy provided by the windowing data was clearly visualisation of high dimensional data,” Advances in neural
information processing systems, vol. 16, pp. 329–336, 2004.
lower than that when using SAE or DSAE. This illustrates
the effectiveness of feature extraction methods using AE. [3] T. Hirose and T. Taniguchi, “Abstraction multimodal low-
The averaged accuracy rate when using SAE was 73.3%, dimensional representation from high-dimensional posture
and averaged accuracy rate when using DSAE was 75.1%. information and visual images,” Journal of Robotics and
This shows that a feature extraction method with a deep Mechatronics, vol. 25, no. 1, pp. 80–88, 2013.
structure works better than one with a shallow structure. [4] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Will-
Because SAE extracted three-dimensional features from sky, “Sharing features among dynamical systems with beta
372-dimensional data, the change in dimensions was too processes,” in Advances in Neural Information Processing
great and SAE would lose some useful information through Systems, 2009, pp. 549–557.
the dimension compression. Since DSAE has a deep struc-
[5] T. Tagniguchi, K. Hamahata, and N. Iwahashi, “Unsupervised
ture, the change of dimensions between each layer was segmentation of human motion data using sticky hdp-hmm
smaller than when using SAE. This kind of ladder-type and mdl-based chunking method for imitation learning,” Ad-
feature extraction method can reduce the information loss vanced Robotics, vol. 25, no. 17, pp. 2143–2172, 2011.
180
[6] T. Taniguchi and S. Nagasaka, “Double articulation analyzer which was also with windowing data. About windowing
for unsegmented human motion using pitman-yor language data, the averaged accuracy rate of training set was highest,
model and infinite hidden markov model,” in System Inte-
but that of test set was lowest. Those results show that SVM
gration (SII), 2011 IEEE/SICE International Symposium on.
IEEE, 2011, pp. 250–255. was over-fitted to the training set, so that its generalization
performance was poor. About PCA, the average accuracy
[7] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning rate of training set and test set were both lower than SAE
algorithm for deep belief nets,” Neural computation, vol. 18, and DSAE. This once again shows that these nonlinear
no. 7, pp. 1527–1554, 2006.
feature extraction methods were superior to the linear feature
[8] P. Smolensky, “Information processing in dynamical systems: extraction method for human motion data. About SAE and
Foundations of harmony theory,” in Parallel Distributed Pro- DSAE, they both had high performances. However, the
cessing: Explorations in the Microstructure of Cognition, averaged accuracy rates of SAE (for training set: 96.7±1.11,
Vol. 1, D. E. Rumelhart, J. L. McClelland, and C. PDP for test set: 74.8 ± 11.3) were a little higher than DSAE
Research Group, Eds. Cambridge, MA, USA: MIT Press, (for training set: 96.0 ± 1.24, for test set: 72.8 ± 11.0). We
1986, pp. 194–281.
notice that their standard deviations for test set were large
[9] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann ma- (about ±11.0). Although the average accuracy rate of DSAE
chines,” in International Conference on Artificial Intelligence was lower than SAE, we could not conclude that SAE was
and Statistics, 2009, pp. 448–455. superior to DSAE.
We compared the accuracy rates of RFs (Table II) and
[10] Y. Bengio, “Learning deep architectures for ai,” Foundations
and trendsR in Machine Learning, vol. 2, no. 1, pp. 1–127,
SVMs (Table III). We find that, the standard errors of SVMs
2009. were much larger than the standard errors of RFs. This
shows that the generalization performance of SVMs were
[11] J. Håstad and M. Goldmann, “On the power of small-depth less stable than that of RFs for this task. Because RFs
threshold circuits,” Computational Complexity, vol. 1, no. 2, produced many sub-datasets through re-sampling from the
pp. 113–129, 1991.
input dataset, RFs could greatly increase the expression of
[12] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling dataset for the true states. The RFs could exhibit good
human motion using binary latent variables,” Advances in generalization performance in recognition task of the test
neural information processing systems, vol. 19, p. 1345, 2007. set.
[13] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and
A. Baskurt, “Sequential deep learning for human action
recognition,” in Human Behavior Understanding. Springer,
2011, pp. 29–39.
181