Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Combining Convolutional Neural Network and

Support Vector Machine for Gait-based Gender


Recognition
Taocheng Liu Xiangbin Ye Bei Sun
College of Intelligence Science and College of Intelligence Science and College of Intelligence Science and
Engineering Engineering Engineering
National University of Defense National University of Defense National University of Defense
Technology Technology Technology
Changsha, China Changsha, China Changsha, China
liutaocheng16@nudt.edu.cn 13873181055@163.com beys1990@163.com

Abstract—Recently, deep learning based on convolutional great state-of-the-art performance in image classification.
neural networks (CNN) has achieved great state-of-the-art The advantage of CNN is that it omits the steps of features
performance in many fields such as image classification, extraction in traditional image processing. There have been
semantic analysis and biometric recognition. Normally, the some studies applying deep learning method in gait analysis
Softmax activation function is used as classifier in the last layer [2-5]. What is more, Razavian et al. [9] proposed that the
of CNN. However, there some studies try to replace the deep fully connected layer units of an off-the-shelf CNN can
Softmax layer with the support vector machine (SVM) in an be used as the input image’s descriptors and the SVM can be
artificial neural network architecture and achieve great trained by the deep features extracted from the pre-trained
results. Inspired by these works, we research the performance
deep convolutional neural network. Many researchers used
of CNN with linear SVM classifier on the gender recognition
the linear SVM to replace the Softmax layer in CNN to
based on CASIA-B dataset. In the first model, the input
image’s descriptors are extracted from the fully connected
create a novel architecture, namely CNN-SVM, to address
layer of the pre-trained VGGNet-16 model as the features to different image recognition tasks [10-12].
train the SVM. In the second model, we adjust VGGNet-16 Yichuan Tang [11] have researched the deep learning
with a hinge loss function using an L2 norm to create a new using linear support vector machines in face classification
architecture, namely VGGNet-SVM. The results have shown task. Inspired by Tang’s work, this paper studies the using of
that SVM shows the better performance than Softmax in linear SVM in deep learning in gait-based gender
VGGNet-16 to work out the gender recognition problem based
recognition. We research the performance of linear SVM
on gait.
classifier on two models, the deep-features model and the
Keywords—CNN, SVM, gender recognition, GEI, deep- fine-tuning model. In the former model, the SVM is trained
features by image’s descriptors which are extracted from the fully
connected layer of the pre-trained deep convolution model
VGGNet-16. In this model, the weights on all layers are
I. INTRODUCTION freeze, meaning that the pre-trained VGGNet-16 is only used
Gender is one of the important basic attributes of human as a tool for features extraction. In the latter model, for our
beings. It is an important function to identify the gender of fine-tuning network a 2 units fully connected layer with a
an object for the long-distance intelligent monitoring system, hinge loss function using an L2 norm was added follow the
which can effectively improve the system's understanding of fc8. And only the weights of fully connected layers will be
the monitoring environment. At present, most gender optimized in the training. Figure 1 shows the architecture of
recognition methods are based on face, but the acquisition of that. To compare our network with network using a Softmax,
face images is easily restricted. Influenced by factors such as we demonstrate the performance on CASIA_B dataset. The
distance and resolution, it is not easy for the camera to experimental results show that the proposed architecture has
capture a clear face image. Compared with face, gait has its a better classification rate compared to the traditional
advantages on nonaggression, non-contact and easy architecture under normal, carrying, and wearing conditions
collection, which makes the gait recognition play an in five views (54°, 72°, 90°, 108°, 126°).
irreplaceable role in the long-distance identification [1].
With the development of image processing and II. RELATED WORK
recognition technology, silhouette-based gait identification
has become the major method. The support vector machine A. The VGGNet-16 model
(SVM) plays a key role in machine learning. Generally, the A state-of-the-art deep convolution model VGGNet-16
SVM classifiers are trained by gait features image’s [13] was used in this paper. The VGGNet-16 model includes
descriptors to get a classification model to estimate the class 16 weight layers which consist of 13 convolutional layers
of feed gait feature image [6-8]. The key of the way is the and 3 fully connected layers. The input to VGGNet-16 is
selection of image descriptors. Recently, the deep learning RGB image whose size is 224×224 pixel. The input image is
based on convolutional neural network (CNN) has achieved passed through a stack of convolutional layers with the shape

978-1-7281-1312-8/18/$31.00 ©2018 IEEE 3477


Fig. 1. The architecture of VGGNet-SVM.

Fig. 2. Samples of GEIs of male under different condition.

of 3×3. The convolutional layers are followed by fully extract the 4096 dimensional fc6 fc7 features and 1000
connected layers which first two have 4096 units and the dimensional fc8 features as the deep-features to train a SVM,
third has 1000 units meaning the results of classification. The listing in table 1.
final layer is the Softmax layer. In the work, the VGGNet-16
model is pre-trained using ImageNet dataset. TABLE I. THE DEEP FEATURES.

B. Input data Name Layer Dimensionality


In this work, we use the Gait Energy Image (GEI) [14] as Deep-features-I fc6 4096
the gait feature image and input it to the CNN. GEI is the Deep-features-II fc7 4096
most common feature model in silhouette-based approaches. Deep-features-III fc8 1000
It usually utilizes a gait cycle to represent the gait energy
which can reflect the gait static information and dynamic
information in human walking. In a GEI, the luminance of
each pixel (pixel value) reflects the body appearance III. THE CNN-SVM ARCHITECTURE
frequency at the pixel in one cycle. The GEI can be
A. Support vector machine
calculated by the following formula:
In this paper, we use the linear support vector machine
(SVM) classifier to predict the gender of pedestrians. The
1 N linear SVM is originally formulated for binary classification.
GEI ( x, y ) = ¦ I t ( x, y ) (1) The SVM are trained to learn the weights parameters w by
N t =1
minimize the squared hinge loss with a regularization term.
The loss function is defined by the following:
Where N is number of frames in a gait sequence of a
cycle, I t ( x, y ) is the gait sequence image, and t is the
N
1
number of gait frames. Figure 2 shows the samples of GEIs. & w &22 +C ¦ max ( 0,1 − wT xn tn )
2
min (2)
N n =1
C. Deep-features as image descriptors
In the traditional machine learning, the input of classifier Where
is always the feature vector which is extracted artificially,
and the CNN's input is the original image. The features
extraction in CNN is done through the several convolutional ­1 if class n is the ground truth
tn = ® (3)
layers. Razavian et al. [9] has demonstrated that the deep ¯−1 otherwise
fully connected layer units of an off-the-shelf CNN can be
used as the input image’s descriptors. In order to extract the And the & w &2 is the L2 norm, with the squared hinge
deep-features, the size of input GEIs must be compatible
with VGGNet-16’s input size which is 224×224. Then deep- loss, N is the number of examples, C is the penalty
features are calculated by forward propagating the size-fixed parameter, xn is the input n-th feature vector.
GEIs through the pre-trained VGGNet-16 model. We can

3478
Fig. 3. The freeze configuration of fine-tuning VGGNet-SVM

For the fine-tuning model, the hyperparameter C from


B. Deep learning with SVM equation (2) is set to 1. In the training phase, because of the
The using of linear SVM in deep learning has two models VGGNet-16 net was pre-trained, the weights parameters of
which are A) deep-features model and B) fine-tuning model. fc6, fc7 and fc8 are learned using Adam with epoch = 200,
In the former model, the deep-features are extracted from batch size = 20 and learning rate = 10 −5 . However, the
the fully connected layers of the pre-trained VGGNet-16 learning rate on new fc9 layer is 10 −3 .By optimizing the
model. Every deep-features will be normalized between 0 weights parameters, it tries to get image descriptors more
and 1 to eliminate the dimensional relationship between suitable for linear SVM to classify. We in turn freeze the
features and to balance the importance of every feature. And parameters on the different fully connected layers to
the linear SVM is trained using these deep-features. It is determine the ability of each layer to extract features. The
important to emphasize that there is no optimization of the detailed freeze configuration is shown in Figure 3.
CNN model by backpropagation. The pre-trained VGGNet-
16 is only used as a tool for image feature extraction. B. Results and discussion
In the last model, we research the effect of fine-tuning the 1) Model A results
pre-trained VGGNet-16 on the recognition. At the newly From table 3 the best results were obtained by using the
added fc9 layer, the hinge loss using an L2 norm as shown in Deep-features-I. It means that the deep-features extracted
equation (2). The weights of all fully connected layers are from the fc6 layer can effectively show the characteristics of
learned by backpropagating from the L2-SVM and the gait, which is beneficial to the gender classification on linear
weights of other layers will be freeze. And the essential step SVM. In addition, we find that the recognition rate is
is that the label of the input GEI must be translated to the gradually reduced from Deep-features-I to Deep-features-III.
following case y = {+1, −1} . We can infer that the deep-features extracted from the fully
connected layer closest to the convoluation layer are more
effective in containing the original features of the input
IV. EXPERIMENTS AND RESULTS image. It may be because the deep-feature extracted on fully
connected layer which is closer to classifier are more
A. Experiments optimized to solve certain task, so that the image features it
The CASIA_B dataset, containing 124 subjects (31 represents are very special, the richness is reduced, and it is
females and 93 males), is used in this work. Each subject has not widely representative.
10 walking sequences: 6 normal sequences (nm), 2 carrying-
bag sequences (bg), and 2 wearing-coat sequences (cl) from TABLE III. THE ACCURACY OF DEEP-FEATURES MODEL.
11 views (0°, 18°, 36°, … , 180°). Because gender
recognition is a two-class problem, the 31 females and 31 Method Accuracy(%)
males selected randomly are used as the training set. For Deep-features-I + SVM 87.94%
training subjects, 4 normal condition sequences are selected. Deep-features-II + SVM 81.92%
And the remaining sequences as test sets. Table 2 shows the Deep-features-III + SVM 80.79%
distribution of sequences in the training and test set.
2) Model B results
Figure 4 shows the output of loss during the training,
TABLE II. DATASET DISTRIBUTION. where the vertical axis represents the loss value and the
Dataset Female Male Total horizontal axis represents the training epoch. It can be seen
Training 620 620 1240 that the loss has tended to stabilize after about 10 epoch
Testing 930 1890 3800 when we fine-tuning all fully connected layers. By
comparison, it is found that the more number of fully layers

3479
Fig. 4. The output of loss in training

frozen, the more difficult for loss to converge to a lower research the performance of deep-features extracted from the
value. This is because the ability of the convolutional neural fully connected layer of pre-trained VGGNet-16. The results
network to modify and adjust features is degraded, making have shown that these features can be used as the input GEI
the extracted features not well adapted to the needs of the descriptors to extract the gait features and the best
task. performance was achieved when using the Deep-features-I
which is extracted on the FC6 layer of VGGNet-16. In our
When the FC6 layer is frozen, the loss output suddenly future work, we intend to extend our researches with other
changes very sharply when it is trained for about 50 epoch, pre-trained CNNs such as AlexNet, ResNet, LetNet and etc.
and then slowly and gradually stabilizes. And whether the
FC8 layer frozen directly affects whether the loss can
converge to a lower value. So it can be inferred here that the ACKNOWLEDGMENTS
FC6 layer and the FC8 layer should play a key role in the The research in this paper use the CASIA Gait Database
feature extraction and adjustment of the entire fully collected by Institute of Automation, Chinese Academy of
connected layer. Sciences. This work is supported by the National Natural
In the table 4, VGGNet-Fune_all is an method using Science Foundation of China (No. 61503398).
Softmax with cross-entropy loss function. The results
provide that the hinge loss using an L2 norm outperforms the REFERENCES
Softmax function with cross-entropy loss. It can be observed [1] Jain A. K, Ross A and Prabhakar S, “An introduction to biometric
that the more number of fully connected layers are fine- recognition,” IEEE Trans. CIRC SYST VID. vol.14, 2004, pp.4-20.
tuning, the more higher accuracy can be get. This result [2] Yeoh T W, Aguirre H E, and Tanaka K, “Clothing-invariant gait
attributed to the fact that the optimization weights of fully recognition using convolutional neural network,” Proc. ISPACS.
connected layer by backpropagation is very effective, which (Xiamen, China) , 2017, pp.1-5.
makes the extracted deep-features more suitable to solve the [3] Shiraga K, Makihara Y, Muramatsu D et al, “GEINet: View-invariant
gait recognition using a convolutional neural network,” Proc. Int.
new recognition task. Conf. on Biometrics. (Halmstad, Sweden) , 2016, pp.1-8.
[4] Shukla R, Shukla R and Shukla A et al, “Gender identification in
TABLE IV. TEST ACCURACY OF FINE-TUNING MODEL. human gait using neural network,” J. Model. Educ & Comput. Sci.
vol.4, 2012, pp.70-75.
Method Accuracy(%) [5] Alotaibi M, Mahmood A, “Improved gait recognition based on
VGGNet-Fune_all 87.10% specialized deep convolutional neural networks,” J. Comput Vision
VGGNet-SVM-Fune_all 89.62% Image Understanding. vol.164, pp.103-10.
VGGNet-SVM-Freeze_FC6 82.72% [6] Yoo J H, Hwang D, Nixon M S, “Gender classification in human gait
VGGNet-SVM-Freeze_FC6FC7 82.22% using support vector machine,” ACIVS. vol.3708, 2005, pp.138-145.
VGGNet-SVM-Freeze_FC6FC7FC8 77.84%
[7] Juang L H, Lin S A and Wu M N, “Gender recognition studying by
At the same time, we can see that the accuracy of gait energy image classification,” Int. IEEE. Symposium on
VGGNet-Fune_all to VGGNet-SVM-Freze_FC6 and Computer, Consumer and Control (Taichung, Taiwan), 2012, pp.837-
VGGNet-SVM-Freze_FC6FC7 to VGGNet-SVM- 840.
Freze_FC6FC7FC8 has greatly reduced. Therefore, it is [8] El-Alfy E S M, Binsaadoon A G, “Silhouette-based gender
further proved that in the VGGNet-SVM structure, the FC6 recognition in smart environments using fuzzy local binary patterns
and support vector machines,” J. Procedia Comput. Sci. vol.109,
and FC8 layers play a key role in feature extraction and 2017, pp.164-171.
adjustment. The parameter freeze on these layers has a great [9] Razavian A S, Azizpour H, Sullivan J, et al, “CNN features off-the-
influence on the result, which is consistent with the output shelf : An astounding baseline for recognition,” IEEE. Conf. on
loss value graph. . Comput Vision & Pattern Recognit (Columbus, American), 2014,
pp.512-519.
V. CONCLUSION [10] Wolfshaar J V D, Karaaba M F, Wiering M A, “Deep Convolutional
Neural Networks and Support Vector Machines for Gender
In this paper, we research using the linear SVM to Recognition,” Proc. IEEE. Symposium Series on Computational
replace the Softmax function in VGGNet-16. The Intelligence (Cape Town, South africa) , 2015, pp.188-195.
experiments have shown that the linear SVM outperforms [11] Tang Y, “Deep Learning using Linear Support Vector Machines,”
arXiv preprint. 2013, 1306.0239
Softmax function to solve gait-based gender recognition task.
We find that the FC6 and the FC8 play a key role in on [12] Agarap A F, “An architecture combining convolutional neural
network (CNN) and support vector machine (SVM) for image
feature extraction and adjustment in the VGGNet-SVM classification,” arXiv preprint. 2017, 1712.0354
model by freeze the different layer. What is more, we

3480
[13] Simonyan K, Zisserman A, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint. 2014, 1409.1566.
[14] Han, J and Bhanu B, “Individual recognition using gait energy
image,” IEEE Trans. TPAMI. vol.28, 2006, pp. 316-322.

3481

You might also like