Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IETE Technical Review

ISSN: 0256-4602 (Print) 0974-5971 (Online) Journal homepage: www.tandfonline.com/journals/titr20

Facial Expression Recognition via Deep Learning

Xiaoming Zhao, Xugan Shi & Shiqing Zhang

To cite this article: Xiaoming Zhao, Xugan Shi & Shiqing Zhang (2015) Facial
Expression Recognition via Deep Learning, IETE Technical Review, 32:5, 347-355, DOI:
10.1080/02564602.2015.1017542

To link to this article: https://doi.org/10.1080/02564602.2015.1017542

Published online: 10 Mar 2015.

Submit your article to this journal

Article views: 2020

View related articles

View Crossmark data

Citing articles: 32 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=titr20
Facial Expression Recognition via Deep Learning
Xiaoming Zhao1, Xugan Shi1,2 and Shiqing Zhang1
1
Institute of Image Processing and Pattern Recognition, Taizhou University, Taizhou, Zhejiang 318000, China, 2School of Automatic Control
of Mechanical, Zhejiang Sci-Tech University, Hangzhou, 310018, China

ABSTRACT
Deep learning is a newly-emerged machine learning theory, and has received extensive attentions in pattern
recognition, signal processing, computer vision, etc. Deep belief networks (DBNs) is a representative method
of deep learning and has a strong ability of unsupervised feature learning. In this paper, by combining DBNs
with multi-layer perceptron (MLP), a new method of facial expression recognition based on deep learning is
proposed. The proposed method integrates the DBNs’s advantage of unsupervised feature learning with the
MLP’s classification advantage. Experimental results on two benchmarking facial expression databases, i.e.,
the JAFFE database and the Cohn-Kanade database, demonstrate the promising performance of the pro-
posed method for facial expression recognition, outperforming the other state-of-the-art classification meth-
ods such as the nearest neighbour, MLP, support vector machine, the nearest subspace, as well as sparse
representation-based classification.
Keywords:
Deep belief networks, Deep learning, Facial expression recognition, Feature learning, Multi-layer perceptron,
Unsupervised.

1. INTRODUCTION location points derived from 74 different landmarks


characterizing the most significant information regard-
Facial expression recognition, which is currently a very ing the muscle movements, so as to constitute a feature
active research topic in signal processing, pattern rec- vector for facial expression recognition. Nevertheless,
ognition, and artificial intelligence, aims at creating an the geometric-features-based system relies on the pre-
automatic facial expression recognition system for the cise and reliable facial component detection techniques
purpose of understanding and identifying human to locate such points in a facial image, yielding large
emotions, such as happiness, sadness, anger, fear, sur- difficulty to accommodate to real-world sceneries.
prise, disgust, and so on. Since facial expression is one
of the most natural media for human intercommunica- Appearance features aim at modelling the appearance
tion, developing an automatic facial expression recog- variations of a face by using a holistic spatial analysis.
nition system has many potential applications in One of the widely used appearance features is Gabor
humancomputer interaction, artificial intelligence, wavelets representation [2,47], in which a set of
security monitoring, social entertainment, etc. Gabor filters is used to convolve facial images to pro-
duce multi-scale and multi-orientational coefficients.
Generally, an automatic facial expression recognition Nevertheless, extracting Gabor wavelets representa-
system has two crucial parts: facial expression feature tion with a large feature vector needs much time and
extraction and facial expression classification. At pres- memory due to its computation complexity. Another
ent, facial feature extraction methods can be classified promising appearance feature is the recently emerged
into two categories: the geometric-features-based sys- texture descriptors, i.e., local binary patterns (LBP) [8]
tem and the appearance-features-based system [1]. and its variants [9]. Shan et al. [10] gave a comprehen-
sive study of facial expression recognition by using the
Geometric features usually extract the shapes and loca- LBP features, and showed that the LBP features were a
tions of a large number of facial fiducial points to form kind of promising facial features for facial expression
a feature vector, by means of encoding some facial geo- recognition due to its tolerance against illumination
metric information from the angle, distance, position, changes and computational simplicity. In our previous
etc. For instance, Bashyal and Venayagamoorthy [2] work [11], we developed a facial expression recogni-
extracted a set of 34 fiducial location points in a facial tion method by integrating the LBP features with a
image to represent facial features for facial expression dimensionality reduction method named discriminant
images. Zavaschi et al. [3] used a set of 20 fiducial

IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015 347


Zhao X, et al.: Facial Expression Recognition via Deep Learning

kernel locally linear embedding. Zhao and Pietikainen deep learning is that by combining multiple levels of
[12] developed an extension of LBP named volume low-level features, high-level features can be formed to
local binary patterns to model the texture information find the distributed representation of feature data.
for facial expression recognition. Other variants of Therefore, deep learning is also known as unsuper-
LBP, such as local directional pattern (LDP) [13] and vised feature learning. Different from the traditional
local transitional pattern [5], are also proposed in shallow learning approaches, such as ANN and SVM,
recent years to represent facial feature representation deep learning is not only multi-layered, but also high-
for facial expression recognition. lights the importance of unsupervised feature learning.

Facial expression classification aims at designing a In recent years, deep learning, as a newly emerged
suitable classification mechanism to identify categories machine learning theory, has received extensive atten-
of facial expression based on the extracted features. So tions in machine learning [23], signal processing [24],
far, various classification methods have been applied and artificial intelligence [25]. Deep belief networks
to facial expression recognition, such as hidden Mar- (DBNs) [26], as a representative method of deep learn-
kov model [14], artificial neural network (ANN) [15], ing, has a strong ability of unsupervised feature learn-
Bayesian network [16], support vector machines ing. DBNs has been recently used for acoustic
(SVM) [17], K-nearest neighbour [18], and so on. modelling [27], natural language understanding [28],
Recently, sparse representation-based classification and so on. The basic idea of DBNs is that it greedily
(SRC) [6,19] based on the newly emerged compressive trains one layer at a time and exploits a restricted
sensing [20] theory has become a popular facial expres- Boltzmann machine (RBM) [29] scheme to perform an
sion classification method due to its promising classifi- unsupervised learning for each layer. Although DBNs
cation performance. has a strong ability of unsupervised feature learning, it
cannot be used directly for classification. To address
Facial expression feature extraction is the key problem this issue, we integrate DBNs with a traditional multi-
on facial expression recognition tasks. In order to char- layer perceptron (MLP) model, endowing DBNs with
acterize human emotion expression, the extracted fea- an ability of facial expression recognition. In addition,
tures are whether suitable or not, yield an important to the best of our knowledge, so far there is no reported
impact on the latter classification performance. It is work about deep learning for facial expression recog-
worth pointing out that the above-mentioned hand nition. Motivated by very little work done on deep
designed feature extraction methods generally rely on learning for facial expression recognition, in this paper
manual operations with labelled data. In other words, we present a new method of facial expression recogni-
these methods are supervised. In addition, these hand tion based on deep learning. First, DBNs is employed
designed features such as LBP and Gabor wavelets to learn the low-level primitive features (such as the
representation are able to capture low-level informa- raw pixels) from facial expression images, and extract
tion of facial images, except for high-level representa- a higher level of abstract feature representation.
tion of facial images. However, deep learning, as a Second, the extracted higher level of abstract feature
recently emerged machine learning theory, has shown representation is used to initialize the weights of hid-
how hierarchies of features can be directly learned den layer of an MLP model. Finally, we adopt the ini-
from original data in an unsupervised manner. This tialized MLP to perform the classification of facial
paper will describe how to use deep learning feature expression. Experimental results on two benchmark-
learning techniques to perform facial expression ing facial expression databases, i.e., the JAFFE data-
recognition. base and the CohnKanade database, demonstrate the
promising performance of the proposed method for
Deep learning, based on the hierarchical architecture facial expression recognition.
of information processing in the primate visual percep-
tion system [21], is originally developed by Hinton and The rest of this paper is organized as follows.Section 2
Salakhutdinov [22] in 2006. In the primate visual per- reviews the principle of DBNs. Section 3 describes the
ception system, it is clear that the mammal brain is details of the proposed method. Section 4 gives experi-
regarded to be structured in a deep architecture [21], ment results and analysis. Finally, the conclusions are
in which a given input percept is represented at multi- presented in Section 5.
ple levels of abstraction, and each level corresponds to
various areas of cortex. The brain is able to perform 2. REVIEW OF DEEP BELIEF NETWORKS
information processing through multiple stages of
transformation and representation. Human beings also Deep belief networks (DBNs) [26] is a typical multi-
usually illustrate such concepts in a hierarchical way layered deep learning structure constituted by a
of using multiple levels of abstraction. The nature of sequence of superimposed RBM [29].

348 IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015


Zhao X, et al.: Facial Expression Recognition via Deep Learning

level of network is needed to model features. Hinton


et al. [26] has proved that a greedy learning method
for unsupervised training is effective. This method is
also called contrastive divergence (CD). And this CD
learning method has effectively improved the training
data’s lower bound of likelihood probability which is
Figure 1: The RBM model.
based on a hybrid model. As an energy model [30],
RBM characterizes the relationship between the visual
layer and the hidden layer by using the following
2.1 Restricted Boltzmann Machine energy function:
Restricted Boltzmann machine (RBM) is a typical neu-
ral network, in which its visual layer (v) and the hid- X
V X
H X
V X
H
Eðv; h; uÞ D ¡ wij vi hj ¡ bi v i ¡ a j hj ; (1)
den layer (h) are interconnected with each other, but iD1 jD1 iD1 jD1
without any connection within the layer. Essentially,
RBM is a bipartite graph and its hidden nodes (h) are
where vi and hj separately represent the status of visual
able to obtain higher order correlation of input data of
visible nodes (v). The RBM model is shown in Figure 1. nodes and hidden nodes, generally 0 or 1, and
u D wij ; aj ; bi is the parameter of the energy model in
which wij is the connection weight between vi and hj ,
2.2 Deep Belief Networks
and aj and bi are the corresponding bias for hj and vi ,
The key characteristic of DBNs is that its adjacent respectively. Joint probability of vectors for visible
layers can be split into several separate RBMs. In other layer generated by this model is
words, DBNs can be considered as an accumulation of
X XX
multiple superimposed RBMs. The principle of DBNs pðv; uÞ D e ¡ Eðv;hÞ = e ¡ Eðv;hÞ : (2)
is that the output of the lowest layer of DBNs is taken h u h
as the input of the next layer, and then the output of
the next layer is subsequently taken as the input of the Conditional probability between the visual layer and
higher level’s layer. Figure 2 gives the structure of the hidden layer is calculated as follows:
DBNs. In practice, the training procedure of DBNs
includes two steps: pre-training and fine tuning, as !
X
V
described below. pðhj D 1 j vÞ D s wij vi C aj ; (3)
iD1
2.2.1 Pre-training 0 1
X
H
Pre-training is essentially an unsupervised learning pðvi D 1 j hÞ D s @ wij hj C bi A; (4)
procedure from the bottom up. Since RBM could not jD1
perfectly model features on the original data, a higher
wheresðxÞ D ð1 C e ¡ x Þ ¡ 1 is the sigmoid function,
which is a neuron nonlinear function. With the loga-
rithm of the probability of derivative, the updated
weights of the RBM model can be obtained

@lnpðvÞ
Dw D e D eð < vi hj > data ¡ < vi hj > model Þ; (5)
@wij

where e represents the learning rate and <  > is the


data expectation. In practice, obtaining the unbiased
samples is often difficult; therefore, the CD method is
widely used to update the network weights by a sam-
pling approximation of reconstructed data [30]. During
the process of pre-training, the input of the next layer
comes from the output of upper layer. Generally, the
input of the lowest layer comes from the observed var-
iables, i.e., the original data of the training samples,
Figure 2: The structure of DBNs. such as grey-scale pixel values of an image.

IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015 349


Zhao X, et al.: Facial Expression Recognition via Deep Learning

2.2.2 Fine-tuning parameters as DBNs, such as the number of hidden


layers, the nodes of each hidden layer, and the weights
After pre-training, each layer of the RBM model has of each hidden layer. Finally, we adopt the initialized
been initialized. All the RBM models in series concate- MLP as a classifier to perform facial expression recog-
nate together according to the training order to form a nition and obtain the recognition results.
DBN. Adjusting the training process is based on the
loss function between the input data and the recon- 4. EXPERIMENTAL STUDIES
structed data. The back propagation algorithm is used
to readjust the network parameters, and finally the 4.1 Databases
global optimal network could be obtained. The used
loss function between the input data and the recon- To testify the performance of the proposed method on
structed data is facial expression recognition, two benchmarking facial
expression databases, i.e., the JAFFE database [31] and
the CohnKanade database [32], were employed for
Lðx; x 0 Þ D kx ¡ x 0 k22 ; (6) facial expression recognition. Each database has seven
emotions: neutral, joy, sadness, surprise, angry, dis-
where x is the input data, x 0 is the reconstruction data, gust, and fear.
andkk2 represents the L2-norm form of the reconstruc-
tion error. The JAFFE database is consisted of 10 Japanese
women, each of which has 7 expressions. Each expres-
3. THE PROPOSED METHOD sion has about 3 or 4 images, giving a total of 213
images. Each image has a resolution pixel of 256 £ 256.
Although DBNs has a strong ability of unsupervised Some sample images from the JAFFE database are
feature learning, it cannot be used directly for classifi- shown in Figure 4.
cation. To solve this problem, we present a new
method of facial expression recognition by integrating The CohnKanade database comprises of 100 college
DBNs with MLP. As a result, endowing DBNs with an students from 18 to 30 years, 65% of which are female,
ability of facial expression recognition is realized. 15% are African-American, and 3% are Asian or
Latino. Some sample images from the CohnKanade
The proposed method of facial expression recognition database are given in Figure 5.
based on combining DBNs with MLP is shown in
Figure 3. The proposed method consists of three cru- 4.2 Experimental Setup
cial steps: DBNs feature learning, MLP initialization,
and facial recognition results. DBNs feature learning, The raw pixels of facial images are directly adopted as
including DBNs’s pre-training and fine-tuning, aims to facial features and used for DBNs feature learning.
perform feature learning on the low-level primitive Each pixel of a facial image is normalized with vari-
features (such as the raw pixels) from facial expression ance 1 and mean 0. When using DBNs, the nodes of its
images, resulting in a higher level of abstract feature
representation. Since DBNs is actually regarded as a
deep learning neutral network with multi-layered hid-
den layers, the extracted higher level of abstract fea-
ture representation is embodied in each hidden layer
of DBNs. In this work, the learning abstract feature
representation of the top side hidden layer of DBNs is
used to perform the MLP initialization tasks. The MLP
initialization aims to employ the higher level of Figure 4: Examples of facial expression images from the
abstract features obtained by DBNs to initialize an JAFFE database.
MLP model. The initialized MLP model has the same

Training DBNs feature MLP


samples learning initialization

Facial
expression
database

Testing Recognition
samples results
Figure 5: Examples of facial expression images from the
Figure 3: The diagram of combining DBNs with MLP. CohnKanade database.

350 IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015


Zhao X, et al.: Facial Expression Recognition via Deep Learning

visible layer are equal to the feature dimensions of the Table 2: Recognition accuracy (%) obtained with DBNs C
input facial image samples. The number of hidden MLP for 32 £ 32 image on the JAFFE database
layers is separately used with 1, 2, and 3, each of which hidden nodes
has the nodes of 50, 100, 200, 300, 400, and 500. The
Hidden layers 50 100 200 300 400 500
DBNs’s recognition results in each hidden layer are
reported. In order to obtain better convergence results, 1 80.47 84.28 80.47 84.76 88.10 89.05
the number of cycles is 200 for DBNs’s pre-training, 2 80.47 80.00 84.28 82.86 85.71 84.76
3 61.90 16.44 55.24 66.67 65.24 14.56
and 500 for DBNs’s fine-tuning. All the algorithms are
performed in the platform of MATLAB2009.

Considering the reliability of recognition results, a 10- Table 3: Recognition accuracy (%) obtained with DBNs C
fold cross validation scheme is employed in all facial MLP for 64 £ 64 image on the JAFFE database
expression recognition experiments. In detail, all facial hidden nodes
image samples are divided into 10 parts, 90% of which Hidden layers 50 100 200 300 400 500
is used for training, and the remaining is used for test-
1 27.61 88.09 90.95 90.95 87.61 90.47
ing. The recognition experiments are repeated for 10
2 14.28 65.71 81.90 85.23 81.90 85.23
times, and the average of recognition results with 10
3 6.19 10.47 35.71 44.76 47.61 49.52
times is taken as the finally reported recognition results.

images is very close. This shows that DBNs has a


4.3 Experimental Results by Combining DBNs
strong ability of unsupervised feature learning.
with MLP
Even in the classification of the lowest resolution
4.3.1 Experiments on the JAFFE Database of 16 £ 16 images, DBNs C MLP can still obtain
the recognition performance of 88.57%, which is
In this experiment, the original expression images with almost identical to the results of the other two dif-
256 £ 256 pixels on the JAFFE database are down-sam- ferent resolutions of images including 32 £ 32 and
pled to 3 different resolutions, such as 16 £ 16, 32 £ 64 £ 64.
32, and 64 £ 64. Then facial expression recognition (2) The proposed method of DBNs C MLP usually
experiments are performed by combining DBNs with performs best with one hidden layer. In detail,
MLP (denoted by DBNs C MLP) for the used three dif- when using one hidden layer, DBNs C MLP gives
ferent resolutions of images. The recognition results the best performance on the condition that the
are shown in Tables 13, respectively. number of hidden nodes is 200 for 16 £ 16 images,
500 for 32 £ 32 images, and 300 for 64 £ 64 images,
From the results in Tables 13, the following two respectively. With the increasing number of hid-
points can be observed. den layers, the obtained recognition performance
of DBNs C MLP decreased. However, in general,
(1) The proposed method of combining DBNs with the more the number of hidden layer is, the stron-
MLP (i.e., DBNs C MLP) separately obtains the ger the learning ability of DBNs is [22]. This was
highest accuracy of 88.57%, 89.05%, and 90.95% mainly caused by the relatively small number
for the used three different resolutions of images, (only 213 samples) of the JAFFE database used in
i.e., 16 £ 16, 32 £ 32 and 64 £ 64. Generally, the this experiment. This indicates that DBNs with
smaller the resolution of an image is, the less the one hidden layer is sufficient when dealing with
amount of useful information embedded in an small samples.
image for classification is. However, the recogni-
tion performance of DBNs C MLP in this experi- Table 4: Recognition accuracy (%) for each facial expression
ment for the used three different resolutions of when DBNs C MLP performs best on the JAFFE database
(Ang  anger, Joy  joy, Sad  sadness, Neu  neutral, Fea
 fear, Bor  boredom, Dis  disgust)
Table 1: Recognition accuracy (%) obtained with DBNs C Ang Joy Sad Neu Fea Bor Dis
MLP for 16 £ 16 image on the JAFFE database Ang 92.86 0.00 0.00 0.00 3.57 3.57 0.00
hidden nodes Joy 0.00 87.10 3.22 0.00 0.00 0.00 9.68
Sad 0.00 3.34 93.33 0.00 0.00 0.00 3.33
Hidden layers 50 100 200 300 400 500 Neu 0.00 3.34 0.00 93.33 0.00 3.33 0.00
1 85.23 87.14 88.57 85.71 85.71 87.14 Fea 3.45 0.00 3.45 0.00 82.76 10.34 0.00
2 82.38 84.76 80.95 78.57 65.71 63.80 Bor 0.00 0.00 3.14 3.12 3.12 87.50 3.12
3 16.60 15.71 11.90 11.90 10.47 10.47 Dis 0.00 0.00 0.00 0.00 0.00 0.00 100

IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015 351


Zhao X, et al.: Facial Expression Recognition via Deep Learning

In order to give the recognition accuracy of each facial Table 7: Recognition accuracy (%) obtained with DBNs C
expression, Table 4 lists the confusion matrix of seven MLP for 64 £ 64 image on the CohnKanade database
facial expression recognition results when DNBs C hidden nodes
MLP performs best with an accuracy of 90.95%. The
Hidden layers 50 100 200 300 400 500
diagonal data in Table 4 represents the correct recogni-
tion rate of each facial expression. From the results 1 61.90 15.23 91.90 91.42 97.61 98.57
shown in Table 4, we can see that four expressions, i.e., 2 11.42 11.90 87.62 84.76 90.95 89.05
3 10.00 10.47 22.38 48.09 37.14 30.00
angry, sadness, surprise and neutral, are identified
well with an accuracy of more than 92%. Specially,
neutral is classified with a satisfactory accuracy of
100%. The correct recognition rates of other expres- including the nearest neighbour (NN), MLP, SVM, the
sions, i.e., joy, disgust, and fear, are obtained with a nearest subspace (NS) [33], as well as SRC [34,35].
slightly lower accuracy of 87.10%, 82.76%, and 87.50%, Three different resolutions of images such as 16 £ 16,
respectively. The main reason is that joy and neutral, 32 £ 32, and 64 £ 64 are employed for experiments.
disgust and fear are easily confused for each other.
The used MLP method is a traditional neural network
4.3.2 Experiments on the CohnKanade Database classifier containing a hidden layer. As done in DBNs,
we set the hidden nodes in MLP to be 50, 100, 200, 300,
Tables 57 separately present the recognition results 400, and 500, and take the best recognition results as
of different resolutions of images including 16 £ 16, 32 the finally reported results of MLP. The weights of hid-
£ 32, and 64 £ 64 when using DBNs C MLP on the den layers in MLP are randomly generated. SVM is a
CohnKanade database. The results in Tables 46 kind of classifier based on statistical learning theory.
show that with one hidden layer DBNs C MLP In this experiment, for SVM, we use the “one-versus-
achieves the highest recognition accuracy of 96.67% for one” scheme to perform the multi-class classification
16 £ 16 images, 97.19% for 32 £ 32 images, and 98.57% problem with radial basis function (RBF) kernel. The
for 64 £ 64 images, respectively. This indicates that RBF kernel parameters are optimized by using a five-
DBNs has a strong ability of unsupervised feature fold cross validation on the training data. The methods
learning, again. Table 8 presents the confusion matrix of NS and SRC are two non-parametric classifiers
of seven facial expression recognition results when based on signal reconstruction. The basic idea of NS is
DNBs C MLP performs best with an accuracy of that a testing sample is represented as a linear combi-
98.57%. The results in Table 8 show that all seven facial nation of all training samples, and then the optimal
expressions except neutral are distinguished very well solution is employed for classification. The basic idea
with an accuracy of 100%. of SRC is that the sparse representation of a testing
sample is sought by using all training samples, and
4.3.3 Comparisons with Other Classification Methods then the sparsest solution is used for classification.
In this experiment, the DBNs C MLP method is com-
pared with five other typical classification methods, Tables 9 and 10 separately show the recognition perfor-
mance comparisons for different classification methods
on the used two facial expression databases, i.e., the
Table 5: Recognition accuracy (%) obtained with DBNs C
JAFFE database and the CohnKanade database, when
MLP for 16 £ 16 image on the CohnKanade database
dealing with three different resolutions of images such
hidden nodes as 16 £ 16, 32 £ 32, and 64 £ 64. From the results in
Hidden layers 50 100 200 300 400 500
1 92.86 96.67 96.67 91.43 88.1 72.85
Table 8: Recognition accuracy (%) for each facial expression
2 72.86 81.43 78.57 84.28 61.9 57.14
when DBNs C MLP performs best on the CohnKanade
3 38.09 32.86 33.33 32.81 24.28 21.43 database (Ang  anger, Joy  joy, Sad  sadness, Neu 
neutral, Fea  fear, Bor  boredom, Dis  disgust)
Table 6: Recognition accuracy (%) obtained with DBNs C Ang Joy Sad Neu Fea Bor Dis
MLP for 32 £ 32 image on the CohnKanade database Ang 100 0.00 0.00 0.00 0.00 0.00 0.00
hidden nodes Joy 0.00 100 0.00 0.00 0.00 0.00 0.00
Sad 0.00 0.00 100 0.00 0.00 0.00 0.00
Hidden layers 50 100 200 300 400 500 Neu 0.00 0.00 0.00 100 0.00 0.00 0.00
1 90.47 93.81 97.19 95.23 91.42 96.19 Fea 0.00 0.00 0.00 0.00 100 0.00 0.00
2 68.09 89.04 91.43 94.76 82.85 90.00 Bor 0.00 0.00 0.00 0.00 0.00 100 0.00
3 23.33 52.85 60.00 47.14 30.95 40.48 Dis 0.00 0.00 3.33 0.00 3.33 3.33 90

352 IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015


Zhao X, et al.: Facial Expression Recognition via Deep Learning

Table 9: Recognition performance (%) comparison for differ- seven facial expression recognition tasks is highly
ent classifiers on the JAFFE database competitive, in comparison with some previously
Method 16 £ 16 image 32 £ 32 image 64 £ 64 image reported results on the JAFFE database. In [2], the 7-
NN 82.85 79.14 80.95 class recognition performance of 87.51% was reported
MLP 64.76 84.76 86.19 based on Gabor wavelets representation and learning
SVM 80.47 81.43 87.61 vector. In [10], they gave the highest 7-class recognition
NS 83.33 78.90 81.90 accuracy of 81.0% by using SVM and the LBP features.
SRC 79.52 85.71 83.33 In [13], they yielded the highest accuracy of 85.4% by
DBNs C MLP 88.57 89.05 90.95 using SVM and the LDP features. In [36], by using
Bold values represent the highest values. Gabor wavelets representation and MLP, 7-class recog-
nition accuracy of 90.1% was reported.
Tables 9 and 10, we can see that the proposed DBNs C
In addition, on the CohnKanade database, the
MLP method obtains the best recognition performance
reported recognition accuracy of DBNs C MLP (96.67%
in all cases among all used classification methods. This
for 16 £ 16 images, 97.19% for 32 £ 32 images, and
demonstrates the effectiveness of DBNs C MLP for facial
98.57% for 64 £ 64 images) is also highly competitive
expression recognition. It is worth pointing out that the
with previously published work. In [10], based on the
recognition performance of DBNs C MLP makes an
LBP features and SVM, the best 7-class recognition accu-
obvious improvement over MLP. For example, com-
racy of 91.4% was obtained. Zhao and Pietikainen [12]
pared with MLP, DBNs C MLP separately yields an
employed the NN and the LBP histograms from three
improvement of 23.81%, 4.29%, and 4.76% for 16 £ 16
orthogonal planes, and gave the best 7-class recognition
images, 32 £ 32 images, and 64 £ 64 images on the
performance of 97.14%. In [13], they gave the best 7-class
JAFFE database. This is attributed to the main difference
recognition performance of 93.4% based on LDP and
between DBNs C MLP and MLP. The former DBNs C
SVM. In [37], they adopted the LBP features and SVM to
MLP method first uses DBNs for unsupervised feature
produce the best 7-class recognition accuracy of 88.4%.
learning, and then employs the DBNs’s learning results
revealing the essential property of data to initialize the
weights of hidden layer of the used MLP model. How- 5. CONCLUSIONS
ever, in the latter MLP method, the weights of the used This paper presents a new method of facial expression
MLP model are initialized randomly without any fea- recognition based on combining DBNs with MLP
ture learning. This demonstrates the importance of (denoted by DBNs C MLP). DBNs is first used to learn
unsupervised feature learning ability of DBNs and com- the raw pixels of facial expression images, and get a
bining DBNs with MLP is able to clearly promote the higher level of abstract features. Then with the aid of
performance of MLP. DBNs’s learning results an MLP model is initialized to
perform facial expression classification. Experimental
Now, we compare our recognition results of the pro- results on the used two benchmarking facial expres-
posed method of DBNs C MLP with previously pub- sion databases, i.e., the JAFFE database and the
lished works on the used two facial expression CohnKanade database, demonstrate the promising
databases, i.e., the JAFFE database and the performance of the proposed method of DBNs C MLP
CohnKanade database, respectively. on facial expression recognition tasks, outperforming
the other used state-of-the-art classification methods
In our work, the reported recognition accuracy of such as NN, MLP, SVM, NS, as well as SRC. This can
DBNs C MLP (88.57% for 16 £ 16 images, 89.05% for be attributed to DBNs having a strong ability of unsu-
32 £ 32 images, and 90.95% for 64 £ 64 images) on pervised feature learning.

Funding
Table 10: Recognition performance (%) comparison for dif-
ferent classifiers on the CohnKanade database This work is supported by National Natural Science Founda-
Method 16 £ 16 image 32 £ 32 image 64 £ 64 image tion of China [grant number 61203257], [grant number
NN 80.71 92.29 94.28 61272261].
MLP 70.95 85.25 87.62
SVM 85.23 93.80 95.23 REFERENCES
NS 87.61 92.74 95.67
SRC 90.14 94.76 96.55 1. Y. Tian, T. Kanade, and J. F. Cohn, “Facial expression recogni-
tion”, in Handbook of Face Recognition, 2011, pp. 487519.
DBNs C MLP 96.67 97.19 98.57
2. S. Bashyal, and G. Venayagamoorthy, “Recognition of facial
Bold values represent the highest values. expressions using Gabor wavelets and learning vector

IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015 353


Zhao X, et al.: Facial Expression Recognition via Deep Learning

quantization,” Eng. Appl. Artif. Intell., Vol. 21, no. 7, pp. 21. T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T.
105664, Oct. 2008. Poggio, “A quantitative theory of immediate visual recognition,”
3. T. H. Zavaschi, A. S. Britto Jr, L. E. Oliveira, and A. L. Koerich, Prog. Brain Res., Vol. 165, pp. 3356, Oct. 2007.
“Fusion of feature sets and classifiers for facial expression rec- 22. G. E. Hinton, and R. R. Salakhutdinov, “Reducing the
ognition,” Expert Syst. Appl., Vol. 40, no. 2, pp. 64655, Feb. dimensionality of data with neural networks,” Science, Vol.
2013. 313, no. 5786, pp. 5047, Jul. 2006.
4. W. Gu, C. Xiang, Y. Venkatesh, D. Huang, and H. Lin, “Facial 23. Y. Bengio, A. Courville, and P. Vincent, “Representation learn-
expression recognition using radial encoding of local Gabor ing: A review and new perspectives,” IEEE Trans. Pattern Anal.
features and classifier synthesis,” Pattern Recogn., Vol. 45, no. Mach. Intell., Vol. 35, no. 8, pp. 1798828, Aug. 2013.
1, pp. 8091, Jan. 2012. 24. D. Yu, and L. Deng, “Deep learning and its applications to sig-
5. T. Ahsan, T. Jabid, and U.-P. Chong, “Facial expression recog- nal and information processing [exploratory dsp],” IEEE Signal
nition using local transitional pattern on Gabor filtered facial Process. Mag., Vol. 28, no. 1, pp. 14554, Jan. 2011.
images,” IETE Tech. Rev., Vol. 30, no. 1, pp. 4752, Jan. 2013. 25. I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learn-
6. M. Mohammadi, E. Fatemizadeh, and M. Mahoor, “PCA-based ing-a new frontier in artificial intelligence research [research
dictionary building for accurate facial expression recognition frontier],” IEEE Comput. Intell. Mag., Vol. 5, no. 4, pp. 138,
via sparse representation,” J. Vis. Commun. Image Represent., Nov. 2010.
Vol. 25, no. 5, pp. 108292, Jul. 2014. 26. G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algo-
7. M.H. Siddiqi, R. Ali, A. Sattar, A.M. Khan, and S. Lee, “Depth rithm for deep belief nets,” Neural Comput., Vol. 18, no. 7, pp.
camera-based facial expression recognition system using mul- 152754, Jul. 2006.
tilayer scheme,” IETE Tech. Rev., Vol. 31, no. 4, pp. 27786, 27. A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling
Jul. 2014. using deep belief networks,” IEEE Trans. Audio, Speech, Lang.
8. T. Ojala, M. Pietik inen, and T. M enp, “Multiresolution gray Process., Vol. 20, no. 1, pp. 1422, Jan. 2012.
scale and rotation invariant texture analysis with local binary 28. R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep
patterns,” IEEE Trans. Pattern Anal. Mach. Intell., Vol. 24, no. 7, belief networks for natural language understanding,” IEEE
pp. 97187, Jul. 2002. Trans. Audio, Speech, Lang. Process., Vol. 22, no. 4, pp.
9. S. Brahnam, L. C. Jain, A. Lumini, and L. Nanni, “Introduction 77884, Apr. 2014.
to local binary patterns: new variants and applications,” in 29. Y. Freund, and D. Haussler, “Unsupervised learning of distribu-
Local Binary Patterns: New Variants and Applications. Springer, tions of binary vectors using two layer networks,” University of
2014, pp. 113. California, Santa Cruz, CA Tech. Rep. UCSC-CRL-94-25, 1994.
10. C. Shan, S. Gong, and P. McOwan, “Facial expression recogni- 30. G. E. Hinton, “A practical guide to training restricted boltzmann
tion based on local binary patterns: A comprehensive study,” machines,” in Neural Networks: Tricks of the Trade. Springer,
Image Vis. Comput., Vol. 27, no. 6, pp. 80316, May 2009. 2012, pp. 599619.
11. X. Zhao, and S. Zhang, “Facial expression recognition using 31. M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding
local binary patterns and discriminant kernel locally linear facial expressions with Gabor wavelets,” in Proceeding of Third
embedding,” EURASIP J. Adv. Signal Process., Vol. 2012, IEEE International Conference on Automatic Face and Gesture
pp. 2002:20, Jan. 2012. doi: 10.1186/1687-6180-2012-20 Recognition, Nara, 1998, pp. 2005.
12. G. Zhao, and M. Pietikainen, “Dynamic texture recognition 32. T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database
using local binary patterns with an application to facial expres- for facial expression analysis,” in Proceeding of Fourth IEEE
sions,” IEEE Trans. Pattern Anal. Mach. Intell., Vol. 29, no. 6, International Conference on Automatic Face and Gesture Rec-
pp. 91528, Jun. 2007. ognition, Grenoble, 2000, pp. 4653.
13. T. Jabid, M. H. Kabir, and O. Chae, “Robust facial expression 33. K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspa-
recognition based on local directional pattern,” ETRI J., Vol. ces for face recognition under variable lighting,” IEEE Trans.
32, no. 5, pp. 78494, Oct. 2010. Pattern Anal. Mach. Intell., Vol. 27, no. 5, pp. 68498, May
14. L. Rabiner, and B. Juang, “An introduction to hidden Markov 2005.
models,” IEEE ASSP Mag., Vol. 3, no. 1, pp. 416, Jan. 1986. 34. J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma,
15. W. S. McCulloch, and W. Pitts, “A logical calculus of the ideas “Robust face recognition via sparse representation,” IEEE
immanent in nervous activity,” Bull. Math. Biophys., Vol. 5, no. Trans. Pattern Anal. Mach. Intell., Vol. 31, no. 2, pp. 21027,
4, pp. 11533, Dec. 1943. Feb. 2009.
16. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Net- 35. J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan,
works of Plausible Inference. Morgan Kaufmann, 1988. “Sparse representation for computer vision and pattern recog-
17. C. Cortes, and V. Vapnik, “Support-vector networks,” Mach. nition,” Proc. IEEE, Vol. 98, no. 6, pp. 103144, Jun. 2010.
learn., Vol. 20, no. 3, pp. 27397, Sep. 1995. 36. Z. Zhang, M. Lyons, M. Schuster, and S. Akamatsu, “Compari-
18. T. Cover, and P. Hart, “Nearest neighbor pattern classification,” son between geometry-based and Gabor-wavelets-based
IEEE Trans. Inform. Theor., Vol. 13, no. 1, pp. 217, Jan. 1967. facial expression recognition using multi-layer perceptron,” in
Proceeding of Third IEEE International Conference on Face
19. S. Zhang, X. Zhao, and B. Lei, “Robust facial expression recog- and Gesture Recognition, Nara, 1998, pp. 4549.
nition via compressive sensing,” Sensors, Vol. 12, no. 3, pp.
374761, Mar. 2012. 37. C. Shan, S. Gong, and P. McOwan, “Robust facial expression
recognition using local binary patterns,” in Proceeding of IEEE
20. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. International Conference on Image Processing (ICIP), Genoa,
Theor., Vol. 52, no. 4, pp. 1289306, Apr. 2006. 2005, pp. 3703.

354 IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015


Zhao X, et al.: Facial Expression Recognition via Deep Learning

Authors
Xiaoming Zhao received the BS degree in Shiqing Zhang received the BS degree in
mathematics from Zhejiang Normal University electronics and information engineering from
in 1990 and the MS degree in software engi- Hunan University of Commerce in 2003, the
neering from Beihang University in 2006. He is MS degree in electronics and communication
currently a professor of department of com- engineering from Hangzhou Dianzi University
puter science, Taizhou University, China. His in 2008, and the PhD degree at School of
research interests include image processing, Communication and Information Engineering,
machine learning, and pattern recognition. University of Electronic Science and Technol-
ogy of China in 2012. Currently, he works as
E-mail: tzxyzxm@163.com
an assistant professor of department of physics and electronics engi-
neering, Taizhou University, China. His research interests include
image processing, affective computing, and pattern recognition.
Xugan Shi received the BS degree in electron-
E-mail: tzczsq@163.com
ics and information engineering from Zhejiang
University Of Media and Communications in
2012. He is currently pursuing the MS degree
at School of Automatic Control of Mechanical,
Zhejiang Sci-Tech University, China. His
research interests include image processing
and pattern recognition.
E-mail: zhangshiqing111@126.com

DOI: 10.1080/02564602.2015.1017542; Copyright © 2015 by the IETE

IETE TECHNICAL REVIEW | VOL 32 | NO 5 | SEPOCT 2015 355

You might also like