Professional Documents
Culture Documents
Using Deep Autoencoders For Facial Expression Recognition
Using Deep Autoencoders For Facial Expression Recognition
Abstract—Feature descriptors involved in image processing dimensional data is a challenging task. Therefore, techniques
are generally manually chosen and high dimensional in nature. like Principal Component Analysis (PCA) and Local Binary
Selecting the most important features is a very crucial task for Pattern (LBP), [5], [8], Non-Negative Matrix Factorization
systems like facial expression recognition. This paper investigates
the performance of deep autoencoders for feature selection and (NMF), etc., are being used to overcome high dimensionality
dimension reduction for facial expression recognition on multiple problem by representing the most relevant features in lower-
levels of hidden layers. The features extracted from the stacked dimensions.
autoencoder outperformed when compared to other state-of-the-
art feature selection and dimension reduction techniques.
I. I NTRODUCTION
Emotion recognition is an important area of research to
enable effective human-computer interaction. Human emo- Fig. 1: Facial expression recognition (FER) block diagram
can be trained by using greedy methods for each additional Indian facial expression databases. They attained above 75%
layer. As a result, when the data is passed through the multiple recognition rate by using both techniques.
hidden layers of stacked autoencoders, it encodes the input Recently, a comprehensive study has been made by Liu et
vector in a smaller representation more efficiently [18]. In our al. [8], they also combined HOG with Local Binary Patterns
case, autoencoders network is more suitable as it not only (LBP) features. For dimension reduction of extracted features,
reduces the dimension of data but can also detect most relevant PCA was used. After applying several classifiers on reduced
features. In previous work, Hinton et al. [19] have shown that features, he received 98.3% maximum recognition rate. Simi-
autoencoders networks can be used for effective dimension larly, Xing et al. [24] used local Gabor features with Adaboost
reduction and they can produce more effective representation classifier for FER and achieved 95.1% accuracy with the 10-
than PCA. time reduced dimensionality of traditional Gabor features.
For our experiments, we choose Extended Cohn-Kanade Encouragingly, Jung et al. [25] used deep neural networks
(CK+) [20] dataset which is extensively used for automatic to extract temporal appearance as well as temporal geometric
facial image analysis and emotion classification. The HOG features from RAW data. They tested this technique on several
features are computed from the selected area of facial expres- datasets and obtained higher accuracy than state-of-the-art
sions and their dimensions have been reduced by using stacked techniques. Jang et al. [26] worked on color images and
autoencoders on multiple levels and with multiple hidden attained 85.74% recognition rate by using color channel-wise
layers to get the most optimal encoded features. SVM model in recurrent learning using deep learning. Similarly, Talele et al.
the one-vs-all scenario is used for classification on this reduced [27] used LBP features and ANN for classification and the
form of features. We have performed multiple experiments on maximum success rate was 95.48%.
the selection of optimal dimension (10-500 features) of the Recently, the autoencoders models have been used more
feature vector. The feature vector with length 60, obtained widely for features learning from data and classification prob-
after the introduction of four hidden layers in autoencoders lems. For example, Huang et al. [28] used sparse autoen-
network outperformed as compared to other dimensions. Most coder networks for feature learning and classification. His
importantly, we also use PCA for dimension reduction in technique was good to avoid human interaction but at the
order to compare the baseline results with autoencoders. Our cost of computation complexity. Interestingly, Gupta et al.
proposed method for FER using stacked autoencoders is also [29] developed a multi-velocity autoencoder network by using
outperformed when results were compared with PCA and other the multi-velocity layers for generating velocity-free deep
recent approaches published in this domain. This demonstrates motion features for facial expressions. The proposed technique
the effectiveness of stacked autoencoders for the selection of attained the state-of-the-art accuracy on various FER datasets.
the most relevant features for FER task. An interesting work has been done by Makhzan et al. [30]
The rest of the paper is organized as follows. In Section to investigate the effectiveness of sparsity on MNIST data.
II, we present background and related work. In Section III, They showed that sparse autoencoders are simple to train
the detail on each step of our proposed method is described. and achieve better classification results as compared to the
In Section IV, we explain the experimental procedure and denoising autoencoders and Restricted Boltzmann Machines
obtained results. Finally, we conclude in Section V. (RBMs) as well as networks trained with dropout. Another
study [31] explored the effect of hidden layers in stacked
II. R ELATED W ORK autoencoders on MNIST data. The authors showed that stacked
autoencoders with larger depth have better learning capability
Facial expressions are visually observable non-verbal com- but require more training examples and time.
munication signals that occur in response to a person’s emo-
tions and originate by the change in facial muscle. They are
III. P ROPOSED M ETHOD
the key mechanism for conveying and understanding emotions.
Ekman and Freisen [21] postulated six universal emotions (i.e., Our proposed FER system consists of four steps (as illus-
anger, fear, disgust, joy, surprise, and sadness) with distinct trated in Figure 2). The first step is related to image processing,
content and unique facial expression. Most of the studies in in which, we use the state-of-the-art Viola Jones [32] face
the area of emotion recognition usually focus on classifying detection method for face detection and extraction. This ex-
these six emotions. tracted portion represents the most variance when expression
Much of the efforts have been made to classify facial ex- changes. In the second step, HOG features are computed
pression with various facial feature by using machine learning from the cropped image. In the third step, high-dimensional
algorithms. For example, Anderson et al. [22] developed an HOD features are reduced to lower dimension using stacked
FER system to recognize the six emotions. They use SVM and autoencoders. Finally, in the fourth step, the SVM model
Multilayer Perceptrons and achieved a recognition accuracy of is used on these lower dimension features to classify the
81.82%. In [6], Wang et al. combined HOG and WLD features facial expressions. Extended Cohn-Kanade Dataset (CK+) is
to have missing information about the contour and shape. The used in our experiment. Most importantly, we investigated the
proposed solution attained 95.86% recognition rate by using performance of encoded features of length 5 to 100 using
chi-square distance and the nearest neighbor method to classify different depth of stacked autoencoders. Figure. 2 shows the
the fused features. Lia et al. [23] used k-nearest neighbor to flowchart of our overall experiment. The detail of each step is
compare the performance of PCA and NMF on Taiwanese and given below.
3
B. Feature Extraction
Histogram of oriented gradients (HOG) is a feature de-
scriptor which is widely used in computer vision and image
processing [33]. The technique counts occurrences of gradient Fig. 4: Architecture of an autoencoder network. The input vector X of length m is
orientation in localized portions of an image. HOG is invari- encoded to lower dimensional feature vector a of length n then reconstructed as y
with length m similar to x (n < m)
ant to geometric and photometric transformations, except for
object orientation. The images that are in the database have For example, if autoencoders are inputted with correlated
different expressions and different orientations of eyes, nose, structural data, then the network will discover some of these
and lip corners. HOG is used in our algorithm because it is a correlations [34]. In an autoencoder, the lower dimension a is
powerful descriptor to detect the variations (i.e., when facial represented by
expressions change). In our proposed approach, we applied X
a=f Wx + b (1)
HOG on cropped face images and extracted the feature vectors.
The cropped image of size 128×128 gives a feature vector of Where W is associated weight vector with the input unit and
size 1×8100 using HOG. The feature vectors are concatenated hidden unit, b is the bias associated with the hidden unit and
to form feature matrix as shown in table I. a is the activation of the hidden unit in the network. Similarly,
f (x) is the sigmoid function that is given by
C. Dimension Reduction 1
f (x) = (2)
The main aim of this work is to show that how nonlinear 1 + exp(−z)
machine learning technique can be effectively used to obtain m
the most relevant holistic representation of features in a
X
zi = Wij x + bi (3)
lower dimension. As the extracted HOG features have a high j=1
4
Fig. 6: The structure of five stacked autoencoders used in our experiment. (a) one
hidden layer autoencoder, (b) two hidden layer autoencoder, (c) three hidden layer
autoencoder, (d) four hidden layer autoencoder and (e) five hidden layer autoencoder
Fig. 5: Sample sequence of five images for various emotions: (a) Angry, (b) Disgust,
(c) Fear, (d) Happy, (e) Sad and (f) Surprised
but after that, it decreases. Similarly, we find the same trend
for all number of features but at a different number of hidden
Multiclass SVM in the one-vs-all scheme with Gaussian layers.
kernel has been used for classification of facial expressions
using MATLAB. We have performed multiple experiments on Number of Hidden Hidden Hidden Hidden Hidden
Feature Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
a different length of features obtained by autoencoder and PCA 5 78.6 89.0 90.1 93.4 94.9
as shown in table II. It can be noted that encoded features 10 78.4 93.0 94.9 97.8 97.1
obtained by the stacked autoencoders mostly outperformed 20 77.7 95.2 98.5 98.1 97.8
40 96.2 98.3 98.9 98.5 98.1
the baseline (PCA) performance. By using autoencoder for 60 96.9 98.7 99.2 99.6 98.8
dimension reduction, we achieved the highest recognition 80 97.6 98.9 98.4 98.1 97.3
100 97.9 98.5 98.9 98.4 98.2
rate of 99.60% with 60 dimensions while with PCA 96.44%
success rate is obtained with 80 dimensions. TABLE III: Recognition rate achieved on multiple hidden layers for different dimension
(number of features)
TABLE II: Accuracy using different dimension (number of features) with SVM
IV, previously maximum achieved accuracy is 99.51% using a [10] H.-C. Choi and S.-Y. Oh, “Realtime facial expression recognition using
combination of features (i.e, HOG+LDA+PCA). Similarly, Liu active appearance model and multilayer perceptron,” in SICE-ICASE,
2006. International Joint Conference. IEEE, 2006, pp. 5924–5927.
et al. [8] achieved 98.3% recognition rate using local binary [11] D. Ghimire, S. Jeong, J. Lee, and S. H. Park, “Facial expression
patterns (LBP) and HOG features together. They achieved this recognition based on local region specific features and support vector
accuracy using a combination of features with 80 dimensions. machines,” Multimedia Tools and Applications, vol. 76, no. 6, pp. 7803–
7821, 2017.
While our proposed method has shown significantly better re- [12] M. H. Siddiqi, S. Lee, Y.-K. Lee, A. M. Khan, and P. T. H. Truc, “Hi-
sults while using a single type of features at lower dimensions. erarchical recognition scheme for human facial expression recognition
systems,” Sensors, vol. 13, no. 12, pp. 16 682–16 713, 2013.
Study Year Method Accuracy (%) [13] M. Z. Uddin, J. Lee, and T.-S. Kim, “An enhanced independent
Local Gabor Feature + component-based human facial expression recognition from video,”
Xing et al. [24] 2016 95.1
Adaboost Classifier IEEE Transactions on Consumer Electronics, vol. 55, no. 4, 2009.
Cossetin et al. [39] 2016 Pairwise Feature 98.07 [14] Y. Li, S. Wang, Y. Zhao, and Q. Ji, “Simultaneous facial feature
Kar et al. [40] 2017 HOG+LDA+PCA 99.51 tracking and facial expression recognition,” IEEE Transactions on Image
Local binary patterns
Liu et al. [24] 2017 98.3 Processing, vol. 22, no. 7, pp. 2559–2573, 2013.
(LBP)+HOG with PCA
Our study 2017 HOG + Autoencoders 99.60
[15] M. Schels and F. Schwenker, “A multiple classifier system approach
for facial expressions in image sequences utilizing gmm supervectors,”
TABLE IV: Comparison of some recent paper with our study in Pattern Recognition (ICPR), 2010 20th International Conference on.
IEEE, 2010, pp. 4251–4254.
Although our proposed approach has achieved a state-of- [16] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
a boosted deep belief network,” in Proceedings of the IEEE Conference
the-art recognition rate but the time complexity of autoen- on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
coders is linearly dependent on the number of features and [17] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson,
hidden layers. Greater the number of features or hidden layers, “Generating facial expressions with deep belief nets,” in Affective
Computing. InTech, 2008.
the more time that is required to train the model. [18] M. Liu, S. Li, S. Shan, and X. Chen, “Au-inspired deep networks for
facial expression feature learning,” Neurocomputing, vol. 159, pp. 126–
V. C ONCLUSION 136, 2015.
[19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
The main contribution of this paper is to investigate the data with neural networks,” science, vol. 313, no. 5786, pp. 504–507,
performance of deep autoencoders for lower dimensional 2006.
feature representation. The experiment proves that nonlinear [20] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,
“The extended cohn-kanade dataset (ck+): A complete dataset for
dimension reduction using autoencoders is more effective than action unit and emotion-specified expression,” in Computer Vision and
linear dimension reduction techniques for FER. We used CK+ Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society
dataset in our experiments and compared our results using Conference on. IEEE, 2010, pp. 94–101.
[21] P. Ekman and W. Friesen, “Facial action coding system: a technique for
features obtained by autoencoder networks with state-of-the- the measurement of facial movement,” Palo Alto: Consulting Psycholo-
art PCA. Most importantly, we explored the effect of an gists, 1978.
increase in the number of hidden layers which enhanced the [22] K. Anderson and P. W. McOwan, “A real-time automated system for the
recognition of human facial expressions,” IEEE Transactions on Systems,
learning capability of the network to provide more robust and Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 1, pp. 96–105,
optimal features for facial expression recognition. 2006.
[23] J. Li and M. Oussalah, “Automatic face emotion recognition system,”
R EFERENCES in Cybernetic Intelligent Systems (CIS), 2010 IEEE 9th International
Conference on. IEEE, 2010, pp. 1–6.
[1] J. Kumari, R. Rajesh, and K. Pooja, “Facial expression recognition: A [24] Y. Xing and W. Luo, “Facial expression recognition using local ga-
survey,” Procedia Computer Science, vol. 58, pp. 486–491, 2015. bor features and adaboost classifiers,” in Progress in Informatics and
[2] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” Computing (PIC), 2016 International Conference on. IEEE, 2016, pp.
Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003. 228–232.
[3] S. Latif, J. Qadir, S. Farooq, and M. A. Imran, “How 5g (and con- [25] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in
comitant technologies) will revolutionize healthcare,” arXiv preprint deep neural networks for facial expression recognition,” in Proceedings
arXiv:1708.08746, 2017. of the IEEE International Conference on Computer Vision, 2015, pp.
[4] P. Carcagnı̀, M. Del Coco, M. Leo, and C. Distante, “Facial expression 2983–2991.
recognition and histograms of oriented gradients: a comprehensive [26] J. Jang, D. H. Kim, H.-I. Kim, and Y. M. Ro, “Color channel-wise re-
study,” SpringerPlus, vol. 4, no. 1, p. 645, 2015. current learning for facial expression recognition,” in Acoustics, Speech
[5] M. Abdulrahman, T. R. Gwadabe, F. J. Abdu, and A. Eleyan, “Gabor and Signal Processing (ICASSP), 2017 IEEE International Conference
wavelet transform based facial expression recognition using pca and on. IEEE, 2017, pp. 1233–1237.
lbp,” in Signal Processing and Communications Applications Conference [27] K. Talele, A. Shirsat, T. Uplenchwar, and K. Tuckley, “Facial expression
(SIU), 2014 22nd. IEEE, 2014, pp. 2265–2268. recognition using general regression neural network,” in Bombay Section
[6] X. Wang, C. Jin, W. Liu, M. Hu, L. Xu, and F. Ren, “Feature fusion Symposium (IBSS), 2016 IEEE. IEEE, 2016, pp. 1–6.
of hog and wld for facial expression recognition,” in System Integration [28] B. Huang and Z. Ying, “Sparse autoencoder for facial expression
(SII), 2013 IEEE/SICE International Symposium on. IEEE, 2013, pp. recognition,” in Ubiquitous Intelligence and Computing and 2015 IEEE
227–232. 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE
[7] M. V. Akinin, N. V. Akinina, A. I. Taganov, and M. B. Nikiforov, 15th Intl Conf on Scalable Computing and Communications and Its
“Autoencoder: Approach to the reduction of the dimension of the vector Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf
space with controlled loss of information,” in Embedded Computing on. IEEE, 2015, pp. 1529–1532.
(MECO), 2015 4th Mediterranean Conference on. IEEE, 2015, pp. [29] O. Gupta, D. Raviv, and R. Raskar, “Multi-velocity neural networks for
171–173. facial expression recognition in videos,” IEEE Transactions on Affective
[8] Y. Liu, Y. Li, X. Ma, and R. Song, “Facial expression recognition with Computing, 2017.
fusion features extracted from salient facial areas,” Sensors, vol. 17, [30] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint
no. 4, p. 712, 2017. arXiv:1312.5663, 2013.
[9] Y. Wang, H. Yao, S. Zhao, and Y. Zheng, “Dimensionality reduction [31] Q. Xu, C. Zhang, L. Zhang, and Y. Song, “The learning effect of
strategy based on auto-encoder,” in Proceedings of the 7th International different hidden layers stacked autoencoder,” in Intelligent Human-
Conference on Internet Multimedia Computing and Service. ACM, Machine Systems and Cybernetics (IHMSC), 2016 8th International
2015, p. 63. Conference on, vol. 2. IEEE, 2016, pp. 148–151.
7