Professional Documents
Culture Documents
Facial Emotion Recognition Using Deep Learning: Ankit Awasthi (Y8084)
Facial Emotion Recognition Using Deep Learning: Ankit Awasthi (Y8084)
CS 676:Computer Vision
Supervisor:
Dr. Amitabha Mukerjee , Department of Computer Science Engineering, IIT Kanpur
Dr. P Guha, TCS Labs, Delhi,India
ABSTRACT
Facial emotion recognition is one of the most important cognitive functions that our brain
performs quite efficiently. State of the art facial emotion recognition techniques are mostly
performance driven and do not consider the cognitive relevance of the model. This project
is an attempt to look at the task of emotion recognition using deep belief networks which
is cognitively very appealing and at the same has been shown to perform very well for digit
recognition (Hinton et.al. 2006). We look at the effects of varying number of hidden layers
and hidden units on the performance of the model and attempt to develop important insights
into the features learnt by the model. Also we observe that as found various psychological
findings our model finds lower spatial frequency more useful for recognizing facial expressions
than higher spatial frequency data.
1
Contents
1 Introduction 3
2 Motivation 4
5 JAFFE Dataset 5
6 Results 5
6.1 First Hidden Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Effect of Number of Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3 Effect of Number of Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 Effect of Image Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8 References 11
2
1 Introduction
Facial expression are important cues for non verbal communication among human beings.
This is only possible because humans are able to recognize emotions quite accurately and
efficiently. An automatic facial emotion recognition system is an important component in
human machine interaction. Apart from the commercial uses of an automatic facial emotion
recognition system it might be useful to incorporate some cues from the biological system
in the model and use the model to develop further insights into the cognitive processing of
our brain.
State of the art approaches in facial emotion recognition use Active Appearance Mod-
els(AAMs), FACS labels or some other sophisticated feature extraction scheme. AAMs can
be learned from a set of training images and can be fitted on a new face to generate the land-
mark positions which can further be used to design features. Thus, in an automatic setting
either the availability of landmark point on face images is assumed or can be obtained by
fitting the model. FACS labels attempt to decompose human emotions in terms of Action
Units(AUs) which correspond to specific muscle movements. FACS coding system is used
in psychology and animation to classify facial expressions in a consistent and systematic
manner.But as of now FACS labels can only be given by experts or trained individuals.
One problem with ad hoc feature extraction schemes is that we need to design separate
feature extraction mechanism foe each visual task to be perfomed. Moreover,it is known
that only some of the filters in the retina are hardcoded and the other units in the V1,V2
and higher areas of visual processing are learned.Hubel and Wiesel showed that irreversible
damage was produced in kittens by sufficient visual deprivation during the so called ”criti-
cal period”. Therefore,it makes much more sense to have generic scheme for learning what
transformations in the input space may lead to good features for performing a particular task.
There is ample evidence that our visual processing architecture is organized in different
levels. Each level transforms the input in a manner that facilitates the visual task to be
performed. Another appealing feature of deep learning models is that there can be feature
or sub-feature sharing. Computationally also, it has been shown that unsufficiently deep
architectures can be exponentially ineffiecient. Deep Learning was revolutionized by Hinton
et.al.[1] when they came up with a very efficient method for training multilayer neural net-
works.
3
2 Motivation
Deep Learning methods have performed very well in MNSIT digit recognition dataset[1].
Our setting is very similar to the task of digit recognition. Corresponding to the digit
labels we have emotion labels. But emotion recognition is much more complicated because
digit images are much simpler than face images depicting various expressions. Moreover
the variability in the images due to different identities hampers the performance. Human
accuracy in facial exression recognition is not as good as in digit recognition and is also aided
by other modes of information such as context,prior experience,speech among others.
P P P
E(v, h) = − i,j vi Wij hj − j bj hj − i ci vi
exp(−E(v,h)
P (v, h) = Z
1
vi2 −
P P P P
E(v, h) = 2 i i,j vi Wij hj − j bj hj − i ci vi
The hidden nodes are conditionally independent given the visible layer and vice versa.
In particular, the conditional probabilities are as follows
The parameters of the RBM can be learned by maximizing the log-likelihood of training
4
data using gradient ascent. But the exact gradient of the log-likelihood is intractable,thus
contrastive divergence is used which works fairly well in practice.The exact gradient is in-
tractable which is approximated by
∂logp(v)
exact gradient : ∂Wij
= < vi hj >0 − < vi hj >∞
∂logp(v)
CD approximation: ∂Wij
= < vi hj >0 − < vi hj >n
5 JAFFE Dataset
Japanese Female Facial Expression (JAFFE) Database - The database contains 213 images
of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female
models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects. Some
of the emotions (fear) have been reported to not have been expressed very well. But in this
project we are working with all the six emotions rather a reduced set of emotions.
6 Results
In this project , we experimented with a lot of different settings of the model hyperarameters
to find how they affect the performance. Few variants to the conventionall DBNs were tried
such as sparse DBNs and stacking up sparse autoencoders but the results did not show any
improvement and hence corresponding results have not been reported. In all the results,
the models were trained using 150 training images and tested on the remaining 63 images.
5
Figure 1: features learned by the first layer of DBN, image size: 24 x 24, hidden layer: 50
units
Deep Belief Networks typically require large amount of data but in our case we have only 213
images. Thus the results may change significantly if a larger dataset is used.The experiments
were performed at three resolutions: 100 X 100, 50 X 50 ,25 X 25. The results for various
experiments are stated as follows and would be discussed in the next section.
6
Figure 2: features learned by the first layer of DBN, image size: 50 x 50,hidden layer: 100
units
Figure 3: features learned by the first layer of DBN, image size:100 x 100,hidden layer: 500
units
7
Figure 4: Performance of DBNs on 24 x 24 images against numner of epochs of supervised
finetuning
in slight improvement. It was also observed that further increasing the number of layers
deteriorates the performance. Results for such cases have not been reported. One possible
explaination for the same could be that increasing the number of parameters to be learned
and with the small dataset that we have it is difficult to learn many parameters.
8
Figure 5: Performance of DBNs on 50 x 50 images against numner of epochs of supervised
finetuning
Figure 6: Performance of DBNs on 100 x 100 images against numner of epochs of supervised
finetuning
9
7 Discussion and Future Work
Accuracy of state of state of the art facial emotion recognition systems is much better than
arrived at in the project. Considering that the algorithm takes raw images rather landmark
points or FACS labels as input, it performs fairly well. The dataset used in the project was
quite small and prohibits any general claim about the success or failure of deep learning
methods. It is expected that a larger dataset would improve the accuracy of the algorithm
and better features would be learned. This comprises a major portion of our future work in
this project
. Observing the features one may say that algorithm is able to extract some meaningful
features. In the absence of any principled way of discrminating the receptive fields learned
by the model it becomes difficult to argue about the ’goodness’ or’badness’ of a feature other
than evaluating the classfication accuracy that the feature facilitates.
As observed increasing number of hidden layers resulted in a slight improvement in classifi-
cation, but further increase in hidden layers however deteriorated the results. The number
of hidden units in each layer was one of the hyperparameters which wasnt satisfactorily in-
vestigated but an important and somewhat counter-intuitive observation that came up was
that the number of hidden units in the first layer should be less than the number of visible
units which in other words means that there should be a significant redcution in the amount
of information from the visible layer to the first hidden layer. This is appealing because
soemhing very similar happens in our visul system where a lot of information is thrown out
in successive layers of processing. What this does is that it forces the hidden units to learn
the most important features. Led by this observation,we thought that sparsity constraints
might lead to even better features and accuracy but as it turned out that there was not any
improvement. Again, this might be attributed to the small dataset we are working with.
One of the imortant results coming out of this project is the observation that low resolution
images had better classification accuracy than higher resolution images. Various psycholog-
ical experiments done on human beings suggest that we make use of mid spatial frequency
band for recognizing emotions rather than thehigh spatial frequency band. Although here,we
do not present any quantitative similarities for spatial frequency versus classification accu-
racy, the few experiments that we performed suggest that lower spatial frequency informa-
tion is more useful for recognizing emotions which speaks for the cognitive relevance of the
model.In our future work we would like to work quantitative ways of evaluating cognitive
imporatnce of features which would help argue for DBNs as a very good model of our visual
system.
10
8 References
References
[1] Geoffrey E. Hinton, Yee-Whye Teh and Simon Osindero, A Fast Learning Algorithm for
Deep Belief Nets. Neural Computation, pages 1527-1554, Volume 18, 2008.
[2] Susskind, J.M. and Hinton, G.E. and Movellan, J.R. and Anderson, A.K., Generat-
ing facial expressions with deep belief nets, Affective Computing, Emotion Modelling,
Synthesis and Recognition, pages 421-440,2009
[3] Michael J. Lyons, Shigeru Akamatsu, Miyuki Kamachi & Jiro Gyoba , Coding Facial
Expressions with Gabor Wavelets,Proceedings, Third IEEE International Conference on
Automatic Face and Gesture Recognition,pp 200-205, 1-19.April 14-16 1998
[4] Geoffrey E. Hinton (2010). A Practical Guide to Training Restricted Boltzmann Ma-
chines,Technical Report,Volume 1
[6] Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng, Sparse deep belief net model for
visual area V2 NIPS,2007
11