Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2020 International Conference for Emerging Technology (INCET)

Belgaum, India. Jun 5-7, 2020

Facial Emotion Detection Using Deep Learning


Akriti Jaiswal, A. Krishnama Raju, Suman Deb
Department of Electronics Engineering
SVNIT Surat, India
{nainajaiswal96, krishnamraju995}@gmail.com, sumandeb@eced.svnit.ac.in

Abstract—Human Emotion detection from image is one of II. R ELATED W ORK


the most powerful and challenging research task in social
communication. Deep learning (DL) based emotion detection A. Human Facial Expressions
gives performance better than traditional methods with image The universality of facial expressions and the body language
processing. This paper presents the design of an artificial
is a key feature of human interaction. Charles Darwin al-
intelligence (AI) system capable of emotion detection through
facial expressions. It discusses about the procedure of emotion ready published on globally common facial expressions in the
detection, which includes basically three main steps: face detec- nineteenth century, which play an important role in nonverbal
tion, features extraction, and emotion classification. This paper communication [4]. In 1971, Ekman Friesen declared facial
proposed a convolutional neural networks (CNN) based deep behaviors to be correlated uniformly with specific emotions
learning architecture for emotion detection from images. The
[5]. Apparently humans but also animals, produce specific
performance of the proposed method is evaluated using two
datasets Facial emotion recognition challenge (FERC-2013) and muscle movements that belong to a certain mental state. People
Japaness female facial emotion (JAFFE). The accuracies achieved interested in research on emotion classification via speech
with proposed model are 70.14 and 98.65 percentage for FERC- recognition are referred to Nicholson et al. [6].
2013 and JAFFE datasets respectively.
Index Terms—Artificially intelligence (AI), Facial emotion B. Image Classification Techniques
recognition (FER), Convolutional neural networks (CNN), Rec-
tified linear units (ReLu), Deep learning (DL).
Image classification system generally consists of feature
extraction followed by classification stage. Fasel and Luet-
tin provided an extensive overview of the analytical feature
extractors and neural network approaches for recognition of
I. I NTRODUCTION facial expression [7]. It may be concluded that both approaches
work approximately equally well by the time of writing, at
Emotion is a mental state associated with the nervous sys- the beginning of the twenty-first century. However, given the
tem associated with feeling, perceptions, behavioral reactions, current availability of training data and computational power,
and a degree of gratification or displeasure [1]. One of the the expectation is that the performance of models based on
current application of artificial intelligence (AI) using neural neural networks can be substantially improved. Several recent
networks is the recognition of faces in images and videos for milestones are set out below.
various applications. Most techniques process visual data and • Krizhevsky and Hinton give a landmark publication on
search for general pattern present in human faces in images or the automatic image classification in general [8]. This
videos. Face detection can be used for surveillance purposes by work shows a deep neural network that resembles the hu-
law enforcers as well as in crowd management. In this paper, man visual cortex’s functionality. A model to categorize
we present a method for identifying seven emotions such as objects from pictures is obtained using a self-developed
anger, disgust, neutral fear, happy, sad, and surprise using fa- labeled array of 60,000 images over 10 classes, using
cial images. Previous research used deep-learning technology the CIFAR-10 dataset. Another important outcome of the
to create models of facial expressions based on emotions to research is the visualization of the filters in the network,
identify emotions [2]. The typical human computer interaction so that how the model breaks down the images can be
(HCI) lacks users emotional state and loses a great deal of assessed.
information during the process of interaction. Comparatively, • In 2010, the launch of the annual Imagenet challenges [9]
users are more efficient and desired by emotion-sensitive HCI boosted work on the classification of images, and since
systems [3]. Now a days, interest in emotional computing has then the belonging gigantic collection of labeled data is
increased with the increasing demands of leisure, commerce, often used in publications . In a later work by Krizhevsky
physical and psychological well being, and education related et al. [10], the ImageNet LSVRC-2010 contest trains a
applications. Because of this, several products of emotionally network of 5 convolutional, 3 max pooling, and 3 fully
sensitive HCI systems have been developed over the past connected layers with 1.2 million high-resolution images.
several years, although the ultimate solution for this research • In particular, with regard to facial expression recogni-
field has not been suggested [4]. tion, Lv et al. [11] present a network of deep beliefs

978-1-7281-6221-8/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: University of Wollongong. Downloaded on August 13,2020 at 17:55:09 UTC from IEEE Xplore. Restrictions apply.
primarily for use with the JAFFE and expanded Cohn- First we are passing our input image to convolutional layer
Kanade (CK+) databases. The results are comparable to which consists of 64 filters each of size 3 by 3, after that
the accuracy obtained 95 percentage on the same database it passes through local contrast normalization can remove
by other methods, such as support vector machine (SVM) average from neighbourhood pixels leads to get quality of
and learning vector quantization (LVQ). feature maps, followed by ReLu activation function. Maximum
• The dataset used now is the Facial Expression Recog- pooling is used to reduce spatial dimension reduction so
nition Challenge (FERC-2013) [2], organized deep net- processing speed will increase. We are using concatenation for
work, obtained an average accuracy of 67.02 percentage getting features of images (eyes, eyebrows, lips, mouth etc)
on emotion classification. perfectly so that prediction accuracy improved as compared
to previous model. Furthermore, it is followed by fully
III. P ROPOSED M ODEL connected layer and softmax for classifying seven emotions.
A. Emotion Detection Using Deep Learning A second layer of maxpooling is added to reduce the number
In this paper we use the deep learning (DL) open library of dimensionality. Here, we use batch normalization, dropout,
“Keras” provided by Google for facial emotion detection, by ReLu activation function, categorical cross entropy loss,
applying robust CNN to image recognition [12]. We used adam optimizer, softmax activation function in ouput layer
two different datasets and trained with our proposed network for seven emotion classification.
and evaluate its validation accuracy and loss accuracy. Images
extracted from given dataset which have facial expressions for In JAFEE dataset, input image size is adjusted to that
seven emotions, and we detected expressions by means of an 128*128*3. The network starts with an input layer of 128
emotion model created by a CNN using deep learning. We by 128 which matches the input data size parallely processed
have changed a few steps in CNN as compared to previous through two similar models as shown in Fig.1. Furthermore,
method using a keras library given by Google and also it is concatenated and pass through one more softmax layer
modified CNN architectutre which give better accuracy . We for emotion classification and all procedure is same as above.
implemented emotion detection using keras with the proposed
network. In Model-B, previously proposed by Correa et al. [2], the
network starts with a 48 by 48 input layer, which matches
B. CNN Architecture the size of the input data. This layer is preceded by one
The networks are program on top of keras, operating convolutional layer, a local contrast normalization layer, and
on Python, using the keras learn library. This environment one layer of maxpooling, respectively. Two more convolutional
reduces the code’s complexity, since only the neuron layers layers and one fully connected layer, connected to a softmax
need to be formed, rather than any neuron. The software output layer, complete the network. Dropout has been applied
also provides real-time feedback on training progress and to the fully connected layer and all layers contain units of
performance, and makes the model after training easy to ReLu.
save and reuse. In CNN architecture initially we have to
extract input image of 48*48*1 from dataset FERC-2013. IV. E XPERIMENT D ETAILS
The network begins with an input layer of 48 by 48 which
matches the input data size parallelly processed through two We develop a network based on the concepts from [12],
similar models that is functionality in deep learning, and [13] and [14] to assess the two models (Model-A and Model-
then concatenated for better accuracy and getting features of B) mentioned above on their emotion detection capability.
images perfectly as shown in Fig.1 which is our proposed This section describes the data used for training and testing,
model, Model-A. There are two submodels for the extraction explains the details of the used data sets and evaluates the
of CNN features which share this input and both have results obtained using two different datasets with two models.
same kernel size. The outputs from these feature extraction
A. Datasets
sub-models are flattened into vectors and concatenated into
one long vector matrix and transmitted to a fully connected Neural networks, and particularly deep networks, needs
layer for analysis before a final output layer allows for large amounts of training data. In addition, the choice of
classification. images used for the training is responsible for a large part of
This models contains convolutional layer with 64 filters each the eventual model’s performance. It means the need for a data
with size of [3*3], followed by a local contrast normalization set that is both high quality and quantitative. Several datasets
layer, maxpooling layer, followed by one more convolutional are available for research to recognize emotions, ranging from
layer, max pooling, flatten respectively. After that we a few hundred high resolution photos to tens of thousands
concatenate two similar models and linked to a softmax of smaller images. The two, we will be debating in this
output layer which can classify seven emotions. We use work, are the Japanese Female Face Expression (JAFFE) [15],
dropout of 0.2 for reducing over-fitting. It has been applied Facial Expression Recognition Challenge (FERC-2013) [16]
to the fully connected layer and all layers contain units of which contains seven emotions like anger, surprise, happy, sad,
rectified linear units (ReLu) activation function. disgust, fear, neutral.

Authorized licensed use limited to: University of Wollongong. Downloaded on August 13,2020 at 17:55:09 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Network Architecture

The datasets primarily vary in the amount, consistency, C. Results using Proposed Model
and cleanness of the images. For example, the FERC-2013
collection has about 32,000 low-resolution images. It can also In emotion detection we are using three steps, i.e., face
be noted that the facial expressions in the JAFFE (i.e. further detection, features extraction and emotion classification using
extended as CK+ ) are posed (i.e. clean), while the FERC-2013 deep learning with our proposed model which gives better
set displays “in the wild” emotions. This makes it harder to result than previous model. In the proposed method, computa-
interpret the images from the FERC 2013 set, but given the tion time reduces, validation accuracy increases and loss also
large size of the dataset, a model’s robustness can be beneficial decreases, and further performance evaluation achieved which
for the diversity. compares our model with previous existing model. We tested
our neural network architectures on FERC-2013 and JAFFE
database which contains seven primary emotions like sad, fear,
B. Training Details happiness, angry, neutral, surprised, disgust.
We train the network using GPU for 100 epochs to ensure Fig.2 shows the proportions of detected emotions in a single
that the precision converges to the optimum. The network will image of FER dataset. Fig.2(a) shows the image, whereas
be trained on a larger set than the one previously described the detected emotion proportions are shown in Fig.2(b). It
in an attempt to improve the model even more. Training will is clearly observable that neutral has higher proportion than
take place with 20,000 pictures from the FERC-2013 dataset other emotions. That means, the emotion detected for this
instead of 9,000 pictures. The FERC-2013 database also uses image (in Fig.2(a)) is neutral. Similarly, Fig.3 show another
newly designed verification (2000 images) and sample sets image and corresponding emotion proportions. From Fig.3(b),
(1000 images). It shows number of emotions in the final testing it is observable that happy emotion has higher proportion than
and validation set after training and testing our model. The others. That suggests that image of Fig.3(a) detects happy
accuracy will be higher on all validation and test sets than in emotions.
previous runs, emphasizing that emotion detection using deep
convolutional neural networks can improve the performance Similarly, performance is evaluated for all the test images
of a network with more information. of the dataset. We have achieved 95 percentage for happy, 75

Authorized licensed use limited to: University of Wollongong. Downloaded on August 13,2020 at 17:55:09 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: (a) Image, (b) Proportion of emotions.

Fig. 3: (a) Image, (b) Proportion of emotions.

percentage for neutral, 69 percentage for sad, 68 percentage


for surprise, 63 percentage for disgust, 65 percentage for fear
and 56 percentage for angry. On an average we are getting
average accuracy of 70.14 percentage using our proposed
model.
The confusion matrix of classification accuracy is shown in
TABLE I. We get an average validation accuracy of 70.14
percentage using our proposed model in facial emotion
detection using FER dataset.

Fig. 4: Prediction of emotion.


Fig.4 shows the result of test sample related to surprise
emotion from JAFFE dataset, and our proposed model also
predicted the same emotion with reduced computation time V. PERFORMANCE EVALUATION
as compared to previous existing model B. Similarly, perfor-
mance is evaluated for all the test samples of JAFFE dataset. In FER dataset we train on 32,298 samples which is validate
When we are using JAFFE dataset we are getting validation on 3589 samples, and in JAFFE dataset we train 833 samples,
accuracy of 98.65 percentage which is better than previous which is validate on 148 samples for calculation of validation
result and it takes less computational time per step. accuracy, validation loss, computational time per step upto to

Authorized licensed use limited to: University of Wollongong. Downloaded on August 13,2020 at 17:55:09 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Confusion Matrix (%) for emotion detection using proposed model
Emotions Angry Sad Happy Disgust Fear Neutral Surprise

Angry 56 12 3 9 8 11 1

Sad 10 69 2 6 9 2 2

Happy 0 0 95 0 0 3 2

Disgust 7 13 0 63 8 5 4

Fear 9 8 3 2 65 10 3

Neutral 2 1 8 1 7 75 6

Surprise 7 3 11 0 3 8 68

Average accuracy = 70.14 (%)

TABLE II: Qualitative assessment of our proposed model for emotion detection
Model (Dataset) Validation accuracy (%) Validation loss Computation time per step (msec.)

Proposed Model A (FERC-2013) 70.14 1.7577 16 msec.

Model B (FERC-2013) 67.02 2.0389 45 msec.

Proposed Model A (JAFFE) 98.65 0.1694 284 msec.

Model B (JAFFE) 97.97 0.1426 462 msec.

100 and 50 epochs respectively shown in TABLE II. The aim [2] E. Correa, A. Jonker,M.Ozo, andR.Stolk, “Emotion recognition using
of the training step is to determine the correct configuration deep convolutional neural networks,” Tech. Report IN4015, 2016.
[3] Y. I. Tian, T. Kanade, and J.F.Cohn, “Recognizing action units for facial
parameters for the neural network which are: number of nodes expression analysis,” IEEE Transactions on pattern analysis and machine
in the hidden layer (HL), rate of learning (LR), momentum intelligence, vol. 23, no. 2, pp. 97–115, 2001.
(Mom), and epoch (Ep). Different combinations of these [4] C. R. Darwin. The expression of the emotions in man and animals. John
Murray, London, 1872.
parameters have been tested to find out how to achieve the [5] P. Ekman and W. V. Friesen. Constants across cultures in the face and
better recognition rate. emotion. Journal of personality and social psychology, 17(2):124, 1971.
From Table II, it is observed that our proposed model shows [6] J. Nicholson, K. Takahashi, and R. Nakatsu. Emotion recognition in
speech using neural networks. Neural computing applications, 9(4):
70.14% average accuracy compared to the 67.02% average 290–296, 2000.
accuracy reported in model B FOR FER dataset. In this case of [7] B. Fasel and J. Luettin. Automatic facial expression analysis: a survey.
JAFFE database, we achieved average accuracy 98.65% which Pattern recognition, 36(1):259–275, 2003.
[8] A. Krizhevsky and G. Hinton. Learning multiple layers of features from
is also higher than model B. tiny images, 2009.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A
VI. C ONCLUSION large-scale hierarchical image database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
In this paper, we have proposed a deep learning based IEEE, 2009.
facial emotion detection method from image. We discuss [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural infor-
our proposed model using two different datasets, JAFFE and mation processing systems, pages 1097–1105, 2012.
FERC-2013. The performance evaluation of the proposed [11] Y. Lv, Z. Feng, and C. Xu. Facial expression recognition via deep
facial emotion detection model is carried out in terms of learning. In Smart Computing (SMARTCOMP), 2014 International
Conference on, pages 303–308. IEEE, 2014.
validation accuracy, computational complexity, detection rate, [12] TFlearn. Tflearn: Deep learning library featuring a higher-level api for
learning rate, validation loss, computational time per step. We tensorflow. URL http://tflearn.org/.
analyzed our proposed model using trained and test sample [13] Open Source Computer Vision Face detection using haar cascades. URL
http://docs.opencv.org/master/d7/d8b/tutorialpyfacedetection.html.
images, and evaluate their performance compare to previous [14] P. J. Werbos et al., “Backpropagation through time: what it does and
existing model. Results of the experiment show that the model how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560,
proposed is better in terms of the results of emotion detection 1990.
[15] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I.
to previous models reported in the literature. The experiments Matthews. The extended cohn-kanade dataset (ck+): A complete dataset
show that the proposed model is producing state-of-the-art for action unit and emotion-specified expression. In Computer Vision
effects on both two datasets. and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer
Society Conference on, pages 94–101. IEEE, 2010.
[16] Kaggle. Challenges in representation learning: Facial expression recog-
R EFERENCES nition challenge, 2013.

[1] S. Li and W. Deng, “Deep facial expression recognition: A survey,”


arXiv preprint arXiv:1804.08348, 2018.

Authorized licensed use limited to: University of Wollongong. Downloaded on August 13,2020 at 17:55:09 UTC from IEEE Xplore. Restrictions apply.

You might also like