Professional Documents
Culture Documents
Paper - Deep - Learning - Emotion Detection PDF
Paper - Deep - Learning - Emotion Detection PDF
Abstract—In recent years the importance of emotion recogni- The article is structured as follows: in Section II, a brief
tion through facial expressions has been increasing in different description of the emotion classification model using CNN
fields such as Affective Computing, Social Robotics or Human are presented; also, some studies of deep learning algorithm
Computer Interaction (HCI). Due to the necessity of the different
systems to interact naturally with the user and increase the implementation in systems with small computational power.
engagement with it. In addition, the growth of different Deep In Section III, CNN architectures and datasets used for the
Learning architectures for image classification based on Con- experimentation are described, moreover the emotion model
volutional Neural Network (CNN) allows to implement systems and evaluation metrics used for evaluate the model. In Section
with high performance. However, not all the CNNs are feasible IV, the results of the emotion classification models are pre-
to be implemented in real time systems or robotic platforms;
which usually use embedded systems with limited computational sented. In Section V, a comparison of the performance of each
power. algorithm is discussed and error analysis is discussed. Finally,
The purpose of this paper is to review three (03) different CNN in Section VI conclusion and the proposed future work are
architectures for emotion recognition based on facial expression presented.
in order to evaluate its performance in relation to the complexity
of the model, in other words the number of parameters; the II. R ELATED W ORK
architectures will be study in different datasets.
Index Terms—Cnnvolutional Neural Network, Deep Learning, In this section, a review of facial emotion recognition (FER)
Emotion Classification, Facial Expressions. is presented and some studies about the complexity of deep
learning algorithm performance in small computational sys-
I. I NTRODUCTION tems. According to [6], a conventional way to perform FER is
based in three (03) steps : (1) face detection and landmark de-
Picard [1] mentions that “Affect is a natural and social part tection, (2) feature extraction, and (3) expression classification
of the human communication; therefore,people naturally use it usually using Support Vector Machine (SVM), Random Forest
when interacts with computer”. For that reason, the author in or AdaBoost. In contrast, the use of Deep Learning, especially
[2] states that HCI efforts should pay attention in the emotion CNNs for frame-based approach have been arising. CNNs are
of the user. A research conducted in kids [3] demonstrated algorithms that allow the understanding of image content [7]
that the incorporation of an emotion module increases the and performs well in image segmentation, classification and
usability of the software. But, the applications of emotion detection.
recognition is not limited to HCI, the discussion of using it An example of use of CNNs for emotion recognition is
for biometric surveillance is proposed in [4]; for example for shown in [8] where two (02) CNNs were used: one for
task that involve high stress level such as airplane pilots or in background removal and the other for classification. Between
nuclear plants. In addition, in [5] a review of how this field these models a face vector was created, the model was tested in
is widely applied in different fields of robotics such as social an Intel i5 with 8 GB RAM and 512 GB SSD. Giannopoulos et
service robots, bioengineering and general purpose. al. [9] test two (02) different CNNs architectures: AlexNet and
The paper’s objective is to compare three (03) CNN archi- GoogLeNet using the Facial Emotion Recognition 2013 (FER-
tectures used commonly for classification tasks and evaluate 2013) dataset created by Pierre Luc Carrier [10] and based on
their performance in the emotion recognition task which is Ekman’s emotions. The results showed that GoogLeNet per-
based on the six (06) basic emotions of the Ekman’s model. formed slightly better than AlexNet in initial states of the train-
The CNN are tested in different emotion datasets that were ing process, three (03) different experiments were conducted:
adapted to only included Ekman’s emotions. A comparison of (a) Emotional content (binary), (b) Only Ekman’s emotions,
the complexity of the model and multi-classification metrics and (c) Ekman’s emotion + neutral. Dhankhar [11] applied
such as average recall, precision and F1-score are discussed the VGG16 and ResNet50 architectures and an ensemble of
in order to evaluate the feasibility of applying the algorithms both of them to classify the emotions of two (02) datasets. The
in real-time systems such as social robots. ensemble model consisted in concatenating the feature vectors
and using them to fed a Logistic Regression for each emotion.
The result of the ensemble model outperforms the baseline
which is based on SVM. The previous models have a large
number of parameters, Arraiga et al. [12] presented the model
called mini-Xception which is based on Residual modules and
Depth-wise Separable Convolutions; these techniques allowed
to reduce the number of parameter of the CNN and also
achieve an accuracy of 66% in the FER-2013 dataset.
Even though, there are different CNNs architectures for
the emotion recognition; most of them have a large number
of parameters. Real-time systems rarely exceed 10-60 frame-
per-second (fps) but some tracking applications such as au-
tonomous car or service robots needs higher rates [13]; in
contrast typical real-time systems used in robotics only works Fig. 1: Examples of the six basic emotions in the FER-2013
between 10-12 fps that is equal to the human processing ca- dataset.
pacity [14]. Different tests have been conducted to evaluate the
performance of deep learning algorithms in small computers
or embedded systems which are usually used for real-time the basic emotions are six (06): anger, disgust/contempt, fear,
systems. For example in [15] the inference capacity of the happy, sad and surprise. First, we recollected and modified
ARM big.LITTLE, which is used in commercial devices, was three (03) datasets that include the selected emotions; these
tested for different types of CNNs (AlexNet, GoogLeNet, are composed of facial images with a corresponding emotion
MobileNet, ResNet50, SqueezeNet). Also, a framework to annotated (e.g., Fig. 1). Then, this information were used to
optimize the partition across the cores was implemented in train three (03) CNNs architectures and compare the results
the platform Hikey 970. using multi-class metrics such as macro-averaged accuracy
The commercial embedded system Raspberry Pi 3 - Model and confusion matrix. Additionally, a metric to evaluate the
B was tested in [16] using different frameworks (Caffe, Tensor- relation between model complexity and performance were
Flow, OpenCV, Caffe2) and CNN architectures (GoogLeNet, defined.
SqueezeNet, MobileNet, Network to Network). A metric
called Figure of Merit that relates accuracy, fps and power A. Datasets
consumption was defined; the results showed that the model As mentioned before three (03) datasets were selected
with less parameters (SqueezeNet) has a better performance. and some of them modified in order to maintain only the
More specialize platforms developed by NVIDIA were eval- basic emotions. All the dataset were split in three (03) sets
uated; Canziani et al. [17] used a Jetson TX1 to evaluate with the proportion 80-10-10, for training, validation and test
different CNN such as ResNet, VGG, Inception, AlexNet and a correspondingly.
custom Efficient-Network (ENet) in the ImageNet challenge. 1) Karolinska Directed Emotional Faces (KDEF): this
Metrics such as number of operations, power consumption, dataset contains a total of 4900 pictures of human facial
accuracy, number of parameters were evaluated and the re- expressions. The set of pictures contains 70 individu-
sults shows that there ENet have the highest accuracy/M − als displaying 7 different emotional expressions. Each
P arameters. The NVIDIA TX2 platform was used in [18] expression is viewed from 5 different angles [20].
for training and inferring deep learning models such as LSTM, 2) FER-2013: this dataset consists of 35 685 examples
CNN, DCGAN and Deep Reinforcement Learning, all imple- of 48x48 pixel grayscale images of faces, displaying 7
mented using TensorFlow. The experiments in the TX2 showed emotional expressions [10].
that for a CNN like ResNet50 for a batch size of 64, more than 3) AffectNet: the dataset consists of 4̃40K images dis-
50 samples per second are processed; and for XceptionNet tributed in 11 classes that includes the basic emotions
with the same configuration almost 125 samples. [21]. This dataset is unbalanced, some classes have more
In conclusion, there exist different works that evaluate samples than others; in order to balanced and only used
CNNs performance for classification tasks such as ImageNet the basic emotions a random sampling was performed.
challenge and architectures with large number of parameters. After, this process it was reduced to approximately 5̃0K
Due to this, a performance comparison between a small samples.
architecture such the mini-Xception and larger ones such as
VGG, ResNet is proposed for the specific task of emotion B. Models
classification. In the experiment, the concept of transfer learning was
applied for two (02) popular CNNs architectures (VGG16 and
III. E XPERIMENT D ESIGN ResNet50). Pre-trained weights of these models were used to
In this article the facial emotion recognition task was create a feature-vector with a length of 18,432 for each image.
evaluated using the Ekman’s [19] model which states that The vectors were used to train a simple neural network of two
Fig. 2: Overall process for training the emotion classification models using CNNs
(02) fully connected layers with 256 neurons and dropout of The overall process of the experiment (data recollection,
30%. In addition, a third CNN model called mini-Xception image re-sizing, training, validation and evaluation) that was
was trained. A brief description of the models is presented: described before, is shown in Fig. 2.
1) VGG16: is a model developed by the Visual Geometry IV. E XPERIMENTAL TESTS AND RESULTS
Group (VGG) [22]. This is composed of convolutions
layers, max pooling layers, and fully connected layers. The previous models were implemented using Tensorflow
The total is 16 layers with 5 blocks and each block with 2.0 and Keras library in Python. In order to use pre-trained
a max pooling layer. models of the libraries and apply transfer learning, a folder
2) ResNet50: is a slight version of the very deep Microsoft structure for each dataset was created.
ResNet CNN, with 152 layers [23]. As name suggests, The models were trained only for 20 epochs and batch
version of ResNet50 contains only 50 layers. size of 128 due to computational limitations; also the Adam
3) mini-Xception: the model developed in [12] in order optimizer was used. In addition, the learning rate was reduced
to reduce computational costs it uses residual blocks during the training using the callback ReduceLROnPlateu.
and depth-wise separable convolutions. Also, instead of This callback allowed to reduce the learning-rate by a factor
using fully connected networks, those are replaced for when a variable has not changed in a defined number of
global average pooling with the number of classes. epochs.
Due to the emotion classification is a multiclass task and
Before, the training the images were re-sized to the shape each class is represented by a number, the loss function and
of 200x200 pixels in order to uniform the dimensions for the evaluation metric were sparse categorical cross-entropy and
VGG16 and the ResNet50. In the case of the miniXception sparse categorical accuracy respectively. In the following
model the shape 48x48 was selected in order to reduce the paragraphs, the results of the training process and metric
training time. evaluation are presented: