Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Comparison of CNN Architectures for Emotion

Classification for use on Real-Time Systems


Jose Balbuena, Roger Machacca, and Mariela Espinoza

Graduate School, Pontificia Universidad Católica del Perú


San Miguel - Lima, Peru
Email: {jose.balbuena, a20204997, a20204521}@pucp.edu.pe

Abstract—In recent years the importance of emotion recogni- The article is structured as follows: in Section II, a brief
tion through facial expressions has been increasing in different description of the emotion classification model using CNN
fields such as Affective Computing, Social Robotics or Human are presented; also, some studies of deep learning algorithm
Computer Interaction (HCI). Due to the necessity of the different
systems to interact naturally with the user and increase the implementation in systems with small computational power.
engagement with it. In addition, the growth of different Deep In Section III, CNN architectures and datasets used for the
Learning architectures for image classification based on Con- experimentation are described, moreover the emotion model
volutional Neural Network (CNN) allows to implement systems and evaluation metrics used for evaluate the model. In Section
with high performance. However, not all the CNNs are feasible IV, the results of the emotion classification models are pre-
to be implemented in real time systems or robotic platforms;
which usually use embedded systems with limited computational sented. In Section V, a comparison of the performance of each
power. algorithm is discussed and error analysis is discussed. Finally,
The purpose of this paper is to review three (03) different CNN in Section VI conclusion and the proposed future work are
architectures for emotion recognition based on facial expression presented.
in order to evaluate its performance in relation to the complexity
of the model, in other words the number of parameters; the II. R ELATED W ORK
architectures will be study in different datasets.
Index Terms—Cnnvolutional Neural Network, Deep Learning, In this section, a review of facial emotion recognition (FER)
Emotion Classification, Facial Expressions. is presented and some studies about the complexity of deep
learning algorithm performance in small computational sys-
I. I NTRODUCTION tems. According to [6], a conventional way to perform FER is
based in three (03) steps : (1) face detection and landmark de-
Picard [1] mentions that “Affect is a natural and social part tection, (2) feature extraction, and (3) expression classification
of the human communication; therefore,people naturally use it usually using Support Vector Machine (SVM), Random Forest
when interacts with computer”. For that reason, the author in or AdaBoost. In contrast, the use of Deep Learning, especially
[2] states that HCI efforts should pay attention in the emotion CNNs for frame-based approach have been arising. CNNs are
of the user. A research conducted in kids [3] demonstrated algorithms that allow the understanding of image content [7]
that the incorporation of an emotion module increases the and performs well in image segmentation, classification and
usability of the software. But, the applications of emotion detection.
recognition is not limited to HCI, the discussion of using it An example of use of CNNs for emotion recognition is
for biometric surveillance is proposed in [4]; for example for shown in [8] where two (02) CNNs were used: one for
task that involve high stress level such as airplane pilots or in background removal and the other for classification. Between
nuclear plants. In addition, in [5] a review of how this field these models a face vector was created, the model was tested in
is widely applied in different fields of robotics such as social an Intel i5 with 8 GB RAM and 512 GB SSD. Giannopoulos et
service robots, bioengineering and general purpose. al. [9] test two (02) different CNNs architectures: AlexNet and
The paper’s objective is to compare three (03) CNN archi- GoogLeNet using the Facial Emotion Recognition 2013 (FER-
tectures used commonly for classification tasks and evaluate 2013) dataset created by Pierre Luc Carrier [10] and based on
their performance in the emotion recognition task which is Ekman’s emotions. The results showed that GoogLeNet per-
based on the six (06) basic emotions of the Ekman’s model. formed slightly better than AlexNet in initial states of the train-
The CNN are tested in different emotion datasets that were ing process, three (03) different experiments were conducted:
adapted to only included Ekman’s emotions. A comparison of (a) Emotional content (binary), (b) Only Ekman’s emotions,
the complexity of the model and multi-classification metrics and (c) Ekman’s emotion + neutral. Dhankhar [11] applied
such as average recall, precision and F1-score are discussed the VGG16 and ResNet50 architectures and an ensemble of
in order to evaluate the feasibility of applying the algorithms both of them to classify the emotions of two (02) datasets. The
in real-time systems such as social robots. ensemble model consisted in concatenating the feature vectors
and using them to fed a Logistic Regression for each emotion.
The result of the ensemble model outperforms the baseline
which is based on SVM. The previous models have a large
number of parameters, Arraiga et al. [12] presented the model
called mini-Xception which is based on Residual modules and
Depth-wise Separable Convolutions; these techniques allowed
to reduce the number of parameter of the CNN and also
achieve an accuracy of 66% in the FER-2013 dataset.
Even though, there are different CNNs architectures for
the emotion recognition; most of them have a large number
of parameters. Real-time systems rarely exceed 10-60 frame-
per-second (fps) but some tracking applications such as au-
tonomous car or service robots needs higher rates [13]; in
contrast typical real-time systems used in robotics only works Fig. 1: Examples of the six basic emotions in the FER-2013
between 10-12 fps that is equal to the human processing ca- dataset.
pacity [14]. Different tests have been conducted to evaluate the
performance of deep learning algorithms in small computers
or embedded systems which are usually used for real-time the basic emotions are six (06): anger, disgust/contempt, fear,
systems. For example in [15] the inference capacity of the happy, sad and surprise. First, we recollected and modified
ARM big.LITTLE, which is used in commercial devices, was three (03) datasets that include the selected emotions; these
tested for different types of CNNs (AlexNet, GoogLeNet, are composed of facial images with a corresponding emotion
MobileNet, ResNet50, SqueezeNet). Also, a framework to annotated (e.g., Fig. 1). Then, this information were used to
optimize the partition across the cores was implemented in train three (03) CNNs architectures and compare the results
the platform Hikey 970. using multi-class metrics such as macro-averaged accuracy
The commercial embedded system Raspberry Pi 3 - Model and confusion matrix. Additionally, a metric to evaluate the
B was tested in [16] using different frameworks (Caffe, Tensor- relation between model complexity and performance were
Flow, OpenCV, Caffe2) and CNN architectures (GoogLeNet, defined.
SqueezeNet, MobileNet, Network to Network). A metric
called Figure of Merit that relates accuracy, fps and power A. Datasets
consumption was defined; the results showed that the model As mentioned before three (03) datasets were selected
with less parameters (SqueezeNet) has a better performance. and some of them modified in order to maintain only the
More specialize platforms developed by NVIDIA were eval- basic emotions. All the dataset were split in three (03) sets
uated; Canziani et al. [17] used a Jetson TX1 to evaluate with the proportion 80-10-10, for training, validation and test
different CNN such as ResNet, VGG, Inception, AlexNet and a correspondingly.
custom Efficient-Network (ENet) in the ImageNet challenge. 1) Karolinska Directed Emotional Faces (KDEF): this
Metrics such as number of operations, power consumption, dataset contains a total of 4900 pictures of human facial
accuracy, number of parameters were evaluated and the re- expressions. The set of pictures contains 70 individu-
sults shows that there ENet have the highest accuracy/M − als displaying 7 different emotional expressions. Each
P arameters. The NVIDIA TX2 platform was used in [18] expression is viewed from 5 different angles [20].
for training and inferring deep learning models such as LSTM, 2) FER-2013: this dataset consists of 35 685 examples
CNN, DCGAN and Deep Reinforcement Learning, all imple- of 48x48 pixel grayscale images of faces, displaying 7
mented using TensorFlow. The experiments in the TX2 showed emotional expressions [10].
that for a CNN like ResNet50 for a batch size of 64, more than 3) AffectNet: the dataset consists of 4̃40K images dis-
50 samples per second are processed; and for XceptionNet tributed in 11 classes that includes the basic emotions
with the same configuration almost 125 samples. [21]. This dataset is unbalanced, some classes have more
In conclusion, there exist different works that evaluate samples than others; in order to balanced and only used
CNNs performance for classification tasks such as ImageNet the basic emotions a random sampling was performed.
challenge and architectures with large number of parameters. After, this process it was reduced to approximately 5̃0K
Due to this, a performance comparison between a small samples.
architecture such the mini-Xception and larger ones such as
VGG, ResNet is proposed for the specific task of emotion B. Models
classification. In the experiment, the concept of transfer learning was
applied for two (02) popular CNNs architectures (VGG16 and
III. E XPERIMENT D ESIGN ResNet50). Pre-trained weights of these models were used to
In this article the facial emotion recognition task was create a feature-vector with a length of 18,432 for each image.
evaluated using the Ekman’s [19] model which states that The vectors were used to train a simple neural network of two
Fig. 2: Overall process for training the emotion classification models using CNNs

(02) fully connected layers with 256 neurons and dropout of The overall process of the experiment (data recollection,
30%. In addition, a third CNN model called mini-Xception image re-sizing, training, validation and evaluation) that was
was trained. A brief description of the models is presented: described before, is shown in Fig. 2.
1) VGG16: is a model developed by the Visual Geometry IV. E XPERIMENTAL TESTS AND RESULTS
Group (VGG) [22]. This is composed of convolutions
layers, max pooling layers, and fully connected layers. The previous models were implemented using Tensorflow
The total is 16 layers with 5 blocks and each block with 2.0 and Keras library in Python. In order to use pre-trained
a max pooling layer. models of the libraries and apply transfer learning, a folder
2) ResNet50: is a slight version of the very deep Microsoft structure for each dataset was created.
ResNet CNN, with 152 layers [23]. As name suggests, The models were trained only for 20 epochs and batch
version of ResNet50 contains only 50 layers. size of 128 due to computational limitations; also the Adam
3) mini-Xception: the model developed in [12] in order optimizer was used. In addition, the learning rate was reduced
to reduce computational costs it uses residual blocks during the training using the callback ReduceLROnPlateu.
and depth-wise separable convolutions. Also, instead of This callback allowed to reduce the learning-rate by a factor
using fully connected networks, those are replaced for when a variable has not changed in a defined number of
global average pooling with the number of classes. epochs.
Due to the emotion classification is a multiclass task and
Before, the training the images were re-sized to the shape each class is represented by a number, the loss function and
of 200x200 pixels in order to uniform the dimensions for the evaluation metric were sparse categorical cross-entropy and
VGG16 and the ResNet50. In the case of the miniXception sparse categorical accuracy respectively. In the following
model the shape 48x48 was selected in order to reduce the paragraphs, the results of the training process and metric
training time. evaluation are presented:

C. Evaluation Metrics A. FER-2013


Finally, in order to evaluate the models some classical The accuracy and loss function of the three models on the
metrics were used such as accuracy, recall and F1-score; but FER-2013 set is shown in Fig. 3. The results reported cor-
in order to be used in a multi-class scenario a macro-averaged respond to classification accuraccy, the highest accuracy was
approach was selected. Using this approach, each metric is 87.68% for ResNet50 and the lowest accuracy was 86.39%
evaluated as binary for each class, then the average of the for VGG16, and the average one is 87.07%. Each metric
metric is computed; it allows to evaluate the performance of scores with respect to each model is shown in Table I. In
the model across all the datasets. all case ResNet50 has high score, which means that ResNet50
Also, a multiclass confusion matrix is used in order to prediction is better than miniXception and VGG16 for FER-
understand which classes are predicted wrong by the models. 2013 dataset. However, regards to MAoP, the miniXception
Finally, in order to evaluate the performance in real-time sys- model has the best performance (15.579) compared to other
tems, the metric MAoP that related the number of parameters two models.
with the accuracy is defined as follows: In the Fig. 4 is shown the confusion matrix results of
the six emotions classification (happy, sad, surprise, fear,
Macro-averaged Accuracy disgust, anger) with the three models. It was observed that
M AoP = (1) for the three models the emotion ’disgust-contempt’ was not
Millions of parameters
Fig. 3: Loss and Accuracy during the training - FER2013

predicted correctly and most of the samples were classified B. KDEF


as ’anger’.Also, the number of samples for this class is small
in comparison to other classes, this could the source of error The accuracy and loss function of the three models on
in this dataset. In addition, the best predicted emotion was the KDEF set is reported in Fig. 5. The results reported
’happy’ and ’surprise’ in the three (03) models, due to the correspond to classification accuracy, the highest accuracy was
majority of samples were correctly predicted. 89.60% for ResNet50, and the average one is 88.71%. Each
metric score with respect to each model is shown in Table
II. In all cases ResNet50 has a high score, which means that
ResNet prediction is still significantly better than VGG16 and
miniXception for KDEF dataset. But in terms of MAoP, the
miniXception has better performance (15.58) than the other
two models.
In Fig. 6 we provide the confusion matrix results of the
three models. It can be observed that most of samples were
classified correctly, a possible reason could be that KDEF was
built in a controlled environment and with balanced number
of images per emotion. The best classified emotion for the
KDEF were ’happy’, ’surprise’ and ’anger’.

TABLE I: FER-2013 Metrics


ResNet50 VGG16 MiniXception
Parameters
23,587,712 14,714,688 57,414
Body
Parameters
4,786,182 4,786,182 0
Top
Total Parameters
28.37 19.50 0.06
(Millons)
Macro-Average
87.68% 86.39% 87.15%
Accuracy
Macro-Average
61.45% 67.25% 59.14%
Precision
Macro-Average
51.24% 49.67% 54.31%
Recall
Macro-Average
61.00% 50.06% 55.61%
Fig. 4: Confusion Matrix - FER2013 Dataset: (a) ResNet50. F1-Score
(b) VGG16. (c) miniXception MAoP 0.03 0.04 15.178
Fig. 5: Loss and Accuracy during the training - KDEF

C. AffecNet process of the models using the AffecNet is shown in Fig.


As mentioned in the previous section, the AffecNet dataset 7. It could be observe that in the case of the VGG16 model
was modified by eliminating emotions that are not included an overfitting was starting since the 12 epoch approximately,
in the Ekman emotional model, also in order to balanced the also the ”sparse categorical accuracy” was around 43%. The
samples the same number of examples were selected randomly ResNet50 and the mimiXception reached a plateau and a con-
for each emotion. stant difference between the training and test values between
The loss function and accuracy values during the training 15-20%.
After the training process, the confusion matrix for each
model was constructed. The confusion matrices are shown
in Fig. 8. It was observed that in the case of VGG16 the
overfitting was clearly in the emotion sad, the model predicted
most of the samples as a ’sad’ class. For both the ResNet50 and
miniXception the best predicted emotions were ’happy’, ’sur-
prise’ and ’anger’. But, emotions such as ’sad’ and ’disgust-
contempt’ are highly confused with the others, which produces
a low ’F1-Score’.
Finally, using the confusion matrices, the macro-averaged
and MAoP metrics were computed, these are shown in Table

TABLE II: KDEF Metrics


ResNet50 VGG16 MiniXception
Parameters
23,587,712 14,714,688 57,414
Body
Parameters
4,786,182 4,786,182 0
Top
Total Parameters
28.37 19.50 0.06
(Millons)
Macro-Average
89.60% 87.10% 89.44%
Accuracy
Macro-Average
71.09% 61.59% 69.85%
Precision
Macro-Average
68.81% 61.31% 68.33%
Recall
Macro-Average
69.50% 61.00% 68.75%
Fig. 6: Confusion Matrix - KDEF Dataset: (a) ResNet50. (b) F1-Score
VGG16. (c) miniXception MAoP 0.03 0.05 15.579
Fig. 7: Loss and Accuracy during the training - AffecNet

III. The miniXception and ResNet50 model achieved the V. D ISCUSSION


highest macro-averaged accuracy, both values are around 84%,
however the ResNet50 had better precision. The custom metric In based on the evaluation and comparison of different CNN
MAoP shows that the miniXception have a better performance architectures on three (03) different dataset, in the experiments
than the ResNet, due to the use of less parameters to achieve we used the ResNet and VGG as a feature vector generator in
similar accuracy. order to train a simple neural network and have a comparison
against the miniXception architecture proposed by Arraiga et.
al. [12]. We evaluated if this small architectured can achieved
results similar to common CNN architectures (ResNet and
VGG).
It was observed that ResNet50 achieved the highest macro-
average accuracy in two of the three dataset; however the
miniXception achieve similar values with less parameter. Due
to this miniXception have higher values in the custom MAoP
metric and make this model suitable to be used in small
embedded systems such as Raspberry Pi 4.
Another issue that was observed while testing the models
in different dataset is that the same emotions were confused

TABLE III: AffectNet Metrics


ResNet50 VGG16 MiniXception
Parameters
23,587,712 14,714,688 57,414
Body
Parameters
4,786,182 4,786,182 0
Top
Total Parameters
28.37 19.50 0.06
(Millons)
Macro-Average
84.08% 81.41% 84.30%
Accuracy
Macro-Average
53.26% 48.96% 52.35%
Precision
Macro-Average
51.62% 43.59% 52.44%
Recall
Macro-Average
52.02% 44.24% 52.36%
Fig. 8: Confusion Matrix - AffectNet Dataset: (a) ResNet50. F1-Score
(b) VGG16. (c) miniXception MAoP 0.03 0.04 14.683
with the other; these are ’sad’ and ’fear’. This observation [7] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
confirmed that this classification task is hard problem to solve, recent architectures of deep convolutional neural networks,” Artificial
Intelligence Review, vol. 53, no. 8, pp. 5455–5516, 2020.
even for human where the accuracy was around 65% using the [8] R. Saran, S. Haricharan, and N. Praveen, “Facial emotion recognition
FER2013 dataset [10]. using deep convolutional neural networks,” International Journal of Ad-
Finally, is observed that using the same top layer for the vanced Science and Technology, vol. 29, no. 6 Special Issue, pp. 2020–
2025, 2020.
ResNet50 and VGG16 is not recommendable, custom top [9] “Deep learning approaches for facial emotion recognition: A case study
layers should be implemented according to the problem and on FER-2013,” Smart Innovation, Systems and Technologies, vol. 85,
the architecture in order to obtain better results. Also, to pp. 1–16, 2018.
[10] I. J. Goodfellow, D. Erhan, P. Luc Carrier, A. Courville, M. Mirza,
improve the results is recommended to try another regular- B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D. H. Lee, Y. Zhou,
ization techniques such as Batch Normalization or Weight C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis, J. Shawe-Taylor,
Regularization, use more epochs for training and try different M. Milakov, J. Park, R. Ionescu, M. Popescu, C. Grozea, J. Bergstra,
J. Xie, L. Romaszko, B. Xu, Z. Chuang, and Y. Bengio, “Challenges in
optimizer. representation learning: A report on three machine learning contests,”
Neural Networks, vol. 64, pp. 59–63, 2013.
VI. C ONCLUSIONS AND F UTURE W ORKS [11] P. Dhankhar, “ResNet-50 and VGG-16 for recognizing Facial Emotions,”
International Journal of Innovations in Engineering and Technology
In this project, a research to classify facial emotions (IJIET), vol. 13, no. 4, pp. 126–130, 2019.
[12] O. Arriaga, M. Valdenegro-Toro, and P. G. Plöger, “Real-time convolu-
over three different dataset using different CNN architectures tional neural networks for emotion and gender classification,” pp. 221–
(ResNet50, VGG16 and miniXception). Emotion classification 226, 2017.
using facial expressions is a complex problem that has already [13] A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison, “Real-
Time Camera Tracking : When is High Frame-Rate Best?,” European
been approached several times with different techniques. The Conference on Computer Vision (ECCV), pp. 222–235, 2012.
CNN models used achieved a macro-average accuracy values [14] Q. Y. Gu and I. Ishii, “Review of some advances and applications in
between 80-89%, also a F1-scored that is between 50-60% real-time high-speed vision: Our views and experiences,” International
Journal of Automation and Computing, vol. 13, no. 4, pp. 305–318,
that show their effectiveness. 2016.
In addition, the use a small architecture such as miniXcep- [15] S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania,
tion have been demonstrated to be as accurate as a ResNet50 and T. Mitra, “High-Throughput CNN Inference on Embedded ARM
Big.LITTLE Multicore Processors,” IEEE Transactions on Computer-
or VGG16 to classify emotions using different dataset that Aided Design of Integrated Circuits and Systems, vol. 39, no. 10,
contained color and gray-scale images. The dataset were built pp. 2254–2267, 2020.
in controlled environment such as KDEF or in the wild such [16] D. Velasco-Montero, J. Fernández-Berni, R. Carmona-Galán, and
Á. Rodrı́guez-Vázquez, “Performance analysis of real-time DNN infer-
as AffecNet and FER-2013. Due to this characteristic of the ence on Raspberry Pi,” p. 14, 2018.
dataset is observed that the KDEF experiments had higher [17] A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep Neural
accuracy values. Network Models for Practical Applications,” pp. 1–7, 2016.
[18] J. Liu, J. Liu, W. Du, and D. Li, “Performance analysis and characteri-
As a future work, a comparison of metrics such as frame zation of training deep learning models on mobile device,” Proceedings
per second (fps) and power consumption (W) could be com- of the International Conference on Parallel and Distributed Systems -
pared in two (02) popular embedded systems used in robotic ICPADS, vol. 2019-Decem, pp. 506–515, 2019.
[19] P. Ekman, W. V. Friesen, and P. Ellsworth, Emotion in the Human
education: Raspberry Pi 4 and NVIDIA Jetson Nano. Face: Guidelines for Research and an Integration of Findings, vol. 122.
Finally, the miniXception model could be trained using all Pergamon Press Inc., 1973.
the AffectNet data, which had been reduced from 440K images [20] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed
emotional faces (kdef),” (Accessed on 11/22/2020).
to 50K due to computational power limitations. This training [21] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A Database
should be performed in order to analyze if higher accuracy for Facial Expression, Valence, and Arousal Computing in the Wild,”
could be achieved with larger data, because the KDEF and IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31,
2017.
FER2013 have a few number of images. [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in 3rd International Conference on
R EFERENCES Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.),
[1] R. W. Picard, Affective computing. MIT press, 2000. 2015.
[2] R. W. Picard and J. Klein, “Computers that recognise and respond to [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
user emotion: Theoretical and practical implications,” Interacting with Recognition,” in Proceedings of the IEEE conference on computer vision
Computers, vol. 14, no. 2, pp. 141–169, 2002. and pattern recognition, 2016.
[3] J. M. Garcia-Garcia, V. M. Penichet, M. D. Lozano, J. E. Garrido, and
E. L. C. Law, “Multimodal affective computing to enhance the user
experience of educational software applications,” Mobile Information
Systems, vol. 2018, 2018.
[4] J. Bullington, “’Affective’ computing and emotion recognition systems:
The future of biometric surveillance?,” Proceedings of the 2005 Infor-
mation Security Curriculum Development Conference, InfoSecCD ’05,
no. September 2005, pp. 95–99, 2005.
[5] F. Cavallo, F. Semeraro, L. Fiorini, G. Magyar, P. Sinčák, and P. Dario,
“Emotion Modelling for Social Robotics Applications: A Review,”
Journal of Bionic Engineering, vol. 15, no. 2, pp. 185–203, 2018.
[6] B. C. Ko, “A brief review of facial emotion recognition based on visual
information,” Sensors (Switzerland), vol. 18, no. 2, 2018.

You might also like