Professional Documents
Culture Documents
Multi Modal
Multi Modal
Emotion Recognition
Juan D. S. Ortega, Mohammed Senoussaoui, Eric Granger, Marco Pedersoli
Patrick Cardinal and Alessandro L. Koerich
Abstract— This paper presents a novel deep neural network to the specific individual behaviors and capture conditions.
(DNN) for multimodal fusion of audio, video and text modalities Moreover, it is costly to create large representative audio-
for emotion recognition. The proposed DNN architecture has video datasets with reliable expert annotation as needed to
independent and shared layers which aim to learn the rep-
resentation for each modality, as well as the best combined design recognition models, and accurately detect of levels of
arousal and valence.
arXiv:1907.03196v1 [cs.CV] 6 Jul 2019
AVEC SEWA database, the system proposed in this paper with a codebook size of 1,000 each, resulting in a 3,000-
learns the feature representation, and the classification and dimensional segment-level feature vector.
fusion functions to accurately predict the subject’s levels of 3) Text Features: Text features consist of a bag-of-words
arousal, valence and liking. feature representation based on the transcription of the
The main intuition of the proposed approach is to co- speech. The dictionary contains 521 words, where only
jointly learn each a discriminant representation, along with unigrams are considered. The histograms created over a
their classification and fusion functions. Each feature subset segment of 6 s in time. In total, the bag-of-text-words
is first independently processed by one or more hidden (BoTW) features contain 521 features.
layers. This part of the network learns the best feature
representation for a given task. Then, the last hidden layers A. DNN Architecture
of each block are inter-connected to one or more fully con-
nected layers that will perform the classification and fusion of The proposed DNN architecture for multi-modal fusion is
representations. Globally, this network should learn how the shown in Fig. 1. The DNN processes each source of informa-
input features can be transformed with the intention of being tion – audio, video, and text – separately using a pair of fully
classified and combined, and producing one global decision. connected layers per modality (a) in order to generate and
Training a hybrid classifier to combine these representations explore correlations between features of the same type. Then,
could improve the overall accuracy of the recognition system. a second stage (b) merges the outputs of those independent
The proposed architecture exploits three different sources of layers using a concatenation function. This layer receives
information – audio (voice), video (face) and text. For more the outputs of the previous step in one block, and feeds
details on the feature sets, please refer to [20]. a fully connected layer (c) that correlates the essence of
the modalities. The DNN output is generated by a single
1) Audio Features: Acoustic features consist of 23 acous-
linear neuron (d) used as the regression value of the whole
tic low-level descriptors (LLDs) such as energy, spectral and
network. Finally, a scaling module (e) is used to reduce the
cepstral, pitch, voice quality, and micro-prosodic features,
gap between magnitudes of the predictions and the labels.
extracted every 10ms over a short-term frame. Segment-level
Different scaling functions have been considered to yield
acoustic features are computed over segments of 6 seconds
the best results – decimal scaling, min-max normalization
using a codebook of 1,000 audio words and a histogram of
and standard deviation scaling. The final prediction (f) is
audio words is created. Therefore, the resulting feature vector
produced via a linear activation function (weighted sum of
has 1,000 dimensions.
previous layer).
2) Video Features: The video features are extracted for
For the training process the Mean Squared Error (MSE)
each video frame (frame step 20ms) and consist of three
was used as loss function:
feature types: normalized face orientation in degrees; pixel
coordinates for 10 eye points; pixel coordinates for 49 facial m
landmarks. Bag-of-video-words with separate codebooks and 1 X
M SE = (yˆi − yi )2 (1)
histograms were created for each three facial feature type, m i=1
TABLE I: Description of the DNN architecture for each the other hand the Liking predictions shows a better behavior
emotional dimension (arousal,valence and liking). when the delay comes to be d = 2.5
Modality Arousal Valence Liking
L1 L2 L1 L2 L1 L2 CCC
VS
DELAY
Audio 50 50 200 200 50 50
Video 100 100 200 200 100 100
Text 200 200 200 200 100 100 569 555 580 562 566
530 539
481 499
Fusion Layer 100 200 50 464 468
427
where ŷ, y, and m denote the expected values, the ob- 150
83
served values and the number of predictions respectively. 41 50 50 56
The ‘checkpoint’ functionality (given by Keras API [4]) was 0 0.5 1 1.5 2 2.5
used in order to get the best model during the training pro- Arousal Valence Liking
cess. Several training sessions were run for each dimension
(arousal, valence, and liking) to ensure a stable configuration Fig. 2: Effect of the delay on the CCC
of the networks.
IV. E XPERIMENTAL M ETHODOLOGY
B. Postprocessing
The experimental protocol used with RECOLA dataset in-
The output of the DNN regression layer can be any real
volves a training set and a development set. For experiments
value. However, the regression process tends to average
conducted in this research, the development set has been split
the values obtained from the feature vectors in different
into two subsets. The first one contains five randomly chosen
ways. That process implicitly leads to an attenuation of the
subjects out of the 14 subjects in the development set. The
predicted values. This issue can be circumvented by scaling
nine remaining subjects have been used as a test subset to
the outputs.
evaluate the performance of the proposed approach.
For the SVR experimentation, the original protocol has
Min-Max Scaling Min-Max scaling provides a linear trans-
been used for building unimodal and the early-fusion sys-
formation on the original range of predictions in a pre-
tems. The late-fusion approach used the first development
defined boundary. A modified value ~ynorm is obtained using
subset to optimize each unimodal system. The fusion func-
Equation 2:
tion has been optimized using the second development sub-
set. maxl − min l · ~y − minp
In the case of the DNN, the smaller development subset ~ynorm = + min (2)
maxp − minp l
has been used to determine the optimal number of layers
and neurons for each modalities as well as the number of where maxl and minl respectively are the maximum and
neurons in the fusion layer. The second development subset minimum values in the training labels, maxp and minp
has been used for testing the model. Note that each emotional respectively are the maximum and minimum values of ~y ,
dimension has his own architecture, described in Table I. the original prediction vector.
The evaluation measure used in the challenge is the CCC,
which is defined as: Standard-Deviation Ratio This scaling method has been
used in [24], with interesting results. A modified value ~ynorm
2 ∗ sy0 y is obtained using Equation 3:
ρy 0 y =
s2y0 + s2y + (y¯0 − ȳ)2
σp
0
~ynorm = ⊗ ~y (3)
where y and y are the sets for which the correlation is σl
calculated, s2y0 and s2y are the variances calculated on sets where σp is the standard deviation of the predictions, σl is the
y 0 and y respectively and y¯0 and ȳ are the means of sets y 0 standard deviation of the golden standard, ⊗ is the element-
and y, respectively. wise multiplication operation and ~y is the prediction vector
A. Preprocessing to be scaled.
A significant increase over the results of the CCC score Decimal Scaling This scaling method moves the decimal
appears when a delay compensation function is applied. point of values of the prediction vector ~y . The number of
Fig. 2 shows how the CCC change in function of the delay. decimal points moved depends on the maximum absolute
The suitable value for the delay compensation (optimized on value in ~y . A modified value ~ynorm is obtained using
the development partition) was chosen as d = 1.5 for arousal Equation 4:
and valance, the curves behavior shows that this point is ~y
useful in order to increase the CCC for those dimensions on ~ynorm = minp (4)
10
where ~y is the original prediction vector and minp is the dimension. The most interesting point with this experiment
smallest predicted number such that max|~ynorm | < 1 and is the relative contribution of each modality in the prediction
the division is performed element-wise. of the emotion, as shown in Fig. 3. Values are derived
from the linear regression coefficients obtained from the
V. E XPERIMENTAL R ESULTS AND D ISCUSSION multimodal late fusion model. As previously mentioned, it
A. Results with SVR is clear from Fig. 3 that text is the most important modality
Some preliminary experiments have been carried out us- for all dimensions. It is particularly true for the liking, for
ing a SVR in order to provide a advanced baseline. The which the text contribution is more than 97% of the predicted
complexity (as provided in the baseline script), the epsilon value. Note that for arousal and valence, the text contributes
(in the range [0.0-0.0001]) and the delay (in the range to more than 50% of the prediction value, which shows the
[0-3] seconds) have been optimized on the development importance of this modality. Another interesting point in this
set. Table II shows results on the experiments conducted figure is the negative contribution of the audio in prediction
with the SVR framework. Note that the protocol used in of the liking, suggesting that audio (or at least the feature
the baseline paper [20] has been used excepting for the used to represent audio recordings) is not suitable for the
early fusion scheme. The first three lines present the CCC prediction of the liking.
obtained on each individual modality. A first observation is B. Results with DNNs
that predicting the liking is a quite hard task, particularly
from the audio and video. However, the text seems a much The first conducted experiment with a DNN was the early
better modality for predicting liking. Moreover, the text also fusion scheme. Table III shows the results obtained on the
performs well for both the arousal and valence. A major development set. The performance of the early fusion scheme
drawback of this modality is the need of transcriptions, for depends on the modality. For the arousal, using whether a
which the production is both time consuming and expensive DNN or a SVR leads to same result. However, the SVR
and not realistic in real-life applications. However, with the outperforms the DNN for the valence prediction by 6%. On
accuracy obtained with state-of-the-art speech recognition other hand, the DNN is less sensitive to unusable modality
systems, the use of the text modality is not out of reach. since it outperforms the SVR by 26%.
It would be interesting to study how the errors made by The results achieved with the proposed approach are very
a speech recognition system will affect the precision of an interesting since they outperform both the DNN and the
emotion identification system. Another observation is that the SVR for the arousal and valence dimensions. However,
use of early fusion led to a 23% improvement for arousal and late fusion is still the better choice for liking prediction.
32% for valence. The early fusion scheme is very harmful This means that, even with the proposed architecture, the
when applied to the liking dimension. This is mainly due to DNN is not able to ignore modalities that are useless or
the poor performance of audio and video features for this hurtful. However, this architecture works better than early
dimension. fusion with DNN. The performance on the test set are a
bit low. The best explanation for this results is over-fitting.
However, since the results on the development set suggest
that the proposed approach provide a relatively high level
of performance, more experiments will be conducted to
understand this problem.
TABLE III: DNN results for emotion recognition on the Development (D) and Test (T) partitions. Performance is measured
in terms of the Concordance Correlation Coefficient.
R EFERENCES Proc. of the 3rd Intl Conf. on Pattern Recognition Applications and
Methods, pages 671–678, 2014.
[1] A. Ali, N. Dehak, P. Cardinal, S. Khuranam, S. H. Yella, P. Bell, and
[13] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical committee
S. Renals. Automatic dialect detection in arabic broadcast speech.
of deep cnns with exponentially-weighted decision fusion for static
In Proc. of the 13th Annual Conf. of the Intl Speech Communication
facial expression recognition. In Proc. of Intl Conf. on Multimodal
Association (Interspeech)., 2016.
Interaction, pages 427–434, New York, NY, USA, 2015.
[2] P. Cardinal, N. Dehak, A. L. Koerich, J. Alam, and P. Boucher. ETS
[14] J. Kumari, R. Rajesh, and K. Pooja. Facial expression recognition:
System for AV+EC 2015 Challenge. In Proc. of the 5th Intl Workshop
A survey. Procedia Computer Science, 58:486 – 491, 2015. 2nd Intl
on Audio/Visual Emotion Challenge, pages 17–23, New York, New
Symposium on Computer Vision and the Internet.
York, USA, 2015.
[15] H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang.
[3] S. Chen and Q. Jin. Multi-modal dimensional emotion recognition
Depression recognition based on dynamic facial and vocal expression
using recurrent neural networks. In Proc. of the 5th Intl Workshop on
features using partial least square regression. In Proc. of the 3rd
Audio/Visual Emotion Challenge, pages 49–56, 2015.
ACM Intl Workshop on Audio/Visual Emotion Challenge, pages 21–
[4] F. Chollet et al. Keras. https://github.com/fchollet/
30, October 2013.
keras, 2015.
[16] E. Moore, M. Clements, J. Peifer, and L. Weisser. Analysis of prosodic
[5] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T.
variation in speech for clinical depression. In Proc. of the 25th Annual
Padilla, F. Zhou, and F. D. la Torre. Detecting depression from facial
Intl Conf. of the IEEE Engineering in Medicine and Biology Society,
actions and vocal prosody. In 3rd Intl Conf. on Affective Computing
volume 3, pages 2925–2928, Sept 2003.
and Intelligent Interaction and Workshops, pages 1–7, Sept 2009.
[17] M. Nasir, A. Jati, P. G. Shivakumar, S. Nallan Chakravarthula, and
[6] M. J. Cossetin, J. C. Nievola, and A. L. Koerich. Facial expres-
P. Georgiou. Multimodal and multiresolution depression detection
sion recognition using a pairwise feature selection and classification
from speech and facial landmark features. In Proc. of the 6th Intl
approach. In International Joint Conference on Neural Networks
Workshop on Audio/Visual Emotion Challenge, pages 43–50, 2016.
(IJCNN’2016), pages 5149–5155. IEEE, 2016.
[18] L. E. S. Oliveira, M. Mansano, A. L. Koerich, and A. S. Britto Jr. 2d
[7] N. Cummins, J. Epps, and E. Ambikairajah. Spectro-temporal analysis
principal component analysis for face and facial-expression recogni-
of speech affected by depression and psychomotor retardation. In 2013
tion. Computing in Science & Engineering, 13(3):9–13, 2011.
IEEE Intl Conf. on Acoustics, Speech and Signal Processing, pages
[19] M. Pantic and I. Patras. Dynamics of facial expression: recognition
7542–7546, May 2013.
of facial actions and their temporal segments from face profile image
[8] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes.
sequences. IEEE Trans. on Systems, Man, and Cybernetics, Part B
Acoustical properties of speech as indicators of depression and suicidal
(Cybernetics), 36(2):433–449, April 2006.
risk. IEEE Trans. on Biomedical Engineering, 47(7):829–837, July
[20] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer,
2000.
S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic. AVEC 2017
[9] Y. Guo, G. Zhao, and M. Pietikinen. Dynamic facial expression
– Real-life Depression, and Affect Recognition Workshop and Chal-
recognition with atlas construction and sparse representation. IEEE
lenge. In Proc. of the 7th Intl Workshop on Audio/Visual Emotion
Trans. on Image Processing, 25(5):1977–1992, May 2016.
Challenge, Mountain View, USA, October 2017.
[10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly,
[21] J. D. Silva Ortega, P. Cardinal, and A. L. Koerich. Emotion recognition
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury.
using fusion of audio and video features. In IEEE International
Deep neural networks for acoustic modeling in speech recognition:
Conference on Systems, Man, and Cybernetics (SMC), pages 1–6,
The shared views of four research groups. IEEE Signal Processing
2019.
Magazine, 29(6):82–97, Nov 2012.
[22] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
[11] Z. Huang, T. Dang, N. Cummins, B. Stasak, P. Le, V. Sethu, and
M. B. Gotway, and J. Liang. Convolutional neural networks for
J. Epps. An investigation of annotation delay compensation and
medical image analysis: Full training or fine tuning? IEEE Trans.
output-associative fusion for multimodal continuous emotion predic-
on Medical Imaging, 35(5):1299–1312, May 2016.
tion. In Proc. of the 5th Intl Workshop on Audio/Visual Emotion
[23] D. L. Tannugi, A. S. Britto Jr., and A. L. Koerich. Memory integrity
Challenge, pages 41–48, 2015.
of cnns for cross-dataset facial expression recognition. In IEEE
[12] M. Kächele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker.
International Conference on Systems, Man, and Cybernetics (SMC),
Fusion of audio-visual features using hierarchical classifier systems
pages 1–6, 2019.
for the recognition of affective states and the state of depression. In
[24] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, Fusion of feature sets and classifiers for facial expression recognition.
B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech Expert Systems with Applications, 40(2):646–655, 2013.
emotion recognition using a deep convolutional recurrent network. In [27] T. H. H. Zavaschi, A. L. Koerich, and L. E. S. Oliveira. Facial
IEEE Intl Conf. on Acoustics, Speech and Signal Processing, pages expression recognition using ensemble of classifiers. In 2011 ieee
5200–5204, March 2016. international conference on acoustics, speech and signal processing
[25] J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. (ICASSP), pages 1489–1492. IEEE, 2011.
Mehta. Vocal and facial biomarkers of depression based on motor [28] B. Zhang, C. Quan, and F. Ren. Study on cnn in the recognition
incoordination and timing. In Proc. of the 4th Intl Workshop on of emotion in audio and images. In IEEE/ACIS 15th Intl Conf. on
Audio/Visual Emotion Challenge, pages 65–72, 2014. Computer and Information Science, pages 1–5, June 2016.
[26] T. H. H. Zavaschi, A. S. Britto Jr., L. E. S. Oliveira, and A. L. Koerich.