PDF Original Del Articulo 05

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 1
Video-Based Depression Level Analysis by

Encoding Deep Spatiotemporal Features
Mohamad Al Jazaery and Guodong Guo*, Senior Member, IEEE
Abstract—As a serious mood disorder problem, depression face and head gestures [15], [16], [17], [18], [19], [20], which
causes severe symptoms that affect how people feel, think, and motivated us to extract deep features using two different face
handle daily activities, such as sleeping, eating, or working. In visual cues. Our first visual cue uses tight-cropped aligned
this paper, a novel framework is proposed to estimate the Beck
Depression Inventory II (BDI-II) values from video data, which face region and focuses mostly on the face expressions. The
uses a 3D convolutional neural network to automatically learn the second cue uses a relatively larger face region with the full
spatiotemporal features at two different scales of the face regions. head included, which helps capturing head gestures and more
Then, a Recurrent Neural Network (RNN) is used to learn further dynamics around the face region. Even though audio-visual
from the sequence of the spatiotemporal information. This fusion shows better results than using the visual data only to
formulation, called RNN-C3D, can model the local and global
spatiotemporal information from consecutive face expressions, analyze the depression [21], [22], [23], our work focuses on
in order to predict the depression levels. Experiments on the visual based nonverbal data without utilizing the audio.
AVEC2013 and AVEC2014 depression datasets show that our
proposed approach is promising, when compared to the state-of- The depression diagnosis is usually based on the patients’
the-art visual-based depression analysis methods. verbal and action behaviors reported by the patients or friends.
Index Terms—Automated Visual-based depression analysis, Beck Depression Inventory-II (BDI-II), an estimation method
nonverbal behavior, 3D convolutional Neural Network (C3D), of depression levels [24], has depression values ranging from
Recurrent Neural Network (RNN). 0 to 63, where 0-13 implies no depression, 14-19 mild
depression, 20-28 moderate, and 29-63 severe depression. As
I. I NTRODUCTION a result, Automatic Depression Level Prediction (ADLP) can
be formulated as a regression or a multi-class classification
Major depressive disorder (MDD) is a mental illness char-
problem.
acterized by low confidence, loss of interest in normally fun
activities, and low energy without a specific reason [1]. MDD The aim of this paper is predicting the depression levels
affects approximately 216 million people (3% of the world’s from a person’s visual expression. A convolutional 3D net-
population) in 2015. Females are affected about twice as often work (C3D) [25] is used to model both the dynamics and
as males. It can also negatively affect a person’s work or appearance from the video data, where people performing
school life, as well as sleeping, eating habits, and general Human-Computer Interaction tasks responding to a number
health [1], [2]. Up to 60% of people who die by suicide had of questions, such as: What is your favorite dish? What was
depression or other mood disorders [3]. As more people are your best gift, and why? Discuss a sad childhood memory. The
affected badly, the importance of an accurate MDD diagnosis ability of the C3D to learn both the salient temporal and spatial
becomes a priority. Thus automated machine systems need information from a sequence of frames makes it appropriate to
to be developed to give an objective assessment and quick analyze the human visual behavior and predict the depression
analysis for the mood disorders, which can lead to a better levels. After the deep spatiotemporal features learned by the
and in time MDD therapy. C3D net are extracted, a Recurrent Neural Network (RNN)
The machine based health assessment systems can give an [26], [27] is used to learn the depression levels further. To the
easy way to track the person’s depression status online through best of our knowledge, it is for the first time to explore the
human machine interaction [4]. Specifically, automated mental C3D and RNN methods for ADLP in videos.
health systems can detect and analyze the audio and the visual
behaviors which are related particularly to the depression. The Our extensive experiments on both the AVEC2013 and
speech characteristics can be useful for depression analysis AVEC2014 datasets show that our approach is promising, in
[5][6], some depression analysis methods are developed using comparison with the state-of-the-art visual based depression
the audio data [7], [8], [9], [10], [11], [12]. Studies show that analysis methods.
visual based expression and gestures are very important in
depression analysis as well [13], [14]. Specifically, many of the In the following, the related works and methods are re-
methods focus on the face region because in human activities viewed briefly. Then in Section III, the proposed framework
more than half of the visual-based nonverbal behaviors are with details about the architecture and submodules is pre-
sented. After that, the evaluation of our method is demon-
Mohamad AL Jazaery and Guodong Guo (Corresponding Author, E-mail: strated through many experiments using each submodule in
Guodong.Guo@mail.wvu.edu) are with the Lane Department of Computer
Science and Electrical Engineering, West Virginia University, Morgantown, our framework. Finally, Sections V and VI present some
WV 26506, USA. discussions and draw conclusions.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Besides the different features, several techniques were used

for depression prediction. The best was proposed by [23].
Instead of using a single complex regression model to predict
the depression level, it works in a two-stage manner consisting
of a coarse classifier and a fine regressor. The combination of
LDA classifier and OLS regressor achieved the best results.
Our approach uses C3D features to model both spatial and
temporal salient features simultaneously, rather than using
two C2D models separately as in [16]. In [27], C3D was
Fig. 1. Tight aligned and loose or larger face samples from the AVEC2014 explored for emotion recognition and merged with RNN-
dataset. Faces are blurred for privacy concerns. C2D model. Different from their approach, in this paper, the
C3D and RNN-C3D are explored for the automatic depression
level prediction. In the following, our depression analysis
II. R ELATED W ORK
framework design and components are presented in details.
The depression recognition competition is a part of the
Audio-visual emotion challenge 2013 and 2014 (AVEC2013 III. RNN-C3D - A D EEP S PATIOTEMPORAL D EPRESSION
and AVEC2014). The 2013 and 2014 depression datasets A NALYSIS F RAMEWORK
contain video and audio data (see Sections IV-A and IV-B
The better the descriptors to model the spatial and temporal
for more details about the datasets). Although the audio
components from the videos, the better the depression anal-
data shows improvement when utilized for the depression
ysis could be. In depression videos, the spatial information
prediction [7], [8], [9], our approach focuses on exploring the
is the face static appearance in each frame. The temporal
C3D feature to encode the visual information, while it does not
information is the dynamics of the subject’s face and head.
use any audio cue. In the following, short descriptions about
Based on this perspective, we introduce the RNN-C3D, a deep
the visual-based approaches in the AVEC2013 and AVEC2014
spatiotemporal depression analysis framework, which is able
competitions are presented.
to encode both appearance and dynamics in videos. Fig. 2 is
Many types of visual features were explored to solve the
an illustration of our framework with steps and components. In
depression prediction. Local Phase Quantization (LPQ), which
the following, each component of our framework is discussed
has a good performance in facial expression recognition,
in details.
was used as the baseline features for AVEC2013. Also in
AVEC2013 challenge, the Motion History Histogram (MHH)
was used by Meng et al. [8] to model the motion at the A. C3D: A Direct Spatiotemporal Model
pixel level. However, their conclusion stated that the MHH Different from the 2D convolutional deep networks, the
descriptor lacks the ability to encode all temporal information three dimensional convolutional networks (C3D) can learn
completely. Also, the Space-Time Interest Points (STIPs) [28] from both spatial and temporal data. From Fig. 3 one can see
and Pyramid of Histogram of Gradients (PHOG) [29] were that in 3D convolution the depth of filter kernel is smaller than
explored in Cummins et al.’s work [9] and the experimental the depth of the input volume. As a result, the output feature
results showed that PHOG feature outperforms STIP for map has three dimensions and contain features in both spatial
depression prediction. and temporal dimensions. Therefore C3D has the potential to
In the AVEC2014 competition, the Local Binary Patterns perform well in video-based learning tasks. The C3D can learn
(LBP), Edge Orientation Histogram (EOH) and LPQ were not only the appearance but also the motion and dynamics
merged by [30] and used as visual descriptors. Also in encoded in the consecutive frames.
AVEC2014, [31] used the combination of the LGBP-TOP and C3D has shown the state-of-art performance in action recog-
LPQ, while [32] combined three different motion features: nition and scene classification [25], as well as in face dynamics
motion static image, motion history image and motion average, analysis and emotion detection [27]. So it is interesting to
which achieved a better depression prediction. explore the C3D model to perform video-based depression
Recently, the convectional deep neural network has achieved analysis. To handle the limited training data in depression
state-of-art results in the computer vision field. Several meth- analysis, a pre-trained C3D model is used, which is trained on
ods used the deep features to do the depression level pre- the Sports-1M dataset, a large scale video dataset for sports
diction. In [16], a two-stream deep framework was proposed activities, and then fine-tuned on the UCF101 (Human action
to take facial images and facial flows as the input, called dataset) [25]. These pre-training approaches are motivated
appearance and dynamics deep convolutional neutral networks by the observation that the earlier features of a ConvNet
(DCNN), respectively. Different from [16] which uses flow contain more generic features (e.g. edge or color detectors).
data as temporal information, [22] proposed Feature Dynamic ImageNet [34] dataset, which contains 1.2 million images with
History Histogram (FDHH) to capture the dynamics of the 1000 categories, is commonly used in C2D deep learning as
temporal movement from the sequence of deep features in the pretraining data. In our case, Sports-1M and UCF101 are used
RGB spatial space to achieve a better prediction than [16]. The as pretraining datasets for our C3D models. Initializing the
standard 2D deep neural networks is also used for depression weights using this technique instead of using random weights
analysis in [33]. helps transferring generic spatiotemporal primitive knowledge
Fig. 2. The flow diagram of the proposed method for predicting the Beck depression inventory (BDI) score using Deep C3D and RNN in videos. Two
different prepossessing methods are used to extract face features in two different scales. Then the RNN learns the dynamics further.
Each fully connected layer has 512 output units. The input is
a series of 16 RGB video frames with 128 × 128 pixels in the
spatial dimensions. The C3D structure is illustrated in Fig. 4.
As shown in [35], [36], the head gestures are good visual
cues along with the facial expressions for depression analysis.
In addition to a tight aligned face C3D model, another C3D
model is trained on larger face regions, in order to explore the
dynamics of faces and heads in depression analysis. From now
on, the model which is trained on aligned, tight faces will be
called the C3D Tight-Face model, or C3D Loose-Face model,
when trained on larger face regions.
B. Learning on Spatiotemporal Sequences Using an RNN

Unlike the feedforward neural networks, the RNN assumes
Fig. 3. 2D (top) and 3D (bottom) convolutions, where W, H, and L are width,
height and number of input frames. The 3D kernel’s depth (d) is less than L. that the inputs are not independent. So it learns from the
input data sequence. RNNs use the Back-Propagation Through
Time (BPTT) scheme, where the gradient at each output
from the human action model to our depression analysis, and depends not only on the current time step, but also previous
overcome the potential over-fitting problem. The net structure time steps. Given an input sequence (x1 , x2 , ..., xn ), the RNN
in [25] is used, except the number of units in the last 2 fully uses the following equations to calculate the output sequence
connected layers, which are reduced from 4096 to 512 in (o1 , o2 , ..., on ) [26], [27]: ht = g Wxh xt + Whh ht−1 + bh ,
each to avoid over-fitting, considering the limited depression and ot = g (Who ht + bo ), where ht is the hidden state at state
training data. Also, the softmax loss function is replaced with t, the three connection weights matrices are Wxh , Whh , Who ,
an Euclidean loss, to handle the regression in our depression the bias matrices are bh , bo , and g is the activation function,
level prediction, where the depression level score is called the such as a sigmoid or rectified linear unit (ReLU). In our case,
Beck Depression Inventory-II (BDI-II score). As a result, our we use ReLU as the activation function.
final C3D architecture is as the following: 8 convolutions, 5 We do not use the spatial deep features as inputs to the RNN
max-poolings, and 2 fully connected layers, followed by an for modeling the sequence, which is quite common in many
Euclidean loss output. All 3D convolution kernels are 3×3×3 other video analysis works using the RNNs. Instead, we feed
with stride 1 in both spatial and temporal dimensions. All the deep spatiotemporal features learned by the C3D model to
pooling kernels are 2×2×2, except for pool1 which is 1×2×2. the RNN. In other words, the RNN in our approach is used to
Fig. 4. C3D deep structure. 8 convolutions, 5 max-poolings, 2 fully connected layers, and an Euclidean loss output layer.
A. AVEC2013 Depression Dataset

AVEC2013 depression dataset [40] was released as part
of the Audio-Visual Emotion Challenge and Workshop 2013
for the depression sub-challenge. It is a subset of the Audio-
Visual depressive language corpus (AVid-Corpus), containing
150 videos from 82 subjects with age range 18 to 63 years. The
dataset is split into three parts: training, development, and test,
each has 50 video clips. The videos are about 25 minutes long
Fig. 5. The structure of our RNN for encoding depression features. in average for people performing Human-Computer Interaction
tasks, while being recorded by a webcam and microphone.
Only the “Freeform” task is used in our evaluation. In this
model consecutive clips, whereas previous methods used it to task, participants respond to one of a number of questions,
model the consecutive frames. Our method considers not only such as: What is your favorite dish? What was your best gift,
the sequence of spatial information but also the sequence of and why? Discuss a sad childhood memory, in the German
temporal frames, such as consecutive face sub-expressions and language. There is only one person per video. The videos are
head movements. On the other hand, because we fine-tuned a 24-bit color at a sampling rate of 30 Hz and a resolution of
C3D which has been trained on the input of 16 consecutive 640 × 480. The Beck Depression Inventory-II (BDI-II) [40]
frames, our temporal features are limited to encode only from value, a depression severity indicator, is estimated for each
the 16 frames, if no RNN is used. However, since the RNN video using a standardized depression questionnaire.
can deal with variable lengths of sequences, more dynamic
context can be included by using longer spatiotemporal input
sequences. According to our experiments, an RNN input B. AVEC2014 Depression Dataset
sequence of length two, i.e. two consecutive 16-frame clips, AVEC2014 depression dataset [41] is part of the Audio-
achieves the best performance. The final depression prediction Visual Emotion Challenge and Workshop 2014. AVEC2014
score for a video is the median of the RNN outputs over all was used for the depression-sub challenge. AVid-Corpus has
possible two consecutive 16-frames clips in the video. 12 recorded Human-Computer Interaction tasks. As a subset
Two different RNN models were trained, using the deep of the Audio-Visual depressive language corpus (AVid-Corpus)
spatiotemporal features extracted from C3D Tight-Face and dataset, AVEC2014 has only two tasks. Only the “Freeform”
Loose-Face model, separately. From now on, we call them recorded sessions are used in our evaluation. All participants
RNN-C3D Tight-Face and RNN-C3D Loose-Face models, are recorded between one and four times, with a period of two
respectively. weeks between each recording. The mean age of subjects is
In addition, it is important to mention that we do not use 31.5 years. The dataset has 150 videos divided into 100 videos
the LSTM [37] as many previous works on action recognition. for training and development and 50 videos for testing. The
We use the simple vanilla version of the RNN [38], since our average length of those videos is about 2 minutes, which is
experiments on the validation set shows that the best result can much shorter than the AVEC2013 videos. One sample from
be achieved using a simple structure of the RNN with a small this dataset is shown in Fig. 1.
number of hidden units, low dimensional input features, and
short sequences. Further, some recent works have compared
the simple RNNs structure with LSTMs and found that they C. Experimental Settings
are able to yield comparable results in some tasks by just using 1) Video Preprocessing:
a special weight initialization method and ReLU units to avoid - Subsampling: First we lower the sample frame rate from
exploding gradient issues [39]. This inspires us in selecting the 30 fps (frames per second) to 6 fps which, according to the
structure of RNN for our depression analysis task. comparison in [42], yields a better video analysis since the
model obtains more context from less frames. In AVEC2014,
IV. E XPERIMENTS the videos are short, the 6 fps were obtained from many
The experiments are conducted on two datasets: the Au- different starting points to get more samples, e.g., 16 frames
dio/Visual Emotion Challenge (AVEC) 2013 [40] and 2014 of taking a sample every 4 frames starting at frame 1 are
[41] depression sub-challenge datasets. In this section, we different from those starting at frame 2 in the same video.
briefly describe the two datasets first, and then show the After that, each video is split into 16 frame long subsequences
experimental results. Finally, we compare with other state-of- with 4-frames and 8-frames overlaps between two consecutive
the-art methods. clips in AVEC2013 and AVEC2014, respectively. This leads to
around 80K and 30K 16-frame training clips for AVEC2013 TABLE I
and AVEC2014, respectively. D EPRESSION RECOGNITION RESULTS ON AVEC2013 (T EST SET ).
- Face Detection and Alignment: For the C3D Tight-Face Our Methods RMSE MAE
model, we used the OpenFace [43] to detect face landmarks. C3D Tight-Face 9.64 7.50
Then in each frame, we cropped and aligned the facial region RNN-C3D Tight-Face 9.33 7.39
using the mouth, ears, and eyes facial landmarks. This setting C3D Loose-Face 10.04 8.15
RNN-C3D Loose-Face 9.84 7.95
is used for both training and testing data. On the other hand, C3D Tight-Face & Loose-Face 9.49 7.40
all available frames from AVEC2013 and AVEC2014 training weighted merge (2 models)
datasets are used to train the C3D Loose-Face model. RNN-C3D Tight-Face & Loose-Face 9.28 7.37
weighted merge (2 models)
2) C3D Training: To initialize the weights of the 3D
convolutional layers, the weights from a C3D pre-trained
TABLE II
model trained on Sports-1M dataset and then fine-tuned on D EPRESSION RECOGNITION RESULTS ON AVEC2014 (T EST SET ).
the UCF101 are used. This model is released by [25]. As
mentioned earlier, we changed the last fully-connected layer Our Methods RMSE MAE
from 4096 to 512 neural units to avoid over-fitting and changed C3D Tight-Face 9.66 7.48
the softmax loss function into Euclidean distance for the RNN-C3D Tight-Face 9.49 7.41
C3D Loose-Face 9.81 7.73
regression task. The implementation is in a modified version RNN-C3D Loose-Face 9.76 7.66
of Caffe toolbox [44] to support 3-Dimensional Convolutional C3D Tight-Face & Loose-Face 9.32 7.29
Networks [25]. The learning rate is fixed to 10−6 and the weighted merge (2 models)
RNN-C3D Tight-Face & Loose-Face 9.20 7.22
batch size to 35. The network converged after around 4000 weighted merge (2 models)
iterations (2 epochs) and around 3000 iterations (4 epochs),
for AVEC2013 and AVEC2014, respectively.
3) RNN Training: For each 2-consecutive 16-frame clips, At first, we explored the performance using the individ-
we extracted the C3D features from the Fc6 layer. The reason ual C3D deep model, i.e. Tight-Face or Loose-Face model.
of choosing the Fc6 layer, not Fc7, is that it immediately From Table I (AVEC2013), when only the face C3D Tight-
follows the last 3D convolutional layer, so its features hold Face model is used, the RMSE and MAE achieve 9.64 and
some raw spatiotemporal information along with the depres- 7.50, respectively. While the MAE and RMSE are 10.04 and
sion labels. We applied PCA to reduce the feature dimension 8.15 when using the C3D Loose-Face model. From Table
from 512 to 10. Then, the vanilla RNN [38] of 30 hidden II (AVEC2014), when using the C3D Tight-Face model, the
units is used to learn the normalized depression labels from MAE and RMSE are 7.48 and 9.66, respectively. While the
the input sequence of length 2. Specifically, the z-score is used MAE and RMSE are obtained 7.73 and 9.81, respectively,
to normalize the depression score labels after calculating the when using the C3D Loose-Face model. From both the
mean and std of training data labels, in order to make the net AVEC2013 and AVEC2014 results, one can see that the C3D
converge faster. Also, we set the learning rate (LR) to 5∗10−5 Tight-Face model is better than the C3D Loose-Face model.
and LR decay to 0.9. ReLU is used as the activation function. Then, we explored the performance by merging the C3D Tight-
The net converged after 10 epochs. Face and Loose-Face models. In AVEC2014, the final merging
4) Testing and Performance Measure: For the C3D models, result for any video is the median of all sub-clips from both
the final depression prediction value of each testing video is models. The RMSE and MAE are obtained 9.32 and 7.29,
the median value of all 16-frame sample clips in the video. respectively. On the other hand, to obtain the best merging
While for the RNN-C3D models, the median of the RNN result for AVEC2013, where the videos are much longer, a
outputs for all possible two consecutive 16-frame clips in double weight is given for the C3D Tight-Face model clips
the video. Since the median is less affected by outliers than prediction. Using this C3D Tight-Face & Loose-Face weighted
the mean, it is used to avoid the outliers from the large merge model, the obtained RMSE and MAE are 9.49 and 7.40,
number of samples in the video. The overall performance is respectively, which are improved over each single model.
measured using the Mean Absolute Error (MAE) and Root The results highlight that the tight faces are better than loose
Mean Square r Error (RMSE). The RMSE is computed by or larger faces to analyze the depression, probably because
1 PN 2 there are more background noise in the loose or larger face
RM SE = (yi − ŷi ) , and the MAE is computed
N i=1 regions. However, the best result can be achieved by merging
1 PN the Tight-Face and Loose-Face models together. The dynamics
by M AE = |yi − yˆi |, where N is the total number
N i=1 and appearance of the loose face data which includes the head
of video samples, yi denotes the ground truth of the i-th video
gestures are useful, even though they are less accurate than the
sample, and ŷi is the predicted value of the i-th video sample.
tight faces for depression analysis.
D. Performance of the C3D Models E. Performance of the RNN-C3D models

The depression recognition results on AVEC2013 and After exploring the performance using the C3D models, we
AVEC2014 (both results are reported on the test sets) are explored if the RNN-C3D models, the RNN with a sequence
shown in Tables I and II, respectively. of C3D features as input, as shown in 5, can further improve
(a) C3D Tight-Face 16 frames saliency maps.
(b) C3D Loose-Face 16 frames saliency maps.
Fig. 6. Visualization of C3D Tight-Face and Loose-Face models, using the method from [45] . C3D captures appearance for the first and last few frames but
then only attends to salient motion in the middle frames.
TABLE III TABLE IV

D EPRESSION RECOGNITION RESULT COMPARISON TO OTHER METHODS D EPRESSION RECOGNITION RESULT COMPARISON TO OTHER METHODS
ON AVEC2013 (T EST SET ). A LL ARE BASED ON VIDEO DATA ONLY. ON AVEC2014 (T EST SET ). A LL ARE BASED ON VIDEO DATA ONLY.
Methods RMSE MAE Methods RMSE MAE

Baseline [40] 13.61 10.88 Baseline [41] 10.86 8.86
Kächele et al. [46] 10.82 8.97 InaoeBuap [32] 9.84 8.46
team-australia [9] 10.45 N/A Brunel [30] 10.50 8.44
Brunel-Beihang [8] 11.19 9.14 BU-CMPE [31] 10.27 8.20
Wen et al. [47] 10.27 8.22 Zhu et al. [16] 9.55 7.47
Kaya et al. [21] 9.72 7.86 Jan et al. [22] 8.01 6.68
Zhu et al. [16] 9.82 7.58 Our Method 9.20 7.22
Ma et al. [23] 8.91 7.26
Our Method 9.28 7.37
F. Comparison with Previous Methods
We also compare our approach with other visual based de-
the results. Firstly, using the RNN-C3D Tight-Face models, pression analysis methods. All the results are shown in Tables
the RMSE (9.33 and 9.49) and the MAE (7.39 and 7.41) III and IV. It can be seen that our C3D Tight-Face model,
are obtained on AVEC2013 and AVEC2014, respectively. The without using RNN, has comparable results on AVEC2014
RNN-C3D Loose-Face model achieved the RMSE (9.84, 9.76) and better results on AVEC2013 than [16] which is based
and the MAE (7.95, 7.66) on AVEC2013 and AVEC2014, on fine-tuned model of two different deep networks, trained
respectively. Then, using the same idea to do merging in separately on the face’s dynamics and the appearance of the
the C3D models, we explored the merging of the RNN- videos [16]. The C3D Deep Net can model both the dynamics
C3D models as well. Using the RNN-C3D Tight-Face and and appearance of the faces for depression prediction.
Loose-Face models, the RMSE (9.28 and 9.2) and MAE Additionally, from the comparison, Ma et al. [23] has
(7.37 and 7.22) are obtained on AVEC2013 and AVEC2014, a better performance on the AVEC2013 than our method.
respectively. From the results, we can observe that the RNN The novel part of their method is using a two-stage manner
can learn from the C3D spatiotemporal sequences and achieve consisting of a coarse classifier followed by a fine regressor.
better results, compared to the C3D models only. All the The HOG, HOF, MBH features were used. This method can
results of our methods are shown in Table I (AVEC2013) and also utilize the C3D features as well. On AVEC2014, Jan
Table II (AVEC2014), respectively. et al. [22] achieved a better depression prediction than ours.
Fig. 7. Comparison with the AVEC2013 competition results. Note that several
of the listed methods utilize the audio data while our method only uses visual Fig. 8. Comparison with the AVEC2014 competition results. Using video
data. (V) indicates using video data, while (A) for audio only. only is marked with (V), while audio marked with (A).
Their method used the Features Dynamic History Histogram In future, our RNN-C3D can be evaluated on more human
(FDHH) to capture the dynamics of the temporal movement behavior understanding problems.
from the sequence of the features in the spatial space. Since it
extracts FDHH from C2D features, the FDHH can be also be ACKNOWLEDGMENT
applied on the C3D features, which may be examined in the
future. The method in [22] was evaluated only on one dataset The authors thank the organizers of AVEC2013/AVEC2014
which is AVEC2014. Other than those two methods, the RNN- for providing the data, Yu Zhu for useful discussions on [16],
C3D Tight-Face & Loose-Face models, our best approach, has and anonymous reviewers for comments to improve the paper.
a better performance than other start-of-the-art methods on
both AVEC2013 and AVEC2014 datasets. R EFERENCES
[1] N. I. of Mental Health (NIH). (2016) Depression. [Online]. Available:
V. D ISCUSSIONS https://www.nimh.nih.gov/health/topics/depression/index.shtml
[2] A. P. Association, Diagnostic and Statistical Manual of Mental Disor-
The AVEC2013 and AVEC2014 depression datasets are ders (5th ed.). Arlington: American Psychiatric Publishing, 2013.
released as part of the Audio-Visual Emotion Challenge and [3] J. B. Lynch, Virginia A.; Duval, Forensic Nursing Science. Elsevier
Health Sciences. Arlington: American Psychiatric Publishing, 2010.
Workshop 2013 and 2014, respectively. Even though our
[4] “A Cross-modal Review of Indicators for Depression Detection
method uses only the visual data without making use of the Systems,” in Proc. of the Fourth Workshop on Computational
audio, we compare our results to the competition results where Linguistics and Clinical Psychology, Vancouver, Canada. [Online].
many of them utilize both the audio and visual data. Available: http://aclweb.org/anthology/W17-3101
[5] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M.
Compared to the competition results in AVEC2013 and Wilkes, “Acoustical properties of speech as indicators of depression and
AVEC2014 challenges, as shown in Figures 7 and 8, our suicidal risk,” Biomedical Engineering, IEEE Trans. on, vol. 47, no. 7,
method has comparable performance to the top visual- pp. 829–837, 2000.
[6] J. C. Mundt, A. P. Vogel, D. E. Feltner, and W. R. Lenderking, “Vocal
based and Audio-Visual methods on both AVEC2013 and acoustic biomarkers of depression severity and treatment response,”
AVEC2014 datasets. As the bars represent errors in Figures Biological psychiatry, vol. 72, no. 7, pp. 580–587, 2012.
7 and 8, the lower values mean better results. [7] J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu,
and D. D. Mehta, “Vocal biomarkers of depression based on motor
incoordination,” in Proc. of the 3rd ACM Int’l workshop on Audio/visual
VI. C ONCLUSION emotion challenge. ACM, 2013, pp. 41–48.
[8] H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang,
We have proposed a new approach to perform depression “Depression recognition based on dynamic facial and vocal expression
analysis from the visual data. We have shown that the convolu- features using partial least square regression,” in Proc. of the 3rd ACM
Int’l workshop on Audio/visual emotion challenge, 2013, pp. 21–30.
tional 3D model can learn and detect the salient spatiotemporal [9] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps, “Di-
features to perform the depression level prediction. We have agnosis of depression by behavioural signals: a multimodal approach,” in
also shown that the RNN model, as a global descriptor, can Proc. of the 3rd ACM Int’l workshop on Audio/visual emotion challenge,
2013, pp. 11–20.
learn from the transitions of the local C3D spatiotemporal [10] Y. Yang, C. Fairbairn, and J. F. Cohn, “Detecting depression severity
features, and improve the results further. To the best of our from vocal prosody,” Affective Computing, IEEE Trans. on, vol. 4, no. 2,
knowledge, this is for the first time to explore C3D for de- pp. 142–150, 2013.
[11] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T.
pression level analysis. Our best result is obtained by merging Padilla, F. Zhou, and F. D. La Torre, “Detecting depression from facial
the models learned from tight-cropped and loose face regions. actions and vocal prosody,” in Affective Computing and Intelligent
Experiments on both AVEC2013 and AVEC2014 datasets have Interaction and Workshops. IEEE, 2009, pp. 1–7.
[12] L. Chao, J. Tao, M. Yang, Y. Li, and J. Tao, “Multi task sequence
shown that our visual-based approach is promising, compared learning for depression scale prediction from video,” in 2015 Int’l Conf.
to the state-of-the-art visual-based or audio-visual approaches. on Affective Computing and Intelligent Interaction, 2015, pp. 526–531.
[13] I. H. JONES and M. PANSA, “Some nonverbal aspects of depression [36] J. Joshi, R. Goecke, G. Parker, and M. Breakspear, “Can body expres-
and schizophrenia occurring during the interview.” The Journal of sions contribute to automatic depression analysis?” in 2013 10th IEEE
nervous and mental disease, vol. 167, no. 7, pp. 402–409, 1979. Int’l Conf. and Workshops on Automatic Face and Gesture Recognition
[14] H. Ellgring, Non-verbal communication in depression. Cambridge (FG), April 2013, pp. 1–7.
University Press, 2007. [37] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” vol. 9,
[15] R. L. Birdwhistell, “Toward analyzing american movement,” Nonverbal pp. 1735–80, 12 1997.
communication, pp. 134–143, 1974. [38] G. Taylor, “Recurrent neural network implemented with theano,”
[16] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression diagnosis https://github.com/gwtaylor/theano-rnn, 2012.
based on deep networks to encode facial appearance and dynamics,” [39] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize
IEEE Trans. on Affective Computing, vol. 8, no. 99, pp. 1–9, 2017. recurrent networks of rectified linear units,” CoRR, vol. abs/1504.00941,
[17] J. M. Girard, J. F. Cohn, M. H. Mahoor, S. M. Mavadati, Z. Hammal, and 2015. [Online]. Available: http://arxiv.org/abs/1504.00941
D. P. Rosenwald, “Nonverbal social withdrawal in depression: Evidence [40] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia,
from manual and automatic analyses,” Image and vision computing, S. Schnieder, R. Cowie, and M. Pantic, “Avec 2013: the continuous
vol. 32, no. 10, pp. 641–647, 2014. audio/visual emotion and depression recognition challenge,” in Proc. of
[18] N. Firth, “Computers diagnose depression from our body language,” the 3rd ACM Int’l workshop on Audio/visual emotion challenge. ACM,
New Scientist, vol. 217, no. 2910, pp. 18–19, 2013. 2013, pp. 3–10.
[19] S. Alghowinem, R. Goecke, M. Wagner, G. Parker, and M. Breakspear, [41] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski,
“Eye movement analysis for depression detection,” in the 20th IEEE R. Cowie, and M. Pantic, “Avec 2014: 3d dimensional affect and
Int’l Conf. on Image Processing, 2013, pp. 4220–4224. depression recognition challenge,” in Proc. of the 4th Int’l Workshop
[20] A. Pampouchidou, P. Simos, K. Marias, F. Meriaudeau, F. Yang, on Audio/Visual Emotion Challenge. ACM, 2014, pp. 3–10.
M. Pediaditis, and M. Tsiknakis, “Automatic assessment of depression [42] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,
based on visual cues: A systematic review,” IEEE Trans. on Affective and G. Toderici, “Beyond short snippets: Deep networks for video
Computing, vol. PP, no. 99, pp. 1–1, 2017. classification,” CoRR, vol. abs/1503.08909, 2015. [Online]. Available:
[21] H. Kaya and A. A. Salah, “Eyes whisper depression: A cca based http://arxiv.org/abs/1503.08909
multimodal approach,” in Proc. of the 22Nd ACM Int’l Conf. on [43] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained local
Multimedia, ser. MM ’14. New York, NY, USA: ACM, 2014, pp. 961– neural fields for robust facial landmark detection in the wild,” in Proc.
964. [Online]. Available: http://doi.acm.org/10.1145/2647868.2654978 of the 2013 IEEE Int’l Conf. on Computer Vision Workshops.
[22] A. Jan, H. Meng, Y. F. A. Gaus, and F. Zhang, “Artificial intelligent [44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
system for automatic depression level analysis through visual and vocal D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
expressions,” IEEE Trans. on Cognitive and Developmental Systems, convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available:
vol. PP, no. 99, pp. 1–1, 2017. http://arxiv.org/abs/1409.4842
[23] X. Ma, D. Huang, Y. Wang, and Y. Wang, “Cost-sensitive two-stage [45] R. Kotikalapudi and contributors, “keras-vis,”
depression prediction using dynamic visual clues,” in Computer Vision https://github.com/raghakot/keras-vis, 2017.
– ACCV 2016, S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, Eds. [46] M. Kächele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker,
Cham: Springer Int’l Publishing, 2017, pp. 338–351. “Fusion of audio-visual features using hierarchical classifier systems for
[24] A. MCPHERSON and C. R. MARTIN, “A narrative review of the the recognition of affective states and the state of depression,” in Proc.
beck depression inventory (bdi) and implications for its use in an of the 3rd Int’l Conf. on Pattern Recognition Applications and Methods.
alcohol-dependent population,” Journal of Psychiatric and Mental [47] L. Wen, X. Li, G. Guo, and Y. Zhu, “Automated depression diagnosis
Health Nursing, vol. 17, no. 1, pp. 19–30, 2010. [Online]. Available: based on facial dynamic analysis and sparse coding,” IEEE Trans. on
http://dx.doi.org/10.1111/j.1365-2850.2009.01469.x Information Forensics and Security, vol. 10, no. 7, pp. 1432–1441, 2015.
[25] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
[Online]. Available: http://arxiv.org/abs/1412.0767
[26] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal,
“Recurrent neural networks for emotion recognition in video,” in Proc.
of the 2015 ACM Int’l Conf. on Multimodal Interaction. Mohamad Al Jazaery received the B.E. degree in computer science from
[27] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition Damascus University, Damascus, Syria, in 2012. He is currently a CS graduate
using cnn-rnn and c3d hybrid networks,” in Proc. of the 18th ACM Int’l student at West Virginia University (WVU) and working as a Research
Conf. on Multimodal Interaction. Assistant in the Computer Sciences Department. His research interests include
[28] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning computer vision, deep learning, machine learning, and face recognition.
realistic human actions from movies,” in IEEE Conf. on CVPR, 2008,
pp. 1–8.
[29] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a
spatial pyramid kernel,” in Proc. of the 6th ACM Int’l Conf. on Image
and video retrieval, 2007, pp. 401–408.
[30] A. Jan, H. Meng, Y. F. A. Gaus, F. Zhang, and S. Turabzadeh,
“Automatic depression scale prediction using facial expression dynamics Guodong Guo (M’07-SM’07) received the B.E. degree in automation from
and regression,” in Proc. of the ACM 4th Int’l Workshop on Audio/Visual Tsinghua University, Beijing, China, the Ph.D. degree in pattern recognition
Emotion Challenge, 2014, pp. 73–80. and intelligent control from Chinese Academy of Sciences, Beijing, China, and
[31] H. Kaya, F. Çilli, and A. A. Salah, “Ensemble cca for continuous the Ph.D. degree in computer science from University of Wisconsin-Madison,
emotion prediction,” in Proc. of the ACM 4th Int’l Workshop on Madison, WI, USA. He is an Associate Professor with the Department
Audio/Visual Emotion Challenge, 2014, pp. 19–26. of Computer Science and Electrical Engineering, West Virginia University
[32] H. Pérez Espinosa, H. J. Escalante, L. Villaseñor-Pineda, M. Montes-y (WVU), Morgantown, WV, USA. In the past, he visited and worked in several
Gómez, D. Pinto-Avedaño, and V. Reyez-Meza, “Fusing affective di- places, including INRIA, Sophia Antipolis, France; Ritsumeikan University,
mensions and audio-visual features from segmented video for depression Kyoto, Japan; Microsoft Research, Beijing, China; and North Carolina Central
recognition,” in Proc. of the ACM 4th Int’l Workshop on Audio/Visual University. He authored a book, Face, Expression, and Iris Recognition Using
Emotion Challenge, 2014, pp. 49–55. Learning-based Approaches (2008), co-edited two books, Mobile Biometrics
[33] R. Gupta, S. Sahu, C. Espy-Wilson, and S. Narayanan, “An affect (2017), and Support Vector Machines Applications (2014), and published
prediction approach through depression severity parameter incorporation about 100 technical papers. His research interests include computer vision,
in neural networks,” pp. 3122–3126, 08 2017. machine learning, and multimedia. He received the North Carolina State
[34] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet: Award for Excellence in Innovation in 2008, Outstanding Researcher (2017-
A large-scale hierarchical image database,” in IEEE Conf. on CVPR, 2018, 2013-2014), and New Researcher of the Year (2010-2011) at CEMR,
June 2009, pp. 248–255. WVU. He was selected the “People’s Hero of the Week” by BSJB under
[35] J. Joshi, A. Dhall, R. Goecke, and J. F. Cohn, “Relative body parts Minority Media and Telecommunications Council (MMTC) on July 29, 2013.
movement for automatic depression analysis,” in 2013 Humaine Asso- Two of his papers were selected as “The Best of FG’13” and “The Best of
ciation Conf. on Affective Computing and Intelligent Interaction, Sept FG’15”, respectively.
2013, pp. 492–497.

PDF Original Del Articulo 05

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF Original Del Articulo 05

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Video-Based Depression Level Analysis by

Besides the different features, several techniques were used

B. Learning on Spatiotemporal Sequences Using an RNN

A. AVEC2013 Depression Dataset

D. Performance of the C3D Models E. Performance of the RNN-C3D models

(a) C3D Tight-Face 16 frames saliency maps.

(b) C3D Loose-Face 16 frames saliency maps.

TABLE III TABLE IV

Methods RMSE MAE Methods RMSE MAE

You might also like