Professional Documents
Culture Documents
Multimodal Spatiotemporal Representation For Automatic Depression Level Detection
Multimodal Spatiotemporal Representation For Automatic Depression Level Detection
1, JANUARY-MARCH 2023
Abstract—Physiological studies have shown that there are some differences in speech and facial activities between depressive and
healthy individuals. Based on this fact, we propose a novel spatio-temporal attention (STA) network and a multimodal attention feature
fusion (MAFF) strategy to obtain the multimodal representation of depression cues for predicting the individual depression level.
Specifically, we first divide the speech amplitude spectrum/video into fixed-length segments and input these segments into the STA
network, which not only integrates the spatial and temporal information through attention mechanism, but also emphasizes the audio/
video frames related to depression detection. The audio/video segment-level feature is obtained from the output of the last full
connection layer of the STA network. Second, this article employs the eigen evolution pooling method to summarize the changes of
each dimension of the audio/video segment-level features to aggregate them into the audio/video level feature. Third, the multimodal
representation with modal complementary information is generated using the MAFF and inputs into the support vector regression
predictor for estimating depression severity. Experimental results on the AVEC2013 and AVEC2014 depression databases illustrate the
effectiveness of our method.
Index Terms—Multimodal depression detection, spatio-temporal attention, audio/video segment-level feature, eigen evolution pooling, audio/
video level feature, multimodal attention feature fusion
1949-3045 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 295
TABLE 1
The BDI-II Score and Corresponding
Depression Degree
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
296 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 297
Fig. 2. The illustration of the multimodal spatiotemporal representation framework for automatic depression level detection. Our method first inputs
spectrum/video segments into the STA network and takes the output of the last full connection layer as the ASLF or VSLF. Then, we use the EEP
method to aggregate these ASLFs or VSLFs into the ALF or VLF. Finally, the multimodal representation is generated using the MAFF strategy and
input into the SVR to predict the BDI-II score.
combined the Dirichlet process with the FV encoding to contained some hand-crafted descriptors (i.e., LBP, LPQ
automatically learn the number of Gaussian components and Edge Orientation Histogram) and the deep representa-
from the observed data and obtained the video level feature tion extracted by VGG-Face. Finally, the concatenation of
for depression detection. In the work of [31], Niu et al. pre- these two modal features was input into Linear Regression
sented that the average-pooling and max-pooling were the (LR) and Partial Linear Regression (PLR), respectively. The
special cases of ‘p -norm pooling. Thus, they combined the results of two regressors were weighted as the correspond-
‘p -norm pooling with the Least Absolute Shrinkage and ing individual depression score.
Selection Operator (LASSO) to find the suitable parameter p Different from the above approaches, this paper extracts
for the task of depression detection. Moreover, the aggrega- the high-level representation of a spectrum or video seg-
tion result was obtained through calculating the ‘p -norm of ment with the STA network, which uses the attention mech-
the each dimension of the segment-level features. anism to generate the spatiotemporal representation and
emphasize the frames related to depression detection. For
aggregation, we employ the EEP method to summarize the
2.3 Multimodal Fusion Strategies for Depression changes of the each dimension of segment-level features. In
Detection addition, we propose a MAFF fusion strategy to capture
Gupta et al. [45] combined the audio baseline features of the complementary information between modalities to improve
AVEC2014 with an additional acoustic features proposed in the quality of the multimodal representation. Experimental
[54] to predict the depression score. Meanwhile, they com- results on the AVEC2013 and AVEC2014 depression data-
bined the video baseline features provided by the bases illustrate the effectiveness of our method.
AVEC2014 with some additional video representations
(including LBP-TOP, optical flow feature and the motion of
the facial landmarks) to estimate the depression score. At 3 MULTIMODAL SPATIOTEMPORAL
last, the multimodal result was obtained by linearly fusing
REPRESENTATION FRAMEWORK FOR
the prediction scores of audio and video modalities. In the
work of [30], on the one hand, Perez et al. generated predic- DEPRESSION DETECTION
tions for affective dimensions and used them as the attrib- The fact that there are some differences in speech and
utes of the audio segment-level feature. On the other hand, facial activity between depressive and healthy individuals
they used the facial landmarks to extract the motion and has been confirmed by physiological studies [3], [4]. There-
velocity information from the video segments. Finally, a fore, this paper proposes a multimodal representation
majority strategy was implemented for the predicted results framework to predict individual depression level. In our
from all segments. approach, we first divide the long-term speech amplitude
Jain et al. [20] used the Principal Component Analysis spectrum/video into fixed-length segments and input
(PCA) to reduce the dimension of LLDs and FV encoding to them into the STA network for obtaining the ASLFs and
obtain the audio feature. Meanwhile, they used the PCA VSLFs. Then, the EEP method is employed to aggregate
and FV encoding to process the LBP-TOP and dense trajec- ASLFs and VSLFs into the ALF and VLF. Finally, the mul-
tory features to gain the video feature. The multimodal timodal representation with modal complementarity is
representation was generated by concatenating the audio generated by the proposed MAFF strategy and input into
and video features for depression detection. In [6], the audio the SVR for predicting the BDI-II score. Fig. 2 shows the
feature contained LLDs and MFCCs. The video feature whole flow of our framework.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
298 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
Fig. 3. The STA network is to extract the ASLF. “” means the matrix multiplication. FC refers to the full connection layer. And the output of FC3 (i.e.,
the part surrounded by the red box) is the ASLF. The 2D CNN, LSTM and FC layers are trained jointly with the loss function of RMSE.
3.1 The STA Network for Audio/Video Therefore, as shown in Fig. 4, we use the 3D CNN to extract
Segment-Level Feature Extraction the spatial feature of the video segment and record it as
n1
In this section, we will describe in detail the process of oCV 2R , where n is the dimension of the spatial feature.
extracting the ASLF and VSLF using the STA network. For extracting the temporal sequence representation, we
Figs. 3 and 4 show the specific network architectures. first train the 2D CNN using video frames with the loss
For the ASLF extraction, on the one hand, we use the function of RMSE and regard the output of the last full con-
2D CNN to examine the spatial structure of an amplitude nection layer (i.e., FC_3 layer in Fig. 4) as the encoding
n1
spectrum segment and record the output as oC A 2R , result of a video frame. Note that the process of training 2D
where n is the dimension of the spatial feature. On the CNN is independent and the label of each frame is the cor-
other hand, the LSTM is adopted to extract the temporal responding video label. In this way, each video segment can
sequence representation, which characterizes the dynamic be encoded into a vector sequence o ^LV 2 RnTV , where n and
changes in the amplitude spectrum segment and is TV are the dimension of the encoding vector and number of
denoted as oLA 2 RnTA , where n and TA are the dimension frames in the video segment, respectively. Moreover, we
and length of the output sequence of the LSTM, respec- input this vector sequence into the LSTM to extract the tem-
tively. Then, we can obtain the spatiotemporal attention poral feature and refer the result as oLV 2 RnTV . Then, simi-
weight wA 2 RTA 1 by Eq. (1). Moreover, the result rA 2 lar to the procedure of extracting ASLF, we replace oLA and
Rn1 of the spatiotemporal attention can be calculated oC L C
A in Eq. (2) with oV and oV to obtain w ^ V 2 RTV 1 , which
using Eq. (3). At last, we input the rA into three full con- has the same property as w ^ A . The spatiotemporal attention
nection layers and take the output of the last full connec- weight wV 2 RTV 1 can be gained by replacing w ^ A in Eq. (1)
tion layer (i.e., the part circled by the red box in the Fig. 3) with w ^ V . Similarly, the result rV 2 Rn1 of the spatiotempo-
as the ASLF. Note that, in the process of model training, ral attention can be obtained by Eq. (3) using the oLV and wV .
the 2D CNN, LSTM and full connection layers are jointly After that, the rV is fed into three full connection layers and
trained with the loss function of RMSE as shown in the output of the last full connection layer (i.e., the part cir-
Eq. (13). cled by the red box in the Fig. 4) is considered as the VSLF.
T Note that, in the process of model training, the 2D CNN is
^ 1 Þ; . . . ; exp w
expðw ^ TA trained independently, then the 3D CNN, LSTM and full
^ AÞ ¼
wA ¼ softmaxðw PTA ; (1)
^ tA connection layers are jointly trained with the loss function
tA ¼1 exp w
of RMSE.
^ A 2 RTA 1 is
where “T” refers to matrix transposition and w Intuitively, from the extraction processes of the ASLF
calculated by Eq. (2). and VSLF, one can see that we use the attention mechanism
to embed the spatial feature (oC C
A or oV ) into the temporal
T
T
^ A ¼ oLA oC ^1; . . . ; w
^ TA ; sequence representation (oLA or oLV ) for integrating the spa-
A, w
w (2)
tiotemporal information of the amplitude spectrum or video
where “” represents the matrix multiplication operation, segment. At the same time, the STA network can emphasize
which captures the correlation between oC A and each frame the audio or video frames related to depression detection by
feature of the oLA for integrating the spatiotemporal informa- assigning different weight coefficients.
tion and generating the spatiotemporal attention weight.
3.2 The EEP Method for Audio/Video
rA ¼ oLA wA : (3) Level Feature Generation
It is necessary to predict an individual depression level
For the VSLF extraction, if the video segment is regarded through examining the long-term performance in the com-
as a kind of 3D data, the 3D CNN can be used to extract the plete audio and video. In other words, we need to aggregate
spatial information of the video segment, just like 2D CNN ASLFs and VSLFs into ALF and VLF. In this paper, we
is able to examine the spatial structure of the image. employ the EEP method to summarize the changes of each
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 299
Fig. 4. The STA network is to extract the VSLF. “” means the matrix multiplication. FC refers to the full connection layer. And the output of FC3 (i.e.,
the part surrounded by the red box) is the VSLF. Note that the 2D CNN is first trained independently using the video frames with the loss function of
RMSE. Then, the 3D CNN, LSTM and full connection layers are jointly trained using the loss function of RMSE.
dimension of these segment-level features to obtain the rep- 85 percent. Since the small eigenvalues correspond to the
resentations of the corresponding audio and video. noise component [11], we use the projection of S in gmax
In a formal way, we let the matrix composed of the seg- (i.e., Sgmax ) as the result of segment-level features aggrega-
ment-level features be S ¼ ½s1 ; . . . ; sM 2 RnM , where n is tion, where gmax is the eigenvector corresponding to the
the dimension of the segment-level feature (i.e., ASLF or largest eigenvalue max . In short, if S is the ASLFs or
VSLF) and M is the number of segments divided from the VSLFs, then Sgm is the corresponding ALF or VLF.
corresponding audio or video. If we denote di 2 R1M ði ¼
1; . . . ; nÞ is the ith row of S, then S ¼ ½d1 T ; . . . ; dn T T . In other D X
X K
words, the di can be treated as a time series of the segment- G ¼ arg min ðgTk dTi gk di Þ: (5)
GT G¼I K i¼1 k¼1
level features in the ith dimension. In order to maintain the
changes of S in each dimension as much as possible, we
attempt to find Kð < MÞ standard orthogonal vectors X
K
g1 ; . . . ; gK (gk 2 RM1 ; k ¼ 1; . . . ; KÞ in the space of RMK to G ¼ arg max gTk ðST SÞgk : (6)
form the base matrix G 2 RMK for reconstructing the GT G¼IK k¼1
dTi ði ¼ 1; . . . ; DÞ. In this way, the coordinate of dTi under the
base matrix G can be expressed as GT dTi . Therefore, we can
optimize the objective function as shown in Eq. (4) to recon- X
M
ST S ¼ m qm qm T ; 1 M ; (7)
struct dTi for obtaining the required standard orthogonal m¼1
base matrix G ¼ ½g1 ; . . . ; gK 2 RMK , where gk 2 RM1
ðk ¼ 1; . . . ; KÞ. where m ðm ¼ 1; . . . ; MÞ is the mth eigenvalue of ST S and
qm 2 RM1 is the corresponding eigenvector. Note that
X
D qm T qm ¼ 1 and qr T qt ¼ 0; r 6¼ t.
G ¼ arg min kGGT di T di T k2 ; (4)
GT G¼IK i¼1 X
K X
M
2
G ¼ arg max m ðqTm gk Þ : (8)
where IK is the identity matrix of order K. GT G¼IK k¼1 m¼1
To solve the G , we equivalently convert Eqs. (4) to (5)
and rearrange to get Eq. (6). Furthermore, according to the " #
orthogonal decomposition theorem of real symmetric
K X
X M X
M
2
G
arg max M m ðqmi gki Þ ; (9)
matrix, we get the Eq. (7) and bring it into Eq. (6) to obtain GT G¼IK k¼1 m¼1 i¼1
Eq. (8). In this way, according to the property that the arith-
metic mean does not exceed the square mean, we scale where qmi and gki are the ith elements of qm and gk ,
Eqs. (8) to (9). Note that if and only if qm ¼ gk , the equal respectively.
sign holds. In other words, g1 ; . . . ; gK are K vectors in From the above process, we can see that the EEP method
q1 ; . . . ; qM . Thus, G is composed of the eigenvectors corre- can maintain the changes of feature sequence in each
sponding to the first K eigenvalues T
PK of S S, because Eq. (8) dimension as much as possible through Eq. (4). That is to
can obtain the maximum value k¼1 k at this time. say, the ALF summarizes the dynamic changes of all spec-
In this paper, we find that the ratio of the largest eigen- trum segments in the range of complete audio. The VLF
value max of ST S to the sum of all eigenvalues is more than also has the same property.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
300 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
Section 5.3 also confirm this view. In like wise, the AAVF
has the same property. Therefore, in this paper, we take
the VAAF and AAVF as the modal complementary infor-
mation. Furthermore, the multimodal representation with
complementary information is generated by concatenating
VAAF, VLF, ALF and AAVF and is input into SVR to pre-
dict the individual BDI-II score.
Fig. 5. The multimodal representation with modal complementary infor- 4.1 Databases and Evaluation Measure
mation (i.e., the part surrounded by a red box) is generated by the MAFF In this paper, all experiments are conducted on two publicly
strategy. VAAF and AAVF are modal complementary information.
available datasets i.e., AVEC2013 and AVEC2014 depres-
3.3 The MAFF for Multimodal Representation With sion datasets.
Complementary Information For the AVEC2013 depression corpus, it is a subset of the
audio-visual depression language corpus (AViD-Corpus),
As mentioned above, depressive patients differ from healthy
which is recorded by a webcam and microphone. Specifi-
individuals in speech and facial activity [3]. Therefore, it is
cally, each subject needs to perform 14 different tasks
reasonable to predict the individual depression level via fus-
according to the instructions on the computer screen. These
ing the audio and video features. For the purpose, this
14 tasks include sustained vowel phonation, sustained loud
paper proposes a novel multimodal fusion strategy named
vowel phonation, sustained smiling vowel phonation,
MAFF to capture the complementary information between
speaking out loud while solving a task, counting from 1 to
modalities to improve the quality of the representation of
10, etc. The mean age of subjects is 31.5 years old with a
depression cues. Fig. 5 shows the fusion process.
standard deviation of 12.3 years and a range of 18-63 years.
In particular, we adopt the attention mechanism between
All subjects are German speakers. There are 150 video clips
ASLFs and VLF to obtain the VAAF. Eq. (10) gives the cal-
from 82 subjects and these recordings have been divided
culation formula of VAAF. Similarly, we only need to
into three parts by the publisher: training, development and
replace ASLFi in Eq. (11) and VLF in Eq. (12) with VSLFi
test set. Each has 50 samples. It should be noted that, in the
and ALF to obtain the AAVF. Note that the length of all fea-
AVEC2013 database, all behavior performance of a subject
tures is normalized to 1 before the fusion. Then, the multi-
is included in the same recording without being separated
modal fusion can be generated by concatenating VLF,
by the publisher when he or she perform tasks. These vid-
VAAF, AAVF and ALF.
eos are set to 30 frames per second with resolution of 640
VAAF ¼ SASLF a; (10) 480 pixels. Each sample is labeled using a BDI-II score.
For the AVEC2014 depression corpus, it is a subset of the
where SASLF is obtained through Eq. (11). aT ¼ ½a1 ; . . . ; AVEC2013 corpus. Thus, they are similar in collection set-
aN AS can be calculated using Eq. (12). tings, age distribution and language characteristics. In the
AVEC2014 database, only two tasks named “Northwind”
SASLF ¼ ½ASLF1 ; . . . ; ASLFN AS ; (11) and “FreeForm” are involved. For “Northwind”, the sub-
jects need to read an excerpt of the fable “Die Sonne und
where ASLFi ði ¼ 1; . . . ; N AS Þ refers to the feature corre- derWind” (The NorthWind and the Sun) in German. For
sponding to the ith spectrum segment and N AS is the number “FreeForm”, the subjects respond to one of a number of
of segments divided from the speech amplitude spectrum. questions, such as: “What is your favorite dish?” or discuss
a sad childhood memory in German. In each task, there are
e < VLF;ASLFi > 150 recordings from 84 subjects. And these recordings are
ai ¼ PN ; i ¼ 1; 2; . . . ; N AS ; (12)
AS < VLF;ASLFj >
j¼1 e
divided equally into training, development and test set. The
duration ranges from 6 seconds to 4 minutes. In our experi-
where < ; > means the inner product operation of two ments, unless otherwise specified, we combine the training,
vectors and VLF is obtained by aggregating the VSLFs development and test sets of these two tasks as the new
using the EEP method as mentioned above. database. Namely, there are 100 samples in the training,
Mathematically, Eqs. (3) and (10) have the same expres- development and test set, respectively. In addition, it is not
sion. But unlike Eq. (3), which integrates the spatiotempo- difficult to predict that there would be similar findings in
ral information and emphasizes key frames, the a in both two databases due to their similarity. The following
Eq. (10) actually reflects the similarity between two modali- experimental results confirm this view.
ties. Therefore, the VAAF contains the information similar In this paper, we conduct experiments on AVEC2013 and
to the video modality in the audio modality. In other AVEC2014 databases, respectively. For each database, the
words, the VAAF provides the supplement of video training set is to train the model. The development set is to
modality to audio modality and the experiments in adjust experimental parameters and validate the effectiveness
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 301
Fig. 6. Depression detection performance on the development sets of Fig. 7. Depression detection performance on the development sets of
the AVEC2013 (a) and AVEC2014 (b) using the spectrum segments with AVEC2013 using four sampling schemes. The index1, index2, index3,
different lengths. index4 and index5 refer to sampling frame 15, frame (8,15,23), frame
(5,10,15,20,25), frame (3,7,11,15,19, 23, 27) in every 30 consecutive
of each module in our model. The test set is used for compar- frames.
ing our method with other works.
At present, Root Mean Square Error (RMSE) and Mean performance on the development set of the AVEC2013 data-
Absolute Error (MAE) are widely used to evaluate the perfor- base is shown in Fig. 7. As shown, when the 15th frame is sam-
mance of depression detection algorithms. Eqs. (13) and (14) pled in every 30 consecutive frames, the best detection result
show the calculation formulas for RMSE and MAE, where N is obtained. Note that, for the AVEC2014 database, we fine-
denotes the number of subjects. yi and y^i are ground truth tune on the 2D CNN trained using the AVEC2013 database
and predicted BDI-II score of the ith subject, respectively. due to their similarity. Besides, similar to the process of opti-
mizing the length of the spectrum segment, Fig. 8 presents
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u N comparative experiments for finding the suitable length of a
u1 X
RMSE ¼ t ðyi y^i Þ2 : (13) video segment, where the overlapping is 50 percent and video
N i¼1 frames are encoded using the trained 2D CNN. As shown, we
set the length of a video segment and shift to 60 frames (about
2 seconds) and 30 frames (about 1 second) for AVEC2013 and
1X N AVEC2014 databases.
MAE ¼ jyi y^i j: (14) For the STA network shown in Fig. 3 for extracting the
N i¼1
ASLF , the 2D CNN has two convolution layers with the
same settings: 8 kernels with a size of 5 11, the stride size
4.2 Experimental Setup of ð3; 3Þ and 64 neurons in the FC layer. The dimension of
As described above, this paper extracts the representation of LSTM output sequence is 64. Three fully connection layers
depression cues from audio and video modalities. For audio (i.e., FC1, FC2 and FC3) with 64 neurons are followed. The
modality, we sample the waveforms at 8KHZ and generate Sigmoid is used as the activation function in the FC3 layer
the 129 dimensional normalized amplitude spectrums using and the ReLU is used in other layers. The optimizer is SGD
a short-time Fourier transform with 32 ms length Hamming with batch size 32 and learning rate is 0.0002. The loss func-
window and 16 ms frame shift for AVEC2013 and tion is RMSE. The network structure for extracting the
AVEC2014 databases. To find the suitable length of a spec- ASLF is the same for AVEC2013 and AVEC2014 databases.
trum segment, we use the spectrum segments with different For the STA network shown in Fig. 4 for extracting the
lengths to conduct experiments, where the STA network in ASLF , the 3D CNN has two convolution layers, which have 8
Fig. 3 is to extract ASLF s, the EEP aggregate them into the kernels with a size of 3 3 3 and the stride size of (2; 2; 2).
ALF and SVR is for predicting. Note that the overlapping The FC layer has 64 neurons. For the frame encoding, the 2D
of two adjacent segments is 50 percent, the label of each seg- CNN has three convolution layers with the same setting: 8 ker-
ment is the BDI-II score of the corresponding audio and the nels with a size of 3 3, the stride size of ð2; 2Þ. And the layers
structure of STA network in Fig. 3 will be described below. of FC_1, FC_2 and FC_3 all have 64 neurons. The training pro-
The prediction performance on the development sets of cess for the frame encoding is independent and the loss func-
AVEC2013 and AVEC2014 databases are shown in Fig. 6. tion is RMSE. The dimension of LSTM output sequence is 64.
Hence, we set the length of a spectrum segment as 64 frames Three fully connected layers (i.e., FC1, FC2 and FC3) are with
(about 1 second) and shift of 32 frames (about 0.5 seconds) 64 neurons. The Sigmoid is used as the activation function in
for AVEC2013 and AVEC2014 databases. the F3 and FC_3 layer. The ReLU is used in other layers. The
For video modality, we extract the facial images in the optimizer is SGD with batch size 32 and learning rate is 0.0002.
videos using the Dlib [18] and resize to 128 128 for The loss function for the VSLF extraction model is RMSE. The
AVEC2013 and AVEC2014 databases. The 2D CNN in Fig. 4 network structure for extracting the VSLF is the same for
is trained independently using video frames obtained AVEC2013 and AVEC2014 databases.
through different sampling schemes, where the label of
each video frame is the BDI-II score of the corresponding
video. After that, we input the video segment with 60 5 RESULTS AND DISCUSSION
frames into the STA network (as shown in Fig. 4) to extract In this section, we show the effectiveness of STA network,
VSLF s, then the EEP aggregates them into VLF and SVR is EEP and MAFF for depression detection. In addition, we
for predicting. Note that the overlapping of two adjacent also investigate the influence of different tasks on the detec-
video segments is 50 percent. The structures of 2D CNN and tion accuracy. Finally, the comparisons between our method
STA network in Fig. 4 will be described below. The detection and previous works are presented.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
302 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 303
TABLE 4 TABLE 6
Depression Detection Performance Using Different Depression Detection Performance Using Different Information
Aggregation Methods on the Development Sets on the Development Sets of AVEC2013 and AVEC2014
of AVEC2013 and AVEC2014
Different fusion information AVEC2013 AVEC2014
Databases Aggregation Methods Audio Video RMSE MAE RMSE MAE
RMSE MAE RMSE MAE
ALF 9.26 7.85 9.11 7.60
AVEC2013 MP 10.02 8.23 9.86 7.77 VAAF 8.86 7.76 8.79 7.57
AP 9.93 7.98 9.84 7.56 ALF+VAAF 8.49 7.67 8.61 6.72
FV 9.85 7.90 9.57 7.21 VLF 8.54 7.70 8.31 6.73
EEP 9.26 7.85 8.54 7.70 AAVF 8.73 7.97 9.08 7.35
VLF+AAVF 8.42 7.58 8.35 6.69
AVEC2014 MP 9.97 8.57 9.62 7.29
AP 9.88 8.29 9.54 7.41 VCA 8.38 7.27 8.32 7.29
FV 9.76 7.15 9.30 7.10 VCA+VAAF+AAVF 8.10 6.38 8.00 6.42
EEP 9.11 7.60 8.31 6.73
VCA means to concatenate the VLF with ALF. And “+” refers to the concate-
MP, AP and FV refer to Max-Pooling, Average-Pooling and Fisher Vector nation operation.
encoding, respectively.
the depression score in these experiments. As shown, the the prediction result of a certain spectrum or video segment,
EEP is more suitable for the depression detection task than so the dynamic information in the range of complete audio
other methods. This is because max-pooling or average- or video cannot be captured.
pooling only calculates the maximum or average value of
the sequence, which is not sensitive to the temporal order of
5.3 Depression Detection Performance Using
the sequence. For FV encoding, each sample is treated inde- Different Information
pendently for creating the dictionary so that the temporal
In this section, we validate the detection performance of dif-
relationship among samples is ignored. Different from
ferent information on the development sets of AVEC2013 and
them, the EEP maintains as much dynamic information as
AVEC2014 databases. Table 6 gives the experimental results.
possible by reconstructing the changes of each dimension of
Note that the STA network, EEP and SVR are for the segment-
segment-level features, which leads to better accuracy.
level feature extraction, aggregation and prediction.
Furthermore, to illustrate the effectiveness of combining
From the Table 6, one can see that the detection perfor-
the feature aggregation with SVR, we compare the predic-
mance of VAAF is better than that of ALF. The reason is
tion performance using an end-to-end way in Table 5. In the
that the more pronounced depression cues are contained in
end-to-end way, we input spectrum segments (or video seg-
facial activity than speech. Thus, the usage of attention
ments) into the STA network to obtain the predicted results
mechanism between VLF and ASLFs can extract the infor-
of these segments and take the median of these results as
mation similar to video from audio and improve the predic-
the BDI-II score corresponding to the long-term spectrum
tion accuracy. The similar reason can be explained the
(or video) like [22]. Note that, in the end-to-end way, the
result that the AAVF is not as good as VLF. In addition, we
training process of STA network is the same as our method,
find that “ALF+VAAF” and “VLF+AAVF” both obtain the
that is, the network settings are the same in these experi-
better detection performance than “ALF” and “VLF”. This
ments. From the Table 5, one can observe that the combina-
result illustrates that the modal complementary information
tion of feature aggregation with SVR can provide better
(i.e., VAAF and AAVF) is helpful to improve the experi-
experimental accuracy. The reason is that the EEP method
mental accuracy. “VCA þ VAAF þ AAVF” achieves the
can summarize the temporal evolution of each dimension of
best detection result because it contains not only audio and
all segment-level features, so as to completely characterize
video features, but also their complementary information.
the long-term spectrum or video. But, the median is to select
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
304 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
TABLE 7 TABLE 9
Depression Detection Performance Using Different Tasks and Comparison of Our Method and Previous Works on the Test Set
Modalities on the Development Set AVEC2014 of AVEC2014
“F” and “N” are the tasks of “FreeForm” and “Northwind”. “Com” refers to
the combination of these two tasks like the above experiments. “Con” is the
concatenation of the features, which are extracted using the models correspond-
ing to “FreeForm” and “Northwind”.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 305
6 CONCLUSION
Physiological studies have revealed that there are some dif-
ferences between depressive and healthy individuals in
speech and facial activity. Based on this fact, we proposes a
multimodal spatiotemporal representation framework for
automatic depression level detection. The proposed STA net-
work not only integrates spatial and temporal information,
but also emphasizes the frames related to depression detec-
tion. In addition, the proposed MAFF strategy improves the
quality of multimodal representation through extracting the
complementary information between modalities. Experi-
mental results on AVEC2013 and AVEC2014 indicate that
our approach achieves the good detection performance. In
the future, we will segment different tasks and train separate
models to improve the detection accuracy. In addition, we
also consider using this framework to predict other diseases
if the data is available.
ACKNOWLEDGMENTS
This work was supported by the National Key Research &
Development Plan of China under Grant No.2017YFB1002804,
the National Natural Science Foundation of China (NSFC)
under Grant Nos.61831022, 61771472, 61773379, 61901473 and
the Key Program of the Natural Science Foundation of Tianjin
under Grant Grant Nos. 18JCZDJC36300 18JCZDJC36300.
Fig. 10. The scatter plot of the ground truth versus predicted value on the
test set of AVEC2013 (a) and AVEC2014 (b) based on our proposed REFERENCES
method.
[1] P. H. Soloff et al., “Self-mutilation and suicidal behavior in
borderline personality disorder,” J. Pers. Disorders, vol. 8,
no. 4, pp. 257–267, 1994.
also occurs in [22], because the 3D CNN ignores the differ- [2] World Health Organization, “Depression and other common
ences across video frames [60]. The methods in [37], [48] use mental disorders: Global health estimates,” World Health Organiza-
tion, pp. 7–24, 2017.
the attention mechanism to emphasize the key frames asso- [3] A. J. Flint et al., “Abnormal speech articulation, psychomotor
ciated with target task. To this end, we repeat their works to retardation, and subcortical dysfunction in major depression,” J.
predict the depression levels on AVEC2013 and AVEC2014 Psychiatric Res., vol. 27, no. 3, pp. 309–319, 1993.
databases. But, as shown in Table 3, the performance of the [4] A. Korszun, “Facial pain, depression and stress–Connections and
directions,” J. Oral Pathol. Med., vol. 31, no. 10, pp. 615–619, 2002.
LSTM is better than the 3D CNN and our method pays [5] A. McPherson and C. R. Martin, “A narrative review of the beck
attention to the output of LSTM rather than 3D CNN like depression inventory (BDI) and implications for its use in an alco-
[37], so the better accuracy is gained. For the method in [48], hol–Dependent population,” J. Psychiatric Mental Health Nursing,
vol. 17, no. 1, pp. 19–30, 2010.
the depression severity is estimated through visual behav- [6] A. Jan, H. Meng, Y. F. B. A. Gaus, and F. Zhang, “Artificial intelli-
iors (i.e., FAU, landmark, Head Pose and Gaze), which are gent system for automatic depression level analysis through
not sufficient for extracting facial detail texture [49]. More- visual and vocal expressions,” IEEE Trans. Cogn. Devel. Syst., vol.
over, the works of [34], [56] obtain the better performance 10, no. 3, pp. 668–680, Sep. 2018.
[7] L. He, D. Jiang, and H. Sahli, “Automatic depression analysis
due to multiple facial regions division. The method in [46] using dynamic facial appearance descriptor and dirichlet process
achieves the best accuracy because facial movements are fisher encoding,” IEEE Trans. Multimedia, vol. 21, no. 6, pp. 1476–
directly examined across the complete video without the 1486, Jun. 2019.
division of video segments. [8] N. Cummins et al., “Diagnosis of depression by behavioural sig-
nals: A multimodal approach,” in Proc. ACM Int. Workshop Audio/
For the multimodal fusion, our method achieves the best Visual Emotion Challenge, 2013, pp. 11–20.
prediction performance, especially combining different [9] L. He, D. Jiang, and H. Sahli, “Multimodal depression recognition
tasks in the AVEC2014 database. This is because the concat- with dynamic visual and audio cues,” in Proc. Int. Conf. Affect.
Comput. Intell. Interaction, 2015, pp. 260–266.
enation of audio and video features used in [6], [24], [29], [10] L. He and C. Cao, “Automated depression analysis using convolu-
[30], [45] or decision linear combination [15], [17], [26], [27], tional neural networks from speech,” J. Biomed. Informat., vol. 83,
[28] is weak in capturing the complementary information pp. 103–111, 2018.
between modalities. Different from them, the proposed [11] Y. Wang, V. Tran, and M. Hoai, “Eigen evolution pooling for
human action recognition,” 2017, arXiv: 1708.05465.
MAFF strategy improves the quality of multimodal repre- [12] M. Valstar et al., “AVEC 2013: The continuous audio/visual emo-
sentation by using attention mechanism to extract comple- tion and depression recognition challenge,” in Proc. ACM Int.
mentary information between modalities. Furthermore, we Workshop Audio/Visual Emotion Challenge, 2013, pp. 3–10.
[13] M. Valstar et al., “AVEC 2014: 3D dimensional affect and depres-
draw the scatter plot in Fig. 10 to show the ground truth sion recognition challenge,” in Proc. ACM Int. Workshop Audio/
and predicted value for illustrating the prediction perfor- Visual Emotion Challenge, 2014, pp. 3–10.
mance of our proposed method.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
306 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023
[14] F. Eyben, M. W€ ollmer, and B. Schuller, “OpenEAR–Introducing [37] J. Lee, S. Kim, S. Kiim, and K. Sohn, “Spatiotemporal attention
the munich open-source emotion and affect recognition toolkit,” based deep neural networks for emotion recognition,” in Proc.
in Proc. Int. Conf. Affect. Comput. Intell. Interaction Workshops, 2009, IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 1513–1517.
pp. 1–6. [38] W. C. De Melo, E. Granger, and A. Hadid, “Combining global and
[15] H. Meng et al., “Depression recognition based on dynamic facial local convolutional 3D networks for detecting depression from
and vocal expression features using partial least square facial expressions,” in Proc. IEEE Int. Conf. Autom. Face Gesture Rec-
regression,” in Proc. ACM Int. Workshop Audio/Visual Emotion Chal- ognit., 2019, pp. 1–8.
lenge, 2013, pp. 21–30. [39] S. Poria et al., “A review of affective computing: From unimodal
[16] C. D. Sherbourne et al., “Long-term effectiveness of disseminating analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125,
quality improvement for depression in primary care,” Arch. Gen. 2017.
Psychiatry, vol. 58, no. 7, pp. 696–703, 2001. [40] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis.,
[17] X. Ma et al., “Cost-sensitive two-stage depression prediction using vol. 64, no. 2/3, pp. 107–123, 2005.
dynamic visual clues,” in Proc. Asian Conf. Comput. Vis., 2016, [41] B. Schuller et al., “Paralinguistics in speech and language–State-of-
pp. 338–351. the-Art and the challenge,” Comput. Speech Lang., vol. 27, no. 1,
[18] D. Castelli and P. Pagano, “OpenDLib: A digital library service pp. 4–39, 2013.
system,” in Proc. Int. Conf. Theory Pract. Digit. Libraries, 2002, [42] J. C. Mundt et al., “Vocal acoustic biomarkers of depression sever-
pp. 292–308. ity and treatment response,” Biol. Psychiatry, vol. 72, no. 7,
[19] J. M. Girard and J. F. Cohn, “Automated audiovisual depression pp. 580–587, 2012.
analysis,” Current Opinion Psychol., vol. 4, pp. 75–79, 2015. [43] B. Bhushan, “Study of facial micro-expressions in psychology,” in
[20] V. Jain et al., “Depression estimation using audiovisual features Understanding Facial Expressions in Communication. New Delhi,
and fisher vector encoding,” in Proc. ACM Int. Workshop Audio/ India: Springer, 2015, pp. 265–286.
Visual Emotion Challenge, 2014, pp. 87–91. [44] G. Zhao and M. Pietikainen, “Dynamic texture recognition using
[21] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression local binary patterns with an application to facial expressions,”
diagnosis based on deep networks to encode facial appearance IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928,
and dynamics,” IEEE Trans. Affective Comput., vol. 9, no. 4, Jun. 2007.
pp. 578–584, Fourth Quarter 2018. [45] R. Gupta et al., “Multimodal prediction of Affective dimensions
[22] M. Al Jazaery and G. Guo, “Video-based depression level analysis and depression in human-computer interactions,” in Proc. ACM
by encoding deep spatiotemporal features,” IEEE Trans. Affective Int. Workshop Audio/Visual Emotion Challenge, 2014, pp. 33–40.
Comput., to be published, doi: 10.1109/TAFFC.2018.2870884. [46] S. Song, S. Jaiswal, L. Shen, and M. Valstar, “Spectral representation
[23] L. Wen, X. Li, G. Guo, and Y. Zhu, “Automated depression diag- of behaviour primitives for depression analysis,” IEEE Trans. Affec-
nosis based on facial dynamic analysis and sparse coding,” IEEE tive Comput., to be published, doi: 10.1109/TAFFC.2020.2970712.
Trans. Inf. Forensics Security, vol. 10, no. 7, pp. 1432–1441, Jul. 2015. [47] S. Song, L. Shen, and M. Valstar, “Human behaviour-based auto-
[24] H. Kaya, F. Çilli, and A. A. Salah, “Ensemble CCA for continuous matic depression analysis using hand-crafted statistics and deep
emotion prediction,” in Proc. ACM Int. Workshop Audio/Visual learned spectral features,” in Proc. IEEE Int. Conf. Autom. Face Ges-
Emotion Challenge, 2014, pp. 19–26. ture Recognit., 2018, pp. 158–165.
[25] A. Dhall and R. Goecke, “A temporally piece-wise fisher vector [48] Z. Du, W. Li, D. Huang, and Y. Wang, “Encoding visual behaviors
approach for depression analysis,” in Proc. Int. Conf. Affect. Com- with attentive temporal convolution for depression prediction,”
put. Intell. Interaction, 2015, pp. 255–259. in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–7.
[26] M. K€ achele et al., “Fusion of audio-visual features using hierarchi- [49] I. A. Essa and A. P. Pentland, “Coding, analysis, interpretation,
cal classifier systems for the recognition of affective states and the and recognition of facial expressions,” IEEE Trans. Pattern Anal.
state of depression,” in Proc. Int. Conf. Pattern Recognit. Appl. Meth- Mach. Intell., vol. 19, no. 7, pp. 757–763, Jul. 1997.
ods, 2014, pp. 671–678. [50] N. Cummins et al., “A review of depression and suicide risk
[27] J. R. Williamson et al., “Vocal and facial biomarkers of depression assessment using speech analysis,” Speech Commun., vol. 71, no.
based on motor incoordination and timing,” in Proc. ACM Int. 71, pp. 10–49, 2015.
Workshop Audio/Visual Emotion Challenge, 2014, pp. 65–72. [51] A. Pampouchidou et al., “Automatic assessment of depression
[28] M. Senoussaoui et al., “Model fusion for multimodal depression based on visual cues: A systematic review,” IEEE Trans. Affective
classification and level detection,” in Proc. ACM Int. Workshop Comput., vol. 10, no. 4, pp. 445–470, Fourth Quarter 2019.
Audio/Visual Emotion Challenge, 2014, pp. 57–63. [52] N. Cummins, V. Sethu, J. Epps, J. R. Williamson, T. F. Quatieri,
[29] N. Cummins et al., “Diagnosis of depression by behavioural sig- and J. Krajewski, “Generalized two-stage rank regression frame-
nals: A multimodal approach,” in Proc. ACM Int. Workshop Audio/ work for depression score prediction from speech,” IEEE Trans.
Visual Emotion Challenge, 2013, pp. 11–20. Affective Comput., vol. 11, no. 2, pp. 272–283, Second Quarter 2020.
[30] H. Perez Espinosa et al., “Fusing affective dimensions and [53] Y. Kang et al., “Deep transformation learning for depression diag-
audio-visual features from segmented video for depression nosis from facial images,” in Proc. Chin. Conf. Biometric Recognit.,
recognition: INAOE-BUAP’s participation at AVEC’14 2017, pp. 13–22.
challenge,” in Proc. ACM Int. Workshop Audio/Visual Emotion [54] M. Van Segbroeck et al., “A robust frontend for VAD: Exploiting
Challenge, 2014, pp. 49–55. contextual, discriminative and spectral cues of human voice,” in
[31] M. Niu et al., “Automatic depression level detection via lp- Proc. Conf. Int. Speech Commun. Assoc. 2013, pp. 704–708.
norm pooling,” in Proc. Conf. Int. Speech Commun. Assoc., 2019, [55] T. F. Quatieri et al., “Multimodal biomarkers to discriminate cog-
pp. 4559–4563. nitive state,” in The Role of Technology in Clinical Neuropsychology,
[32] M. Niu, J. Tao, and B. Liu, “Local second-order gradient cross Oxford, U.K.: Oxford Univ. Press, 2017, pp. 409–443.
pattern for automatic depression detection,” in Proc. 8th Int. [56] M. A. Uddin, J. B. Joolee, and Y. Lee, “Depression level prediction
Conf. Affect. Comput. Intell. Interaction Workshops Demos, 2019, using deep spatiotemporal features and multilayer Bi-LTSM,”
pp. 128–132. IEEE Trans. Affective Comput., to be published, doi: 10.1109/
[33] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recogni- TAFFC.2020.2970418.
tion with fisher vectors on a compact feature set,” in Proc. Int. [57] M. K€achele, M. Schels, and F. Schwenker, “Inferring depression
Conf. Comput. Vis., 2013, pp. 1817–1824. and affect from application dependent meta knowledge,” in Proc.
[34] X. Zhou, K. Jin, Y. Shang, and G. Guo, “Visually interpretable Int. Workshop Audio/Visual Emotion Challenge, 2014, pp. 41–48.
representation learning for depression recognition from facial [58] T. Taguchi et al., “Major depressive disorder discrimination using
images,” IEEE Trans. Affective Comput., vol. 11, no. 3, pp. 542–552, vocal acoustic features,” J. Affect. Disorders, vol. 225, pp. 214–220,
Third Quarter 2020. 2018.
[35] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression [59] E. Rejaibi et al., “Clinical depression and affect recognition with
diagnosis based on deep networks to encode facial appearance EmoAudioNet,” 2019, arXiv: 1911.00310.
and dynamics,” IEEE Trans. Affective Comput., vol. 9, no. 4, [60] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatio-
pp. 578–584, Fourth Quarter 2018. temporal attention networks for action recognition and
[36] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE detection,” IEEE Trans. Multimedia, to be published, doi: 10.1109/
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. TMM.2020.2965434.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 307
Mingyue Niu (Student Member, IEEE) received Jian Huang (Student Member, IEEE) received
the master’s degree from the Department of the BE degree from Wuhan University, China. He
Applied Mathematics, Northwestern Polytechnical is currently working toward the PhD degree with
University (NWPU), China, in 2017. He is currently the National Laboratory of Pattern Recognition
working toward the PhD degree with the National (NLPR), Institute of Automation, Chinese Acad-
Laboratory of Pattern Recognition (NLPR), Insti- emy of Sciences, China. He had published the
tute of Automation, Chinese Academy of Sciences papers in INTERSPEECH and ICASSP. His
(CASIA), China. He had published the papers in research interests cover affective computing,
ICASSP and INTERSPEECH. His research inter- deep learning, multimodal emotion recognition.
ests include affective computing and depression
recognition and analysis.
Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.