Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

294 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO.

1, JANUARY-MARCH 2023

Multimodal Spatiotemporal Representation for


Automatic Depression Level Detection
Mingyue Niu , Student Member, IEEE, Jianhua Tao , Senior Member, IEEE, Bin Liu , Member, IEEE,
Jian Huang , Student Member, IEEE, and Zheng Lian

Abstract—Physiological studies have shown that there are some differences in speech and facial activities between depressive and
healthy individuals. Based on this fact, we propose a novel spatio-temporal attention (STA) network and a multimodal attention feature
fusion (MAFF) strategy to obtain the multimodal representation of depression cues for predicting the individual depression level.
Specifically, we first divide the speech amplitude spectrum/video into fixed-length segments and input these segments into the STA
network, which not only integrates the spatial and temporal information through attention mechanism, but also emphasizes the audio/
video frames related to depression detection. The audio/video segment-level feature is obtained from the output of the last full
connection layer of the STA network. Second, this article employs the eigen evolution pooling method to summarize the changes of
each dimension of the audio/video segment-level features to aggregate them into the audio/video level feature. Third, the multimodal
representation with modal complementary information is generated using the MAFF and inputs into the support vector regression
predictor for estimating depression severity. Experimental results on the AVEC2013 and AVEC2014 depression databases illustrate the
effectiveness of our method.

Index Terms—Multimodal depression detection, spatio-temporal attention, audio/video segment-level feature, eigen evolution pooling, audio/
video level feature, multimodal attention feature fusion

1 INTRODUCTION is necessary to investigate an automatic depression diagno-


sis method to help doctors improve efficiency.
EPRESSION is a psychiatric disorder that makes people
D possess a very low mood and an inability to participate
in social life normally. More seriously, depression can lead
Physiological studies [3], [4] have revealed that there are
some differences in speech and facial activities between
depressive patients and healthy individuals. In other words,
to self-mutilation and suicide behaviors [1]. According to
it is reasonable to take speech and facial activities as the bio-
the World Health Organization in 2017, there are about
markers to estimate the individual depression level, which
350 million depressive patients worldwide and depression
can be measured through the Beck Depression Inventory-II
will become the second leading cause of death by 2030 [2].
(BDI-II) score [5] in Table 1.
Fortunately, early diagnosis and treatment can help patients
Currently, there are many methods to extract the audio
get out of troubles as soon as possible. However, the diag-
and video features to represent the depression cues in
nostic process is usually laborious and mainly relies on the
speech and facial activities for predicting the BDI-II score.
clinical experience of doctors, which can lead to some
Some of them [10], [34], [53] use the Convolutional Neural
patients unable to get proper treatment in time [34]. Thus, it
Network (CNN) to extract the spatial feature of speech
amplitude spectrum or facial image. However, the dynamic
information in the speech or face is lost. In the work of [15],
they capture the dynamic feature of speech using the
 Mingyue Niu is with the National Laboratory of Pattern Recognition
(NLPR), Institute of Automatic Chinese Academy of Sciences (CASIA), Motion History Histogram (MHH), but the lack of spatial
Beijing 100190, China, and also with the School of Artificial Intelligence, information affects the detection accuracy [59] and the sta-
University of Chinese Academy of Sciences (UCAS), Beijing 100049, tistical histogram is not sensitive to discriminative temporal
China. E-mail: niumingyue2017@ia.ac.cn. changes [48]. The methods in [46], [47] use the Fourier trans-
 Jianhua Tao is with the National Laboratory of Pattern Recognition
(NLPR), Institute of Automatic Chinese Academy of Sciences (CASIA), form to obtain the spectral representation of each behaviour
Bejing 100190, China, the School of Artificial Intelligence, University of primitives (facial action units, head pose, gaze directions,
Chinese Academy of Sciences (UCAS), Beijing 100049, China, and also etc) for encoding facial activitives, but the primitives are not
with the CAS center for Excellence in Brain Science and Intelligence Tech-
nology, Bejing 100190, China. E-mail: jhtao@nlpr.ia.ac.cn.
sufficient for extracting the detailed facial appearance [49]
 Bin Liu, Jian Huang, and Zheng Lian are with the National Laboratory of and all frames are treated equally in their methods. The sim-
Pattern Recognition (NLPR), Institute of Automatic Chinese Academy of ilar issue also occurs in the works using spatiotemporal fea-
Sciences (CASIA), Bejing 100190, China. E-mail: {liubin, jian.huang} tures for depression detection [7], [22], [45], even if the
@nlpr.ia.ac.cn, lianzheng2016@ia.ac.cn.
combination of 3D CNN with Recurrent Neural Network
Manuscript received 1 Dec. 2019; revised 17 Sept. 2020; accepted 12 Oct. 2020.
(RNN) is used in [22] and Three Orthogonal Plane (TOP)
Date of publication 15 Oct. 2020; date of current version 28 Feb. 2023.
(Corresponding author: Jianhua Tao.) framework [44] is adopted in [7], [45]. This is because the
Recommended for acceptance by M. Valstar. 3D CNN processes all consecutive frames equally [60] and
Digital Object Identifier no. 10.1109/TAFFC.2020.3031345

1949-3045 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 295

TABLE 1
The BDI-II Score and Corresponding
Depression Degree

BDI-II Score Depression Degree


0-13 None
14-19 Mild
20-28 Moderate
29-63 Severe

TOP framework, as a histogram feature extraction method,


ignores the uneven distribution of salient features in the
temporal space [48].
To illustrate that the effects of different audio/video Fig. 1. Audio and video feature sequences of healthy and depressive
individuals. Each column is the feature vector corresponding to the audio
frames on depression detection are not exactly the same, we or video frame. (a) and (b) are the audio and video feature sequences for
use the Long-Short Term Memory (LSTM) network to the healthy subject of No. 203-1 in the AVEC2013 database. (c) and (d)
extract the temporal sequence representation of the speech are the corresponding results for the depressive subject of No. 236-1 in
amplitude spectrum with 64 frames. A video segment with the AVEC2013 database. The BDI-II scores for the subjects of No. 203-
1 and 236-1 are 3 (None depression) and 23 (Moderate depression).
60 consecutive video frames are processed using the struc- The parts enclosed by red and green boxes show the discriminative and
ture of 2D CNN+LSTM. The training process of 2D CNN less discriminative feature vectors.
+LSTM is as follows: 2D CNN is trained independently
using the video frames. Then, we input each frame of the information between modalities to improve the quality of
video segment into the 2D CNN and take the output of the multimodal representation. Specifically, our method consists
last full connection layer as the frame feature. In this way, of three steps.
the video segment can be encoded as a temporal feature First, we divide the long-term speech amplitude spec-
sequence. Finally, the feature sequence is input into the trum/video into fixed-length segments and input these seg-
LSTM to generate the temporal sequence representation. As ments into the STA network. In this process, on the one hand,
shown in Figs. 1a and 1b are the audio and video temporal a spectrum or video segment is input into the 2D CNN or 3D
sequence representations of a healthy individual. (c) and (d) CNN to extract the spatial feature. On the other hand, the net-
are the results of a depressive individual. The Imagesc work of LSTM or 2D CNN+LSTM is adopted to obtain the
function in the MATLAB is used to draw these figures and temporal sequence representation corresponding to the spec-
each column corresponds to the audio or video frame. From trum or video segment. After that, we use the attention mech-
the Fig. 1, one can find that the difference between (a) and anism between the spatial feature and the temporal sequence
(c) or between (b) and (d) in each column is not exactly the representation. This process not only embeds the spatial fea-
same. For example, the parts enclosed by red boxes are ture into the temporal sequence representation, but also
more discriminative than the counterparts enclosed by assigns different weight coefficients to each frame feature in
green boxes. Therefore, these audio/video frames are not the temporal sequence representation for emphasizing the
equally important for distinguishing the health and depres- frames related to depression detection. The Audio/Video Seg-
sion states. ment-Level Features (ASLF/VSLF) can be obtained from the
In addition, for the works [6], [9], [15], [29] using multi- output of the last full connection layer of the STA network.
modal features to estimate depression scores, the feature Second, to obtain the representation of the long-term speech
concatenation and decision weighting are two common spectrum or video, we employ the Eigen Evolution Pooling
methods. However, these two fusion strategies are not (EEP) method to summarize the changes of the each dimen-
enough for examining the modal complementary informa- sion in the segment-level features (ASLF/VSLF) to aggregate
tion, which, in this paper, refers to the information similar them into the Audio/Video Level Feature (ALF/VLF).
to audio (video) modality in the video (audio) modality. Third, the attention mechanism is used between the ALF
Recently, a temporal attention model [48] is proposed to and VSLFs to obtain the Audio Attention Video Feature
emphasize the key frames in the feature sequence for (AAVF). In this way, we obtain the information similar to
depression detection. But, as mentioned above, the behav- the audio modality from the video modality and realize
iour primitives used in [48] are not sufficient for characteriz- the supplement of audio modality to video modality. Simi-
ing the facial appearance [49]. Besides, a spatiotemporal larly, Video Attention Audio Feature (VAAF) is obtained
representation method is developed in the field of expres- through the attention mechanism between the VLF and
sion recognition [37]. This method not only examines ASLFs. In this paper, we regard AAVF and VAAF as the
the differences across video frames, but also provides the modal complementary information. The multimodal repre-
spatial appearance of the whole face. Considering the sentation can be generated by concatenating ALF, AAVF,
advantages of these two works, we propose a novel Spa- VLF and VAAF. Moreover, the support vector regression
tio-Temporal Attention (STA) network to generate the (SVR) is adopted to predict the individual depression
spatiotemporal representation of the audio/video data and level. The experimental results on two publicly available
highlight the frames that contribute to depression detection. Audio/Visual Emotion Challenge (AVEC) 2013 [12] and
In addition, we also present a Multimodal Attention Feature AVEC2014 [13] depression databases demonstrate the effec-
Fusion (MAFF) strategy to extract the complementary tiveness of our method. Since there are some abbreviations

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
296 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

TABLE 2 network to automatically obtain the information related to


Abbreviations of Different Features and Their Corre- depression from the speech, He et al. [10] proposed a four-
sponding Full Names stream CNN model to extract the feature from the audio
Abbreviation Full name segment. They input the short-term waveform, correspond-
ing spectrum, LLDs and Median Robust Extended Local
ASLF Audio Segment-Level Feature Binary Pattern (MRELBP) extracted from the spectrum into
VSLF Video Segment-Level Feature
ALF Audio Level Feature their model to explore the differences of vocal expression
VLF Video Level Feature among individuals with different depression levels. In the
AAVF Audio Attention Video Feature work of [31], they took the MFCCs as the discriminative
VAAF Video Attention Audio Feature biomarker according to the physiological study [58] and
adopted CNN, LSTM and feedforward network to extract
the spatial, temporal and discriminant information related
involved in this paper, we list them in Table 2 to avoid to depression from MFCC segments. More works about
confusion. analyzing speech signals for depression detection are
The main contributions of this paper can be summarized reviewed comprehensively in [50].
as: Video Segment-Level Feature Extraction. Valstar et al. [12]
extracted the Local Phase Quantisation (LPQ) feature from
1) We propose a novel Spatio-Temporal Attention net- each frame of a video segment and used the mean of these
work, which not only utilizes the attention mecha- features as the video segment-level feature in the competi-
nism to embed the spatial feature into the temporal tion of AVEC2013. Besides, they extracted the Local Gabor
sequence representation, but also emphasizes the Binary Pattern (LGBP) feature from the XY-T image plane to
audio and video frames that are helpful for depres- represent the video segment in the competition of
sion detection. AVEC2014 [13]. Dhall et al. [25] divided each video segment
2) In this paper, we employ the EEP method to summa- into non-overlapping blocks and extracted the Local Binary
rize the changes of each dimension of the segment- Patterns from Three Orthogonal Planes (LBP-TOP) from
level features for the purpose of aggregating the those blocks. Fisher Vector (FV) encoding was performed
ASLFs or VSLFs into the ALF or VLF. To the best of on those LBP-TOP features to obtain the representation of
our knowledge, this is the first time to apply the EEP the video segment. He et al. [7] used Median Robust Local
method to the field of automatic depression Binary Pattern (MRLBP) to process each frame of the video
detection. segment and TOP framework to obtain the MRLBP-TOP as
3) We propose a Multimodal Attention Feature Fusion the feature of the corresponding video segment. Consider-
strategy. This method uses the attention mechanism ing that deep neural network is able to extract the high-level
to extract the complementary information between representation, Jazaery et al. [22] combined the 3D CNN
different modalities to improve the quality of multi- with RNN to extract the spatiotemporal representation of
modal representation. the video segment. Similarly, Melo et al. [38] integrated 3D
The rest of this paper is organized as follows: Section 2 global average pooling into 3D CNN to process the video
reviews the related works. Section 3 provides a detailed segments including full-face and eye regions, respectively.
description of our method. Section 4 presents the experi- A comprehensive review about depression analysis based
mental data and setup. Experimental results and discussion on visual cues can be found in [51].
are illustrated in Sections 5 and 6 gives the conclusion and
future works.
2.2 Aggregation Methods for Audio/Video Level
2 RELATED WORKS Feature Generation
To obtain the representation of the complete audio or video
As mentioned above, segment-level feature extraction, fea-
for predicting the severity of depression, some methods
ture aggregation and multimodal fusion are the three main
were proposed to aggregate the segment-level features into
steps of our method. Therefore, in this section, we briefly
the audio or video level feature. The average pooling was
review the previous works on these three aspects.
adopted in the works of [12], [13]. Meng et al. [15] used the
MHH to process each component of audio segment-level
2.1 Audio/Video Segment-Level Feature Extraction features to aggregate the temporal sequence. Considering
Audio Segment-Level Feature Extraction. In the AVEC compet- that the good performance of the Bag-of-Words (BoW)
itions held in 2013 and 2014, Valstar et al. [12], [13] released approach in the community of action recognition and affect
two datasets containing audio and video data for depres- analysis, Dhall et al. [25] constructed the visual words using
sion detection. Meanwhile, they extracted the baseline fea- the set of the video segment-level features. And the aggre-
ture set with 2268 dimensions (spectral Low-Level gation result was generated through calculating the fre-
Descriptors (LLDs) and Mel-frequency Cepstral Coeffi- quency histogram. In addition, they also presented the
cients (MFCCs) 11-16) from the divided speech segments. performance of depression detection using other three sta-
Jan et al. [6] further investigated these baseline features tistical aggregation methods (mean, maximum and stan-
and selected the most dominant feature combination (i.e., dard deviation). He et al. [7] pointed that it was not easy to
Flatness, Band1000, PSY Sharpness, POV, Shimmer, ZCR tune the Gaussian components in the aggregation process
and MFCC) as the segment-level feature. To use the neural using the FV encoding. To overcome this limitation, they

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 297

Fig. 2. The illustration of the multimodal spatiotemporal representation framework for automatic depression level detection. Our method first inputs
spectrum/video segments into the STA network and takes the output of the last full connection layer as the ASLF or VSLF. Then, we use the EEP
method to aggregate these ASLFs or VSLFs into the ALF or VLF. Finally, the multimodal representation is generated using the MAFF strategy and
input into the SVR to predict the BDI-II score.

combined the Dirichlet process with the FV encoding to contained some hand-crafted descriptors (i.e., LBP, LPQ
automatically learn the number of Gaussian components and Edge Orientation Histogram) and the deep representa-
from the observed data and obtained the video level feature tion extracted by VGG-Face. Finally, the concatenation of
for depression detection. In the work of [31], Niu et al. pre- these two modal features was input into Linear Regression
sented that the average-pooling and max-pooling were the (LR) and Partial Linear Regression (PLR), respectively. The
special cases of ‘p -norm pooling. Thus, they combined the results of two regressors were weighted as the correspond-
‘p -norm pooling with the Least Absolute Shrinkage and ing individual depression score.
Selection Operator (LASSO) to find the suitable parameter p Different from the above approaches, this paper extracts
for the task of depression detection. Moreover, the aggrega- the high-level representation of a spectrum or video seg-
tion result was obtained through calculating the ‘p -norm of ment with the STA network, which uses the attention mech-
the each dimension of the segment-level features. anism to generate the spatiotemporal representation and
emphasize the frames related to depression detection. For
aggregation, we employ the EEP method to summarize the
2.3 Multimodal Fusion Strategies for Depression changes of the each dimension of segment-level features. In
Detection addition, we propose a MAFF fusion strategy to capture
Gupta et al. [45] combined the audio baseline features of the complementary information between modalities to improve
AVEC2014 with an additional acoustic features proposed in the quality of the multimodal representation. Experimental
[54] to predict the depression score. Meanwhile, they com- results on the AVEC2013 and AVEC2014 depression data-
bined the video baseline features provided by the bases illustrate the effectiveness of our method.
AVEC2014 with some additional video representations
(including LBP-TOP, optical flow feature and the motion of
the facial landmarks) to estimate the depression score. At 3 MULTIMODAL SPATIOTEMPORAL
last, the multimodal result was obtained by linearly fusing
REPRESENTATION FRAMEWORK FOR
the prediction scores of audio and video modalities. In the
work of [30], on the one hand, Perez et al. generated predic- DEPRESSION DETECTION
tions for affective dimensions and used them as the attrib- The fact that there are some differences in speech and
utes of the audio segment-level feature. On the other hand, facial activity between depressive and healthy individuals
they used the facial landmarks to extract the motion and has been confirmed by physiological studies [3], [4]. There-
velocity information from the video segments. Finally, a fore, this paper proposes a multimodal representation
majority strategy was implemented for the predicted results framework to predict individual depression level. In our
from all segments. approach, we first divide the long-term speech amplitude
Jain et al. [20] used the Principal Component Analysis spectrum/video into fixed-length segments and input
(PCA) to reduce the dimension of LLDs and FV encoding to them into the STA network for obtaining the ASLFs and
obtain the audio feature. Meanwhile, they used the PCA VSLFs. Then, the EEP method is employed to aggregate
and FV encoding to process the LBP-TOP and dense trajec- ASLFs and VSLFs into the ALF and VLF. Finally, the mul-
tory features to gain the video feature. The multimodal timodal representation with modal complementarity is
representation was generated by concatenating the audio generated by the proposed MAFF strategy and input into
and video features for depression detection. In [6], the audio the SVR for predicting the BDI-II score. Fig. 2 shows the
feature contained LLDs and MFCCs. The video feature whole flow of our framework.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
298 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

Fig. 3. The STA network is to extract the ASLF. “” means the matrix multiplication. FC refers to the full connection layer. And the output of FC3 (i.e.,
the part surrounded by the red box) is the ASLF. The 2D CNN, LSTM and FC layers are trained jointly with the loss function of RMSE.

3.1 The STA Network for Audio/Video Therefore, as shown in Fig. 4, we use the 3D CNN to extract
Segment-Level Feature Extraction the spatial feature of the video segment and record it as
n1
In this section, we will describe in detail the process of oCV 2R , where n is the dimension of the spatial feature.
extracting the ASLF and VSLF using the STA network. For extracting the temporal sequence representation, we
Figs. 3 and 4 show the specific network architectures. first train the 2D CNN using video frames with the loss
For the ASLF extraction, on the one hand, we use the function of RMSE and regard the output of the last full con-
2D CNN to examine the spatial structure of an amplitude nection layer (i.e., FC_3 layer in Fig. 4) as the encoding
n1
spectrum segment and record the output as oC A 2R , result of a video frame. Note that the process of training 2D
where n is the dimension of the spatial feature. On the CNN is independent and the label of each frame is the cor-
other hand, the LSTM is adopted to extract the temporal responding video label. In this way, each video segment can
sequence representation, which characterizes the dynamic be encoded into a vector sequence o ^LV 2 RnTV , where n and
changes in the amplitude spectrum segment and is TV are the dimension of the encoding vector and number of
denoted as oLA 2 RnTA , where n and TA are the dimension frames in the video segment, respectively. Moreover, we
and length of the output sequence of the LSTM, respec- input this vector sequence into the LSTM to extract the tem-
tively. Then, we can obtain the spatiotemporal attention poral feature and refer the result as oLV 2 RnTV . Then, simi-
weight wA 2 RTA 1 by Eq. (1). Moreover, the result rA 2 lar to the procedure of extracting ASLF, we replace oLA and
Rn1 of the spatiotemporal attention can be calculated oC L C
A in Eq. (2) with oV and oV to obtain w ^ V 2 RTV 1 , which
using Eq. (3). At last, we input the rA into three full con- has the same property as w ^ A . The spatiotemporal attention
nection layers and take the output of the last full connec- weight wV 2 RTV 1 can be gained by replacing w ^ A in Eq. (1)
tion layer (i.e., the part circled by the red box in the Fig. 3) with w ^ V . Similarly, the result rV 2 Rn1 of the spatiotempo-
as the ASLF. Note that, in the process of model training, ral attention can be obtained by Eq. (3) using the oLV and wV .
the 2D CNN, LSTM and full connection layers are jointly After that, the rV is fed into three full connection layers and
trained with the loss function of RMSE as shown in the output of the last full connection layer (i.e., the part cir-
Eq. (13). cled by the red box in the Fig. 4) is considered as the VSLF.
  T Note that, in the process of model training, the 2D CNN is
^ 1 Þ; . . . ; exp w
expðw ^ TA trained independently, then the 3D CNN, LSTM and full
^ AÞ ¼
wA ¼ softmaxðw PTA   ; (1)
^ tA connection layers are jointly trained with the loss function
tA ¼1 exp w
of RMSE.
^ A 2 RTA 1 is
where “T” refers to matrix transposition and w Intuitively, from the extraction processes of the ASLF
calculated by Eq. (2). and VSLF, one can see that we use the attention mechanism
to embed the spatial feature (oC C
A or oV ) into the temporal
 T
T
^ A ¼ oLA  oC ^1; . . . ; w
^ TA ; sequence representation (oLA or oLV ) for integrating the spa-
A, w
w (2)
tiotemporal information of the amplitude spectrum or video
where “” represents the matrix multiplication operation, segment. At the same time, the STA network can emphasize
which captures the correlation between oC A and each frame the audio or video frames related to depression detection by
feature of the oLA for integrating the spatiotemporal informa- assigning different weight coefficients.
tion and generating the spatiotemporal attention weight.
3.2 The EEP Method for Audio/Video
rA ¼ oLA  wA : (3) Level Feature Generation
It is necessary to predict an individual depression level
For the VSLF extraction, if the video segment is regarded through examining the long-term performance in the com-
as a kind of 3D data, the 3D CNN can be used to extract the plete audio and video. In other words, we need to aggregate
spatial information of the video segment, just like 2D CNN ASLFs and VSLFs into ALF and VLF. In this paper, we
is able to examine the spatial structure of the image. employ the EEP method to summarize the changes of each

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 299

Fig. 4. The STA network is to extract the VSLF. “” means the matrix multiplication. FC refers to the full connection layer. And the output of FC3 (i.e.,
the part surrounded by the red box) is the VSLF. Note that the 2D CNN is first trained independently using the video frames with the loss function of
RMSE. Then, the 3D CNN, LSTM and full connection layers are jointly trained using the loss function of RMSE.

dimension of these segment-level features to obtain the rep- 85 percent. Since the small eigenvalues correspond to the
resentations of the corresponding audio and video. noise component [11], we use the projection of S in gmax
In a formal way, we let the matrix composed of the seg- (i.e., Sgmax ) as the result of segment-level features aggrega-
ment-level features be S ¼ ½s1 ; . . . ; sM  2 RnM , where n is tion, where gmax is the eigenvector corresponding to the
the dimension of the segment-level feature (i.e., ASLF or largest eigenvalue max . In short, if S is the ASLFs or
VSLF) and M is the number of segments divided from the VSLFs, then Sgm is the corresponding ALF or VLF.
corresponding audio or video. If we denote di 2 R1M ði ¼
1; . . . ; nÞ is the ith row of S, then S ¼ ½d1 T ; . . . ; dn T T . In other D X
X K

words, the di can be treated as a time series of the segment- G ¼ arg min  ðgTk dTi gk di Þ: (5)
GT G¼I K i¼1 k¼1
level features in the ith dimension. In order to maintain the
changes of S in each dimension as much as possible, we
attempt to find Kð < MÞ standard orthogonal vectors X
K
g1 ; . . . ; gK (gk 2 RM1 ; k ¼ 1; . . . ; KÞ in the space of RMK to G ¼ arg max gTk ðST SÞgk : (6)
form the base matrix G 2 RMK for reconstructing the GT G¼IK k¼1
dTi ði ¼ 1; . . . ; DÞ. In this way, the coordinate of dTi under the
base matrix G can be expressed as GT dTi . Therefore, we can
optimize the objective function as shown in Eq. (4) to recon- X
M
ST S ¼ m qm qm T ; 1  M ; (7)
struct dTi for obtaining the required standard orthogonal m¼1
base matrix G ¼ ½g1 ; . . . ; gK  2 RMK , where gk 2 RM1
ðk ¼ 1; . . . ; KÞ. where m ðm ¼ 1; . . . ; MÞ is the mth eigenvalue of ST S and
qm 2 RM1 is the corresponding eigenvector. Note that
X
D qm T qm ¼ 1 and qr T qt ¼ 0; r 6¼ t.
G ¼ arg min kGGT di T  di T k2 ; (4)
GT G¼IK i¼1 X
K X
M
2
G ¼ arg max m ðqTm gk Þ : (8)
where IK is the identity matrix of order K. GT G¼IK k¼1 m¼1
To solve the G , we equivalently convert Eqs. (4) to (5)
and rearrange to get Eq. (6). Furthermore, according to the " #
orthogonal decomposition theorem of real symmetric
K X
X M X
M
2

G
arg max M m ðqmi gki Þ ; (9)
matrix, we get the Eq. (7) and bring it into Eq. (6) to obtain GT G¼IK k¼1 m¼1 i¼1
Eq. (8). In this way, according to the property that the arith-
metic mean does not exceed the square mean, we scale where qmi and gki are the ith elements of qm and gk ,
Eqs. (8) to (9). Note that if and only if qm ¼ gk , the equal respectively.
sign holds. In other words, g1 ; . . . ; gK are K vectors in From the above process, we can see that the EEP method
q1 ; . . . ; qM . Thus, G is composed of the eigenvectors corre- can maintain the changes of feature sequence in each
sponding to the first K eigenvalues T
PK of S S, because Eq. (8) dimension as much as possible through Eq. (4). That is to
can obtain the maximum value k¼1 k at this time. say, the ALF summarizes the dynamic changes of all spec-
In this paper, we find that the ratio of the largest eigen- trum segments in the range of complete audio. The VLF
value max of ST S to the sum of all eigenvalues is more than also has the same property.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
300 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

Section 5.3 also confirm this view. In like wise, the AAVF
has the same property. Therefore, in this paper, we take
the VAAF and AAVF as the modal complementary infor-
mation. Furthermore, the multimodal representation with
complementary information is generated by concatenating
VAAF, VLF, ALF and AAVF and is input into SVR to pre-
dict the individual BDI-II score.

4 EXPERIMENTAL DATABASES AND SETUP


In this section, we introduce the databases and evaluation
measure used in the experiments, then the experimental
setup is given.

Fig. 5. The multimodal representation with modal complementary infor- 4.1 Databases and Evaluation Measure
mation (i.e., the part surrounded by a red box) is generated by the MAFF In this paper, all experiments are conducted on two publicly
strategy. VAAF and AAVF are modal complementary information.
available datasets i.e., AVEC2013 and AVEC2014 depres-
3.3 The MAFF for Multimodal Representation With sion datasets.
Complementary Information For the AVEC2013 depression corpus, it is a subset of the
audio-visual depression language corpus (AViD-Corpus),
As mentioned above, depressive patients differ from healthy
which is recorded by a webcam and microphone. Specifi-
individuals in speech and facial activity [3]. Therefore, it is
cally, each subject needs to perform 14 different tasks
reasonable to predict the individual depression level via fus-
according to the instructions on the computer screen. These
ing the audio and video features. For the purpose, this
14 tasks include sustained vowel phonation, sustained loud
paper proposes a novel multimodal fusion strategy named
vowel phonation, sustained smiling vowel phonation,
MAFF to capture the complementary information between
speaking out loud while solving a task, counting from 1 to
modalities to improve the quality of the representation of
10, etc. The mean age of subjects is 31.5 years old with a
depression cues. Fig. 5 shows the fusion process.
standard deviation of 12.3 years and a range of 18-63 years.
In particular, we adopt the attention mechanism between
All subjects are German speakers. There are 150 video clips
ASLFs and VLF to obtain the VAAF. Eq. (10) gives the cal-
from 82 subjects and these recordings have been divided
culation formula of VAAF. Similarly, we only need to
into three parts by the publisher: training, development and
replace ASLFi in Eq. (11) and VLF in Eq. (12) with VSLFi
test set. Each has 50 samples. It should be noted that, in the
and ALF to obtain the AAVF. Note that the length of all fea-
AVEC2013 database, all behavior performance of a subject
tures is normalized to 1 before the fusion. Then, the multi-
is included in the same recording without being separated
modal fusion can be generated by concatenating VLF,
by the publisher when he or she perform tasks. These vid-
VAAF, AAVF and ALF.
eos are set to 30 frames per second with resolution of 640 
VAAF ¼ SASLF  a; (10) 480 pixels. Each sample is labeled using a BDI-II score.
For the AVEC2014 depression corpus, it is a subset of the
where SASLF is obtained through Eq. (11). aT ¼ ½a1 ; . . . ; AVEC2013 corpus. Thus, they are similar in collection set-
aN AS  can be calculated using Eq. (12). tings, age distribution and language characteristics. In the
AVEC2014 database, only two tasks named “Northwind”
SASLF ¼ ½ASLF1 ; . . . ; ASLFN AS ; (11) and “FreeForm” are involved. For “Northwind”, the sub-
jects need to read an excerpt of the fable “Die Sonne und
where ASLFi ði ¼ 1; . . . ; N AS Þ refers to the feature corre- derWind” (The NorthWind and the Sun) in German. For
sponding to the ith spectrum segment and N AS is the number “FreeForm”, the subjects respond to one of a number of
of segments divided from the speech amplitude spectrum. questions, such as: “What is your favorite dish?” or discuss
a sad childhood memory in German. In each task, there are
e < VLF;ASLFi > 150 recordings from 84 subjects. And these recordings are
ai ¼ PN ; i ¼ 1; 2; . . . ; N AS ; (12)
AS < VLF;ASLFj >
j¼1 e
divided equally into training, development and test set. The
duration ranges from 6 seconds to 4 minutes. In our experi-
where < ; > means the inner product operation of two ments, unless otherwise specified, we combine the training,
vectors and VLF is obtained by aggregating the VSLFs development and test sets of these two tasks as the new
using the EEP method as mentioned above. database. Namely, there are 100 samples in the training,
Mathematically, Eqs. (3) and (10) have the same expres- development and test set, respectively. In addition, it is not
sion. But unlike Eq. (3), which integrates the spatiotempo- difficult to predict that there would be similar findings in
ral information and emphasizes key frames, the a in both two databases due to their similarity. The following
Eq. (10) actually reflects the similarity between two modali- experimental results confirm this view.
ties. Therefore, the VAAF contains the information similar In this paper, we conduct experiments on AVEC2013 and
to the video modality in the audio modality. In other AVEC2014 databases, respectively. For each database, the
words, the VAAF provides the supplement of video training set is to train the model. The development set is to
modality to audio modality and the experiments in adjust experimental parameters and validate the effectiveness

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 301

Fig. 6. Depression detection performance on the development sets of Fig. 7. Depression detection performance on the development sets of
the AVEC2013 (a) and AVEC2014 (b) using the spectrum segments with AVEC2013 using four sampling schemes. The index1, index2, index3,
different lengths. index4 and index5 refer to sampling frame 15, frame (8,15,23), frame
(5,10,15,20,25), frame (3,7,11,15,19, 23, 27) in every 30 consecutive
of each module in our model. The test set is used for compar- frames.
ing our method with other works.
At present, Root Mean Square Error (RMSE) and Mean performance on the development set of the AVEC2013 data-
Absolute Error (MAE) are widely used to evaluate the perfor- base is shown in Fig. 7. As shown, when the 15th frame is sam-
mance of depression detection algorithms. Eqs. (13) and (14) pled in every 30 consecutive frames, the best detection result
show the calculation formulas for RMSE and MAE, where N is obtained. Note that, for the AVEC2014 database, we fine-
denotes the number of subjects. yi and y^i are ground truth tune on the 2D CNN trained using the AVEC2013 database
and predicted BDI-II score of the ith subject, respectively. due to their similarity. Besides, similar to the process of opti-
mizing the length of the spectrum segment, Fig. 8 presents
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u N comparative experiments for finding the suitable length of a
u1 X
RMSE ¼ t ðyi  y^i Þ2 : (13) video segment, where the overlapping is 50 percent and video
N i¼1 frames are encoded using the trained 2D CNN. As shown, we
set the length of a video segment and shift to 60 frames (about
2 seconds) and 30 frames (about 1 second) for AVEC2013 and
1X N AVEC2014 databases.
MAE ¼ jyi  y^i j: (14) For the STA network shown in Fig. 3 for extracting the
N i¼1
ASLF , the 2D CNN has two convolution layers with the
same settings: 8 kernels with a size of 5  11, the stride size
4.2 Experimental Setup of ð3; 3Þ and 64 neurons in the FC layer. The dimension of
As described above, this paper extracts the representation of LSTM output sequence is 64. Three fully connection layers
depression cues from audio and video modalities. For audio (i.e., FC1, FC2 and FC3) with 64 neurons are followed. The
modality, we sample the waveforms at 8KHZ and generate Sigmoid is used as the activation function in the FC3 layer
the 129 dimensional normalized amplitude spectrums using and the ReLU is used in other layers. The optimizer is SGD
a short-time Fourier transform with 32 ms length Hamming with batch size 32 and learning rate is 0.0002. The loss func-
window and 16 ms frame shift for AVEC2013 and tion is RMSE. The network structure for extracting the
AVEC2014 databases. To find the suitable length of a spec- ASLF is the same for AVEC2013 and AVEC2014 databases.
trum segment, we use the spectrum segments with different For the STA network shown in Fig. 4 for extracting the
lengths to conduct experiments, where the STA network in ASLF , the 3D CNN has two convolution layers, which have 8
Fig. 3 is to extract ASLF s, the EEP aggregate them into the kernels with a size of 3  3  3 and the stride size of (2; 2; 2).
ALF and SVR is for predicting. Note that the overlapping The FC layer has 64 neurons. For the frame encoding, the 2D
of two adjacent segments is 50 percent, the label of each seg- CNN has three convolution layers with the same setting: 8 ker-
ment is the BDI-II score of the corresponding audio and the nels with a size of 3  3, the stride size of ð2; 2Þ. And the layers
structure of STA network in Fig. 3 will be described below. of FC_1, FC_2 and FC_3 all have 64 neurons. The training pro-
The prediction performance on the development sets of cess for the frame encoding is independent and the loss func-
AVEC2013 and AVEC2014 databases are shown in Fig. 6. tion is RMSE. The dimension of LSTM output sequence is 64.
Hence, we set the length of a spectrum segment as 64 frames Three fully connected layers (i.e., FC1, FC2 and FC3) are with
(about 1 second) and shift of 32 frames (about 0.5 seconds) 64 neurons. The Sigmoid is used as the activation function in
for AVEC2013 and AVEC2014 databases. the F3 and FC_3 layer. The ReLU is used in other layers. The
For video modality, we extract the facial images in the optimizer is SGD with batch size 32 and learning rate is 0.0002.
videos using the Dlib [18] and resize to 128  128 for The loss function for the VSLF extraction model is RMSE. The
AVEC2013 and AVEC2014 databases. The 2D CNN in Fig. 4 network structure for extracting the VSLF is the same for
is trained independently using video frames obtained AVEC2013 and AVEC2014 databases.
through different sampling schemes, where the label of
each video frame is the BDI-II score of the corresponding
video. After that, we input the video segment with 60 5 RESULTS AND DISCUSSION
frames into the STA network (as shown in Fig. 4) to extract In this section, we show the effectiveness of STA network,
VSLF s, then the EEP aggregates them into VLF and SVR is EEP and MAFF for depression detection. In addition, we
for predicting. Note that the overlapping of two adjacent also investigate the influence of different tasks on the detec-
video segments is 50 percent. The structures of 2D CNN and tion accuracy. Finally, the comparisons between our method
STA network in Fig. 4 will be described below. The detection and previous works are presented.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
302 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

Fig. 8. Depression detection performance on the development sets of


the AVEC2013 (a) and AVEC2014 (b) using the video segments with dif-
ferent lengths.

5.1 Depression Detection Performance Using


Different Modalities and Network Structures
Based on the above setup, we examine the detection perfor-
mance using different modalities and network structures
through predicting the depression level on the development Fig. 9. The STA network uses the attention mechanism to emphasize
audio and video frames related to depression detection. The color maps
sets of AVEC2013 and AVEC2014. The experimental results
are the outputs of LSTM for the spectrum segments (a-d) and the outputs
are shown in the Table 3. In this table, the CCL means to of 2D CNN+LSTM for the video segments (e-h). The gray bars are the
concatenate the outputs of 2D CNN (3D CNN) and LSTM attention weights and the darker the color is, the larger the coefficient is.
(2D CNN+LSTM) without using the attention mechanism, The red boxes show the discriminative parts selected through the atten-
tion mechanism. The green boxes show the less discriminative parts. The
where the output of LSTM (2D CNN+LSTM) is a vector arrow points to the enlarged part. (a)-(d) are the results of subjects of No.
rather than a sequence. Note that the EEP and SVR are for 209-2, 222-1, 236-2, and 238-3 in the “Northwind” dataset of AVEC2014
aggregation and prediction in these experiments. using the audio data. (e)-(h) are the corresponding results using the video
For the detection accuracy using the audio modality, the data. The BDI-II scores of subjects of No. 209-2, 222-1, 236-2 and 238-3
are 9 (None), 16 (Mild), 25 (Moderate), and 39 (Severe).
LSTM performs better than the 2D CNN, which is due to the
fact that the spectrum is a sequential data and temporal
Instead, the 2D CNN+LSTM uses 2D CNN to encode the
changes are more expressive than the spatial structure in char-
video frames and LSTM to process the frame sequence, so
acterizing the depression cues [42]. The reason for good per-
the facial changes in the video segment can be captured,
formance of the CCL and STA network is that they both
which leads to better experimental performance. Similar to
contain the spatial and temporal information. The result also
the audio modality, the CCL and STA network further
illustrates that the spatial and temporal features of the spec-
improve the detection accuracy due to the extraction of spa-
trum are both helpful for depression detection. Moreover, the
tiotemporal information. Moreover, the STA network gains
STA network performs better than the CCL. The reason is that
the best performance because of the focus on key frames.
the STA network uses attention mechanism to integrate spa-
To further elaborate the effectiveness of attention mecha-
tiotemporal information and emphasize frames related to
nism, we use the STA network to process the spectrum and
depression. And the CCL is limited in selecting key frames.
video segments from individuals with different depression lev-
For the detection accuracy using the video modality, the
els. The results are shown in Fig. 9. These figures are drawn
3D CNN regards the video as a three dimensional data and
using the Imagesc function in the MATLAB. The gray bars
extracts local spatial structure by multiple convolution ker-
are the attention weights and the darker the color is, the larger
nels, but it is relatively weak at capturing facial motions
the coefficient is. From this figure, one can see that the STA net-
within 60 frames due to the small kernel’s support [56].
work pays more attention to the discriminative parts (such as
the parts surrounded by red boxes) and less attention to the
TABLE 3 less discriminative parts (such as the parts surrounded by
Using Different Modalities and Network Structures green boxes). At the same time, the result also indicates that the
for Depression Prediction on the Development Sets of effect of audio and video frames on depression detection is not
AVEC2013 and AVEC2014
exactly the same.
AVEC2013 AVEC2014 Due to the similarity between AVEC2013 and AVEC2014
Modalities Network Structures
databases, we can see that the video modality obtains better
RMSE MAE RMSE MAE
detection performance than the audio modality for these
Audio 2D CNN 10.68 8.99 10.97 9.31 two databases. In addition, this finding can be explained by
LSTM 10.30 8.86 10.33 8.61 the fact that facial activities of depressive individuals have a
CCL 9.79 7.97 9.82 8.88 more pronounced difference than speech compared with
STA network 9.26 7.85 9.11 7.60
healthy individuals [19].
Video 3D CNN 10.25 8.83 10.36 8.73
2D CNN+LSTM 10.04 8.95 10.22 8.93
CCL 9.62 7.91 9.73 7.81 5.2 Depression Detection Performance Using
STA network 8.54 7.70 8.31 6.73 Different Aggregation Methods
In this section, we examine the effect of different aggrega-
CCL means to concatenate the outputs of 2D CNN (3D CNN) and LSTM (2D tion methods on depression detection. Table 4 presents the
CNN+LSTM), where the output of LSTM (2D CNN+LSTM) is a vector
rather than a sequence. The EEP and SVR are for aggregation and prediction specific experimental results. Note that the STA network is
in these experiments. to extract the segment-level feature and SVR is to predict

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 303

TABLE 4 TABLE 6
Depression Detection Performance Using Different Depression Detection Performance Using Different Information
Aggregation Methods on the Development Sets on the Development Sets of AVEC2013 and AVEC2014
of AVEC2013 and AVEC2014
Different fusion information AVEC2013 AVEC2014
Databases Aggregation Methods Audio Video RMSE MAE RMSE MAE
RMSE MAE RMSE MAE
ALF 9.26 7.85 9.11 7.60
AVEC2013 MP 10.02 8.23 9.86 7.77 VAAF 8.86 7.76 8.79 7.57
AP 9.93 7.98 9.84 7.56 ALF+VAAF 8.49 7.67 8.61 6.72
FV 9.85 7.90 9.57 7.21 VLF 8.54 7.70 8.31 6.73
EEP 9.26 7.85 8.54 7.70 AAVF 8.73 7.97 9.08 7.35
VLF+AAVF 8.42 7.58 8.35 6.69
AVEC2014 MP 9.97 8.57 9.62 7.29
AP 9.88 8.29 9.54 7.41 VCA 8.38 7.27 8.32 7.29
FV 9.76 7.15 9.30 7.10 VCA+VAAF+AAVF 8.10 6.38 8.00 6.42
EEP 9.11 7.60 8.31 6.73
VCA means to concatenate the VLF with ALF. And “+” refers to the concate-
MP, AP and FV refer to Max-Pooling, Average-Pooling and Fisher Vector nation operation.
encoding, respectively.

the depression score in these experiments. As shown, the the prediction result of a certain spectrum or video segment,
EEP is more suitable for the depression detection task than so the dynamic information in the range of complete audio
other methods. This is because max-pooling or average- or video cannot be captured.
pooling only calculates the maximum or average value of
the sequence, which is not sensitive to the temporal order of
5.3 Depression Detection Performance Using
the sequence. For FV encoding, each sample is treated inde- Different Information
pendently for creating the dictionary so that the temporal
In this section, we validate the detection performance of dif-
relationship among samples is ignored. Different from
ferent information on the development sets of AVEC2013 and
them, the EEP maintains as much dynamic information as
AVEC2014 databases. Table 6 gives the experimental results.
possible by reconstructing the changes of each dimension of
Note that the STA network, EEP and SVR are for the segment-
segment-level features, which leads to better accuracy.
level feature extraction, aggregation and prediction.
Furthermore, to illustrate the effectiveness of combining
From the Table 6, one can see that the detection perfor-
the feature aggregation with SVR, we compare the predic-
mance of VAAF is better than that of ALF. The reason is
tion performance using an end-to-end way in Table 5. In the
that the more pronounced depression cues are contained in
end-to-end way, we input spectrum segments (or video seg-
facial activity than speech. Thus, the usage of attention
ments) into the STA network to obtain the predicted results
mechanism between VLF and ASLFs can extract the infor-
of these segments and take the median of these results as
mation similar to video from audio and improve the predic-
the BDI-II score corresponding to the long-term spectrum
tion accuracy. The similar reason can be explained the
(or video) like [22]. Note that, in the end-to-end way, the
result that the AAVF is not as good as VLF. In addition, we
training process of STA network is the same as our method,
find that “ALF+VAAF” and “VLF+AAVF” both obtain the
that is, the network settings are the same in these experi-
better detection performance than “ALF” and “VLF”. This
ments. From the Table 5, one can observe that the combina-
result illustrates that the modal complementary information
tion of feature aggregation with SVR can provide better
(i.e., VAAF and AAVF) is helpful to improve the experi-
experimental accuracy. The reason is that the EEP method
mental accuracy. “VCA þ VAAF þ AAVF” achieves the
can summarize the temporal evolution of each dimension of
best detection result because it contains not only audio and
all segment-level features, so as to completely characterize
video features, but also their complementary information.
the long-term spectrum or video. But, the median is to select

5.4 Depression Detection Performance Combining


TABLE 5 Different Tasks
Depression Detection Using Different Prediction Ways on the
Development Sets of AVEC2013 and AVEC2014 In this section, we predict the depression level on the develop-
ment sets of AVEC2014 database to investigate the effect of
Databases Prediction ways Audio Video different tasks on depression detection. Since the publisher
RMSE MAE RMSE MAE does not perform task segmentation in the AVEC2013 data-
base, we only build models for the two tasks (i.e., “FreeForm”
AVEC2013 End-to-End 10.27 8.21 9.52 7.81 and “Northwind”) in the AVEC2014 database. Considering
SES 9.26 7.85 8.54 7.70 that the data of each task is not enough for training the STA
AVEC2014 End-to-End 9.86 8.28 9.24 7.57 network, so we use the training set of each task to fine-tune
SES 9.11 7.60 8.31 6.73 the network, which is trained using the combination of train-
ing sets of “FreeForm” and “Northwind”. Table 7 shows the
In the end-to-end way, we input the spectrum or video segments into the STA experimental results. Note that, the “Com” means to combine
network and take the median of these outputs as the prediction score. SES
refers to the STA network for extracting ASLFs/VSLFs, EEP for generating the training sets of two tasks for training the models and com-
ALF/VLF and SVR for prediction. bine the development sets of two tasks for validation. In other

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
304 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

TABLE 7 TABLE 9
Depression Detection Performance Using Different Tasks and Comparison of Our Method and Previous Works on the Test Set
Modalities on the Development Set AVEC2014 of AVEC2014

Audio Video Multimodal


RMSE MAE RMSE MAE RMSE MAE
F 9.31 7.82 8.24 6.83 8.22 6.54
N 9.02 7.47 8.46 6.71 7.77 6.26
Com 9.11 7.60 8.31 6.73 8.00 6.42
Con 8.17 6.43 7.53 6.11 6.68 5.07

“F” and “N” are the tasks of “FreeForm” and “Northwind”. “Com” refers to
the combination of these two tasks like the above experiments. “Con” is the
concatenation of the features, which are extracted using the models correspond-
ing to “FreeForm” and “Northwind”.

cases, we get results based on separate tasks. “Con” indicates


that the features extracted from two tasks are concatenated to
estimate the severity of depression. From this table, we can
find that the accuracy of “Northwind” is better than that of
“FreeForm”. This is can be explained that different tasks can
lead to different speech changes and facial activities, which
contribute unequally for depression detection. The compari-
son of “Com” and “Con” illustrates that the detection accu-
racy can be further improved when multiple tasks are
considered.

5.5 Comparison With Previous Works


In this section, we predict the depression level on the test “A” and “V” refer to audio and video modalities. “A+V” is the fusion of the
audio and video modalities. “Ours with Con” refers to the concatenation of fea-
sets of AVEC2013 and AVEC2014 databases to compare our tures extracted using our method from two tasks of “Northwind” and
approach with previous work. Tables 8 and 9 show these “FreeForm”. ‘/’ indicates that the result is not provided.
comparisons. From them, we can see that our method
achieves good detection performance.
For the audio modality, the proposed STA network inte- and emphasizes the frames related to depression detection,
grates spatiotemporal information of the speech spectrum so the good performance is obtained. In the works of [6],
[10], [12], [13], [15], [20], only spatial or temporal features
are extracted. Besides, the EEP method summarizes the
TABLE 8
Comparison of Our Method and Previous Works on the Test Set changes of each dimension of segment-level features and is
of AVEC2013 more suitable than other pooling methods [12], [13], [20],
[31]. This is because statistical variables (mean, median,
etc.) [12], [13] and ‘p -norm [31] are not sensitive to temporal
changes. FV encoding used in [20] does not consider the
order of features in the sequence when creating a dictio-
nary. The works in [52], [55] obtain better detection accu-
racy in the AVEC2013 database. The method in [55]
captures the hidden parameters in the auto/cross-corre-
lations of the measured signals. For [52], they consider
that the depression score is ordinal, so the score is parti-
tioned and the relationship between features and these
partitions is explored. Moreover, for AVEC2014 data-
base, “Ours with Con” obtains the best precision. This
result shows that it is helpful to consider different tasks
for depression detection.
For the video modality, our approach provides better
accuracy than most methods and the accuracy further
improves when involving different tasks for the AVEC2014
database. Whereas Kang et al. [53] only explore facial
images and ignore the effect of facial activities on depres-
sion detection. In the works of [7], [15], [23], [24], [25], video
dynamic features are generated by calculating the statistical
“A” and “V” refer to audio and video modalities. “A+V” is the fusion of the histogram, but the histogram can not reflect the uneven dis-
audio and video modalities. ‘/’ indicates that the result is not provided. tribution of features in temporal space [48]. This issue is

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 305

6 CONCLUSION
Physiological studies have revealed that there are some dif-
ferences between depressive and healthy individuals in
speech and facial activity. Based on this fact, we proposes a
multimodal spatiotemporal representation framework for
automatic depression level detection. The proposed STA net-
work not only integrates spatial and temporal information,
but also emphasizes the frames related to depression detec-
tion. In addition, the proposed MAFF strategy improves the
quality of multimodal representation through extracting the
complementary information between modalities. Experi-
mental results on AVEC2013 and AVEC2014 indicate that
our approach achieves the good detection performance. In
the future, we will segment different tasks and train separate
models to improve the detection accuracy. In addition, we
also consider using this framework to predict other diseases
if the data is available.

ACKNOWLEDGMENTS
This work was supported by the National Key Research &
Development Plan of China under Grant No.2017YFB1002804,
the National Natural Science Foundation of China (NSFC)
under Grant Nos.61831022, 61771472, 61773379, 61901473 and
the Key Program of the Natural Science Foundation of Tianjin
under Grant Grant Nos. 18JCZDJC36300 18JCZDJC36300.
Fig. 10. The scatter plot of the ground truth versus predicted value on the
test set of AVEC2013 (a) and AVEC2014 (b) based on our proposed REFERENCES
method.
[1] P. H. Soloff et al., “Self-mutilation and suicidal behavior in
borderline personality disorder,” J. Pers. Disorders, vol. 8,
no. 4, pp. 257–267, 1994.
also occurs in [22], because the 3D CNN ignores the differ- [2] World Health Organization, “Depression and other common
ences across video frames [60]. The methods in [37], [48] use mental disorders: Global health estimates,” World Health Organiza-
tion, pp. 7–24, 2017.
the attention mechanism to emphasize the key frames asso- [3] A. J. Flint et al., “Abnormal speech articulation, psychomotor
ciated with target task. To this end, we repeat their works to retardation, and subcortical dysfunction in major depression,” J.
predict the depression levels on AVEC2013 and AVEC2014 Psychiatric Res., vol. 27, no. 3, pp. 309–319, 1993.
databases. But, as shown in Table 3, the performance of the [4] A. Korszun, “Facial pain, depression and stress–Connections and
directions,” J. Oral Pathol. Med., vol. 31, no. 10, pp. 615–619, 2002.
LSTM is better than the 3D CNN and our method pays [5] A. McPherson and C. R. Martin, “A narrative review of the beck
attention to the output of LSTM rather than 3D CNN like depression inventory (BDI) and implications for its use in an alco-
[37], so the better accuracy is gained. For the method in [48], hol–Dependent population,” J. Psychiatric Mental Health Nursing,
vol. 17, no. 1, pp. 19–30, 2010.
the depression severity is estimated through visual behav- [6] A. Jan, H. Meng, Y. F. B. A. Gaus, and F. Zhang, “Artificial intelli-
iors (i.e., FAU, landmark, Head Pose and Gaze), which are gent system for automatic depression level analysis through
not sufficient for extracting facial detail texture [49]. More- visual and vocal expressions,” IEEE Trans. Cogn. Devel. Syst., vol.
over, the works of [34], [56] obtain the better performance 10, no. 3, pp. 668–680, Sep. 2018.
[7] L. He, D. Jiang, and H. Sahli, “Automatic depression analysis
due to multiple facial regions division. The method in [46] using dynamic facial appearance descriptor and dirichlet process
achieves the best accuracy because facial movements are fisher encoding,” IEEE Trans. Multimedia, vol. 21, no. 6, pp. 1476–
directly examined across the complete video without the 1486, Jun. 2019.
division of video segments. [8] N. Cummins et al., “Diagnosis of depression by behavioural sig-
nals: A multimodal approach,” in Proc. ACM Int. Workshop Audio/
For the multimodal fusion, our method achieves the best Visual Emotion Challenge, 2013, pp. 11–20.
prediction performance, especially combining different [9] L. He, D. Jiang, and H. Sahli, “Multimodal depression recognition
tasks in the AVEC2014 database. This is because the concat- with dynamic visual and audio cues,” in Proc. Int. Conf. Affect.
Comput. Intell. Interaction, 2015, pp. 260–266.
enation of audio and video features used in [6], [24], [29], [10] L. He and C. Cao, “Automated depression analysis using convolu-
[30], [45] or decision linear combination [15], [17], [26], [27], tional neural networks from speech,” J. Biomed. Informat., vol. 83,
[28] is weak in capturing the complementary information pp. 103–111, 2018.
between modalities. Different from them, the proposed [11] Y. Wang, V. Tran, and M. Hoai, “Eigen evolution pooling for
human action recognition,” 2017, arXiv: 1708.05465.
MAFF strategy improves the quality of multimodal repre- [12] M. Valstar et al., “AVEC 2013: The continuous audio/visual emo-
sentation by using attention mechanism to extract comple- tion and depression recognition challenge,” in Proc. ACM Int.
mentary information between modalities. Furthermore, we Workshop Audio/Visual Emotion Challenge, 2013, pp. 3–10.
[13] M. Valstar et al., “AVEC 2014: 3D dimensional affect and depres-
draw the scatter plot in Fig. 10 to show the ground truth sion recognition challenge,” in Proc. ACM Int. Workshop Audio/
and predicted value for illustrating the prediction perfor- Visual Emotion Challenge, 2014, pp. 3–10.
mance of our proposed method.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
306 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 14, NO. 1, JANUARY-MARCH 2023

[14] F. Eyben, M. W€ ollmer, and B. Schuller, “OpenEAR–Introducing [37] J. Lee, S. Kim, S. Kiim, and K. Sohn, “Spatiotemporal attention
the munich open-source emotion and affect recognition toolkit,” based deep neural networks for emotion recognition,” in Proc.
in Proc. Int. Conf. Affect. Comput. Intell. Interaction Workshops, 2009, IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 1513–1517.
pp. 1–6. [38] W. C. De Melo, E. Granger, and A. Hadid, “Combining global and
[15] H. Meng et al., “Depression recognition based on dynamic facial local convolutional 3D networks for detecting depression from
and vocal expression features using partial least square facial expressions,” in Proc. IEEE Int. Conf. Autom. Face Gesture Rec-
regression,” in Proc. ACM Int. Workshop Audio/Visual Emotion Chal- ognit., 2019, pp. 1–8.
lenge, 2013, pp. 21–30. [39] S. Poria et al., “A review of affective computing: From unimodal
[16] C. D. Sherbourne et al., “Long-term effectiveness of disseminating analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125,
quality improvement for depression in primary care,” Arch. Gen. 2017.
Psychiatry, vol. 58, no. 7, pp. 696–703, 2001. [40] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis.,
[17] X. Ma et al., “Cost-sensitive two-stage depression prediction using vol. 64, no. 2/3, pp. 107–123, 2005.
dynamic visual clues,” in Proc. Asian Conf. Comput. Vis., 2016, [41] B. Schuller et al., “Paralinguistics in speech and language–State-of-
pp. 338–351. the-Art and the challenge,” Comput. Speech Lang., vol. 27, no. 1,
[18] D. Castelli and P. Pagano, “OpenDLib: A digital library service pp. 4–39, 2013.
system,” in Proc. Int. Conf. Theory Pract. Digit. Libraries, 2002, [42] J. C. Mundt et al., “Vocal acoustic biomarkers of depression sever-
pp. 292–308. ity and treatment response,” Biol. Psychiatry, vol. 72, no. 7,
[19] J. M. Girard and J. F. Cohn, “Automated audiovisual depression pp. 580–587, 2012.
analysis,” Current Opinion Psychol., vol. 4, pp. 75–79, 2015. [43] B. Bhushan, “Study of facial micro-expressions in psychology,” in
[20] V. Jain et al., “Depression estimation using audiovisual features Understanding Facial Expressions in Communication. New Delhi,
and fisher vector encoding,” in Proc. ACM Int. Workshop Audio/ India: Springer, 2015, pp. 265–286.
Visual Emotion Challenge, 2014, pp. 87–91. [44] G. Zhao and M. Pietikainen, “Dynamic texture recognition using
[21] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression local binary patterns with an application to facial expressions,”
diagnosis based on deep networks to encode facial appearance IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928,
and dynamics,” IEEE Trans. Affective Comput., vol. 9, no. 4, Jun. 2007.
pp. 578–584, Fourth Quarter 2018. [45] R. Gupta et al., “Multimodal prediction of Affective dimensions
[22] M. Al Jazaery and G. Guo, “Video-based depression level analysis and depression in human-computer interactions,” in Proc. ACM
by encoding deep spatiotemporal features,” IEEE Trans. Affective Int. Workshop Audio/Visual Emotion Challenge, 2014, pp. 33–40.
Comput., to be published, doi: 10.1109/TAFFC.2018.2870884. [46] S. Song, S. Jaiswal, L. Shen, and M. Valstar, “Spectral representation
[23] L. Wen, X. Li, G. Guo, and Y. Zhu, “Automated depression diag- of behaviour primitives for depression analysis,” IEEE Trans. Affec-
nosis based on facial dynamic analysis and sparse coding,” IEEE tive Comput., to be published, doi: 10.1109/TAFFC.2020.2970712.
Trans. Inf. Forensics Security, vol. 10, no. 7, pp. 1432–1441, Jul. 2015. [47] S. Song, L. Shen, and M. Valstar, “Human behaviour-based auto-
[24] H. Kaya, F. Çilli, and A. A. Salah, “Ensemble CCA for continuous matic depression analysis using hand-crafted statistics and deep
emotion prediction,” in Proc. ACM Int. Workshop Audio/Visual learned spectral features,” in Proc. IEEE Int. Conf. Autom. Face Ges-
Emotion Challenge, 2014, pp. 19–26. ture Recognit., 2018, pp. 158–165.
[25] A. Dhall and R. Goecke, “A temporally piece-wise fisher vector [48] Z. Du, W. Li, D. Huang, and Y. Wang, “Encoding visual behaviors
approach for depression analysis,” in Proc. Int. Conf. Affect. Com- with attentive temporal convolution for depression prediction,”
put. Intell. Interaction, 2015, pp. 255–259. in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–7.
[26] M. K€ achele et al., “Fusion of audio-visual features using hierarchi- [49] I. A. Essa and A. P. Pentland, “Coding, analysis, interpretation,
cal classifier systems for the recognition of affective states and the and recognition of facial expressions,” IEEE Trans. Pattern Anal.
state of depression,” in Proc. Int. Conf. Pattern Recognit. Appl. Meth- Mach. Intell., vol. 19, no. 7, pp. 757–763, Jul. 1997.
ods, 2014, pp. 671–678. [50] N. Cummins et al., “A review of depression and suicide risk
[27] J. R. Williamson et al., “Vocal and facial biomarkers of depression assessment using speech analysis,” Speech Commun., vol. 71, no.
based on motor incoordination and timing,” in Proc. ACM Int. 71, pp. 10–49, 2015.
Workshop Audio/Visual Emotion Challenge, 2014, pp. 65–72. [51] A. Pampouchidou et al., “Automatic assessment of depression
[28] M. Senoussaoui et al., “Model fusion for multimodal depression based on visual cues: A systematic review,” IEEE Trans. Affective
classification and level detection,” in Proc. ACM Int. Workshop Comput., vol. 10, no. 4, pp. 445–470, Fourth Quarter 2019.
Audio/Visual Emotion Challenge, 2014, pp. 57–63. [52] N. Cummins, V. Sethu, J. Epps, J. R. Williamson, T. F. Quatieri,
[29] N. Cummins et al., “Diagnosis of depression by behavioural sig- and J. Krajewski, “Generalized two-stage rank regression frame-
nals: A multimodal approach,” in Proc. ACM Int. Workshop Audio/ work for depression score prediction from speech,” IEEE Trans.
Visual Emotion Challenge, 2013, pp. 11–20. Affective Comput., vol. 11, no. 2, pp. 272–283, Second Quarter 2020.
[30] H. Perez Espinosa et al., “Fusing affective dimensions and [53] Y. Kang et al., “Deep transformation learning for depression diag-
audio-visual features from segmented video for depression nosis from facial images,” in Proc. Chin. Conf. Biometric Recognit.,
recognition: INAOE-BUAP’s participation at AVEC’14 2017, pp. 13–22.
challenge,” in Proc. ACM Int. Workshop Audio/Visual Emotion [54] M. Van Segbroeck et al., “A robust frontend for VAD: Exploiting
Challenge, 2014, pp. 49–55. contextual, discriminative and spectral cues of human voice,” in
[31] M. Niu et al., “Automatic depression level detection via lp- Proc. Conf. Int. Speech Commun. Assoc. 2013, pp. 704–708.
norm pooling,” in Proc. Conf. Int. Speech Commun. Assoc., 2019, [55] T. F. Quatieri et al., “Multimodal biomarkers to discriminate cog-
pp. 4559–4563. nitive state,” in The Role of Technology in Clinical Neuropsychology,
[32] M. Niu, J. Tao, and B. Liu, “Local second-order gradient cross Oxford, U.K.: Oxford Univ. Press, 2017, pp. 409–443.
pattern for automatic depression detection,” in Proc. 8th Int. [56] M. A. Uddin, J. B. Joolee, and Y. Lee, “Depression level prediction
Conf. Affect. Comput. Intell. Interaction Workshops Demos, 2019, using deep spatiotemporal features and multilayer Bi-LTSM,”
pp. 128–132. IEEE Trans. Affective Comput., to be published, doi: 10.1109/
[33] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recogni- TAFFC.2020.2970418.
tion with fisher vectors on a compact feature set,” in Proc. Int. [57] M. K€achele, M. Schels, and F. Schwenker, “Inferring depression
Conf. Comput. Vis., 2013, pp. 1817–1824. and affect from application dependent meta knowledge,” in Proc.
[34] X. Zhou, K. Jin, Y. Shang, and G. Guo, “Visually interpretable Int. Workshop Audio/Visual Emotion Challenge, 2014, pp. 41–48.
representation learning for depression recognition from facial [58] T. Taguchi et al., “Major depressive disorder discrimination using
images,” IEEE Trans. Affective Comput., vol. 11, no. 3, pp. 542–552, vocal acoustic features,” J. Affect. Disorders, vol. 225, pp. 214–220,
Third Quarter 2020. 2018.
[35] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression [59] E. Rejaibi et al., “Clinical depression and affect recognition with
diagnosis based on deep networks to encode facial appearance EmoAudioNet,” 2019, arXiv: 1911.00310.
and dynamics,” IEEE Trans. Affective Comput., vol. 9, no. 4, [60] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatio-
pp. 578–584, Fourth Quarter 2018. temporal attention networks for action recognition and
[36] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE detection,” IEEE Trans. Multimedia, to be published, doi: 10.1109/
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. TMM.2020.2965434.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.
NIU ET AL.: MULTIMODAL SPATIOTEMPORAL REPRESENTATION FOR AUTOMATIC DEPRESSION LEVEL DETECTION 307

Mingyue Niu (Student Member, IEEE) received Jian Huang (Student Member, IEEE) received
the master’s degree from the Department of the BE degree from Wuhan University, China. He
Applied Mathematics, Northwestern Polytechnical is currently working toward the PhD degree with
University (NWPU), China, in 2017. He is currently the National Laboratory of Pattern Recognition
working toward the PhD degree with the National (NLPR), Institute of Automation, Chinese Acad-
Laboratory of Pattern Recognition (NLPR), Insti- emy of Sciences, China. He had published the
tute of Automation, Chinese Academy of Sciences papers in INTERSPEECH and ICASSP. His
(CASIA), China. He had published the papers in research interests cover affective computing,
ICASSP and INTERSPEECH. His research inter- deep learning, multimodal emotion recognition.
ests include affective computing and depression
recognition and analysis.

Jianhua Tao (Senior Member, IEEE) received


the PhD degree from Tsinghua University, China, Zheng Lian received the BE degree from the Bei-
in 2001. He is the winner of the National Science jing University of Posts and Telecommunications,
Fund for Distinguished Young Scholars and the China. He is currently working toward the PhD
deputy director in NLPR, CASIA. He has directed degree with the National Laboratory of Pattern
many national projects, including ”863”, National Recognition (NLPR), Institute of Automation, Chi-
Natural Science Foundation of China. His inter- nese Academy of Sciences, China. His research
ests include speech synthesis, affective comput- interests include affective computing, deep learn-
ing and pattern recognition. He has published ing, multimodal emotion recognition.
more than eighty papers on journals and pro-
ceedings including the IEEE Transation on Audio,
Speech, and Language Processing, ICASSP, and INTERSPEECH. He
also serves as the steering committee member for the IEEE Transac-
tions on Affective Computing and the chair or program committee mem- " For more information on this or any other computing topic,
ber for major conferences, including ICPR, Interspeech, etc. please visit our Digital Library at www.computer.org/csdl.

Bin Liu (Member, IEEE) received the BS and MS


degrees from the Beijing institute of technology
(BIT), Beijing, in 2007 and 2009, respectively,
and the PhD degree from the NLPR, CASIA,
Beijing, China, in 2015. He is currently an associ-
ate professor with the National Laboratory of Pat-
tern Recognition, CASIA, Beijing, China. His
current research interests include affective com-
puting and audio signal processing.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on March 01,2023 at 09:42:04 UTC from IEEE Xplore. Restrictions apply.

You might also like