Professional Documents
Culture Documents
1 - Neural Granger Causality
1 - Neural Granger Causality
Abstract—While most classical approaches to Granger causality detection assume linear dynamics, many interactions in real-world
applications, like neuroscience and genomics, are inherently nonlinear. In these cases, using linear models may lead to inconsistent
estimation of Granger causal interactions. We propose a class of nonlinear methods by applying structured multilayer perceptrons
(MLPs) or recurrent neural networks (RNNs) combined with sparsity-inducing penalties on the weights. By encouraging specific sets of
weights to be zero—in particular, through the use of convex group-lasso penalties—we can extract the Granger causal structure. To
further contrast with traditional approaches, our framework naturally enables us to efficiently capture long-range dependencies
between series either via our RNNs or through an automatic lag selection in the MLP. We show that our neural Granger causality
methods outperform state-of-the-art nonlinear Granger causality methods on the DREAM3 challenge data. This data consists of
nonlinear gene expression and regulation time courses with only a limited number of time points. The successes we show in this
challenging dataset provide a powerful example of how deep learning can be useful in cases that go beyond prediction on large
datasets. We likewise illustrate our methods in detecting nonlinear interactions in a human motion capture dataset.
Index Terms—Time series, Granger causality, neural networks, structured sparsity, interpretability
[21], are able to detect these nonlinear dependencies between network recovery benchmark datasets [35] and find that our
past and future with minimal assumptions about the predic- methods outperform a wide set of competitors across all five
tive relationships. However, these estimators have high datasets. Finally, we use our cLSTM method to explore Granger
variance and require large amounts of data for reliable estima- causal interactions between body parts during natural motion
tion. These approaches also suffer from a curse of dimension- with a highly nonlinear and complex dataset of human motion
ality [22] when the number of series grows, making them capture [36], [37]. Our implementation is available online:
inappropriate in the high-dimensional setting. https://github.com/iancovert/Neural-GC.
Neural networks are capable of representing complex, Traditionally, the success stories of neural networks have
nonlinear, and non-additive interactions between inputs been on prediction tasks in large datasets. In contrast, here,
and outputs. Indeed, their time series variants, such as our performance metrics relate to our ability to produce inter-
autoregressive multilayer perceptrons (MLPs) [23], [24], pretable structures of interaction amongst the observed time
[25] and recurrent neural networks (RNNs) like long-short series. Furthermore, these successes are achieved in limited
term memory networks (LSTMs) [26] have shown impres- data scenarios. Our ability to produce interpretable structures
sive performance in forecasting multivariate time series and train neural network models with limited data can be
given their past [27], [28], [29]. While these methods have attributed to our use of structured sparsity-inducing penalties
shown impressive predictive performance, they are essen- and the regularization such penalties provide, respectively.
tially black box methods and provide little interpretability We note that sparsity inducing penalties have been used for
of the multivariate structural relationships in the series. A architecture selection in neural networks [38], [39]. However,
second drawback is that jointly modeling a large number of the focus of the architecture selection was on improving pre-
series leads to many network parameters. As a result, these dictive performance rather than on returning interpretable
methods require much more data to fit reliably and tend to structures of interaction among observed quantities.
perform poorly in high-dimensional settings. More generally, our proposed formulation shows how
We present a framework for structure learning in MLPs structured penalties common in regression [30], [31] may be
and RNNs that leads to interpretable nonlinear Granger generalized for structured sparsity and regularization in
causality discovery. The proposed framework harnesses the neural networks. This opens up new opportunities to use
impressive flexibility and representational power of neural these tools in other neural network context, especially as
networks. It also sidesteps the black-box nature of many applied to structure learning problems. In concurrent work,
network architectures by introducing component-wise a similar notion of sparse-input neural networks were
architectures that disentangle the effects of lagged inputs on developed for high-dimensional regression and classifica-
individual output series. For interpretability and an ability tion tasks for independent data [40].
to handle limited data in the high-dimensional setting, we
place sparsity-inducing penalties on particular groupings of 2 LINEAR GRANGER CAUSALITY
the weights that relate the histories of individual series to
Let xt 2 Rp be a p-dimensional stationary time series and
the output series of interest. We term these sparse compo-
assume we have observed the process at T time points,
nent-wise models, e.g., cMLP and cLSTM, when applied to
ðx1 ; . . . ; xT Þ. Using a model-based approach, as is our focus,
the MLP and LSTM, respectively. In particular, we select for
Granger causality in time series analysis is typically studied
Granger causality by adding group sparsity penalties [13]
using the vector autoregressive model [9]. In this model, the
on the outgoing weights of the inputs.
time series at time t, xt , is assumed to be a linear combina-
As in linear methods, appropriate lag selection is crucial
tion of the past K lags of the series
for Granger causality selection in nonlinear approaches—
especially in highly parametrized models like neural net-
X
K
works. For the MLP, we introduce two more structured group xt ¼ AðkÞ xtk þ et ; (1)
penalties [15], [30], [31] that automatically detect both nonlin- k¼1
ear Granger causality and also the lags of each inferred inter-
action. Our proposed cLSTM model, on the other hand, where AðkÞ is a p p matrix that specifies how lag k affects
sidesteps the lag selection problem entirely because the recur- the future evolution of the series and et is zero mean noise.
rent architecture efficiently models long range dependencies In this model, time series j does not Granger-cause time
ðkÞ
[26]. When the true network of nonlinear interactions is series i if and only if for all k, Aij ¼ 0. A Granger causal
sparse, both the cMLP and cLSTM approaches will select a analysis in a VAR model thus reduces to determining which
subset of the time series that Granger-cause the output series, values in AðkÞ are zero over all lags. In higher dimensional
no matter the lag of interaction. To our knowledge, these settings, this may be determined by solving a group lasso
approaches represent the first set of nonlinear Granger causal- regression problem [41]
ity methods applicable in high dimensions without requiring
precise lag specification. X
T X
K
We first validate our approach and the associated penal- minAð1Þ ;...;AðKÞ kxt AðkÞ xtk k22
ties via simulations on both linear VAR and nonlinear t¼K k¼1 (2)
X ð1Þ ðKÞ
Lorenz-96 data [32], showing that our nonparametric app- þ kðAij ; . . . ; Aij k2 ;
roach accurately selects the Granger causality graph in both ij
linear and nonlinear settings. Second, we compare our cMLP
and cLSTM models with existing Granger causality approa- where k k2 denotes the L2 norm. The group lasso penalty
ð1Þ ðKÞ
ches [33], [34] on the difficult DREAM3 gene regulatory over all lags of each ði; jÞ entry, kðAij ; . . . ; Aij k2 jointly
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
TANK ET AL.: NEURAL GRANGER CAUSALITY 4269
Fig. 1. (left) Schematic for modeling Granger causality using cMLPs. If the outgoing weights for series j, shown in dark blue, are penalized to zero,
then series j does not Granger-cause series i. (center) The group lasso penalty jointly penalizes the full set of outgoing weights while the hierarchical
version penalizes the nested set of outgoing weights, penalizing higher lags more. (right) Schematic for modeling Granger causality using a cLSTM.
If the dark blue outgoing weights to the hidden units from an input xðt1Þj are zero, then series j does not Granger-cause series i.
shrinks all Akij parameters to zero across all lags k [13]. The simultaneously allows series j to Granger cause series i but
hyper-parameter > 0 controls the level of group sparsity. not Granger cause series i0 for i ¼ 6 i0 . Second, a joint network
The group penalty in Equation (2) may be replaced with a over all xti for all i assumes that each time series depends on
structured hierarchical penalty [30], [42] that automatically the same past lags of the other series. However, in practice,
selects the lag of each Granger causal interaction [15]. Specifi- each xti may depend on different past lags of the other series.
cally, the hierarchical lag selection problem is given by To tackle these challenges, we propose a structured neu-
ral network approach to modeling and estimation. First,
X
T X
K
instead of modeling g jointly across all outputs xt , as is stan-
minAð1Þ ;...;AðKÞ kxt AðkÞ xtk k22 dard in multivariate forecasting, we instead focus on each
t¼K k¼1
(3) output component with a separate model
XX
K
ðkÞ ðKÞ
þ kðAij ; . . . ; Aij k2 ;
ij k¼1
xti ¼ gi x < t1 ; . . . ; x < tp þ eti :
where > 0 now controls the lag order selected for each Here, gi is a function that specifies how the past K lags are
interaction. Specifically, at higher values of there exists a k mapped to series i. In this context, Granger non-causality
for each ði; jÞ pair such that the entire contiguous set of lags between two series j and i means that the function gi does
ðkÞ ðKÞ not depend on x < tj , the past lags of series j. More formally,
ðAij ; . . . ; Aij Þ is shrunk to zero. If k ¼ 1 for a particular
ði; jÞ pair, then all lags are equal to zero and series i does Definition 1. Time series j is Granger non-causal for time series
not Granger-cause series j; thus, this penalty simulta- i if for all x < t1 ; . . . ; x < tp and all x0< tj 6¼ x < tj ,
neously selects for Granger non-causality and the lag of
each Granger causal pair. gi x < t1 ; . . . ; x < tj ; . . . ; x < tp ¼
gi x < t1 ; . . . ; x0< tj ; . . . x < tp ;
3 MODELS FOR NEURAL GRANGER CAUSALITY
3.1 Adapting Neural Networks for Granger Causality that is, gi is invariant to x < tj .
A nonlinear autoregressive model (NAR) allows xt to evolve In Sections 3.2 and 3.3 we consider these component-wise
according to more general nonlinear dynamics [25] models in the context of MLPs and LSTMs. We examine a
set of sparsity inducing penalties as in Equations (2) and (3)
xt ¼ gðx < t1 ; . . . ; x < tp Þ þ et ; (4)
that allow us to infer the invariances of Definition 1 that
where x < ti ¼ . . . ; xðt2Þi ; xðt1Þi denotes the past of series i lead us to identify Granger non-causality.
and we assume additive zero mean noise et .
In a forecasting setting, it is common to jointly model the 3.2 Sparse Input MLPs
full nonlinear functions g using neural networks. Neural Our first approach is to model each output component gi with
networks have a long history in NAR forecasting, using a separate MLP, so that we can easily disentangle the effects
both traditional architectures [25], [43], [44] and more recent from inputs to outputs. We refer to this approach, displayed
deep learning techniques [27], [29], [45]. These approaches pictorially in Fig. 1, as a component-wise MLP (cMLP). Let gi
either utilize an MLP where the inputs are x < t ¼ xðt1Þ:ðtKÞ , take the form of an MLP with L 1 layers and let the vector
for some lag K, or a recurrent network, like an LSTM. hlt 2 RH denote the values of the m-dimensional lth hidden
There are two problems with applying the standard neural layer at time t. The parameters of the neural network
1 are given
network NAR model in the context of inferring Granger cau- by weights W
1 and biases
b at each layer, W ¼ W ; . . . ; WL
sality. The first is that these models act as black boxes that are and b ¼ b ; . . .; bL . To draw an analogy with the time
difficult to interpret. Due to sharing of hidden layers, it is series VAR model, we further decompose the weights at the
difficult to specify sufficient conditions on the weights that first layer across time lags, W 1 ¼ W 11 ; . . . ; W 1K . The
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
4270 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 8, AUGUST 2022
dimensions of the parameters are given by W 1 2 RHpK , number of estimated Granger causal connections. This
W l 2 RHH for 1 < l < L; W L 2 RH ; bl 2 RH for l < L and group penalty is the neural network analogue of the group
bL 2 R. Using this notation, the vector of first layer hidden lasso penalty across lags in Equation (2) for the VAR case.
values at time t are given by To detect the lags where Granger causal effects exists, we
! propose a new penalty called a group sparse group lasso pen-
X
K alty. This penalty assumes that only a few lags of a series j
h1t ¼s 1k
W xtk þ b 1
; (5) are predictive of series i; and provides both sparsity across
k¼1
groups (a sparse set of Granger causal time series) and spar-
where s is an activation function. Typical activation func- sity within groups (a subset of relevant lags)
tions are either logistic or tanh functions. The vector of XK
1k
hidden units in subsequent layers is given by a similar V W:j1 ¼ aW:j1 þð1 aÞ W:j ; (10)
F 2
form, also with s activation functions k¼1
hidden state variable ct , referred to as the cell state, giving min xit gi ðx < t Þ þ jjW:j1 jj2 : (15)
W
the full set of hidden parameter as ðct ; ht Þ. The LSTM model t¼2 j¼1
updates its hidden states recursively as For a large enough , many columns of W 1 will be zero, lead-
ing to a sparse set of Granger causal connections. An example
f t ¼ s W f xt þ U f hðt1Þ sparsity pattern in the LSTM parameters is shown in Fig. 3.
in
it ¼ s W xt þ U in hðt1Þ
ot ¼ s W o xt þ U o hðt1Þ (13) 4 OPTIMIZING THE PENALIZED OBJECTIVES
ct ¼ f t ct1 þ it s ðW xt þ U ht1 Þ
c c 4.1 Optimizing the Penalized cMLP Objective
ht ¼ ot sðct Þ; We optimize the nonconvex objectives of Equation (8) using
proximal gradient descent [50]. Proximal optimization is
where denotes element-wise multiplication and it , f t , important in our context because it leads to exact zeros in
and ot represent input, forget and output gates, respec- the columns of the input matrices, a critical requirement for
tively, that control how each component of the state cell, ct , interpretating Granger non-causality in our framework.
is updated and then transferred to the hidden state used for Additionally, a line search can be incorporated into the opti-
prediction, ht . In particular, the forget gate, f t , controls the mization algorithm to ensure convergence to a local mini-
amount that the past cell state influences the future cell mum [51]. The algorithm updates the network weights W
state, and the input gate, it , controls the amount that the cur- iteratively starting with Wð0Þ by
rent observation influences the new cell state. The additive
form of the cell state update in the LSTM allows it to encode Wðmþ1Þ ¼ proxg ðmÞ V WðmÞ g ðmÞ rLðWðmÞ Þ ; (16)
long-range dependencies, since cell states from far in the PT 2
past may still influence the cell state at time t if the forget where L ¼ t¼K xti gi ðx < t Þ is the the neural network
gates remain close to one. In the context of Granger causal- prediction loss and proxV is the proximal operator with
ity, this flexible architecture can represent long-range, non- respect to the sparsity inducing penalty function V. The
linear dependencies between time series. As in the cMLP, entries in Wð0Þ are initialized randomly from a standard
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
4272 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 8, AUGUST 2022
normal distribution. The scalar g ðmÞ is the step size, which is Algorithm 3. One Pass Algorithm to Compute the Proxi-
either set to a fixed value or determined by line search [51]. mal Map for the Hierarchical Group Lasso Penalty, for
While the objectives in Equation (8) are nonconvex, we find Automatic Lag Selection in the cMLP Model
that no random restarts are required to accurately detect
Require: > 0; g > 0; W:j11 ; . . . ; W:j1K
Granger causality connections.
Since the sparsity promoting group penalties are only k ¼ K to 1 do
for
applied to the input weights, the proximal step for weights W:j1k ; . . . ; W:j1K ¼ soft W:j1k ; . . . ; W:j1K ; g
at the higher levels is simply the identity function. The prox- end for
imal step for the group lasso penalty on the input weights is return W:j11 ; . . . ; W:j1K
given by a group soft-thresholding operation on the input
weights [50], Since all datasets we study are relatively small, the gra-
dients are with respect to the full data objective (i.e., all time
proxg ðmÞ V ðW:ki Þ ¼ softðW:k1 ; g ðmÞ Þ (17) points); for larger datasets, one could instead use proximal
stochastic gradient descent [52].
!
g ðmÞ 4.2 Optimizing the Penalized cLSTM Objective
¼ 1 W:k1 ; (18)
jjW:j1 jjF Similar to the cMLP, we optimize Equation (15) using proxi-
þ
mal gradient descent. When the data consists of many repli-
where ðxÞþ ¼ maxð0; xÞ. For the group sparse group lasso, the cates of short time series, like in the DREAM3 data in
proximal step on the input weights is given by group-soft Section 7, we perform a full backpropagation through time
thresholding on the lag specific weights, followed by group (BPTT) to compute the gradients. However, for longer series
soft thresholding on the entire resulting input weights for we truncate the BPTT by unlinking the hidden sequences. In
each series, see Algorithm 2. The proximal step on the input practice, we do this by splitting the dataset up into equal
weights for the hierarchical penalty is given by iteratively sized batches, and treating each batch as an independent
applying the group soft-thresholding operation on each realization. Under this approach, the gradients used to opti-
nested group in the penalty, from the smallest group to the mize Equation (15) are only approximations of the gradients
largest group [42], and is shown in Algorithm 3. of the full cLSTM model. This is very common practice in
the training of of RNNs [53], [54], [55]. The full optimization
Algorithm 1. Proximal Gradient Descent With Line Search algorithm for training is shown in Algorithm 4.
Algorithm for Solving Equation (8). Proximal Steps Given
in Equation (17) for the Group Lasso Penalty, in Algo- Algorithm 4. Proximal Gradient Descent With Line Search
rithm 2 for the Group Sparse Group Lasso Penalty, and in Algorithm for Solving Equation (15) for the cLSTM With
Algorithm 3 for the Hierarchical Penalty Group Lasso Penalty
Require: > 0 Require: > 0
m ¼ 0, initialize Wð0Þ m ¼ 0, initialize Wð0Þ
while not converged do while not converged do
m¼mþ1 m¼mþ1
determine g by line search compute rL WðmÞ by BPTT (truncated for large T )
for j ¼ 1 to p do determine g by line search.
1ðmþ1Þ 1ðmÞ for j ¼ 1 to p do
W:j ¼ proxgV W:j grW 1 L WðmÞ 1ðmþ1Þ 1ðmÞ
:j W:j ¼ soft W:j grW 1 L WðmÞ ; g
end for :j
Algorithm 2. One Pass Algorithm to Compute the Proxi- 5 COMPARING CMLP AND CLSTM MODELS
mal Map for the Group Sparse Group Lasso Penalty, for FOR GRANGER CAUSALITY
Relevant Lag Selection in the cMLP Model Both the cMLP and cLSTM frameworks model each com-
ponent function gi using independent networks for each i.
Require: > 0; g > 0; W:j11 ; . . . ; W:j1K
for k ¼ K to 1 do For the cMLP model, one needs to specify a maximum pos-
sible model lag K. However, our lag selection strategy
W:j1k ¼ soft W:j1k ; g
(Equation (11)) allows one to set K to a large value and the
end for
weights for higher lags are automatically removed from the
W:j11 ; . . . ; W:j1K ¼ soft W:j11 ; . . . ; W:j1K ; g model. On the other hand, the cLSTM model requires no
maximum lag specification, and instead automatically
return W:j11 ; . . . ; W:j1K
learns the memory of each interaction. As a consequence,
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
TANK ET AL.: NEURAL GRANGER CAUSALITY 4273
TABLE 1
Comparison of AUROC for Granger Causality Selection Among Different Approaches,
as a Function of the Forcing Constant F and the Length of the Time Series T
Model F = 10 F = 40
T 250 500 1000 250 500 1000
cMLP 86.6 0.2 96.6 0.2 98.4 0.1 84.0 0.5 89.6 0.2 95.5 0.3
cLSTM 81.3 0.9 93.4 0.7 96.0 0.1 75.1 0.9 87.8 0.4 94.4 0.5
IMV-LSTM 63.7 4.3 76.0 4.5 85.5 3.4 53.6 5.2 59.0 4.5 69.0 4.8
LOO-LSTM 47.9 3.2 49.4 1.8 50.1 1.0 50.1 3.3 49.1 3.2 51.1 3.7
Results are the mean across five initializations, with 95 percent confidence intervals.
also relatively narrow, at less than 1 percent AUROC for of the cLSTM and cMLP improves at larger T , and, as in the
the cLSTM and cMLP, compared with 3-5 percent for the Lorenz-96 case, both models outperform the IMV-LSTM
IMV-LSTM. and LOO-LSTM by a wide margin. The cMLP remains more
To understand the role of the number of hidden units in robust than the cLSTM with smaller amounts of data, but
our methods, we perform an ablation study to test different the cLSTM outperforms the cMLP on several occasions with
values of H; the results show that both the cMLP and T ¼ 500 or T ¼ 1000.
cLSTM are robust to smaller H values, but that their perfor- The IMV-LSTM consistently underperforms our methods
mance benefits from H ¼ 100 hidden units (see Appendix, with these datasets, likely because it is not explicitly
available in the online supplemental material). Addition- designed for Granger causality discovery. Our finding that
ally, we investigate the importance of the optimization algo- the IMV-LSTM performs poorly at this task is consistent
rithm; we found that Adam [58], proximal gradient descent with recent work suggesting that attention mechanisms are
[50] and proximal gradient descent with a line search [51] not indicative of feature importance [59]. The LOO-LSTM
lead to similar results (see Appendix, available in the online approach consistently achieves poor performance, likely
supplemental material). However, because Adam requires due to two factors: (i) unregularized LSTMs are prone to
a thresholding parameter and the line search is computa- overfitting in the low-data regime, even when Granger
tionally costly, we use standard proximal gradient descent causal time series are held out, and (ii) withholding a single
in the remainder of our experiments. time series will not impact the loss if the remaining time
series have dependencies that retain its signal.
6.1.2 VAR Model
To analyze our methods’ performance when the true under- 6.2 Quantitative Analysis of the Hierarchical Penalty
lying dynamics are linear, we simulate data from p ¼ 20 We next quantitatively compare the three possible struc-
VAR(1) and VAR(2) models with randomly generated tured penalties for Granger causality selection in the cMLP
sparse transition matrices. To generate sparse dependencies model. In Section 3.2 we introduced the full group lasso
for each time series i, we create self dependencies and ran- (GROUP) penalty over all lags (Equation (8)), the group
domly select three more dependencies among the other p sparse group lasso (MIXED) (Equation (10)) and the hierar-
1 time series. Where series i depends on series j, we set chical (HIER) lag selection penalty (Equation (11)). We com-
Akij ¼ 0:1 for k ¼ 1 or k ¼ 1; 2, and all other entries of A are pare these approaches across various choices of the cMLP
set to zero. Examining both VAR models allows us to see model’s maximum lag, K 2 ð5; 10; 20Þ. We use H ¼ 10 hid-
how well our methods detect Granger causality at longer den units for data simulated from the nonlinear Lorenz
time lags, even though no time lag is explicitly specified in model with F ¼ 20, p ¼ 20, and T ¼ 750. As in Section 6.1,
our models. Our results are the average over five random we compute the mean AUROC over five random initializa-
initializations for a single dependency graph. tions and display the results in Table 3. Importantly, the
The AUROC results are displayed in Table 2 for the hierarchical penalty outperforms both group and mixed
cLSTM, cMLP, IMV-LSTM, and LOO-LSTM approaches for penalties across all model input lags K. Furthermore, per-
three dataset lengths, T 2 ð250; 500; 1000Þ. The performance formance significantly declines as K increases in both group
TABLE 2
Comparison of AUROC for Granger Causality Selection Among Different Approaches,
as a Function of the VAR Lag Order and the Length of the Time Series T
Results are the mean across five initializations, with 95 percent confidence intervals.
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
TANK ET AL.: NEURAL GRANGER CAUSALITY 4275
TABLE 3
AUROC Comparisons Between Different cMLP Granger
Causality Selection Penalties on Simulated Lorenz-96 Data
as a Function of the Input Model Lag, K
Lag K 5 10 20
GROUP 88.1 0.8 82.5 0.3 80.5 0.5
MIXED 90.1 0.5 85.4 0.3 83.3 1.1
HIER 95.5 0.2 95.4 0.5 95.2 0.3
Results are the mean across five initializations, with 95 percent confidence
intervals.
Fig. 6. (Top) AUROC and (bottom) AUPR (given in %) results for our pro-
posed regularized cMLP and cLSTM models and the set of methods–
OKVAR, LASSO, and G1DBN presented in [33]. These results are for
the DREAM3 size-100 networks using the original DREAM3 data sets.
cLSTM and cMLP methods do much better than all previous Fig. 7. ROC curves for the cMLP ( ) and cLSTM ( ) models on the
approaches, with the cLSTM outperforming the cMLP in five DREAM datasets.
three datasets. The raw ROC curves for cMLP and cLSTM are
displayed in Fig. 7. recordings across two different subjects for a total of T ¼ 2024
These results clearly demonstrate the importance of taking time points. In total, there are recordings from 24 unique
a nonlinear approach to Granger causality detection in a (sim- regions because some regions, like the thorax, contain multi-
ulated) real-world scenario. Among the nonlinear approaches, ple angles of motion corresponding to the degrees of freedom
the neural network methods are extremely powerful. Further- of that part of the body.
more, the cLSTM’s ability to efficiently capture long memory We apply the cLSTM model with H ¼ 8 hidden units to
(without relying on long-lag specifications) appears to be par- this data set. For computational speed ups, we break the
ticularly useful. This result validates many findings in the lit- original series into length 20 segments and fit the penalized
erature where LSTMs outperform autoregressive MLPs. An cLSTM model from Equation (15) over a range of values.
interesting facet of these results, however, is that the impres- To develop a weighted graph for visualization, we let the
sive performance gains are achieved in a limited data scenario edge weight wij between components be the norm of the
and on a task where the goal is recovery of interpretable struc- outgoing cLSTM weights from input series j to output com-
ture. This is in contrast to the standard story of prediction on ponent series i, standardized by the maximum such edge
large datasets. To achieve these results, the regularization and weight associated with the cLSTM for series i. Edges associ-
induced sparsity of our penalties is critical. ated with more than one degrees of freedom (angle direc-
tions for the same body part) are averaged together. Finally,
to aid visualization, we further threshold edge weights of
8 DEPENDENCIES IN HUMAN MOTION
magnitude 0.01 and below.
CAPTURE DATA The resulting estimated graphs are displayed in Fig. 9 for
We next apply our methodology to detect complex, nonlinear multiple values of the regularization parameter, . While
dependencies in human motion capture (MoCap) recordings. we present the results for multiple , one may use cross val-
In contrast to the DREAM3 challenge results, this analysis idation to select if one graph is required. To interpret the
allows us to more easily visualize and interpret the learned presented skeleton plots, it is useful to understand the full
network. Human motion has been previously modeled using set of motion behaviors exhibited in this data set. These
both linear dynamical systems [62], switching linear dynam- behaviors are depicted in Fig. 8, and include instances of
ical systems [37], [63] and also nonlinear dynamical models jumping jacks, side twists, arm circles, knee raises, squats, punch-
using Gaussian processes [64]. While the focus of previous ing, various forms of toe touches, and running in place. Due to
work has been on motion classification [62] and segmentation the extremely limited data for any individual behavior, we
[37], our analysis delves into the potentially long-range, non- chose to learn interactions from data aggregated over the
linear dependencies between different regions of the body entire collection of behaviors. In Fig. 9, we see many intui-
during natural motion behavior. We consider a data set from tive learned interactions. For example, even in the more
the CMU MoCap database [36] previously studied in [37]. sparse graph (largest ) we learn a directed edge from right
The data set consists of p ¼ 54 joint angle and body position knee to left knee and a separate edge from left knee to right.
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
TANK ET AL.: NEURAL GRANGER CAUSALITY 4277
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.
TANK ET AL.: NEURAL GRANGER CAUSALITY 4279
[52] L. Xiao and T. Zhang, “A proximal stochastic gradient method Nicholas Foti (Member, IEEE) received the bach-
with progressive variance reduction,” SIAM J. Optim., vol. 24, elor’s degree in computer science and mathematics
no. 4, pp. 2057–2075, 2014. from Tufts University, Medford, Massachusetts, in
[53] R. J. Williams and D. Zipser, “Gradient-based learning algorithms 2007, and the PhD degree in computer science
for recurrent, ” in, Backpropagation: Theory, Architectures, and Appli- from Dartmouth College, Hanover, New Hampshire,
cations, vol. 433. London, U.K.: Psychology Press, 1995. in 2013. He is currently a research scientist at the
[54] I. Sutskever, Training Recurrent Neural Networks. Toronto, Canada: Paul G. Allen School of Computer Science and
Univ. Toronto, 2013. Engineering, University of Washington, Seattle,
[55] P. J. Werbos, “Backpropagation through time: What it does and Washington. He received a notable paper award at
how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990. AISTATS, in 2013.
[56] T. Guo, T. Lin, and Y. Lu, “An interpretable LSTM neural network
for autoregressive exogenous model,” 2018, arXiv: 1804.05251.
[57] T. Guo, T. Lin, and N. Antulov-Fantulin, “Exploring interpretable Ali Shojaie (Member, IEEE) is currently an associ-
LSTM neural networks over multi-variable data,” in Proc. 36th Int. ate professor of biostatistics and adjunct associate
Conf. Mach. Learn., 2019, pp. 2494–2504. professor of statistics at the University of Washing-
[58] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- ton, Seattle, Washington. His research interests
mization,” 2014, arXiv:1412.6980. include the intersection of statistical machine learn-
[59] S. Wiegreffe and Y. Pinter, “Attention is not not explanation,” ing for high-dimensional data, statistical network
2019, arXiv: 1908.04626. analysis, and applications in biology and social sci-
[60] V. Sindhwani, M. H. Quang, and A. C. Lozano, “Scalable matrix- ences. He and his team have developed multiple
valued kernel learning for high-dimensional nonlinear multivari- award-wining methods and publicly available soft-
ate regression and Granger causality,” 2012, arXiv:1210.4792. ware tools for network-based analysis of diverse
[61] D. Marinazzo, M. Pellicoro, and S. Stramaglia, “Kernel-Granger biological and social data, including novel methods
causality and the analysis of dynamical networks,” Physical Rev. for discovering network Granger causality relations, as well as methods for
E, vol. 77, no. 5, 2008, Art. no. 056215. learning and analyzing high-dimensional networks.
[62] E. Hsu, K. Pulli, and J. Popovic, “Style translation for human
motion,” ACM Trans. Graph., vol. 24, no. 3, pp. 1082–1089, Jul. 2005.
[Online]. Available: http://doi.acm.org/10.1145/1073204.1073315 Emily B. Fox (Member, IEEE) received the SB and
[63] V. Pavlovic, J. M. Rehg, and J. MacCormick, “Learning switching PhD degrees from the Department of Electrical
linear models of human motion,” in Proc. 13th Int. Conf. Neural Inf. Engineering and Computer Science, Massachu-
Process. Syst., 2001, pp. 981–987. setts Institute of Technology (MIT), Cambridge,
[64] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process Massachusetts, in 2004 and 2009, respectively.
dynamical models for human motion,” IEEE Trans. Pattern Anal. She is an associate professor with the Paul G. Allen
Mach. Intell., vol. 30, no. 2, pp. 283–298, Feb. 2008. School of Computer Science and Engineering and
[65] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork Department of Statistics, University of Washington,
RNN,” 2014, arXiv:1402.3511. Seattle, Washington, and is the Amazon professor
[66] A. V. D. Oord et al., “WaveNet: A generative model for raw of machine learning. She has been awarded a
audio,” 2016, arXiv:1609.03499. Presidential Early Career Award for Scientists and
Engineers (PECASE, 2017), Sloan Research Fellowship (2015), ONR
Alex Tank (Member, IEEE) received the PhD deg- Young Investigator award (2015), NSF CAREER award (2014), National
ree in statistics from the University of Washington, Defense Science and Engineering Graduate (NDSEG) Fellowship, NSF
Seattle, Washington. His research interests include Graduate Research Fellowship, NSF Mathematical Sciences Postdoctoral
integrating modern machine learning and optimiza- Research Fellowship, Leonard J. Savage Thesis Award in Applied Method-
tion tools with classical time series analysis. ology (2009), and MIT EECS Jin-Au Kong Outstanding Doctoral Thesis
Prize (2009). Her research interests include large-scale Bayesian dynamic
modeling and computations.
Authorized licensed use limited to: Toronto Metropolitan University Library. Downloaded on October 20,2023 at 00:38:42 UTC from IEEE Xplore. Restrictions apply.