Statistical Tests For Detecting Granger Causality

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO.

22, NOVEMBER 15, 2018 5803

Statistical Tests for Detecting Granger Causality


Ribhu Chopra , Member, IEEE, Chandra Ramabhadra Murthy , Senior Member, IEEE, and Govindan Rangarajan

Abstract—Detection of a causal relationship between two or Gaussian distributed time series, the directed mutual informa-
more sets of data is an important problem across various scien- tion reduces to the logarithm of ratio of the prediction error
tific disciplines. The Granger causality index and its derivatives variances of the time series of interest with and without in-
are important metrics developed and used for this purpose. How-
ever, the test statistics based on these metrics ignore the effect of cluding the purported causing time series into the prediction
practical measurement impairments such as subsampling, addi- model [2]. This quantity, known as the Granger causality in-
tive noise, and finite sample effects. In this paper, we model the dex (GCI) [2], [3], has been applied extensively for the detection
problem of detecting a causal relationship between two time series of causal relationships across numerous research areas such as
as a binary hypothesis test with the null and alternate hypotheses neuroscience [4]–[11], physics [12], [13], climate science [14],
corresponding to the absence and presence of a causal relationship,
respectively. We derive the distribution of the test statistic under the econometrics [15], etc. Most of the present studies on Granger
two hypotheses and show that measurement impairments can lead causality assume that the sampling process is noiseless and the
to suppression of a causal relationship between the signals, as well sampling frequency is above the Nyquist rate. As a consequence,
as false detection of a causal relationship, where there is none. We the tests are not designed to achieve a given false alarm/detection
also use the derived results to propose two alternative test statistics rate. It is also typically assumed that the second-order statis-
for causality detection. These detectors are analytically tractable,
which allows us to design the detection threshold and determine tics of the signals of interest are perfectly known. However,
the number of samples required to achieve a given missed detection in practice, the causality detector has to estimate the second-
and false alarm rate. Finally, we validate the derived results using order statistics from a finite number of possibly undersampled
extensive Monte Carlo simulations as well as experiments based and noisy measurements of the signals of interest. In this pa-
on real-world data, and illustrate the dependence of detection per- per, our main focus is on the quantification of the effects of
formance of the conventional and proposed causality detectors on
parameters such as the additive noise variance and the strength of these measurement inaccuracies on the performance of Granger
the causal relationship. causality detectors. This, in turn, allows us to design the detector
to achieve a desired false alarm/detection probability.
Index Terms—Causality detection, granger causality, perfor-
mance analysis, hypothesis testing.
The phenomenon of undersampling arises mainly due to the
inability of the signal acquisition unit to sample the signal of
interest above the Nyquist rate. As shown later in the paper,
I. INTRODUCTION under-sampling can lead to a weakening or even a complete
A. Motivation suppression of the causal relationship between two signals. In
addition to this, additive noise can lead to suppression of an
ONFIRMING the presence or absence of a causative re-
C lationship between different observed phenomena is an
important problem across various disciplines of physical, bio-
existing causal relationship, as well as may cause a false de-
tection [16], [17]. The effect of these inaccuracies is further
compounded by the fact that the second-order statistics of the
logical, and social sciences. It has been argued that if the phe- signal of interest, required to compute the GCI, are not known,
nomena of interest can be represented as a time series, then and have to be estimated using the acquired samples. Therefore,
the corresponding relationship between them can be quantified finite sample effects, in addition to undersampling and addi-
by directed mutual information [1]. It is also known that for tive noise, also contribute to errors in the detection of a causal
relationship between the signals of interest. In this paper, we
Manuscript received December 7, 2017; revised May 21, 2018, July 4, 2018, model the problem of detection of Granger causality as a binary
and August 12, 2018; accepted September 14, 2018. Date of publication Septem-
ber 26, 2018; date of current version October 8, 2018. The associate editor
hypothesis test, and evaluate the effects of the measurement im-
coordinating the review of this manuscript and approving it for publication was pairments listed above on the probabilities of detection and false
Dr. Paolo Braca. The work of C. R. Murthy was supported in part by a re- alarm.
search grant from the Aerospace Network Research Consortium and in part
by the Young Faculty Research Fellowship from the Ministry of Electronics
and Information Technology, Government of India. The work of G. Rangarajan B. Related Work
was supported in part by research grants from J. C. Bose National Fellowship,
DST Centre for Mathematical Biology Phase II, and in part by UGC Centre for In the original formulation of Granger causality in [2], a
Advanced Study. (Corresponding author: Chandra Ramabhadra Murthy). time series y[n] is said to Granger cause another time series
R. Chopra is with the Indian Institute of Technology Guwahati, Assam
781039, India (e-mail:,ribhu@outlook.com). x[n], when the past samples of the y[n] can be used to improve
C. R. Murthy and G. Rangarajan are with the Indian Institute of Science, the estimate of the x[n] over ‘what can be predicted using all
Bangalore 560012, India (e-mail:,cmurthy@iisc.ac.in; rangaraj@iisc.ac.in). other information in the universe’, conventionally considered
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. to be the past samples of x[n]. Therefore, a causal relation-
Digital Object Identifier 10.1109/TSP.2018.2872004 ship between x[n] and y[n] can be inferred by comparing the
1053-587X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
5804 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

mean squared prediction error for x[n] with and without in- binary hypothesis testing framework for causality detection to
cluding y[n] in the prediction model [18]. The prediction model GCI.
considering only the past samples of x[n] is known as the re-
strictive model (R-model), whereas the one including the ef-
C. Contributions
fects of y[n] is known as the unrestricted model (U-model). In
addition to the GCI, several metrics for quantifying Granger In the present work, we consider the problem of detecting
causality have been studied [19], [20]. The system model with a causal relationship between two possibly sub-sampled time
two time series [2] has been also generalized to study causal series, in the presence of additive white Gaussian noise. We
relationships between more than two time series [21]. However, model the time series as two independent pth order autoregres-
all these studies assume the samples are noiseless, and ignore sive (AR(p)) processes in the absence of a causal relation, and as
the finite sample effects on the second-order statistics being a bivariate vector autoregressive (VAR(p)) process in the pres-
estimated. ence of a causal relationship. We model the test for causality
The effect of noise on Granger causality was first discussed as a binary hypothesis testing problem with the null hypothe-
in [22], and was mathematically analyzed in [16] for a bivariate sis corresponding to the absence of a causal relationship, and
VAR-1 process. These results were extended to a generalized the alternate hypothesis to its presence. The system model is
VAR(p) (pth order autoregressive process) in [17], where a described in detail in Section II. Following this, the main ob-
Kalman filtering based technique to alleviate the detrimental jective of this paper is to evaluate the detection performance of
effect of noise was also developed. An alternative test statistic GCI for finite samples of the two time series of interest. Our
for causality detection, known as the phase slope index, was contributions in this direction are as follows:
introduced in [23] and was shown to be more resilient to noise, 1) We derive the effects of down-sampling on the causal
as compared to the GCI. Another metric for causality, known as relationship between two signals. (See Section III.)
the time reversed Granger causality, has been discussed in [18], 2) We derive the GCI for the down-sampled time series in
[24], [25]. The authors in [26] used simulation techniques to the presence of additive noise assuming knowledge of
compare the robustness of GCI, time reversed Granger causality, the second-order statistics of the signals of interest. (See
and phase slope index to correlated noise, and found that time Section IV.)
reversed Granger causality is the most noise resilient. 3) We derive the statistics of the GCI under the two hypothe-
The problem of sub-sampling in Granger causality detection ses for finite number of samples, and use these to derive
has been studied in the literature [27]–[34]. In [27], [28], the au- the probabilities of detection and false alarm. Our analysis
thors study the effects of the choice of the sampling frequency accounts for the estimation of the second-order statistics
on the detectability of a causal relationship between neurolog- of the two signals using a finite number of samples when
ical signals. In [28], these derived results are used to identify the generative model is unknown. (See Section V.)
favorable sampling rates for detecting Granger causality. In [29], 4) Based on the statistics of GCI, we propose two alter-
two methods, based on the expectation maximization (EM) al- native test statistics to detect Granger causality and de-
gorithm and the variational inference framework, are proposed rive the corresponding probabilities of detection and false
to detect causal relationships from low resolution data, and their alarm. (See Section VI.)
applicability to simulated and practical data is tested. The prob- 5) Using detailed simulations and numerical experiments
lem of causal inference from an under-sampled time series by based on real-world data, we validate the derived the-
using a general purpose Boolean constraint solver has been dis- ory and compare the performance of the two proposed
cussed in [32]. It is shown that the method considered in [32] test statistics against the GCI. (See Section VII.)
is independent of the modeling parameters, and can be conve- The results derived in this paper can be used to characterize
niently scaled to any number of variables. In [33], the authors the behavior of GCI and related detectors under different phys-
propose three sampling rate agnostic algorithms for the recov- ical conditions. In addition to this, the two alternative causality
ery of a causal relationship between observed time series. The detectors discussed in this work are easy-to-use replacements
authors in [34] incorporate the effect of both down-sampling for the GCI. We next describe the system model considered in
and additive noise in their data model. However, none of these this work.
studies consider the effect of finite samples on the detection Notation: Boldface lowercase and uppercase letters repre-
performance, or model the problem of detection of Granger sent vectors and matrices, respectively. The kth column of A
causality as a binary hypothesis testing problem. They also do is denoted by ak . (.)H represents the Hermitian operation on a
not analyze the effects of these physical impairments on the vector or a matrix, and (.)† represents the Moore-Penrose pseu-
probabilities of detection and false alarm. doinverse of a matrix. IK , 0K and OK represent the identity
The authors in [1] consider the related problem of estimation matrix, the all-zero vector and the all-zero matrix of dimension
of directed mutual information between two time series, using a K, respectively. The 2 norm of a vector and the Frobenius
finite number of samples. Here, the test for a causal relationship norm of a matrix are denoted by  · 2 and  · F , respectively.
between two time series is modeled as a binary hypothesis test, CN (μ, σ 2 ) represents a circularly symmetric complex Gaussian
and it is shown that the probabilities of missed detection and random variable with mean μ and variance σ 2 . E[·] and var(·)
false alarm asymptotically converge to zero as the number of represent the mean and variance of a random variable. In gen-
samples is increased. In this paper, we propose to extend the eral, x̂ and x̃ denote the MMSE estimate and the corresponding
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5805

TABLE I Defining cq [n]  [c[n], c[n −1], . . . , c[n −q +1]]T , dq [n] 


A LIST OF NOTATION FOLLOWED IN THE PAPER
[d[n], d[n −1], . . . , d[n −q +1]]T , ai  [ai,1 , ai,2 , . . . , ai,q ]T ,
i ∈ {1, 2, 3, 4}, we can write (1) as,
   H H    
c[n] a1 a2 cq [n − 1] ηc [n]
= + , (2)
d[n] 0H H
q a4 dq [n − 1] ηd [n]

with 0q being the q dimensional all zero vector. Defining


rcc [τ ]  E[c[n]c∗ [n − τ ]], rcd [τ ]  E[c[n]d∗ [n − τ ]], and sim-
ilarly rdc [τ ] and rdd [τ ], we can write


q 
q
rcc [τ ] = a∗1,k rcc [τ − k] + a∗2,k rdc [τ − k] + ση2c δ[τ ].
k =1 k =1
(3)
Letting rcd,q [τ ] = [rcd [τ ], . . . , rcd [τ − q + 1]]T , and simi-
larly rcc,q [τ ], etc., the above becomes
2
rcc [τ ] = aH
1 rcc,q [τ − 1] + a2 rdc,q [τ − 1] + ση c δ[τ ].
H
(4)

We can similarly write

rcd [τ ] = aH
1 rcd,q [τ − 1] + a2 rdd,q [τ − 1],
H
(5)

and
2
rdd [τ ] = aH
4 rdd,q [τ − 1] + ση d δ[τ ]. (6)

The above equations, along with the information that rcc [0] =
σc2 , and rdd [0] = σd2 can be used to compute rcc [τ ], rcd [τ ], and
rdd [τ ] for different values of τ .
Defining
 
Rcc,q [τ ]  E cq [n]cH
q [n − τ ] , (7)
estimation error of a random variable x. Table I introduces the
 
different symbols used in this paper. Rcd,q [τ ]  E cq [n]dH
q [n − τ ] , (8)

II. SYSTEM MODEL and similarly defining Rdc,p [τ ] and Rdd,q [τ ], it can be shown
using the Weiner-Hopf equations [35] that
We consider two complex valued discrete time zero mean
Gaussian random processes c[n] and d[n] having variances σc2

−1

and σd2 , respectively, and expressed in the form of a bivariate a1 Rcc,q [0] Rcd,q [0] rcc,q [1]
= . (9)
VAR(q) model as a2 Rdc,q [0] Rdd,q [0] rcd,q [1]

q 
q
We can therefore calculate the regression coefficients from the
c[n] = a∗1,k c[n − k] + a∗2,k d[n − k] + ηc [n],
correlation coefficients when the latter are not known. The inter-
k =1 k =1
ested reader is referred to [35, Chapter 2] for a detailed derivation

q 
q
of the Wiener-Hopf equations. We can then use these to compute
d[n] = a∗3,k c[n − k] + a∗4,k d[n − k] + ηd [n], (1)
the estimation error variance as
k =1 k =1

H
−1

with ai,k , i ∈ {1, 2, 3, 4} being the regression coefficients, and r cc,q [1] R cc,q [0] Rcd,q [0] rcc,q [1]
ηc [n] and ηd [n] the innovation components in c[n] and d[n], ση2c = σc2 − .
rcd,q [1] Rdc,q [0] Rdd,q [0] rcd,q [1]
respectively. We assume that c[n] and d[n] are sampled at the (10)
Nyquist rate, and the temporally white innovation process vector Also, it can be observed from (2) that the causal dependence of
η[n] = [ηc [n] , ηd [n]]T is distributed as c[n] on d[n] is determined by the coefficient vector a2 , that is,
  2  there exists a causal relationship only if a2 = 0.
ση c 0
η[n] ∼ CN 0, . Now, in the absence of a causal relationship, c[n] can alter-
0 ση2d
natively be expressed using a univariate AR(q) model,
For simplicity, we assume that only a unidirectional coupling
from d[n] to c[n] can exist, i.e., a3,k = 0 , for 1 ≤ k ≤ q. c[n] = gH cq [n − 1] + ζ[n], (11)
5806 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

with ζ[n] being the zero mean temporally white complex Gaus- Let ru v [τ ] = E[u[n]v ∗ [n − τ ]], ru v ,p [τ ] = E[up [n]v ∗ [n −
sian innovation component having a variance σζ2 . Consequently, τ ]], and Ru v ,p [τ ] = E[up [n]vpH [n − τ ]], with the absence
of the index [τ ] indicating τ = 0. We can write, ru u [τ ] =
rcc [τ ] = gH rcc,q [τ − 1] + σζ2 δ[τ ], (12) E[c[nM ]c∗ [nM − τ M ]] = rcc [τ M ]. Similarly, ru u ,p [τ ] =
g = R−1 [rcc [τ M ], rcc [(τ + 1)M ], . . . , rcc [(τ + p)M ]]T , and
cc,q [0]rcc,q [1], (13)
2
and ru u [τ ] = bH
1 ru u ,p [τ − 1] + b2 rv u ,p [τ − 1] + ση c δ[τ ], (18)
H

σζ2 = σc2 − rH −1 ru v [τ ] = bH
1 ru v ,p [τ − 1] + b2 rv v ,p [τ − 1],
H
cc,q [1]Rcc,q [0]rcc,q [1]. (14) (19)
2
It is to be noted that in the presence of a causal relationship rv v [τ ] = bH
4 rv v ,p [τ − 1] + ση d δ[τ ], (20)
between c[n] and d[n] the mean squared prediction error for c[n]
using the bivariate model, ση2c , will be smaller than the mean and
squared prediction error using the univariate signal model, σζ2 , Ru u ,p
whereas, in the absence of such a causal relationship between ⎡ ⎤
them, σζ2 = ση2c . rcc [0] rcc [M ]
. . . rcc [(p − 1)M ]
⎢ rcc [M ] . . . rcc [(p − 2)M ] ⎥
rcc [0]
The GCI uses this property to quantify the causal relationship, ⎢ ⎥
=⎢ .. .. .. .. ⎥.
and is defined as [2] ⎣ . . . . ⎦

σζ2 rcc [(p − 1)M ] rcc [(p − 2)M ] . . . rcc [0]
TG  log . (15) (21)
ση2c
The weight vectors b1 , b2 , and b4 can then be determined via
In Granger causality, we say that a causal relationship exists
the Wiener-Hopf equations [35] as
between c[n] and d[n] if TG > 0, and no causal relationship


−1

exists if TG = 0. b1 Ru u ,p Ru v ,p ru u ,p [1]
However, both c[n] and d[n] are assumed to be noiseless and = , (22)
b2 Rv u ,p Rv v ,p ru v ,p [1]
sampled at the Nyquist rate. Moreover, it also assumes that the
exact second-order statistics for both c[n] and d[n] are known. and b4 = R−1
v v ,p rv v [1]. (23)
However, these assumptions do not hold in a practical data ac-
quisition system. Therefore, in the subsequent sections, we in- We can then write,
troduce these imperfections, viz. down-sampling, measurement u [n] = u[n] − bH
1 up [n − 1] − b2 vp [n − 1],
H
(24)
noise, and finite sample effects, into a GCI based detector, and
analyze their impact on the detection performance. In the next v [n] = y[n] − bH
4 vp [n − 1]. (25)
section, we discuss the effect of down-sampling on the GCI.
Consequently,

H
−1

III. THE EFFECT OF DOWN-SAMPLING ON GCI ru u ,p [1] Ru u ,p Ru v ,p ru u ,p [1]


r u  u [0] = σc2 − ,
Let u[n] and v[n] respectively be versions of c[n] and d[n] ru v ,p [1] Rv u ,p Rv v ,p ru v ,p [1]
down-sampled by a factor M , such that, (26)
u[n] = c[nM − l], v[n] = d[nM − l]. (16)
r v  v [0] = σc2 − rH −1
v v ,p [1]Rv v ,p rv v ,p [1]. (27)
Since we use the second-order statistics of the observed wide
sense stationary random processes for detection, we can set the r u  v [0] = rcd [0]
offset parameter l to zero without loss of generality. We can now
−1

 H  Ru u ,p Ru v ,p ru v ,p [1]
express u[n] and v[n] in the form of a bivariate linear regression − ru u ,p [1] ru v ,p [1]
H

process of order p as Rv u ,p Rv v ,p rv v ,p [1]







H
u[n] bH1 b2
H
up [n − 1] u [n] ru u ,p [1]
= + , (17) −1
− ru v ,p [1]Rv v ,p rv v ,p [1] +
H
v[n] 0H
p b4
H
vp [n − 1] v [n] rv v ,p [1]
with bi , i ∈ {1, 2, 4} representing new regression coef-
−1

Ru u ,p Ru v ,p Ru v ,p
ficients, and E [u[n − k]∗u [n]] = 0, E [v[n − k]∗v [n]] = 0, × R−1 v v ,p rv v ,p [1].
E [u[n − k]∗v [n]] = 0, and E [v[n − k]∗u [n]] = 0 for k > 0. It Rv u ,p Rv v ,p Rv v ,p
is important to note that the model order p in this case may (28)
be different from the model order q considered in the previous
Hence, the covariance matrix of the innovation process need not
section due to down-sampling.
be diagonal after down-sampling.
Since the GCI is a function of the second-order statistics of
Similarly, the single variable linear regression model for the
c[n] and d[n], we need to derive the second-order statistics of
down-sampled signal u[n] can be written as
u[n] and v[n] in terms of c[n] and d[n] in order to quantify the
effect of down-sampling on the GCI. u[n] = f H up [n − 1] + ξ[n], (29)
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5807

with ξ[n] being the innovation component. Therefore, we get and


 
f = R−1
u u ,p ru u ,p [1], (30) Rz z ,p = E zp [n]zH
p [n] . (40)

and rξ ξ [0] = σu2 − rH −1


u u ,p [1]Ru u ,p ru u ,p [1]. (31) Since the additive noise is independent of both u[n] and v[n],
it can be shown that
A causal relationship between u[n] and v[n] can now be deter-
r [0]
mined by considering the ratio TG = log2 ( r  ξ ξ [0] ). rxx [τ ] = ru u [τ ] + σν2x δ[τ ], (41)
u u
To illustrate the effect of down-sampling on the GCI, consider rxy [τ ] = ru v [τ ], (42)
the simple special case: c[n] = ad[n − 1] + ηc [n], and d[n] =
ηd [n]. Letting M = 2, we can write ry y [τ ] = rv v [τ ] + σν2y δ[τ ]. (43)

u[n] = c[2n] = aηd [2n − 1] + ηc [2n], (32) Substituting equations (18)–(20) into (41)–(43) we obtain
2 2
v[n] = d[2n] = ηd [2n]. (33) rxx [τ ] = bH
1 ru u ,p [τ −1] + b2 rv u ,p [τ −1] + (σ u +σν x )δ[τ ],
H

(44)
Consequently, ru u [1] = rcc [2] = 0, ru v [1] = rcd [2] = 0, ru v [0]
= rcd [0] = 0, and rξ ξ [0] = r x  x [0] = σc2 . We see that there ex- rxy [τ ] = bH
1 ru v ,p [τ − 1] + b2 rv v ,p [τ − 1],
H
(45)
ists a causal relationship between c[n] and d[n] for any a = 0, 2 2
ry y [τ ] = bH
4 rv v ,p [τ − 1] + (σ v + σν y )δ[τ ], (46)
but no causal relationship exists between u[n] and v[n]. Thus,
down-sampling can suppress an existing causal relationship be- and similarly,
tween two signals. However, the degree of suppression of the  
Rxx,p Rxy ,p
causal relationship depends on the structure of the VAR model Rz z ,p =
followed by the signals. We next analyze the effect of additive Ry x,p Ry y ,p
   
noise on the performance of the GCI. Ru u ,p Ru v ,p σν2x Ip 0p
= + . (47)
Rv u ,p Rv v ,p 0p σν2y Ip
IV. GCI WITH DOWNSAMPLING AND ADDITIVE NOISE
The optimal weight vector minimizing σϕ2 2 can be obtained
Let us define x[n] and y[n] as the down-sampled signals via the Wiener-Hopf equations as
u[n] and v[n] corrupted by AWGN, such that,
w = R−1
z z ,p rz x,p [1]. (48)
x[n] = u[n] + νx [n],
The minimized value of σϕ2 2 is
y[n] = v[n] + νy [n], (34)
σϕ2 2 = σx2 − rH −1
z x,p [1]Rz z ,p rz x,p [1]. (49)
where νx [n] ∼ CN (0, σν2x ) and νy [n] ∼ CN (0, σν2y ).
Using (17), we can write, However, in case no causal relationship exists between c[n]




and d[n], and consequently between x[n] and y[n], x[n] can be
x[n] bH H
1 b2 up [n − 1] u [n] + νx [n] expressed using the single variable AR model as
= H H + . (35)
y[n] 0p b4 vp [n − 1] v [n] + νy [n]
x[n] = hH xp [n − 1] + ϕ1 [n], (50)
If a causal relationship exists between c[n] and d[n], and is pre-
with ϕ1 [n] being the single variable prediction error, and h being
served in u[n] and v[n] after down-sampling, it should also exist
the optimal weight vector, computed as
between x[n] and y[n], and therefore, the past samples of x[n]
and y[n] can be used to predict x[n] better than its prediction by h = R−1
xx,p rxx,p [1]. (51)
using the past samples of x[n] alone. Letting z[n] = [x[n] y[n]]T
Similar to the two variable case, the minimized value of the
and zp [n] = [xTp [n] ypT [n]]T , we can express x[n] as
prediction error takes the form
x[n] = wH zp [n − 1] + ϕ2 [n], (36) σϕ2 1 = σx2 − rH −1
xx,p [1]Rxx,p rxx,p [1]. (52)
with ϕ2 [n] being the prediction error for the bivariate VAR The GCI therefore becomes
model, and w being the weight vector minimizing the mean
squared value of ϕ2 [n]. σϕ2 1 σx2 − rH −1
xx,p [1]Rxx,p rxx,p [1]
TG = log2 = log2 .
Defining σϕ2 2  E[|ϕ2 [n]|2 ] = E[|x[n] − wH zp [n − 1]|2 ], it σϕ2 2 σx2 − rH −1
z x,p [1]Rz z ,p rz x,p [1]
can be shown that [35], (53)
To demonstrate the effects of noise on the GCI, we again
σϕ2 2 = σx2 − wH rz x,p [1] − rH
z x,p [1]w + w Rz z ,p w,
H
(37)
consider our running example, d[n] = v [n], c[n] = u[n] =
where ad[n − 1] + bd[n − 2] + c [n]. Downsampling these by a fac-
tor M = 2, we obtain, v[n] = v [n], u[n] = bv[n − 1] + u [n],
σx2 = E[x[n]x∗ [n]] = σc2 + σν2x , (38) with, σv2 = σ2v , and σu2 = |b|2 σ2v + σ2u , and consequently,
rz x,p [τ ] = E[zp [n]x∗ [n − τ ]] = [rH
xx [τ ] ry x [τ ]] ,
H H
(39) σx2 = |b|2 σ2v + σ2u + σν2x , σy2 = σ2v + σν2y , (54)
5808 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

rxy [0] = 0, rxy [1] = bσ2v . (55) given as [35]

Therefore, ŵ = Z†p [N ]xN−p [N ]. (61)


 
|b|2 σ2v + σ2u + σν2x 0 Here, ŵ is a finite sample estimate of the true weight vector
Rz z ,1 = , (56) w with Zp [N ] = [zp [N − 1], zp [N − 2], . . . , zp [p]]H , and ()†
0 σ2v + σν2y
represents the Moore-Penrose inverse of a matrix. Since the in-
|b|2 σ2v σν2y novation process and the measurement noise are white and zero
σϕ2 2 = σ2u + σν2x + , (57)
σ2v + σν2y mean, it can be shown that ŵ has the following properties [35]:
1) ŵ is an unbiased estimator of w.
and σ ϕ2 2
2) The covariance matrix of ŵ is cov(ŵ) = N−p R−1z z ,p .
σϕ2 1 = σx2 = |b|2 σ2v + σ2u + σν2x . (58)  σ2 
Therefore, ŵ = w + w̃, with w̃ ∼ CN 0, N−p R−1 ϕ2
z z ,p .
We can simplify the GCI as Similarly, considering the single variable prediction model,
⎛ ⎞ we can write xN −p [N ] in terms of its projection, x̄N −p [N ] on the
⎜ |b| σ2v + σ2u + σν2x
2
⎟ data matrix, Xp [N ] = [xp [N − 1], xp [N − 2], . . . , xp [p]]H ,
TG = log2 ⎝  σ2  ⎠. (59)
and the error component ẋN −p [N ] = xN −p [N ] − x̄N −p [N ].
|b|2 σ2v σ 2 +σνy
2 + σ2u + σν2x
v νy
Again, x̄N −p [N ] = Xp [N ]ĥ with ĥ = X†p [N ]xN −p [N ], and
In the absence of additive noise in both x[n] and y[n], TG
|b|2 σ 2 v +σ 2 u σϕ2 1
reduces to TG = log2 ( σ2 ), as in the previous section. ĥ ∼ CN h, R−1 . (62)
u N − p xx,p
However, as either σν2x , or σν2y get large, the second term in the
expression for TG converges to zero. Thus, additive noise can We next use the above properties of the estimated regression
also lead to the suppression of a causal relationship between two weights to derive the statistics of the finite sample estimate of
signals. the mean squared error for the two prediction models.
The discussion till now assumed that the second-order statis-
tics of both the signals of interest are available at the detec- B. Mean Squared Prediction Errors
tor. However, in practice, we have to estimate the second-order
First, considering the prediction error for the two variable
statistics using a finite number of samples of x[n] and y[n].
model, we can express x̃[n] as
The effects of the use of estimated second-order statistics on
the detection of Granger causality in an under-sampled noisy x̃[n] = x[n] − x̂[n]
environment are discussed in the next section.
= x[n] − wH zK [n − 1] − w̃H zK [n − 1]. (63)
V. FINITE SAMPLE EFFECTS ON THE GCI Since the innovation process of u[n] and the additive noise are
In case only N samples of each of x[n] and y[n] are observed, white, x̃[n] is zero mean i.i.d. Gaussian with variance
and the generative model is unknown, we need to estimate the
optimal predictors for the single and two variable cases, the E[|x̃[n]|2 ] = σϕ2 2 + E[w̃H zp [n − 1]zH
p [n − 1]w̃]

corresponding prediction error variances, and the GCI or an − 2{E[w̃H zp [n − 1]x∗ [n]]}
equivalent test statistic using these samples. In this section, we
first discuss the properties of the optimal predictor weights for − 2{E[w̃H zp [n − 1]zH
p [n − 1]w]}. (64)
the one variable and two variable prediction models. Then, we
It is common in the adaptive filtering literature to assume the
use these weights to calculate the statistics of the finite sample
weight error vector to be independent of the regression vectors
estimates of the corresponding mean square prediction errors,
as well as the desired output [35]. Hence,
and finally derive expressions for the performance of GCI with
finite number of observations. E[w̃H zp [n − 1]x∗ [n]] = E[w̃H zp [n − 1]zH
p [n − 1]w] = 0,
(65)
A. Properties of Optimal Predictor Weights and
It is known that given a finite number of observations, the E[w̃H zp [n − 1]zH
p [n − 1]w̃]
least squares estimate is the best linear unbiased estimator for
x[n] [35]. Letting x[1], . . . , x[N ] be the N available samples = Tr(E[zp [n − 1]zH p [n − 1]E[w̃w̃ ])
H

of x[n], y[1], . . . , y[N ] be the N available samples of y[n], and


x̂[n] = ŵH zp [n − 1] be the least squares (LS) estimate of x[n] σϕ2 2 2p
= Tr Rz z ,p R−1 = σϕ2 2 . (66)
generated using the two variable model, we can write N − p z z ,p N −p

xN−p [N ] = x̂N−p [N ] + x̃N−p [N ], (60) Substituting these in (64), we can write


 
where x̃N−p [N ] is the error term orthogonal to the measurement 2 2 2p
E[|x̃[n]| ] = σϕ 2 1 + . (67)
space. Also, ŵ is the least squares estimate of the weight vector, N −p
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5809

We can similarly argue, for the one variable model, that ẋ[n] Therefore, ẋ[n] = x̃[n] + x́[n] − h̃H xp [n − 1] and
is also zero mean i.i.d. Gaussian distributed, with  
    p
p E[t1 ] = σϕ2 2 + σx́2 1 +
E[|ẋ[n]|2 ] = σϕ2 1 1 + . (68) N −p
N −p  
p σϕ2 2
Thus, the estimates of the prediction error variances from the = E[t2 ] + σx́2 2 1 + − . (78)
one variable and two variable regression models of the system N −p N −p
can be expressed as Similarly,
 2
1 N
1 p
t1 = |ẋ[n]|2 , (69) E[t21 ] 2
= E [t1 ] + (σ 2 + σx́2 )2 1+ , (79)
N − p n =p+1 N − p ϕ2 N −p
and,
1 N
t2 = |x̃[n]|2 . (70) 1  4 
N − p n =p+1 var(t1 ) = σ + σx́4 + 2σϕ2 2 σx́2
N − p ϕ2
 
These can now be used to calculate the GCI as p2 2p
  × 1+ +
t1 (N − p)2 N −p
TG = log2 , (71)
t2
1
However, since TG is a logarithm of ratios of random variables, = var(t2 ) + σ 4 + 2σϕ2 2 σx́2
N − p x́
its statistics become hard to determine. Instead, we define TE 

2−T G = tt 21 as a test statistic, for simplicity of analysis. 3p2 2p
4
Since x̃[n] and ẋ[n] are the measurement errors for the same − σϕ 2 + . (80)
(N − p)2 N −p
process, t1 and t2 cannot be considered independent. However,
since both t1 and t2 are sums of a large number of i.i.d. random Consequently, it can be shown that
variables, we can use the central limit theorem to approximate
σϕ4 2
these as Gaussian r.v.s [36]. Note that this approximation is valid cov(t1 , t2 ) = . (81)
when the number of samples N is much larger than the model N −p
order p. Now The GCI can therefore be approximated as the ratio of two
  correlated Gaussian r.v.s t1 and t2 whose statistics are derived
2p
E[t2 ] = σϕ2 2 1 + , (72) above. In the next subsection, we use the above statistics to
N −p
calculate the probabilities of detection and false alarm.
and
 2
1 2p C. Performance of GCI
var(t2 ) = σ4 1+ . (73)
N − p ϕ2 N −p Now, we declare a causal relationship to exist between the
Similarly, two signals if the ratio TE = tt 21 is below a threshold λ. Since
  t1 is the sum of non-negative terms, it is almost surely positive,
p
E[t1 ] = σϕ2 1 1+ . (74) and therefore
N −p
Pr{TE < λ} = Pr{λt1 − t2 > 0}. (82)
However, in the presence of a causal relationship between
y[n] and x[n], wH zp [n − 1] will be a better estimate of x[n] Since t1 and t2 are approximated as correlated Gaussian r.v.s,
than hH xp [n − 1], therefore, wH zp [n − 1] can be written as therefore, from (78), TD  λt1 − t2 can also be approximated
wH zp [n − 1] = hH xp [n − 1] + x́[n], such that, x́[n] is orthog- as a Gaussian r.v. with mean and variance
 
onal to hH xp [n − 1]. Consequently, 2 p p
E[TD ] = (λ − 1)E[t2 ] + λσx́ 1 + − λσϕ2 2 ,
ϕ1 [n] = ϕ2 [n] + x́[n], (75) N −p N −p
(83)
with E[x́[n]ϕ∗2 [n]] = 0, and
and
σϕ2 1 = σϕ2 2 + σx́2 , (76)
var(TD ) = λ2 var(t1 ) + var(t2 ) − 2λ cov(t1 , t2 ), (84)
where σx́2 2
 var(x́ ).
Now, since x́[n] = wH zp [n] − hH xp [n], it can be shown that respectively.
E[x́[n]] = 0, and, Now, under the null hypothesis, H0 , σϕ2 1 = σϕ2 2 and σx́|H
2
0
=
0. Hence, under H0 , the above simplify to
E[|x́[n]|2 ] = wH Rz z ,p w + hH Rxx,p h − 2{wH Rz x h}
μT D ,0  E[TD |H0 ]
−1 −1
z z ,p Rz z ,p rz z ,p + rxx,p Rxx,p rxx,p
= rH H
   
p p
−1 −1 = (λ − 1)σϕ2 2 1+ − σϕ2 2 , (85)
− 2{rH
z z ,p Rz z ,p Rz x,p Rxx,p rxx,p }. (77) N −p N −p
5810 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

σT2 D ,0  var(TD |H0 ) As an example, consider N samples of signals u[n] and v[n]
 2 corrupted by AWGN and observed as x[n] and y[n], respectively,
σϕ4 2 2p such that v[n] = v [n] and
= (λ2 + 1) 1+ 
N −p N −p
u ,0 [n] H0
4   u[n] = , (92)
σϕ 2 3p2 2p σϕ4 2 av [n] + u ,1 [n] H1
− λ2 + − 2λ .
N −p (N − p)2 N −p N −p
(86) with var(v ) = σ2v , var(u ,1 ) = σ2u , and var(u ,0 ) = |a|2 σ2v +
σ2u . Then, K = 1, σϕ2 1 is given by (58), and
Similarly, under the alternate hypothesis, H1 , σϕ2 1 > σϕ2 2 , and ⎧ 2
⎨ σϕ 1 H0
therefore,
σϕ2 2 = 2 2 2 , (93)
  ⎩ σ2 + σν2 + |a|2 σ  v σ2ν y = σϕ2 − Δ H1
p σ +σ 1
μT D ,1  E[TD |H1 ] = λσx́2 1 +
u x u νy

N −p  σ ν2 y

  with Δ = |a|2 σ2v 1 − σ 2 v +σ ν2 y .
2 p p
+ (λ − 1)σϕ 2 1 + − σϕ2 2 , (87) Therefore,
N −p N −p  
 2 λ−2
σϕ4 2 2p μT D ,0 = σϕ2 1 λ−1+ , (94)
2 2
σT D ,1  var(TD |H1 ) = (λ + 1) 1+ N −1
N −p N −p
and
  2 2 2  2  2
σx́4 p σϕ σ p 1 2
+ λ2 1+ + 2λ2 2 x́ 2 1 + σT2 D ,0 = (λ2 + 1)σϕ4 1 1 +
N −p N −p N −p N −p N −1 N −1
4  
σϕ 2 3p2 2p σϕ4 2  

− λ2 + − 2λ . 3 2 σϕ2 1
N − p (N − p)2 N −p N −p −λ 2
σϕ4 1 + − 2λ .
(88) (N − 1)2 N −1 N −1
(95)
Based on these, the probability of false alarm can be calcu-
lated as Consequently, the probability of false alarm is given as (96) at
  the bottom of this page.
μT D ,0 Under the alternate hypothesis,
PFA = Pr{TD > 0|H0 } = Q . (89)
σT D ,0  
2 1
E[TD |H1 ] = (λ − 1)σϕ 1 1 +
Similarly, the probability of correct detection of a causal rela- N −1
tionship, PD , can then be calculated as  
1 1
  −Δ 1+ − (σϕ2 1 − Δ) ,
μT D ,1 N −1 N −1
PD = Pr{TD > 0|H1 } = Q . (90) (97)
σT D ,1
∗ and
Using (89), a detector providing a given false alarm rate of PFA  2
can be designed by selecting a threshold λ that satisfies 1 2 4 1
var(TD |H1 ) = (λ + 1)σϕ 1 1 +
   2 N −1 N −1
p p
σϕ2 2 (λ − 1) 1 + − λσϕ2 2  2  2
N −p N −p 2 2 2 2
+ Δ 1+ − 2σϕ 2 Δ 1 +
 2 N −1 N −1
σϕ4 2 p
−1 ∗
= Q (PF A ) (λ + 1)2
1+  

N −p N −p 3 2
+ σϕ4 1 + . (98)
 
(N − 1)2 N −1
4
2 σϕ 2 3p2 2p σϕ4 2
−λ + − 2λ . (91) The probability of detection can thus be expressed as (99) at the
N − p (N − p)2 N −p N −p
bottom of the next page.
The above reduces to a quadratic equation in λ that can be solved We observe from (99) that the probability of detection de-
to obtain the detection threshold for the detector. pends on the ratio σΔ2 , which can in turn be viewed as the
ϕ1

⎛   ⎞
!
⎜ (N − 1) λ − 1 + Nλ−2 −1 ⎟
PFA = Q ⎜
⎝"  
⎟.
⎠ (96)
  2 σ4
(λ2 + 1)σϕ4 2 1 + N 2−1 − λ2 σϕ4 2 (N −1)
3
2 +
2
N −1 − 2λ N ϕ−1
1
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5811

relative difference between the variances of the prediction er- and


ror under the one variable and the two variable models. Conse-  2
1 2p
quently, the difference between the mean square prediction error var(TC |H0 ) = 2σϕ4 2 1 +
for the single and two variable prediction models, i.e., t1 − t2 N −p N −p
can be used as a test statistic to detect the presence of a causal  

relationship. We inspect the use of t1 − t2 as a test statistic in 3p2 2p


− σϕ4 2 + − 2σϕ4 2 .
the next section. (N − p)2 N −p
(105)
 2
VI. GRANGER CAUSALITY DETECTION BASED ON THE
1 2p
DIFFERENCE OF ERROR TERMS var(TD |H1 ) = 2σϕ4 2 1 +
N −p N −p
Let TC  t1 − t2 denote the difference in the finite sample  2  2
estimates of the mean squared errors of the one variable and 4 p 2 2 p
+ σx́ 1+ + 2σϕ 2 σx́ 2 1+
the two variable prediction models. It can be argued that TC N −p N −p
is the difference of two correlated Chi-squared r.v.s., each with  

2(N − p) degrees of freedom. Therefore, the pdf of TC can be 4 3p2 2p 4


− σϕ 2 + − 2σϕ 2 .
written as [37] (N − p)2 N −p
(106)
|y|N −p−1
fT C (y) =
(m − 1)![2var(t1 )var(t2 ) − cov2 (t1 , t2 )γ](N −p) Using the above statistics, it is now straightforward to design
  N  i the detector to achieve a given probability of false alarm, and
−p−1
1 − (N − p + i − 1)! 2 obtain closed-form expressions for the probability of false alarm
× exp α y , y < 0;
4 i=0
(N − p − i − 1)! γ|y| and detection, similar to (89) and (90), respectively. In particular,
when N
p, i.e., the number of samples is much larger than the
model order, it can be easily shown that the detection threshold
|y|N −p−1 λ is well approximated by
fT C (y) = √ −1
(m − 1)![2var(t1 )var(t2 ) − cov2 (t1 , t2 )γ](N −p) 2Q (PF A )σϕ2 2
  N
−p−1  i λ= √ , (107)
1 (N − p + i − 1)! 2 N −1
× exp − α+ y , y ≥ 0,
4 i=0
(N − p − i − 1)! γ|y| and the probability of detection as a function of the target false
(100) alarm probability PF∗ A can be approximated as
⎛ √ √ ⎞
with Q−1 (PF∗ A ) 2σϕ2 2 − N − pσx́2
PD ≈ Q ⎝ # ⎠. (108)
[(var(t1 ) + var(t2 ))2 − 4cov2 (t1 , t2 )] 2
1 σx́4 + 2σϕ2 2 σx́2 2
γ= , (101)
var(t1 )var(t2 ) − cov2 (t1 , t2 )) Based on the above, the number of samples required to attain a
given pair of probabilities of detection and false alarm can be
and
approximated as
[(var(t1 ) − var(t2 ))] $ 2
α± = γ ± . (102) √ σϕ2 2 σϕ2 2
var(t1 )var(t2 ) − cov2 (t1 , t2 )) −1 −1
N ≈ p + Q (PF A ) 2 2 + Q (PD ) 1 + 2 2 .
σx́ σx́
The p.d.f. of TC is in the form of a series, which makes it (109)
difficult to obtain the probabilities of detection and false alarm Revisiting the example considered in the previous sections,
in closed form. However, for large N , TC can be approximated we observe that, σϕ2 1 = σx2 , σϕ2 2 = σx2 − Δ, using which, it is
as a Gaussian r.v., such that, easy to obtain the expressions for the probabilities of false alarm
p and detection. We omit writing out the expressions for the sake
E[TC |H0 ] = − σϕ2 2 , (103) of brevity.
N −p
  We can also use the mean squared value of the difference of
p p
E[TC |H1 ] = σx́2 1+ − σϕ2 2 , (104) error terms as a test statistic to detect the presence of a causal
N −p N −p relationship between x[n] and y[n]. Letting x̌[n]  x̂[n] − x̄[n],

⎛ ⎞
√     
⎜ N − 1 (λ − 1)σϕ2 1 1 + N 1−1 − Δ 1 + N 1−1 − (σϕ2 1 − Δ) N 1−1 ⎟

PD = Q ⎝ " ⎟
 2  2  2  ⎠ . (99)
(λ2 + 1)σϕ4 1 1 + N 1−1 + Δ2 1 + N 2−1 − 2σϕ2 2 Δ 1 + N 2−1 + σϕ4 1 (N −1) 3
2 +
2
N −1
5812 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

Fig. 1. Value of T E at different SNRs, for a = 0.5, with known second-order


statistics.

and
1 N
TA  |x̌[n]|2 , (110)
N − p n =p+1

it can be shown that


 
3p
E[TA |H0 ] = σϕ2 1 , (111)
N −p
and
−1 −1
E[TA |H1 ] = rH
z z ,p Rz z ,p rz z ,p + rxx,p Rxx,p rxx,p
H

2σϕ2 2 p σϕ2 1 p Fig. 2. Probabilities of (a) detection and (b) false alarm as a function of the
− 2{rH R−1 −1
z z ,p Rz x Rxx,p rxx,p } + + . (112) detection threshold λ for the test statistic T E .
N −p N −p
However, the higher order statistics and the exact distribution σ u2
of TA cannot be found in closed form. Therefore, we evaluate inspect the effects of noise on the GCI. We define γ1  σ ν2 1 ,
its performance via simulations. To the best of our knowledge, σ u2
and γ2  as the observation SNRs for x[n] and y[n]. We
σ ν2 2
the test statistics TC and TA have not been used in the literature
see that the mean value of TG converges to a value close to zero
to detect Granger causality.
as the noise variance increases. The reduction in the SNR will
make it harder to detect the presence of a causal relationship
VII. SIMULATION RESULTS between the signals of interest.
In this section, we use Monte Carlo simulations to cor-
roborate our analytical expressions and illustrate the relative B. Finite Sample Effects
performance of different test statistics for detecting Granger
We now consider the effects of using a finite number of sam-
causality. We also illustrate the causality detection performance
ples on the GCI, as discussed in Section V. In Figs. 2(a) and 2(b),
of TC using real-world data with known ground truth, available
we compare the simulated detection and false alarm probabil-
in [38]
ity obtained by using the test statistic TE as a function of the
For the simulation-based study, we consider a system model
detection threshold λ and for different number of samples, with
similar to the example discussed in the previous sections, for
a = 0.25 and γ1 = γ2 = 0 dB. The simulated performance of
different values of the variance of the additive noise and vary-
TE is found to be in close agreement with the values computed
ing number of samples. The probabilities of detection and false
using our analytical expressions in (96) and (99). It is also ob-
alarm are obtained by averaging over 10,000 independent real-
served that the behavior of both the probability of detection and
izations of the signals of interest.
false alarm as a function of the detection threshold becomes
sharper with an increase in the number of samples, and asymp-
A. Effect of Additive Noise
totically approximates a step function. The transition point of the
In Fig. 1, we plot the mean value of the test statistic TG for step function occurs at the mean of the test statistic under the re-
the example considered in Section IV, with b = 0.5. Here we spective hypotheses, which corresponds to the 50% probability
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5813

Fig. 3. ROC for detection based on T E , at an SNR of 0 dB, for different


number of samples.

of detection/ false alarm point on Figs. 2(a) and 2(b). Therefore,


the threshold λ = 1 gives a zero error rate when infinite samples
of the two signals of interest are available. However, in case of a
finite number of samples, λ has to be chosen carefully to achieve
reliable performance.
In Fig. 3, we compare the theoretical and simulated receiver
operating characteristics (ROCs) for the example considered in
Section V for a = 0.25 with different values of N , at an SNR
of 0 dB. It is observed that for a small number of samples,
the ROC is close to the PF A = PD line, whereas, for a large
number of samples, the probability of detection becomes almost
independent of PF A . This is in line with the behavior predicted Fig. 4. ROC for detection based on T C , at an SNR of 0 dB, for different
in (96) and (99). number of samples (N ) (a) averaged over 10,000 realizations in all cases,
(b) obtained for a fixed data record length (10,000) divided into windows of
Having established that the use of finite samples indeed causes length N .
a deterioration in the performance of the GCI, we now discuss
the performance of the other two test statistics, viz. TC and
TA , as discussed in Section VI. In Fig. 4(a), we compare the Thus, given a data record of a given size, a good choice for N
theoretical and simulated ROCs for the example considered in is the largest one such that the ROC is averaged over about 100
Section VI, with a = 0.25, for different number of samples, and instantiations.
at an SNR of 0 dB. We observe that the ROC for TC shows a In Fig. 5, we repeat the experiment considered in Fig. 4(a)
behavior similar to that for TE with an increase in the number by fixing the number of samples of x[n] and y[n] at 100 and
of samples. Therefore, an appropriate choice of the detection varying the SNRs. Again, the theoretical and simulated plots are
threshold λ becomes important for TC as well. in close agreement, with the concavity of the ROC increasing
In practice, one is typically provided a given number of sam- with the SNR.
ples of a pair of time-series data to perform detection. In this In Figs. 6 and 7, we evaluate the performance of TA as a test
case, there is a trade-off between the window size N and the statistic for detection of a causal relation between x[n] and y[n]
quality of the ROC obtained. Using a larger N improves the for a setup similar to the example considered in Section V, with
reliability of detection, but the resulting ROC has more “jumps” a = 0.25. Since theoretical expressions for the behavior of TA
because it is computed by averaging over fewer instantiations. are not derived, these are not considered here. In Fig. 6, we plot
We illustrate this in Fig. 4(b), where we plot the ROC for the the ROC of TA for different SNRs for 100 samples of each of
setup of Fig. 4(a) with the total number of available samples x[n] and y[n] being used for detection. The performance of TA
fixed at 10,000. We see that the ROC with N = 100 (averaged under different SNRs is seen to mimic that of TC and TE , with
over 100 instantiations) matches well with the corresponding the concavity of the ROC increasing with the SNR. In Fig. 7,
curve in Fig. 4(a), where 10,000 instantiations were used. For we plot the probabilities of detection for different numbers of
higher N , the performance improves, but the quality of the samples as a function of the signal SNRs. In this case, we use the
ROC deteriorates, because the probability of detection is av- Neyman-Pearson criterion to fix the probability of false alarm at
eraged over a correspondingly fewer number of instantiations. 10%. As expected, an increase in the number of samples leads to
5814 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

Fig. 5. ROC for detection based on T C , with N = 200 samples, for various Fig. 8. ROC of test statistics T E , T C and T A for N = 100 samples at an
values of the SNR. SNR of 0 dB.

Fig. 6. ROC for detection based on T A , with N = 100 samples, for various Fig. 9. Performance of the test statistic T C on the real-world datasets 67–69
values of the SNR. in [38] for different numbers of samples.

performance among the three test statistics considered in this


paper, followed by TC and TE respectively. It is interesting to
note that if the exact knowledge about the second-order statistics
is available, then the performance of the three test statistics
should be identical.

C. Detection Performance of TC With Real World Data


In Figs. 9 and 10, we plot the performance obtained from
the test statistic TC on a real-world dataset with known ground
truth, available in [38]. Fig. 9 corresponds to data pairs 65–67 of
this database. These data pairs consist of the returns from related
Fig. 7. Probability of detection using T A at a fixed false alarm rate of 10% as stocks over a period of five years. Similarly, Fig. 10 corresponds
a function of the SNR.
to data pair 69 of the database, containing temperature measure-
ments from inside and outside a room taken every five minutes.
improved detection performances for the same SNR. Therefore, These four data sets were selected out of the 108 available data
TA can also be used to detect a causal relationship between a sets because:
pair of signals in the finite sample regime. 1) They represent time series data, in accordance with our
In Fig. 8, we compare the simulated performance of TE , system model.
TC and TA for the example considered previously, with b = 2) The number of samples in these time series were suffi-
0.25, at an SNR of 0 dB and for 100 samples. We observe ciently large to estimate of the probabilities of detection
that TA , despite being theoretically intractable, offers the best and false alarm with good accuracy.
CHOPRA et al.: STATISTICAL TESTS FOR DETECTING GRANGER CAUSALITY 5815

In Fig. 11, we plot the performance of TC for real-world


data used in Fig. 10 with different amounts of added noise for
a window length of N = 200. Noisy samples are obtained by
adding real valued white Gaussian noise to the caused sequence,
suitably scaled to achieve a given SNR. These samples are used
to calculate the test statistic TC , which is then compared against
the detection threshold in (107), with σϕ2 1 as the sample vari-
ance of the corrupted caused signal. It is observed that for a fixed
number of samples per window, the addition of noise worsens
the detection performance of TC , and the variation in the detec-
tion performance approximately matches with the Q-function
of the square root of the SNR, as expected.

VIII. CONCLUSIONS
Fig. 10. Performance of the test statistic T C on the real-world dataset 69 In this work, we derived the effects of different data acquisi-
in [38] for different numbers of samples. tion impairments on the detection of a causal relation between
two signals. We showed that these effects, viz. down-sampling,
additive noise, and finite sample effects, could lead to a signifi-
cant deterioration in the performance of the GCI, both individ-
ually as well as cumulatively. We derived the probabilities of
detection and false alarm for the GCI under the impairments.
Following this, we proposed two alternative test statistics, based
on the difference of mean squared error terms, and the mean
squared value of the difference of error terms. Via extensive
simulations, we showed that the derived results corroborate the
simulated as well as real-world data, with the test statistic based
on the mean squared value of the difference of error terms out-
performing the other two statistics. A more precise analysis of
this test statistic is an interesting direction for future research.

ACKNOWLEDGMENT

Fig. 11. Performance of T C for the dataset considered in Fig. 10 for N = 200 The authors are grateful to the anonymous reviewers for their
with different amounts of added noise. feedback, which significantly improved the overall presentation,
and in particular the simulation results in the paper.

For generating the experimental results, we assumed that the REFERENCES


data was uniformly sampled at Nyquist rate. However, neither
[1] I. Kontoyiannis and M. Skoularidou, “Estimating the directed informa-
the model parameters of the data nor the additive noise variance tion and testing for causality,” IEEE Trans. Inf. Theory, vol. 62, no. 11,
are known. Hence, an order 1-VAR generative model is as- pp. 6053–6067, Nov. 2016.
sumed, and each data set is preprocessed to make it zero mean. [2] C. W. J. Granger, “Investigating causal relations by econometric models
and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438,
The detection threshold for a given false alarm rate (PF A ) is 1969.
calculated assuming the two sequences to be independent and [3] D. A. Pierce, “R 2 measures for time series,” J. Amer. Statist. Assoc.,
white, and using the sample variance of the caused sequence as vol. 74, no. 368, pp. 901–910, 1979.
[4] R. Goebel, A. Roebroeck, D.-S. Kim, and E. Formisano, “Investigating
σϕ2 2 in (107). directed cortical interactions in time-resolved fMRI data using vector
The data pairs are divided into windows of length N , and autoregressive modeling and Granger causality mapping,” Magn. Reson.
the test statistic TC is calculated and compared to the threshold Imag., vol. 21, no. 10, pp. 1251261, 2003.
[5] W. Hesse, E. Mller, M. Arnold, and B. Schack, “The use of time-variant
in (107). The results of these comparisons are averaged over all EEG Granger causality for inspecting directed interdependencies of neural
the sample windows to obtain the probability of detection for a assemblies,” J. Neurosci. Methods, vol. 124, no. 1, pp. 27–44, 2003.
given probability of false alarm. The jitters in the plots are due [6] Y. Chen, G. Rangarajan, J. Feng, and M. Ding, “Analyzing multiple non-
linear time series with extended Granger causality,” Phys. Lett. A, vol. 324,
to the averaging over the relatively small number of samples no. 1, pp. 26–35, 2004.
that were available. The performance of data set 69 is poorer [7] M. Dhamala, G. Rangarajan, and M. Ding, “Analyzing information flow
than data sets 65–67, for the same value of N , indicating either in brain networks with nonparametric Granger causality,” NeuroImage,
vol. 41, no. 2, pp. 354–362, 2008.
a weaker causal relationship, or larger amount of additive noise, [8] S. Hu and H. Liang, “Causality analysis of neural connectivity: New tool
or both. Nonetheless, the ROC follows a pattern similar to that and limitations of spectral Granger causality,” Neurocomputing, vol. 76,
predicted by (108). no. 1, pp. 44–47, 2012.
5816 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 66, NO. 22, NOVEMBER 15, 2018

[9] D. Sampath et al., “A study on fear memory retrieval and REM sleep [36] R. Tandra and A. Sahai, “SNR walls for signal detection,” IEEE J. Sel.
in maternal separation and isolation stressed rats,” Behav. Brain Res., Topics Signal Process., vol. 2, no. 1, pp. 4–17, Feb. 2008.
vol. 273, pp. 144–154, 2014. [37] M. K. Simon, Probability Distributions Involving Gaussian Random Vari-
[10] M. Hu and H. Liang, “A copula approach to assessing Granger causality,” ables. New York, NY, USA: Springer, 2002.
NeuroImage, vol. 100, pp. 125–134, 2014. [38] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf,
[11] A. K. Seth, A. B. Barrett, and L. Barnett, “Granger causality analysis in “Distinguishing cause from effect using observational data: Methods and
neuroscience and neuroimaging,” J. Neurosci., vol. 35, no. 8, pp. 3293– benchmarks,” J. Mach. Learn. Res., vol. 17, no. 32, pp. 1–102, 2016.
3297, 2015.
[12] R. Ganapathy, G. Rangarajan, and A. K. Sood, “Granger causality and Ribhu Chopra (S’11–M’17) received the B.E. de-
cross recurrence plots in rheochaos,” Phys. Rev. E, vol. 75, Jan. 2007, Art. gree in electronics and communication engineering
no. 016211. from Panjab University, Chandigarh, India, in 2009,
[13] S. Stramaglia, J. M. Cortes, and D. Marinazzo, “Synergy and redundancy and the M.Tech. and Ph.D. degrees in electronics
in the Granger causal analysis of dynamical networks,” New J. Phys., and communication engineering from the Indian In-
vol. 16, no. 10, 2014, Art. no. 105003. stitute of Technology Roorkee, Roorkee, India, in
[14] E. Kodra, S. Chatterjee, and A. R. Ganguly, “Exploring Granger causal- 2011 and 2016, respectively. He worked as a Project
ity between global average observed time series of carbon dioxide and Associate with the Department of Electrical Com-
temperature,” Theor. Appl. Climatol., vol. 104, no. 3, pp. 325–335, 2011. munication Engineering, Indian Institute of Science,
[15] H. White and D. Pettenuzzo, “Granger causality, exogeneity, cointegration, Bangalore, from August 2015 to May 2016. From
and economic policy analysis,” J. Econometrics, vol. 178, Part 2, pp. 316– May 2016 to March 2017, he worked as an Institute
330, 2014. Research Associate with the Department of Electrical Communication Engi-
[16] H. Nalatore, M. Ding, and G. Rangarajan, “Mitigating the effects of mea- neering, Indian Institute of Science, Bangalore, India. In April 2017, he joined
surement noise on Granger causality,” Phys. Rev. E, vol. 75, Mar. 2007, the Department of Electronics and Electrical Engineering, Indian Institute of
Art. no. 031123. Technology Guwahati, Assam, India. His research interests include statistical
[17] H. Nalatore, S. N, and G. Rangarajan, “Effect of measurement noise on and adaptive signal processing, massive MIMO communications, and cognitive
Granger causality,” Phys. Rev. E, vol. 90, Dec. 2014, Art. no. 062127. communications.
[18] I. Winkler, D. Panknin, D. Bartz, K. R. Mller, and S. Haufe, “Validity of
time reversal for testing Granger causality,” IEEE Trans. Signal Process.,
vol. 64, no. 11, pp. 2746–2760, Jun. 2016. Chandra Ramabhadra Murthy (S’03–M’06–
[19] T. Schreiber, “Measuring information transfer,” Phys. Rev. Lett., vol. 85, SM’11) received the B.Tech. degree in electrical
pp. 461–464, Jul. 2000. engineering from the Indian Institute of Technol-
[20] L. A. Baccalá and K. Sameshima, “Partial directed coherence: A new ogy Madras, Chennai, India, in 1998, the M.S. and
concept in neural structure determination,” Biol. Cybern., vol. 84, no. 6, Ph.D. degrees in electrical and computer engineer-
pp. 463–474, 2001. ing from Purdue University and the University of
[21] E. Siggiridou and D. Kugiumtzis, “Granger causality in multivariate time California, San Diego, USA, in 2000 and 2006, re-
series using a time-ordered restricted vector autoregressive model,” IEEE spectively. From 2000 to 2002, he worked as an
Trans. Signal Process., vol. 64, no. 7, pp. 1759–1773, Apr. 2016. Engineer for Qualcomm, Inc., where he worked on
[22] P. Newbold and N. Davies, “Error mis-specification and spurious regres- WCDMA baseband transceiver design and 802.11b
sions,” Int. Econ. Rev., vol. 19, no. 2, pp. 513–519, 1978. baseband receivers. From August 2006 to August
[23] G. Nolte et al., “Robustly estimating the flow direction of information in 2007, he worked as a Staff Engineer at Beceem Communications, Inc., on
complex physical systems,” Phys. Rev. Lett., vol. 100, Jun. 2008, Art no. advanced receiver architectures for the 802.16e Mobile WiMAX standard. In
234101. September 2007, he joined the Department of Electrical Communication En-
[24] S. Haufe, V. V. Nikulin, and G. Nolte, Alleviating the Influence gineering, Indian Institute of Science, Bangalore, India, where he is currently
of Weak Data Asymmetries on Granger-Causal Analyses. Berlin, working as a Professor.
Heidelberg: Springer, 2012, pp. 25–33. His research interests include the areas of energy harvesting communica-
[25] S. Haufe, V. V. Nikulin, K.-R. Mller, and G. Nolte, “A critical assessment tions, multiuser MIMO systems, and sparse signal recovery techniques applied
of connectivity measures for EEG data: A simulation study,” NeuroImage, to wireless communications. His paper won the best paper award in the Com-
vol. 64, pp. 120–133, 2013. munications Track at NCC 2014 and a paper co-authored with his student won
[26] M. Vinck et al., “How to detect the Granger-causal flow direction in the the student best paper award at the IEEE ICASSP 2018. He has 50+ journal
presence of additive noise?,” NeuroImage, vol. 108, pp. 301–318, 2015. papers and 80+ conference papers to his credit. He was an Associate Edi-
[27] A. K. Seth, P. Chorley, and L. C. Barnett, “Granger causality analysis tor for the IEEE SIGNAL PROCESSING LETTERS during 2012–2016. He is an
of fMRI BOLD signals is invariant to hemodynamic convolution but not elected member of the IEEE SPCOM Technical Committee for the years 2014–
downsampling,” NeuroImage, vol. 65, no. Supplement C, pp. 540–555, 2016, and has been re-elected for the 2017–2019 term. He is the Past Chair
2013. of the IEEE Signal Processing Society, Bangalore Chapter. He is currently
[28] L. Barnett and A. K. Seth, “Detectability of Granger causality for subsam- serving as an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PRO-
CESSING, the Sadhana Journal, and an Editor for the IEEE TRANSACTIONS ON
pled continuous-time neurophysiological processes,” J. Neurosci. Meth-
ods, vol. 275, no. Supplement C, pp. 93–121, 2017. COMMUNICATIONS.
[29] M. Gong, K. Zhang, B. Schoelkopf, D. Tao, and P. Geiger, “Discovering
temporal causal relations from subsampled data,” in Proc. 32nd Int. Conf.
Mach. Learn., vol. 37, Jul. 2015, 1898–1906.
[30] M. Gong, K. Zhang, B. Schölkopf, C. Glymour, and D. Tao, “Causal Govindan Ranagarajan received the Integrated
discovery from temporally aggregated time series,” in Proc. Conf. Uncer- M.Sc. degree from Birla Institute of Technology and
tainty Artif. Intell., Aug. 2017, pp. 1–10. Science, Pilani, India, and the Ph.D. degree from the
[31] A. Tank, E. B. Fox, and A. Shojaie, “Identifiability of non-Gaussian struc- University of Maryland, College Park, MD, USA. He
tural VAR models for subsampled and mixed frequency time series,” in then was with the Lawrence Berkeley Lab, Univer-
Proc. SIGKDD Workshop Causal Discovery, 2016, pp. 1–7. sity of California, Berkeley, before returning to India
[32] A. Hyttinen, S. Plis, M. Jrvisalo, F. Eberhardt, and D. Danks, “A constraint in 1992. He is currently a Professor of mathematics
optimization approach to causal discovery from subsampled time series with the Indian Institute of Science, Bangalore. He is
data,” Int. J. Approx. Reasoning, vol. 90, pp. 208–225, 2017. also the Chairman of the Division of Interdisciplinary
[33] S. Plis, D. Danks, C. Freeman, and V. Calhoun, “Rate-agnostic (causal) Research.
structure learning,” Adv. Neural Inf. Process. Syst., vol. 28, pp. 3303–3311, His research interests include nonlinear dynamics
2015. and chaos, time series analysis, and brain–machine interface. He is a J. C. Bose
[34] X. Wen, G. Rangarajan, and M. Ding, “Is Granger causality a viable National Fellow. He is also a Fellow of the Indian Academy of Sciences and the
technique for analyzing fMRI data?,” PLOS ONE, vol. 8, pp. 1–11, 2013. National Academy of Sciences, India. He was knighted by the Government of
[35] S. Haykin, Adaptive Filter Theory. London, U.K.: Pearson Education, France: Knight (Chevalier) of the Order of Palms. He was also a Homi Bhabha
2002. Fellow.

You might also like