Professional Documents
Culture Documents
Kernel-Based Conditional Independence Test and Application in Causal Discovery
Kernel-Based Conditional Independence Test and Application in Causal Discovery
Causal Discovery
EY Z ; (iii.) E(f˜g) = 0, ∀f˜ ∈ EXZ and g ∈ L2Y Z ; (iv.) where Vx,i denotes the ith eigenvector of KX . e
E(f˜g˜0 ) = 0, ∀f˜ ∈ EXZ and g˜0 ∈ EY0 Z ; (v.) E(f˜g 0 ) = 0, On the other hand, consider the eigenvalues λ∗X,i and
∀f˜ ∈ EXZ and g 0 ∈ L2Y . eigenfunctions uX,i of the kernel kX w.r.t. the prob-
ability measure with the density p(x), i.e., λ∗X,i and
The above result can be considered as a generaliza- uX,i satisfy
tion of the partial correlation based characterization Z
of CI for Gaussian variables. Suppose that (X, Y, Z) kX (x, x0 )uX,i (x)p(x)dx = λ∗X,i uX,i (x0 ).
is jointly Gaussian; then X ⊥⊥ Y |Z is equivalent to the
vanishing of the partial correlation coefficient ρXY.Z . Here we assume that uX,i have unit variance, i.e.,
Here, intuitively speaking, condition (ii) means that E[u2X,i (X)] = 1. We also sort λ∗X,i in descending order.
any “residual” function of (X, Z) given Z is uncor- Similarly, we define λ∗Y,i and uY,i of kY . Define
related from that of (Y, Z) given Z. Note that EXZ Pn
1 ψi (xt )φj (yt )
(resp. EY Z ) contains all functions of X and Z (resp. Sij , √ ψi (x)T φj (y) = t=1 √
of Y and Z) that cannot be “explained” by Z, in the n n
sense that any function f˜ ∈ EXZ (resp. g̃ ∈ EY Z ) is with ψi (xt ) being the t-th component of the vector
uncorrelated with any function of Z (Daudin, 1980). ψi (x). We then have the following results.
Theorem 3 Suppose that we are given arbitrary cen- Theorem 4 [Independence test]
tred kernels kX and kY with discrete eigenvalues and Under the null hypothesis that X and Y are statisti-
the corresponding RKHS’s HX and HY for sets of ran- cally independent, the statistic
dom variables X and Y , repectively. We have the fol-
lowing three statements. 1
TU I , Tr(K
e XK
eY) (8)
n
1) Under the condition that f (X) and g(Y ) are uncor-
related for all f ∈ HX and g ∈ HY , for any L such has the same asymptotic distribution as
that λ∗X,L+1 6= λ∗X,L and λ∗Y,L+1 6= λ∗Y,L , we have n
1 X 2
ŤU I , λx,i λy,j zij , (9)
L L 2 n2 i,j=1
d
X X
2
Sij −→ λ̊∗k zk2 , as n → ∞, (5)
i,j=1 k=1 d
i.e., TU I = ŤU I as n → ∞.
where zk are i.i.d. standard Gaussian variables (i.e., This theorem inspires the following unconditional in-
zk2 are i.i.d. χ21 -distributed variables), λ̊∗k are the eigen- dependence testing procedure. Given the samples x
values of E(wwT ) and w is the random vector obtained and y, one first calculates the centralized kernel ma-
q the L × L matrix N whose (i, j)th entry is
by stacking trices Ke X and Ke Y and their eigenvalues λx,i and λy,i ,
Nij = λ∗X,i λ∗Y,j uX,i (X)uY,j (Y ). 3 and then evaluates the statistic TU I according to (8).
Next, the empirical null distribution of ŤU I under
2) In particular, if X and Y are further independent, the null hypothesis can be simulated in the following
we have way: one draws i.i.d. random samples from the χ21 -
2
L L
distributed variables zij , and then generates samples
d for ŤU I according to (9). (Later we will give another
X X
2
Sij −→ λ∗X,i λ∗Y,j · zij
2
, as n → ∞, (6)
i,j=1 i,j=1 way to approximate the asympotic null distribution.)
Finally the p-value can be found by locating TU I in
2
where zij are i.i.d. χ21 -distributed variables. the null distribution.
3) The results (5) and (6) also hold if L = n → ∞. This unconditional independence testing method is
closely related to the one based on the Hilbert-Schmidt
All proofs are sketched in Appendix. We note that independence criterion (HSIC) proposed by Gretton
et al. (2008). Actually the defined statistics (our
Tr(K
e XK e Y ) = Tr(ψ x ψ T ψ y ψ T ) statistic TU I and HSICb in Gretton et al. (2008)) are
x y
n
the same, but the asymptotic distributions are in dif-
= Tr(ψ Tx ψ y ψ Ty ψ x ) = n
X
2
Sij . (7) ferent forms. In our results the asymptotic distribution
i,j=1
only involves the eigenvalues of the two regular kernel
matrices K e X and Ke Y , while in their results it depends
Hence, the above theorem gives the asymptotic distri- on the eigenvalues of an order-four tensor, which are
bution of n1 Tr(K
e XKe Y ) under the condition that f (X) more difficult to calculate.
and g(Y ) are always uncorrelated. Combined with the
characterizations of (conditional) independence, this 3.3 Conditional independence testing
inspires the corresponding testing method. We fur-
ther give the following remarks on Theorem 3. First, Here we would like to make use of condition (iv) in
X and Y are not necessarily disjoint, and HX and HY Lemma 2 to test for CI, but the considered functional
can be any RKHS’s. Second, in practice the eigen- spaces are f (Ẍ) ∈ HẌ , g 0 (Y ) ∈ HY , h∗f (Z) ∈ HZ , and
values λ̊∗k are not known, and one needs to use the h∗g0 (Z) ∈ HZ , as in Lemma 1. The functions f˜(Ẍ)
empirical ones instead, as discussed below. and g˜0 (Y, Z) appearing in condition (iv), whose spaces
are denoted by HẌ |Z and HY|Z , respectively, can be
3.2 Unconditional independence testing constructed from the functions f , g 0 , h∗f , and h∗g0 .
Suppose that we already have the centralized kernel
As a direct consequence of the above theorem, we have
matrices Ke (which is the centralized kernel matrix
Ẍ
the following result which allows for kernel-based un-
conditional independence testing. of Ẍ = (X, Z)), K e Y , and K
e Z on the samples x, y,
and z. We use kernel ridge regression to estimate the
3
Equivalently, one can consider λ̊∗k as the eigenvalues of
regression function h∗f (Z) in (4), and one can easily
the tensor T with Tijkl = E(Nij,t Nkl,t ). see that ĥ∗f (z) = K e Z + εI)−1 · f (ẍ), where ε is a
e Z (K
small positive regularization parameter (Schölkopf and i.i.d. χ21 samples and summing them up with weights
Smola, 2002). Consequently, f˜(ẍ) can be constructed λ̊k . (For computational efficiency, in practice we drop
as f˜(ẍ) = f (ẍ) − ĥ∗f (z) = RZ · f (ẍ), where all λẍ|z,i , λy|z,i , and λ̊k which are smaller than 10−5 ).
Finally the p-value is calculated as the probability of
Rz = I − K e Z + εI)−1 = ε(K
e Z (K e Z + εI)−1 . (10) ŤCI exceeding TCI . Approximating the null distribu-
tion with a Gamma distribution, which is given next,
Based on the EVD decoposition K e = VT Λẍ Vẍ , we avoids calculating λ̊k and simulating the null distribu-
Ẍ ẍ
1/2
can construct ϕẍ = [ϕ1 (ẍ), ..., ϕn (ẍ)] , Vẍ Λẍ , as tion, and is computationally more efficient.
an empirical kernel map for ẍ. Correspondingly, an
empirical kernel map of the space HẌ |Z is given by 3.4 Approximating the null distribution by a
ϕ̃ẍ = RZ ϕ(ẍ). Consequently, the centralized kernel Gamma distribution
matrix corresponding to the functions f˜(Ẍ) is
In addition to the simulation-based method to find the
K T null distribution, as in Gretton et al. (2008), we pro-
Ẍ|Z = ϕ̃ẍ ϕ̃ẍ = RZ KẌ RZ . (11)
e e
vide approximations to the null distributions with a
Similarly, that corresponding to g˜0 is two-parameter Gamma distribution. The two parame-
ters of the Gamma distribution are related to the mean
K
e Y |Z = RZ K
e Y RZ . (12) and variance. In particular, under the null hypothesis
that X and Y are independent (resp. conditionally in-
Furthermore, we let the EVD decompositions of dependent given Z), the distribution of ŤU I given by
T
K
e
Ẍ|Z and KY |Z be KẌ|Z = Eẍ|z Λẍ|z Eẍ|z and
e e (9) (resp. of ŤCI given by (14)) can be approximated
e Y |Z = Ey|z Λy|z ET , respectively. Λẍ|z (resp.
K by the Γ(k, θ) distribution:
y|z
Λy|z ) is the diagonal matrix containing non-negative
e−t/θ
eigenvalues λẍ|z,i (resp. λy|z,i ). Let ψ ẍ|z = p(t) = tk−1 ,
1/2
θk Γ(k)
[ψẍ|z,1 (ẍ), ..., ψẍ|z,n (ẍ)] , Vẍ|z Λẍ|z and φy|z =
1/2
[φy|z,1 (ÿ), ..., φy|z,n (ÿ)] , Vy|z Λy|z . We then have where k = E2 (ŤU I )/Var(ŤU I ) and θ =
the following result which the proposed KCI-test is Var(ŤU I )/E(ŤU I ) in the unconditional case, and
based on. k = E2 (ŤCI )/Var(ŤCI ) and θ = Var(ŤCI )/E(ŤCI )
in the conditional case. The means and variances of
Proposition 5 [Conditional independence test] ŤU I and ŤCI on the given sample D are given in the
Under the null hypothesis H0 (X and Y are condition- following proposition.
ally independent given Z), we have that the statistic
Proposition 6 i. Under the null hypothesis that X
1 and Y are independent, on the given sample D,
TCI , Tr(K
e
Ẍ|Z KY |Z )
e (13)
n we have
has the same asymptotic distribution as 1
E(ŤU I |D) = e X ) · Tr(K
Tr(K e Y ), and
n 2 n2
1X
ŤCI , λ̊k · zk2 , (14) 2
n Var(ŤU I |D) = e 2 ) · Tr(K
Tr(K e 2 ).
k=1 X Y
n4
where λ̊k are eigenvalues of w̌w̌T and
ii. Under the null hypothesis of X ⊥
⊥ Y |Z, we have
w̌ = [w̌1 , ..., w̌n ], with the vector w̌t obtained
by stacking M̌t = [ψẍ|z,1 (ẍt ), ..., ψẍ|z,n (ẍt )]T · 1
[φy|z,1 (ÿt ), ..., φy|z,n (ÿt )]. 4 E{ŤCI |D} = Tr(w̌w̌T ), and
n
Similarly to the unconditional independence testing, 2
Var(ŤIC |D) = Tr[(w̌w̌T )2 ].
we can do KCI-test by generating the approximate n2
null distribution with Monte Carlo simulation. We
first need to calculate K
e
Ẍ|Z according to (11), KY |Z
e 3.5 Practical issues: On determination of the
according to (12), and their eigenvalues and eigenvec- hyperparameters
tors. We then evaluate TCI according to (13) and sim-
In unconditional independence testing, the hyperpa-
ulate the distribution of ŤCI given by (14) by drawing
rameters are the kernel widths for constructing the
4
Note that equivalently, λ̊k are eigenvalues of w̌T w̌; kernel matrices K
e X and K e Y . We found that the per-
hence there are at most n non-zero values of λ̊k . formance is robust to these parameters within a certain
range; in particular, as in Gretton et al. (2008), we use 4.1 On the effect of the dimensionality of Z
the median of the pairwise distances of the points. and the sample size
In KCI-test, the hyperparameters fall into three cate- We examine how the probabilities of Type I and II
gories. The first one includes the kernel widths used errors of KCI-test change along with the size of the
to construct the kernel matrices K e and K
Ẍ
e Y , which
conditioning set Z (D = 1, 2, ..., 5) and the sample
are used later to form the matrices KẌ|Z and K
e e Y |Z
size (n = 200 and 400) in particular situations with
according to (11) and (12). The values of the kernel simulations. We consider two cases as follows.
widths should be such that one can use the informa-
tion in the data effectively. In our experiments we In Case I, only one variable in Z, namely, Z1 , is ef-
normalize all variables to unit variance, and use some fective, i.e., other conditioning variables are indepen-
empirical values for those kernel widths: they are set dent from X, Y , and Z1 . To see how well the de-
to 0.8 if the sample size n ≤ 200, to 0.3 if n > 1200, rived asymptotic null distribution approximates the
or to 0.5 otherwise. We found that this simple setting true one, we examined if the probability of Type I
always works well in all our experiments. errors is consistent with the significance level α that is
specified in advance. We generated X and Y from Z1
The other two categories contain the kernel widths for accroding to the post-nonlinear data generating pro-
constructing Ke Z and the regularization parameter ε, cedure (Zhang and Hyvärinen, 2009): they were con-
which are needed in calculating RZ in (10). These pa- structed as G(F (Z1 ) + E) where G and F are random
rameters should be selected carefully, especially when mixtures of linear, cubic, and tanh functions and are
the conditioning set Z is large. If they are too large, different for X and Y , and E is independent across X
the corresponding regression functions h∗f and h∗g0 may and Y and has various distributions. Hence X ⊥ ⊥ Y |Z
underfit, resulting in a large probability of Type I er- holds. In our simulations Zi were i.i.d. Gaussian.
rors (where the CI hypothesis is incorrectly rejected).
On the contrary, if they are too small, these regression A good test is expected to have small probability of
function may overfit, which may increase the probabil- Type II errors. To see how large it is for KCI-test, we
ity of Type II errors (where the CI hypothesis is not also generated the data which do not follow X ⊥ ⊥ Y |Z
rejected although being false). Moreover, if ε is too by adding the same variable to X and Y produced
small, RZ in (10) tends to vanish, resulting in very above. We increased the dimensionality of Z and the
sample size n, and repeated the CI tests 1000 random
small values in Ke
Ẍ|Z and KY |Z , and then the perfor-
e
replications. Figure 1 (a,b) plots the resulting proba-
mance may be deteoriated by rounding errors.
bility of Type I errors and that of Type II errors at the
We found that when the dimensionality of Z is small significance level α = 0.01 and α = 0.05, respectively.
(say, one or two variables), the proposed method works
In Case II, all variables in the conditioning set Z are ef-
well even with some simple empirical settings (say,
fective in generating X and Y . We first generated the
ε = 10−3 , and the kernel width for constructing K eZ
independent variables Zi , and then similarly to Case
equals half of that for constructing KẌ and KY ).
e e
I, to examine Type I errors, X and Y were generated
When Z contains many variables, to make the re- P
as G( i Fi (Zi ) + E). To examine Type II errors, we
gression functions h∗f and h∗g0 more flexible, such that further added the same variable to X and Y such that
they could properly capture the information about Z they became conditionally dependent given Z. Fig-
in f (Ẍ) ∈ HẌ and g 0 (Y ) ∈ HY , respectively, we use ure 1 (c,d) gives the probabilities of Type I and Type
separate regularization parameters and kernel widths II errors of KCI-test obtained on 1000 random repli-
for them, denoted by {εf , σ f } and {εg0 , σ g0 }. To cations.
avoid both overfitting and underfitting, we extend the
Gaussian process (GP) regression framework to the One can see that with the derived null distribution,
multi-output case, and learn these hyperparameters the 1% and 5% quantiles are approximated very well
by maximizing the total marginal likelihood. Details for both sample sizes, since the resulting probabilities
are skipped. The MATLAB source code is available at of Type I errors are very close to the significance lev-
http://people.tuebingen.mpg.de/kzhang/KCI-test.zip . els. Gamma approximation tends to produce slightly
larger probabilities of Type I errors, meaning that
the two-parameter Gamma approximation may have
4 Experiments a slightly lighter tail than the true null distribution.
With a fixed significance level α, as D increases, the
We apply the proposed method KCI-test to both syn- probability of Type II errors always increases. This is
thetic and real data to evaluate its practical perfor- intuitively reasonable: due to the finite sample size ef-
mance and compare it with CIPERM (Fukumizu et al., fect, as the conditioning set becomes larger and larger,
2008). It is also used for causal discovery. X and Y tend to be considered as conditionally inde-
(a) Type I error, case I (b) Type II error, case I it is far less sensitive to D.6
Type I error, n=200 (dashed line) & 400 (solid line) Type II error, n=200 (dashed line) & 400 (solid line)
0.07
0.06 0.06 Simulated (α=0.01) (a) Errors of CIPERM (b) CPU time by both methods
Gamma appr. (α=0.01)
Type I & II errors, n=200 (dashed line) & 400 (solid line) Average CPU time, n = 200 (dashedline) & 400 (solid line)
0.05 0.05 Simulated (α=0.05) 1 3
10
Simulated (α=0.01) Gamma appr. (α=0.05)
KCI−test
0.04 Gamma appr. (α=0.01) 0.04
CIPERM
Simulated (α=0.05) 0.8
0.03 0.03
0.8
part. corr.
distribution under the null hypothesis of conditional
0.6
independence. This distribution can either be gener-
0.4
ated by Monte Carlo simulation or approximated by
0.2 a two-parameter Gamma distribution. Compared to
100 200 300 400
sample size
500 600 700 discretization-based conditional independence testing
methods, the proposed one exploits more complete in-
formation of the data, and compared to those methods
Figure 3: The chance that the correct Markov equiva- defining the test statistic in terms of the estimated con-
lence class was inferred with PC combined with differ- ditional densities, calculation of the proposed statistic
ent CI testing methods. KCI-test outperforms CIPERM involves less random errors. We applied the method
and partial correlation. on both simulated and real world data, and the re-
sults suggested that the method outperforms existing
techniques in both accuracy and speed.
4.2.2 Real data
That is,
Lemma 9 Let {AL,n } be a double sequence of ran-
dom variables with indices L and n, {BL } and {Cn } n n L2
1X 1X 0 2 d X ∗ 2
sequences of random variables, and D a random vari- ||vt ||2 = ||v || −→ λ̊k zk . (16)
able. Assume that they are defined in a separable prob- n t=1 n t=1 t
k=1
d
ability space. Suppose that, for each L, AL,n −→ BL
d
Combining (15) and (16) gives (5).
as n → ∞ and that BL −→ D as L → ∞. Suppose fur-
ther that lim lim supL→∞,n→∞ P ({|AL,n −Cn | ≥ ε}) = If X and Y are independent, for k 6= i or l 6= j,
d oneq can see that the non-diagonal entries of Σ are
0 for each positive ε. Then Cn −→ D as n → ∞.
E[ λ∗X,i λ∗Y,j λ∗X,k λ∗Y,l uX,i (xt )uY,j (yt )uX,k (xt )uY,l (yt )] =
Sketch of proof of Theorem 3. We first define a
q
λ∗X,i λ∗Y,j λ∗X,k λ∗Y,l E[uX,i (xt )uX,k (xt )]E[uY,j (yt )uY,l (yt )]
L × L block-diagonal matrix PX as follows. For sim-
ple eigenvalues λ∗X,i , PX,ii = 1, and all other entries in = 0. The diagonal entries of Σ are
the ith row and column are zero. For the eigenvalues λ∗X,i λ∗Y,j · E[u2X,i (xt )]E[u2Y,j (yt )] = λ∗X,i λ∗Y,j , which are
with multiplicity r + 1, say, λ∗X,k , λ∗X,k+1 , ..., λ∗X,k+r , also eigenvalues of Σ. Substituting this result into
the corresponding main diagonal block from the kth (5), one obtains (6).
PL
row to the (k + r)th row of PX is an orthogonal ma- Finally, consider Lemma 9, and let AL,n , i,j=1 Sij 2
,
trix. According to Lemmas 7 and 8 and the con- PL2 ∗ 2 Pn 2
tinuous mapping theorem (CMT) (Mann and Wald, BL , k=1 λ̊k zk , Cn , i,j=1 Sij , and D ,
P∞ ∗ 2 Pn 2 d
1943), which states that continuous functions are k=1 λ̊k zk . One can then see that i,j=1 Sij −→
P∞
λ̊∗k zk2 as n → ∞. That is, (5) also holds as
k=1 ory. Journal of the Royal Statistical Society. Series B,
L = n → ∞. As a special case of (5), (6) also holds as 41:1–31, 1979.
L = n → ∞. F. Eicker. A multivariate central limit theorem for ran-
dom linear vector forms. The Annals of Mathematical
Sketch of proof of Theorem Pn 4. 2 On the one hand, Statistics, 37:1825–1828, 1966.
due to (7), we have TU I = i,j=1 Sij , which, according K. Fukumizu, F. R. Bach, M. I. Jordan, and C. Williams.
P∞
to (6), converges in distribution to i,j=1 λ∗X,i λ∗Y,i · zij 2 Dimensionality reduction for supervised learning with
as n → ∞. On the other hand, by extending the proof reproducing kernel hilbert spaces. JMLR, 5:2004, 2004.
of Theorem 1 of Gretton et al. (2009), one can show K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Ker-
P∞ nel measures of conditional dependence. In NIPS 20,
that i,j=1 ( n12 λx,i λy,j − λ∗X,i λ∗Y,j )zij
2
→ 0 in probabil-
pages 489–496, Cambridge, MA, 2008. MIT Press.
as n → ∞. That is, ŤU I converges in probability
ity P
∞ A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf,
to i,j=1 λ∗X,i λ∗Y,j zij
2
as n → ∞. Consequently, TU I and A. J. Smola. A kernel statistical test of indepen-
and ŤU I have the same asymptotic distribution. dence. In NIPS 20, pages 585–592, Cambridge, MA,
2008.
Sketch of proof of Proposition 5. Here we let X, A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sripe-
Y, K
e X , and K e Y in Theorem 3 be Ẍ, (Y, Z), Ke
Ẍ|Z , rumbudur. A fast, consistent kernel two-sample test. In
and Ke Y |Z , respectively. According to (7) and Theo- NIPS 23, pages 673–681, Cambridge, MA, 2009. MIT
d P∞ Press.
rem 3, one can see that TCI −→ k=1 λ̊∗k zk2 as n → ∞. T. M. Huang. Testing conditional independence using max-
Again, we extend the proof of imal nonlinear conditional correlation. Ann. Statist., 38:
PTheorem
∞
1 of Gretton
2047–2091, 2010.
et al. (2009) to show that ( k=1 n1 λ̊k − λ̊∗k )zk2 → 0
P∞ 1 2
P∞ ∗ 2 D. Koller and N. Friedman. Probabilistic Graphical Models:
as n → ∞, or that k=1 n λ̊k zk → k=1 λ̊k zk in Principles and Techniques. MIT Press, Cambridge, MA,
probability as n → ∞. The key step is to show that 2009.
∗
P 1
|
k n λ̊k − λ̊k | → 0 in probability as n → ∞. Details A. J. Lawrance. On conditional and partial correlation.
are skipped. The American Statistician, 30:146–149, 1976.
Sketch of proof of Proposition 6. As O. Linton and P. Gozalo. Conditional independence restric-
2 tions: testing and estimation, 1997. Cowles Foundation
zij follow the χ2 distribution with one degree of Discussion Paper 1140.
2 2
freedom, we have E(zij ) = 1 and Var(zij ) = H. B. Mann and A. Wald. On stochastic limit and order
2. According to (9), we have E(ŤU I |D) = relationships. The Annals of Mathematical Statistics, 14:
1 1 1 217–226, 1943.
P P P
i,j λx,i λy,j = n2 i λx,i j λy,j = n2 Tr(KX ) ·
e
n2
Tr(K e Y ). Furthermore, bearing in mind that z 2 are D. Margaritis. Distribution-free learning of bayesian net-
ij work structure in continuous domains. In Proc. AAAI
independentPvariables across i and j, and recalling 2005, pages 825–830, Pittsburgh, PA, 2005.
Tr(K2X ) = 2
i λx,i , one can see that Var(Ť |D) =
1
P 2 2 2 2
P 2 P U I2 J. Pearl. Causality: Models, Reasoning, and Inference.
n4 i,j λx,i λy,j Var(zij ) = n4 i λx,i j λy,j = Cambridge University Press, Cambridge, 2000.
2 2 2
n 4 Tr( K
e
X ) · Tr( YK
e ). Consequently (i) is true. B. Schölkopf and A. Smola. Learning with kernels. MIT
Press, Cambridge, MA, 2002.
Similarly, from (14), one can calculate the mean
K. Song. Testing conditional independence via rosenblatt
E{ŤCI |D} and variance Var(ŤIC |D), as given in (ii). transforms. Ann. Statist., 37:4011–4045, 2009.
P. Spirtes, C. Glymour, and R. Scheines. Causation, Pre-
diction, and Search. MIT Press, Cambridge, MA, 2nd
edition, 2001.
References
L. Su and H. White. A consistent characteristic function-
A. Asuncion and D.J. Newman. UCI machine learning based test for conditional independence. Journal of
repository. http://archive.ics.uci.edu/ml/, 2007. Econometrics, 141:807–834, 2007.
C. Baker. The numerical treatment of integral equations. L. Su and H. White. A nonparametric hellinger metric test
Oxford University Press, 1977. for conditional independence. Econometric Theory, 24:
829–864, 2008.
W. P. Bergsma. Testing conditional independence for con-
tinuous random variables, 2004. EURANDOM-report X. Sun, D. Janzing, B. Schölkopf, and K. Fukumizu. A
2004-049. kernel-based causal learning algorithm. In Proc. ICML
2007, pages 855–862. Omnipress, 2007.
P. Billingsley. Convergence of Probability Measures. John
Wiley and Sons, 1999. R. Tillman, A. Gretton, and P. Spirtes. Nonlinear di-
rected acyclic structure learning with weakly additive
J. J. Daudin. Partial association measures and an applica- noise models. In NIPS 22, Vancouver, Canada, 2009.
tion to qualitative regression. Biometrika, 67:581–590, K. Zhang and A. Hyvärinen. On the identifiability of the
1980. post-nonlinear causal model. In Proc. UAI 25, Montreal,
A. P. Dawid. Conditional independence in statistical the- Canada, 2009.