Professional Documents
Culture Documents
1-2 因果表述学习技术进展
1-2 因果表述学习技术进展
Thanks to: Biwei Huang, Weiran Yao, Feng Xie, Mingming Gong, Petar Stojanov, Yujia Zheng, Haoyue Dai…
Clark Glymour, Peter Spirtes, Bernhard Schölkopf, Aapo Hyvärinen, Ruichu Cai, Jiji Zhang, Joseph Ramsey…
intervention
32
Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation learning
关系? from time series 从时间序列进⾏
因果表征学习
Causal Thinking
• “Strange” dependence
gender IQ
• Let’s go back 50 years; maybe you would
nd female college students are smarter than college
male ones on average. Why?
fi
214 C
• Kant’s metaphysics
The world of experience
Figure 9.1
Kant’s picture of metaphysics.
by our minds. Kant called this process synthesis. We have seen the Kantian
before, but it is worth seeing again (figure 9.1).
For the most part, the construction of the empirical world from sensory i
fi
Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation learning
关系? from time series 从时间序列进⾏
因果表征学习
• Causal discovery (Spirtes et al., 1993)/ causal representation learning (Schölkopf et al., 2021): nd such
representations with identi ability guarantees
fi
Advances in Causal Representation Learning
Parametric Latent
i.i.d. data? What can we get?
constraints? confounders?
No
(Different types of)
No
equivalence class
Yes
Yes
No Unique identi ability
Yes (under structural
Yes conditions)
(Extended)
No
regression
Non-I, but I.D. No/Yes
Latent temporal
Yes
causal processes
Moreidenti able! than
informative
No
MEC (CD-NOD)
No
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Variables in changing
Yes
relations identi able
fi
fi
fi
fi
fi
Causal Discovery in Archeology: An Example
Parametric Latent
i.i.d. data?
constraints? confounders?
Thanks to Marlijn Noback
Yes No No
No Yes Yes
ff
(Typical) Constraint-Based Causal
Discovery
X1
X1 X2 X3 X4 X5
-1.1 1 1.3 0.2 -0.7
X1⫫X5 | X3 X2 X4
2.1 2 3.1 -1.3 -1.6 X2⫫X4 | X1
3.1 4.2 -2.6 0.6 2.1
X2⫫X5 | X3 X3
2.3 -0.6 -3.5 0.8 2.3
1.3 -1.7 0.9 2.4 -1.4 X4⫫X5 | X3
-1.8 0.9 -1.3 0.9 0.7
X1⫫X3 |{X2, X4} X5
… ... ... … …
climate
diet
gender
geodistance attrition
paramasticatory
behavior
cranial size
Y X
Parametric GaussianLatent
case
i.i.d. data?
constraints? confounders?
X X Y Y
Yes No No
“Independent changes” renders causal direction identi able
No Yes Yes
13
fi
• Special cases:
• Linear models
• Nonlinear additive noise models
• Multiplicative noise models:
Y = X · E = exp log(X) + log(E)
precipitation
altitude
- Zhang, Hyvärinen, Distinguishing causes from effects using nonlinear acyclic causal models, PMLR’08
All Non-Identi able Cases (Zhang & Hyvärinen, 2009)
e n e r a l ly
t i o n i s g
a l d i re c t a w e r e
C a u s t h e d a
a b l e i f g t o
ide n t i c c o rd i n
r a t e d a ) .
ge n e ( X ) + E
f ( f 1 1
X2 = 2
n li n e a r
a n d n o
m o d e l s s a r e
Lin e ar m o d e l
e n o is e
addi ti v a s e s .
e c i a l c
sp
fi
fi
L1 X5
Discovery: How? L3
X6
X1 X2 X3 X4
X7
L4
L2 X8
• Find latent variables Li and Figure 1: A causal structure involving 4 latent variables and 8 observed variables, where each pair of
their causal
observed variables in {X1 , Xfrom
relations 2 , X 3 , X 4 }
measured
are affected variables
by two latent X i ?
variables.
- Xie, Cai, Huang, Glymour, Hao, Zhang, "Generalized Independent Noise Condition for Estimating Linear
Non-Gaussian Latent Variable Causal Graphs," NeurIPS 2020
- Cai, Xie, Glymour, Hao, Zhang, “Triad Constraints for Learning Causal Structure of Latent Variables,
“ NeurIPS 2019
fi
fi
fi
L1
L2 L3 L4
X10 L5 L6 L7 L8 L9 X11
X1 X2 X3 X4 X5 X6 X7 X8 X9
1: A hierarchical causal structure involving 9 latent variables (shaded nodes) and 11 observed
s (unshaded
- Xie, nodes).
Huang Chen, He, Geng, Zhang, “Estimation of Linear Non-Gaussian Latent Hierarchical Structure,” ICML 2022
- Huang, Low, Xie, Glymour, Zhang, “Latent Hierarchical Causal Structure Discovery with Rank Constraints, NeurIPS 2022
- Adams, Hansen, Zhang, “Identi cation of Partially Observed Linear Causal Models: Graphical Conditions for the Non-
Gaussian and Heterogeneous Cases,” NeurIPS 2021
20
fi
Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation
关系? learning from time series 从
时间序列进⾏因果表征学习
No
(Different types of)
No
equivalence class
Yes
Yes
No Unique identi ability
Yes (under structural
Yes conditions)
No (Extended) regression
Non-I, but I.D. No/Yes
Latent temporal causal
Yes
processes identi able!
?
More informative than
No
MEC (CD-NOD)
No
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Variables in changing
Yes
relations identi able
fi
fi
fi
fi
fi
Estimating Time-Delayed Causal Model
Parametric Latent
i.i.d. data?
constraints? confounders?
Yes No No
No Yes Yes
- Swanson, Granger. Impulse response functions based on a causal approach to residual orthogonalization in
vector autoregression. J. of the Americal Statistical Association, 1997
- Hyvärinen, Zhang, Shimizu, Hoyer, "Estimation of a structural vector autoregression model using non-
Gaussianity," JMLR, 2010
fi
fl
where matrix A is a random Directed Acyclic Graph (DAG) which contains the coeffici
linear instantaneous relations. The noises ✏it are sampledXL from i.i.d. Laplacian distrib
Results
= 0.1. The entries of state ontmatrices
transition
xt = g(z ), Video Bt ⌧+Data
zt = Az areBuniformly
⌧ zt ⌧ + ✏t distributed
with ✏it ⇠ p✏between
,
⌧ =1
[ i
where matrix A is a random Directed Acyclic Graph (DAG) which contains the coefficients o
B.2 R EAL - WORLD DATASET
linear instantaneous relations. The noises ✏it are sampled from i.i.d. Laplacian distribution
• For easy interpretation, consider
= 0.1. The entries oftwo
statesimple video
transition data
matrices B⌧sets
Three public datasets, including KiTTiMask, Mass-Spring System, and CMU MoCap da
are uniformly distributed between [ 0.5, 0
Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation learning
关系? from time series 从时间序列进⾏
因果表征学习
No
(Different types of)
No
equivalence class
Yes
Yes
No Unique identi ability
Yes (under structural
Yes conditions)
No (Extended) regression
Non-I, but I.D. No/Yes
Latent temporal causal
Yes
processes identi able!
More informative than
No
MEC (CD-NOD)
No
?
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Variables in changing
Yes
relations identi able
fi
fi
fi
fi
fi
Nonstationary/Heterogeneous Data and Causal Modeling
- Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, Schölkopf, "Causal Discovery from Heterogeneous/
Nonstationary Data," JMLR, 2020
- Zhang, Huang, et al., Discovery and visualization of nonstationary causal models, arxiv 2015
- Ghassami, et al., Multi-Domain Causal Structure Learning in Linear Systems, NIPS 2018
ff
ff
Causal Discovery from Nonstationary/Heterogeneous
110
111
to reveal the correct causal structure when the data distri-
bution shifts. If the changes in some variables are related,
change over time or across domains.
assume that the underlying causal struc
112
113 tity which
Data
one can imagine that there exists some unobservable quan-
influences Latent
Parametric all those variables and, as a conse-
acyclic graph (DAG) and that the causal
with changing causal models.
i.i.d. data?
114 quence, the conditional
constraints? independence relationships in the
confounders?
115Yes distribution-shifted
No dataNowill be different from those im- g(C)
116No plied by Yes the true causal Yes structure. Similarly, suppose a V1 V2
V1 V2 V3 V4
117 variable Vi was generated from its direct causes with a cer-
•118
Task: tain functional causal model (e.g., the linear, non-Gaussian (a) (b
119 model (Shimizu et al., 2006)) whose parameters change at
•
120 Determine
121
some point. Then ifcausal
changing one fitsmodules
a fixed functional
& estimate causal model
skeleton
from the directed causes to Vi , the noise term is usually not
Figure 1. An illustration on how ignoring ch
model may lead to spurious connections by
•
122 Causal orientation
independent fromdetermination
the causes anybene more,tsand
from
accordingly it method. (a) The true causal graph (including
(b) The estimated conditional independence
123 independent changes the
fails to distinguish in P(cause) and P(effect
correct causal | cause),
structure from other
served data in the asymptotic case.
124 including invariant
candidates. There mechanism/
exist some cause as special
methods aimingcases
to detect
125 the changes (Talih & Hengartner, 2005; Adams & Mackay, Let us decompose the joint probability d
•
126 Visualization of changing modules over time/
2007; Kummerfeld & Danks, 2013) or directly model time- across
given variable
Kernel set V = {Vi }i=1 accordin
nonstationary
n
- Hoover, “ The logic of causal inference” Economics and Philosophy, 6:207–234, 1990. i
fi
(07/05/2006 - 12/16/2009)
both synthetic and real da
data) demonstrated that th
reliable information for ca
the estimated nonstationa
background knowledge fo
variables. We note that ca
are heavily coupled and th
Fig. 8: Recovered causal graph from 80 NYSE stocks. Each useful information for ca
color of nodes represents one sector. of our future research is t
online prediction in nonst
R
while the stocks SAN and CHK only have changes points
[1] R. P. Adams and D. J.
around 05/05/2008 (T2 ). Most stocks which have change detection, 2007. Technical
points only at T2 have more direct causes. The change points UK. Preprint at http://arxi
match with the critical time of financial crisis–those in the [2] R. F. Engle, D. F. Hendry,
51:277–304, 1983.
TED spread, as well as parts of the change points (T2 and T3 ) [3] A. Gretton, K. Fukumizu,
in HK stock data. Smola. A kernel statistic
585–592, Cambridge, MA
[4] B. Huang, K. Zhang, and B
8 8 8 causal model: A gaussian
6 6 6 Joint Conference on Arti
pages 3561–3568, Buenos
SAN
USB
JCP
4 4 4
[5] C. Meek. Strong complet
2 2 2 In Proceedings of the E
0 0 0
Uncertainty in Artificial In
[6] J. Pearl. Causality: Mod
University Press, Cambrid
[7] B. Schölkopf and A. Sm
8 8 8
Cambridge, MA, 2002.
6 6 6 [8] B. Schölkopf, A. Smola, a
as a kernel eigenvalue pr
CHK
PBR
4 4 4
GE
1998.
2 2 2
[9] P. Spirtes, C. Glymour, a
0 0 0 Search. Spring-Verlag Lec
07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 [10] P. Spirtes, C. Glymour, a
Time (t) Search. MIT Press, Camb
[11] X. Wang, R. Hutchinson,
- Huang, Zhang, Zhang, Romero, Glymour, Schölkopf, BehindFig.
Distribution Shift:
9: The estimated Mining Driving
nonstationary Forces
driving force of
of six stock to detect cognitive states a
returns from 07/05/2006 ⇠ 12/16/2009. The stocks USB,30 JCP, 709.
Changes and Causal Arrows,” ICDM 2017 [12] J. Woodward. Making thi
utions of Y and its children. Hence, in practice one may not need to find the whole graph over
Application: Domain Adaptation as Inference on Graphical
ures and Y . This observation may accelerate the procedure of learning the augmented graph,
will be discussed in Section 3.1.
➜
X1 Y X2 X4 X6
variables), and theData set 2
observed
➜
e perfect random samples from
➜
...
X5 X3 X7 Prediction in
ulations implied byData the causal
set n mi Target-domain
then one can directly benefit
•
sing the causalOnly
model for trans-
relevant Figure
features 1: An
needed to augmented
predict Y DAG over Y and Xi . For any vari-
ning, if it is known, as shown able V with a ✓ variable/vector as its parent, the conditional
• Augmented graph learned by CD-NOD
14, 36]. If fact, in this case our distribution P (V | PA(V )) may change across domains. The
cal representation will encode
• ✓ variables
Independently changing
me set of conditional indepen-
take the
modules i
same value within each domain.
•
elations as the original causalinvariant
Special case: model. modules
•
rth noting thatDomain
the causal model, inference
adaption: on its own,onmight not be sufficient
this graphical model to explain the properties of
a, for instance, because of selection bias [37], which is often present in the sample. Furthermore,
• Infer the posterior of Y in target domain
toriously difficult to find causal relations based on observational data; to achieve it, one often
•
make rather strong assumptions on
Nonparametric the causal
methods modelconditional
to model (such as faithfulness [38]) and sampling
distributions
s. On the other hand, it is rather easy to find the graphical model purely as a description of
onal independence relationships
- Zhang*, Gong*, Stojanov, in and
Huang, Liu, the Glymour,
variables as well
"Domain as theAsproperties
Adaptation a Problem ofofInference
changes on in the
Graphical Models,"
tion modules. NeurIPS 2020.
The underlying causal(Huang et al., ICML’19
structure may befor timedifferent
very series data)
from the augmented DAG
Following the experimental setting in [47], we build a multi-source domain dataset by combing four
digits datasets, including MNIST, MNIST-M, SVHN, and SynthDigits. We take MNIST, MNIST-M,
395
T-SVDNet Z (LiSet are
LtC-MSDA (Wang et al., 2020)
iMSDA (Ours)
identi
al., 2021)
122 90.43
123
able; invariant
90.19
invariant
this line
90.47
between
90.61
of work
part Z85.49
97.23
views,
98.50
assumes
in a
C are identi
81.53
block-wise
availability
93.44±0.20 91.79±1.52 98.28±0.03 88.95±0.64 93.12 of
89.8
manner.
91.25
paired
However,
instances in
Further, we assume that y is genera
able up tovariables
its subspace
zc and z̃s . Thus, this genera
the conditional-shift setting, in which
• 396
Using invariant
397
398
part ZC and transformed
Table 1. Classification results on
124
PACS.
two domains.
changing
Backbone:Resnet-18.
In the context
part
Most baseline results Z̃
are taken
of out-of-distribution
for
from
S (Yangprediction
et al., 2020).
125 ization, Lu et al. (Lu et al., 2020) extend the identifiability
general-
domains, and p y stays the same.
399 Models 126! Art result of iVAE (Khemakhem
! Clipart ! Product !etRealworld al., 2020) Avg to a general expo- We note below that the distinguishing
Source Only (He et al., 2016) 64.58±0.68 52.32±0.63 77.63±0.23 80.70±0.81 68.81
400
DANN (Ganin et al., 2016) 127 nential58.01±1.55
64.26±0.59
family that is not necessarily
76.44±0.47 78.80±0.49
factorized.
69.38
However, ating process, and illustrate how thes
401
402 DANN+BSP (Chen et al., 2019) 128 this study
66.10±0.27 does not
61.03±0.39 take advantage
78.13±0.31 79.92±0.13 of the
71.29 fact that the la- to tackling UDA.
403 DAN (Long et al., 2015) 129 tent representations
68.28±0.45 57.92±0.65 78.45±0.05 should 81.93±0.35
contain invariant71.64 information
MCD (Saito et al., 2018) 67.84±0.38 59.91±0.55 79.21±0.61 80.93±0.18 71.97
404 130 that can be disentangled from the part that corresponds to
405 M3SDA (Peng et al., 2019) 66.22±0.52 58.55±0.62 79.45±0.52 81.35±0.19 71.39 Partitioned latent space As discu
DCTN (Xu et al., 2018) 131 changes
66.92±0.60 across domains.
61.82±0.46 79.20±0.58 Most importantly,
77.78±0.59 71.43 this study re-
406 bulk of prior work (Ganin & Lempit
407 MIAN (Park & Lee, 2021) 132 sorts to finding a conditionally invariant sub-part of z, even
69.39±0.50 63.05±0.61 79.62±0.16 80.44±0.24 73.12
MIAN- (Park & Lee, 2021) 69.88±0.35 64.20±0.68 80.87±0.37 81.49±0.24 74.11 et al., 2010; Zhao et al., 2018) focu
408 133 though
75.77±0.21
there may be parts of
84.13±0.09
z that are76.39
84.83±0.12
not conditionally
409 iMSDA (Ours) 60.83±0.73 variant representation over domains
410 134 invariant, and yet still relevant for predicting y.
Table 2. Classification results on Office-Home. Backbone: Resnet-50. Baseline results are taken from (Park & Lee, 2021). can be applied to novel domains in th
411 135 In this paper, we make use of realistic assumptions regard-
these approaches, we will demonstr
- Kong, Xie,Yao, Zheng, Chen, Stojanov, Akinwande, Zhang,
412
413 7.2. Results and Discussion
136Partial disentanglement for domain adaptation,
ing the data-generating
8. Conclusion process in order to provably identify ization of zs allows us to preserve
ICML 2022 414 137
fi
fi
Summary