1-2 因果表述学习技术进展

Methodological Advances in
Causal Representation Learning

因果表征学习技术进展
Kun Zhang
Carnegie Mellon University & MBZUAI
Thanks to: Biwei Huang, Weiran Yao, Feng Xie, Mingming Gong, Petar Stojanov, Yujia Zheng, Haoyue Dai…
Clark Glymour, Peter Spirtes, Bernhard Schölkopf, Aapo Hyvärinen, Ruichu Cai, Jiji Zhang, Joseph Ramsey…
Causality vs. Dependence

• Causality ➜ dependence ! Dependence ➜ causality
An intervention on X changes only the target variable X, leaving

any other variable (http://imgs.xkcd.com/comics/correlation.png)
unchanged, at least for the moment.
X and Y are associated iff X is a cause of Y iff

∃x1 ≠ x2 P(Y|X=x1) ≠ P(Y|X=x2) ∃x1 ≠ x2 P(Y|do X=x1) ≠ P(Y|do X=x2)
intervention
32

Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation learning
关系？ from time series 从时间序列进⾏
因果表征学习
2. Causal representation learning: 4. Causal representation learning

IID case 因果表征学习：独⽴同 from heterogeneous/nonstationary
分布情形 data 多分布下的因果表征学习
Causal Thinking
• Dependence vs. causality

• Simpson’s paradox
• “Strange” dependence
gender IQ
• Let’s go back 50 years; maybe you would
nd female college students are smarter than college
male ones on average. Why?
fi

Addressing These ML Problems

• Generalization/adaptation, decision making, recommendations…
• Dealing with adversarial attacks?
(Goodfellow et al., 2014)

Why Causal Representations? PROPERTY OF THE MIT PRESS
FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
214 C
• To bene t downstream tasks (?) Space
• Why-type explanations World-in-itself

Time Causal order
Pure forms of experience
• Recovering the property of true causal process
• Kant’s metaphysics
The world of experience
Figure 9.1
Kant’s picture of metaphysics.
by our minds. Kant called this process synthesis. We have seen the Kantian
before, but it is worth seeing again (figure 9.1).
For the most part, the construction of the empirical world from sensory i
fi

Outline ⽬录
因果表征学习
2. Causal representation 4. Causal representation learning

learning: IID case 因果表征学 from heterogeneous/nonstationary
习：独⽴同分布情形 data 多分布下的因果表征学习
How to Find Causality from

Observational Data?
• Causal system has “irrelevant” modules (Pearl, 2000; Spirtes et al., 1993)
rain - conditional independence among variables;

- independent noise condition;
- minimal (and independent) changes…
slippery
wet_ground
• Footprint of causality in data: Independent & minimal changes
• Causal discovery (Spirtes et al., 1993)/ causal representation learning (Schölkopf et al., 2021): nd such
representations with identi ability guarantees
• Three dimensions of the problem:

Parametric Latent
i.i.d. data?
constraints? confounders?
Yes No No
No Yes Yes
fi

fi
Advances in Causal Representation Learning
Parametric Latent
i.i.d. data? What can we get?
No
(Different types of)
No
equivalence class
Yes
Yes
No Unique identi ability
Yes (under structural
Yes conditions)
(Extended)
No
regression
Non-I, but I.D. No/Yes
Latent temporal
Yes
causal processes
Moreidenti able! than
informative
No
MEC (CD-NOD)
No
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Variables in changing
Yes
relations identi able
fi
fi
fi
fi
fi
Causal Discovery in Archeology: An Example
Parametric Latent
i.i.d. data?
Thanks to Marlijn Noback
Yes No No
No Yes Yes
• 8 variables of 250 skeletons collected from di erent locations
ff
(Typical) Constraint-Based Causal
Discovery
• Conditional independence constraints between each variable pair

• Illustration: the PC (Peter-Clark) algorithm
• Extensions: the FCI algorithm…
X1
X1 X2 X3 X4 X5
-1.1 1 1.3 0.2 -0.7
X1⫫X5 | X3 X2 X4
2.1 2 3.1 -1.3 -1.6 X2⫫X4 | X1
3.1 4.2 -2.6 0.6 2.1
X2⫫X5 | X3 X3
2.3 -0.6 -3.5 0.8 2.3
1.3 -1.7 0.9 2.4 -1.4 X4⫫X5 | X3
-1.8 0.9 -1.3 0.9 0.7
X1⫫X3 |{X2, X4} X5
… ... ... … …
- Spirtes, Glymour, and Scheines. Causation, Prediction, and Search. 1993.
Result of PC on the Archeology Data

Thanks to co aborator Marlijn Noback
• By PC algorithm (Spirtes et al., 1993) + kernel-based conditional independence test (Zhang

et al., 2011)
climate
diet
gender
geodistance attrition
paramasticatory
behavior
cranial size
cranial shape differentiation

ll
Functional Causal Model-Based Causal Discovery Linear regression Y = aX + EY
Y E
Linear regression X = bY + E
X E
X
Y X
Parametric GaussianLatent
case
i.i.d. data?
X X Y Y
Yes No No
“Independent changes” renders causal direction identi able
No Yes Yes
• Linear non-Gaussian model (Shimizu et al., 2006): Y = aX + E

Y E
X
Y X
Uniform case
X
X Y Y
• Post-nonlinear causal model (Zhang & Chan, 2006):

Y = f2 ( f1(X) +E )
• Additive noise model (Hoyer et al, 2009)

Y = f(X) +E
13
fi

Post-NonLinear (PNL) Causal Model
Xi = fi,2 ( fi,1 (pai) + Ei)
• Special cases:
• Linear models
• Nonlinear additive noise models
• Multiplicative noise models:
Y = X · E = exp log(X) + log(E)
precipitation
altitude
- Zhang, Hyvärinen, Distinguishing causes from effects using nonlinear acyclic causal models, PMLR’08
All Non-Identi able Cases (Zhang & Hyvärinen, 2009)
e n e r a l ly
t i o n i s g
a l d i re c t a w e r e
C a u s t h e d a
a b l e i f g t o
ide n t i c c o rd i n
r a t e d a ) .
ge n e ( X ) + E
f ( f 1 1
X2 = 2
n li n e a r
a n d n o
m o d e l s s a r e
Lin e ar m o d e l
e n o is e
addi ti v a s e s .
e c i a l c
sp
fi
fi

Learning Hidden Variables & Their Relations

Parametric Latent
i.i.d. data?
Yes No No
No Yes Yes
• Measured variables (e.g., answer scores in psychometric questionnaires) were generated by

causally related latent variables Latent variables &
their causal structure
L1 X5
Discovery: How? L3
X6
X1 X2 X3 X4
X7
L4
L2 X8
• Find latent variables Li and Figure 1: A causal structure involving 4 latent variables and 8 observed variables, where each pair of
their causal
observed variables in {X1 , Xfrom
relations 2 , X 3 , X 4 }
measured
are affected variables
by two latent X i ?
variables.

GIN for Estimating Linear, Non-Gaussian LV Model

Parametric Latent
i.i.d. data?
Yes No No
No Yes Yes Latent variables &
their causal structure
Cluster 2
• Linear, non-Gaussian latent variable causal model L1 X5
L3
• Generalized independent noise (GIN) condition X6
Cluster 1
• (Y, Z) satis es GIN iff ∃w ≠ 0 such that w⊺Y is independent from Z X1 X2 X3 X4
Cluster 3
• Graphical interpretation: exogenous set of parents of Y d-separate Y X7
and Z L4
• Step 1: nd causal clusters

L2 X8
Figure 1: A causal structure involving 4 latent variables and 8 observed variabl

• Step 2: nd causal order of the latent variables observed variables in {X1 , X2 , X3 , X4 } are affected by two latent variables.
- Xie, Cai, Huang, Glymour, Hao, Zhang, "Generalized Independent Noise Condition for Estimating Linear
Non-Gaussian Latent Variable Causal Graphs," NeurIPS 2020
- Cai, Xie, Glymour, Hao, Zhang, “Triad Constraints for Learning Causal Structure of Latent Variables,
“ NeurIPS 2019

fi
fi
fi

GIN-Based Method: Application to Teacher’s

Burnout Data
• Contains 28 measured variables
Hypothesized model by experts
• Discovered clusters and causal order of the latent
variables:
Ref [Byrne, 2010]

(from root to leaf)
• Consistent with the hypothesized model

- Xie, Cai, Huang, Glymour, Hao, Zhang, "Generalized Independent Noise Condition for Estimating Linear Non-Gaussian
Latent Variable Causal Graphs," NeurIPS 2020
- Cai, Xie, Glymour, Hao, Zhang, “Triad Constraints for Learning Causal Structure of Latent Variables,“ NeurIPS 2019

Estimating Latent Hierarchical Structure

Parametric Latent
i.i.d. data?
Yes No No
No Yes Yes
L1
L2 L3 L4
X10 L5 L6 L7 L8 L9 X11
X1 X2 X3 X4 X5 X6 X7 X8 X9
1: A hierarchical causal structure involving 9 latent variables (shaded nodes) and 11 observed
s (unshaded
- Xie, nodes).
Huang Chen, He, Geng, Zhang, “Estimation of Linear Non-Gaussian Latent Hierarchical Structure,” ICML 2022
- Huang, Low, Xie, Glymour, Zhang, “Latent Hierarchical Causal Structure Discovery with Rank Constraints, NeurIPS 2022
- Adams, Hansen, Zhang, “Identi cation of Partially Observed Linear Causal Models: Graphical Conditions for the Non-
Gaussian and Heterogeneous Cases,” NeurIPS 2021
20
fi
Outline ⽬录
1. Why causality? 为何在意因果 3. Causal representation
关系？ learning from time series 从
时间序列进⾏因果表征学习
2. Causal representation learning: 4. Causal representation learning

IID case 因果表征学习：独⽴同 from heterogeneous/nonstationary
分布情形 data 多分布下的因果表征学习
Where Are We?

Parametric Latent
constraint? confounders?
No
No
equivalence class
Yes
Yes
Yes conditions)
No (Extended) regression
Latent temporal causal
Yes
processes identi able!
?
More informative than
No
MEC (CD-NOD)
No
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Yes
fi
fi
fi
fi
fi
Estimating Time-Delayed Causal Model
Parametric Latent
i.i.d. data?
Yes No No
No Yes Yes
• Granger causality: Conditional independence-based approach + temporal constraints
• Further with instantaneous causal relations
• Conditional independence for instantaneous relations (Swanson & Granger, 1997)
• With linear, non-Gaussian model (Hyvärinen et al, 2010)
- Swanson, Granger. Impulse response functions based on a causal approach to residual orthogonalization in
vector autoregression. J. of the Americal Statistical Association, 1997
- Hyvärinen, Zhang, Shimizu, Hoyer, "Estimation of a structural vector autoregression model using non-
Gaussianity," JMLR, 2010

Learning Latent Causal Dynamics

LEAP: Latent tEmporally cAusal Processes Estimation
LEAP: Latent
Parametric tEmporally
Latent cAusal Processes Estimation
Learn the underlying causal dynamics from their mixtures?
i.i.d. data?
Yes No No “Time-delayed” in uence renders latent processes & their
No Yes Yes relations
Temporal VAE with Causal Prior identi able
Temporal VAE with Causal Prior
Temporal VAE with causal prior

Unsupervised Causal
Unsupervised Latent temporal causal processes zit
Representation Causal Skeleton
Learning follow
Representation Skeleton Recovery
Learning
-
completely nonparametric
Recovery
model; or furthermore, Learnable Causal Prior

-
Inference ModuleLearnable
xt = g(z ) Exploiting
non-stationary noise or causal
Inference Module Causal Prior
Exploiting Nonstationarity OR Functional
Form Form
influence, or
Nonstationarity
t • •Nonparametric
OR Functional
Nonparametric + Nonstationary condition Recovered latent
-•
$$ + Nonstationary condition
Time-series Inputs !!
Time-seriesInputs !"#
!"#
Parametric z%& = fconstraints
&
) &% , E&% )
= &%f', (E&%PA
'z(% PA
processes
Latent processes • Linear + Laplacian
Linear Noise Noise
+ Laplacian
z%& = Az &PA=
&
% A+ PA
E&%& + E&%
% %
• •PNLPNL
+ Gaussian
+ GaussianNoise Noise
z%& = f( (f &
& ' ( PA% + E&%&))
z% = f( (f' ( PA% + E&% ))
- Yao, Chen, Zhang, “Causal Disentanglement

Figure 1: for Time
Represent latent causal Series,”from
mechanisms NeurIPS
temporal2022
data. By assuming the noise are spatially-temporally independent, we embed
- Figure
Yao, Sun, Ho, Sun, Zhang, 1: Represent
“Learning
the conditional
thefunctional
conditional
latent causal
Temporally
independence
independence
or distributional
mechanisms
causal
condition
form condition
within
assumptions within
from temporal
latentfunctional
processes from
causal
functional
are exploited
data.
model
to causal
By assuming
general
(FCM) temporal the
in latent space,.
modelcausal
identify latent (FCM) in latent
graphs
noise
data,” are spatially-temporally
Non-stationarity
fromspace,.
independent,
in noise distribution
temporalNon-stationarity
and we embed
observation data.in noise distribution and
ICLR 2022 functional or distributional form assumptions are exploited to identify latent causal graphs from temporal observation data.
fi
fl

where matrix A is a random Directed Acyclic Graph (DAG) which contains the coeffici
linear instantaneous relations. The noises ✏it are sampledXL from i.i.d. Laplacian distrib
Results
= 0.1. The entries of state ontmatrices
transition
xt = g(z ), Video Bt ⌧+Data
zt = Az areBuniformly
⌧ zt ⌧ + ✏t distributed
with ✏it ⇠ p✏between
,
⌧ =1
[ i
where matrix A is a random Directed Acyclic Graph (DAG) which contains the coefficients o
B.2 R EAL - WORLD DATASET
linear instantaneous relations. The noises ✏it are sampled from i.i.d. Laplacian distribution
• For easy interpretation, consider
= 0.1. The entries oftwo
statesimple video
transition data
matrices B⌧sets
Three public datasets, including KiTTiMask, Mass-Spring System, and CMU MoCap da
are uniformly distributed between [ 0.5, 0
used. •The observations

KiTTiMask: together
a video
B.2 R dataset
EAL - WORLDwith the true• temporally
ofDATASET
binary Mass-springcausal
system:latent
a videoprocesses are sho
dataset with
Fig. B.1.pedestrian
For CMU MoCap,
masks
Three publicthe trueincluding
datasets, latent causal variables
ball
KiTTiMask,movementandand
Mass-Spring time-delayed
invisible
System, relations
springs
and CMU MoCap are
database
used. The observations together with the true temporally causal latent processes are showcase
Learned
KiTTiMask Fig. B.1. For CMU
latent processes
MoCap,
Interpretationthe true latent causal variables and
Learned time-delayed relations are unkn
Video Mass-spring Interpretation
latent
& processes #
#
!!"# !!# Video ! # !! !!"$
(Movement ! #
! #
in a direction) !!# !!& !!"$ #
!!"#
!"# !!# (x- & y- coordinates
$ $
!!"# !!$ !!% of the 5 balls) !!"$
(Movement
$
!!"# in an!!$ %
!!
$
!!"$ $
!!"#
orthogonal direction) !!$ $
!!'
%
!!"# !!%
%
!!"# !!% !! !!' %
!!"$ %! %
!!"$
!"#
(Size)
(a) (b) (c)

(a) (b) (c)
- Yao, Chen, Zhang, “Learning Latent
Figure B.1:
Causal Real-world
Dynamics,” NeurIPS datasets:
2022 (a) KiTTiMask is a video dataset of binary pedestrian masks
- Yao, Sun, Ho, Sun, Zhang, “Learning
Mass-Spring system
Temporally causal latentisprocesses
a video dataset
from with ball
general temporal movement rendered in color and invisible spr
data,”
Figure B.1: Real-world
ICLR 2022
and (c)datasets:
CMU MoCap (a)is aKiTTiMask is a video
3D point cloud dataset dataset of
of skeleton-based binary pedestrian
signals.

Outline ⽬录
因果表征学习
2. Causal representation learning: 4. Causal representation

IID case 因果表征学习：独⽴同 learning from heterogeneous/
分布情形 nonstationary data 多分布下的
因果表征学习
Where Are We?

Parametric Latent
No
No
equivalence class
Yes
Yes
Yes conditions)
No (Extended) regression
Latent temporal causal
Yes
processes identi able!
More informative than
No
MEC (CD-NOD)
No
?
May have unique
Yes
identi ability
I., but non-I.D.
Changing subspace
No
identi able
Yes
Yes
fi
fi
fi
fi
fi
Nonstationary/Heterogeneous Data and Causal Modeling
• Ubiquity of nonstationary/heterogeneous data
• Nonstationary time series (brain signals, climate data...)
• Multiple data sets under di erent observational or

experimental conditions
• Causal modeling & distribution shift heavily coupled
• P(cause) and P(e ect | cause) change independently
- Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, Schölkopf, "Causal Discovery from Heterogeneous/
Nonstationary Data," JMLR, 2020
- Zhang, Huang, et al., Discovery and visualization of nonstationary causal models, arxiv 2015
- Ghassami, et al., Multi-Domain Causal Structure Learning in Linear Systems, NIPS 2018
ff
ff
Causal Discovery from Nonstationary/Heterogeneous
110
111
to reveal the correct causal structure when the data distri-
bution shifts. If the changes in some variables are related,
change over time or across domains.
assume that the underlying causal struc
112
113 tity which
Data
one can imagine that there exists some unobservable quan-
influences Latent
Parametric all those variables and, as a conse-
acyclic graph (DAG) and that the causal
with changing causal models.
i.i.d. data?
114 quence, the conditional
constraints? independence relationships in the
confounders?
115Yes distribution-shifted
No dataNowill be different from those im- g(C)
116No plied by Yes the true causal Yes structure. Similarly, suppose a V1 V2
V1 V2 V3 V4
117 variable Vi was generated from its direct causes with a cer-
•118
Task: tain functional causal model (e.g., the linear, non-Gaussian (a) (b
119 model (Shimizu et al., 2006)) whose parameters change at
•
120 Determine
121
some point. Then ifcausal
changing one fitsmodules
a fixed functional
& estimate causal model
skeleton
from the directed causes to Vi , the noise term is usually not
Figure 1. An illustration on how ignoring ch
model may lead to spurious connections by
•
122 Causal orientation
independent fromdetermination
the causes anybene more,tsand
from
accordingly it method. (a) The true causal graph (including
(b) The estimated conditional independence
123 independent changes the
fails to distinguish in P(cause) and P(effect
correct causal | cause),
structure from other
served data in the asymptotic case.
124 including invariant
candidates. There mechanism/
exist some cause as special
methods aimingcases
to detect
125 the changes (Talih & Hengartner, 2005; Adams & Mackay, Let us decompose the joint probability d
•
126 Visualization of changing modules over time/
2007; Kummerfeld & Danks, 2013) or directly model time- across
given variable
Kernel set V = {Vi }i=1 accordin
nonstationary
n
127 data sets?

varying causal relations (see, e.g., (Huang et al., 2015)) in a driving force estimationYn
128 dynamic manner. They usually focus on the linear case, in- P (V) = P (Vi | P Ai
- Huang
129 et al., "Causal
volve Discovery
high from Heterogeneous/Nonstationary
computational load, and cannotData," JMLR, locate
quickly 2020
- Tian, Pearl, “Causal discovery from changes,” UAI 2001
130 changing causal relations. This motivated the following
i=1
- Hoover, “ The logic of causal inference” Economics and Philosophy, 6:207–234, 1990. i
fi

the changes in the causal m

why and how the generat
Causal Analysis of Major Stocks in NYSE gestions about what varia
system to make it causall
(07/05/2006 - 12/16/2009)
both synthetic and real da
data) demonstrated that th
reliable information for ca
the estimated nonstationa
background knowledge fo
variables. We note that ca
are heavily coupled and th
Fig. 8: Recovered causal graph from 80 NYSE stocks. Each useful information for ca
color of nodes represents one sector. of our future research is t
online prediction in nonst
R
while the stocks SAN and CHK only have changes points
[1] R. P. Adams and D. J.
around 05/05/2008 (T2 ). Most stocks which have change detection, 2007. Technical
points only at T2 have more direct causes. The change points UK. Preprint at http://arxi
match with the critical time of financial crisis–those in the [2] R. F. Engle, D. F. Hendry,
51:277–304, 1983.
TED spread, as well as parts of the change points (T2 and T3 ) [3] A. Gretton, K. Fukumizu,
in HK stock data. Smola. A kernel statistic
585–592, Cambridge, MA
[4] B. Huang, K. Zhang, and B
8 8 8 causal model: A gaussian
6 6 6 Joint Conference on Arti
pages 3561–3568, Buenos
SAN
USB
JCP
4 4 4
[5] C. Meek. Strong complet
2 2 2 In Proceedings of the E
0 0 0
Uncertainty in Artificial In
[6] J. Pearl. Causality: Mod
University Press, Cambrid
[7] B. Schölkopf and A. Sm
8 8 8
Cambridge, MA, 2002.
6 6 6 [8] B. Schölkopf, A. Smola, a
as a kernel eigenvalue pr
CHK
PBR
4 4 4
GE
1998.
2 2 2
[9] P. Spirtes, C. Glymour, a
0 0 0 Search. Spring-Verlag Lec
07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 [10] P. Spirtes, C. Glymour, a
Time (t) Search. MIT Press, Camb
[11] X. Wang, R. Hutchinson,
- Huang, Zhang, Zhang, Romero, Glymour, Schölkopf, BehindFig.
Distribution Shift:
9: The estimated Mining Driving
nonstationary Forces
driving force of
of six stock to detect cognitive states a
returns from 07/05/2006 ⇠ 12/16/2009. The stocks USB,30 JCP, 709.
Changes and Causal Arrows,” ICDM 2017 [12] J. Woodward. Making thi
utions of Y and its children. Hence, in practice one may not need to find the whole graph over
Application: Domain Adaptation as Inference on Graphical
ures and Y . This observation may accelerate the procedure of learning the augmented graph,
will be discussed in Section 3.1.
Relation to Causal Graphs

Models
ausal graph underlying the ob- ✓1 ✓Y ✓3 ✓2 ✓6 Target-domain
data is known, thereData is no set
con-1 unlabeled data
r (hidden direct common cause
➜
X1 Y X2 X4 X6
variables), and theData set 2
observed
➜
e perfect random samples from
➜
...
X5 X3 X7 Prediction in
ulations implied byData the causal
set n mi Target-domain
then one can directly benefit
•
sing the causalOnly
model for trans-
relevant Figure
features 1: An
needed to augmented
predict Y DAG over Y and Xi . For any vari-
ning, if it is known, as shown able V with a ✓ variable/vector as its parent, the conditional
• Augmented graph learned by CD-NOD
14, 36]. If fact, in this case our distribution P (V | PA(V )) may change across domains. The
cal representation will encode
• ✓ variables
Independently changing
me set of conditional indepen-
take the
modules i
same value within each domain.
•
elations as the original causalinvariant
Special case: model. modules
•
rth noting thatDomain
the causal model, inference
adaption: on its own,onmight not be sufficient
this graphical model to explain the properties of
a, for instance, because of selection bias [37], which is often present in the sample. Furthermore,
• Infer the posterior of Y in target domain
toriously difficult to find causal relations based on observational data; to achieve it, one often
•
make rather strong assumptions on
Nonparametric the causal
methods modelconditional
to model (such as faithfulness [38]) and sampling
distributions
s. On the other hand, it is rather easy to find the graphical model purely as a description of
onal independence relationships
- Zhang*, Gong*, Stojanov, in and
Huang, Liu, the Glymour,
variables as well
"Domain as theAsproperties
Adaptation a Problem ofofInference
changes on in the
Graphical Models,"
tion modules. NeurIPS 2020.
The underlying causal(Huang et al., ICML’19
structure may befor timedifferent
very series data)
from the augmented DAG

Following the experimental setting in [47], we build a multi-source domain dataset by combing four
digits datasets, including MNIST, MNIST-M, SVHN, and SynthDigits. We take MNIST, MNIST-M,
Results on Simulated & Real Data

and SVHN in turn as the target domain and use the rest domains as source domains, which leads to
three domain adaptation tasks. We randomly sample 20,000 labeled images for training in the source
domain, and test on 9,000 examples in the target domain. We use Y ! X (as in previous work such
as [36]), where X is the image, as the graph for adaptation. We leverage a recently proposed twin
auxiliary classifier GAN framework [49] to match conditional distributions of generated and real data.
More 1:
Table implementation details can
Accuracy on simulated be found
datasets forin
thethe Appendix
baselines andA7.
proposed method. The values presented
are
We averages
compare over 10 replicates
our method for each
with recent experiment.
deep multi-sourceStandard deviation
adaptation is inMDAN
method parentheses.
[47], with two
DICA weigh simple_adapt comb_classif LMP poolSVM Infer
variants
9 sources Hard-Max
80.04(15.5)and Soft-Max,
72.1(14.5) and several baseline
70.0(14.3) methods evaluated
72.34(16.24) 78.90(13.81)in [47], including
71.8(11.43) poolNN
83.90(9.02)
and denoted
4 sources weight described
74.16(13.2) above
67.88(13.7) and poolDANN)
65.22(16.00) that considers
69.64(15.8) the combined
79.06(13.93) source85.38(11.31)
70.08(12.25) domains
as2 asources
single 86.56(13.63)
source domain and perform
75.04(18.8) the DANN
69.42(17.87) method [12].84.52(13.72)
74.28(18.2) Because our 93.10(7.17)is
classifier network
83.84(13.7)
.
different from that used in [47], we also report the poolNN method with our network architecture,
denoted as poolNN_Ours.
4The Experiments
quantitative results are shown in Table 3. It can be seen that our method achieves much better
Table 2: Accuracy on the Wi-Fi & Flow data. Standard deviation is in parentheses.
performance than alternatives
DICA on the twoLMP
weigh hard tasks. This is very Soft-Max
poolSVM impressive because
poolNN ourInfer
baseline
4.1 Simulations
classifier (poolNN_Ours)
t2, t3 ! t1 29.32(2.5) performs worse46.80(1.4)
43.71(3.02) poolNN in [47]. Figure 44.86(5.1)
40.25(1.6) 3 shows the42.88(1.6)
generated images
70.8(2.7)in
each
We
t1, t3domain
from t20!
! t2 in the
t1, simulate
tot39. It can
T+S+D/M
24.5(3.6)
binary be seen that
21.7(3.9)
task. Each39.11(2.1)
38.19(1.9)
classification ourdata
method
36.03(1.85)
row of an image
from generates
39.28(2.05)
corresponds
48.70(1.8)
the graph40.46(1.4)
on Figure
correct
to a fixed
44.95(4.4)
1, where
images for thewe
43.63(4.1)
Y value,84.5(2.9)
47.41(2.1) ranging
vary the number
corresponding
41.00(1.8) of
labels,
83.0(7.3)
😀
source
Flow 3 domains
indicatingsources
that ourbetween
79.2(11.0)
method and 9. Wetransfer
2,successfully
484.2(9.3) model each
labelmodule
91.6 (8.4) in the
92.1(7.5)
knowledge graph withdomains
89.0(9.7)
from source 1-hidden-layer
95.7(5.2) MLPs
and96.8(3.5)
recovers
with
Flow
the 32 nodes. distribution
5 sources
conditional In each replication,
83.1(12.0) P92.9(7.0) we Prandomly
(also 92.3)(6.4)
in sample
the the MLP
94.7(6.1)
unlabeled parameters
89.7(8.0)
target domain. andgenerated
domain-specific
96.0(5.1)
The 97.1(3.5)
images
X|Y Y |X
✓forvalues fromtwo
the other N (0, Wegiven
I).are
tasks sampled
in the500 points in
Appendix A8.each source domain and the target domain. We
compare our approach, denoted by Infer against alternatives. We include a hypothesis combination
method, denoted simple_adapt [27], linear mixture of source conditionals [16] denoted by weigh
Table 3: Accuracy
and comb_classif on the digits
respectively. Wedata.
alsoT:compare
MNIST;toM: theMNIST-M;
pooling SVM S: SVHN;
(denotedD: poolSVM),
SynthDigits.which
merges all source data to trainpoolNN
weigh the SVM,poolDANN
as well as domain-invariant
Hard-Max Soft-Max component analysis (DICA)
poolNN_Ours Infer[44],
and S+ Learning
M + D/T marginal
75.5predictors
93.8 (LMP) 92.5[45]. The 97.6
results are presented
97.9 in Table
94.9 1. From the96.64
results,
it can
T + beS +seenD/M that 56.3
the proposed56.1 method 65.1outperforms 66.3 the baselines
68.7 by a large59.6 margin. Regarding
89.89
significance
M + T + D/S of the 60.4
results, we77.1compared 77.6 our method 80.2 with the81.6two other most
67.8 powerful methods
89.34
(DICA and LMP) using Wilcoxon signed rank test. The the p-values are 0.074, 0.009, 0.203 (against
DICA) and 0.067, 0.074, 0.074 (against LMP), for 2, 4, and 9 source domains, respectively.
MNIST SVHN SynthDigits MNIST-M
Partial Disentanglement for Domain Adaptation 110 u domain u’s, and the changing part
Z̃s
Parametric 111
Latent with varying distribution over doma
i.i.d. data? 112 the influence of domain u on zs as
113 Zc Zs of some generic form of the chang
Yes No No
114 Namely, given a component-wise m
115 we let zs = fu (z̃s ). For example, in
No385 Yes Methods Yes ! Art ! Cartoon ! Photo ! Sketch Avg
386 Source Only (He et al., 2016)
11674.9 ± 0.88 72.1±0.75 94.5±0.58X 64.7±1.53Y 76.6
correspond to various kinds of backg
387 DANN (Ganin et al., 2016) 117 81.9±1.13 77.5±1.26 91.8±1.21 74.6±1.03 81.5 (sand, trees, sky, etc.), and in this ca
• For domain 388
389
adaptation/generalization:
MDAN (Zhao et al., 2018) 118
WBN (Mancini et al., 2018)
119
79.1±0.36
X
Figure
89.9±0.28 1.= Theg(Z);
76.0±0.73
89.7±0.56
some
91.4±0.85
generating
97.4±0.84
underlying
72.0±0.80
process: The
58.0±1.51 gray
79.6
shade
83.8
components
of nodes indi- Z S may
generic change
background pattern that can
into a domain-specific image backg
across domains
390
391
MCD (Saito et al., 2018)
M3SDA (Peng et al., 2019) 120
cates
88.7±1.01
89.3±0.42
that the variable
88.9±1.53
89.9±1.00
is observable.
96.4±0.42
97.3±0.31
73.9±3.94
76.7±2.86
87.0
88.3 which function fu is used.
392 CMSS (Yang et al., 2020) 121 88.6 ±0.36 90.4± 0.80 96.9±0.27 82.0±0.59 89.5
• Changing 394 components

393
395
T-SVDNet Z (LiSet are
LtC-MSDA (Wang et al., 2020)
iMSDA (Ours)
identi
al., 2021)
122 90.43
123
able; invariant
90.19
invariant
this line
90.47
between
90.61
of work
part Z85.49
97.23
views,
98.50
assumes
in a
C are identi
81.53
block-wise
availability
93.44±0.20 91.79±1.52 98.28±0.03 88.95±0.64 93.12 of
89.8
manner.
91.25
paired
However,
instances in
Further, we assume that y is genera
able up tovariables
its subspace
zc and z̃s . Thus, this genera
the conditional-shift setting, in which
• 396
Using invariant
397
398
part ZC and transformed
Table 1. Classification results on
124
PACS.
two domains.
changing
Backbone:Resnet-18.
In the context
part
Most baseline results Z̃
are taken
of out-of-distribution
for
from
S (Yangprediction
et al., 2020).
125 ization, Lu et al. (Lu et al., 2020) extend the identifiability
general-
domains, and p y stays the same.
399 Models 126! Art result of iVAE (Khemakhem
! Clipart ! Product !etRealworld al., 2020) Avg to a general expo- We note below that the distinguishing
Source Only (He et al., 2016) 64.58±0.68 52.32±0.63 77.63±0.23 80.70±0.81 68.81
400
DANN (Ganin et al., 2016) 127 nential58.01±1.55
64.26±0.59
family that is not necessarily
76.44±0.47 78.80±0.49
factorized.
69.38
However, ating process, and illustrate how thes
401
402 DANN+BSP (Chen et al., 2019) 128 this study
66.10±0.27 does not
61.03±0.39 take advantage
78.13±0.31 79.92±0.13 of the
71.29 fact that the la- to tackling UDA.
403 DAN (Long et al., 2015) 129 tent representations
68.28±0.45 57.92±0.65 78.45±0.05 should 81.93±0.35
contain invariant71.64 information
MCD (Saito et al., 2018) 67.84±0.38 59.91±0.55 79.21±0.61 80.93±0.18 71.97
404 130 that can be disentangled from the part that corresponds to
405 M3SDA (Peng et al., 2019) 66.22±0.52 58.55±0.62 79.45±0.52 81.35±0.19 71.39 Partitioned latent space As discu
DCTN (Xu et al., 2018) 131 changes
66.92±0.60 across domains.
61.82±0.46 79.20±0.58 Most importantly,
77.78±0.59 71.43 this study re-
406 bulk of prior work (Ganin & Lempit
407 MIAN (Park & Lee, 2021) 132 sorts to finding a conditionally invariant sub-part of z, even
69.39±0.50 63.05±0.61 79.62±0.16 80.44±0.24 73.12
MIAN- (Park & Lee, 2021) 69.88±0.35 64.20±0.68 80.87±0.37 81.49±0.24 74.11 et al., 2010; Zhao et al., 2018) focu
408 133 though
75.77±0.21
there may be parts of
84.13±0.09
z that are76.39
84.83±0.12
not conditionally
409 iMSDA (Ours) 60.83±0.73 variant representation over domains
410 134 invariant, and yet still relevant for predicting y.
Table 2. Classification results on Office-Home. Backbone: Resnet-50. Baseline results are taken from (Park & Lee, 2021). can be applied to novel domains in th
411 135 In this paper, we make use of realistic assumptions regard-
these approaches, we will demonstr
- Kong, Xie,Yao, Zheng, Chen, Stojanov, Akinwande, Zhang,
412
413 7.2. Results and Discussion
136Partial disentanglement for domain adaptation,
ing the data-generating
8. Conclusion process in order to provably identify ization of zs allows us to preserve
ICML 2022 414 137

fi
fi

Summary
• A number of ML tasks involve suitable representations of data

• Decision-making, domain adaptation/generalization, reinforcement
learning, recommender systems, trustworthy AI, explainable AI, fairness…
• Causality, including hidden variables, can be recovered under appropriate

assumptions
• Causality not mysterious: (Pseudo-)causal representation learning

1-2 因果表述学习技术进展

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-2 因果表述学习技术进展

Uploaded by

Copyright:

Available Formats

Methodological Advances in

Causal Representation Learning

Causality vs. Dependence

An intervention on X changes only the target variable X, leaving

X and Y are associated iff X is a cause of Y iff

2. Causal representation learning: 4. Causal representation learning

• Dependence vs. causality

Addressing These ML Problems

• Dealing with adversarial attacks?

(Goodfellow et al., 2014)

• To bene t downstream tasks (?) Space

• Why-type explanations World-in-itself

• Recovering the property of true causal process

2. Causal representation 4. Causal representation learning

How to Find Causality from

rain - conditional independence among variables;

• Footprint of causality in data: Independent & minimal changes

• Three dimensions of the problem:

• 8 variables of 250 skeletons collected from di erent locations

• Conditional independence constraints between each variable pair

• Extensions: the FCI algorithm…

- Spirtes, Glymour, and Scheines. Causation, Prediction, and Search. 1993.

Result of PC on the Archeology Data

• By PC algorithm (Spirtes et al., 1993) + kernel-based conditional independence test (Zhang

cranial shape differentiation

• Linear non-Gaussian model (Shimizu et al., 2006): Y = aX + E

• Post-nonlinear causal model (Zhang & Chan, 2006):

• Additive noise model (Hoyer et al, 2009)

Post-NonLinear (PNL) Causal Model

Xi = fi,2 ( fi,1 (pai) + Ei)

Learning Hidden Variables & Their Relations

• Measured variables (e.g., answer scores in psychometric questionnaires) were generated by

GIN for Estimating Linear, Non-Gaussian LV Model

• Step 1: nd causal clusters

Figure 1: A causal structure involving 4 latent variables and 8 observed variabl

GIN-Based Method: Application to Teacher’s

Ref [Byrne, 2010]

• Consistent with the hypothesized model

Estimating Latent Hierarchical Structure

2. Causal representation learning: 4. Causal representation learning

Where Are We?

• Granger causality: Conditional independence-based approach + temporal constraints

• Further with instantaneous causal relations

• Conditional independence for instantaneous relations (Swanson & Granger, 1997)

• With linear, non-Gaussian model (Hyvärinen et al, 2010)

Learning Latent Causal Dynamics

Temporal VAE with causal prior

model; or furthermore, Learnable Causal Prior

- Yao, Chen, Zhang, “Causal Disentanglement

used. •The observations

(a) (b) (c)

2. Causal representation learning: 4. Causal representation

Where Are We?

• Ubiquity of nonstationary/heterogeneous data

• Nonstationary time series (brain signals, climate data...)

• Multiple data sets under di erent observational or

• Causal modeling & distribution shift heavily coupled

• P(cause) and P(e ect | cause) change independently

127 data sets?

the changes in the causal m

Relation to Causal Graphs

Results on Simulated & Real Data

• Changing 394 components