Professional Documents
Culture Documents
Causal Inference and Stable Learning: Peng Cui Tong Zhang
Causal Inference and Stable Learning: Peng Cui Tong Zhang
Yes
Maybe
No
7
Training Data
City Hospital
Predictive Model
City Hospital
Higher income, higher survival rate. University Hospital
Survival rate is not so correlated with income.
8
Correlation is ‘unstable’
11
• Conditional on S
• Spurious Correlation S Sample
Selection
12
T Y
T:grass
X:dog nose
Y:label
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØConclusions
16
𝐴𝑇𝐸 = 𝐸[𝑌 𝑇 = 1 − 𝑌 𝑇 = 0 ]
20
Counterfactual Problem
• Two key points for causal effect
Person T 𝒀𝑻.𝟏 𝒀𝑻.𝟎
estimation
P1 1 0.4 ?
• Changing T
P2 0 ? 0.6
• Keeping everything else constant
P3 1 0.3 ?
P4 0 ? 0.1
• For each person, observe only one:
P5 1 0.5 ? either 𝑌-./ or 𝑌-.0
P6 0 ? 0.5 • For different group (T=1 and T=0),
P7 0 ? 0.1 something else are not constant
21
𝑌 𝑇=1 𝑌 𝑇=0
22
Confounding Effect
age
smoking weight
• Propensity Score
Matching
𝑇=0 𝑇=1
27
Matching
28
Matching
• Identify pairs of treated (T=1) and control (T=0) units
whose confounders X are similar or even identical to
each other
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑋A , 𝑋C ≤ 𝜖 𝒊 𝒋
• Paired units guarantee that the everything else
(Confounders) approximate constant
• Small 𝜖: less bias, but higher variance
• Fit for low-dimensional settings
• But in high-dimensional settings, there will be few exact
matches
29
• Propensity Score
𝑒 𝑋 = 𝑃(𝑇 = 1|𝑋)
Distribution Bias
𝑇A 1 − 𝑇A
Reweighting by inverse of propensity score: 𝑤A = +
𝑒A 1 − 𝑒A
P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
33
P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
34
Non-parametric solution
• Model specification problem is inevitable
• Can we directly learn sample weights that can balance
confounders’ distribution between treated and control
groups?
35
• Propensity Score
Entropy Balancing
Experiments
LaLonde
Kun Kuang, Peng Cui, et al. 2017. Estimating Treatment Effect in the Wild via Differentiated Confounder Balancing, KDD 2017, 265–274.
42
Sectional Summary
p Progress has been made to draw causality from
big data.
p From single to group
p From binary to continuous
p Weak assumptions
Ready for Learning?
44
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØFuture Directions and Conclusions
45
Traditional Learning
Prediction
Performance
Stable Learning
Learning Process
True Model
Stable Learning
Testing
…
Distribution n Accuracy n Transfer Learning
47
Prediction Stability
• Lasso
Estimation Stability
• Estimation Stability:
• Mean regression function:
• Variance of function m:
• Estimation Stability:
Domain Generalization
• Assumption: the conditional probability P(Y|X) is stable or
invariant across different environments.
• Idea: taking knowledge acquired from a number of related domains
and applying it to previously unseen domains
• Theorem: Under reasonable technical assumptions. Then with
probability at least
Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature. ICML 2013.
53
Invariant Prediction
• Invariant Assumption: There exists a subset 𝑆 ∈ 𝑋 is causal for the prediction
of 𝑌, and the conditional distribution P(Y|S) is stable across all environments.
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and
confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016
54
Global Balancing
X
Given ANY feature T
Theoretical Guarantee
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
57
Causal Regularizer
Set feature j as treatment variable
Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
58
Sample
reweighted Causal
logistic loss Contribution
Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
59
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
60
𝑠𝑚𝑎𝑙𝑙 𝑒𝑟𝑟𝑜𝑟
𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
66
Goes to infinity when perfect collinearity exists! Bias term with bound 𝑏 𝑥 ≤ 𝛿
Toy Example
• Assume the design matrix 𝑋 consists of two variables 𝑋/ , 𝑋h ,
generated from a multivariate normal distribution:
Simulation Results
𝑙𝑎𝑟𝑔𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑖𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠
𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑐𝑜𝑙𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
69
which is equivalent to
Decorrelation
Experimental Results
• Simulation Study
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
72
Experimental Results
• Regression
• Classification
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
73
Yoshua Bengio, From Deep Learning of Disentangled Representations to Higher-level Cognition. (2019). YouTube. Retrieved 22 February 2019.
74
A policy 𝜋Q A representation 𝑓Q
Sectional Summary
p Causal inference provide valuable insights for stable learning
Benchmark is important!
76
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØFuture Directions and Conclusions
77
Existence of Non-I.I.Dness
• One metric (NI) for Non-I.I.Dness
Distribution shift
For normalization
Related Datasets
• DatasetA & DatasetB & DatasetC
• NI is ubiquitous, but small on these datasets
• NI is Uncontrollable, not friendly for Non IID setting Average NI: 2.7
ImageNet
PASCAL
VOC MSCOCO
Small NI Uncontrollable NI
Minimum Bias
• In this setting, the way of random sampling leads to minimum distribution shift between
training and testing distributions in dataset, which simulates a nearly i.i.d. scenario.
• 8000 samples for training and 2000 samples for testing in each superclass (ConvNet)
Proportional Bias
• Given a class, when sampling positive samples, we use all contexts for both training and
testing, but the percentage of each context is different between training and testing dataset.
4.4
Testing
1:1
4.3
NI
4.2
4.1
4
We can control NI by varying dominate ratio 1:1 2:1 3:1 4:1 5:1
Dominant Ratio in Training Data
6:1
85
Compositional Bias
• Given a class, the observed contexts are different between training and testing data.
Training: Training:
Testing: Testing:
4.4 5.0 Testing
4.34 1:1
4.8
4.3
4.6
NI 4.2 NI 4.44
4.4
4.1
4.2
4.0 4.0
7 6 5 4 3 1:1 2:1 3:1 4:1 5:1
Number of Contexts in Training Data Dominant Ratio in Training data
Radical setting
Moderate setting
(No Overlap &
(Overlap)
Dominant ratio)
86
large NI
Large NI Controllable NI
small NI
87
• https://www.dropbox.com/sh/8mouawi5guaupyb/AAD4fdySrA6fn3P
gSmhKwFgva?dl=0
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØConclusions
89
Conclusions
• Predictive modeling is not only about Accuracy.
• Stability is critical for us to trust a predictive model.
• Causality has been demonstrated to be useful in stable prediction.
• How to marry causality with predictive modeling effectively and
efficiently is still an open problem.
90
Disentangled Prediction
Learning
Global
Balancing
Linear Stable
Propensity
Learning
Causal Inference Score
Direct Confounder
Debiasing Balancing
91
Reference
• Shen Z, Cui P, Kuang K, et al. Causally regularized learning with agnostic data
selection bias[C]//2018 ACM Multimedia Conference on Multimedia Conference.
ACM, 2018: 411-419.
• Kuang K, Cui P, Athey S, et al. Stable prediction across unknown
environments[C]//Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, 2018: 1617-1626.
• Kuang K, Cui P, Li B, et al. Estimating treatment effect in the wild via differentiated
confounder balancing[C]//Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2017: 265-274.
• Kuang K, Cui P, Li B, et al. Treatment effect estimation with data-driven variable
decomposition[C]//Thirty-First AAAI Conference on Artificial Intelligence. 2017.
• Kuang K, Jiang M, Cui P, et al. Steering social media promotions with effective
strategies[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).
IEEE, 2016: 985-990.
92
Reference
• Pearl J. Causality[M]. Cambridge university press, 2009.
• Austin P C. An introduction to propensity score methods for reducing the effects of
confounding in observational studies[J]. Multivariate behavioral research, 2011, 46(3): 399-
424.
• Johansson F, Shalit U, Sontag D. Learning representations for counterfactual
inference[C]//International conference on machine learning. 2016: 3020-3029.
• Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization
bounds and algorithms[C]//Proceedings of the 34th International Conference on Machine
Learning-Volume 70. JMLR. org, 2017: 3076-3085.
• Johansson F D, Kallus N, Shalit U, et al. Learning weighted representations for generalization
across designs[J]. arXiv preprint arXiv:1802.08598, 2018.
• Louizos C, Shalit U, Mooij J M, et al. Causal effect inference with deep latent-variable
models[C]//Advances in Neural Information Processing Systems. 2017: 6446-6456.
• Thomas V, Bengio E, Fedus W, et al. Disentangling the independently controllable factors of
variation by interacting with the world[J]. arXiv preprint arXiv:1802.09484, 2018.
• Bengio Y, Deleu T, Rahaman N, et al. A Meta-Transfer Objective for Learning to Disentangle
Causal Mechanisms[J]. arXiv preprint arXiv:1901.10912, 2019.
93
Reference
• Yu B. Stability[J]. Bernoulli, 2013, 19(4): 1484-1500.
• Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv
preprint arXiv:1312.6199, 2013.
• Volpi R, Namkoong H, Sener O, et al. Generalizing to unseen domains via adversarial data
augmentation[C]//Advances in Neural Information Processing Systems. 2018: 5334-5344.
• Ye N, Zhu Z. Bayesian adversarial learning[C]//Proceedings of the 32nd International
Conference on Neural Information Processing Systems. Curran Associates Inc., 2018: 6892-
6901.
• Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature
representation[C]//International Conference on Machine Learning. 2013: 10-18.
• Peters J, Bühlmann P, Meinshausen N. Causal inference by using invariant prediction:
identification and confidence intervals[J]. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 2016, 78(5): 947-1012.
• Rojas-Carulla M, Schölkopf B, Turner R, et al. Invariant models for causal transfer learning[J].
The Journal of Machine Learning Research, 2018, 19(1): 1309-1342.
• Rothenhäusler D, Meinshausen N, Bühlmann P, et al. Anchor regression: heterogeneous data
meets causality[J]. arXiv preprint arXiv:1801.06229, 2018.
94
Acknowledgement
Thanks!
Peng Cui
cuip@tsinghua.edu.cn
http://pengcui.thumedialab.com