CIML2023

Causal Inference and Machine Learning
Juan Carlos Escanciano

Universidad Carlos III de Madrid
Department of Economics
This version: January 2023
Copyright 2023 by Juan Carlos Escanciano. All Rights Reserved. No part of this publication can be re-
produced, stored, nor distributed in any way (either manual, electronic, recording, mechanical, photocopying
or including information storage and retrieval systems), without explicit permission from the author.
Contents
I Causal Inference: RCT and Unconfoundedness 1

1 Randomized Control Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Observational data under unconfoundedness . . . . . . . . . . . . . . . . . . . . . 14
2.1 A Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II Machine Learning for Regression 25

3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Best subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Forward stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Trees: Bagging, Random Forest, Boosting . . . . . . . . . . . . . . . . . . . . . . 41
8.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Generalized Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Kernel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
III A General Framework for Locally Robust Estimators 48

10 Double-Robust Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1 Cross-Fitting with DR Moments . . . . . . . . . . . . . . . . . . . . . . . . . 52
11 Constructing Orthogonal Functions in General . . . . . . . . . . . . . . . . . . . . 55
11.1 Cross-Fitting in the general case . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.2 Automatic Estimation of 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
11.3 Functionals of a Quantile Regression . . . . . . . . . . . . . . . . . . . . . . 68
11.4 Functionals of Nonparametric Instrumental Variables . . . . . . . . . . . . . 72
12 Heterogenous Treatment E¤ects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Part I
Causal Inference: RCT and

Unconfoundedness
We introduce the so-called Neyman-Rubin’s potential outcome notation (Neyman 1923, Ru-
bin 1974). Let Di be a binary treatment indicator variable for the i th unit, with Di = 1
denoting that the i th individual has been treated and Di = 0 otherwise. Let Yi (1) be the
outcome under treatment and let Yi (0) be the outcome without treatment.1 There is a miss-
ing data problem, in the sense that we only observe Yi ; with Yi = Yi (1) Di +Yi (0) (1 Di ) ;
but (Yi (1) ; Yi (0)) is not jointly observed. That is, we observe Yi = Yi (1) if Di = 1 and
Yi = Yi (0) if Di = 0: In addition to (Yi ; Di ); we may also observe a vector of covariates Xi :
The sample is an independent and identically distributed (iid) sample fWi (Yi ; Xi ; Di )gni=1
generated from P; the joint distribution of Wi (Yi ; Xi ; Di ):
Causal inference refers to inference on causal parameters, as de…ned below. A causal
parameter of interest is the treatment e¤ect for individual i; i.e. Yi (1) Yi (0) ; but this
quantity is never observed, since for individual i either Yi (1) is observed, if Di = 1; or Yi (0)
is observed, if Di = 0; but not both. This is the fundamental missing data problem of causal
inference. We shall discuss di¤erent settings under which, despite the missing data problem,
we identify and learn about causal parameters of the form
(C) := E [Yi (1) Yi (0)j C] ;
for di¤erent sub-populations de…ned by di¤erent conditioning sets C. Causal inference refers
to inference on parameters such as (C): This setting includes the Average Treatment E¤ect
(ATE) parameter as a special case (C = ;), denoted by AT E := E [Yi (1) Yi (0)], or the Av-
erage Treatment e¤ect on the Treated (ATT), denoted by AT T := E [ Yi (1) Yi (0)j Di = 1] ;
with C the set of treated units. This setting also allows to study treatment e¤ects hetero-
geneity by considering many C 0 s de…ning di¤erent groups (so called Group ATEs, GATEs).
Henceforth, E; C and V denote expectation, covariance and variance under the probability
P:
As a running example we use log earnings Yi and an indicator for college graduation
Di ; with Di = 1 indicating graduation from college. Another example could be log earn-
ings Yi ; as before, but now Di is an indicator for participation in a job training program:
1
An implicit assumption here is no interference, i.e., the potential outcome does not depend on the
treatment of others. This assumption may not be reasonable if there are network e¤ects.
1
There are also applications where Yi is binary (e.g. an indicator of being employed). Our
previous notation also applies to binary dependent variables. There are also many appli-
cations where the treatment is multi-valued and even continuous. The previous potential
outcomes notation can be generalized to the multiple treatment setting (see examples be-
low). For example, in the multi-valued setting the ATE from treatment d to d + 1 is given
by d+1;d := E [Yi (d + 1) Yi (d)] ; where Yi (d) denotes potential outcome under treatment
level d: The map d ! E [Yi (d)] is called the Causal Response or Average Structural functions.
We study causal inference under di¤erent scenarios. The …rst setting we investigate is
that of Randomized Control Trials (RCT). For simplicity of presentation, we focus for the
most part on the binary treatment case.
1 Randomized Control Trials

Consider the assumption:
RCT1: (Yi (1); Yi (0)) is independent of Di (in short, (Yi (1); Yi (0)) ? Di ):
Under this condition,
AT E = E[Yi (1) Yi (0)]

= E[Yi (1)] E[Yi (0)]
= E[Yi (1) jDi = 1] E[Yi (0) jDi = 0] (RCT1)
= E[Yi jDi = 1] E[Yi jDi = 0] (de…nition of Yi ).
The last equality is an identi…cation result: we have expressed AT E as a functional of the

joint distribution P of the observables (Yi ; Di ). This identi…cation suggests the di¤erence-in-
means estimator
1 X 1 X
^ = Yi Yi (1)
n1 D =1 n0 D =0
i i
= : Y1 Y0
P P
based on an iid sample f(Yi ; Di )gni=1 ; where Yd := nd 1 Di =d Yi , nd := ni=1 1(Di = d); and
henceforth, 1(A) denotes the indicator of the event A; i.e. equals one if A is true and zero
otherwise. The number of individuals in the treatment group is n1 ; and in the control group
is n0 : Note that nd could be random here, but it depends only on the data in D = fDi gni=1 :
2
The di¤erence-in-means estimator is unbiased:
E[^] = E Y1 E Y0
= E[Yi jDi = 1] E[Yi jDi = 0];
where the last equality follows from (assuming nd 6= 0)

" " ##
1 X
E Yd = E E Yi jD (Iterated Expectations)
nd D =d
" " ni ##
1 X
= E E Yi 1(Di = d)jD
nd i=1
" #
1 X
n
= E 1(Di = d)E [Yi jD]
nd i=1
" #
1 X
n
= E 1(Di = d)E [Yi jDi ] (no interference)
nd i=1
" #
1 X
n
= E 1(Di = d)E [Yi jDi = d]
nd i=1
" #
1 X
n
= E [Yi jDi = d] E 1(Di = d)
nd i=1
= E [Yi jDi = d]
= E [Yi (d)]
Moreover, using that Yi 1(Di = d) = Yi (d)1(Di = d), and
1 X
n
V Yd jD = V [Yi 1(Di = d)jD]
n2d i=1
1 X
n
= 1(Di = d)V [Yi (d)jD]
n2d i=1
1
= V [Yi (d)] ;
nd
we can write the (conditional) variance as
1 1
V[^jD] = V[Yi (0)] + V[Yi (1)]:
n0 n1
3
We aim to …nd the asymptotic properties of ^. Our main tool is the Central Limit Theorem
(CLT), which states that if f i gni=1 is a sample of iid random variables with mean zero,
E[ i ] = 0; and …nite variance 2 = E[ 2i ] < 1; then
1 X
n
d 2
p i ! N (0; ):
n i=1
p
With that goal in mind, write n(^ ) as the di¤erence of two standardized sample means
(see Exercise 1)
n 1 X
n
p (Yi E[Yi jDi = 1]) 1(Di = 1)
n1 n i=1
n 1 X
n
p (Yi E[Yi jDi = 0]) 1(Di = 0):
n0 n i=1
Then, by the CLT and the equality
E (Yi E[Yi jDi = d])2 1(Di = d) = E (Yi (d) E[Yi (d)])2 1(Di = d)
= V[Yi (d)]P(Di = d) (by RCT1),
it follows that
p d
n(^ ) ! N (0; VDM );
where
1 1
VDM = V[Yi (1)] + V[Yi (0)]
p 1 p
and p = P(Di = 1). The asymptotic variance VDM can be easily estimated by the sample
analogue
1 X 1 X
V^DM = (Yi Y1 )2 + (Yi Y0 )2 ;
n1 1 D =1 n0 1 D =0
i i
which can be used to do inference on the ATE (e.g. construct asymptotic con…dence intervals
in the usual way). That is,
q
lim P AT E 2^ z =2 V^DM =n =1 ;
n!1
1
where denotes the standard Gaussian CDF and z = (1 ) its 1 quantile. This
is our …rst causal inference result. The estimator ^ is the best estimator of based on the
data f(Yi ; Di )gni=1 (we will not discuss issues of e¢ ciency since they are beyond the scope of
4
this course).
It is important and useful to note that ^ and inference based on it can be implemented
through a simple regression outcome. In the correctly speci…ed linear model
Yi = + Di + "i ; (2)
where E["i jDi ] = 0; the parameter is the ATE. The ATE is (just-)identi…ed from the
least squares normal equations E[m(Yi ; Di ; )] = 0; with m(Yi ; Di ; 0 ) = (Yi ~ i;
Di ) D
D~ i = (1; Di )0 and 0 = ( ; )0 : In fact, ^ in (1) is the Ordinary Least Squares (OLS) estimator
in (2). For the purpose of estimating the ATE, the linear regression model makes an e¢ cient
use of all the information when only data on (Yi ; Di ) is available. Assumption RCT1 is not
testable.
The estimator ^ does not use information on covariates, which may also be available.
Introducing covariates provides over-identi…cation of the ATE, thereby improving its esti-
mation under the following assumption:
RCT2: (Yi (1); Yi (0); Xi ) is independent of Di .
For simplicity of exposition, consider a model for potential outcomes such as
Yi (d) = d (Xi ) + "i (d); (3)
where d (Xi ) = E[Yi (d)jXi ]; so that E["i (d)jXi ] = 0:

We shall show that indeed covariates improve e¢ ciency, but, for simplicity of presentation,
we will do so under the additional assumptions that V["i (d)jXi ] = 2 and p = 1=2 (these are
strong assumptions). Further, assume without loss of generality (wlog) that E[Xi ] = 0: This
assumption is wlog because all estimators we will consider are translation invariant (if you
add a constant to all the X 0 s the estimator of the ATE remains the same).
Under the assumption RCT2 and (3), d (Xi ) is identi…ed as
d (Xi ) = E[Yi jXi ; Di = d];
and since E[Yi (d)] = E[ d (Xi )]; it follows that
AT E = E[ 1 (Xi ) 0 (Xi )]:
5
This identi…cation result can be used to construct estimators for the ATE using covariates.
A commonly used model is the linear regression model
0
Yi (d) = cd + d Xi + vi (d);
where we intentionally write vi (d) instead of "i (d) to emphasize that the linear model may
be misspeci…ed, though still E[vi (d)] = 0 by construction. From the zero mean, it follows
that misspeci…cation of the linear model does not a¤ect the previous identi…cation of the
ATE. Thus,
0 0
AT E = E[c1 + 1 Xi c0 0 Xi ]
and the OLS (covariate-adjusted) estimator is
ÔLS = c^1 c^0 + X 0 ^ 1 ^

0 ;
cd ; ^ d ) based on the subsample with Di = d: In R this could be

with the OLS estimators (^
done with the command:
ols.…t.0 = lm(Y ~ X, subset D=0)
ols.…t.1 = lm(Y ~X, subset D=1)
Importantly, the OLS based estimator is of the plug-in form
1X
n
ÔLS = (^ (Xi ) ^ 0 (Xi )) ;
n i=1 1
0
where in this example ^ d (Xi ) ^ d;OLS (Xi ) = c^d + ^ d Xi : A drawback of matching with
covariates is that it is not fully automatic in general. That is, the estimator ÔLS is a
two-step estimator: matching and averaging. Because of this feature standard errors may
be complicated, depending on the …tted values ^ 1 and ^ 0 used. However, the estimator
ÔLS can also be computed in one single regression with the treatment interacted with the
covariates, which facilitates inference in this case. It is important to center the covariates
to have the interpretation cd = E[Yi (d)] and in the model with interactions to interpret the
coe¢ cient of D as the ATE (Exercise 2). The R code for this would be:
X.centered = scale(X, center = TRUE, scale = FALSE)
ols.…t = lm(Y ~D * X.centered)
6
We aim to compare the asymptotic variance of ÔLS with that of the di¤erence estimator
^ in (1) under possible misspeci…cation of the linear model. First, note from the results
above (since p = 1=2)
VDM = 2V[Yi (1)] + 2V[Yi (0)]

2
= 4 + 2V[ 1 (Xi )] + 2V[ 0 (Xi )]:
From standard OLS theory and E[Xi ] = 0; it follows

" # " #!
p c^d cd d M SEd 0
nd !N 0; ; (4)
^ 0 :::
d d
2
where M SEd = E[(Yi (d) cd d Xi ) ]: Additionally, we can show that all estimators c
^d ;
^ d and X are asymptotically independent (by E[Xi ] = 0 and E[vi (d)Xi ] = 0):
De…ne the (square) norm
kak2 = a0 a;
where = V[X]: We will show that by AT E = c1 c0 ; the CLT and the OLS convergence
in (4)
p p p p
n (ÔLS AT E ) = n (^
c1 c1 ) n (^
c0 c0 ) + nX 0 ^ 1 ^0
p p p
= n (^
c1 c1 ) n (^
c0 c0 ) + nX 0 ( 1 0) + oP (1)
d
! N (0; VOLS ) ;
where
2
VOLS = VDM k 1 + 0k :
2
To prove this, note V["i (d)jXi ] = and p = 1=2; and making use of the parallelogram law
2 2 2 2
k 1 + 0k +k 1 0k = 2k 1k + 2k 0k ;
7
0 0
and the fact that C[ d (Xi ); d Xi ] = V[ d Xi ]; we obtain
2
VOLS = 2M SE0 + 2M SE1 + k 1 0k
2 0 2
= 4 + 2V[ 1 (Xi ) 1 Xi ] + 2V[ 0 (Xi ) 0 Xi ] +k 1 0k
2 0 0
= 4 + 2 (V[ 1 (Xi )] V[ 1 Xi ]) + 2 (V[ 0 (Xi )] V[ 0 Xi ])
2
+k 1 0k
2 2
= 4 + 2 (V[ 1 (Xi )] + V[ 0 (Xi )]) + k 1 0k
2 2
2k 1k 2k 0k
2 2
= 4 + 2 (V[ 1 (Xi )] + V[ 0 (Xi )]) k 1 + 0k
2
= VDM k 1 + 0k :
The reduction in asymptotic variance is given by k 1 + 0 k2 ; and it is zero in the worst case
where 1 = 0 = 0 :
Introducing covariates improves the e¢ ciency of estimators and inference on the ATE
in RCTs. Fisher (1925) already found the bene…ts of using covariates to improve upon
the standard di¤erence in means estimator. For models with a small and …xed number
of parameters (low-dimensional asymptotics, model …xed, sample size increasing) it has
been shown that regression adjustment is always asymptotically helpful–even in misspeci…ed
models–provided we add a full treatment-by-covariate interactions to the regression design
and use robust standard errors, see, e.g., Lin (2013). Regression adjustment, as explained
above, can hurt asymptotically if the sample sizes are di¤erent and no interactions are
included. When we use models with a large number of parameters, and Machine Learning
(ML) methods to estimate these models, the ML methods necessarily have large bias due
to the regularization and model selection. These regularization and model selection biases
are transmitted into the causal estimate, unless the causal parameter is Locally Robust
(LR) or insensitive to the ML …rst steps. We will discuss in this course how to construct
LR causal estimates and how to deploy ML methods for estimating and doing inference on
causal parameters.
Linear regression may not be appropriate with binary choice or other limited dependent
variables. Zhang, Tsiatis and Davidian (2008) provide a general method to regression adjust-
ment in RCTs based on moment restrictions that are LR. We instroduce these LR moments
as follows. Suppose we aim to improve upon a moment restriction that only uses data on Y
and D by means of a moment restriction:
E[m(Yi ; Di ; 0 )] = 0;
8
where the 0 identi…ed through this set of moments may include the ATE or other causal
parameters of interest. That is, we assume that E[m(Yi ; Di ; )] = 0 then = 0 : For
example, for the di¤erence in means estimator and binary treatment case m(Yi ; Di ; 0 ) =
(Yi ~ i; D
Di ) D ~ i = (1; Di )0 and 0 = ( ; )0 : This setting also includes models for binary
outcomes, such as, for example, the logistic regression MLE, where
m(Yi ; Di ; 0) = (Yi ~ i;
( + Di )) D (5)
and (u) = exp(u)= (1 + exp(u)) : More generally, for a multi-valued setting where Di is a
possibly multi-valued treatment, with treatment intensities d = 0; :::; T; we could take
!!
X
T
m(Yi ; Di ; 0) = Yi d 1(Di = d) ~ i;
D (6)
d=0
where D ~ i = (1(Di = 0); :::; 1(Di = T ))0 ; and could be the identity or the logistic distribu-
tion, among others. For the identity case (u) = u; note that 1(Di = d1 )1(Di = d2 ) = 0 if
d1 6= d2 ; and therefore E[m(Yi ; Di ; )] = 0 means, for d = 0; :::; T;
E[Yi 1(Di = d) d 1(Di = d)] = 0
or
E[Yi 1(Di = d)]

d =
E[1(Di = d)]
= E[Yi jDi = d]
= E[Yi (d)]:
So, the coe¢ cient d is interpreted as the potential average under treatment d:
Then, for a given but general m; Zhang et al. (2008) recommended inference to be based
on the moment
X
T
(Wi ; 0; ) = m(Yi ; Di ; 0) (1(Di = d) pd ) d (Xi );
d=0
where pd = P(Di = d) and
d (Xi ) = E[m(Yi ; Di ; 0 )jXi ; Di = d]
9
is a nuisance function to be estimated. We shall show later that the estimator based on
E[ (Wi ; 0 ; )] = 0 is very well motivated, since it will be LR. This is a general method, that
can be applied to any m, but we will see how it simpli…es for an important example of m.
For moments such as (6) with the identity, the nuisance parameter d (Xi ) has all
components zero but the d th component, which equals
d (Xi ) d:
In this case, E[ (Wi ; 0; )] = 0 means, for d = 0; :::; T;
E[Yi 1(Di = d) d 1(Di = d) (1(Di = d) pd ) ( d (Xi ) d )] = 0;
so the solution 0 = ( 0 ; :::; T) is such that
1
d = E[Yi 1(Di = d) (1(Di = d) pd ) d (Xi )]
pd
1
= E[(Yi d (Xi )) 1(Di = d) + pd d (Xi )]
pd
(Yi d (Xi ))
= E d (Xi ) + 1(Di = d) : (7)
pd
Before we needed d (Xi ) to satisfy d = E [ d (Xi )] ; but now even if this does not hold, by
independence,
(Yi d (Xi )) Yi
E d (Xi ) + 1(Di = d) = E [ d (Xi )] + E 1(Di = d)
pd pd
d (Xi )
E 1(Di = d)
pd
= E [ d (Xi )] + E [Yi (d)] E [ d (Xi )]
= E [Yi (d)] :
This is the LR property referred to before, where the moment is not sensitive to misspeci…-
cation of d ( ):
The sample analog of d in (7) can be used as its estimator, which requires estimation of
d (Xi ): ML methods can be used for estimating d (Xi ); as we shall see in future lectures.
10
The plug-in (ML-based) estimator for the ATE is the naive (covariate-adjusted) di¤erence-
in-means (also called Direct) estimator
1X
n
^DM = (^ (Xi ) ^ 0 (Xi )) ;
n i=1 1
for high-dimensional or ML estimators ^ 1 (x) and ^ 0 (x) of 1 (x) and 0 (x); respectively. The
estimator ^DM will lead to large biases and it is thus not recommended, because
1X
n
^DM = ( (Xi ) 0 (Xi ))
n i=1 1
| {z }
Infeasible Estimator
1 Xn
+ (^ 1 (Xi ) 1 (Xi ))
n i=1
| {z }
Large with ML ( n 1=2 )
1 Xn
(^ 0 (Xi ) 0 (Xi )):
n i=1
| {z }
Large with ML ( n 1=2 )
Wager, Du, Taylor and Tibshirani (2019) recommend instead the estimator
1X 1 X 1 X
n
^LR = (^ (Xi ) ^ 0 (Xi )) + [Yi ^ 1 (Xi )] [Yi ^ 0 (Xi )] : (8)
n i=1 1 n1 D =1 n0 D =0
i i
This choice is nicely motivated from the general construction in Zhang et al. (2008), as we
have shown above in (7).
That is, de…ne the estimated (uncentered) in‡uence function
^ i = (^ 1 (Xi ) Di [Yi ^ 1 (Xi )] (1 Di ) [Yi ^ 0 (Xi )]

^ 0 (Xi )) + ;
p^ 1 p^
where p^ = n1 =n and ^ d does not use the i th observation. Let i be de…ned as ^ i but with
known regressions d (Xi ) and p, and denote the infeasible LR estimator by
1X
n
~LR = i:
n i=1
11
Then, simple algebra shows
1 X ^
n
p
n (^LR ~LR ) = p i i
n i=1
1 X
n
= p (^ (Xi ) 1 (Xi )) (1 Di =p)
n i=1 1
1 X
n
1 1
+ p^ p p Di (Yi 1 (Xi ))
n i=1
1 X
n
1 1
p^ p p (^ (Xi ) 1 (Xi )) Di
n i=1 1
1 X
n
p (^ (Xi ) 0 (Xi )) (1 (1 Di )=(1 p))
n i=1 0
1 X
n
1 1
(1 p^) (1 p) p (1 Di ) (Yi 0 (Xi ))
n i=1
1 X
n
1 1
+ (1 p^) (1 p) p (^ (Xi ) 0 (Xi )) (1 Di )
n i=1 0
X
6
= : Snj :
j=1
By RCT1, E [Snj ] = 0; for j = 1; 2; 4; 5 and
(1 p) 2
V [Sn1 ] = E (^ 1 (Xi ) 1 (Xi ))
p
p 2
V [Sn4 ] = E (^ 0 (Xi ) 0 (Xi ))
1 p
Sn2 = Sn5 = OP (^p p) :
q
2
So, under the mild consistency condition that RM SE(^ d ) = E (^ d (Xi ) d (Xi )) !0
and p^ = p + oP (1) we conclude that Snj = oP (1) for j = 1; 2; 4 and 5. The more complicated
terms are Sn3 and Sn6 ; which can be shown to satisfy
p
Sn3 = OP n (^
p p) RM SE(^ 1 ) = OP (RM SE(^ 1 ))
p
Sn6 = OP n (^
p p) RM SE(^ 0 ) = OP (RM SE(^ 0 )) :
12
Therefore, in an RCT, under mild consistency conditions on the ML method (RM SEd ! 0),
p
we have n (^LR ~LR ) = oP (1) and hence
p d
n (^LR AT E ) ! N (0; V ) ;
2
where V = E ( i AT E ) ; is consistently estimated by
1X ^
n
2
V^ = i ^LR :
n i=1
A standard asymptotic 95% con…dence interval for based on ^LR is

q
^LR 1:96 V^ =n:
How about other causal parameters of interest? In an RCT, AT T = AT E ; since
AT T = E [Yi (1) Yi (0)j Di = 1] = E [Yi (1) Yi (0)] :
Other causal parameters of interest are Conditional ATEs (CATEs)
CAT E (x) = E [Yi (1) Yi (0)j Xi = x] :
From our results above, CATE is identi…ed as
CAT E (x) = 1 (x) 0 (x):
Estimating CAT E (x) for a vector X containing a continuous covariate is a much harder
statistical problem than estimating AT E : See, e.g., Künzel, Sekhon, Bickel and Yu (2019)
for discussion. We discuss below some methods to investigate treatment e¤ects heterogeneity
for more general settings, which include RCT as a special case.
It is worth to investigate further the explicit expression for the asymptotic variance V:
From the formula for i above, it follows that
2 2
2 1 0
V =E ( CAT E (Xi ) AT E ) + + :
p 1 p
This result can be used to show that p = 1 =( 0 + 1) minimizes V: When 0 = 1; this

gives the well known case p = 1=2:
13
RCTs are the gold standard in causal empirical analysis, but they are rare in certain
…elds or for certain interesting economic situations (e.g. returns to education). For this
reason, we will try to generalize these ideas to a broader context. The motivation for intro-
ducing covariates in the previous results is for improving e¢ ciency, and it is not based on
identi…cation arguments.
2 Observational data under unconfoundedness

Why is not the di¤erence of means estimator valid outside the RCT setting? Using that
Yi = Yi (0) + i Di with i = Yi (1) Yi (0) it follows directly that
E[Yi jDi = 1] E[Yi jDi = 0] = E[Yi (0)jDi = 1] E[Yi (0)jDi = 0] + E[ i jDi = 1]:
| {z }
Selection bias
There is no causal interpretation of the left hand side because the selection bias term may
not be zero. If the treatment indicator is a choice variable, as in many economic settings,
this choice is likely to be related to Yi (0): For example, the decision to participate in a job
training program may be related to the wages in the absense of treatment.
We generalize the setting of the previous section to the case of unconfoundedness and
overlap (Rosenbaum and Rubin 1983), de…ned as the setting where the following assumptions
hold:
U1: (Yi (1); Yi (0)) is independent of Di , conditional on Xi .
De…ne the propensity score: p (Xi ) = P (Di = 1j Xi ), and assume:
U2: 0 < p(x) < 1 a.s:
Assumption U1 is unconfoundedness and Assumption U2 is overlap. Di¤erent names have

been used for U1 and U2 (e.g. ignorability, selection on observables for U1 and common
support for U2, etc.).
We …rst prove the identi…cation of the CATE. Under Assumption U1,
CAT E (Xi ) = E[Yi (1)jXi ; Di = 1] E[Yi (0)jXi ; Di = 0]

= E[Yi jXi ; Di = 1] E[Yi jXi ; Di = 0]:
Moreover, under U1-U2 the ATE AT E := E[Yi (1) Yi (0)] is given by (cf. Rosenbaum and
14
Rubin, 1983)
AT E = E[ CAT E (Xi )] (Iterated expectations)

= E[E[Yi jXi ; Di = 1] E[Yi jXi ; Di = 0]]
E[Yi Di jXi ] E[Yi (1 Di )jXi ]
= E
p(Xi ) 1 p(Xi )
Yi D i Yi (1 Di )
= E :
p(Xi ) 1 p(Xi )
A natural estimator based on this identifying equation is the plug-in Inverse Probability
Weighting (IPW) estimator
1 X Yi D i
n
Yi (1 Di )
ÎP W = ;
n i=1 p^(Xi ) 1 p^(Xi )
where p^(x) is an estimator of the propensity score, see Hirano, Imbens and Ridder (2003). As
we will see using a ML or high dimensional estimator for p^(x) will lead to large biases in ÎP W :
We will discuss LR versions of this estimator below. Note that estimating the propensity
score is a prediction problem (estimating a conditional mean), because p (Xi ) = E (Di j Xi ) :
Unfortunately, the estimator ÎP W will be very sensitive to the estimation of the propen-
sity score. We will motivate in later sections a LR estimator of the form
1X
n
Di 1 Di
^LR = ÎP W + ^ (Xi ) 1 ^ 0 (Xi ) 1 :
n i=1 1 p^(Xi ) 1 p^(Xi )
| {z }
mean zero noise
This LR estimator is called Augmented IPW (AIPW). This estimator was introduced by
Robins, Rotnitzky and Zhao (1994), and developed in a sequence of papers including Robins
and Rotnitzky (1995) and Scharfstein, Rotnitzky, and Robins (1999).
Simple algebra shows
1X 1 X Di [Yi ^ 1 (Xi )] 1 X (1
n n n
Di ) [Yi ^ 0 (Xi )]
^LR = (^ (Xi ) ^ 0 (Xi )) + ;
n i=1 1 n i=1 p^(Xi ) n i=1 1 p^(Xi )
which is a generalization of (8) to observational data under unconfoundedness. The estimator

^LR is Double Robust (DR) in a sense that we will describe in later sections, and it is
preferred to alternative estimators. Inference with this estimator is straightforward. De…ne
15
the estimated (uncentered) in‡uence function
^ Yi D i Yi (1 Di ) Di 1Di
i = + ^ 1 (Xi ) 1 ^ 0 (Xi ) 1
p^(Xi ) 1 p^(Xi ) p^(Xi ) 1 p^(Xi )
Di [Yi ^ 1 (Xi )] (1 Di ) [Yi ^ 0 (Xi )]
= (^ 1 (Xi ) ^ 0 (Xi )) + :
p^(Xi ) 1 p^(Xi )
Let i be de…ned as ^ i but with known regressions and propensity scores, replacing estimated
quantities. As before, de…ne the infeasible LR estimator
1X
n
~LR = i:
n i=1
p
Then, as for RCT we write n (^LR ~LR ) as
1 X
n
p (^ (Xi ) 1 (Xi )) (1 Di =p(Xi ))
n i=1 1
1 X
n
+p p^ 1 (Xi ) p 1 (Xi ) Di (Yi 1 (Xi ))
n i=1
1 X
n
p p^ 1 (Xi ) p 1 (Xi ) (^ 1 (Xi ) 1 (Xi )) Di
n i=1
1 X
n
p (^ (Xi ) 0 (Xi )) (1 (1 Di )=(1 p(Xi )))
n i=1 0
1 X
n
1 1
p (1 Di ) (Yi 0 (Xi )) (1 p^(Xi )) (1 p(Xi ))
n i=1
1 X
n
1 1
+p (^ (Xi ) 0 (Xi )) (1 p^(Xi )) (1 p(Xi )) (1 Di ):
n i=1 0
In contrast to the RCT, now the rate of convergence for the propensity score estimator may be
non-parametric. Nevertheless, under suitable conditions on p^(x) and ^ d (x) (to be described
p
later) the leading term in the previous expansion is of the order OP ( nRM SE(^ p) RM SE(^ 1 )) ;
so the main rate condition on the ML methods is
1=2
RM SE(^
p) RM SE(^ d ) = oP n ;
q
where RM SE(^ p) = E (^ p(Xi ) p(Xi ))2 : One such condition is that p^(x) and ^ d (x) are
made of ML estimators that do not use the i th observation when evaluating at Xi (cross-
…tting), thereby ensuring that some terms above are conditionally centered. Also, note that
16
DR allows for a trade-o¤ in estimating p^(x) and ^ d (x): We will investigate these issues in a
more general setting below.
Then, under regularity conditions, which include consistent estimation of d and p by
p
^ d and p^; respectively, we have n (^LR ~LR ) = oP (1) and hence
p d
n (^LR AT E ) ! N (0; V ) ;
where V = E ( i AT E )
2
is consistently estimated by the sample analogue V^ : The as-
ymptotic variance has the expression
V[Yi (1)jXi ] V[Yi (0)jXi ]

V = V (( 1 (Xi ) 0 (Xi ))) +E +E :
p(Xi ) 1 p(Xi )
This is the best (i.e. smallest) possible variance for regularly estimating the ATE (the
concept of regularity is beyond the scope of this course). That is, the LR estimator is the
most e¢ cient estimator (Hahn, 1998). Assumption U2 is often strengthened to bound the
second and third terms. If the overlap assumption does not hold then the variance of the
ATE estimator will be high. It is therefore important in practice to check for the overlap
assumption by, for example, plotting an histogram or density plot of the estimated propensity
score.
There are other e¢ cient estimators for the ATE. For example, Hahn (1998) proposes the
imputed estimator
1X^
n
Î = Y1i Y^0i ;
n i=1
where Y^1i = Di Yi + (1 Di )^ 1 (Xi ) and Y^0i = (1 Di )Yi + Di ^ 1 (Xi ): However, the estimator
Î is not LR. We will see LR ML imputation methods later in the course.
The LR estimator ^LR is preferred to the other estimators because it only requires cor-
rect speci…cation of either the regression outcomes d or propensity score p: A standard
asymptotic 95% con…dence interval for based on ^LR is
q
^LR 1:96 V^ =n:
More generally, for a multi-valued treatment case
1X
n
1(Di = d) [Yi ^ d (Xi )]
^LR;d = ^ d (Xi ) +
n i=1 p^d (Xi )
will be a LR estimator for d (cf 7), where now p^d (Xi ) estimates a generalized propensity
17
score pd (Xi ) = E (1(Di = d)j Xi ) :
As before, the motivation to consider LR or DR moments is that we do not need d (Xi )
to be correctly speci…ed to obtain
(Yi d (Xi )) Yi
E d (Xi ) + 1(Di = d) = E [ d (Xi )] + E 1(Di = d)
pd (Xi ) pd (Xi )
d (Xi )
E 1(Di = d)
pd (Xi )
= E [Yi (d)] ;
provided the propensity score is correctly speci…ed. The same equality is true if d (Xi ) is
correctly speci…ed but the propensity score is not, since (Yi d (Xi )) has conditional mean
zero. These two properties confer the DR property mentioned before. For RC the propensity
score was always correctly speci…ed, which resulted in full robustness to misspeci…cation of
d (Xi ):
Another parameter of interest is the ATT, which is identi…ed under weaker conditions
than the ATE, since
AT T = E[Yi (1)jDi = 1] E[Yi (0)jDi = 1]

= E[Yi jDi = 1] E[E[Yi (0)jXi ; Di = 1]jDi = 1]
= E[Yi jDi = 1] E[ 0 (Xi )jDi = 1]
Yi Di (Xi )Di
= E E 0
p p
(Yi 0 (Xi ))Di
= E
p
More generally, d1 ;d2 = E[Yi (d1 )jDi = d2 ] is identi…ed as the solution to the moment
E d1 (Xi ) d1 ;d2 1(Di = d2 ) = 0;
and, as we will see later on, a LR moment for d1 ;d2 is
pd2 (Xi )
E d1 (Xi )1(Di = d2 ) + Yi d1 (Xi ) 1(Di = d1 ) 1(Di = d2 ) d1 ;d2 = 0:
pd1 (Xi )
With this notation in place, AT T = 1 0;1 : As with previous estimands, the plug-in ATT
estimator will be very sensitive to the speci…cation of 0 (Xi ): Instead, we will use the LR
18
estimation of AT T given by:
1X
n
p^(Xi ) (Yi ^ 0 (Xi ))
^LR;AT T = Di (1 Di ) :
n i=1 1 p^(Xi ) p^
| {z }
=: ^ i
P
As for the ATE, a consistent variance estimator is given by V^ = n 1 ni=1 (^ 'i ^LR;AT T )2 :
Thus, we see that for estimating causal parameters of interest, it is often necessary to esti-
mate conditional means such as d (Xi ) = E[Yi jXi ; Di = d] and/or pd (Xi ) = E (1(Di = d)j Xi ) :
Estimating these objects is the goal of Machine Learning (ML). See Varian (2014) and Athey
and Imbens (2019) for a review of the uses of ML for economics.
We will provide evidence that although these ML methods provide high quality predic-
tions for estimating these conditional means, causal inference based on plug-in estimators of
these quantities are not reliable. We shall see that the problem is the (necessary) bias that
ML methods have in high dimensional settings. To solve the bias problem, we have modi…ed
the original identifying moment conditions to LR/DR versions. We will explain below how
to obtain LR moments in general.
2.1 A Simulation Study

In this section we carry out a small simulation study to investigate the …nite sample per-
formance of the previously described ATE estimators. We consider two data generating
processes (DGP):
Linear: Yi (0) = Xi1 + Xi2 + "0i ; Yi (1) = Xi1 + Xi3 + "1i ; p(Xi ) = 1=(1 + exp(Xi1 )):
2 2
Nonlinear: Yi (0) = 4 1(Xi1 > 0) + Xi2 =2 + "0i ; Yi (1) = 4 1(Xi1 > 0) + Xi3 =2 + "1i ;
p(Xi ) = (1 + sin(Xi1 ))=2;
where ("0i ; "1i ; Xi ) are independent with mean zero, "di N (0; 4), and Xi N (0; I20 ): The
dimension of the covariates is p = 20: We run 1000 Monte Carlo simulations for di¤erent
sample sizes n 2 f100; 200; 400; 800; 1600g:
In Figures 1 and 2 we report the Box-plots for the Linear and Nonlinear DGPs for the
DM-ATE based on OLS and Random Forest (RF) …ts. We will describe these ML methods
in future lectures. The OLS is a parametric, correctly speci…ed, method for the linear model,
and as such, it performs very well in terms of bias. Its variance, although large for small
p
sample sizes, decreases with the sample size (at the rate n): For the nonlinear model, OLS
is based on a misspeci…ed functional form, and therefore, it leads to bias results which do
19
not improve with the sample size. In contrast, RF, which is a nonparametric method, has
large bias for small sample sizes but improves substantially for large sample sizes in both
cases, linear and nonlinear.
20
Figure 1: Linear model. Direct ATE based on OLS (left) and RF (right).
Figure 2: Non-Linear model. Direct ATE based on OLS (left) and RF (right).
We now report in Figures 3 and 4 the results for the DR estimators. OLS remains valid
for the linear model, and its bias in the Non-Linear case reduces substantially due to the DR
property. DR-RF improves considerably for small sample sizes in terms of bias.
21
Figure 3: Linear model. DR-ATE based on OLS (left) and RF (right).
Figure 4: Non-Linear model. DR-ATE based on OLS (left) and RF (right).
In Figures 5 and 6 we report the results for IPW estimators, and their DR versions. IPW
has large bias for small sample sizes that is corrected with the DR estimator.
22
Figure 5: Linear model. ATE based on IPW (left) and DR-RF (right).
Figure 6: Non-Linear model. ATE based on IPW (left) and DR-RF (right).
To complete the picture, we report coverage probabilities at 95% for con…dence intervals
based on asymptotic normality for the di¤erent estimators and sample sizes. These results
23
will help us to see which methods are reliable for inference and which methods are not. Table
1 reports the average coverage over 1000 MC simulations for the di¤erent sample sizes in the
Linear model. As expected, the correctly speci…ed parametric method, OLS, performs the
best. The direct estimator based on OLS has small bias. In contrast, the direct RF estimate
has a large bias, and its performance gets even worse with the sample size! We can see here
how DM estimators are sensitive to the ML method used. In contrast, DR based estimators
perform uniformly well. The IPW estimator performs better than the DM-RF, but still has
considerable bias. Table 2 reports the results for the Non-linear model. Now, the parametric
OLS method is misspeci…ed, which results in invalid inference. The nonparametric DM-RF
inference leads to bias in small samples. DR methods corrects most of the bias, including
the large misspeci…cation bias of OLS (this is the DR property), though still have some
small distortions (that can be corrected by cross-…tting, as explained below). IPW has large
bias and coverage distortions. In sum, DR methods are preferred due to their robustness
properties which lead to uniform valid performance.
Table 1: Coverage Probabilities at 95% for Linear Model

n=^ DM-OLS DM-RF DR-OLS DR-RF IPW
100 0.958 0.632 0.713 0.829 0.761
200 0.956 0.452 0.861 0.859 0.715
400 0.935 0.300 0.875 0.867 0.690
800 0.933 0.204 0.905 0.911 0.605
1600 0.925 0.133 0.929 0.913 0.592
Table 2: Coverage Probabilities at 95% for Non-Linear Model

n=^ DM-OLS DM-RF DR-OLS DR-RF IPW
100 0.875 0.060 0.608 0.888 0.451
200 0.753 0.243 0.812 0.906 0.469
400 0.531 0.700 0.900 0.918 0.496
800 0.234 0.875 0.921 0.923 0.545
1600 0.046 0.942 0.906 0.905 0.516
24
Part II
Machine Learning for Regression

We aim to learn about the conditional mean function of a dependent variable Yi on a set of
p covariates (also called features, regressors or predictors) Xi ;
(x) := E [Yi j Xi = x]
when we observe an iid sample Wn = (Wi )ni=1 of Wi = (Yi ; Xi ): Here, Xi could contain
discrete and continuous variables, as when Xi = (X1i ; Di ); and X1i are the pre-treatment
variables in observational data. When Yi is binary this is often called Classi…cation, otherwise
is called Regression. This is a bit of a misnomer, since in any case ( ) is the regression
function. There are, of course, many other statistical problems for which ML methods
have been developed and which are not regression problems, e.g. clustering, and we will
not discuss these other problems in this course. We will focus on the so-called supervised
regression problems.
A general approach is to learn by solving the optimization problem
^ ( ) = arg min En (Y m(X))2 ; (9)

m:m2M
where M is a class of functions and En is the in-sample mean based on observations Wn ;

i.e. for a function f of W
1X
n
En [f (W )] = f (Wi ) :
n i=1
An important feature of the optimization problem (9) is that is strictly convex if M is a

convex set. This feature will have important consequences on uniqueness of (in-sample)
predictions ^ ( ):
The training Mean Squared Error (MSE) is
1X
n
2
M SEn = En f (Y ^ (X)) = (Yi ^ (Xi ))2 ;
n i=1
where ^ ( ) has been computed with the same sample used in the averaging. The training
MSE typically decreases with the complexity of the model M and must not be used as
a measure of model accuracy for predictions. By construction, the bigger M the smaller
M SEn :
25
In ML often the goal is prediction, and the accuracy of the model ^ for prediction is
evaluated on the basis of its out-of-sample MSE or risk given by
r(^ ) = E (Y0 ^ (X0 ))2 ;
where (Y0 ; X0 ) is an independent copy of Wi = (Yi ; Xi ); independent of the sample Wn =

(Wi )ni=1 used to construct ^ ; and satisfying
Y0 = (X0 ) + ";
where by construction " is the prediction error term " = Y0 E [ Y0 j X0 ] : The term r(^ ) is
also called overall expected test MSE in ISLR. Note that r(^ ) is a non-random scalar, and
the expectation in r(^ ) is joint in the distribution of (Y0 ; X0 ; Wn ):
To motivate the risk r(^ ), consider …rst the simpler case where a prediction is to be made
for a new unseen but …xed covariate value x0 ; with corresponding true value
Y (x0 ) = (x0 ) + ":
Substituting this equation we obtain
r(^ (x0 )) : = E (Y (x0 ) ^ (x0 ))2

= E ( (x0 ) ^ (x0 ))2 + E "2 :
| {z } | {z }
Reducible risk Irreducible risk
The reducible risk can be further decomposed as
E ( (x0 ) E [^ (x0 )])2 + E (^ (x0 ) E [^ (x0 )])2 ;

| {z } | {z }
Squared Bias b2^ (x0 ) Variance v ^ (x0 )
where we have used that
E [( (x0 ) E [^ (x0 )]) (^ (x0 ) E [^ (x0 )])] = ( (x0 ) E [^ (x0 )]) E [(^ (x0 ) E [^ (x0 )])]
= 0:
Coming back to the overall risk r(^ ); we can condition on X0 = x0 and apply iterated
26
expectations to obtain
r(^ ) = E b2^ (X0 ) + v ^ (X0 ) + E "2

= Bias2 (^ ) + V ariance(^ ) + Irreducible:
The irreducible part is the risk that cannot be controlled by the choice of ^ : The bias and
variance parts are typically decreasing and increasing, respectively, as a function of the
complexity of the model M. Thus, there is a bias-variance tradeo¤ in the choice of ^ :
To optimize this tradeo¤, it would be useful to know r(^ ): Unfortunately, r(^ ) depends on
the data generating process and it is unknown.
To see why r(^ ) depends on the data generating process, write
r(^ ) = E E (Y0 ^ (X0 ))2 Wn

Z
E (y0 ^ (x0 ))dP(y0 ; x0 ) :
The term inside the brackets is called the conditional prediction risk R(^ ) and it is unknown
because P is unknown. R(^ ) is random, it changes with the sample Wn ; and it is an unbiased
estimator of r(^ ):
Although R(^ ) is unknown, it can be estimated from a hold out sample WT = (Wi )Tt=1 ,
independent of Wn ; by
1X
T
M SET (^ ) = (Yt ^ (Xt ))2
T t=1
ET f (Y ^ (X))2 :
The quantity M SET (^ ) is called the test MSE, which can be used to select the model
^ in M with the optimal complexity. Here, optimality is in terms of minimizing the risk
(bias-variance trade-o¤).
Splitting the original sample size in two, the training sample and the test sample, may
lead to high variability in the test MSE estimate and to low precision in the training sample
…t. To overcome these limitations, alternative cross-validation methods have been proposed.
One option is Leave-One-Out Cross-Validation (LOOCV). In LOOCV a single obser-
vation is used for the validation (test) set, say (Y1 ; X1 ); while the remaining observations
(Y2 ; X2 ); :::; (Yn ; Xn ) make up the training set. The ML method is …tted on the n 1 ob-
servations, and a prediction Y^1 = ^ ( 1) (X1 ) is made for the excluded observation X1 (^ ( 1)
here is based on the training sample except for (Y1 ; X1 )): We estimate the test error by
27
M SE1 = (Y1 Y^1 )2 : The estimator M SE1 is approximately unbiased, but it has large vari-
ance (it is based on one test observation!). We can repeat the procedure with the second
observation and compute M SE2 = (Y2 Y^2 )2 ; and so on. The …nal estimate of the test error
by LOOCV is
1X
n
CV(n) = M SEi :
n i=1
This estimator has less bias and variability than the two-fold sample splitting above. The
computational cost of CV(n) could be high, however, particularly if computing ML estimates
of the training sample is costly, but there is a class of problems for which this cost is highly
reduced. Consider a situation where computing the …tted values for a training sample of n
observations can be done algebraically as Y ^ = Ln Y; where Y is the n 1 vector of dependent
variables Y = (Y1 ; Y2 ; :::; Yn )0 ; Ln is an n n matrix and Y^ = (Y^1 ; Y^2 ; :::; Y^n )0 is the vector of
…tted values Yî = ^ (Xi ). Such methods are called linear prediction methods, and Ln is
called the hat matrix. The number v = tr(Ln ) is called e¤ective degrees of freedom,
where tr(A) is the trace of the matrix A (sum of the elements of the main diagonal). We
will assume that Ln 1 = 1; for 1 a n 1 vector of ones. Then, we will show in Exercise 3
that for a linear method
1 X Yi ^ (Xi )
n 2
CV(n) = ; (10)
n i=1 1 Lii
where Lii is the i th diagonal element of the hat matrix Ln ; and here ^ uses all the training
data. Additionally, an approximation of CV(n) ; called the generalized cross-validation,
P
replaces Lii by the average ( ni=1 Lii ) =n = v=n; so that
1X
n 2
Yi ^ (Xi )
GCV(n) = : (11)
n i=1 1 v=n
Furthermore, using the approximation (1 x) 2 1 + 2x (valid when x 0) we see that

the right hand side of the last display can be approximated by
1X
n
2vb2
Cp := (Yi ^ (Xi ))2 + ; (12)
n i=1 n
known as the Cp statistic of Mallows. This motivates an alternative and common approach
for model selection: minimizing a penalized version of the training risk to avoid over-…tting.
LOOCV is a very general method and it can be used with any ML method, even nonlinear
methods, while the formulas based on (10-12) are only valid for linear prediction methods for
^ = LY. Linear prediction methods include linear (in parameters) regression, Ridge,
which Y
28
kernel smoothing, kernel machines, splines, and others, while nonlinear methods include
Lasso, trees, Random Forest, Deep learning, etc.
An alternative to LOOCV is K-fold CV. This approach involves randomly dividing the
set of observations into K groups, or folds, I1 ; :::; IK ; of approximately the same size. The
…rst fold is treated as a validation set, and the method is …t with the other K 1 folds. The
estimator M SE1 is then computed with the observations from the …rst fold as test sample.
This procedure is repeated K 1 times with all the di¤erent folds playing the role of test
set, and the …nal estimator is computed as
1 X
K
CV(K) = M SEi :
K i=1
ni
1 X 1 X
K
2
= (Yj ^( i) (Xj )) ;
K i=1 ni j=1
where ni is the cardinality of Ii ; ni = #Ii ; and ^ ( i) is computed with all observations but
those in Ii : LOOCV is a special case of K-fold CV, with K = n: In practice, one typically
performs K-fold CV with K = 5 or 10, for computational considerations. K-fold CV is also
preferred to LOOCV because it has less variance (the averaging in LOOCV is for highly
correlated terms).
Summarizing, unbiased estimation of the risk is an important part of modern ML meth-
ods. It is the procedure used in practice for the evaluation of models and for model selection.
3 Linear Regression
Our starting point of ML methods is the classical linear p dimensional regression model
Yi = Xi0 0 + "i ; E ["i j Xi ] = 0 a.s.
For simplicity, we start with the strong assumption E ["i j Xi ] = 0; which implies (and it is
implied by)
(x) = x0 0 ;
so is fully determined by 0 : In this case, we say the linear model is correctly speci…ed.
This is indeed a strong assumption with p …xed, which is unlikely to be satis…ed, so we will
relaxed it. As we shall see, it is possible to do inference for linear models that is robust
to misspeci…cation. On the other hand, the linear model may be quite general because the
variables in Xi could be nonlinear transformations of some primitive covariates, and in that
29
case we say the model is linear in parameters but non-linear in variables, e.g. Xi could
include powers of primitive variables such as Zi2 ; Zi3 ; or other transformations such as e.g.
sin(Zi ).
The conditional moment E ["i j Xi ] = 0 implies E [Xi "i ] = 0; that is
E [Xi (Yi Xi0 0 )] = 0
or
E [Xi Yi ] = E [Xi Xi0 ] 0 (13)
If E [Xi Xi0 ] is non-singular, we can solve
1
0 = (E [Xi Xi0 ]) E [Xi Yi ] :
This identi…cation argument suggests the OLS estimator

! 1 !
1X 1X
n n
ÔLS = Xi Xi0 Xi Yi :
n i=1 n i=1
The Law of Large Numbers (LLN) implies that ÔLS consistently estimates 0: Moreover,
the CLT, and Slutsky theorem2 yield
p d
n ÔLS 0 ! N (0; VOLS ) ;
1 1
where VOLS = E [Xi Xi0 ] E ["2i Xi Xi0 ] E [Xi Xi0 ] is the asymptotic variance that is consis-
tently estimated by
1 1
VÔLS = En [Xi Xi0 ] En ^"2i Xi Xi0 En [Xi Xi0 ]
where ^"i = Yi Xi0 ÔLS : Inference on 0 based on VÔLS is robust to heteroskedasticity and also
to model misspeci…cation (i.e. E ["i j Xi ] 6= 0 or (x) 6= x0 0 ; but E [Xi "i ] = 0). Non-robust
standard errors, such as those provided in pg. 66 of ISLR should NOT be used.
The OLS …tted values are computed as Y ^ = Ln Y; with hat matrix Ln = X (X0 X) 1 X0 ;
where X is the n p design matrix (with i th row Xi0 ): OLS is a linear prediction method
with degrees of freedom v = tr(Ln ) = tr((X0 X) 1 X0 X) = tr(Ip ) = p: The hat matrix Ln is
a projection matrix onto the column space of X; therefore if X contains an intercept then
Ln 1 = 1; which was needed for the LOOCV formula. The diagonal term Lii is called the
2 p d d
This theorem states that if An ! A and Bn ! B then An Bn ! AB:
30
leverage and it measures the sensitivity of the prediction to any change in the variable Yi .
The prediction of the linear model for a new observation x0 is ^ OLS (x0 ) = x00 ÔLS with
a corresponding (approximately) 95% con…dence interval
r
0
^ OLS (x0 ) 2 x00 VÔLS x0 =n:
This is di¤erent from a prediction interval, which aims at predicting Y (x0 ) = x00 0 + "i and
thus have a corresponding prediction interval
s
0
^ OLS (x0 ) 2 x00 VÔLS x0 + ^ 2 =n;
where ^ 2 = En ^"2i and ^"i = Yi Xi0 ÔLS : Note that prediction intervals are centered also
at the point estimate ^ OLS (x0 ); but they are wider than con…dence intervals. Also note that
they should be computed with robust standard errors.
OLS inference relies, however, on the assumption:
OLS1: E [Xi Xi0 ] is non-singular.
The related condition En [Xi Xi0 ] = X0 X=n being non-singular rules out that p > n (see
Exercise 4) Under the assumption OLS1 0 is just-identi…ed (thus, the linear model with
the only restriction being E [Xi "i ] = 0 cannot be rejected by the data). Under the stronger
condition E ["i j Xi ] = 0; the model is over-identi…ed, it can be rejected, and there are more
e¢ cient estimators than OLS, see Chamberlain (1987).
We have motivated OLS from a method of moments perspective, but it is clear that OLS
falls under our general constrained least squares setting above with a linear class of models
M = fm(x) = x0 : 2 Rp g. The optimal linear predictor Xi0 0 solves
Xi0 0 = arg min E (Yi m(Xi ))2 ;

m:m(x)=x0
or equivalently (13), and Xi0 0 is uniquely identi…ed even when OLS1 fails. To see this, let
0
1 another parameter vector satisfying E [Xi Yi ] = E [Xi Xi ] 1 with 0 6= 1 : Then,
0
E [Xi Xi0 ] ( 1 0) =0 =) ( 1 E [Xi Xi0 ] (
0) 1 0) =0
h i
0 2
() E ( 1 0 ) Xi =0
0
() ( 1 0) Xi = 0
0 0
() 1 Xi = 0 Xi :
31
At a more general level, uniqueness follows from the strict convexity of the least squares in
0
Xi : However, note that 0 is under-identi…ed when OLS1 fails. The problem of identi…cation
and inference on (in-sample) predictions is easier than the problem of identi…cation and
inference on slope coe¢ cients. The problem of inference on (out-of-sample) prediction is
generally equivalent to the problem of inference on slope coe¢ cients. To give a more precise
statement, identi…cation of 00 x0 for a …xed vector of covariates x0 requires x00 v = 0 for all v
such that v = 0: This can be seen equivalent to x0 = a for some a: In fact, 0 + v; with
v = 0; solves the population OLS normal equations, but
x00 ( 0 + v) = x00 0 + x00 v

= x00 0 + a0 v
= x00 0 :
The condition x0 = a also implies (and is implied by)

0
x0 := inf 0 > 0;
0
: x0 6=0 x0
which can be shown by a generalization of the Cauchy-Schwarz’s inequality.

The risk of OLS r(^ OLS ); ^ OLS (x) = x0 ÔLS ; can be bounded following Theorem 11.3 of
Gyor…, Kohler, Krzyzak and Walk (2002).
Theorem. If supx E ["2i j Xi = x] < 1 and all random variables are bounded by L, then
h i p
2
r(^ OLS ) 8E (Xi0 0 (Xi )) + L (log(n) + 1) + 2
:
n
The …rst term accounts for the approximation bias, the second term the variance and the
last term the irreducible risk. Note the inequality above is a …nite sample bound. The proof
is straightforward but very long, and hence it is omitted. Note that the variance term is
linear in the number of parameters to estimate. In general, r(^ OLS ) is at least of the order of
p=n in least squares, so when p is large OLS prediction does not behave well. This motivates
looking for alternative procedures for High-Dimensional (HD) settings.
The MSE M SE( 0 ) := r( 0 ) is estimated by the out-of-sample mean M SET (^ OLS ) =
ET f (Y ^ OLS (X))2 : Though these quantities are routinely employed in ML, inference
on r( 0 ) is often not considered. In contrast, there is a long tradition in econometrics to do
inference on r( 0 ): In particular, West (1996) obtained the asymptotic distribution theory
for the standardized quantity
p
= T (M SET (^ OLS ) M SE( 0 )) :
32
See also Escanciano and Olmo (2010). A key component in the asymptotic analysis of is
the Estimation Risk (ER) component de…ned as
p h i h i
2 2
= T ET (Y X 0 0) E (Y X 0 0)
p 2
2
+ T ET Y X 0 ÔLS (Y X 0 0) :
| {z }
Estimation Risk
In the classical …xed p setting, a standard Taylor expansion argument and the asymptotic
p
linearity of T ÔLS 0 yields an asymptotic normal distribution for the ER (cf. West
p
1996). That is, as long as T ÔLS 0 = OP (1); under regularity conditions
h i0 p
ER _
E ft ( 0 ) T ÔLS 0 ;
h i
where f_t ( 0 ) = 2 (Y X 0 0 ) X: Therefore, in the low dimensional case, since E f_t ( 0 ) = 0;
p
the condition T ÔLS 0 = OP (1) implies that
ER = oP (1):
This result is discussed in West (1996, pg. 1073): As a result, the CLT implies that a 95%
asymptotic con…dence interval for M SE( 0 ) is given by
q
M SET (^ OLS ) 2 V^ =T
where " #
2 2
V^ = ET Y X 0 ÔLS M SET (^ OLS ) :
p p
However, the assumption of T consistency, T ^ 0 = OP (1); is too strong in the HD
setting with p ! 1. We shall see with a simple example that the ER may no longer be
asymptotically bounded in a HD setting.
0
Consider a highly stylized example of a Gaussian prediction error "i = Yi 0 Xi ; with
zero mean, independent and identically distributed (iid), independent of X ; the -algebra
generated by fXi g1i= 1 ; and with variance
2
= E["2i ]: Simple algebra (see Exercise 5)
shows that the conditional distribution of the ER, conditional on X and ÔLS ; is normally
33
distributed, with mean and variance
p
ERj X ; ÔLS N ( T r^2 ; 4 2 r^2 );
where
2
r^2 = ET X 0 (ÔLS 0)
1X
T
2
= Xi0 (ÔLS 0)
T i=1
= (ÔLS 0^ ^
0 ) ( OLS 0 );
and ^ = ET [XX 0 ] :
p
In a HD setting, if T r^2 !P 1; then ER !P +1: Heuristically, we expect
p
p 2 T p
T r^ p p :
n n
Thus, if p is very large relative to n; the risk of OLS may diverge, due to the estimation risk.
Having an in-sample size relatively larger than the out-of-sample size may help to control
p
the estimation risk. Typically, T n, so we need p n for the risk to be under control
(and inference being valid as usual). This discussion motivates the assumption of sparsity
in HD settings, which we will discuss in more detail below when introducing Lasso.
4 Ridge
One approach that works for linear models even when p n is Ridge regression (Hoerl and
Kennard 1970). This method has a much older history in mathematics, where it is known as
Tikhonov’s regularization. Here the model includes an L2 penalization for the parameter
to reduce the variance of the resulting estimator. The optimal predictor Xi0 0 under Ridge
solves the constrained problem
^ Ridge = arg min En (Yi m(Xi ))2 ;

m:m(x)=x0
k k2 k
34
p 0
where k k2 = is the standard Euclidean norm. Often this is implemented with the
Lagrangian penalization form, such as ^ Ridge (x) = x0 ^Ridge with
h i
^Ridge = arg min En (Yi 2 n
Xi0 ) + k k22 ;
n
with n =n # 0 with the sample size. The …rst order condition of this optimization problem
solves h i
n
En Xi Yi Xi0 ^Ridge + ^Ridge = 0;
n
which can be solved for ^Ridge in closed form as
1
^Ridge = n
En [Xi Xi0 ] + Ip En [Xi Yi ] ;
n
where Ip is the Identity matrix of order p: Note that when n = 0 we get back OLS, as
expected, assuming En [Xi Xi0 ] is non-singular. When En [Xi Xi0 ] is singular, and OLS cannot
be used, En [Xi Xi0 ] is substituted by the positive de…nite matrix En [Xi Xi0 ] + nn Ip ; so that
^Ridge is always well de…ned.
What is ^Ridge estimating? As usual, it depends on the assumptions considered. If OLS1
holds, then ^Ridge estimates the unique slope of the optimal linear predictor. If OLS1 does not
hold, then ^Ridge estimates the minimum norm solution to the set 0 s : E [Xi Yi ] = E [Xi Xi0 ] ,
i.e.
0 = min k k2 :
:E[Xi Yi ]=E[Xi Xi0 ]
That is, although there may be an in…nite number of solutions to the OLS normal equations,
there is one and only one with a minimum L2 norm (see Exercise 6): In any case, Xi0 ^Ridge
estimates consistently the unique optimal linear predictor 0 (Xi ) = Xi0 0 . Ridge allows for
inference on predictions in cases where OLS is not applicable (when OLS1 fails).
The risk of Ridge r(^ Ridge ); ^ Ridge (x) = x0 ^Ridge ; is bounded in e.g. Hsu, Kakade and
Zhang (2014). We de…ne = E [Xi Xi0 ] :
Theorem. If supx E ["2i j Xi = x] < 1 and all random variables are bounded by L, then
1 + L2 = n n k 0 k2 2
tr( )
r(^ Ridge ) r( 0 ) 1+O + :
n 2 n 2 n
The …rst term is the bias term, and the second term is the variance term. Increasing n
increases the bias term but decreases the variance term (since the norm of is more penalized,
this is called shrinkage). Ridge introduces regularization in the OLS objective function.
Regularization is a common feature of ML methods (regularization to reduce the variance
35
of highly variable prediction methods).
We will obtain asymptotic normality of Ridge regression as a special case of a more
general result given below. Additionally, we will see that Ridge is also a special case of a
more general approach called Kernel Machines.
For the choice of n in practice we can use CV or GCV (11) with the corresponding hat
matrix for Ridge equals to
1
Ln = X (X0 X + n Ip ) X0 ;
where recall X is the n p design matrix.
5 Best subset selection

Another approach to the high-dimensionality problem is to do variable selection where we try
to …nd a good subset of the covariates Xi = (Xi1 ; :::; Xip ). Let S be a subset of f1; 2; :::; pg
and let XiS = fXij : j 2 Sg: Fix k < min(p; n) and let Sk be all subsets of size k from
f1; 2; :::; pg: Then, we we would like to choose S 2 Sk to minimize
h i
0 2
En (Yi XiS S) :
This is equivalent to h i
2
min En (Yi Xi0 ) (14)
k k0 k
where k k0 is the number of non-zero elements of : As k increases, the bias decreases but
the variance increases. Unfortunately, the above minimization problem is too hard (NP-
hard). Two solutions that have been exploited are greedy approximation (forward stepwise
regression) or a convex relaxation (Lasso), which we analyze below.
6 Forward stepwise regression

Forward stepwise regression is a greedy approximation to the best subset regression discussed
above. It begins with a model containing no predictors, and then adds predictors to the
model, one-at-a-time, until all of the predictors are in the model. The speci…c algorithm for
forward stepwise is given as follows:
Algorithm (Forward Stepwise Regression)
1. Let M0 denote the null model, which contains no predictors.
2. For k = 0; :::; p 1:
36
(a) Consider all p k models that augment the predictors in Mk with one additional
predictor.
(b) Choose the best, de…ned as having the smallest RSS or highest R2 ; among these
p k models, and call it Mk+1 :
3. Select a single best model from among M0 ; :::; Mp using cross-validated prediction
error, Cp ; or other model selection methods.
Unlike best subset selection, which requires …tting 2p models, forward stepwise regression
requires …tting 1 + p(p + 1)=2 models, a much smaller number for moderate and large values
of p: Forward stepwise regression can be applied in the HD case where p > n; though in this
case only models M0 ; :::; Mn 1 can be constructed, since each submodel is …t using OLS,
which will not yield a unique solution if p n:
7 Lasso
An alternative to greedy approximation (forward stepwise regression) is convex relaxation
(Lasso). That is, the problem (14), which is not a convex optimization problem, is relaxed
to the convex problem h i
2
min En (Yi Xi0 ) ; (15)
k k1 k
Pp
where k k1 = j=1 j j j is the L1 norm of the coe¢ cients. In a certain sense, which is
beyond the scope of these notes, (15) is the closest convex problem to (14). Like Ridge, this
is often implemented in a Lagrangian penalization form with the objective function
p
1X X
n
2 n
Qn ( ) = (Yi Xi0 ) + j jj ;
n i=1 n j=1
where = ( 1 ; :::; p )0 and n =n is a sequence of positive numbers converging to zero. This

objective function is convex, which makes the computation fast. The Lasso estimator of
Tibshirani (1996) solves
^Lasso = arg min Qn ( ):
Unlike Ridge, this estimator does not have a closed-form solution and requires convex op-
timization routines for its computation. It is also a penalized/regularized version of OLS,
but with a di¤erent norm than Ridge. This choice of norm is responsible for an important
feature of Lasso: the Lasso estimator ^Lasso is often sparse, having many components exactly
equal to zero (feature selection). In contrast, Ridge estimators are not sparse. This feature
37
of Lasso is considered an advantage, as it improves the interpretation of the resulting model
…t. Both approaches shrink the estimates toward zero. As with Ridge, the parameter n
controls the amount of shrinkage. A large value of n decreases variance but increases bias.
Its choice is often done by CV.
We note that unlike Ridge, the Lasso estimator may not be unique, though the predicted
value in sample is, i.e. Xi0 ^Lasso is unique. However, ^Lasso will be unique under weak
assumptions on the distribution of covariates (e.g. having a continuous component).
As a practical note, Xi in Lasso regressions does not include an intercept and in practice
often researchers center it, to make it zero mean, so only the slope coe¢ cients are penal-
ized. Similarly, Ridge regression …tted values are not scale invariant, which motivates to
standardize the regressors (make them zero mean and unit variance) prior to apply Ridge
predictions.
In comparing Ridge and Lasso estimators the following simple example is useful. Consider
the case where the design matrix X is the identity matrix, with p = n; and there is no
intercept. In this case there is a simple solution for the j th component of OLS, Ridge and
Lasso given by
ÔLS;j = Yj
^Ridge;j = Yj = (1 + n)
and 8
>
< Yj n =2 if Yj > n =2
^Lasso;j = Yj + n =2 if Yj < n =2
>
:
0 if jYj j n =2:
We see that Ridge shrinks the OLS estimator to zero with the same proportion for all
variables. Lasso, instead shrinks the OLS estimator to zero with the same amount for all
variables with corresponding OLS coe¢ cient larger in absolute value than n =2; but it shrinks
entirely to zero those coe¢ cients with an OLS smaller than n =2 in absolute value. This
type of shrinkage is called soft-thresholding.
In all this discussion, we can see the importance of the choice of n : For this, and in
the general case, we often use CV (here GCV is not possible because Lasso is not a linear
method).
There is a well established theory for Lasso, both asymptotic and in …nite samples. The
asymptotic distribution of Lasso for …xed p has been studied by Knight and Fu (2000). These
38
authors investigated the asymptotic distribution properties of the estimator
^ = arg min Qn; ( );
in the p …xed case, where

p
1X X
n
2 n
Qn; ( ) = (Yi Xi0 ) + j jj ;
n i=1 n j=1
and > 0: Note that Ridge corresponds to = 2; while Lasso to = 1: For 1; the
estimator ^ has a sparse solution (some components exactly zero). Best subset can be seen
as the limit case ! 0, since
p p
X X
lim j jj = 1( j 6= 0):
!0
j=1 j=1
Knight and Fu (2000) also considered …xed regressors, and non-singular design matrix for
the most part. Set = E [XX 0 ] and Z N (0; 2 ):
p
Theorem (Theorem 2, Knight and Fu 2000). Let OLS1 holds, 1; n = n !
0 0. Then,
p d
n ^ 0 ! arg min (u);
where, for >1

p
X 1
0 0
(u) = 2u Z + u u + 0 uj sgn( 0j ) j 0j j
j=1
and for =1
p
X
0 0
(u) = 2u Z + u u + 0 [uj sgn( 0j )1( 0j 6= 0) + juj j 1( 0j = 0)] :
j=1
There are some important implications from this result: (i) If 0 = 0; the limiting distribution
is the same as OLS; (ii) unfortunately, the optimal choice of n (in terms of minimizing MSE)
is such that 0 > 0: Thus, even in the …xed p case, cross-validated choices of n lead to an
asymptotic bias that invalidates inference. This is a regularization bias. For example, for
Ridge regression, u = arg min (u) satis…es
2Z + 2 u + 2 0 0 =0
39
1 1
and hence u = Z 0 0 and
p d
n ^2 0 ! N (b; 2 1
);
1
with an asymptotic bias b = 0 0 (see Exercise 7). If b 6= 0; classical con…dence intervals
based on asymptotic normality will be invalid; (iii) Lasso with 0 > 0 is not asymptotically
normal when some component is zero, and it is asymptotically biased in general.
For the HD setting where p ! 1 with n; the available theory for lasso requires some
strong sparsity and restricted eigenvalue conditions on the distribution of X. The out-of-
2
sample predictive risk of Lasso, r^2 ; or its expected value r2 = E X 0 (^Lasso 0) ; as
t
well as other related quantities such as L2 rates and in-sample predictive risk have been
extensively investigated in the literature. Greenshtein and Ritov (2004) provided general
conditions for consistency in the iid case, i.e. conditions for r2 ! 0 as n; p ! 1: These
authors called the estimator persistent when such consistency condition holds. Moreover,
under general conditions on predictors only the so-called slow rates are possible, where
p
2 k 0 k1 log p
r p : (16)
n
Our important observation is that in a setting with (i) slow rates of convergence for learners
(as one would expect in a truly high dimensional setting) and (ii) general conditions on the
sample splitting, the ER diverges.
We give the following result on the risk of Lasso in (15) by Greenshtein and Ritov (2004):
This result does not require assumptions of correct speci…cation or restricted eigenvalue
conditions.
Theorem. If all random variables are bounded by L and ^ Lasso solves (15), then with
probability at least 1
v
u p !
u 16(L + 1)2 k 2 2p
r(^ Lasso ) r( 0 ) t log p :
n
The discussion of rates of convergence for Lasso in the HD setting, though interesting, is
a bit technical, and thus we leave it out of the scope of this course.
40
8 Trees: Bagging, Random Forest, Boosting
This section studies tree-based methods. We start with basic decision trees. They are not
competitive in terms of prediction accuracy, but they form the basis for more competitive
methods such as Random Forest or Boosting, which combine multiple trees into a single
prediction. A tree forms a partition of the feature space in J regions R = (R1 ; :::; RJ ): A
…rst approach could be to …nd the J regions and coe¢ cients c1 ; :::; cJ ; that minimize the
training MSE, by solving the optimization problem
^ ( ) = arg min En (Yi mc;R (Xi ))2 ; (17)

mR;c
P
where mc;R (Xi ) = Jj=1 cj 1(Xi 2 Rj ): However, this problem is computationally infeasible
for most cases. Decision trees consider a greedy approximation to this problem. Note that
when R is …xed, the solution to the OLS problem
c^ = arg min En (Yi mc;R (Xi ))2 ;

c
P P
c1 ; :::; c^J ); where c^j = ni=1 Yi 1(Xi 2 Rj )= ni=1 1(Xi 2 Rj ) = YRj is
is simply given by c^ = (^
the average outcome in the Rj region. The corresponding prediction is
X
J
^ R (x) = c^j 1(x 2 Rj )
j=1
Xn
= i (x)Yi ;
i=1
where
XJ
1(Xi 2 Rj )1(x 2 Rj )
i (x) = Pn :
j=1 i=1 1(Xi 2 Rj )
P
Note that ni=1 i (x) = 1; because (Rj ) is a partition. Moreover, if j is such that x 2 Rj ;
denoted as j (x); and L(x) = Rj (x) ; then
1(Xi 2 L(x))
i (x) = ; (18)
jL(x)j
P
where jL(x)j = ni=1 1(Xi 2 Rj ) is the number of data points in L(x): Finally, a tree with a
…xed partition is a linear prediction method, with a hat matrix Ln with i; j element j (Xi ):
The computational complication of (17) is in the optimization in the (large) space of
41
possible partitions
X
J X
2
min En (Yi ^ R (Xi ))2 = min Yi YRj :
R R
j=1 i2Rj
Trees focus on partitions made of boxes or high-dimensional rectangles. More importantly,

they consider a greedy approach called recursive binary splitting. First, an optimal partition
is found among those made of two elements R1 (k; s) = 1(X : Xk < s) and its complement
R2 (k; s) = 1(X : Xk s); and we seek to …nd k (variable) and s (threshold) that minimize
the previous objective function with J = 2 and R = (R1 ; R2 ); i.e.
X 2
X 2
min Yi YR1 + Yi YR2 ;
k;s
i2R1 (k;s) i2R2 (k;s)
where YRj is the average of the outcome in region Rj (k; s): Once we …nd the optimal values
k and s ; we repeat the process within each of the resulting regions from the previous
step and select the optimal partition. We now have three regions. Again we look at the
split of one of these three regions that minimize the resulting RSS. The process continues
until a stopping criterion is reached. This process is likely to over-…t the data, so some
regularization is needed. One approach is cost complexity pruning, which for a given large
tree T0 and penalization parameter selects the subtree that solves
jT j
X X 2
min Yi YRm + jT j ;
T T0
m=1 i:xi 2Rm
where jT j indicates the number of terminal nodes of the tree, Rj is the rectangle correspond-
ing to the j th terminal note. We can select by CV.
Algorithm (Building a regression tree)
1. Use recursive binary splitting to grow a large tree on the training data, stopping when
each terminal node has fewer than some minimum number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best
subtrees, as a function of
3. Use K-fold CV to choose : That is, divide the training observations into K folds. For
each k = 1; :::; K :
(a) Repeat Steps 1 and 2 on all but the k th fold of the training data.
42
(b) Evaluate the mean squared prediction error on the data in the left-out k th fold,
as a function of
(c) Average the results for each value of and pick the to minimize the average
error:
4. Return the subtree from Step 2 that corresponds to the chosen value of .
Trees are easy to interpret, but they typically lead to high variability and non-robust
results. To improve their performance, researchers have considered combinations of trees
(Ensemble methods) along the following lines.
8.1 Bagging
Bagging or bootstrap aggregation is a general procedure to reduce the variance of a machine
learning method (see Breiman, 1996). A general method to reduce variance is to take
averages. The idea is to draw many training sets, and average the resulting predictions.
This is not practical because typically we do not have access to multiple training sets.
Instead, we can bootstrap, by taking repeated samples from the (single) training data set.
So, what is the idea of the bootstrap? The underlying fundamental result is the Glivenko-
Cantelli’s Theorem, which shows that if fZ1 ; :::; Zn g is a sequence of iid random variables
with CDF F , then the empirical distribution function Fn (x) = En [1(Z x)] is a uniformly
consistent estimator of F (x): Then, we can use Fn instead of F for approximating the
distribution of estimators and statistics. Given sample realizations fz1 ; :::; zn g of fZ1 ; :::; Zn g
the distribution Fn is known, it is a multinomial distribution putting 1=n mass to each point
zi in the sample (if there are no ties). We can draw data from Fn to generate bootstrap
data. We can then obtain predictions with the following algorithm.
Algorithm (Bootstrap aggregation, Bagging)
1. A sample of size n is generated from Fn : Z1 ; :::; Zn ; Zi = (Yi ; Xi ):
2. We use the bootstrap sample to generate a prediction at x; say fn (x).
3. Repeat Step 1 and Step 2 B times, where B is a large number, getting B predictions
fn 1 (x); :::; fn B (x) to compute the average
1X b
B
fn (x) = f (x):
B b=1 n
43
It turns out that it is straightforward to the estimate the test error of a Bagged model,
without the need to perform cross-validation. The key idea is to use that each bagged
tree makes use of about two-thirds of the observations on average (see Exercise 8). The
remaining one-third of observations not used to …t a given tree are called out-of-bag (OOB)
observations. We can predict the response for the i th observation using each of the trees in
which the i th observation was OOB. We can take the average and use the OOB prediction
as an estimate for the test MSE. The idea is similar the LOOCV.
Also, we can report the total amount that the RSS is decreased due to splits over a given
predictor as a measure of the importance of the predictor (averaged over the B trees).
8.2 Random Forest

Averaging variables that are highly correlated does not reduce the variance much. The
idea of Random Forest is to “de-correlate” the trees by performing a random sample of m
predictors (out of the p possible predictors, with m p) at each time a split is considered
p
(see Breiman 2001). Typically m p: Making the trees less correlated improves the
variance of the resulting prediction by averaging. Random Forest have shown to be a very
competitive prediction algorithm in many applications.
8.3 Generalized Random Forest

Generalized Random Forest (GRF, henceforth, cf. Athey, Tibshirani and Wager, 2019) is a
generalization of Random Forest to general moment restrictions. These authors consider a
conditional moment model
E[ (Wi ; (x); v(x))jXi = x] = 0; (19)
for a score function ; a parameter of interest (x); which is a function of x; and a nuisance
parameter v(x): An example is (Wi ; (x); v(x)) = Yi (x); which identi…es (x) as the
conditional mean E[Yi jXi = x]: To introduce GRF, recall that for a tree we had
X
n Z
b i jXi = x] =
E[Y i (x)Yi = ydPn (y; x);
i=1
for a conditional empirical measure
X
n
Pn (y; x) = i (x) Yi :
i=1
44
Applying this logic to the conditional moment (19), we obtain
X
n
b (Wi ; (x); v(x))jXi = x] =
E[ i (x) (Wi ; (x); v(x));
i=1
which suggests to use the localized estimators
X
n
^(x); v^(x) = arg min i (x) (Wi ; ; v) :
;v
i=1
GRF uses RF and not a tree, and for that reason the weights are given by (cf. (18))
1 X 1(Xi 2 Lb (x))
B
i (x) = ;
B b=1 jLb (x)j
where, for each tree b; Lb (x) is the set of observations in tree b falling in the same “leaf”
as x; and jLb (x)j is its cardinality, cf. Athey, Tibshirani and Wager (2019). Of course, this
discussion assumes the partitions in the tree are …xed, but Athey, Tibshirani and Wager
(2019) introduced a gradient-based binary splitting algorithm to estimate the partition. We
will see below how these methods can be used to study heterogeneity in treatment e¤ects.
8.4 Boosting
Boosting is a general ensemble method that works for many machine learners. Here we
describe the so-called L2 boosting.
Algorithm (L2 Boosting)
1. Set f^(0) (x) = Y or f^(0) (x) = 0 and ri = Yi ; for all i in the training set: Set m = 0:
2. Increase m by 1. Compute residuals ri = Yi f^(m 1)

(Xi )
3. Fit the residual vector r = (r1 ; :::; rn ) to X by a base procedure (e.g. decision tree) to
obtain g^(m) (Xi ):
4. Update f^(x) by adding in a shrunken version the new predictor f^(m) (x) = f^(m 1)
(x) +
g^(m) (x):
5. Iterate steps 2 to 4 until m = mstop for some stoping iteration mstop :
To provide intuition on how Boosting works, consider the situation where the base pro-
cedure is a linear prediction method with a hat matrix Ln : Assume that f^(0) (x) = 0 and
45
= 1: Then, the hat matrix of L2 Boosting after m iterations equals (by step 4)
Bm = Bm 1 + Ln (I Bm 1 )
= I (I Ln )m :
If kI Ln k < 1 for a suitable norm, it follows that Bm ! I as m ! 1: Thus, as m

increases the complexity of the model …tted by L2 boosting increases, and converges to the
fully saturated model (if Bm ! I; then Y ^ ! Y, and we can …t any response data). Note
how m plays a role as a regularization parameter. Under the condition kI Ln k < 1; we
say L2 -boosting has a learning capacity. For further discussion on Boosting see Bühlmann
and Hothorn (2007).
9 Kernel Machines
This section is a bit more technical than the rest. A general framework for prediction is
provided by kernel machines, where
^ ( ) = arg min En (Yi m(Xi ))2 ;

m2H;kmkH k
where H is a Reproducing Kernel Hilbert Space (RKHS). As in previous sections, this is

often implemented in its Lagrangian form
1X
n
n
^ ( ) = arg min Qn (m) = (Yi m(Xi ))2 + kmkH :
m2H n i=1 n
An introduction to RKHS is beyond the scope of these notes, but it su¢ ces to say that cor-
responding to each H there exists a unique positive semi-de…nite symmetric kernel function
K(x1 ; x2 ) on X X ; where X is the support of the covariates, with the representation
X
1
K(x1 ; x2 ) = j j (x1 ) j (x2 )
j=0
for a positive sequence of numbers j ; and linearly independent functions j (called features)
such that each element m 2 H has the presentation
X
1
m(x) = mj j (x)
j=0
46
with
X
1
m2j
< 1:
j=0 j
Mercer’s Theorem guarantees that the space of such K 0 s is really broad.

A feature that makes Kernel Machines practical is given by the fact that the penalization
problem above has a simple closed form solution (for a positive de…nite kernel) given by
X
n
^ (x) = ci K(x; Xi )
i=1
where c = (c1 ; :::; cn ) is given by
1
c = (K + n Ip ) Y
and K is the n n matrix with ij th element K(Xi ; Xj ): The corresponding vector of …tted
values is
^ = LY
Y
where L = K (K + n Ip ) 1 : Therefore, kernel machines are linear predictors, and the previ-
ous discussion on generalized cross-validation applies here. This method generalizes Ridge
to general kernels (even with a possibly in…nite number of features j ):
A special case of kernel machines is smoothing splines, where
Z
2
kmkH = (m00 (x)) dx
is a measure of roughness of m: Wahba (1990) provides an excellent treatment of this and

other cases. Note that
Z
2
(m00 (x)) dx = 0 =) m(x) = a + bx:
Therefore, for this choice and for n large, ^ becomes a linear …t (srinkage is towards the
linear model).
A popular choice for a kernel is the Squared Exponential kernel
K(x1 ; x2 ) = 2
f exp jx1 x2 j2 = 2`2 ;
where 2f is the signal variance and ` the length-scale. In general, these parameters ( 2
f ; `)
are called hyperparameters, and they can be also estimated by cross-validation.
47
Part III
A General Framework for Locally

Robust Estimators
In this section we derive a general method to construct Locally Robust (LR)/Debiased/Orthogonal
moments for causal inference with ML methods. We follow Chernozhukov, Escanciano,
Ichimura, Newey and Robins (2016, 2022, CEINR henceforth) for the most part. We con-
sider …rst a simpler but general setting of linear functionals of conditional means and the
corresponding Double-Robust estimators.
10 Double-Robust Estimators
We start with a setting where the parameter of interest is a linear functional of a conditional
mean function
0 (X) := E[Y jX]:
That is, the parameter of interest is 0 =E[m(W; 0 )]; where W is the data observation
which includes (Y; X); and linearity means that m is such that for all constants c1 and c2
and functions 1 (X) and 2 (X) (with …nite variance),
E[m(W; c1 1 + c2 2 )] = c1 E[m(W; 1 )] + c2 E[m(W; 2 )]:
We illustrate with several examples.
Example 1: The data observation is W = (Y; X; Z) and parameter of interest 0 =

E[Z 0 (X)]: Trivially, this functional is linear, as E[Z (c1 1 (X) + c2 2 (X))] = c1 E[Z 1 (X)] +
c2 E[Z 2 (X)]:
Example 2 (ATE): The data observation is W = (Y; X) with X = (X1 ; D) and parameter
of interest 0 = AT E = E[ 0 (X1 ; 1) 0 (X1 ; 0)]: Here, 0 is the long regression of Y on
X = (X1 ; D):
Example 3 (Conterfactuals): The data observation is W = (Y; X) with X = (X1 ; D) as

before. The parameter of interest is 0 = E [ 0 (X1 ; d1 )1(Di = d2 )] :
Under a continuity condition, we can write a linear functional of 0; 0 = E[m(W; 0 )];
48
as
0 = E[ 0 (X) 0 (X)] for all possible 0;
and for some 0 (X) with …nite variance. The function 0 is called the Riesz representer
of the functional. We illustrate the calculation of the Riesz representer in the examples.
Example 1: Trivially, 0 = E[Z 0 (X)] = E[ 0 (X) 0 (X)]; so the Riesz representer is

0 (X) = E[ZjX]:
Example 2 (ATE): De…ne the propensity score p (X1 ) = P (D = 1j X1 ) and note that
D p0 (X1 )
0 = E 0 (X)
p0 (X1 )(1 p0 (X1 ))
D p0 (X1 )
= E E 0 (X1 ; D) jX1
p0 (X1 )(1 p0 (X1 ))
= E[ 0 (X1 ; 1) 0 (X1 ; 0)]
Therefore, the corresponding Riesz representer is the Horvitz-Thompson weight
D p0 (X1 )
0 (X) =
p0 (X1 )(1 p0 (X1 ))
D 1 D
= :
p0 (X1 ) 1 p0 (X1 )
Example 3 (Conterfactuals): Arguing as for the ATE, but for the multi-valued case with
treatments d = 0; 1; :::; T;
1(Di = d1 )pd2 (X1 )

0 = E 0 (X1 ; D)
pd1 (X1 )
1(Di = d1 )pd2 (X1 )
= E E 0 (X1 ; D) jX1
pd1 (X1 )
= E [ 0 (X1 ; d1 )pd2 (X1 )]
= E [ 0 (X1 ; d1 )E [1(Di = d2 )jX1 ]]
= E [ 0 (X1 ; d1 )1(Di = d2 )] ;
so
pd2 (X1 )
0 (X) = 1(Di = d1 ):
pd1 (X1 )
49
In all these cases, we have three ways to write down the parameter
0 = E[m(W; 0 )] = E[Y 0 (X)] = E[ 0 (X) 0 (X)]:
CEINR shows that a LR moment for 0 is
(w; ; ; ) = m(w; ) Y (X) + (X) (X):
This is LR and Double Robust in the sense that
E[ (W; ; ; )] = 0 + 0 E[ 0 (X) (X)] E[ (X)f 0 (X) (X)g]

= 0 + E[ 0 (X) 0 (X)] E[ 0 (X) (X)]
+ E[ (X)f (X) 0 (X)g]
= 0 + E[f (X) 0 (X)gf (X) 0 (X)g]: (20)
Equation (20) is the most important equation of the course. It explains why orthogonal/LR
moments are useful when ML estimators are used. Departures of ( ; ) from ( 0 ; 0 ) have
only a second-order e¤ect on the average orthogonal moment function, leading to small bias
from …rst step estimation. In particular, if and are local deviations around 0 and 0 ;
respectively, then by the chain rule and (20)
d d d
E[ (W; ; ; )] = E[ (W; ; 0; )] + E[ (W; 0; ; )]
d d d
= 0
This is the local robustness property of E[ (W; ; ; )]:

We have started with an identifying moment
g(w; ; ) = m(w; );
then by adding the adjustment term or First-Step In‡uence Function (FSIF)
(w; ; ) = (y (x)) (x)
we construct the LR moment
(w; ; ; ) = g(w; ; ) + (w; ; ):
50
We arrive at the same LR if we start with the identifying moment
g(w; ; ) = y (x)
and add the adjustment term
(w; ; ) = m(w; ) (x) (x):
The situation is like with the IPW estimator and the direct estimators for the ATE. Both led
to the same LR moment. This moment is LR/DR because we only need either (X) = 0 (X)
or (X) = 0 (X) for E[ (W; ; ; )] = 0 to identify = 0 ; as (20) shows.
We discuss LR moments in relation with the literature for the examples above.
Example 1: Recall, 0 = E[Z 0 (X)] = E[ 0 (X) 0 (X)] for 0 (X) = E[ZjX]: This exam-
ple is of interest in its own right as the component of the expected conditional covariance
E[C(Z; Y jX)] = E[ZY ] 0 that depends on unknown functions. This covariance is useful
for the analysis of covariance and for estimation of a partially linear model, Robinson (1988).
We specify the identifying moment function as z (x) and the plim of ^ to be E[Y jX]; so
that ^ is a ML nonparametric regression estimator. Model selection and/or regularization of
P
^ will typically lead to large biases in a plug-in estimator ~ = ni=1 Zi ^ (Xi )=n as previously
mentioned, for n data observations Wi . An orthogonal moment function can be constructed
by adding the FSIF, which for the identifying moment function z (x) was shown to be
(x)[y (x)] in Proposition 4 of Newey (1994a). The orthogonal moment function is the
sum of the identifying moment function and the FSIF, given by
(w; ; ; ) := z (x) [y (x)] (x):
and the true value is identi…ed as
0 = E [Z 0 (X) + (Y 0 (X)) 0 (X)] ;
where both 0 (X) = E[Y jX] and 0 (X) = E[ZjX] can be estimated by ML methods.
Example 2 (ATE): Note that 0 =E[Y 0 (X)] corresponds to the IPW estimand, since
Yi Di Yi (1 Di ) D p0 (X1 )
E = E Yi ;
p(X1i ) 1 p(X1i ) p0 (X1 )(1 p0 (X1 ))
51
while 0 =E[ 0 (X) 0 (X)] corresponds to the direct estimand, since
E[ 0 (X) 0 (X)] = E [E [ 0 (X) 0 (X)jX1 ]]
= E [ 0 (X1 ; 1) 0 (X1 ; 1)p0 (X1 ) + 0 (X1 ; 0) 0 (X1 ; 0)(1 p0 (X1 ))]

= E [ 0 (X1 ; 1) 0 (X1 ; 0)] :
From these results, the LR identifying equation is
0 = E [ 0 (X) 0 (X) + (Y 0 (X)) 0 (X)]
= E [ 0 (X1 ; 1) 0 (X1 ; 0)] + E [(Y 0 (X)) 0 (X)]

(Y 0 (X))D (Y 0 (X))(1 D)
= E [ 0 (X1 ; 1) 0 (X1 ; 0)] + E
p(X1 ) 1 p(X1 )
(Y 0 (X1 ; 1))D (Y 0 (X1 ; 0))(1 D)
= E [ 0 (X1 ; 1) 0 (X1 ; 0)] + E ;
p(X1 ) 1 p(X1 )
which corresponds to the DR moment, and where the last equality follows because 0 (X1 ; D)D =
0 (X1 ; 1)D a.s., and the same for 0 (X1 ; D)(1 D) = 0 (X1 ; 0)(1 D) a.s:
Example 3 (Conterfactuals): The LR moment has the form
0 = E [ 0 (X1 ; d1 )1(Di = d2 ) + (Y 0 (X)) 0 (X)]
where recall
1(Di = d1 )pd2 (X1 )
0 (X) = :
pd1 (X1 )
In addition to the regularization and model selection bias from ML there is also over-
…tting biases. To address this over-…tting bias we use cross-…tting.
10.1 Cross-Fitting with DR Moments

We combine orthogonal moment functions with cross-…tting, a form of sample splitting, to
construct debiased sample moments; e.g. see Bickel (1982), Schick (1986), Klaassen (1987),
and Chernozhukov et al. (2018). Partition the observation indices (i = 1; :::; n) into L groups
I` ; (` = 1; :::; L): Consider ^ ` and ^ ` , which are constructed using all observations not in I` :
52
Debiased estimators are given by
XX L
^= 1 fm(Wi ; ^ ` ) + ^ ` (Xi )[Yi ^ ` (Xi )]g :
n `=1 i2I
`
The choice of L = 5 works well based on a variety of empirical examples and in simula-
tions, for medium sized data sets of a few thousand observations; see Chernozhukov et al.
(2018). The choice L = 10 works well for small data sets with the larger L providing more
observations for construction of ^ ` and ^ ` .
The cross-…tting used here, where the average over observations not used to form ^ `
and ^ ` ; eliminates bias due to averaging over observations that are used to construct the
…rst step: Eliminating such "own observation" bias helps remainders converge faster to zero,
e.g. as in Newey and Robins (2017), and can be important in practice, e.g. as in the
jackknife instrumental variables estimators of Angrist and Krueger (1995) and Blomquist
and Dahlberg (1999). It also eliminates the need for Donsker conditions for ^ ` and ^ ` ;
which is important for many machine learning …rst steps that are not known to satisfy such
conditions, as discussed in Chernozhukov et al. (2018).
p
Under some conditions, we will show that n(^ 0 ) is asymptotically normal with an
asymptotic variance that is consistently estimated by
1 X X ^ ^0 ^
L
V^ = ; = m(Wi ; ^ ` ) ^ + ^ ` (Xi ) (Yi ^ ` (Xi )) :
n `=1 i2I i` i` i`
`
To prove this result, we proceed as we did before in RCT and consider the infeasible estimator
~ based on the true (uncentered) in‡uence function
i = m(Wi ; 0 ) + 0 (Xi ) (Yi 0 (Xi )) ;
and show that
p p
n ^ ~ ! 0:
Consider the following assumptions:
2
Assumption 1: E[ i] < 1 and
Z
2 p
i) km(w; ^ ` ) m(w; 0 )k F0 (dw) ! 0;
Z
p
ii ) 2
0 (x) ( 0 (x) ^ ` (x))2 F0 (dw) ! 0;
Z
2 2 p
iii ) (^ ` (x) 0 (x)) (y 0 (x)) F0 (dw) ! 0:
These are mild mean-square consistency conditions for ^ ` and ^ ` separately.
53
Assumption 2: For each ` = 1; :::; L,
p p
nRM SE(^ ` ) RM SE(^ ` ) ! 0:
Assumptions 1-2 will only require mean-square convergence rates.
Theorem (Lemma 8 in CEINR). Under Assumptions 1 and 2,

p d
n ^ 0 ! N (0; V );
2
where V = E ( i 0) :
Proof. Write
^ i` i = m(Wi ; ^ ` ) m(Wi ; 0)
+ 0 (Xi ) ( 0 (Xi ) ^ ` (Xi ))

+ (^ ` (Xi ) 0 (Xi )) (Yi 0 (Xi ))
+ (^ ` (Xi ) 0 (Xi )) ( 0 (Xi ) ^ ` (Xi ))

^ 1ì + R
R ^ 2ì + R
^ 3ì + ^ ` (Wi )
Let W`c denote the observations not in I` , so that ^ ` ; ^ ` ; and ^` depend only on W`c . Therefore,
by the Riesz representation theorem and the orthogonality condition
^ 1ì + R
E[R ^ 2ì jW c ] = 0 (21)
`
^ 3ì jW c ] = 0:
E[R `
Also by observations in I` mutually independent conditional on W`c and Assumption 1,

( )2
1 X ^ ^ jì jW c ]) n` ^ jì jW c ) p
^ 2 jW c ] ! 0; (j = 1; 2; 3):
E[ p (Rjì E[R ` jW`c ] = V ar(R ` E[R jì `
n i2I n
`
Then by the triangle and conditional Markov inequalities,
1 X ^ ^ 2ì + R
^ 3ì ^ 1ì + R
^ 2ì + R p
^ 3ì jW c ]) !
p (R1ì + R E[R ` 0: (22)
n i2I
`
P ^ ^ 3ì )=pn
^ 2ì + R p
By equation (22) and the triangle inequality i2I` (R1ì +R ! 0: It also
54
P ^ ì (Wi )=pn !
p
follows from Assumption 2 that i2I` 0: Therefore,
p p
n ^ ~ ! 0:
p p ~ + pn ~
Conclude by writing n ^ 0 = n ^ 0 and applying the CLT to
p ~
n 0 .
Example 1: (continued) The LR/debiased estimator is
XX L
^= 1 fZi ^ ` (Xi ) + ^ ` (Xi )[Yi ^ ` (Xi )]g ;
n `=1 i2I
`
where ^ ` (Xi ) and ^ ` (Xi ) are estimators for 0 (X) = E[Y jX] and 0 (X) = E[ZjX]; respec-
tively. Assumptions 1 and 2 hold under mild conditions.
Example 2 (ATE): (continued) The LR/debiased estimator is
XX L
ÂT E = 1 ^ ;
n `=1 i2I i`
`
where
^ Yi D i Yi (1 Di ) Di 1 Di
i` = + ^ ` (W; 1) 1 ^ ` (W; 0) 1
p^` (Wi )1 p^` (Wi ) p^` (Wi ) 1 p^` (Wi )
Di [Yi ^ ` (W; 1)] (1 Di ) [Yi ^ ` (W; 0)]
= (^ ` (W; 1) ^ ` (W; 0)) + :
p^` (Wi ) 1 p^` (Wi )
Asymptotic normality of the cross-…tted ATE estimator then follows from mild rate condi-
tions on the …tted values and propensity score estimators.
In all these applications the Riesz representer 0 can be estimated by plug-in methods.
Nevertheless, we give a general way to construct ML estimators ^ ` with plim 0 below.
11 Constructing Orthogonal Functions in General

We now describe a Generalized Method of Moments (GMM) setting that covers the previ-
ous linear setting and more. To describe debiased GMM in general, let denote a …nite
dimensional parameter vector of interest, be the unknown …rst step function, and W a
data observation with unknown cumulative distribution function (CDF) F0 : We assume that
55
there is a vector g(w; ; ) of known functions of a possible realization w of W; ; and such
that
E[g(W; 0 ; 0 )] = 0;
where E[ ] is the expectation under F0 and 0 is the probability limit (plim) under F0 of a
…rst step estimator ^ . Here we assume that 0 is identi…ed by these moments, i.e. that 0 is
the unique solution to E[g(W; 0 ; )] = 0 over in some set :
The identifying moment functions g(w; ; ) and ^ can be used to estimate the parameter
of interest 0 : Let W1 ; :::; Wn be a sample of i.i.d. data observations: Estimated sample
moment functions can be formed by plugging the …rst step estimator ^ into g(Wi ; ; ) and
P
averaging over data observations to obtain ni=1 g(Wi ; ^ ; )=n: One could form a "plug-in"
GMM estimator by minimizing a quadratic form in these estimated sample moments, but
such an estimator will be highly biased by …rst step model selection and/or regularization
as discussed before. This bias can be reduced by using orthogonal moment functions.
The orthogonal moment functions we give are based on in‡uence functions. To describe
them we need a few additional concepts and notation. Let F denote a possible CDF for a data
observation W and suppose that the plim of ^ is (F ) when F is the true distribution of a data
observation W . Here (F ) is the plim of ^ under general misspeci…cation, similar to Newey
(1994a), where F is unrestricted except for regularity conditions such as existence of (F )
or the expectation of certain functions of the data. For example, if ^ (x) is a nonparametric
estimator of E[Y jX = x] then (F )(x) = EF [Y jX = x] is the conditional expectation
function when F is the true distribution of W , which is well de…ned under the regularity
condition that EF [jY j] is …nite. We assume that (F0 ) = 0 ; consistent with 0 being the
plim of ^ when F0 is the CDF of W:
Next, let F0 again denote the true distribution of W; H be some alternative distribution
that is unrestricted except for regularity conditions, and F = (1 )F0 + H for 2 [0; 1]:
We assume that H is chosen so that (F ) exists for small enough and possibly other
regularity conditions are satis…ed. We make the key assumption that there exists a function
(W; ; ; ) such that
Z
d
E[g(W; ; )] = (w; 0; 0; )H(dw); (23)
d
E[ (W; 0; 0; )] = 0; E[ (W; 0; 0; )2 ] < 1;
for all H and all : Here is an unknown function, additional to , on which only (w; ; ; )
depends, 0 is the such that equation (23) is satis…ed, and d=d is the derivative from
the right (i.e. for nonnegative values of ) at = 0: This equation is a well known Gateaux
56
derivative characterization of the in‡uence function (w; 0 ; 0 ; ) of the functional (F ) =
E[g(W; (F ); )]; as in Von Mises (1947), Hampel (1974), and Huber (1981). The restriction
that (F ) exists allows (w; 0 ; 0 ; ) to be the in‡uence function when (F ) is only well
de…ned for certain types of distributions, such as when (F ) is a conditional expectation
or pdf. Also (w; 0 ; 0 ; ) is unique because we are not restricting H except for regularity
conditions. Here 0 and 0 can depend on ; equation (23) is assumed to hold for each
2 ; and F0 and H do not depend on :
We refer to (w; ; ; ) as the …rst step in‡uence function (FSIF) because it characterizes
the local e¤ect of the …rst step plim (F ) on the average moment function (F ) as F varies
away from F0 in any direction H: Further discussion on the FSIF can be found in CEINR
and references therein.
We illustrate the results with two important examples, the ATT and the Local Average
Treatment E¤ect (LATE) parameters.
Example 2 (ATE, Partially Linear): A partially linear version of the ATE under un-
confoundedness solves E[g(w; 0 ; 0 )] = 0; with
g(w; 0; 0) = (y 0d 0 (x))d:
This is motivated form the model
Y = 0D + 0 (X) + "; E ["j D; X] = 0:
Computation of the derivative leads to an analysis similar to linear functionals, since
d d
E[g(W; ; 0 )] = E[ (X)D]
d d
d
= E[ (X) 0 (X)]
d
d
= E[(Y 0D (X)) 0 (X)]
d
d
= E [(Y 0D 0 (X)) 0 (X)]
Zd
= (w; 0; 0; )H(dw);
with (w; ; ; ) = (y d (x)) (x) and 0 (X) = E [Dj X] : The second equality above
is by iterated expectations, the third by de…nition, the fourth by applying the chain rule to
the orthogonality condition
E [(Y (X)) 0 (X)] = 0;
57
and the last equality in (24) by noting that
Z
E [(Y 0 (X)) 0 (X)] = (w; 0; 0 )H(dw):
The LR moment is
(w; ; ; ) = (y d (x)) (d (x)) :
This is the Double Machine Learning (DML) of Chernozhukov et al. (2018), which is obtained
here as a special case of CEINR.
Example 3 (ATT): The ATT solves a moment condition E[g(w; 0 ; 0 )] = 0; with
g(w; 0; 0) = 0d (y 0 (x1 ; 0))d:
Then, from Example 3 in the previous section, for
p(X1 )
0 (X) = (1 D);
1 p(X1 )
it holds that
d d
E[g(W; ; )] = E[ (X1 ; 0)D]
d d
d
= E[ (X) 0 (X)]
d
d
= E[(Y (X)) 0 (X)]
d
d
= E [(Y 0 (X)) 0 (X)]
Zd
= (w; 0; 0 )H(dw); (24)
with (w; ; ) = (y (x)) (x); where the second equality is by the Riesz representation,
the third by de…nition, the fourth by applying the chain rule to the orthogonality condition
E [(Y (X)) 0 (X; Z)] = 0;
and the last equality in (24) by noting that

Z
E [(Y 0 (X)) 0 (X)] = (w; 0; 0 )H(dw):
58
Thus, the LR moment is
(w; ; ; ) = d (y 0 (x1 ; 0))d + (y 0 (x1 ; d)) 0 (x)

p(x1 )
= d (y 0 (x1 ; 0))d + (y 0 (x1 ; 0)) (1 d)
1 p(x1 )
p(x1 )
= d (y 0 (x1 ; 0)) d (1 d)
1 p(x1 )
The LR/debiased estimator is

XX L
ÂT T = 1 ^ ;
n `=1 i2I i`
`
where
^ i` = ^ 1i` =^
p
^ p^` (x1 )
1i` = (y ^ ` (W; 0)) d (1 d) :
1 p^` (x1 )
Example 4 (LATE): We apply the previous results to a setting with Instrumental

Variables (IV). This setting is an alternative to the selection on observables assumption in
observational data. Let Di be a treatment indicator (=1 if treated, 0 otherwise), Xi a p-
dimensional vector of pre-treatment variables, Zi a binary instrument, Yi (1) be the outcome
under treatment, Yi (0) be the outcome without treatment, and Yi = Yi (1)Di + Yi (0)(1 Di ).
Similarly, de…ne potential treatments such that Di = Di (1)Zi + Di (0)(1 Zi ): The data
observation is W = (Y; X; D; Z). The LATE parameter is de…ned as
0 = LAT E := E [Yi (1) Yi (0)j Di (1) > Di (0)] :
Here, the conditioning event C = fi : Di (1) > Di (0)g is the set of compliers. The assump-
tions considered in the literature are:
IV1: (Yi (1); Yi (0); Di (1); Di (0)) and Zi are independent, conditional on Xi ;
IV2: 0 < p(x) < 1; where p(x) = E[Zi jXi = x]; and P (Di (1) = 1jXi ) > P (Di (0) = 1jXi )
a.s.
IV3: P (Di (1) Di (0)jXi ) = 1 a.s.
These assumptions have been extensively discussed in e.g. Imbens and Angrist (1995) and
Abadie (2003). These authors have shown that 0 is identi…ed by the moment condition
59
E[g(W; 0; )] = 0 where
g(w; 0; ) = ( 02 (x; 1) 02 (x; 0)) ( 01 (x; 1) 01 (x; 0))
01 (x; z) = E [Y j X = x; Z = z]
02 (x; z) = E [Dj X = x; Z = z]
0 = ( 01 ; 02 )
That is, the LATE parameter is identi…ed as
E [E [Y j X = x; Z = 1] E [ Y j X = x; Z = 0]]
0 = :
E [E [Dj X = x; Z = 1] E [Dj X = x; Z = 0]]
To …nd the corresponding FSIF, let E denote expectation under F = (1 )F0 + H; and
de…ne j (F ) (x; z) =E (Yj j X = x; Z = z) ; for Y1 := Y and Y2 := D: Note by the chain
rule
d d
E[g(W; (F ); )] = E[ 2 (F ) (x; 1) 2 (F ) (x; 0)]
d d
d
E[ 1 (F ) (x; 1) 1 (F ) (x; 0)]:
d
For j = 1; 2; and arguing as for the ATE but with the Horvitz-Thompson weight
Z 1 Z
0 (X; Z) =
p0 (X) 1 p0 (X)
p0 (X) = E [Zj X] ;
we obtain
d
E[ j (F ) (x; 1) j (F ) (x; 0)]
d
d
= E[ j (F ) 0 (X; Z)]
d
d
= E[(Yj j (F )) 0 (X; Z)]
d
d
= E [(Yj 0j (X; Z)) 0 (X; Z)]
Zd
= j (w; 0j ; 0 )H(dw); (25)
with j (w; j; ) = (Yj j (X)) (X); where the …rst equality is by the Riesz representation,
60
the second by de…nition, the third by applying the chain rule to the orthogonality condition
E [(Yj j (F )) 0 (X; Z)] = 0:
The FSIF for the LATE is
(w; 0; 0; )= 1 (w; 01 ; 0) 2 (w; 02 ; 0 ):
The adjustment term or FSIF is like a directional derivative of the identifying moment
with respect to the nuisance parameter. If the FSIF is not zero, it implies the original
identifying moment is not locally robust, and inference based on it will be biased with ML.
Orthogonal moment functions can be constructed by adding the FSIF to the identifying
moment functions to obtain
(W; ; ; ) = g(W; ; ) + (W; ; ; ): (26)
Example 4 (LATE): (continued) The LR moment for LATE is given by
(w; ; ; ) = [( 2 (x; 1) 2 (x; 0)) + 1 (w; 1; )]

1 (x; 1) 1 (x; 0) 2 (w; 2; )
2 (w; 2; ) 1 (w; 1; )
This DR moment was …rst derived by Tan (2006).
This vector of moment functions has two key orthogonality properties. The …rst property
is that, for the set of possible directions of departure of (F ) from 0 ; which we assume
to be linear,
d
E[ (W; 0 + t ; 0 ; )] = 0 for all 2 and 2 ; (27)
dt
where t is a scalar and the derivative is evaluated at t = 0: Here represents a possible
direction of deviation of (F ) from 0 and t the size of a deviation. This property means
that varying away from 0 has no e¤ect, locally, on E[ (W; ; 0 ; )]. The second property
is that for the set A of 0 such that equation (23) is satis…ed for some F0 ;
E[ (W; 0; ; )] = 0 for all 2 and 2 A. (28)
Consequently varying will have no e¤ect, globally, on E[ (W; 0; ; )] = E[g(W; 0; )] +
61
E[ (W; 0; ; )] = E[g(W; 0; )]: These properties are shown in CEINR.
Example 4 (LATE): (continued) Similarly to (20), one can show
E[ j (W; ; )] = E[ j (W; 0j ; 0 )] E[f (X) 0 (X)gf j (X) 0j (X)g];
which implies the LR property. On the other hand, iterated expectations gives E[ (W; 0; ; )] =
0, so that both of equations (27) and (28) are satis…ed.
Example 4 (LATE, Partially Linear): We consider now the same partially linear version
as for the ATE but with an IV, so the parameter solves E[g(w; 0 ; 0 )] = 0; with
g(w; 0; 0) = (y 0d 0 (x))z:
The same calculations for FSIF go through but now with 0 (X) = E [ Zj X] : The LR moment
is
(w; ; ; ) = (y d (x)) (d (x)) :
An alternative LR moment that only uses conditional mean …rst steps based on Robinson
(1988) is
(w; ; ; ) = (y 1 (x) [d 2 (x)]) (d (x)) ;
where = ( 1 ; 2 ); 10 (x) = E [Y j X] and 20 (x) = E [ Dj X] :

Example 5 (MLD): The Mean Logarithmic Deviation (MLD) is an important measure of
Inequality of Opportunity, see Terschuur (2022) and references therein. It is de…ned as
0 = ln E[ 0 (X)] E[ln 0 (X)]
ln 01 02 ;
where the identifying moments for ( 01 ; 02 ) are E[ 0 (X) 01 ] = 0 and E[ln 0 (X) 02 ] = 0:
Here 0 (X) = E[Y jX]; Y is income and X is a vector of individual’s circumstances (variables
out of the control of the individual, such as parental education). We aim to quantify unfair
inequality, i.e. inequality due to the circumstances.
The LR moment for the …rst component is E[Y 01 ] = 0: For the second, Terschuur
(2022) applies the general approach of CEINR to obtain the LR moment
(W; ; ; ) = ln 0 (X) + (W; ; );
with (w; 0; 0) = (y 0 (x)) 0 (x) and 0 (x) = 1= 0 (x): Therefore, a LR estimator for
62
0 is given by
1 XX
L
^ = ln En [Y ] fln ^ ` (Xi ) + ^ ` (Xi )[Yi ^ ` (Xi )]g ;
n `=1 i2I
`
where ^ ` (Xi ) = 1=^ ` (Xi ): Escanciano and Terschuur (2022) generalize the setting in CEINR
to U-statistics and apply these results to obtain debiased inference for the Gini coe¢ cient
to quantify Inequality of Opportunity.
Constructing orthogonal moment functions is greatly facilitated by the wide variety of
known (W; ; ; ): For …rst step least squares projections (including conditional expecta-
tions), density weighted conditional means, and their derivatives (W; ; ; ) is given in
Newey (1994a). Hahn (1998) and Hirano, Imbens, and Ridder (2003) used those results to
obtain (W; ; ; ) for treatment e¤ect estimators. Bajari et al. (2010) derived (W; ; ; )
for some …rst steps used in structural estimation. Hahn and Ridder (2013, 2019) derived
(W; ; ; ) for generated regressors that depend on …rst step conditional expectations; see
Escanciano and Perez-Izquierdo (2022) for more general results with generated regressors,
including automatic estimation of . Chen and Liao (2015) derived (W; ; ; ) for a …rst
step that approximately minimizes the sample average of a function of a data observation
and . Ai and Chen (2007, p. 40) and Ichimura and Newey (2022) give (W; ; ; ) for
…rst step estimators of functions satisfying conditional moment and orthogonality conditions
respectively. Semenova (2018) derived (W; ; ; ) for support functions used in partial
identi…cation. This wide variety of known (W; ; ; ) can be used to construct orthogonal
moment functions in many settings. Singh and Sun (2022) have derived the LR moments
for general conditional moments of compliers, following important work by Abadie (2003)
and generalizing the results of Tan (2006). They based their construction of the results in
CEINR. Argañaraz and Escanciano (2023) have recently obtained expressions for in mod-
els with unobserved heterogeneity, including panel data models with …xed e¤ects, random
coe¢ cient models, etc.
11.1 Cross-Fitting in the general case

As before, partition the observation indices (i = 1; :::; n) into L groups I` ; (` = 1; :::; L):
Consider ^ ` ; ^ ` , and an initial estimator ~` that are constructed using all observations not
in I` : Debiased sample moment functions are
XXL
1 XX
L
^ ( ) = g^( ) + ^ ; g^( ) = 1 g(Wi ; ^ ` ; ); ^ = (Wi ; ^ ` ; ^ ` ; ~` ): (29)
n `=1 i2I n `=1 i2I
` `
63
A debiased GMM estimator is
^ = arg min ^ ( )0 ^ ^ ( ); (30)

2
where ^ is a positive semi-de…nite weighting matrix and is the set of parameter values.
A choice of ^ that minimizes the asymptotic variance of ^ will be ^ = ^ 1 ; for
XX L
^ = 1 ^ i` ^ 0 ; ^ i` = g(Wi ; ^ ` ; ~` ) + (Wi ; ^ ` ; ^ ` ; ~` ):
i`
n `=1 i2I
`
There is no need to account for the presence of ^ ` and ^ ` in ^ i` because of the orthogonality
p
of (W; ; ; ). An estimator V^ of the asymptotic variance of n(^ 0 ) is
g (^)
^ = @^
V^ = (G
^ 0 ^ G)
^ 1G
^ 0 ^ ^ ^ G(
^ G^ 0 ^ G)
^ 1; G : (31)
@
The initial estimator ~` can be based on only the identifying moment conditions and con-
structed as
1 XX
~` = arg min g^` ( )0 ^ ` g^` ( ); g^` ( ) = g(Wi ; ~ ``0 ; );
2 n n` `0 6=` i2I`0
where ^ ` uses only observations not in I` ; n` is the number of observations in I` , and ~ ``0
uses observations not in I` and not in I`0 :
The e¢ ciency of debiased GMM is entirely determined by the choice of moment functions,
…rst step, and weighting matrix. The matrix ^ 1 is an optimal choice of weighting matrix as
usual for GMM. The presence of ^ in the orthogonal moment functions ^ ( ) does not a¤ect
identi…cation of . The FSIF has mean zero for all possible distributions of W so that ^ will
converge in probability to zero. The sole purpose of including ^ is to remove the local e¤ect
of ^ ` on average moment functions.
Example 4 (LATE): (continued) The LR/debiased estimator is
1
PL P ^
^LAT E = n `=1 i2I` 1i`
PL P ;
1 ^
n `=1 i2I` 2i`
where for Y1i = Yi and Y2i = Di ;
^ ji` = ^ ji` (1) Zi Yji ^ ji` (1) (1 Zi ) Yji ^ ji` (0)

^ ji` (0) + ;
pî` 1 pî`
64
^ ji` (z) = ^ j` (Xi ; z); z = 0; 1; and pî` = p^` (Xi ).
11.2 Automatic Estimation of 0
The debiased moments require a …rst step estimator ^ ` with plim 0 : When the form of 0 is
known one can plug-in nonparametric estimators of unknown components of 0 to form ^ ` ;
as in the previous examples. We can also use the orthogonality of (W; ; 0 ; ) with respect
to in equation (27) to construct estimators of 0 without knowing the form of 0 : This
approach is "automatic" in only requiring the orthogonal moment function (W; ; ; ) and
data for construction of ^ ` .
Equation (27) can be thought of as a population moment condition for 0 for each : We
can form a corresponding sample moment function by replacing the expectation by a sample
average and 0 and 0 by cross-…t estimators to obtain
XX
^ ( ; )= d 1 (Wi ; ~ ``0 + t ; ; ~``0 ) ; 2 ; (32)
dt n n` `0 6=` i2I
`0 t=0
where ~ ``0 and ~``0 do not depend on observations in I` or I`0 and we assume that (Wi ; ~ ``0 +
t ; ; ~``0 ) is di¤erentiable in t: We can then replace by a sieve (i.e. parametric approxima-
tion) and estimate the sieve parameters using these sample moments for a variety of choices
of : We can also regularize to allow for a high dimensional speci…cation for : The sample
moments in equation (32) depend only on observations not in I` so that the resulting ^ ` will
also, as required for the cross-…tting in debiased GMM.
Example 1: (continued) Here 0 is a function of X that has …nite second moment and
^ ( ; )= d 1 XXn ~``0
o
Zi [^ ``0 (Xi ) + t (Xi )] + (Xi )[Yi ^ ``0 (Xi ) t (Xi )]
dt n n` `0 6=` i2I
`0 t=0
1 X
= [Zi (Xi )] (Xi ):
n n`
i2I
= `
This is a sample moment corresponding to the moment condition E[fZ 0 (X)g (X)] = 0;
which holds by 0 (X) = E[ZjX]: If (X) was replaced by a linear combination 0 b(x) of a
dictionary b(x) = (b1 (x); :::; bp (x))0 of functions and (X) chosen to be one element bj (X) of
the dictionary then the sample moment function is
1 X
^ (bj ; 0 b) = [Zi 0
b(Xi )] bj (Xi ):
n n`
i2I
= `
65
The collection of sample moment conditions ^ (bj ; 0 b) = 0 (j = 1; :::; p) are the …rst order
conditions for minimizing the least squares objective function for the regression of Wi on
b(Xi ). Adding an L1 penalty to this objective function and minimizing leads to the Lasso
least squares estimator ^ ` (x) = ^0 b(x); with
p
X 2
X
0
^ = arg min [Zi b(Xi )] =2 + r j :
i2I
= ` j=1
The construction of ^ ` in Example 1 can be generalized to a wide class of regression

settings that will be useful for many additional examples. This generalization builds on
Example 1 to enable estimation of more complicated objects such as a functional of quantile
regression in CEINR. We generalize from being a conditional expectation to an object
that is restricted to be an element of a linear set : In Example 1 where the conditional
expectation is an unknown function of X the is all functions of X with …nite second
moment. Instead could, for example, be restricted to be a linear combination of a sequence
(b1 (X); b2 (X); :::), corresponding a high dimensional regression. We also generalize (F ) from
a conditional expectation to a function satisfying an orthogonality condition. For a scalar
residual (w; (x)) we consider (F ) 2 satisfying
EF [ (X) (W; (F )(X))] = 0 for all 2 : (33)
The (F )(X) = EF [Y jX] of Example 1 satis…es this equation for (W; ) = Y (X) and
equal to all functions of X with …nite second moment. The HD Logit MLE satis…es this
equation with (W; ) = Y ( (X)); (u) = 1=(1 + exp( u)) and equal to a linear
combinations of (b1 (X); b2 (X); :::). Here 0 (X) = 00 b(X); where 0 corresponds to the limit
of a Logit Lasso
p
1X X
n
n = arg min (Zi ) + j jj

n i=1 j=1
where (Zi ) = Yi log ( 0 b(Xi )) (1 Yi ) log(1 ( 0 b(Xi ))):

A HD quantile regression will satisfy this equation for (w; (x)) = 1(y < (x)) for
0 < < 1 and equal to a linear combinations of (b1 (X); b2 (X); :::): The FSIF for (F ) in
equation (33) is
(w; ; ; ) = (x; ) (w; (x));
as in Ai and Chen (2007, p. 40) for conditional moments when is unrestricted and Ichimura
and Newey (2022) in general, where the formula for (x; ) is given.
66
Example 1 can be further generalized to consider a general vector of identifying moment
functions other than w (x) : Here there will be one (x; ) for each component of g(w; ; ).
Let (b1 (X); b2 (X); :::) "span" ; meaning any 2 can be well approximated in mean-square
by a …nite linear combination of bj (X): We will describe an estimator for the k th component
k (x; ) of (x; ) corresponding to the k th component gk (w; ; ) of g(w; ; ): Let ^ i =
d (wi ; ^ ``0 (Xi ) + t)=dtjt=0 for i 2 I`0 and let ej denote the j th column of a p dimensional
identity matrix. Then for b(X) = (b1 (X); :::; bp (X))0 ;
^ (bj ; 0 b) = M
^ jk` + e0 Q
^
k j ` ;
XX d
M^ jk` = 1 gk (Wi ; ^ ``0 + tbj ; ~``0 );
n n` `0 6=` i2I dt
`0
1 XX
Q^` = ^ i b(Xi )b(Xi )0 ;
n n` `0 6=` i2I
`0
corresponding to the k th orthogonal moment function k (w; ; ) = gk (w; ; )+ k (x; ) (w; (x))
and the j th component bj of b(X): Let M^ k` = (M
^ 1k` ; :::; M
^ pk` )0 so that
^ k (bj ; 0 b) = @ fM
^ 0 + 0Q
^ ` =2g:
k`
@ j
The collection of sample moment conditions ^ k (bj ; 0 b) = 0; (j = 1; :::; p) are the …rst
order conditions for minimizing M ^0 0^
Q` =2: Here we assume that we can normalize
k`
so that v (X) := dE[ (W; 0 (X) + t)jX]=dtjt=0 < 0 so that Q ^ ` is positive semi-de…nite
asymptotically. Adding an L1 penalty to this objective function and minimizing leads to the
Lasso minimum distance estimator ^ k` (x) = ^0k` b(x); with
( p
)
X
^k` = arg min ^0
M 0 ^ ` =2 + r
Q j : (34)
k`
j=1
As usual for Lasso we assume that each element bj (x) of the dictionary has been standardized
P ^ 0 ^r
to have standard deviation 1: One choice of r would be r^ = arg minr L`=1 f M k` k`
0r ^ r r
^k` Q` ^k` =2g that minimizes a cross-validation criteria where ^k` is from equation (34) for a
given r and the minimization is over a grid of r values.
This estimator ^ k` (x) of k0 (x; 0 ) generalizes that of Chernozhukov, Newey, and Singh
(2018) to allow for a residual (w; (x)) other than y (x) and a general moment function.
The nested sample splitting used for ^ ``0 (Xi ) requires that the …rst step learner be computed
for L2 subsamples. Use of ^ ` (x) as a starting value for computation of each ^ ``0 (Xi ) may aid
67
in this computation. The nested cross-…tting allows for a very general …rst step that need
only have a mean-square convergence rate.
Orthogonality was used to estimate unknown components of doubly robust moment func-
tions for the average treatment e¤ect by Vermeulen and Vansteelandt (2015), Tan (2020),
and Avagyan and Vansteelandt (2021) in order to obtain standard errors that are robust
to misspeci…cation. We use orthogonality here to estimate the unknown 0 in the FSIF
for a general identifying moment function and …rst step. The resulting standard errors are
robust to misspeci…cation because the FSIF takes full account of the plim of ^ under general
misspeci…cation.
This approach to estimating the FSIF uses its form (w; ; ; ) to construct an estima-
tor of 0 . This approach is parsimonious in estimating only unknown parts of (w; ; ; )
rather than the whole function. It is also possible to estimate the entire FSIF using just
the …rst step and the identifying moments. Such estimators are available for …rst step series
and kernel estimation. More recently, they have become available for models with unob-
served heterogeneity in Argañaraz and Escanciano (2023). For …rst step series estimation
an estimator of (w; ; ; ) can be constructed by treating the …rst step estimator as if it
were parametric and applying a standard formula for parametric two-step estimators, e.g.
as in Newey (1994a), Ackerberg, Chen, and Hahn (2012), and Chen and Liao (2015). For
parametric maximum likelihood the resulting orthogonal moment functions are the basis of
Neyman’s (1959) C-alpha test. For …rst step kernel estimation one can use the numerical
in‡uence function estimator of Newey (1994b) to estimate (w; ; ; ); as shown in Bravo,
Escanciano, and van Keilegom (2020). It is also possible to estimate the FSIF using a numer-
ical derivative version of equation (23). This approach has been given in Carone, Luedtke,
and van der Laan (2016) and Bravo, Escanciano, and van Keilegom (2020) for construction
of orthogonal moment functions.
11.3 Functionals of a Quantile Regression

The object of interest in this section is an expected linear function of a quantile regression
0 = E[m(W; 0 )]; (35)

(F ) = arg min EF [v(Y (X))];
2
v(u) = [ 1(u < 0)]u; 0 < < 1;
where m(w; ) is a linear functional of , Y is a dependent variable of interest, is a linear

set of functions of x (such as all functions of X with …nite second moment), and we assume
68
the minimum (F ) exists: An example of m(w; ) is a weighted average derivative of where
R
m(w; ) = !(x)[@ (x)=@x1 ]dx for a weight !(x). Here the identifying moment function
is g(w; ; ) = m(w; ) . The …rst order condition for this (F ) is equation (33) with
(w; (x)) = 1(y < (x)) , so the FSIF has the form (w; ; ) = (x) (y; (x)).
In this example the automatic estimator ^ ` of equation (34) does not exist because
(y; (x) + t) is not continuous in and hence not di¤erentiable. We address this complica-
tion by using kernel weighting in the construction of a Q ^ ` to use in equation (34). Let ^ ` (x)
be a learner of 0 , computed from observations not in I` , and ^ ``0 (x) be computed from ob-
R
servations not in I` or I`0 : Also let K(u) be a bounded, univariate kernel, with K(u)du = 1
R
and K(u)udu = 0; h a bandwidth, and b(x) be a p 1 vector of functions of x: Then for
b(X) = (b1 (X); :::; bp (X))0 ;
^ (bj ; 0 b) = M
^ j` + e0 Q
^
j ` ;
XX d
M^ j` = 1 m(Wi ; ^ ``0 + tbj );
n n` `0 6=` i2I dt
`0
^` = 1 XX 1 Yi ^ ``0 (Xi )
Q K b(Xi )b(Xi )0 :
n n` `0 6=` i2I h h
`0
The role of the kernel term in this Q^ ` is to smooth over the discontinuity in (w; (x)) =
1(y < (x)) at (x) = y: This Q ^ ` estimates E[f (0jX)b(X)b(X)0 ] where f (0jX) is the
0
conditional pdf of Y 0 (X) evaluated 0: The automatic estimator is ^ ` (x) = ^` b(x); with
( p
)
X
^` = arg min ^0
M 0 ^ ` =2 + r
Q j :
`
j=1
A debiased GMM estimator of 0 can be formed from any learner ^ ` of 0 as
XX L
^= 1 [m(Wi ; ^ ` ) + ^ ` (Xi ) (Wi ; ^ ` (Xi ))];
n `=1 i2I
`
1 X
L X
V^ = ^2 ;
i`
n `=1 i2I`
p
X
0 ^0
^ ` (x) = b(x) ^` ; ^` = arg minf 2M 0 ^ ` + 2r
Q j g; (36)
`
j=1
1 X
^ `j =
M ^ ` = (M
m(Wi ; bj ); M ^ `1 ; :::; M
^ `p )0 ;
n n`
i2I
= `
69
where ^ i` = m(Wi ; ^ ` )+ ^ ` (Xi ) (Wi ; ^ ` (Xi )) ^. The theory requires that the regularization
p
constant r for ^ ` goes to zero slower than the conventional Lasso rate of ln(p)=n; in order
to accommodate the presence of kernel weighting and ^ ``0 (Xi ) in Q ^ ` . Further details and
references can be found in CEINR.
Example 6 (ES): As an application and generalization of the results of this section consider
inference on the Expected Shortfall (ES) with a HD quantile. The ES is an important tool
in Risk Management. It is de…ned as
0 = E [Y j Y < 0 (X)] ;
which can be written as the solution to the moment condition
E [Y 1 (Y < 0 (X)) 0 ] = 0:
In this example, g(w; 0 ; 0 ) = y1 (y < 0 (x)) 0 is a nonlinear functional of a HD quantile.

Applying iterated expectations
"Z #
(F )
E [Y 1 (Y < (F ))] = E yfY =X (y)dy ;
1
where fY =X is the conditional density of Y given X. Then, by Fubini and the Fundamental
Theorem of Calculus
d d
E[g(W; (F ); )] = E [Y 1 (Y < (F ))]
d d
d
= E (F0 ) fY =X ( (F0 )) (F )
d
d
= E[ 0 (X) (F )];
d
where
0 (X) = (F0 ) fY =X ( (F0 )):
In addition to be nonlinear, the functional g(w; 0 ; 0 ) = y1 (y < 0 (x)) 0 is also non-
smooth, which implies that the automatic estimator for 0 (x) above needs to be modi…ed.
70
Arguing as for (w; (x)) = 1(y < (x)) ; we suggest;
^ (bj ; 0 b) = M
^ j` + e0j Q
^` ;
^` = 1 XX 1 Yi ^ ``0 (Xi )
M Yi K b(Xi );
n n` `0 6=` i2I h h
`0
^` = 1 XX 1 Yi ^ ``0 (Xi )
Q K b(Xi )b(Xi )0 :
n n` `0 6=` i2I h h
`0
The automatic estimator in the ES example is, for b(X) = (b1 (X); :::; bp (X))0 ; ^ ` (x) = ^0` b(x);
with ( )
p
X
^` = arg min ^ `0
M 0^
Q` =2 + r j :
j=1
A debiased ES estimator and its asymptotic variance can be formed from any learner ^ ` of
0 as
XL X
^= 1 ^ ;
n `=1 i2I i`
`
^ i` = Yi 1 (Yi < ^ ` (Xi )) + ^ ` (Xi ) (Wi ; ^ ` (Xi ))
1 X X ^2
L
^
V = 2 i` :
n `=1 i2I
`
Recently, Barendse (2022) has proposed LR moments for parametric ES with covariates
in a time series framework. Speci…cally, he proposes the parametric conditional ES model
E [Y j X; Y < s(X; 0 )] = s(X; 0 );
where s(X; 0 ) is the q conditional quantile and s(X; 0 ) is the speci…ed parametric ES
model. Here, s is known up to the …nite dimensional parameters 0 and 0 : Barendse (2022)
suggests to use moments of the form
(w; ; ) = (s(x; ) s(x; ) q 1 1 (y < s(x; )) [y s(x; )])k(x)

m(w; ; )k(x);
71
where k(x) are weights. To see that this moment is LR note
d d
E[ (W; ; )] = E[ E [ m(W; ; )j X] k(X)]
d d
and dE [m(w; ; )j X] =d is given by
ds(X; ) d 1
qE [1 (Y < s(X; )) [Y s(X; )]j X]
d d
Z s(X; )
ds(X; ) 1 d
= q (y s(X; ))fY =X (y)dy
d d 1
Z
ds(X; ) ds(X; ) 1 s(X; 0 )
= + q fY =X (y)dy
d d 1
= 0;
provided E [1 (Y < s(X; 0 ))j X] = q; i.e. the parametric conditional quantile model is cor-
rectly speci…ed. Having the same functional form for quantiles and ES is not necessary here,
but the correct speci…cation of the quantiles seems critical.
11.4 Functionals of Nonparametric Instrumental Variables

Consider the Nonparametric Instrumental Variable regression model
Y = 0 (X) + "; E["jZ] = 0;
where the observed data is W = (Y; X; Z): As before, consider …rst the situation where the
parameter of interest is 0 =E[m(W; 0 )]; where m is linear in the sense that for all constants
c1 and c2 and functions 1 (X) and 2 (X) (with …nite variance),
E[m(W; c1 1 + c2 2 )] = c1 E[m(W; 1 )] + c2 E[m(W; 2 )]:
We illustrate with several examples.
Example 1 (BLA): The Best Linear Approximation BLA parameter is 0 = E[AX 0 (X)];
where A = E[XX 0 ] 1 ; which solves
0
0 = arg min E[( 0 (X) X)2 ]:
Escanciano and Li (2021) has proposed a two-step semiparametric estimator for 0: The
estimator is not LR.
72
Example 2 (Weighted Average Derivative): Let X = (X1 ; D); with X1 a continuous
regressor (e.g. price). The parameter of interest is
@ 0 (X)
0 = E w(X) :
@X1
Example 3 (Average Policy E¤ect): The parameter of interest is 0 = E[ 0 (g(X))

0 (X)]; where g is a known transformation of covariates:
Under a continuity condition, we can write a linear functional of 0; 0 = E[m(W; 0 )];

as
0 = E[ 0 (X) 0 (Z)] for all possible 0;
and for some 0 (Z) with …nite variance. The function 0 is called the Riesz represen-
ter of the functional. Severini and Tripathi (2012) has shown that existence of the Riesz
representation is necessary for the root-n estimability of 0 :
We illustrate the calculation of the Riesz representer in the …rst two examples.
Example 1: Suppose we can …nd h(Z) such that
X = E[h(Z)jX]:
Then, trivially, 0 = E[AE[h(Z)jX] 0 (X)] = E[ 0 (Z) 0 (X)]; so the Riesz representer is

0 (X) = Ah(Z):
Example 2: Let f0 (X) denote the density of X; and note by integration by parts
@ 0 (X)
0 = E w(X)
@X1
@fw(X)f0 (X)g
= E 0 (X) :
@X1
Assume we can …nd 0 (Z) such that
@fw(X)f0 (X)g
= E[ 0 (Z)jX]: (37)
@X1
Then, trivially, 0 = E[ 0 (Z) 0 (X)]; so the Riesz representer is a solution to (37):

The Riesz representer in this class of problems has only an implicit representation, and it
solves a complicated ill-posed problem. It is therefore useful to develop automatic estimators
73
for it.
CEINR shows that a LR moment for 0 is
(w; ; ; ) = m(w; ) (Z)(Y (X)):
Double Robust follows as in exogeneous case. In particular, for all ;
d
E[ (W; 0 + ; 0; )] = E[m(W; )] + E[ 0 (Z) (X)]
d
= 0;
which follows by the Riesz representation theorem:

The LR can be used to estimate the Riesz representer, as suggested in CEINR. We can
replace 0 by a sieve (i.e. parametric approximation) and estimate the sieve parameters
using these sample moments for a variety of choices of : We can also regularize to allow for
a high dimensional speci…cation for : The sample moments in equation (32) depend only
on observations not in I` so that the resulting ^ ` will also, as required for the cross-…tting
in debiased GMM.
In the context of the linear functionals of this section
1 X
^ ( ; )= f m(Zi ; (Xi )) + (Xi ) (Xi )g :
n n`
i2I
= `
This is a sample moment corresponding to the moment condition E[m(W; ) 0 (Z) (X)] =
0: If (Z) was replaced by a linear combination 0 b(z) of a dictionary b(z) = (b1 (z); :::; bp (z))0
of functions and (X) chosen to be the elements of another dictionary d(x) = (d1 (x); :::; dq (x))
then the sample moment function is
1 X
^ (dj ; 0 b) = [m(Wi ; dj ) 0
b(Zi )dj (Xi )] :
n n`
i2I
= `
The collection of q dimensional sample moment conditions ^ ( ) ^ (d; 0 b) = 0 (j =

1; :::; q) can be used to obtain the Lasso least squares estimator ^ ` (z) = ^0 b(z); with
p
0 X
^ = arg min ^ ( ) ^ ( )=2 + r j :
j=1
74
Debiased estimators are given by
XX L
^= 1 fm(Wi ; ^ ` ) + ^ ` (Zi )[Yi ^ ` (Xi )]g :
n `=1 i2I
`
with an asymptotic variance that is consistently estimated by
1 X X ^ ^0 ^
L
V^ = ; = m(Wi ; ^ ` ) + ^ ` (Zi ) (Yi ^ ` (Xi )) :
n `=1 i2I i` i` i`
`
In the exogenous case b = d: The asymptotic properties of ^ are investigated in CEINR.

There are several proposals for estimators ^ ` ; including estimators using ML methods, see the
references cited in Bakhitov (2022). This author has investigated the asymptotic properties
of ^ ` ; both theoretical and by simulations.
12 Heterogenous Treatment E¤ects

In the recent years, there has been substantial interest in estimating heterogenous treatment
e¤ects, as for example the CATEs
(x) := E [ Yi (1) Yi (0)j Xi = x]
and GATEs for speci…c groups
(g) := E [Yi (1) Yi (0)j Gi = g] :
For a revision of the literature and Monte Carlo evidence of some of the existing methods see
Knaus, Lechner and Strittmatter (2021), with a focus on methods that work under selection
of observables assumptions. Here we review some of the leading methods under this setting.
From our previous arguments, it follows that (x) is identi…ed as
(x) = E [Yi;IP W j Xi = x] = E [Yi;DR j Xi = x] ;
where
Yi D i Yi (1 Di )
Yi;IP W = and
p(Xi ) 1 p(Xi )
Yi D i Yi (1 Di ) Di 1 Di
Yi;DR = + 1 (Xi ) 1 0 (Xi ) 1 :
p(Xi ) 1 p(Xi ) p(Xi ) 1 p(Xi )
75
To reduce the curse of dimensionality, some authors have considered estimands such as
E [Yi;DR j Zi = z]
for a low dimensional subset of covariates Zi Xi ; see the references in Knaus, Lechner and
Strittmatter (2021). Here, we discuss in some detail Causal Forest, as it is quite popular in
economic applications. The idea is to apply GRF to the score
(Wi ; (x); v(x)) = (Yi (x)Di v(x))(1; Di );
so from the moment restriction
E[ (Wi ; (x); v(x))jXi = x] = 0;
and the standard strong ignorability conditions, it follows that (x) = (x) and v(x) =
E[Yi (x)Di jXi = x]: In fact, substituting v(x) in ; it follows that
E[(Y~i (x) ~ i (x))Di jXi = x] = 0;

(x)D
where Y~i (x) = Yi E [Yi j Xi = x] and D

~ i (x) = Di E [Di j Xi = x] : Following the literature
on LR, Athey et al. (2019) further suggested using the conditional LR moment
E[(Y~i (x) ~ i (x))D

(x)D ~ i (x)jXi = x] = 0;
which is used in the GRF algorithm. This method is called Causal Forest.
Athey et al. (2019) also give an example with instrumental variables, generalizing the
previous example to an IV setting. The moment is similar to previous one, but with an
instrument as exogenous variable, i.e.
E[(Y~i (x) ~ i (x))Z~i (x)jXi = x] = 0;

(x)D
where Z~i (x) = Zi E [Zi j Xi = x] :

Estimation of (x), or more generally (x), is a hard statistical poblem, and we will
not expect to have precise inference in high or moderate dimensions. For this reason, the
literature has focused on low dimensional aspects of (x) and (x): The previous theory can
be readily applied to construct LR moments for many functionals of (x) and (x):
76
For example, under selection on observables
(g) = E [Yi (1) Yi (0)j Gi = g]

= E [Yi;DR j Gi = g]
E [Yi;DR 1(Gi = g)]
= ;
P [Gi = g]
provided P [Gi = g] > 0: If Gi depends only on Xi ; the moment function ( (g) Yi;DR ) 1(Gi =
g) identi…es the parameter (g) and is LR.
Another way to understand heterogeneity is to look at features, i.e. functionals, of (x)
that summarize how it changes with x; for example, an average derivative functional
@ (Xi )
0 =E
@xj
for the j th covariate (assuming it is continuous). For functionals like 0 we can use the
theory of LR.
As an illustration, we consider an example from Argañaraz and Escanciano (2023). This
is a model with endogeneity and heterogenous parameters (i.e. interactions),
E [Y1 0 Y2 0 (Y2 ; X)j W ] = 0 a.s., (38)
where W = (X; Z2 ), with X possibly high dimensional, i.e. the dimension dX of X can be
large, much larger than the sample size, and the function 0 ( ) has the representation
dX
X
0
0 (Y2 ; X) = 01 + 02 (X 03 ) + 04;l Y2 Xl 03;l :
l=1
The parameters 03 are the means of X, i.e,
E [X 03 ] = 0: (39)
Models with interactions are commonly used in applied work, particularly when the interest is
in understanding heterogenous (in observables) treatment e¤ects. Argañaraz and Escanciano
(2023) has shown the following result for inference on ( 0 ) = 04;l , for some l: For instance,
we may be interested in testing if the causal response function depends on the covariate Xl ;
i.e. testing H0 : 04;l = 0 vs H1 : 04;l 6= 0.
Our next result characterizes orthogonal moments for ( 0 ) = 04;l in this example.
77
De…ne the random vector
Ql = (Y2 ; 1; X 03 ; Y2 X l 03; l );
and its projection onto the exogenous variables
l l (W ) = E [Ql j W ] = (p0 (W ); 1; X 03 ; p0 (W ) X l 03; l );
where p0 (W ) = E [Y2 j W ] ; and X l and 03; l denote all coordinates of X and 03 but the
l th:
Theorem. (Argañaraz and Escanciano (2023)) For the model (38) and (39), orthogonal mo-
ments for ( 0 ) = 04;l are given by g(Z) = (Y1 0 Y2 0 (Y2 ; X)) '(W ); where orthogonal
instruments are given by
'(W ) = (W ) l
(W ); (40)
for some 2 L2 (W ): The orthogonal moment will be informative about ( 0 ) = 04;l if
E Y2 X l 03;l '(W ) 6= 0: (41)
This proposition implies that for the heterogenous parameter ( 0 ) = 04;l ; relevance
can be achieved with just one instrument, provided (41) holds. This relevance condition
means that the partial correlation between Y2 Xl 03;l and , after removing the e¤ect of
l ; must be non-zero. When this condition holds, we can identify 04;l from the orthogonal
moment as
0 = E [(Y1 0 Y2 0 (Y2 ; X)) '(W )]

0
=E Y1 l Ql 04;l Y2 Xl 03;l '(W ) ;
where l = ( 0 ; 01 ; 002 ; 04; l )0 : To implement a LR moment satisfying the minimal relevance

condition under (41), we recommend '(W ) = ' (W ); where l (W ) = p0 (W ) Xl 03;l and
' (W ) = (W ) l
(W ):
For this choice, under the minimal relevance condition

0
E [(Y1 l Ql ) ' (W )]
04;l = ;
E Y2 Xl 03;l ' (W )
To implement inference in this example (testing H0 : 04;l = 4;l vs H1 : 04;l 6= 4;l ) based
78
on orthogonal moments, we can use the following algorithm (we assume that prior to apply
this algorithm we have centered X):
Step 1: Run Lasso, Logit Lasso, Random Forest, or any other machine learning method
for prediction of p0 (W ); denoted as p^: Compute ^l (W ) = (1; X; p^; p^X l ) and ^l (W ) =
p^Xl : In the exogenous case where Y2 = Z2 ; this step is not needed and ^l (W ) = Ql and
^ (W ) = Y2 Xl :
l
Step 2: Run, e.g., Lasso for estimating l as the coe¢ cient of ^l ; say ^ l ; in the projection
of Y1 ^ ^ ^ 0l Ql :
4;l Y2 Xl on l : Compute Y1 = Y1
Step 3: Run, e.g., Lasso for estimating ' (W ) in (40) based on ^l (W ) and ^l (W ), to
obtain '
^ '^ (W ):
Step 4: Base inference on the sample analog of the orthogonal moment
E Y1 4;l Y2 Xl 03;l ' (W ) = 0;
where Y1 = Y1 0
l Ql is estimated by Y^1 and ' by '
^ :
If the goal is estimation of 04;l , we can use the following variation of the previous algorithm,
where Steps 1 and 3 remain the same, but Steps 2 and 4 change to:
Step 2: Run, e.g., Lasso for estimating l as the coe¢ cient of ^l ; say ^ l ; in the projection
of Y1 on ^l and ^l : Compute Y^1 = Y1 ^ 0l Ql :
Step 4: Run an IV regression Y^1 on Y2 Xl with IV '

^ to estimate 04;l .
79
REFERENCES
Abadie, A. (2003): "Semiparametric instrumental variable estimation of treatment re-

sponse models," Journal of Econometrics, 113, 231-263.
Ackerberg, Daniel, Xiaohong Chen, and Jinyong Hahn (2012): "A Practical
Asymptotic Variance Estimator for Two-step Semiparametric Estimators," Review of Eco-
nomics and Statistics 94: 481–498.
Ai, Chunrong and Xiaohong Chen (2003): “E¢ cient Estimation of Models with
Conditional Moment Restrictions Containing Unknown Functions,”Econometrica 71, 1795-
1843.
Ai, Chunrong and Xiaohong Chen (2007): "Estimation of Possibly Misspeci…ed
Semiparametric Conditional Moment Restriction Models with Di¤erent Conditioning Vari-
ables," Journal of Econometrics 141, 5–43.
Angrist, Joshua D. and Alan B. Krueger (1995): "Split-Sample Instrumental
Variables Estimates of the Return to Schooling," Journal of Business and Economic Statistics
13, 225-235.
Argañaraz, F. and J.C. Escanciano (2023): "On the Existence and Information of
Orthogonal Moments for Inference," working paper.
Athey, S., Imbens, G., (2019): "Machine learning methods economists should know
about," Annu. Rev. Econ., 11, 685-725.
Athey, Susan, Guido W. Imbens, and Stefan Wager (2018): "Approximate resid-
ual balancing: debiased inference of average treatment e¤ects in high dimensions," Journal
of the Royal Statistical Society, Series B, 80, 597-623.
Athey, Susan, J. Tibshirani, and Stefan Wager (2019): "Generalized Random
Forest," The Annals of Statistics, 47, 1148-1178.
Avagyan, Vahe and Stijn Vansteelandt (2017): "High-dimensional Inference for
the Average Treatment E¤ect Under Model Misspeci…cation Using Penalized Bias-Reduced
Doubly-Robust Estimation," Biostatistics and Epidemiology, DOI: 10.1080/24709360.2021.1898730.
Bajari, Patrick, Han Hong, John Krainer, and Denis Nekipelov (2010): "Es-
timating Static Models of Strategic Interactions," Journal of Business and Economic Statis-
tics 28, 469-482.
Bakhitov, E. (2022): "Automatic Debiased Machine Learning in Presence of Endo-
geneity", working paper.
Barendse, S. (2022): "E¢ ciently Weighted Estimation of Tail and Interquantile Ex-
pectations", working paper.
Belloni, Alexandre and Victor Chernozhukov (2011): "`1-Penalized Regres-
sion in High-Dimensional Sparse Models," Annals of Statistics 9, 82-130.
80
Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian B.
Hansen (2012): “Sparse Models and Methods for Optimal Instruments with an Application
to Eminent Domain,”Econometrica 80, 2369–2429.
Belloni, Alexandre, Victor Chernozhukov, and Christian B. Hansen (2014):
"Inference on Treatment E¤ects after Selection among High-Dimensional Controls," Review
of Economic Studies 81, 608–650.
Belloni, Alexandre, Victor Chernozhukov, Ivan Fernandez-Val, and Chris-
tian B. Hansen (2017): "Program Evaluation and Causal Inference with High-Dimensional
Data," Econometrica 85, 233-298.
Bickel, Peter J. (1982): “On Adaptive Estimation,”Annals of Statistics 10, 647–671.
Blomquist, Soren and Matz Dahlberg (1999): "Small Sample Properties of LIML
and Jackknife IV Estimators: Experiments with Weak Instruments," Journal of Applied
Econometrics 14, 69-88.
Bravo, Francesco, Juan Carlos Escanciano, and Ingrid van Keilegom (2020):
"Two-step Semiparametric Likelihood Inference," Annals of Statistics 48, 1-26.
Bühlmann, P. and T. Hothorn (2007): "Boosting algorithms: Regularization, pre-
diction and model …tting," Statist. Sci. 22 477-505.
Breiman, L. (1996): "Bagging predictors," Mach. Learn. 24 123-140.
Breiman, L. (2001): "Random forests," Mach. Learn. 45 5–32.
Carone, Marco, Alexander R. Luedtke, and Mark J. van der Laan (2019):
"Toward Computerized E¢ cient Estimation in In…nite Dimensional Models," Journal of the
American Statistical Association 114, 1174-1190.
Chamberlain, G. (1987): “Asymptotic E¢ ciency in Estimation With Conditional Mo-
ment Restrictions,”Journal of Econometrics, 34, 305-334.
Chen, Xiaohong and Zhipeng Liao (2015): “Sieve Semiparametric GMM Under
Weak Dependence," Journal of Econometrics 189, 163-186.
Chernozhukov, Victor, Juan Carlos Escanciano, Hidehiko Ichimura, Whit-
ney K. Newey (2016): "Locally Robust Semiparametric Estimation," arxiv.1608.00033v1.
Chernozhukov, Victor, Juan Carlos Escanciano, Hidehiko Ichimura, Whit-
ney K. Newey (2022): "Locally Robust Semiparametric Estimation," Econometrica, 90,
1501-1535.
Chernozhukov, Victor, Christian B. Hansen, and Martin Spindler (2015):
"Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach,"
Annual Review of Economics 7: 649–688.
Chernozhukov, V., D. Chetverikov, Mert Demirer, Esther Duflo, Chris-
tian B. Hansen, Whitney K. Newey, James M. Robins (2018): "Debiased/Double
81
Machine Learning for Treatment and Structural Parameters,Econometrics Journal 21, C1-
C68.
Chernozhukov, Victor, Whitney K. Newey, and James M. Robins (2018):
"Double/De-Biased Machine Learning Using Regularized Riesz Representers," arxiv.1802.08667v1.
Chernozhukov, V., W.K. Newey, and R. Singh (2018): "Learning L2-Continuous
Regression Functionals via Regularized Riesz Representers," arxiv.1809.05224v1.
Escanciano, Juan Carlos and Olmo, J. (2010): "Backtesting parametric value-at-
risk with estimation risk," Journal of Business & Economic Statistics, 28, 36-51.
Escanciano, J.C. and T. Perez-Izquierdo (2023): "Automatic Locally Robust
Estimation with Generated Regressors", arXiv:2301.10643.
Escanciano, J.C. and J.R. Terschuur (2022): "Debiased Semiparametric U-Statistics:
Machine Learning Inference on Inequality of Opportunity", arXiv:2206.05235.
Farrell, Max (2015): "Robust Inference on Average Treatment E¤ects with Possibly
More Covariates than Observations," Journal of Econometrics 189, 1–23.
Fisher, R. A. (1925): "Statistical methods for research workers," Genesis Publishing
Pvt Ltd.
Foster, Dylan F. and Vasilis Syrgkanis (2019): "Orthogonal Statistical Learn-
ing," https://arxiv.org/pdf/1901.09036.pdf.
Greenshtein, E. and Ritov, Y. (2004): "Persistence in high-dimensional linear pre-
dictor selection and the virtue of overparametrization," Bernoulli 10, 971-988.
Györfi, L. , M. Kohler, A. Krzyzak, H. Walk, (2002): "A Distribution-Free
Theory of Nonparametric Regression," Springer Series in Statistics, Springer, Berlin.
Hahn, Jinyong (1998): "On the Role of the Propensity Score in E¢ cient Semipara-
metric Estimation of Average Treatment E¤ects," Econometrica 66, 315-331.
Hahn, Jinyong and Geert Ridder (2013): "Asymptotic Variance of Semiparametric
Estimators With Generated Regressors," Econometrica 81, 315-340.
Hahn, Jinyong and Geert Ridder (2019): “Three-stage Semi-Parametric Inference:
Control Variables and Di¤erentiability," Journal of Econometrics 211, 262-293.
Hampel, Frank R. (1974): "The In‡uence Curve and Its Role in Robust Estimation,"
Journal of the American Statistical Association 69, 383-393.
Hirano, Kei, Guido W. Imbens, and Geert Ridder (2003): "E¢ cient Estimation
of Average Treatment E¤ects Using the Estimated Propensity Score," Econometrica 71:
1161–1189.
Hoerl, Arthur E., and Robert W. Kennard (1970): "Ridge regression: Biased
estimation for nonorthogonal problems." Technometrics 12, 55-67.
Hotz, V. Jospeh and Robert A. Miller (1993): "Conditional Choice Probabilities
82
and the Estimation of Dynamic Models," Review of Economic Studies 60, 497-529.
Hsu, D., S.M. Kakade and T. Zhang (2014): "Random Design Analysis if Ridge
Regression", working paper.
Huber, Peter (1981): Robust Statistics, New York: Wiley.
Imbens, G., and J. Angrist (1994): “Identi…cation and Estimation of Local Average
Treatment E¤ects,”Econometrica, Vol. 61, No. 2, 467-476.
Ichimura, Hidehiko and Whitney K. Newey (2022): "The In‡uence Function of
Semiparametric Estimators," Quantitative Economics, 13, 29-61.
Klaassen, Chris A. J. (1987): "Consistent Estimation of the In‡uence Function of
Locally Asymptotically Linear Estimators," Annals of Statistics 15, 1548-1562.
Knaus, M. C., Lechner, M., and Strittmatter, A. (2021). Machine learning esti-
mation of heterogeneous causal e¤ects: Empirical Monte Carlo evidence. The Econometrics
Journal, 24(1), 134-161.
Knight, K. and Fu, W. (2000): "Asymptotics for Lasso-Type Estimators,”, The An-
nals of Statistics, 28, 1356-1378.
Künzel S, Sekhon J, Bickel P, Yu B. (2017): "Meta-learners for estimating hetero-
geneous treatment e¤ects using machine learning," arXiv:1706.03461.
Lin, W. (2013): "Agnostic Notes on Regression Adjustments to Experimental Data:
Reexamining Freedman’s Critique." The Annals of Applied Statistics 7, 295-318.
Leeb, Hannes and Benedikt M. Potscher (2005): "Model Selection and Inference:
Facts and Fiction," Econometric Theory 21, 21-59.
Newey, Whitney K. (1994a): "The Asymptotic Variance of Semiparametric Estima-
tors," Econometrica 62, 1349-1382.
Newey, Whitney K. (1994b): ”Kernel Estimation of Partial Means and a General
Variance Estimator,”Econometric Theory 10, 233-253.
Newey, Whitney K., Fushing Hsieh, and James M. Robins (1998): “Under-
smoothing and Bias Corrected Functional Estimation," MIT Dept. of Economics working
paper 72, 947-962, https://economics.mit.edu/…les/11219.
Newey, Whitney K., Fushing Hsieh, and James M. Robins (2004): “Twicing
Kernels and a Small Bias Property of Semiparametric Estimators,” Econometrica 72, 947-
962.
Newey, Whitney K. and James M. Robins (2017): "Cross Fitting and Fast Re-
mainder Rates for Semiparametric Estimation," CEMMAP Working paper WP41/17.
Neyman, Jerzy (1923): “Sur les applications de la theorie des probabilites aux ex-
periences agricoles: Essai des principes,” Master’s Thesis. Excerpts reprinted in English,
Statistical Science, Vol. 5, pp. 463–472. (D. M. Dabrowska, and T. P. Speed, Translators.)
83
Neyman, Jerzy (1959): “Optimal Asymptotic Tests of Composite Statistical Hypothe-
ses,” Probability and Statistics, the Harald Cramer Volume, ed., U. Grenander, New York,
Wiley.
Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao (1994): "Estimation
of Regression Coe¢ cients When Some Regressors Are Not Always Observed," Journal of the
American Statistical Association 89: 846–866.
Robins, James M. and Andrea Rotnitzky (2001): Comment on “Semiparametric
Inference: Question and an Answer,” by P.A. Bickel and J. Kwon, Statistica Sinica 11,
863-960.
Robinson, Peter M. (1988): "‘Root-N-consistent Semiparametric Regression," Econo-
metrica 56, 931-954.
Rosenbaum, P. R., and Rubin, D. B. (1983): "Assessing sensitivity to an unobserved
binary covariate in an observational study with binary outcome," Journal of the Royal Sta-
tistical Society: Series B (Methodological), 45(2), 212-218.
Rubin, D. (1974): "Estimating Causal E¤ects of Treatments in Randomized and Non-
randomized Studies". J. Educ. Psychol. 66(5): 688-701.
Schick, Anton (1986): "On Asymptotically E¢ cient Estimation in Semiparametric
Models," Annals of Statistics 14, 1139-1151.
Semenova, Vira (2018): "Debiased Machine Learning of Set-Identi…ed Linear Models,"
https://arxiv.org/pdf/1712.10024.pdf.
Singh, Rahul and Liyang Sun (2019): "De-biased Machine Learning in Instrumental
Variable Models for Treatment E¤ects," https://arxiv.org/pdf/1909.05244.pdf.
Tan, Z. (2006): "Regression and weighting methods for causal inference using instru-
mental variables," Journal of the American Statistical Association, 101(476), 1607-1618.
Tan, Z. (2020): "Model-Assisted Inference for Treatment E¤ects Using Regularized
Calibrated Estimation with High-Dimensional Data," Annals of Statistics 48, 811-837.
Tchetgen Tcehtgen, Eric J. (2009): "A Commentary on G. Molenburgh’s Review
of Missing Data Methods," Drug Information Journal 43, 433-435.
Terschuur, J.R. (2022): "Debiased Machine Learning Inequality of Opportunity in
Europe", work in progress.
van der Laan, Mark J. (2014), “Targeted Estimation of Nuisance Parameters to
Obtain Valid Statistical Inference,”International Journal of Biostatistics 10, 29–57.
van der Laan, Mark J. and Sherri Rose (2011): Targeted Learning: Causal In-
ference for Observational and Experimental Data, Springer Science & Business Media.
van der Laan, Mark J. and Daniel Rubin (2006): "Targeted Maximum Likelihood
Learning," The International Journal of Biostatistics 2.
84
Varian, H. R. (2014): "Big data: New tricks for econometrics," Journal of Economic
Perspectives, 28(2), 3-28.
Vermeulen, Karel and Stijn Vansteelandt (2015): "Bias-Reduced Doubly Ro-
bust Estimation," Journal of the American Statistical Association 110, 1024-1036.
Von Mises, Richard (1947): "On the Asymptotic Distribution of Di¤erentiable Sta-
tistical Functions," Annals of Mathematical Statistics 18, 309-34.
Wahba, G. (1990): Spline Models for Observational Data. CBMS-NSF Regional Con-
ference Series in Applied Mathematics, Vol. 59, Philadelphia: Society of Industrial and
Applied Mathematics.
Wager, S. Du, W. Taylor, J. and R.J. Tibshirani (2019): "High-dimensional
regression adjustments in randomized experiments,”Proc Natl Acad Sci., 113; 12673-12678.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer.
West, K.D., (1996): "Asymptotic inference about predictive ability," Econometrica 64,
1067–1084.
Zhang, M., Tsiatis, A.A., and Davidian, M. (2008): "Improving e¢ ciency of infer-
ences in random clinical trials using auxiliary covariates," Biometrics, 64: 707-715.
85

CIML2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIML2023

Uploaded by

Copyright:

Available Formats

Causal Inference and Machine Learning

Juan Carlos Escanciano

This version: January 2023

I Causal Inference: RCT and Unconfoundedness 1

II Machine Learning for Regression 25

III A General Framework for Locally Robust Estimators 48

Causal Inference: RCT and

(C) := E [Yi (1) Yi (0)j C] ;

1 Randomized Control Trials

Under this condition,

AT E = E[Yi (1) Yi (0)]

The last equality is an identi…cation result: we have expressed AT E as a functional of the

where the last equality follows from (assuming nd 6= 0)

Moreover, using that Yi 1(Di = d) = Yi (d)1(Di = d), and

we can write the (conditional) variance as

Then, by the CLT and the equality

RCT2: (Yi (1); Yi (0); Xi ) is independent of Di .

For simplicity of exposition, consider a model for potential outcomes such as

Yi (d) = d (Xi ) + "i (d); (3)

where d (Xi ) = E[Yi (d)jXi ]; so that E["i (d)jXi ] = 0:

d (Xi ) = E[Yi jXi ; Di = d];

and since E[Yi (d)] = E[ d (Xi )]; it follows that

AT E = E[ 1 (Xi ) 0 (Xi )]:

and the OLS (covariate-adjusted) estimator is

^OLS = c^1 c^0 + X 0 ^ 1 ^

cd ; ^ d ) based on the subsample with Di = d: In R this could be

ols.…t.0 = lm(Y ~ X, subset D=0)

ols.…t.1 = lm(Y ~X, subset D=1)

Importantly, the OLS based estimator is of the plug-in form

X.centered = scale(X, center = TRUE, scale = FALSE)

ols.…t = lm(Y ~D * X.centered)

VDM = 2V[Yi (1)] + 2V[Yi (0)]

From standard OLS theory and E[Xi ] = 0; it follows

E[Yi 1(Di = d) d 1(Di = d)] = 0

E[Yi 1(Di = d)]

where pd = P(Di = d) and

d (Xi ) = E[m(Yi ; Di ; 0 )jXi ; Di = d]

In this case, E[ (Wi ; 0; )] = 0 means, for d = 0; :::; T;

E[Yi 1(Di = d) d 1(Di = d) (1(Di = d) pd ) ( d (Xi ) d )] = 0;

so the solution 0 = ( 0 ; :::; T) is such that

^ i = (^ 1 (Xi ) Di [Yi ^ 1 (Xi )] (1 Di ) [Yi ^ 0 (Xi )]

By RCT1, E [Snj ] = 0; for j = 1; 2; 4; 5 and

A standard asymptotic 95% con…dence interval for based on ^LR is

How about other causal parameters of interest? In an RCT, AT T = AT E ; since

AT T = E [Yi (1) Yi (0)j Di = 1] = E [Yi (1) Yi (0)] :

Other causal parameters of interest are Conditional ATEs (CATEs)

CAT E (x) = E [Yi (1) Yi (0)j Xi = x] :

From our results above, CATE is identi…ed as

CAT E (x) = 1 (x) 0 (x):

This result can be used to show that p = 1 =( 0 + 1) minimizes V: When 0 = 1; this

2 Observational data under unconfoundedness

U1: (Yi (1); Yi (0)) is independent of Di , conditional on Xi .

De…ne the propensity score: p (Xi ) = P (Di = 1j Xi ), and assume:

U2: 0 < p(x) < 1 a.s:

Assumption U1 is unconfoundedness and Assumption U2 is overlap. Di¤erent names have

CAT E (Xi ) = E[Yi (1)jXi ; Di = 1] E[Yi (0)jXi ; Di = 0]

AT E = E[ CAT E (Xi )] (Iterated expectations)

which is a generalization of (8) to observational data under unconfoundedness. The estimator

V[Yi (1)jXi ] V[Yi (0)jXi ]

More generally, for a multi-valued treatment case

AT T = E[Yi (1)jDi = 1] E[Yi (0)jDi = 1]

E d1 (Xi ) d1 ;d2 1(Di = d2 ) = 0;

and, as we will see later on, a LR moment for d1 ;d2 is

2.1 A Simulation Study