Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Journal of Economic Literature 2019, 57(4), 835–903

https://doi.org/10.1257/jel.20181361

The Identification Zoo: Meanings of


Identification in Econometrics†
Arthur Lewbel*

Over two dozen different terms for identification appear in the econometrics liter-
ature, including set identification, causal identification, local identification, generic
identification, weak identification, identification at infinity, and many more. This sur-
vey: (i) gives a new framework unifying existing definitions of point identification;
(ii) summarizes and compares the zooful of different terms associated with identifica-
tion that appear in the literature; and (iii) discusses concepts closely related to identifi-
cation, such as normalizations and the differences in identification between structural
models and causal, reduced form models. ( JEL C01, C20, C50)

1.  Introduction more general definition of identification that


unifies and encompasses previously existing

E conometric identification really means


just one thing: model parameters or
features being uniquely determined from
definitions.
This survey then discusses the differences
between identification in traditional struc-
the observable population that generates tural models versus the so-called reduced
the data.1 Yet well over two dozen different form (or causal inference, or treatment
terms for identification now appear in the effects, or program evaluation) literature.
econometrics literature. The goal of this sur- Other topics include set versus point identifi-
vey is to summarize (identify) and categorize cation, limited forms of identification such as
this zooful of different terms associated with local and generic identification, and identifi-
identification. This includes providing a new, cation concepts that relate to statistical infer-
ence, such as weak identification, irregular
identification, and identification at infinity.
* Boston College. I would like to thank Steven Durlauf, Concepts that are closely related to identifi-
Jim Heckman, Judea Pearl, Krishna Pendakur, Frederic
Vermeulen, Daniel Ben-Moshe, Xun Tang, Juan-Carlos cation, including normalizations, coherence,
Escanciano, Jeremy Fox, Eric Renault, Yingying Dong, and completeness are also discussed.
Laurens Cherchye, Matthew Gentzkow, Fabio Schiantar- The mathematics in this survey is kept rel-
elli, Andrew Pua, Ping Yu, and five anonymous referees for
many helpful suggestions. All errors are my own. atively simple, with a little more formality

Go to https://doi.org/10.1257/jel.20181361 to visit the provided in the appendix. Each section can
article page and view author disclosure statement(s).
1 The first two sections of this survey use identification
be read largely independently of the others,
in the traditional sense of what would now be more pre- with only a handful of concepts carried over
cisely called “point identification.” See section 3 for details.  from one section of the zoo to the next.

835
836 Journal of Economic Literature, Vol. LVII (December 2019)

The many terms for identification that so recognizing lack of identification, and
appear in the econometrics literature include searching for restrictions that suffice to attain
(in alphabetical order): Bayesian identifi- identification, are fundamentally important
cation, causal identification, essential iden- problems in econometric modeling.
tification, eventual identification, exact The next section, section 2, begins by pro-
identification, first order identification, fre- viding some historical background. The basic
quentist identification, generic identification, notion of identification (uniquely recovering
global identification, identification arrange- model parameters from the observable popu-
ment, identification at infinity, identification lation), is now known as “point identification.”
by construction, identification of bounds, Section 3 summarizes the basic idea of point
ill-posed identification, irregular identi- identification. A few somewhat different char-
fication, local identification, nearly weak acterizations of point identification appear in
identification, nonparametric identifica- the literature, varying in what is assumed to be
tion, non-robust identification, nonstandard observable and in the nature of the parameters
weak identification, overidentification, para- to be identified. In section 3 (and in an appen-
metric identification, partial identification, dix), this survey proposes a new definition of
point identification, sampling identification, point identification (and of related concepts
semiparametric identification, semi-strong like structures and observational equivalence)
identification, set identification, strong iden- that encompasses these alternative character-
tification, structural identification, thin set izations or classes of point-identified models
identification, underidentification, and weak that currently appear in the literature.
identification. This survey gives the meaning Section 3 then provides examples of, and
of each and shows how they relate to each methods for obtaining, point identification.
other. This section also includes a discussion of typi-
Let ​θ​ denote an unknown parameter or a cal sources of non-identification, and of some
set of unknown parameters (vectors and/or traditional identification-related concepts
functions) that we would like to learn about, like overidentification, exact identification,
and ideally, estimate. Examples of what​ and rank and order conditions. Identification
θ​ could include are objects like regressor by functional form is described and examples
coefficients, or average treatment effects, are provided, including constructed instru-
or error distributions. Identification deals ments based on second and higher moment
with characterizing what could potentially or assumptions. Appropriate use of such meth-
conceivably be learned about parameters ​θ​ ods is discussed.
from observable data. Roughly, identification Next is section 4, which defines and dis-
asks, if we knew the population that data are cusses the concepts of coherence and com-
drawn from, would ​θ​ be known? And if not, pleteness of models. These are closely
what could be learned about ​θ​? associated with existence of a reduced form,
The study of identification logically pre- which in turn is often used as a starting point
cedes estimation, inference, and testing. for proving identification. This is followed
For θ​ ​to be identified, alternative values of​ by section 5, which is devoted to discussing
θ​must imply different distributions of the identification concepts in what is variously
observable data (see, e.g., Matzkin 2012). known as the reduced form, or program
This implies that if ​θ​is not identified, then evaluation, or treatment effects, or causal
we cannot hope to find a consistent estimator inference literature. This literature places a
for ​θ​. More generally, identification failures particular emphasis on ­randomization, and is
complicate statistical analyses of models, devoted to the identification of ­parameters
Lewbel: The Identification Zoo 837

that can be given a causal interpretation. economic analysis is generally attributed to


Typical methods and assumptions used to Alfred Marshall (1890). However, Persky
obtain identification in this literature are (1990) points out that usage of the term
compared to identification of more tradi- ceteris paribus in an economic context goes
tional structural models. To facilitate this back to William Petty (1662).2
comparison, the assumptions of the popular The textbook example of an identifica-
local average treatment effect (LATE) causal tion problem in economics, that of separat-
model, which are usually described in poten- ing supply and demand curves, appears to
tial outcome notation, are here rewritten have been first recognized by Philip Wright
using a traditional structural notation. The (1915), who pointed out that what appeared
relative advantages and disadvantages of ran- to be an upward sloping demand curve for
domization based causal inference methods pig iron was actually a supply curve, traced
versus structural modeling are laid out, and a out by a moving demand curve. Philip’s son,
case is made for combining both approaches Sewall, invented the use of causal path dia-
in practice. grams in statistics.3 Sewall Wright (1925)
Section 6 describes nonparametric identi- applied those methods to construct an instru-
fication, semiparametric identification, and mental variables estimator, but in a model of
set identification. This section also discusses exogenous regressors that could have been
the related role of normalizations in identifi- identified and estimated by ordinary least
cation analyses, which has not been analyzed squares. The idea of using instrumental vari-
in previous surveys. Special regressor meth- ables to solve the identification problem aris-
ods are then described, mainly to provide ing from simultaneous systems of equations
examples of these concepts. first appears in appendix B of Philip Wright
Section 7 describes limited forms of (1928). Stock and Trebbi (2003) claim that
identification, in particular, local identifi- this is the earliest known solution to an iden-
cation and generic identification. Section 8 tification problem in econometrics. They
considers forms of identification that have apply a stylometric analysis (the statistical
implications for, or are related to, statistical analysis of literary styles) to conclude that
inference. These include weak identification, Philip Wright was the one who actually wrote
identification at infinity, ill-posed identifica- appendix B, using his son’s estimator to solve
tion, and Bayesian identification. Section 9 their identification problem.
then concludes, and an appendix provides
some additional mathematical details.
2 Petty’s (1662) use of the term ceteris paribus  gives
2.  Historical Roots of Identification what could be construed as an early identification argu-
ment, identifying a determinant of prices. On page 50 of
Before discussing identification in detail, his treatise he writes, ”If a man can bring to London an
consider some historical context. I  include ounce of Silver out of the Earth in Peru, in the same time
that he can produce a bushel of Corn, then one is the nat-
first names of early authors in this section ural price of the other; now if by reason of new and more
to promote greater knowledge of the early easie Mines a man can get two ounces of Silver as easily
leaders in this field. as formerly he did one, then Corn will be as cheap at ten
shillings the bushel, as it was before at five shillings caeteris
Before we can think about isolating, and paribus.”
thereby identifying, the effect of one variable 3 Sewall’s first application of causal paths was establish-
on another, we need the notion of “ceteris ing the extent to which fur color in guinea pigs was deter-
mined by developmental vs genetic factors. See, e.g., Pearl
paribus,” that is, holding other things equal. (2018). So while the father Philip considered pig iron, the
The formal application of this concept to son Sewall studied actual pigs.
838 Journal of Economic Literature, Vol. LVII (December 2019)

In addition to two different Wrights, two Jerzy Splawa-Neyman (1923)4, David R. Cox
different Workings also published early (1958), and Donald B. Rubin (1974), among
papers relating to the subject: Holbrook many others. Pearl (2015) and Heckman and
Working (1925) and, more relevantly, Elmer Pinto (2015) credit Haavelmo (1943) as the
J. Working (1927). Both wrote about statis- first rigorous treatment of causality in the
tical demand curves, though Holbrook is context of structural econometric models.
the one for whom the Working-Leser Engel Unlike the results in the statistics literature,
curve is named. econometricians historically focused more
Jan Tinbergen (1930) proposed indi- on cases where selection (determining who
rect least squares estimation (numerically is treated or observed) and outcomes may be
recovering structural parameters from lin- correlated. These correlations could come
ear regression reduced form estimates), from a variety of sources, such as simulta-
but does not appear to have recognized neity as in Haavelmo (1943), or optimizing
its usefulness for solving the identification self selection as in Andrew D. Roy (1951).
problem. Another example is Wald’s (1943) survivor-
The above examples, along with the later ship bias analysis (regarding airplanes in
analyses of Trygve Haavelmo (1943), Tjalling world war II), which recognizes that even
Koopmans (1949),  Theodore W. Anderson when treatment assignment (where a plane
and Herman Rubin (1949), Koopmans was hit) is random, sample attrition that is
and Olav Reiersøl (1950), Leonid Hurwicz correlated with outcomes (only planes that
(1950), Koopmans, Rubin, and Roy B. survived attack could be observed) drasti-
Leipnik (1950), and the work of the Cowles cally affects the correct analysis. General
Foundation more generally, are concerned models where selection and outcomes are
with identification arising from simultane- correlated follow from James J. Heckman
ity in supply and demand. Other import- (1978). Causal diagrams (invented by Sewall
ant early work on this problem includes Wright as discussed above) were promoted
Abraham Wald (1950), Henri Theil (1953), by Judea Pearl (1988) to model the connec-
J. Denis Sargan (1958), and results sum- tions between treatments and outcomes.
marized and extended in Franklin Fisher’s A different identification problem is that
(1966) book. Most of this work emphasizes of identifying the true coefficient in a linear
exclusion restrictions for solving identifica- regression when regressors are measured
tion in simultaneous systems, but identifi- with error. Robert J. Adcock (1877, 1878),
cation could also come from restrictions and Charles H. Kummell (1879) considered
on the covariance matrix of error terms, measurement errors in a Deming regres-
or combinations of the two, as in Karl G. sion, as popularized in W. Edwards Deming
Jöreskog (1970). Milton Friedman’s (1953) (1943)5. This is a regression that minimizes
essay on positive economics includes a the sum of squares of errors measured per-
critique of the Cowles foundation work, pendicular to the fitted line. Corrado Gini
essentially warning against using different (1921) gave an example of an estimator that
criteria to select models versus criteria to deals with measurement errors in standard
identify them.
A standard identification problem in the 4 Neyman’s birth name was Splawa-Neyman, and
statistics literature is that of recovering a he published a few of his early papers under that name,
treatment effect. Derived from earlier proba- including this one.
5 Adcock’s publications give his name as R. J. Adcock. I
bility theory, identification based on random- only have circumstantial evidence that his name was actu-
ization was developed in this literature by ally Robert.
Lewbel: The Identification Zoo 839

linear regression, but Ragnar A. K. Frisch from data. Think of ​ϕ​as everything that could
(1934) was the first to discuss the issue in a be learned about the population that data
way that would now be recognized as iden- are drawn from. Usually, ​ϕ​would either be a
tification. Other early papers looking at distribution function, or some features of dis-
measurement errors in regression include tributions like conditional means, quantiles,
Neyman (1937), Wald (1940), Koopmans autocovariances, or regression coefficients.
(1937), Reiersøl (1945, 1950), Roy C. Geary In short, ​ϕ​is what would be knowable from
(1948), and James Durbin (1954). Tamer unlimited amounts of whatever type of data
(2010) credits Frisch (1934) as also being the we have. The key difference between the
first in the literature to describe an example definition of identification given in this sur-
of set identification. vey and previous definitions in the literature
is that previous definitions generally started
with a particular assumption (sometimes only
3.  Point Identification implicit) of what constitutes ϕ ​ ​(examples are
the Wright–Cowles identification and distri-
In modern terminology, the standard bution-based identification discussed in sec-
notion of identification is formally called tion 3.3).
point identification. Depending on context, Assume also that we have a model, which
point identification may also be called global typically imposes some restrictions on the
identification or frequentist identification. possible values ϕ ​ ​could take on. A simple
When one simply says that a parameter or a definition of (point) identification is then
function is identified, what is usually meant that a parameter ​θ​is point identified if, given
is that it is point identified. the model, ​θ​is uniquely determined from ϕ ​ ​.
Early formal definitions of (point) iden- For example, suppose for scalars ​ Y​, ​X​,
tification were provided by Koopmans and and ​θ​, our model is that ​Y = Xθ + e​ where
Reiersøl (1950), Hurwicz (1950), Fisher ​E​(​X​​  2​)​  ≠ 0​and ​E(​ eX)​  = 0​, and suppose that​
(1966), and Rothenberg (1971). These ϕ​, what we can learn from data, includes
include the related concepts of a structure and the second moments of the vector ​​ (Y, X)​​.
of observational equivalence. See Chesher Then we can conclude that θ​ ​is point iden-
(2008) for additional historical details on tified, because it is uniquely determined
these classical identification concepts. in the usual linear regression way by
In this survey I provide a new general defi- ​θ  =  E​(XY)​/E​(​X​​  2​)​​, which is a function of
nition of identification. This generalization second moments of (​​ Y, X)​​.
maintains the intuition of existing classical Another example is to let the model be that
definitions while encompassing a larger class a binary treatment indicator X ​ ​is assigned to
of models than previous definitions. The dis- individuals by a coin flip, and Y ​ ​is each individ-
cussion in the text below will be somewhat ual’s outcome. Suppose we can observe real-
informal for ease of reading. More rigorous izations of (​​ X, Y)​​that are independent across
definitions are given in the appendix. individuals. We might therefore assume that​
ϕ​, what we can learn from data, includes
3.1 Introduction to Point Identification
​E​(Y | X)​​ . It then follows that the average
Recall that θ​ ​is the parameter (which could treatment effect ​ θ​is identified because,
include vectors and functions) that we want when treatment is randomly assigned,
to identify and ultimately estimate. We start ​θ  =  E​(Y | X  = 1)​  − E​(Y | X  = 0)​​, that
by assuming there is some information, call it​ is, the difference between the mean of ​Y​
ϕ​, that we either already know or could learn among people who have ​X  =  1​(the treated)
840 Journal of Economic Literature, Vol. LVII (December 2019)

and the mean of Y ​ ​among people who have​ that what is knowable to begin with, ​ϕ​, is the
X  =  0​(the untreated). distribution function of W​ ​.
Both of the above examples assume that Another common DGP is where each data
expectations of observed variables are point consists of a value of X ​ ​chosen from its
­knowable, and so can be included in ϕ ​ ​. Since support, and conditional upon that value of ​X​,
sample averages can be observed, to justify we randomly draw an observation of ​Y​ inde-
this assumption we might appeal to the con- pendent from the other draws of Y ​ ​given ​X​.
sistency of sample averages, given conditions For example, ​X​could be the temperature
for a weak law of large numbers. at which you choose to run an experiment
When discussing empirical work, a com- and ​Y​is the outcome of the experiment. As​
mon question is, “what is the source of the n  →  ∞​this DGP allows us to consistently
identification?” that is, what feature of the estimate and thereby learn about F ​ (​ Y | X)​​,
data is providing the information needed to the conditional distribution function of Y ​​
determine θ​ ​? This is essentially asking, what given ​X​. So if we have this kind of DGP in
needs to be in ϕ​ ​? mind, we could start an identification proof
Note that the definition of identification for some ​θ​by assuming that ​F​(Y | X)​​ is know-
is somewhat circular or recursive. We start able. But in this case F ​ ​(Y | X)​​can only be
by assuming some information ​ϕ​is know- known for the values of ​X​that can be chosen
able. Essentially, this means that to define in the experiment (e.g., it may be impossible
identification of something, ​θ​, we start by to run the experiment at a temperature ​X​ of
assuming something else, ϕ ​ ​, is itself identi- a million degrees).
fied. Assuming ​ϕ​is knowable or identified With more complicated DGPs (e.g., time
to begin with can itself only be justified series data, or cross section data containing
by some deeper assumptions regarding social interactions or common shocks), part
the underlying data-generating process of the challenge in establishing identifica-
(DGP). tion is characterizing what information ​ϕ​ is
We usually think of a model as a set of equa- knowable, and hence appropriate to use as
tions describing behavior. But more gener- the starting point for proving identification.
ally, a model is whatever set of assumptions For example, in a time series analyses we
we make about, and restrictions we place on, might start by supposing that the mean, vari-
the DGP. This includes both assumptions ance, and autocovariances of a time series
about the behavior that generates the data are knowable, but not assume information
and about how the data are collected and about higher moments is available. Why
measured. These assumptions in turn imply not? Either because higher moments might
restrictions on ϕ
​ ​and θ​ ​. In this sense, identifi- not be needed for identification (as in vector
cation (even in purely experimental settings) autoregression models), or because higher
always requires a model. moments may not be stable over time.
A common starting assumption is that the Other possible examples are that ​ϕ​ could
DGP consists of ​n​independently, identically equal reduced-form linear regression coeffi-
distributed (IID) observations of a vector ​W​, cients, or, if observations of W ​ ​follow a mar-
where the sample size ​n​goes to infinity. We tingale process, ​ϕ​could consist of transition
know (by the Glivenko–Cantelli theorem, probabilities.
see section 3.4 below) that with this kind What to include in ​ ϕ​depends on the
of data we could consistently estimate the model. For example, in dynamic panel data
distribution of W​ .​ It is therefore reasonable models, the Arellano and Bond (1991) esti-
with IID data in mind to start by assuming mator is based on a set of moments that are
Lewbel: The Identification Zoo 841

assumed to be knowable (since they can be provided here generalize and encompass
estimated from data) and equal zero in the most previous definitions provided in the
population. The parameters of the model are literature. The framework here most closely
identified if they are uniquely determined by corresponds to Matzkin (2007, 2012). Her
the equations that set those moments equal framework is essentially the special case of
to zero. The Blundell and Bond (1998) esti- the definitions provided here in which ​ϕ​is a
mator provides additional moments (assum- distribution function. In contrast, the tradi-
ing functional form information about the tional textbook discussion of identification of
initial time period zero distribution of data) linear supply and demand curves ­corresponds
that we could include in ϕ ​ ​. We may t­ herefore to the special case where ​ϕ​is a set of limiting
have model parameters that are not identi- values of linear regression coefficients. The
fied with Arellano and Bond moments, but relationship of the definitions provided here
become identified if we are willing to assume to other definitions in the literature, such as
the model contains the additional informa- those given by the Cowles foundation work, or
tion needed for Blundell and Bond moments. in Rothenberg (1971), Sargan (1983), Hsiao
Even in the most seemingly straightforward (1983), or Newey and McFadden (1994), are
situations, such as experimental design with discussed below. In this section, the provided
completely random assignment into treatment definitions will still be somewhat informal,
and control groups, additional assumptions stressing the underlying ideas and intuition.
regarding the DGP (and hence regarding the More formal and detailed definitions are pro-
model and ​ϕ​) are required for identification vided in the appendix.
of treatment effects. Typical assumptions that Define a model ​M​to be a set of functions
are routinely made (and may often be vio- or constants that satisfy some given restric-
lated) in this literature are assumptions that tions. Examples of what might be included
rule out certain types of measurement errors, in a model are regression functions, error
sample attrition, censoring, social interac- distribution functions, utility functions,
tions, and general equilibrium effects. game payoff matrices, and coefficient vec-
In practice, it is often useful to distinguish tors. Examples of restrictions could include
between two types of DGP assumptions. assuming regression functions are linear or
One is assumptions regarding the collection monotonic or differentiable, or that errors
of data, for example, selection, measure- are normal or fat tailed, or that parameters
ment errors, and survey attrition. The other are bounded.
is assumptions regarding the generation of Define a model value ​m​to be one par-
data, for example, randomization or statistical ticular possible value of the functions or
and behavioral assumptions. Arellano (2003) constants that comprise ​M​. Each ​m​ implies
refers to a set of behavioral assumptions that a particular DGP. An exception is incoher-
suffice for identification as an identifica- ent models (see section 4), which may have
tion arrangement. Ultimately, both types of model values that do not correspond to any
assumptions determine what we know about possible DGP.
the model and the DGP, and hence deter- Define ​ϕ​to be a set of constants and/or
mine what identification is possible. functions about the DGP that we assume
are known or knowable from data. Common
3.2 Defining Point Identification
examples of ​ ϕ​might be data distribution
Here we define point identification and functions, conditional mean functions, lin-
some related terms, including structure and ear regression coefficients, or time series
observational equivalence. The definitions autocovariances.
842 Journal of Economic Literature, Vol. LVII (December 2019)

Define a set of parameters θ​ ​to be a set We’re now ready to define identification.
of unknown constants and/or functions that The parameter θ​ ​is defined to be point iden-
characterize or summarize relevant features tified (often just called identified) if there
of a model. Essentially, θ​​can be anything do not exist any pairs of possible values ​θ​
we might want to estimate. Parameters θ​​ and ​​θ̃ ​​  that are different but observationally
could include what we usually think of as equivalent.
model parameters, such as regression coef- Let ​Θ​denote the set of all possible val-
ficients, but θ​ ​could also be, for example, the ues that the model says ​θ​could be. One of
sign of an elasticity, or an average treatment these values is the unknown true value of​
effect. θ​, which we denote as θ​ ​​ 0​​​. We say that the
The set of parameters ​θ​may also include particular value θ​ ​​ 0​​​is point identified if θ​ ​​ 0​​​ is
nuisance parameters, which are defined as not o­ bservationally equivalent to any other​
parameters that are not of direct economic θ​in Θ ​ ​. However, we don’t know which of
interest, but may be required for identifica- the possible values of ​θ​(that is, which of
tion and estimation of other objects that are the elements of Θ ​ ​) is the true θ​ ​​0​​​. This is
of interest. For example, in a linear regres- why, to ensure point identification, we gen-
sion model θ​​might include not only the erally require that no two elements θ​ ​ and ​​θ̃ ​​  
regression coefficients, but also the mar- in the set Θ ​ ​having ​θ  ≠ ​θ̃ ​​  be observation-
ginal distribution function of identically dis- ally equivalent. Sometimes this condition is
tributed errors. Depending on context, this called global identification rather than point
distribution might not be of direct interest identification, to explicitly say that ​​θ​0​​​ is point
and would then be considered a nuisance identified no matter what value in ​Θ​ turns
parameter. It is not necessary that nuisance out to be θ​ ​​ 0​​​.
parameters, if present, be included in ​θ​, but We have now defined what it means to
they could be. have parameters ​θ​be point identified. We
We assume that each particular value say that the model is point identified when
of ​m​implies a particular value of ϕ ​ ​ and of​ no pairs of model values ​m​ and ​​m̃ ​ ​in ​M​ are
θ​(violations of this assumption can lead to observationally equivalent (treating ​m​ and​​
incoherence or incompleteness, as discussed m ̃ ​​as if they were parameters). Since every
in a later section). However, there could be model value is associated with at most one
many values of m ​ ​that imply the same ϕ ​ ​ or value of ​θ​, having the model be identified
the same ​θ​. Define the structure ​s(​ ϕ, θ)​​ to be is sufficient, but stronger than necessary, to
the set of all model values m ​ ​that yield both also have any possible set of parameters θ​ ​ be
the given values of ϕ ​ ​and of ​θ​. identified.
Two parameter values, ​θ​ and ​​θ̃ ​​,  are defined The economist or econometrician defines
to be observationally equivalent if there the model ​M​, so we could in theory enumer-
exists a ​ϕ​such that both ​s(​ ϕ, θ)​​ and ​s(​ ϕ, ​θ̃ ​)  ​​ ate every ​m  ∈  M​, list every value of ​ϕ​and ​θ​
are not empty. Roughly, θ​ ​ and ​​θ̃ ​​ observation- that is implied by each ​m​, and thereby check
ally equivalent means there exists a value ​ϕ​ every pair s​ ​(ϕ, θ)​​ and ​s(​ ϕ, ​θ̃ ​)  ​​to see if θ​ ​is point
such that, if ϕ ​  ​is true, then either the value θ​ ​ identified or not. The difficulty of proving
or ​​θ̃ ​ ​could also be true. Equivalently, θ​ ​ and ​​θ̃ ​​   identification in practice is in finding tractable
being observationally equivalent means that ways to accomplish this enumeration. Note
there exists a ϕ ​ ​and model values m ​ ​ and ​​m̃ ​​   that since we do not know which value of θ​ ​ is
such that model value m ​ ​yields the values ϕ ​​ the true one, proving identification in practice
and θ​ ​, and model value ​​m ̃ ​​yields the values​ requires showing that the definition holds for
ϕ​ and ​​θ̃ ​​.  any possible ​θ​, not just the true value.
Lewbel: The Identification Zoo 843

We conclude this section by defining some set identification in which the identified set
identification concepts closely related to point contains only one element, which is ​​θ​0​​​.
identification. Later sections will explore Parametric identification is where ​θ​is a
these identification concepts in more detail. finite set of constants and all the ­different
The concepts of local and generic iden- possible values of ​ϕ​also correspond to dif-
tification deal with cases where we can’t ferent values of a finite set of constants.
­establish point identification for all θ​ ​in ​Θ​. Nonparametric identification is where ​θ​ con-
Local identification of θ​ ​​ 0​​​ means that there sists of functions or infinite sets. Other cases
exists a neighborhood of θ​ ​​0​​​ such that, for are called semiparametric ­identification,
all values θ​   ≠ ​θ​0​​​in this neighborhood, θ​ ​ is which includes situations where, for exam-
not observationally equivalent to ​​θ​0​​​. As with ple, ​θ​includes both a vector of constants
point identification, since we don’t know and nuisance parameters that are functions.
ahead of time which value of ​θ​ is ​​θ​0​​​, to prove As we will see in section 6, sometimes the
that local identification holds we would need differences between parametric, semipara-
to show that for any θ  ​​̃ ​  ∈  Θ​there exists a metric, and nonparametric identification can
neighborhood of ​​θ  ​​such that, for any ​θ  ≠ ​θ̃ ​​  
̃ be somewhat arbitrary (see Powell 1994 for
in this neighborhood, ​θ​is not observationally further discussion of this point).
equivalent to θ  ​​ ̃ ​​.
3.3 Examples and Classes of Point
Generic identification roughly means that
Identification
the set of values of ​θ​in ​Θ​that cannot be
point identified is a very small subset of Θ ​ ​. Consider some examples to illustrate the
Suppose we took all the values of θ​ ​in ​Θ​, and basic idea of point identification.
divided them into two groups: those that are
observationally equivalent to some other ele- Example 1: a median.—Let the model​
ment of ​Θ​, and those that are not. If ​​θ​0​​​ is in M​be the set of all possible distributions
the second group, then it’s identified, other- of a random variable ​ W​having a strictly
wise it’s not. Since θ​ ​​ 0​​​could be any value in​ monotonically increasing distribution
Θ​, and we don’t know which one, to prove function. Our DGP consists of IID draws
point identification in general we would of ​ W​ . From this DGP, what is know-
need to show that the first group is empty. able is ​ F​(w)​​, the distribution function of
The parameter θ​ ​is defined to be generically ​W​. Let our parameter ​θ​be the median of ​W​.
identified if the first group is extremely small In this simple example, we know θ​ ​is iden-
(formally, if the first group is a measure zero tified because it’s the unique solution to
subset of ​Θ​). Both local and generic identifi- ​F​(θ)​  = 1 / 2​. By knowing ​F​, we can deter-
cation are discussed in more detail later. mine ​θ​.
The true parameter value θ​ ​​ 0​​​is said to be How does this example fit the general defi-
set identified (sometimes also called par- nition of identification? Here, each value of​
tially identified) if there exist some values ϕ​is a particular continuous, monotonically
of θ​   ∈  Θ​that are not observationally equiv- increasing distribution function ​ F​. In this
alent to ​​θ​0​​​. So the only time a parameter ​θ​ example, each model value ​m​happens to
is not set identified is when all θ​   ∈  Θ​ are correspond to a unique value of ϕ ​ ​ because
observationally equivalent. For set-identi- each possible distribution of ​W​has a unique
fied parameters, the identified set is defined distribution function. In this example, for
to be the set of all values of θ​   ∈  Θ​that are any given candidate value of ϕ ​ ​and ​θ​, the
observationally equivalent to θ​ ​​ 0​​​. Point iden- structure ​s(​ ϕ, θ)​​is either an empty set or it
tification of θ​ ​​ 0​​​is therefore the special case of has one element. For a given value of ​ϕ​ and​
844 Journal of Economic Literature, Vol. LVII (December 2019)

θ​, if ϕ
​   =  F​and F
​ ​(θ)​  = 1 / 2​(the definition function. This distinction depends on how
of a median) the set s​ ​(ϕ, θ)​​contains one ele- we define ϕ ​ ​and ​θ​. For example, suppose we
ment. That element m ​ ​is the distribution had IID observations of Y ​ , X​. We could then
that has distribution function ​F​. Otherwise, have defined ϕ ​ ​to be the joint distribution
if ​ϕ  =  F​where ​F​(θ)​  ≠ 1 / 2​, the set ​s(​ ϕ, θ)​​ function of Y ​ , X​, and defined θ​​to include
is empty. In this example, it’s not possible to both the coefficients of X ​ ​and the distribu-
have two different parameter values ​θ​ and​​ tion function of the error term ​e​. Given the
θ ̃ ​​be observationally equivalent, because same model M ​ ​, including the restriction that
​F​(θ)​  = 1 / 2​and ​F(​θ̃ ​)   = 1 / 2​implies ​θ  = ​ θ̃ ​​   ​E​(X​X′  ​)​​is non-singular, we would then have
for any continuous, monotonically increasing semiparametric identification of ​θ​.
function ​F​. Therefore, ​θ​is point identified,
because its true value θ​ ​​ 0​​​ cannot be observa- Example 3: Treatment.—Suppose the DGP
tionally equivalent to any other value ​θ​. consists of individuals who are assigned a
treatment of T ​  = 0​or T ​  = 1​ , and each
Example 2: Linear regression.—Consider individual generates an observed outcome
a DGP consisting of observations of Y ​ , X​ ​Y​. Assume ​Y, T​are independent across indi-
where ​Y​is a scalar and ​X​is a ​K-​vector. The viduals. In the Rubin (1974) causal notation,
observations of Y ​ ​and ​X​might not be inde- define the random variable ​Y​(t)​​to be the out-
pendent or identically distributed. Assume come an individual would have generated if
the first and second moments of ​X​and ​Y​ he or she were assigned T ​   =  t​. The observed​
are constant across observations, and let ​ϕ​ Y​satisfies ​Y  =  Y(T)​. Let the parameter of
be the set of first and second moments of X ​​ interest ​ θ​be the average treatment effect
and ​Y​. Let the model ​M​be the set of joint (ATE), defined by ​θ  =  E​(Y​(1)​  − Y​(0)​)​​. The
distributions of e​ , X​that satisfy Y ​  = ​X′  ​θ + e​, model ​M​is the set of all possible joint dis-
where ​θ​is some ​K-​vector of parameters, ​e​ tributions of ​Y(​ 1)​​, ​Y​(0)​​, and ​T​. One possible
is an error term, ​E​(Xe)​  = 0​for an error restriction on the model is Rosenbaum and
term ​e​, and where ​e, X​has finite first and Rubin’s (1983) assumption that ​​(Y​(1)​,  Y​(0)​)​​
second moments. The structure ​ s​(ϕ, θ)​​ is is independent of ​T​. This assumption, equiv-
nonempty when the moments comprising​ alent to random assignment of treatment, is
ϕ​satisfy E ​ [​ X​(Y − ​X′  ​θ)​]​  = 0​for the given​ what Rubin (1990) calls unconfoundedness.
θ​. To ensure point identification, we could Imposing unconfoundedness means that M ​​
add the additional restriction on ​ M​ that only contains model values ​m​(i.e., joint dis-
​E​(X​X′  ​)​​is non-singular, because then θ​​ tributions) where ​​(Y​(1)​,  Y​(0)​)​​ is independent
would be uniquely determined in the usual of ​T​.
way by ​ θ  =  E ​​(X​X′  ​)​​​  −1​  E​(XY)​​. However, The knowable function ​ ϕ​from this
if we do not add this additional restric- DGP is the joint distribution of ​Y​and ​T​.
tion, then we can find values ​​ θ ̃ ​​that are Given unconfoundedness, ​ θ​is identified
observationally equivalent to ​ θ​by let- because unconfoundedness implies that
ting ​​θ̃ ​   =  E ​​(X​X′  ​)​​​  −​  E​(XY)​​ for different ​θ  =  E​(Y | T  = 1)​  − E​(Y | T  = 0)​​, which
pseudoinverses ​E ​​(X​X′  ​)​​​  −​​. is uniquely determined from ​ϕ​. Heckman,
As described here, identification of θ​​ is Ichimura, and Todd (1997) note that a weaker
parametric, because θ​ ​is a vector and ϕ ​ ​ can sufficient condition for identification of ​θ​ by
be written as a vector of moments. However, this formula is the mean u ­ nconfoundedness
some authors claim linear regression as assumption that ​ E(​ Y​(t)​  |  T)​  =  E​(Y​(t)​)​​.
being semiparametric, because it includes If we had not assumed some form of
errors ​e​that have an unknown distribution unconfoundedness, then θ​ ​might not equal

You might also like