Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

ST06CH08_Kreuter ARI 17 January 2019 16:5

Annual Review of Statistics and Its Application

Nonprobability Sampling and


Causal Analysis
Ulrich Kohler,1 Frauke Kreuter,2,3,4
and Elizabeth A. Stuart5
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

1
Faculty of Economics and Social Sciences, University of Potsdam, 14482 Potsdam, Germany;
email: ulrich.kohler@uni-potsdam.de
2
Joint Program in Survey Methodology, University of Maryland, College Park,
Maryland 20742, USA; email: fkreuter@umd.edu
3
School of Social Sciences, University of Mannheim, 68131 Mannheim, Germany
4
Statistical Methods Research Department, Institute for Employment Research (IAB),
90478 Nuremberg, Germany
5
Department of Mental Health, Department of Biostatistics, and Department of Health Policy
and Management, Bloomberg School of Public Health, Johns Hopkins University, Baltimore,
Maryland 21205, USA; email: estuart@jhu.edu

Annu. Rev. Stat. Appl. 2019. 6:149–72 Keywords


First published as a Review in Advance on causal inference, generalizability, self-selection, nonprobability sampling,
September 12, 2018
validity, measurement error, heterogeneous treatment effects, big data
The Annual Review of Statistics and Its Application is
online at statistics.annualreviews.org Abstract
https://doi.org/10.1146/annurev-statistics- The long-standing approach of using probability samples in social science
030718-104951
research has come under pressure through eroding survey response rates,
Copyright  c 2019 by Annual Reviews. advanced methodology, and easier access to large amounts of data. These
All rights reserved
factors, along with an increased awareness of the pitfalls of the nonequivalent
comparison group design for the estimation of causal effects, have moved the
attention of applied researchers away from issues of sampling and toward
issues of identification. This article discusses the usability of samples with
unknown selection probabilities for various research questions. In doing so,
we review assumptions necessary for descriptive and causal inference and
discuss research strategies developed to overcome sampling limitations.

149
ST06CH08_Kreuter ARI 17 January 2019 16:5

1. INTRODUCTION
Ever since the work of Neyman (1934), data collection on research units selected by probability
NPS: nonprobability sampling (hereafter PSg) has been the method of choice for many disciplines, including epidemi-
sample ology, the social sciences, and survey research. However, this long-standing approach has come
NPSg: nonprobability under pressure in recent years for the following reasons: (a) eroding survey response rates make
sampling less certain the assumption that PSg leads to samples that possess the unbiasedness properties
PS: probability sample that probabilistic sampling should convey, (b) there is an increased belief that more sophisticated
PSg: probability statistical techniques can correct deviations from the blueprint of PSg, (c) newly eased access to
sampling large collections of previously inaccessible data is prompting a shift toward research questions
SRS: simple random that can be answered with big data, and (d ) an increased awareness of the pitfalls of attempting to
sample estimate causal effects by comparing intervention and comparison groups that are quite dissimilar
SRSg: simple random from each other has shifted the attention of applied researchers away from issues of sampling and
sampling toward issues of causal identification. Against this backdrop, we discuss the usability of nonprob-
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

ability samples (NPS) for social science research. Before we begin, we highlight a few ideas that
shape our arguments throughout the article.
 Sampling design and the sample characteristics must be distinguished. When response rates
fall below 100% the sampling probabilities are no longer fixed by the sampling design.
Nonresponse—the nonparticipation of units selected for the study—may deteriorate a sim-
ple random sampling (SRSg) design to show the characteristics of a self-selected sample
(Groves 2006). Similarly, in certain instances and with certain assumptions, it may be pos-
sible to estimate selection probabilities for data of a self-selected sample in order to mimic
a probability sample (PS) (Wang et al. 2015). Thus, to keep the sampling design and the
characteristics of the data separate, we use the abbreviations PSg and PS to denote proba-
bility sampling and probability samples, respectively. Correspondingly, we use NPSg and
NPS for nonprobability sampling and nonprobability samples.
 The usability of any data depends on the research question at hand. Following others (King
et al. 1994), we distinguish between the description of a specific characteristic of a pop-
ulation and the analysis of the process generating the empirical reality itself. We argue
that the usability of NPSs is limited—yet still existent—to describe population charac-
teristics, and that a PS is unnecessary—or potentially even harmful—to establish a causal
link. More importantly, we argue that many applied researchers, particularly in the so-
cial sciences, strive to estimate causal effects for a population, which in essence describes a
causal link for a population. In these cases the problems of descriptive and causal research
overlap.
 Statements about the real world from observed data require assumptions. This is true no
matter what kind of data we have, and it is explicitly also true for statements based on
data stemming from SRSg. The validity of our statements therefore critically hinges on the
validity of those assumptions. If our statistical assumptions are wrong for the analysis at
hand, the inferential statements will be incorrect no matter how sophisticated the statistical
methods are. We thus aim to refocus the attention to the assumptions that form the backbone
of the empirical research.
 Empirical findings are always uncertain. No matter how well our research was designed,
it may be the case that the specific estimate we come up with is so far away from the true
parameter of interest that the estimated confidence interval does not capture the true param-
eter of interest (e.g., the population mean, population proportion, regression coefficient, or
correlation between two factors of interest). Statistical techniques show us how uncertain the
estimate is from the standpoint of a thought experiment in which the parameter of interest is

150 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

estimated over and over again. The best way to find out the true uncertainty of an empirical
finding, however, is to actually do the replications (Selvin 1958).
 Empirical research has goals and constraints. Certainly, the dominant goal of science is
Descriptive
to estimate the parameter of interest correctly. But pursuing that target happens under inference: estimating
constraints. We might want to obtain the estimate in a timely manner, or we may want to some unknown
keep the costs of the estimate within the budget of the research project, etc. The research parameter of interest
design in general, including the case selection method, should thus be fit for purpose, not for a well-defined
perfect (Statistics Canada 2017). population using the
data at hand
The review proceeds as follows. In Section 2 we separate the two prototypical research ques-
tions: descriptive and causal inference. In Section 3 we discuss the role of sampling for descriptive
inference. We reiterate that PSs have highly desirable characteristics for descriptive inference, but
we also reiterate that these theoretical characteristics may not mean much for the specific data at
hand. Moreover, we highlight instances where research using NPS data may still shed light on a
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

descriptive research question. In Section 4 we discuss the role of sampling for causal inference.
Access provided by University of Nottingham on 02/18/20. For personal use only.

We review that causal inference requires first and foremost a research design that secures a set of
assumptions to hold (Section 4.2). We show that case selection is not an issue as long as the size
of the causal effect is assumed to be some kind of a law of nature and is therefore the same for
all research units, in all places, at all times (i.e., homogeneous treatment effects, or at least, any
heterogeneity in effects is negligible). We then continue by discussing two types of research ques-
tions that can be tackled with NPSs even in cases when the causal effects differ between research
units, places, or time points (heterogeneous treatment effects): special populations (Section 4.2.1)
and interactions (Section 4.2.2). Heterogeneous treatment effects, however, almost immediately
create a demand for a summary of the individual causal effects for a specific population of interest,
discussed in detail in Section 4.2.3. In this variant of causal inference, sampling has very much the
same role as in descriptive inference. However, estimating the average treatment effect in a pop-
ulation of interest has both the challenges of making valid descriptive inferences and challenges
around obtaining unbiased estimates of causal effects. The final section concludes with some final
thoughts about the role of replication and related topics.
Because we seek to address a broad audience of applied social scientists, we try to minimize
the use of formal notation in the text and have confined most formal definitions and statements
to sidebars.

2. TYPES OF RESEARCH QUESTIONS


This section distinguishes between two research questions: descriptive inference and causal infer-
ence. Descriptive inference seeks to estimate some unknown parameter of interest, θ, of a well-
defined population using the data at hand; this parameter of interest is often called the estimand.
Typical examples of research questions that involve descriptive inference are studies estimating the
mean income of the work force in a specific country at some point in time, or the proportion of an
electorate that prefers a certain candidate. The work of national statistical institutes and election
polls are prime examples. However, descriptive inference is not restricted to the description of
a simple summary statistic such as the arithmetic mean or a proportion. Descriptive inference is
also taking place if the difference between summary statistics is described, even if the difference
between summary statistics is compared between various groups, as long as the goal is inference
to a well-defined population.
A simple summary statistic would be estimated in, for example, a study describing the average
wage of women in a specific country at some point in time using one of the large general population
surveys that exist in many countries. Similarly, examining a difference between summary statistics

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 151


ST06CH08_Kreuter ARI 17 January 2019 16:5

could result in, for example, examining the evolution of women’s wages over time or a comparison
of women’s wages across countries using many different population surveys.
Correspondingly, descriptive inference may also take place in research studying associations
Fundamental
problem of causal between various social structure variables, between social structure variables and attitudes, or
inference: one cannot between several attitudes. To reiterate, it is not a property of the statistic θ that renders a research
observe the outcome question descriptive, but the goal of describing a characteristic of interest of a finite—yet possibly
under a given scenario very large—population with the data at hand.
and its counterfactual
Causal inference seeks to identify a so-called causal effect. The causal effect thereby is defined
at the same time
as the difference between the value of some outcome variable Y under two conditions, namely,
that a unit of interest, i (a person, school, company, state, etc.), has and has not received a certain
treatment; see Section 4 for a formal definition. Because it is impossible for unit i to receive and
not receive the treatment at the same point in time, this causal effect is conceptually unobservable;
this is usually referred to as the fundamental problem of causal inference (Rubin 1974). Thus,
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

the inferential question is to infer from the data at hand to the unknown causal effect (see Mill
Access provided by University of Nottingham on 02/18/20. For personal use only.

1843, Neyman et al. 1935, Rubin 1974, and Pearl 2009 for important milestones of the potential
outcome framework; also see King et al. 1994, Winship & Morgan 1999, Angrist & Pischke
2009, and Rubin 2001 for explanations and examples addressed to specific disciplines, and Dawid
2015 for another perspective on causality). Using the example from above, an example of a causal
inference question might be whether some program aiming to reduce the gender pay gap actually
does so, measured two years after it was implemented in a set of companies.
Causal inference usually comes in two types. Researchers interested in the first type have
in mind (implicitly or explicitly) a causal effect that operates somewhat like a law of nature,
everywhere and at all times—or at least in a very general way (or with negligent variation across
units). Researchers interested in the second type believe that the causal effect may affect only
some, or can be moderated by circumstances, and thus aim to estimate causal effects for a well-
defined population. These two types have important consequences for the way research units can
be selected and whether NPS can or cannot be used for causal inference. In fact, as we show in
Section 4, proponents of the first type can rely on more or less any kind of data, as long as the
so-called identifying assumptions can be justified. Proponents of the second type, however, first
need to clarify their research question a bit further: Do they want to estimate the causal effect
for a specific population, do they want to study how the causal effect varies between units, or do
they want to estimate an average effect for some specific subpopulation? For questions that relate
to population treatment effects, a researcher needs a clear understanding of how the data arise
from the population (or perhaps a different population). The problems that can then arise when
interested in a population average treatment effect mirror the problems that arise for descriptive
inference. It is thus necessary to review important contributions to concepts and methods for
descriptive inference before talking about causal inference in greater detail.

3. DESCRIPTIVE INFERENCE
Any research that seeks descriptive inference has the characteristic that it uses a specific research
design D to estimate the estimand of interest θ in the data at hand. Generally, researchers want
the value θ̂r that they estimated in a specific realization r of their research design to be correct
in the strict sense that θ̂r = θ. An important question, therefore, is, For which research designs
is this the case? In answering this question we concentrate on the sampling design, i.e., the part of
the research design that is related to the selection of observations. Issues of measurement errors,
calculation errors, etc. are ignored for our current purpose but are not less important (Groves
et al. 2011, Biemer et al. 2017). The following success criteria are therefore also only related

152 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

SUCCESS CRITERIA FOR DESCRIPTIVE INFERENCE

 Correctness of the estimate: The value of θ̂r estimated in a specific replication r of a research design D is
correct if
θ̂r = θ. 1.
 Unbiasedness of the estimate: Research design D yields unbiased estimates if, across R replications of the
design D,
1 
R
lim θ̂r = E(θ̂) = θ. 2.
R→∞ R r=1
Correspondingly, if E(θ̂ ) = θ ,
Bias(θ̂) = E(θ̂) − θ 3.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

is the bias of the estimate.


 Uncertainty of the estimate: The uncertainty of θ̂ estimated with research design D is

 
1  R
lim  (θ̂r − θ)2 = Var(θ̂) = S.E.(θ̂). 4.
R→∞ R r=1

The reciprocal value of the standard error is called precision throughout this article.
 Mean squared error (MSE): The MSE is as a combination of bias and uncertainty, defined as
MSE(θ̂) = Bias(θ̂)2 + Var(θ̂). 5.

An estimator with MSE(θ̂) = 0 is called consistent.

to the sampling design (see the sidebar titled Success Criteria for Descriptive Inference for a
summary).

3.1. Success Criteria


A sampling design that would—at least in principle—guarantee correctness of a realized estimate
would be to identify mutually exclusive and exhaustive subgroups, called strata, such that within
each stratum the value of the parameter of interest would be constant for all units, and then
arbitrarily select research units from those strata (see Figure 1a). (Note that the number of units
in the population in each stratum would also need to be known.) More formally, if the population

could be divided into strata s = 1 . . . S such that Ss=1 Ns = N , and we knew Ns for all the strata,
and the parameter of interest were constant for all research units belonging to one stratum, then
S
s =1 p s θs = θ with p s = N regardless of which research units within a stratum we observe. Of
Ns

course, the population illustrated in Figure 1a is only a limiting case and will never be available
in practice. However, the example is instructive, as it shows how previous knowledge (and thus
smart construction of strata) removes the need for random sampling. Real-world surveys do use
this general idea in setting up a sampling design by, for example, defining strata as geographical
or administrative units within a country or organization (Valliant et al. 2018).1

1
Surveys often form clusters within each stratum and sample from those for organizational or financial reasons. Unlike with
strata, only a sample of the defined clusters is then used. Kreuter & Valliant (2007) provide a light introduction.

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 153


ST06CH08_Kreuter ARI 17 January 2019 16:5

a b
Stratified population Uncorrelated strata
Stratum 1 with p1 = 23% Stratum 2 with p2 = 27% Stratum 1 with p1 = 28% Stratum 2 with p2 = 28%

Stratum 3 with p3 = 17% Stratum 4 with p4 = 33% Stratum 3 with p3 = 15% Stratum 4 with p4 = 29%
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Figure 1
Ideal typical populations with four strata of known sizes containing research units with characteristic θ , with
values represented by colors. (a) θ is constant within strata. Arbitrary selection of ns observations from each
stratum yields correct estimates. (b) θ is not correlated with the strata. Random sampling yields unbiased
estimates.

In the more realistic case when the sizes of the strata are unknown and/or when the research
units in each stratum vary, correctness of a specific estimate can only be maintained by collecting
data on the entire population. For all other sampling designs we do not know whether a specific
estimate θ̂r really is equal to θ or not. For some sampling designs, however, it is possible to show
that these methods, when replicated an infinite number of times, lead to values of θ̂r that are on
average identical to θ. A method that has this characteristic is called unbiased (see Equation 2).
Examples of unbiased methods are calculations of arithmetic means, correlation coefficients, or
regression coefficients using data without measurement errors collected from research units of a
SRS. Unbiased methods are also available for research designs with known selection probabilities
(PSs), and these methods can be also used when the selection probabilities are unknown but
estimable. In the latter case the results are unbiased only to the extent that the estimated selection
probabilities are on average correct (Valliant et al. 2018).
Notwithstanding that unbiased methods are generally desirable, using an unbiased method
does not mean a given value θ̂r of a specific realization of the method is correct. In fact, there
is always a nonzero probability that θ̂r is far away from the truth, although that probability is
smaller for some designs than for others. Specifically, the probability that a given realization of θ̂
is far from the truth is a direct function of the uncertainty of the estimation method that is being
used (see Equation 5). The larger the uncertainty of D, i.e., the larger the standard error of the
estimate, the higher the probability that θ̂r is wrong by a fixed amount, and the higher the precision

154 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

of D, the lower the probability θ̂r is wrong. Remember that, while the uncertainty or precision of
an estimation method is theoretically well defined, it is unknown for a specific realization of the
research design. Other things being equal, uncertainty decreases with increasing sample size and
can be estimated. MSE is a useful quantity that combines bias and variance. MSE is defined as the
average of the squared deviations between the estimate and the true value of the estimand, and it
can be calculated as the sum of the variance and the squared bias. When thinking about the quality
of a design and estimation approach in terms of MSE, it becomes clear that a (slightly) biased but
precise estimator may be preferable to an unbiased but very noisy estimator.
As mentioned, PS allow unbiased estimates of parameters for a finite population. The following
subsection discusses instances where descriptive inference might be possible, even in the case of
NPS. Thereby, we concentrate on the extreme case, which is a completely self-selected sample,
where the selection probabilities are unknown and not estimable. We add that in practice, providers
of NPS data often deliver their data with estimates of selection probabilities, or more specifically,
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

their reciprocal values (weights) (see, e.g., Kennedy et al. 2016). If these weights are correct,
the data, strictly speaking, cannot be considered to be a NPS. Of course, if the weights are
incorrect, the data must be considered a NPS, so that the following statements may still have
practical relevance. An overview of NPS in the survey context can be found in the task force
report from the American Association for Public Opinion Research (Baker et al. 2013), the work
of Callegaro et al. (2014) and Mercer (2018), and the review article from Elliott & Valliant
(2017), as well as in several evaluation studies (see Yeager et al. 2011, Kennedy et al. 2016).
Elliott & Valliant (2017), in particular, address estimating descriptive statistics from NPSs and
highlight the work of Dever et al. (2008), Valliant & Dever (2011), and O’Muircheartaigh &
Hedges (2014). Readers are also directed to Rivers (2007) and Rivers & Bailey (2009) for a highly
influential proposal for combining information from PSg and NPSg. Dutwin & Buskirk (2017)
provide some skepticism. More recent work in this field includes that of Lohr & Raghunathan
(2017), National Academies of Sciences, Engineering, and Medicine (2017a,b), and Robbins et al.
(2017).

3.2. Characteristics of the Population


In this section we show that the possibility of making valid descriptive statements about θ from
NPS depends to some extent on characteristics of the population of interest. To this end, we
compare the difference of the bias that arises from nonresponse under a PSg design with the bias
that arises from self-selection under a NPSg design. We thereby assume that we do not know the
probabilities for being included in the sample. In other words, we compare the bias of a NPS that
arises from PSg with the bias of a NPS that arises from NPSg. We also work from the plausible
conjecture that the sampling probabilities are easier to estimate under a PSg design than under
a NPSg design (see the sidebar titled Types of Biases). If this is the case, it will strengthen any
advantages of PSg. For the sake of simplicity, we discuss all this using the sample mean as an
illustration.
The formal definition for the nonresponse bias under PSg is given in Equation 6 (Bethlehem
et al. 2011). Unless the numerator in Equation 6 is zero, it follows that the unit nonresponse bias
of the sample mean—everything else being equal—decreases with
 an increased response rate ρ̄,
 the homogeneity of the population with respect to the individual response rates, SD(ρ),
 the homogeneity of the population with respect to the variable of interest, SD(Y ), and
 the correlation between the individual response rates and the variable of interest.

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 155


ST06CH08_Kreuter ARI 17 January 2019 16:5

TYPES OF BIASES

 Nonresponse bias: For PS, the unit nonresponse bias for the sample mean of the variable Y is
Corr(Y, ρ) · SD(Y ) · SD(ρ)
Bias( ȳ)PS ≈ , 6.
ρ̄
with SD being the standard deviation, ρ the individual response probabilities, and ρ̄ the mean of the individual
response probabilities (Bethlehem et al. 2011). The latter can be estimated by the sample’s response rate, i.e.,
nResp
ρ̄ ≈ ρ̄ˆ = , 7.
nResp + nNon-Resp
where nResp is the number of respondents and nNon-Resp is the number of nonrespondents (Bethlehem 2017).
 Self-selection bias: For NPS, the self-selection bias for the sample mean of the variable Y is
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Corr(Y, π ) · SD(Y ) · SD(π )


Bias( ȳ)NPS ≈ , 8.
π̄
with π being the individual selection probabilities and π̄ the mean of the individual selection probabilities
(Bethlehem et al. 2011). The latter can be estimated by the overall selection rate, i.e.,
nResp
π̄ ≈ π̄ˆ = , 9.
N
where N is the size of the population (Bethlehem 2017).

The formal definition of the bias that arises from self-selection under NPSg (i.e., the self-
selection bias) is very similar; see Equation 8 (Bethlehem et al. 2011).2 An important difference is
that the denominator in the formula for the unit nonresponse bias can be estimated by the response
rate, whereas the denominator in the formula for the self-selected sample can be estimated by the
overall selection rate. Unlike the response rate, the self-selection rate depends on the population
size (Bethlehem 2017). It follows that for PSg, the size of the population is irrelevant for the
nonresponse bias, while the population size is highly relevant for NPSg. Thus, the same size
sample drawn with NPSg of a large population implies a much larger bias than PSg of the same
population. However, it also follows from the comparison of the two formulas that the bias under
NPSg may be small if any of the following are true (and the more the better):
 the selection probabilities are similar for all individuals,
 the population is small and the number of observations in the sample is large,
 the population is homogeneous with respect to Y , and
 the correlation between the selection probabilities and Y is small.
NPSg might therefore work well for small, homogeneous populations. In that regard the
possibility for drawing approximately correct inferences from NPS depends on the population
of interest. While inference on a large, heterogeneous population based on a NPSg is a risky
undertaking, it is far less problematic for a small, homogeneous population. An example might be
the evaluation of teaching quality among participants in a specific training program, or estimating
average annual earnings for anesthesiologists within the state of Rhode Island. More generally,

2
For a very recent alternative approach to nonignorable selection bias for NPS, see Little et al. (2018).

156 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

any study that deals with a small, special population is likely to be less vulnerable to self-selection
biases.
We also learn from Equations 6 and 8 that both the unit nonresponse and self-selection bias
are zero if the variable of interest is not correlated with the unit nonresponse or self-selection
propensities. While this correlation is unknown in practice, an applied researcher who uses self-
selected data should develop a sense for this quantity. In the social sciences it is often important
to check if specific groups have a (monetary or normative) interest in the results of a study.
Bethlehem (2015), for example, illustrates a case in which representatives of a church asked their
parish to participate in a web survey in order to prevent the relaxing of regulations for Sunday
shopping. In election polls, political parties may encourage their partisans to participate in polls in
order to create a bandwagon effect (Lazarsfeld et al. 1948, Leibenstein 1950). In other instances,
the correlation between self-selection probabilities and the variable of interest may be less of a
problem, so self-selection studies may become justifiable. For example, market opinion surveys
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

asking whether people prefer one candy bar to another are unlikely to be seriously biased due to
Access provided by University of Nottingham on 02/18/20. For personal use only.

unit nonresponse. These examples corroborate the notion that data from a self-selected sample
may be used to study one specific descriptive research question while being misleading for another.
It certainly takes an expert in the field to make reasonably valid assumptions about the correlation
between self-selection and the variable of interest, although data collection can help to address
this empirically.

3.3. Fit for Purpose


Besides the characteristics of the population of interest, another important consideration is whether
a sample is fit for purpose (Biemer 2010). One aspect of this is the amount of uncertainty a
researcher is willing to accept for the results of his or her study.
A researcher needs more observations if his or her research topic requires high precision, while
fewer observations are needed if the research topic allows low precision. An opinion poll predicting
the proportion of voters for one candidate to be between, say, 45 and 55% will be considered fairly
useless in a close election, while it may be sufficient to know the predictable take-up rate for a new
community service is as high as 45–55%.
An analogous argument, not yet formalized, could be made with respect to NPSg: A research
topic in which the exact answer is less important justifies the use of NPSg. We suggest that these
are research topics that are entirely new or involve hidden populations. Do we want to learn
something about economic situations of the refugees that immigrated to the Western world?
Are we interested in an initial study about the life satisfaction or work-life balance of 24-hour
caregivers in private households? Using a (well-controlled) NPSg design to study such populations
will arguably increase knowledge much more than insisting on PSg while waiting interminably
for results. However, this is not guaranteed, and care needs to be taken when considering the pros
and cons of different designs for any specific research question.
The National Academy of Sciences provides several arguments for paying more attention to
timeliness in their reports on innovations in federal statistics (National Academies of Sciences,
Engineering, and Medicine 2017a,b), arguing most strongly that “for public policy purposes, it is
particularly important that information be available to decision makers in time to incorporate it
into the decision making process” (National Academies of Sciences, Engineering, and Medicine
2017b, p. 116). A good example is the statistics on consumer price indices. Collected through
careful PS, they suffer from considerable time lag in reporting. Alternatively, data from the Internet
“can be harvested in a timely manner but may miss differentially key sectors of the population
of stores” (National Academies of Sciences, Engineering, and Medicine 2017b, p. 117). As with

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 157


ST06CH08_Kreuter ARI 17 January 2019 16:5

precision, which was discussed above, the relative importance of timeliness and coverage needs to
be evaluated for each particular use case. As argued in the report, if the goal is an early warning
system, then the Internet data, though deficient in coverage terms, would function better than
ATE: average
treatment effect the more comprehensively collected probability-based survey data. “If the purpose is to provide
an estimate that gives an unbiased measure of change in prices for the population as a whole,
the argument would swing in favor of the probability sample of stores” (National Academies of
Sciences, Engineering, and Medicine 2017b, p. 117).

4. CAUSAL INFERENCE
Causal inference was introduced in Section 2 in the form of studies that seek to identify a causal
effect. We define the treatment effect for unit i,
TEi = Yi (1) − Yi (0) = Yi (Ti = 1) − Yi (Ti = 0), 10.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

where Yi is a random variable representing the outcome of interest for research unit i and Ti is
a random variable representing the treatment, indicating whether unit i receives the treatment
(Ti = 1) or not (Ti = 0) (Holland 1986).3 Hence, the treatment effect is the difference in the
outcome for one research unit i receiving the treatment Yi (1) and not receiving the treatment
Yi (0). The causal effect is defined here for one unit of interest. This means that a cause can in
principle have different effects for different units. Once we allow for individual treatment effects, it
is possible to separate systematically between scenarios in which the treatment effect is, like a law
of nature, stable across units, places, and times (a homogeneous treatment effect) and scenarios
in which the treatment effect varies between units, places, or times (a heterogeneous treatment
effect).4 To make these ideas of potential outcomes more concrete, a classic example is that of
aspirin and headache pain. Imagine that a unit i has a headache and may or may not take aspirin to
relieve that pain. Yi (1) would be their headache pain in, say, 2 hours if they take aspirin, Yi (0) their
headache pain if they don’t take aspirin, and the causal effect of their taking aspirin (versus not)
on their headache pain two hours later would be a comparison of those two potential outcomes.
In many causal inference studies, researchers implicitly or explicitly rest on the assumption
of homogeneous treatment effects (Rothman et al. 2013). In the headache pain example, homo-
geneous treatment effects would mean that aspirin leads to the same decrease (or increase!) in
headache pain in all units. To overcome the fundamental problem of causal inference mentioned
in Section 2, the strategy is to create a research design in which certain identification assumptions
hold and estimate what is called the average treatment effects (ATE) (see the sidebar titled Set of
Identifying Assumptions that Allow for Causal Inference for details). These assumptions do not
rely on sampling, so studies assuming homogeneous treatment effects can rely on most NPS data.
Note, too, that there may be some heterogeneity in effects, but it is negligible enough to imply
that assuming a homogeneous effect is still meaningful and relevant.
If the assumption of homogeneous treatment effects is not justifiable, the issue becomes more
complicated. One still would estimate an ATE, but depending on the specific research interest,
this would be for a special population, for interactions, or a well-defined population (population

3
As an aside, recent attempts have estimated dose-response relationships in cases where the unit effect of a continuous treatment
depends on the initial level and the dose (Imbens 2000, Hirano & Imbens 2004, Imai & van Dyk 2004, Bia & Mattei 2008,
Cattaneo 2010, Cattaneo & Farrel 2011).
4
Readers should note that normally we would draw expected values around Yi |X i = 1 and Yi |X i = 0 following the definition
of the mean causal effect for unit i as defined by King et al. (1994). This allows us to average over (fictitious) replications of the
process that creates Yi in the real world, acknowledging random variation (Popper 1982, King 1991, Dawid 2015). However,
for the focus of this review article—discussing the consequences of PSg and NPSg—this difference is of little relevance.

158 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

SET OF IDENTIFYING ASSUMPTIONS THAT ALLOW FOR CAUSAL INFERENCE

The ATE can be estimated if one of the following assumptions is met.


 Unit homogeneity assumption. The value of the outcome variable (Y ) from each unit (here i, and j ) is the
same for a particular value t of the treatment variable T, i.e.,
Yi (0) = Y j (0) = Y (0) and Yi (1) = Y j (1) = Y (1). 11.
 Constant effect assumption. The causal effect will be the same for every unit:
TEi = TE j . 12.
 Conditional independence assumption (CIA). There is no correlation between the selection into the
treatment group and the outcome itself:
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Yi (0), Yi (1) ⊥
⊥ T |X. 13.

average treatment effect, or PATE). Each has its own implications about whether or not NPS can
be used.

4.1. Identifying Assumptions Under Homogeneity


Unlike descriptive inference, causal inference does not infer from data to a finite population but
from data to a general data generating process, or, as Rothman et al. (2013) put it, to a statement
about the way nature works. It infers a causal relationship. In its simplest form, research designs
for causal inference seek to identify a variant of the ATE defined as the average of the differences
in the outcomes of some variable of interest between a realized and a counterfactual situation (see
Equation 10).
Average causal effects are inherently not observable; in terms of Equation 10, this is because
we cannot observe, at the same time, the value of Y for research unit i given X = x and the
value of Y for the same observation given X = X + τ . At a particular point in time, a unit with a
headache either takes an aspirin or does not take an aspirin. In addition to that, expected values are
inherently not observable anyway. However, the ATE can be estimated to the extent that either the
unit homogeneity assumption, the constant effect assumption, or the conditional independence
assumption (CIA) is met (Holland 1986).
The CIA is therefore of high practical importance as it can be secured, to some degree, by the
research design. The CIA implies
1. the assignment to the treatment variable T is not caused by Y ,
2. the assignment to the treatment variable is not caused by covariates X that are themselves a
cause of Y (confounders), and PATE: population
3. the selection of the research units i is not affected by the potential outcomes Y (King et al. average treatment
1994). effect

These conditions can be ensured in several ways. Experimental research uses a mixture of CIA: conditional
independence
randomization and blocking to secure the CIA. Randomization means randomly assigning research
assumption
units to one of at least two treatment groups; blocking refers to performing randomization within
strata that are defined by a set of observed pretreatment covariates (Fisher 1935, Imai et al. 2008).

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 159


ST06CH08_Kreuter ARI 17 January 2019 16:5

For example, we might randomly assign a group of people with a headache to take aspirin, versus
not take aspirin. Blocked randomization would mean doing that randomization within blocks, say,
level of headache pain at the beginning of the study, which would allow researchers to investigate
whether the effect of aspirin varies based on initial severity of headache pain. The condition of
case selection to be independent from Y is always met in experiments because the Y is not realized
before the experiment starts.
In observational studies it is more difficult to ensure the CIA. Random sampling can be used
to secure condition 3, but arbitrarily selecting cases with a characteristic that was present before
nature created the outcome would work, too. However, the two other conditions are usually very
difficult to maintain. In order to understand this, it is necessary to realize any observed association
between two variables T and Y can have three (and only three) explanations (Elwert 2013):
1. Causality, i.e., T causes Y
2. Confounding, i.e., there is a variable X that causes both T and Y
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

3. Endogenous selection,5 i.e., any conditioning on Y, or a consequence of Y . Conditioning


here means any way that limits the variance of a variable, be it by inserting a variable into a
regression or by any selection of observations based on that variable.
Generally, the task for causal inference is to strip away the two noncausal sources for a statistical
association between the treatment T and the outcome Y, so that the remaining association is due
to causality alone. This can be done by controlling for all causes of T without controlling for
any direct or indirect consequence of T (Shpitser et al. 2010), which can be done either by
multivariate statistical techniques (Stuart et al. 2001, Angrist & Pischke 2009, Wooldridge 2009),
or by restricting the analysis on specific research units (Rosenbaum 2015). If not all causes of T
have been observed, there might be a narrower adjustment set that is sufficient for the identification
of a causal effect. Such adjustment sets have been developed in the growing body of literature on
graphical causal models (Pearl 2009); readers are directed to Elwert (2013) for an introduction.
These criteria have in common that they all rely on a graphical representation of the theoretical
model of the data generating process. Graphical rules then make it relatively easy to identify
adjustment sets that satisfy, for example, the backdoor criterion (Pearl 1993, 2009), the parents
of treatment criterion (Pearl 1995), the confounder selection criterion (VanderWeele & Shpister
2011), and several others (see Elwert 2013). It should be noted, however, that all these criteria
identify the causal effect only to the extent that the causal model is correct.
In the context of the topic of this review, it is important to see that a PS is not required for
any of these identification assumptions. In this regard, data from NPS (regardless of whether
it originated from PSg or NPSg) can be used for making causal inferences. However, this does
not mean any self-selection into a causal study is irrelevant for causal inference. In fact, any self-
selection process that is (directly or indirectly) causally affected in some way by the treatment itself
will lead to biased estimates of the ATE (also see Elwert & Winship 2014).
Regardless of whether the ATE is estimated by observational studies or by experiments, random
sampling, PSg, or any other sampling strategy aimed to secure the so-called representativeness
of a sample does not play any theoretical role. Rothman et al. (2013) argue that a search for a
fundamental law of nature should focus on best designing a study that can achieve that, rather
than trying to mimic the population via a sample; after all, if the phenomenon truly is universal,
the population studied will not change the outcome. In this regard, representativeness should be

5
Other terms for this source of association are selection bias (Hernán et al. 2004), collider stratification bias (Greenland 2003),
M-bias (Greenland 2003), Berkson’s bias (Berkson 1946), the explaining away effect (Kim & Pearl 1983), and conditioning
bias (Morgan & Winship 2007).

160 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

avoided (Rothman et al. 2013). This, however, is a matter of considerable debate, with others
(including Keiding & Louis 2018) arguing strenuously that such a fundamental law of nature
almost never exists. From our perspective, it is indisputably important to ensure that the selection
into the sample is independent of the outcome of interest.
As a consequence of assuming homogeneous treatment effects, observational researchers sel-
dom use weighting variables even if they use data stemming from samples with known over- or
underrepresentation of certain groups. Likewise, experimental researchers frequently use highly
special populations in order to block as many possible confounders as possible. An extreme case of
experiments on special populations is randomized experiments on hamsters with identical genes in
immunological research (see Rothman et al. 2013). In scenarios where it is unreasonable to assume
homogeneous treatment effects (e.g., if there is strong evidence or belief that the effectiveness of
aspirin varies across units due to different levels of headache pain severity or other biological
differences) researchers may follow one of the following strategies to proceed.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

4.2. Research Strategies Under Heterogeneity


In the previous section we argued that questions about sampling can be ignored if treatment effects
are homogeneous. When treatment effects are homogeneous, an experiment with randomized
assignment of units into the treatment and control groups essentially solves all problems regarding
identification of the treatment effect, with the same true for observational studies using multivariate
statistics (such as propensity scores) to control for the (sufficient) adjustment set.
If homogeneous treatment effects cannot be assumed, case selection will no longer be
ignorable—even in experiments with randomization. This can be shown clearly in a clinical study
example: Kaizar (2018) analyzes whether insulin pumps improve metabolic control compared with
insulin injections. In her research design, study members were selected from a frame of people reg-
ularly showing up at a clinic to have their metabolic status checked. In this case it is very plausible
these people are those who more reliably use the insulin injections, so there may be little margin
for improvement with insulin pumps. The people rarely showing up are likely to be those who less
reliably use injections and for whom the treatment effect might therefore be larger. Hence, the
argument here is that treatment effects are not stable (constant) and the study members consist
predominantly of people with a small treatment effect. The estimated causal effect is therefore not
representative of the entire population of people with diabetes.
We suggest that causal researchers unwilling to assume homogeneous treatment effects may
follow one of the following strategies to proceed. Note that only the last of these paths requires
data from PS.
 Study special populations: This strategy is possible, only for specific groups, if unit homo-
geneity cannot be generally assumed. Of course, one should only follow this path if the
treatment effect for that specific group answers an important research question.
 Investigate interactions: This strategy aims to find causes for effect heterogeneity. It involves
an entirely new research question and thus new reasoning on the plausibility of treatment
effect homogeneity.
 Estimate the PATE: The third way to study causality in the presence of heterogeneous
treatment effects is to describe the distribution of the individual treatment effects for a well-
defined finite population of interest using one or more summary statistics. This approach is
a combination of descriptive and causal studies.
In the following we discuss each of these strategies in some detail. Any of the strategies we
discuss presumes assumptions discussed in the previous section are justified.

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 161


ST06CH08_Kreuter ARI 17 January 2019 16:5

4.2.1. Special populations. Studying special populations means estimating the ATE for a specific
subpopulation or for a specific condition within which the ATE does not vary. The situation is thus
analogous to the one illustrated in Figure 1a, except that the θ are now the individual treatment
effects. In Figure 1a the treatment effects vary for the entire population but are constant for
each special population (the strata). In this situation any arbitrary selection of research units
from the same special population or stratum will produce an unbiased estimate of the ATE for
that population or stratum. For example, maybe the effect of aspirin varies only across levels of
baseline headache severity; randomization within a stratum of persons with, say, high levels of
headache would allow estimation of unbiased effects for that special population.
Aside from the trivial statement that a research design that is capable of justifying the CIA for
the special population allows internally valid inference of the ATE for that special population,
three practical issues arise for the applied researcher: First, why should one be interested in a
special population; second, how do we know that the individual treatment effects are constant
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

within the specific population; and third, can the results of the special population be transported
Access provided by University of Nottingham on 02/18/20. For personal use only.

to other populations?
Beginning with the first issue, studies of subpopulations can be of direct interest even in the
presence of effect heterogeneity, although examples more frequently come from the natural sci-
ences. In drug development, research in early stages is often only focused on finding evidence
that a certain drug is doing something in some people. Of course it is also generally interesting
to know that a specific drug is effective for persons of a specific genotype, as a medical doctor
can then use this drug to treat a patient having that genotype. In a survey context, a question
might be whether there is evidence that phone versus Internet surveys changes some people’s
willingness to participate, as this may allow customization of the contact attempt for such people
(Tourangeau et al. 2017). Also, questions about tailoring job training programs to certain kinds
of individuals and making predictions about which program helps to integrate these individuals in
the labor marked can be answered using data on special populations. Similarly, Zubizarreta et al.
(2014) restrict attention to people with either very high or very low exposure to an earthquake
in order to determine whether there is any association between earthquake exposure and risk of
post-traumatic stress disorder; if one were found, it could then be investigated in more detail
in more representative populations. Rosenbaum (2015) discusses other design elements that can
increase the ability to isolate causal effects, sometimes in special populations.
Another argument for studying special populations arises from the idea of crucial case studies
(Eckstein 1975). If it can be assumed that the ATE for the study population is smaller than for
any other subpopulation, it would be even possible to make claims about other subpopulations
since their PATE will be at least as high as in the study population. Likewise, if it can be assumed
the ATE for the study population is larger than for any other subpopulation, it can be said the
PATE cannot be higher than the one estimated. Finally, showing that a broadly accepted, or at
least highly plausible, theory predicting a causal effect of some T on some Y is not true for the
study population is certainly an important finding (Popper 1962).
An obvious solution to the second issue is to simply assume that the individual treatment effects
are constant for the population of interest. Of course, it may sound a bit contradictory to insist
on homogeneous treatment effects for a special population after having given up that assumption
for a large population. But, in the end, the assumption is often very natural and common for
experimental researchers to make.
If the treatment homogeneity assumption is not correct for the special population under study,
the estimated treatment effect will no longer be independent of the selection process. In fact, the
estimated treatment effect will be the average of the individual treatment effects of the research
units in the study population. Any statements on descriptive inference made above will then be true.

162 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

For the reasons just stated, it is sensible to look for empirical evidence for the correctness of the
homogeneity assumption. This can be done relatively easily when the data set at hand is large and
other variables besides the outcome and the treatment have been measured. Particularly in survey
experiments (Mutz 2011), it is easy to separate the study population into groups and empirically
check whether the estimated treatment effects are the same for all groups. This kind of replication
across groups is clearly possible for any larger data set and is not restricted to PS. It should be
performed in NPS, especially, since in those samples the average of the individual treatment
effects is not an unbiased estimate of the average of the treatment effects of the special population
under study.6 As PS are rarely achieved in the social sciences, group-specific replications are also
advisable for data stemming from PSg.
The third issue to discuss concerns the transportability of the results on a special population
(Miettinen 1985, Keiding & Louis 2016, Pearl & Bareinboim 2014). To state it clearly, are the ex-
perimental results from a study of hamsters with identical genes transportable to natural hamsters,
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

rodents, or even Mammalia in general?


Access provided by University of Nottingham on 02/18/20. For personal use only.

Yet again, one possible strategy is to simply assume the treatment effects are constant for all
research units at all times. Expressed in terms of Figure 1a, this assumption would mean the
colors would not change between the subpopulations (strata). Clearly, if this assumption holds,
we can easily transport results from a study done with, say, first-semester economics students to
market actors, or first-semester psychology students to human beings in general. It is up to the
researcher if he or she is willing to make this assumption for the research question at hand.
If the causal effect is not constant across populations, i.e., in the situation illustrated in
Figure 1a, one can still find a transport function under a set of assumptions formally defined
and proven by Pearl & Bareinboim (2014). These assumptions are too demanding to repeat here,
but it suffices to say if these assumptions are met there is no further need to revert to PSg to estimate
the causal effect in another population. However, it would be necessary to know the distribution
of a set of confounders in both the study population and the population to which the causal effect
is transported to; see Pearl & Bareinboim (2014) for details. Naturally, the distribution of the set
of confounders needs to be observed, perhaps most promisingly using data from PS.

4.2.2. Studying interactions. Investigating interactions means finding the causes of effect het-
erogeneity. An example given by Rothman et al. (2013) is instructive here: Consumption of con-
taminated shellfish sometimes causes hepatitis A, but sometimes does not, so in this regard, the
ATE is heterogeneous. Further studies investigated the reasons for the heterogeneity and found
that the causal relationship between eating contaminated shellfish and hepatitis A is strongly de-
creased by consumption of beverages containing at least 10% alcohol along with the shellfish
(Desenclos et al. 1992). It seems clear that the alcohol is a cause for the effect heterogeneity, or,
in other words, the effect of contaminated shellfish interacts with alcohol.
An ideal typical example of an interaction is shown in Figure 1a. In the figure, the statistic of
interest—the ATE—is constant within each stratum but differs between strata; that is to say, the
effect of T depends on strata, or, colloquially, T and strata interact in their effect on Y .
In order to study interactions it is necessary to study the ATE in various subpopulations
(for an example, see Caliendo & Kühn 2011). In this case, the estimation of the ATE for each
subpopulation must be unbiased. In the discussion of the previous subsection, we have seen that
this is the case if the ATEs were internally valid and the ATEs are homogeneous within each
subpopulation. If both were the case, PS would not be necessary.

6
Survey Research Methods, the journal of the European Survey Research Association, requires replications across groups for all
publications that use experiments in NPS data; see Kohler (2015).

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 163


ST06CH08_Kreuter ARI 17 January 2019 16:5

Assuming homogeneous treatment effects for each subpopulation may be harder to justify in
the presence of a supposed interaction. This is, first, because the very presence of interaction
already makes clear that the “laws of nature” (Rothman et al. 2013, p. 1013) are not constant, and
second, because one has to justify the assumption for more than one subpopulation.
If the assumption of effect homogeneity within strata cannot be made, it is still possible to
analyze the interactions of the subpopulation ATE. In this case it would be necessary to estimate
the PATE for each subpopulation separately. However, as is shown in the next section, estimating
this quantity would require a solid knowledge of the relation between the data and the finite
population.

4.2.3. Population average treatment effects. The third way to study causality in the presence
of heterogeneous treatment effects is to describe the distribution of the individual treatment
effects for a well-defined finite population of interest using one or more summary statistics. This
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

approach is a combination of descriptive and causal studies. Just like the studies that estimate
Access provided by University of Nottingham on 02/18/20. For personal use only.

the ATE, studies that summarize individual treatment effects must ensure the CIA (or any of the
other assumptions necessary to estimate the ATE). At the same time the estimation of the PATE
requires the establishment of a link between the population of interest and the data at hand. This
resembles the inferential problem of descriptive studies (discussed in Section 3). As Stuart et al.
(2015) comment, when assessing generalizability to a target population, it is critical to identify
that population and to have data on it.
Of course, summarizing the individual treatment effects for a finite population can be done
with any of the statistics that are also used for descriptions. It is possible, at least in principle, to
estimate the median, or the variance, or even the kurtosis of the individual treatment effects. In
practice, however, studies almost always use the arithmetic mean of the individual treatment effects
as descriptive device. This is particularly true for analyses that use some variant of the general
linear model. For the sake of simplicity, we therefore restrict our discussion to the estimation of
the population average of the individual treatment effects—hence the PATE.
Aiming to estimate the PATE as a research goal should not be an automatic reaction to hetero-
geneous treatment effects. If policy recommendations are the aim of causal research, one should
not necessarily base such recommendations on the policy’s effect on average in the population.
In the presence of heterogeneous treatment effects, there can be instances in which a program
or a medical drug only helps some people and does harm to others. A recommendation should,
then, not recommend the policy for all individuals just because it helps on average. Instead, the
researcher should make clear statements about which persons, or under what conditions, a policy
helps. Scenarios in which the PATE would be of interest include settings where health insurance
companies might be interested in whether a new medical drug cures a disease better on average
than the established alternative, or a state policymaker wanting to predict average effects if a new
policy is implemented for all individuals within the state. Also, if a researcher did not have data
on variables that interact with the treatment variable of interest, the estimation of the PATE
would be a sensible fallback solution. Of course, the same is true if a researcher does not have any
hypotheses about possible interactions, although in this case one might also want to assume that
the individual treatment effects are homogeneous.
Once the CIA can be justified through a clever research design or sophisticated statistical
methods, the problems with estimating the PATE are largely identical to those discussed in
Section 3. This becomes obvious by conceiving a data set that holds, for each observation, a valid
measure of the individual treatment effects, TEi . Taking TE ¯ = 1  TEi will then estimate the
n
PATE to the extent that the average of any other variable in that data set also is an estimator of the
corresponding population parameter. At this point it is sensible to reconsider Equations 6 and 8

164 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

for the nonresponse and/or self-selection biases. When using these formulas for the PATE, Y
refers to the variable that (hypothetically) holds the individual treatment effects in the population.
We see from both equations that the bias will be zero for homogeneous treatment effects because
SD(TE) = 0 in this case. This is another way to make clear that sampling does not matter in the
case of homogeneous treatment effects.
However, it may be argued that effects are never truly homogeneous; a question might be
when the heterogeneity is large enough to make an analysis that assumes homogeneous effects
inappropriate. If the individual treatment effects do not vary substantially across research units,
any deviation from PS induces only limited bias. Without research, though, fairly little can be said
about the heterogeneity of treatment effects in a population of interest, but expert knowledge may
help move us forward.
As soon as SD(TE) = 0, the correlation between treatment effects and the individual response
or selection probabilities becomes important. Ceteris paribus, the bias increases with the corre-
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

lation. One may speculate that the correlation between the treatment effect and the nonresponse
Access provided by University of Nottingham on 02/18/20. For personal use only.

is smaller than the correlation between the treatment effect and self-selection, but whether this
is in fact the case should be decided for the research question at hand using prior knowledge.
However, in one specific situation, it is clear the correlation is large: program evaluations when
people select themselves into a program based on their prediction of whether the program will be
successful for them. Because of this, researchers often revert to the estimation of the population
average treatment effect of the treated (the average treatment effect among those individuals who
actually received the treatment) instead of the PATE.
In practice propensity-score-based metrics are used to quantify the similarity of the partici-
pants in a randomized trial and a target population (Stuart et al. 2001). Furthermore, the recent
development and use of machine learning techniques further help to control for a large number
of covariates in more flexible ways (Athey & Imbens 2017), in particular under conditions of both
moderate nonadditivity and moderate nonlinearity (Lee et al. 2010).

5. A WAY OUT
The discussion in the previous sections may sustain the notion that causal research relies solely on
the identifying assumptions and, in the case of the PATE, assumptions about sampling. Given the
new data landscape, other solutions may arise that employ the concept of replications to overcome
situations in which PS cannot be ensured. The availability of many more data sets allows for
replication at a new scale. A causal effect can be studied not just in one data set but in many
different ones. Also, with large sample sizes, effects can be analyzed for many subgroups, allowing
the ability to search for evidence against the treatment homogeneity assumption.
Systematic replications are not just useful to correct mistakes in previous studies. They can
be also seen as a cumulative enterprise of isolating a plausible interval of a PATE. When it
comes to causal inference, any plausible attempt to maintain the CIA is a valid estimate of
the causal effect for the study population, i.e., for the units in the data at hand. Under the as-
sumption of heterogeneous treatment effects, the estimate thus adds to our knowledge about
the range within which the true PATE might lie. Instead of trying to estimate a plausible in-
terval of the PATE based on just one data set, replications could mean using many different
special study populations to isolate the PATE. Bayesian statistics could then offer a systematic
way to update our prior knowledge with new information based on data from yet another special
population.
Another advantage of such an approach would be that aside from isolating the PATE, one
would also gain knowledge about the validity of the constant effect assumption. If estimates are

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 165


ST06CH08_Kreuter ARI 17 January 2019 16:5

very similar across two extremely different data sets, effect homogeneity becomes more plausible.
If the results are different between differently composed data sets, then the assumption of effect
homogeneity would be highly debatable.
This focus and approach would also provide a nice and needed shift away from the focus on
p-values and significance levels. This is another hot debate, too large to fully cover in this review.
We refer interested readers to Gelman & Stern (2006), Gelman (2014), Nuzzo (2015), ASA (2016),
Lakens (2017), Wasserstein & Lazar (2017), and McShane et al. (2018). Relying on replication
goes hand in hand with the need to publish nonsignificant findings (Dewald et al. 1986, King
1995, Dafoe 2014).7
So, in the end, NPS are a huge problem for any attempts to describe a feature of a finite
population, including the average of individual treatment effects in that population. But many
different NPS, as well as the combination of PS and NPS, give us weapons to fight against these
problems. Let us use them properly.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

SUMMARY POINTS
1. Probability samples do not exist in practice. It is thus important to understand what can
be learned from nonprobability samples, regardless of whether or not they stem from a
probability sampling design.
2. Nonprobability samples can be good enough for descriptive inferences of small homo-
geneous populations.
3. Nonprobability samples can be used for estimating treatment effects that are
homogeneous.
4. In the case of heterogeneous treatment effects, nonprobability samples may be used to
study the treatment effects for a subpopulation within which the treatment effects are
homogeneous. It is then important that the subpopulation is of substantial interest on its
own.
5. In the case of heterogeneous treatment effects, large nonprobability samples may be used
for studying differences in treatment effects between groups.
6. In the case of heterogeneous treatment effects, estimation of the population average
treatment effect requires valid estimation of selection probabilities of the research units
(e.g., as can be obtained from a probability sample).
7. Ceteris paribus, the bias in the estimation of the population average treatment effect using
data from nonprobability samples increases with (a) the heterogeneity of the individual
treatment effects, (b) the heterogeneity of the self-selection process, (c) the correlation
between the individual treatment effects and the self-selection probabilities, and (d ) the
size of the population of interest. All that being equal, the bias decreases with the sample
size.
8. Nonprobability samples are a problem for any attempts to describe a feature of a finite
population, including the average of individual treatment effects in that population. But
many different nonprobability samples may also provide solutions to these problems.

7
Also see the Transparency and Openness Guidelines of the Center for Open Science (https://cos.io/).

166 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

FUTURE ISSUES
1. The estimation of the sampling probabilities for probability sampling designs with unit
nonresponse is a topic of ongoing debate. The estimation of selection probabilities for
nonprobability sampling designs is much more difficult and still awaits a generally ac-
cepted solution.
2. Thus, an important topic is how best to model the dependence of sample inclusion
probabilities and outcome variables on covariates.
3. Experimental research has gained attention in disciplines that predominately used obser-
vational data in the past (such as economics). This brings back to the forefront questions
about the trade-off of internal and external validity and whether there even is, in fact, a
trade-off.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

4. Researchers will need diagnostics tools to help them to indicate whether a nonprobability
sample can be used for inference to a full population.
5. Future work should consider how best to combine probability samples and nonprobability
samples.
6. With large data sets (big data) the role of sampling variance becomes more and more
pointless. Criteria for judging the success of an estimation procedure have not yet been
developed for this situation.

DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that
might be perceived as affecting the objectivity of this review.

ACKNOWLEDGMENTS
We thank the following readers for suggestions and critical comments: Vlad Achimescu, Ruben
Bach, Georg Haas, Christoph Kern, Florian Keusch, Stas Kolenikov, Mariel Leonard, Tom Louis,
Jason McMillan, Andrew Mercer, Patrick Schenk, and Malte Schierholz. We especially thank
Richard Valliant for discussions about future issues. Stuart’s and Kreuter’s time was supported
by the National Institutes of Health [R01 MH099010-01A1 to E.A.S.]. Stuart’s time was also
supported by the Institute of Education Sciences [R305D150003 to E.A.S. and Robert Olsen].
This work was also supported by the Mannheim Center for European Social Research (MZES).

LITERATURE CITED
Angrist JD, Pischke JS. 2009. Mostly Harmless Econometrics. Princeton, NJ: Princeton Univ. Press
ASA (Am. Stat. Assoc.). 2016. ASA statement on statistical significance and p-values. Am. Stat. 70:131–
33
Athey S, Imbens GW. 2017. The state of applied econometrics: causality and policy evaluation. J. Econ. Perspect.
31:3–32
Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, et al. 2013. Report of the AAPOR task force on non-
probability sampling. Rep., Am. Assoc. Public Opin. Res., Oakbrook Terrace, IL. https://www.aapor.
org/Education-Resources/Reports/Non-Probability-Sampling.aspx
Berkson J. 1946. Limitations of the application of fourfold table analysis to hospital data. Epidemiology 2:47–53

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 167


ST06CH08_Kreuter ARI 17 January 2019 16:5

Bethlehem J. 2015. Essay: Sunday shopping—the case of three surveys. Surv. Res. Methods 9:221–30
Bethlehem J. 2017. The perils of non-probability sampling. Presentation at Inference from Non Probability Sam-
ples, Paris, March 16–17. https://www.europeansurveyresearch.org/conference/non-probability
Bethlehem J, Cobben F, Schouten B. 2011. Handbook of Nonresponse in Household Surveys. New York: Wiley
Bia M, Mattei A. 2008. A Stata package for the estimation of the dose-response function through adjustment
for the generalized propensity score. Stata J. 8:354–73
Biemer PA. 2010. Total survey error: design, implementation, and evaluation. Public Opin. Q. 74:817–
48
Biemer PP, de Leeuw E, Eckman S, Edwards B, Kreuter F, et al. 2017. Total Survey Error in Practice.
New York: Wiley
Caliendo M, Kühn S. 2011. Start-up subsidies for the unemployed: long-term evidence and effect heterogene-
ity. J. Public Econ. 95:311–31
Callegaro M, Baker RP, Bethlehem J, Goritz AS, Krosnick JA, Lavrakas PJ, eds. 2014. Online Panel Research:
A Data Quality Perspective. New York: Wiley
Cattaneo M. 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

J. Econom. 155:138–54
Cattaneo M, Farrel MH. 2011. Efficient estimation of the dose-response function under ignorability using
subclassification on the covariates. In Missing Data Methods: Cross-sectional Methods and Applications, ed.
DM Druker, pp. 93–127. Bingley, UK: Emerald
Dafoe A. 2014. Science deserves better: the imperative to share complete replication files. PS Political Sci.
Politics 47:60–66
Dawid AP. 2015. Statistical causality from a decision-theoretic perspective. Annu. Rev. Stat. Appl. 2:273–
303
Desenclos J, Klontz K, Wilder M, Gunn R. 1992. The protective effect of alcohol on the occurrence of
epidemic oyster-borne hepatitis A. Epidemiology 3:371–74
Dever J, Rafferty A, Valliant R. 2008. Internet surveys: Can statistical adjustments eliminate coverage bias?
Surv. Res. Methods 2:47–62
Dewald WG, Thursby JG, Anderson RG. 1986. Replication in empirical economics: the Journal of Money,
Credit and Banking project. Am. Econ. Rev. 76:587–603
Dutwin D, Buskirk TD. 2017. Apples to oranges or gala versus golden delicious? Comparing data quality of
nonprobability internet samples to low response rate probability samples. Public Opin. Q. 81:213–49
Eckstein H. 1975. Case study and theory in political science. In Handbook of Political Science, Vol. 1: Political
Science: Scope and Theory, ed. FI Greenstein, NW Polsby, pp. 117–76. Boston: Addison-Wesley
Elliott MR, Valliant R. 2017. Inference for nonprobability samples. Stat. Sci. 32:249–64
A friendly introduction Elwert F. 2013. Graphical causal model. In Handbook of Causal Analysis for Social Research, ed.
to graphical causal S Morgan, pp. 245–73. Dordrecht, Neth.: Springer
models. Elwert F, Winship C. 2014. Endogenous selection bias: The problem of conditioning on a collider variable.
Annu. Rev. Sociol. 40:31–53
Fisher R. 1935. The logic of inductive inference. J. R. Stat. Soc. A 98:39–54
Gelman A. 2014. The statistical crisis in science. Am. Sci. 102:460–65
Gelman A, Stern H. 2006. The difference between “significant” and “not significant” is not itself statistically
significant. Am. Stat. 60:328–31
Greenland S. 2003. Quantifying biases in causal models: classical confounding versus collider-stratification
bias. Epidemiology 14:300–5
Groves RM. 2006. Nonresponse rates and nonresponse bias in household surveys. Public Opin. Q. 70:646–
75
Groves RM, Fowler FJ Jr., Couper MP, Lepkowski JM, Singer E, Tourangeau R. 2011. Survey Methodology.
New York: Wiley
Hernán M, Hernández-Diaz S, Robins J. 2004. A structural approach to selection bias. Epidemiology 155:174–
84
Hirano K, Imbens GW. 2004. The propensity score with continuous treatment. In Applied Bayesian Modelling
and Causal Inference from Missing Data Perspectives, ed. A Gelman, X Meng, pp. 73–84. New York: Wiley
Holland P. 1986. Statistics and causal inference. J. Am. Stat. Assoc. 81:945–60

168 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

Imai K, King G, Stuart EA. 2008. Misunderstandings between experimentalists and observationalists
about causal inference. J. R. Stat. Soc. 171:481–502 Discusses similar
Imai K, van Dyk D. 2004. Causal treatment with general treatment regimes: generalizing the propensity score. problems to the present
J. Am. Stat. Assoc. 99:854–66 article.
Imbens GW. 2000. The role of the propensity score in estimating dose-response functions. Biometrika 3:706–
10
Kaizar EE. 2018. Combining data in a single analysis. Paper presented at AAAS Annual Meeting, Austin, TX,
Feb. 18
Keiding N, Louis TA. 2016. Perils and potentials of self-selected entry to epidemiological studies and surveys.
J. R. Stat. Soc. A 179:319–76
Keiding N, Louis TA. 2018. Web-based enrollment and other types of self-selection in surveys and Can be read as a
studies: consequences for generalizability. Annu. Rev. Stat. Appl. 5:25–47 refutation of Rothman
Kennedy C, Mercer A, Keeter S, Hatley N, McGeeney K, Gimenez A. 2016. Evaluating Online Nonprobability et al. (2013).
Surveys. Washington, DC: Pew Res. Cent.
Kim J, Pearl J. 1983. A computational model for causal and diagnostic reasoning in inference systems. In
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Proceedings of the Eighth International Joint Conference on Artificial Intelligence, Vol. 1, pp. 190–93. San
Francisco: Morgan Kaufmann
King G. 1991. Stochastic variation: a comment on Lewis-Beck and Skalaban’s “The R-square.” Political Anal.
2:158–200
King G. 1995. Replication, replication. PS Political Sci. Politics 18:443–99
King G, Keohane RO, Verba S. 1994. Designing Social Inquiry. Princeton, NJ: Princeton Univ. Press
Kohler U. 2015. Editorial: maintaining quality. Surv. Res. Methods 9:139–40
Kreuter F, Valliant R. 2007. A survey on survey statistics: what is done and can be done in Stata. Stata J. 7:1–
21
Lakens D. 2017. Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Soc. Psychol.
Personal. Sci. 8:355–62
Lazarsfeld PF, Berelson B, Gaudet H. 1948. The People’s Choice: How the Voter Makes Up His Mind in a
Presidential Campaign. New York: Columbia Univ. Press
Lee BK, Lessler J, Stuart EA. 2010. Improving propensity score weighting using machine learning. Stat. Med.
29:337–46
Leibenstein H. 1950. Bandwagon, snob, and Veblen effects in the theory of consumers’ demand. J. Econ.
64:183–207
Little RJ, West BT, Boonstra PS, Hu J. 2018. Measures of the degree of departure from ignorable sample selection.
Paper presented at 73rd Annual Conference of the American Association for Public Opinion Research,
Denver, CO, May 16–19
Lohr SL, Raghunathan TE. 2017. Combining survey data with other data sources. Stat. Sci. 32:293–312
McShane BB, Gal D, Gelman A, Robert C, Tackett JL. 2018. Abandon statistical significance.
arXiv:1709.07588 [stat.ME]
Mercer A. 2018. Selection bias in nonprobability surveys. A causal inference approach. PhD thesis, Univ. Book-length treatise on
Maryland nonprobability surveys.
Miettinen O. 1985. Theoretical Epidemiology. New York: Wiley
Mill JS. 1843. A System of Logic, Ratiocinative and Inductive. Vol. 1. London: John W. Parker
Morgan S, Winship C. 2007. Counterfactuals and Causal Inference: Methods and Principles for Social Research.
Cambridge, UK: Cambridge Univ. Press
Mutz DC. 2011. Population-Based Survey Experiments. Princeton, NJ: Princeton Univ. Press
National Academies of Sciences, Engineering, and Medicine. 2017a. Federal Statistics, Multiple Data Sources,
and Privacy Protection: Next Steps. Washington, DC: Natl. Acad. Press
National Academies of Sciences, Engineering, and Medicine. 2017b. Innovations in Federal Statistics: Combining
Data Sources while Protecting Privacy. Washington, DC: Natl. Acad. Press
Neyman JS. 1934. On the two different aspects of the representative method: the method of stratified sampling
and the method of purposive selection. J. R. Stat. Soc. 97:558–625
Neyman JS, Iwaszkiewicz K, Kolodziejczyk S. 1935. Statistical problems in agricultural experimentation. Suppl.
J. R. Stat. Soc. 2:107–80

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 169


ST06CH08_Kreuter ARI 17 January 2019 16:5

Nuzzo R. 2015. How scientists fool themselves—and how they can stop. Nature 526:182–85
O’Muircheartaigh C, Hedges L. 2014. Generalizing from unrepresentative experiments: a stratified propensity
score approach. J. R. Stat. Soc. C 63:195–210
Pearl J. 1993. Comment: graphical models, causality, and interventions. Stat. Sci. 8:266–69
Pearl J. 1995. Causal diagrams for empirical research. Biometrika 82:669–710
Pearl J. 2009. Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge Univ. Press. 2nd ed.
Pearl J, Bareinboim E. 2014. External validity: from do-calculus to transportability across populations. Stat.
Sci. 29:579–95
Popper KR. 1962. Conjectures and Refutations: the Growth of Scientific Knowledge. London: Routledge
Popper KR. 1982. The Open Universe: An Argument for Indeterminism. Postscript to The Logic of Scientific Discovery,
Vol. 2. London: Hutchinson
Rivers D. 2007. Sampling for web surveys. Paper presented at the 2007 Joint Statistical Meetings, Salt Lake
City, UT
Rivers D, Bailey D. 2009. Inference from matched samples in the 2008 US national elections. Paper presented at
the American Association for Public Opinion Research Annual Conference, Hollywood, Florida, March
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

14–17
Robbins M, Ghosh-Dastidar B, R. R. 2017. Blending of probability and convenience samples as applied to a sur-
vey of military caregivers. Presentation at Inference from Nonprobability Samples, Washington, DC,
September 25
Rosenbaum PR. 2015. How to see more in observational studies: Some new quasi-experimental devices. Annu.
Rev. Stat. Appl. 2:21–48
Argues very strongly Rothman KJ, Gallacher JE, Hatch EE. 2013. Why representativeness should be avoided. Int. J. Epi-
against probability demiol. 42:1012–14
samples for causal Rubin DB. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ.
inference. Psychol. 66:688–701
Rubin DB. 2001. Estimating the causal effects of smoking. Stat. Med. 20:1395–414
Selvin HC. 1958. Durkheim’s Suicide and problems of empirical research. Am. J. Sociol. 63:607–19
Shpitser I, VanderWeele TJ, Robins JM. 2010. On the validity of covariate adjustment for estimating causal
effects. In Proceedings of the 26th Conference on Uncertainty and Artificial Intelligence. Corvallis, OR: AUAI
Press
Statistics Canada. 2017. Data quality toolkit. Statistics Canada. https://www.statcan.gc.ca/eng/data-quality-
toolkit
Stuart EA, Bradshaw CP, Leaf PJ. 2015. Assessing the generalizability of randomized trial results to target
populations. Prev. Sci. 16:475–85
Stuart EA, Cole SR, Bradshaw CA, Leaf PJ. 2001. The use of propensity scores to assess the generalizability
of results from randomized trials. J. R. Stat. Soc. A 174:369–86
Tourangeau R, Brick JM, Li J. 2017. Adaptive and responsive survey designs: a review and assessment. J. R.
Stat. Soc. A 180:202–23
Valliant R, Dever J. 2011. Estimating propensity adjustments for volunteer web surveys. Sociol. Methods Res.
40:105–37
Valliant R, Dever J, Kreuter F. 2018. A Practical Guide to Designing and Weighting Survey Samples. New York:
Springer. 2nd ed.
VanderWeele T, Shpister I. 2011. A new criterion for confounder selection. Biometrics 67:1406–13
Wang W, Rothschild D, Goel S, Gelman A. 2015. Forecasting elections with non-representative polls. Int. J.
Forecast. 31:980–91
Wasserstein RL, Lazar NA. 2017. The ASA’s statement on p-values: context, process, and purpose. Am. Stat.
70:129–31
Introduces various Winship C, Morgan S. 1999. The estimation of causal effects from observational data. Annu. Rev.
techniques to estimate Sociol. 25:659–707
causal effects from Wooldridge JM. 2009. Introductory Econometrics: A Modern Approach. Mason, OH: South-Western. 4th ed.
observational data. Yeager DS, Krosnick JA, Chang L, Javitz HS, Levendusky MS, et al. 2011. Comparing the accuracy of RDD
telephone surveys and Internet surveys conducted with probability and non-probability samples. Public
Opin. Q. 75:709–47

170 Kohler · Kreuter · Stuart


ST06CH08_Kreuter ARI 17 January 2019 16:5

Zubizarreta JR, Small DS, Rosenbaum PR. 2014. Isolation in the construction of natural experiments. Ann.
Appl. Stat. 8:2096–121

RELATED RESOURCES

Web Resources
Pew Research Center methods section (http://www.pewresearch.org/methods/): Regularly
publishes on this topic in their methods section
Causal Analysis in Theory and Practice (http://causality.cs.ucla.edu/blog/): Blog that discusses
questions about the identification of causal effects
Transparency Initiative of the American Association for Public Opinion Research (https://www.
aapor.org/Standards-Ethics/Transparency-Initiative/FAQs.aspx): A good resource for
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org

survey data collection reporting


Access provided by University of Nottingham on 02/18/20. For personal use only.

Survey Research Methods Section of the American Statistical Association (https://community.


amstat.org/surveyresearchmethodssection/home): Regularly updates their links to webi-
nars and training programs related to surveys

Related Software
Tools for designing and weighting survey samples: PracTools, https://cran.r-project.org/
web/packages/PracTools/index.html
R packages relevant for NPS estimation:

MatchIt, https://cran.r-project.org/web/packages/MatchIt/index.html
rstan, https://cran.r-project.org/web/packages/rstan/index.html
rstanarm, https://cran.r-project.org/web/packages/rstanarm/index.html

Various software for finding adjustment sets to identify causal effects based on graphical causal
models:

TETRAD, http://www.phil.cmu.edu/projects/tetrad/
DAGitty, http://www.dagitty.net/
DAG program, https://epi.dife.de/dag/
dagR, https://epi.dife.de/dag/

Textbooks
Cunningham S. 2018. Causal Inference: The Mixtape (V. 1.6). http://scunning.com/cunningham_
mixtape.pdf
Imbens GW, Rubin DB. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An
Introduction. Cambridge, UK: Cambridge Univ. Press
Morgan SL, Winship C. 2014. Counterfactuals and Causal Inference: Methods and Principles for Social
Research. Cambridge, UK: Cambridge Univ. Press
Valliant R, Dever JA. 2018. Survey Weights: A Step-by-step Guide to Calculation. College Station,
MD: Stata Press

www.annualreviews.org • Nonprobability Sampling and Causal Analysis 171


ST06CH08_Kreuter ARI 17 January 2019 16:5

Online Courses
The joint survey programs at the University of Michigan and the University of Maryland offer a
massive open online course on survey data collection and analysis through Coursera. Course 4,
“Sampling People, Networks and Records” (https://www.coursera.org/learn/sampling-
methods), directly targets sampling, and course 5, “Dealing with Missing Data” (https://www.
coursera.org/learn/missing-data), handles weighting, adjustment to nonresponse, and calibra-
tion to population totals.
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

172 Kohler · Kreuter · Stuart


ST06_TOC ARI 10 December 2018 11:2

Annual Review of
Statistics and
its Application

Contents Volume 6, 2019

Stephen Elliott Fienberg 1942–2016, Founding Editor of the Annual


Review of Statistics and Its Application
Alicia L. Carriquiry, Nancy Reid, and Aleksandra B. Slavković p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Historical Perspectives and Current Directions in Hockey Analytics


Namita Nandakumar and Shane T. Jensen p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p19
Experiments in Criminology: Improving Our Understanding of Crime
and the Criminal Justice System
Greg Ridgeway p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p37
Using Statistics to Assess Lethal Violence in Civil and Inter-State War
Patrick Ball and Megan Price p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p63
Differential Privacy and Federal Data Releases
Jerome P. Reiter p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p85
Evaluation of Causal Effects and Local Structure Learning of Causal
Networks
Zhi Geng, Yue Liu, Chunchen Liu, and Wang Miao p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 103
Handling Missing Data in Instrumental Variable Methods for Causal
Inference
Edward H. Kennedy, Jacqueline A. Mauro, Michael J. Daniels,
Natalie Burns, and Dylan S. Small p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 125
Nonprobability Sampling and Causal Analysis
Ulrich Kohler, Frauke Kreuter, and Elizabeth A. Stuart p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 149
Agricultural Crop Forecasting for Large Geographical Areas
Linda J. Young p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 173
Statistical Models of Key Components of Wildfire Risk
Dexen D.Z. Xi, Stephen W. Taylor, Douglas G. Woolford, and C.B. Dean p p p p p p p p p p p p 197
An Overview of Joint Modeling of Time-to-Event
and Longitudinal Outcomes
Grigorios Papageorgiou, Katya Mauff, Anirudh Tomer, and Dimitris Rizopoulos p p p p p p 223
ST06_TOC ARI 10 December 2018 11:2

Self-Controlled Case Series Methodology


Heather J. Whitaker and Yonas Ghebremichael-Weldeselassie p p p p p p p p p p p p p p p p p p p p p p p p p p p 241
Precision Medicine
Michael R. Kosorok and Eric B. Laber p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 263
Sentiment Analysis
Robert A. Stine p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 287
Statistical Methods for Naturalistic Driving Studies
Feng Guo p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 309
Model-Based Learning from Preference Data
Qinghua Liu, Marta Crispino, Ida Scheel, Valeria Vitelli, and Arnoldo Frigessi p p p p p p p 329
Annu. Rev. Stat. Appl. 2019.6:149-172. Downloaded from www.annualreviews.org
Access provided by University of Nottingham on 02/18/20. For personal use only.

Finite Mixture Models


Geoffrey J. McLachlan, Sharon X. Lee, and Suren I. Rathnayake p p p p p p p p p p p p p p p p p p p p p p p 355
Approximate Bayesian Computation
Mark A. Beaumont p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 379
Statistical Aspects of Wasserstein Distances
Victor M. Panaretos and Yoav Zemel p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 405
On the Statistical Formalism of Uncertainty Quantification
James O. Berger and Leonard A. Smith p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 433

Errata
An online log of corrections to Annual Review of Statistics and Its Application articles may
be found at http://www.annualreviews.org/errata/statistics

You might also like