Statistically Valid Inferences From Privacy Protected Data

American Political Science Review (2023) 117, 4, 1275–1290
doi:10.1017/S0003055422001411 © The Author(s), 2023. Published by Cambridge University Press on behalf of the American Political
Science Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://
creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is
properly cited.
Statistically Valid Inferences from Privacy-Protected Data

GEORGINA EVANS Harvard University, United States
GARY KING Harvard University, United States
MARGARET SCHWENZFEIER Harvard University, United States
ABHRADEEP THAKURTA Google Brain, United States
U
nprecedented quantities of data that could help social scientists understand and ameliorate the
challenges of human society are presently locked away inside companies, governments, and other
organizations, in part because of privacy concerns. We address this problem with a general-
purpose data access and analysis system with mathematical guarantees of privacy for research subjects,
and statistical validity guarantees for researchers seeking social science insights. We build on the standard
of “differential privacy,” correct for biases induced by the privacy-preserving procedures, provide a
proper accounting of uncertainty, and impose minimal constraints on the choice of statistical methods and
quantities estimated. We illustrate by replicating key analyses from two recent published articles and show
how we can obtain approximately the same substantive results while simultaneously protecting privacy.
Our approach is simple to use and computationally efficient; we also offer open-source software that
implements all our methods.
INTRODUCTION mission, but we can also take responsibility ourselves
J
and begin to develop technological solutions to these
ust as more powerful telescopes empower astron- political problems.
omers, the accelerating influx of data about the Among all the academic fields, political science schol-
political, social, and economic worlds has enabled arship may be especially affected by the rise in the
considerable progress in understanding and ameliorat- concerns over privacy because so many of the issues
ing the challenges of human society. Yet, although we are inherently political. Consider an authoritarian gov-
have more data than ever before, we may now have a ernment seeking out its opponents; a democratic
smaller fraction of the data in the world than ever government creating an enemies list; tax authorities
before because huge amounts are now locked up inside (or an ex spouse in divorce proceedings) searching for
private companies, governments, political campaigns, an aspiring politician’s unreported income sources; an
hospitals, and other organizations, in part because of employer trying to weed out employees with certain
privacy concerns. In a study of corporate data sharing political beliefs; or even political candidates searching
with academics, “Privacy and security were cited as the for private information to tune advertising campaigns.
top concern for companies that hold personal data The same respondents may also be hurt by political,
because of the serious risk of re-identification,” and financial, sexual, or health scams made easier by using
government regulators obviously agree (FPF 2017, 11; illicitly obtained personal information to construct phish-
https://doi.org/10.1017/S0003055422001411 Published online by Cambridge University Press
see also King and Persily 2020). If we are to do our jobs ing email attacks. Political science access to data from
as social scientists, we have no choice but to find ways of business, governments, and other organizations may be
unlocking these datasets, as well as sharing more seam- particularly difficult to ensure because the subject we
lessly with other researchers. We might hope that study—politics—is also the reason for our lack of access,
government or society will take actions to support this as is clear from analyses of decades of public opinion
polls, laws, and regulations (Robbin 2001).
We develop methods to foster an emerging change
Georgina Evans, PhD Candidate, Department of Government, Har- in the paradigm for sharing research information.
vard University, United States, georgieaevans@gmail.com. Under the familiar data sharing regime, data providers
Gary King , Albert J. Weatherhead III University Professor, Insti- protect the privacy of those in the data via
tute for Quantitative Social Science, Harvard University, United de-identification (removing readily identifiable per-
States, King@ Harvard.edu.
Margaret Schwenzfeier, PhD Candidate, Department of Govern-
sonal information such as names and addresses) and
ment, Harvard University, United States, schwenzfeier@g.harvard. then simply giving a copy to trusted researchers, per-
edu. haps with a legal agreement. Yet, with the public’s
Abhradeep Thakurta, Senior Research Scientist, Google Brain, increasing concerns over privacy, an increasingly hard
athakurta@google.com. line taking by regulators worldwide, and data holders’
Received: January 31, 2021; revised: August 22, 2021; accepted: (companies, governments, researchers, and others)
November 16, 2022. First published online: February 08, 2023. need to respond, this regime is failing. Fueling these
1275
Georgina Evans et al.
concerns is a new computer science subfield showing question itself (see Evans et al. Forthcoming). See
that, although de-identification surely makes it harder Dwork and Roth (2014) and Vadhan (2017) for over-
to learn the personal information of some research views and Wood et al. (2018) for a nontechnical intro-
subjects, it offers no guarantee of thwarting a deter- duction.
mined attacker (Henriksen-Bulmer and Jeary 2016). A fast-growing literature has formed around differ-
For example, Sweeney (1997) demonstrates that ential privacy, seeking to balance privacy and utility,
knowing only a respondent’s gender, zip code, and birth but the current measures of “utility” provide little
date is sufficient to personally identify 87% of the utility to social scientists or other statistical analysts.
U.S. population, and most datasets of interest to polit- Statistical inference in our field usually involves
ical scientists are far more informative. For another choosing a target population of interest, identifying
example, the U.S. Census was able to re-identify the the data generation process, and then using the result-
personal answers of as many as three-quarters of Amer- ing dataset to learn about features of the population.
icans from publicly released and supposedly anony- Valid inferences require methods with known statisti-
mous 2010 census data (Abowd 2018; see also https:// cal properties (such as identification, along with unbi-
bit.ly/privCC). Unfortunately, other privacy-protection asedness, and consistency) and honest assessments of
techniques used in the social sciences—such as aggre- uncertainty (e.g., standard errors). In contrast, privacy
gation, legal agreements, data clean rooms, query researchers typically begin with the choice of a target
auditing, restricted viewing, and paired programmer (confidential) dataset, add privacy-protective proce-
models—can often be broken by intentional attack dures, and then use the resulting differentially
(Dwork and Roth 2014). And not only does the vener- private dataset or analyses to infer to the confidential
able practice of trusting researchers to follow the rules dataset—usually without regard to the data genera-
fail spectacularly at times (like the Cambridge Analy- tion process or valid population inferences. This
tica scandal, sparked by the behavior of a single social approach is useful for designing privacy algorithms
scientist), but it turns out that even trusting a researcher but, as Wasserman (2012, 52) puts it, “I don’t know of
who is known to be trustworthy does not always guar- a single statistician in the world who would analyze
antee privacy (Dwork and Ullman 2018). Fast-growing data this way.” Because social scientists understand
recent evidence is causing many other major organiza- not to confuse internal and external validity, they will
tions to change policies as well (e.g., Al Aziz et al. 2017; aim to infer, not to the data they would see without
Carlini et al. 2020). privacy protections, but to the world from which the
An alternative approach to the data sharing regime data were generated.1
that may help persuade some data holders to allow To make matters worse, although the privacy-
academic research is the two-part data access regime. In protective procedures introduced by differential
the first part, the confidential data reside on a trusted privacy work well for their intended purpose, they
computer server protected by best practices in cyber- induce severe bias in estimating population quantities
security, just as it does before sharing under the data of interest to social scientists. These procedures
sharing regime. The distinctive aspect of the data access include adding random error, which induces measure-
regime is its second step, which treats researchers as ment error bias, and censoring (known as “clamping”
potential “adversaries” (i.e., who may try to violate in computer science), which when uncorrected
respondents’ privacy while supposedly doing their job induces selection bias (Blackwell, Honaker, and King
seeking knowledge for public good) and thus adds 2017; Stefanski 2000; Winship and Mare 1992). We
mathematical guarantees of the privacy of research have not found a single prior study that tries to correct
subjects. To provide these guarantees, we construct a for both (although some avoid the effects of censoring
“differentially private” algorithm that makes it possible in theory at the cost of considerable additional noise
for researchers to discover population-level insights but and far larger standard errors in practice; Karwa and
impossible to reliably detect the effect of the inclusion Vadhan 2017; Smith 2011) and few have uncertainty
or exclusion of any one individual in the dataset or the estimates. This is crucial because inferentially
value of any one person’s variables. Researchers are invalid data access systems can harm societies, orga-
permitted to run statistical analyses on the server and nizations, and individuals—such as by inadvertently
receive “noisy” results computed by this privacy- encouraging the distribution of misleading medical,
preserving algorithm (but are limited by the total num- policy, scientific, and other conclusions—even if it
ber of runs so they cannot repeat the same query and
average away the noise).
Differential privacy is a widely accepted mathemat- 1
In other words, “In statistical inference the sole source of random-
ical standard for data access systems that promises to ness lies in the underlying model of data generation, whereas the
avoid some of the zero-sum policy debates over balan- estimators themselves are a deterministic function of the dataset. In
cing the interests of individuals with the public good contrast, differentially private estimators are inherently random in
that can come from research. It also seems to satisfy their computation. Statistical inference that considers both the ran-
regulators and others. Differential privacy was intro- domness in the data and the randomness in the computation is highly
uncommon” (Sheffet 2017, 3107). As Karwa and Vadhan (2017)
duced by Dwork et al. (2006) and generalizes the social write, “Ignoring the noise introduced for privacy can result in wildly
science technique of “randomized response” to elicit incorrect results at finite sample sizes…this can have severe
sensitive information in surveys; it does this by ran- consequences.” On the essential role of inference and uncertainty
domizing the answer to a question rather than the in science, see King, Keohane, and Verba (1994).
1276
successfully protects individual privacy. For these Material. We illustrate the performance of our
reasons, using existing differentially private systems approach in finite samples via Monte Carlo simulations;
needs correction before being used in social science we then replicate key results from two recent published
analysis.2 articles and shows how to obtain almost the same
Social scientists and others need algorithms on data conclusions while guaranteeing the privacy of all
access systems with both inferential validity and differ- research subjects. The limitations of our methodology
ential privacy. We offer one such algorithm that is must be understood to use it appropriately, a topic we
approximately unbiased with sufficient prior informa- take up throughout the paper, while discussing practi-
tion, has lower variance than uncorrected estimates, cal advice for implementation, and in our Supplemen-
and comes with accurate uncertainty estimates. The tary Material. As a companion to this paper, we offer
algorithm also turns censoring from a problem that open-source software (called UnbiasedPrivacy) to
severely increases bias or inefficiency in statistical esti- illustrate all the methods described herein.
mation in order to protect privacy to an attractive
feature that greatly reduces the amount of noise
needed to protect privacy while still leaving estimates
approximately unbiased. The algorithm is easy to DIFFERENTIAL PRIVACY AND ITS
implement, computationally efficient even for very INFERENTIAL CHALLENGES
large datasets, and, because the entire dataset never
We now define the differential privacy standard,
needs to be stored in the same place, may offer addi-
describe its strengths, and highlight the challenges it
tional security protections.
poses for proper statistical inference. Throughout, we
Our algorithm is generic, designed to minimally
modify notation standard in computer science so that it
restrict the choice among statistical procedures, quan-
is more familiar to social scientists.
tities of interest, data generating processes, and statis-
tical modeling assumptions. The algorithm may
therefore be especially well suited for building research Definitions
data access systems. When valid inferential methods
exist or are developed for more restricted use cases, Begin with a confidential dataset D, defined as a col-
they may sometimes allow less noise for the same lection of N rows of numerical measurements con-
privacy guarantee. As such, one productive plan is to structed so that each individual whose privacy is to be
first implement our algorithm and to then gradually add protected is represented in at most one row.4
these more specific approaches when they become Statistical analysts would normally calculate a statis-
available as preferred choices.3 tic s (such as a count, mean, and parameter estimate)
We begin, in the next section, with an introduction to from the data D, which we write as sðDÞ. In contrast,
differential privacy and description of the inferential under differential privacy, we construct a privacy-
challenges in analyzing data from a differentially protected version of the same statistic, called a
private data access system. We then give a generic “mechanism” Mðs, DÞ, by injecting noise and censoring
differentially private algorithm which, like most such at some point before returning the result (so even if we
algorithms, is statistically biased. We therefore intro- treat sðDÞas fixed, Mðs, DÞis random). As we will show,
duce bias corrections and variance estimators which, the specific types of noise and censoring are specially
together with the private algorithm, accomplishes our designed to satisfy differential privacy.
goals. Technical details appear in the Supplementary The core idea behind the differential privacy stan-
dard is to prevent a researcher from reliably learning
anything different from a dataset regardless of whether
2
Inferential issues also affect differential privacy applications outside
an individual has been included or excluded. To for-

of data access systems (the so-called central model). These include malize this notion, consider two datasets D and D0 that
“local model” systems where private calculations are made on a
user’s system and sent back to a company in a way that prevents it
differ in at most one row. Then, the standard requires
from making reliable inferences about individuals—including Goo- that the probability (or probability density) of any
gle’s Chrome (Erlingsson, Pihur, and Korolova 2014) and their other analysis result m from dataset D, Pr½Mðs, DÞ ¼ m, be
products (Wilson et al. 2019), Apple’s MacOS (Tang et al. 2017), and indistinguishable from the probability that the same
Microsoft’s Windows (Ding, Kulkarni, and Yekhanin 2017)—and result is produced by the same analysis of dataset D0 ,
“hybrid” systems to release differentially private datasets such as Pr½Mðs, D0 Þ ¼ m , where the probabilities take D as
from Facebook (Evans and King 2023) and the U.S. Census Bureau
(Garfinkel, Abowd, and Powazek 2018).
fixed and are computed over the noise.
3
For example, Karwa and Vadhan (2017) develop finite-sample We write an intuitive version of the differential pri-
confidence intervals with proper coverage for the mean of a normal vacy standard (using the fact that eϵ ≈ 1 þ ϵ for small ϵ)
density; Barrientos et al. (2019) offer differentially private signifi- by defining “indistinguishable” as the ratio of the
cance tests for linear regression; Gaboardi et al. (2016) propose chi-
squared goodness-of-fit tests for multinomial data and independence
4
between two categorical variables; Wang, Lee, and Kifer (2015) Hierarchical data structures, or dependence among units, are
propose accurate p-values for chi-squared tests of independence allowed within but not between rows. For example, rows could
between two variables in tabular data; Wang, Kifer, and Lee (2018) represent a family with variables for different family members.
develop differentially private confidence intervals for objective or Although N could leak privacy in unusual situations (in which case
output perturbation; and Williams and McSherry (2010) provide an it could be disclosed in a differentially private manner with our
elegant marginal likelihood approach for moderate sized datasets. algorithm), for simplicity of exposition, we do not treat N as private.
1277
probabilities falling within ϵ of equality (which is 1). donation in a region as the quantity of interest: y ¼
PN
Thus, a mechanism is said to be ϵ-differentially private if 1
i¼1 yi. (The mean is a simple statistic, but we will use
N
the mean to generalize to most other commonly used
Pr½Mðs, DÞ ¼ m statistics.) Suppose also that the region includes many
∈ 1 ϵ, (1)
Pr½Mðs, D0 Þ ¼ m poor, and one more wealthy, individual. Our goal is to
enable researchers to infer the mean without any mate-
where ϵ is a pre-chosen level of possible privacy leakage, rial possibility of revealing information about any indi-
with smaller values potentially giving away less privacy vidual in the region (even if not in the dataset).
(by requiring more noise or censoring).5 Many varia- Consider three methods of privacy protection. First,
tions and extensions of Equation 1 have been proposed aggregation to the mean alone fails as a method of
(Desfontaines and Pejó 2019). We use the most popular, privacy protection: the differential privacy standard
known as “ðϵ, δÞ-differential privacy” or “approximate requires that every individual is protected even in the
differential privacy,” which adds a small chosen offset δ worst-case scenario, but clearly disclosing y may be
to the numerator of the ratio in Equation 1, with ϵ enough to reveal whether the wealthy individual made
remaining the main privacy parameter. This second a large contribution.
privacy parameter, which the user sets to a small value Second, suppose instead we disclose only the sum of
such that δ < 1=N, allows mechanisms with (statistically y and noise (a draw from a mean-zero normal distribu-
convenient) Gaussian (i.e., Normal) noise processes. tion). This mechanism is unbiased, with larger confi-
This relaxation also has Bayesian interpretations, with dence intervals being the cost of privacy protection.
the posterior distribution of Mðs, DÞ close to that of The difficulty with this approach is that we may need to
Mðs, D0 Þ , and also that an ðϵ, δÞ -differentially private add so much noise to protect the one wealthy outlier
mechanism is ϵ-differentially private with probability at that our confidence intervals may be too large to be
least 1−δ (Vadhan 2017, 355ff). We can also express useful.
approximate differential privacy more formally as Finally, we describe the intuitive Gaussian mecha-
requiring that each of the probabilities be bounded by nism that will solve the problem for this example and
a linear function of the other:6 will aid in the development of our general purpose
methodology. The idea is to first censor donations that
Pr½Mðs, DÞ ¼ m ≤ δ þ eϵ Pr½Mðs, D0 Þ ¼ m: (2) are larger than some pre-chosen value Λ , set the
censored values to Λ, average the (partially censored)
Consistent with political science research showing donations, and finally add noise to the average. Cen-
that secrecy is best thought of as continuous rather than soring has the advantage of requiring less noise to
dichotomous (Roberts 2018), the differential privacy protect individual outliers, but has the disadvantage
standard quantifies privacy leakage of statistics of inducing statistical bias. (We show how to correct
released via a randomized mechanism through the this bias in a subsequent section, turning censoring into
choice of ϵ (and δ). Differential privacy is expressed an attractive tool enabling both privacy and utility.)
in terms of the maximum possible privacy loss, but the The bound on privacy loss can be entirely quantified
expected privacy loss is considerably less than this via the noise variance, S2 , even though privacy protec-
worst case analysis, often by orders of magnitude tion comes from both censoring and random noise. This
(Carlini et al. 2019; Jayaraman and Evans 2019). It also is because, if we choose to censor less by setting Λ to be
protects small groups in the same way as individuals, high, then we must add more random noise. To set the
with the maximum risk kϵ rising linearly in group size k. noise variance, we consider the values of ϵ and δ that
would be satisfied in Equation 2. Since we are convert-
ing one number, S2, into two, there are many such ðϵ, δÞ
Example pairs. (A related notion of differential privacy, known

as zero-concentrated differential privacy, captures the
To fix ideas, consider a confidential database of dona- privacy loss of the Gaussian mechanism in a single
tions given by individuals to an organization bent on parameter denoted by ρ, but requires a more compli-
defeating an authoritarian leader, with the mean cated definition of differential privacy [Bun and
Steinke 2016]. Some applications of differential privacy
report the privacy loss in terms of ρ, rather than ϵ and δ,
5
Defining “indistinguishable” via this multiplicative (ratio) metric is
much more protective of privacy than some others, such as an and a simple formula can be used to convert between
additive (difference) metric. For example, consider an obviously the two.)
unacceptable mechanism: “choose one individual uniformly at ran- To guide intuition about how the noise variance must
dom and disclose all of his or her data.” This mechanism is not change to satisfy a particular ϵ privacy level at a fixed δ,
differentially private (the ratio can be infinite and thus greater than it is useful to write
any finite ϵ), but it may appear safe on an additive metric because the
impact of adding or removing one individual on the difference in the Λ
probability distribution of a statistical output is proportional to at SðΛ, ϵ, N; δÞ ∝ : (3)
most 1=N. Nϵ
6
Our algorithms below also satisfy a strong version of approximate
differential privacy known as Rényi differential privacy (see Mironov Thus, Equation 3 shows that ensuring differential
2017). See also related privacy concepts from the social sciences privacy allows us to add less noise if (1) each person
(Chetty and Friedman 2019) and statistics (Reiter 2012). is submerged in a sea of many others (larger N), (2) less
1278
privacy is required (we choose a larger ϵ), or (3) more the sum of all the ϵs across all their analyses does not
censoring is used (we choose a smaller Λ). exceed their total privacy budget.
^ 6¼ y.
With censoring, θ^ is a biased estimate of y: EðθÞ Enabling the researcher to decide how to allocate a
To reduce censoring bias, we can choose larger values privacy budget enables scarce information to be better
of Λ, but that unfortunately increases the noise, statis- directed to scholarly goals than a central authority such
tical variance, and uncertainty; however, reducing as the data provider. However, when the total privacy
noise by choosing a smaller value of Λ increases cen- budget is used up, no researcher can run new analyses
soring and thus bias. We resolve this tension in our unless the data provider chooses to increase the budget.
section on ensuring valid inference. This constraint is useful to protect privacy but utterly
changes the nature of statistical analysis. To see this,
note that best practice recommendations have long
Inferential Challenges included trying to avoid being fooled by the data—by
running every possible diagnostic and statistical check,
We now discuss four issues differential privacy poses fully exploring the dataset—and by the researcher’s
for statistical inference. For some, we offer corrections; personal biases—such as by preregistration to constrain
for others, we suggest how to adjust statistical analysis the number of analyses run to eliminate “p-hacking” or
practices. correcting for “multiple comparisons” ex post
First, censoring induces selection bias. Avoiding cen- (Monogan 2015). One obviously needs to avoid being
soring by setting Λ large enough adds more noise but is fooled in any way, and so researchers normally try to
unsatisfactory because any amount of noise induces balance the resulting contradictory advice. In contrast,
bias in estimates of all but the simplest quantities of differential privacy tips the scales: remarkably, it makes
interest. Moreover, even for unbiased estimators (like solving the second problem almost automatic (Dwork
the uncensored mean above), the added noise makes et al. 2015), but it also reduces the probability of
unadjusted standard errors statistically inconsistent. serendipitous discovery and increases the odds of being
Ignoring either measurement error bias or selection fooled by unanticipated data problems. Successful data
bias is a major inferential mistake as it may change analysis with differential privacy thus requires careful
substantive conclusions, estimator properties, or the planning, although less stringently than with pre-
validity of uncertainty estimates, often in negative, registration. In a sense, differential privacy turns the
unknown, or surprising ways.7 best practices for analyzing a private observational
Second, it would be sufficient for a proper scientific dataset (which might otherwise be inaccessible) into
statements to have (1) an estimator θ^ with known something closer to the best practices for designing a
statistical properties (such as unbiasedness, consis- single, expensive field experiment.
tency, and efficiency) and (2) accurate uncertainty In order to ensure researchers can follow the repli-
estimates. Unfortunately, as Appendix B of the Sup- cation standard (King 1995), and to preserve their
plementary Material demonstrates, accurate uncer- privacy budget, we recommend that differentially pri-
tainty estimates cannot be constructed by using vate systems cache results so that rerunning the same
differentially private versions of classical uncertainty analysis adds the identical noise every time and repro-
estimates, meaning we must develop new approaches. duces the identical estimate and standard errors.
Third, to avoid researchers rerunning the same anal- (Researchers could of course choose to rerun the same
ysis many times and averaging away the noise, their analysis with a fresh draw of noise to reduce standard
analyses must be limited. This limitation is formalized errors at the cost of spending more from their privacy
via a differential privacy property known as composi- budget.) In addition, authors of politically sensitive
tion: if mechanism k is (ϵ k , δk)-differentially private, for studies now often omit information from replication
k ¼ 1, …, K , then disclosing all K estimates is
P PK datasets entirely which, instead, could be made avail-

( K k¼1 ϵ k , k¼1 δk )-differentially private. The compo-
8
able in differentially private ways.
sition property enables the data provider to enforce a Finally, learning about population quantities of inter-
privacy budget by allocating a total value of ϵ to a est is not an individual-level privacy violation (e.g.,
researcher who can then divide it up and run as many https://bit.ly/Noviol). A researcher can even learn
analyses, of whatever type, as they choose, so long as about an individual from a differentially private mech-
anism, but no more than if that individual were
7
excluded from the dataset. For example, suppose
For example, adding mean-zero noise to one variable or its mean research indicates that women are more likely to share
induces no bias for the population mean, but adding noise to its
variance induces bias. The estimated slope coefficient in a regression
fake news with friends on social media than men; then,
of y on x, with random measurement error in x, is biased toward zero; if you are a woman, everyone knows you have a higher
if we add variables with or without measurement error to this risk of sharing fake news. But the researcher would
regression, the same coefficient can be biased by any amount and have learned this social science generalization whether
in any direction. Censoring sometimes attenuates causal effects, but it or not you were included in the dataset and you have no
can also exaggerate them; predictions and estimates of other quan- privacy-related reason to withhold your information.
tities can be too high, too low, or have their signs changed.
8
Alternatively, if the K quantities are disclosed simultaneously and
A key point, however, is that we ensure a chosen
returned in a batch, then we could choose to set the variance of the differentially private mechanism is inferentially valid or
error for all together at a higher individual level but lower collective else we will draw the wrong conclusions about social
level (Bun and Steinke 2016; Mironov 2017). science generalizations. In particular, if researchers use
1279
privacy-preserving mechanisms without bias correc- must select from among the many methods that are
tions, social scientists and ultimately society can be statistically valid under bootstrapping.9
misled. (In fact, it is older people, not women, who are Let θ^ ¼ sðDÞ denote an estimate of θ using statistical
more likely to share fake news! See Guess, Nagler, and estimator sðÞ, which we would use if we could see the
Tucker [2019].) Fortunately, all of differential privacy’s private data; also denote a differentially private estimate
of θ by θ^ . To compute θ^ , the researcher chooses a
properties are preserved under post-processing, so no dp dp
privacy loss will occur when, below, we correct for statistical method, a quantity of interest estimated from
inferential biases (or if results are published or mixed the statistical method (causal effect, risk difference,
with any other data sources). In particular, for any data predicted value, etc.), and values for each of the privacy
analytic function f not involving private data D, if parameters (Λ, ϵ, and δ; we discuss how to make these
Mðs, DÞ is differentially private, then f ½Mðs, DÞ is choices in practice below).
differentially private, regardless of assumptions about Below, we give the details of our proposed mecha-
potential adversaries or threat models.
nism Mðs, DÞ ¼ θ^ followed by the privacy and then
dp
Although differential privacy may seem to follow a
“do no more harm” principle, careless use of it can harm inferential properties of this strategy.
individuals and society if we do not also provide infer-
ential validity. The biases from ignoring measurement
error and selection can each separately or together Mechanism
reverse, attenuate, exaggerate, or nullify statistical We give here a generic differentially private estimator
results. Helpful public policies could be discarded. based on a simple version of the Gaussian mechanism
Harmful practices may be promoted. Of course, when (described above) applied to estimates from subsets of
providing access to confidential data, not using differen- the data rather than to individual observations. To be
tial privacy may also have grave costs to individuals. Data more specific, the algorithm uses a partitioning version of
providers must therefore ensure that data access systems the “sample and aggregate” algorithm (Nissim, Raskhod-
are both differentially private and inferentially valid. nikova, and Smith 2007), to ensure the differential pri-
vacy standard will apply for almost any statistical method
and quantity of interest. We also incorporate an optional
A DIFFERENTIALLY PRIVATE GENERIC application of the computationally efficient “bag of little
ESTIMATOR bootstraps” algorithm (Kleiner et al. 2014) to ensure an
aspect of inferential validity generically, by not having to
We now introduce an approximately unbiased worry about differences in how to scale up different
approach which, like our software, we call Unbiased- statistics from each partition to the entire dataset.
Privacy (or UP); it has two parts: (1) a differentially Figure 1 gives a visual representation of the algo-
private mechanism introduced in this section and (2) a rithm we now detail. We first randomly sort observa-
bias correction of the differentially private result, using tions from D into P separate partitions fD1 , …, DP g,
the algorithm in the next section, along with accurate each of size n ≈ N=P(we discuss the choice of P below),
uncertainty estimates in the form of standard errors. and then follow this algorithm.
Our work builds on the algorithm in Smith (2011), but
results in an estimator with a root-mean-square error
that, in simulations, is hundreds of times smaller (see 1. From the data in partition p (p ¼ 1, …, P):
(a) Compute an estimate θ^ (using the same method
p
Appendix C of the Supplementary Material).
Let D denote a population data matrix, from which N as we would apply to the full private data, and
observations are selected to form our observed data scaling up to the full dataset, or via the general
matrix D. Our goal is to estimate some (fixed scalar) purpose bag of little bootstrap algorithm; see
quantity of interest, θ ¼ sðDÞ with the researcher’s Appendix D of the Supplementary Material).
choice of statistical procedure s. The statistical proce- (b) Censor the estimate θ^ as cðθ^ , ΛÞ.
p p
dure sðÞ refers to the results of running a chosen
statistical methodology from among the many statistical
2. Compute θ^ by averaging the censored estimates
dp
methods in widespread use in political science—based
on maximum likelihood, Bayesian, nonparametric (over partitions instead of observations) and adding
approaches, or others—and finally computing a quantity mean-zero noise:
of interest computed from them—such as a prediction,
θ^ ¼ θ^ þ e,
dp
probability, first difference, and forecast. Multiple quan- (4)
tities of interest can be estimated by repeating the
algorithm (dividing ϵ among the runs). 9
That is, we allow any statistic with a positive bounded second
Our procedure is “generic,” by which we mean that Gateaux derivative and Hadamard differentiability (Wasserman
researchers may choose among a wide variety of 2006, 35), excluding statistics such as the maximum. We also assume
methods, including most now in use across the social that the parameter value does not fall at a boundary of its support, the
distribution of the underlying estimator has a finite mean and vari-
sciences, but of course there are constraints. For example, ance, and N is growing faster than P. This condition is commonly met,
the data must be arranged so that each research subject but there are exceptions. For instance, the synthetic control estimator
whose privacy we wish to protect contributes information can be nonnormal (Li 2020). It is therefore important to consider
to only one row of D (see Footnote 4) and, technically, we whether asymptotic normality applies in any particular application.
1280
Inferential Properties
FIGURE 1. Differentially Private Algorithm
(Before Bias Correction) To study the statistical properties of our point estimator
and its uncertainty estimates, consider two conditions:
(1) an assumption we maintain until the next
section that Λ pis large penough so that censoring has
no effect (cðθ^ , ΛÞ ¼ θ^ ) and (2) a method that is
unbiased when applied to the private data.
If these two conditions hold, our point estimates are
unbiased:
1X P
Eðθ^ Þ ¼ Eðθ^ Þ þ EðeÞ ¼ θ:
dp p
(6)
P p¼1
In practice, however, choosing the bounding parameter

Λ involves a bias-variance trade-off: if Λ is set large
enough, censoring has no effect and θ^ is unbiased, but
dp
the noise and resulting uncertainty estimates will be large

(see Equation 5). Alternatively, choosing a smaller value
where of Λ will reduce noise and the resulting uncertainty
estimates, but it increases censoring and bias. Of course,
1X P
if the chosen estimator is biased when applied to the
θ^ ¼ cðθ^ , ΛÞ,
p
e N ð0, S2θ^ Þ, Sθ^ ¼ SðΛ, ϵ, δ, PÞ,
P p¼1 private data, perhaps due to violating statistical assump-
tions or merely being a nonlinear function of the data like
(5) logit or an event count model, then our algorithm will not
where S is explained intuitively here and formally in magically remove the bias, but it will not add bias.
Appendix A of the Supplementary Material. Uncertainty estimators, in contrast, require adjust-
ment even if both conditions are met. For example, the
variance of the differentially private estimator is
Vðθ^ Þ ¼ VðθÞ ^ þ S2^ , but its naive variance estimator
dp
Privacy Properties θ
(the differentially private version of an unbiased non-
Privacy is ensured in this algorithm by each individual
private variance estimator) is biased:
appearing in at most one partition, and by the censoring
and noise in the aggregation mechanism ensuring that h i h i
data from any one individual can have no meaningful E Vð^ θ^dp Þ ¼ E Vð^ θÞ
^ þ e ¼ VðθÞ
^ þ EðeÞ
effect on the distribution of possible outputs. Each (7)
^ ^dp
¼ VðθÞ 6¼ Vðθ Þ:
partition can even be sequestered on a separate server,
which may reduce security risks.
The advantage of always using the censored mean of Fortunately, we can compute an unbiased estimate of
estimates across partitions, regardless of the statistical the variance of the differentially private estimator
method used within each partition, is that the variance by simply adding back in the (known) variance of
of the noise can be calibrated generically to the sensi- the noise: Vð^ θ^dp Þ ¼ Vð
^ θÞ
^ þ S2^ , which is unbiased:
tivity of this mean rather than having to derive the h i θ
sensitivity anew for each estimator. The cost of this E Vð^ θ^ Þ ¼ Vðθ^ Þ. Of course, because censoring will
dp dp
strategy is additional noise because P rather than N bias our estimates, we must bias correct and then com-
appears in the denominator of the variance. Thus, pute the variance of the corrected estimate, which will
from the perspective of reducing noise, we should set ordinarily require a more complicated expression.
P as large as possible, subject to the constraints
that (1) the number of units in each partition
n ≈ N=P gives valid statistical results in each bootstrap
and the estimate being sensible (such as regression ENSURING VALID STATISTICAL INFERENCE
covariates being of full rank) and (2) n is large enough
and growing faster than P (to ensure the applicability We now correct our differentially private estimator for
of the central limit theorem or, for a precise rate of the bias due to censoring, which has the effect of reduc-
convergence, the Berry–Esseen theorem). See also ing the impact of the choice of Λ. The correction thus
our simulations below. Subsampling itself increases allows users to choose smaller values of Λ, and conse-
the variance of nonlinear estimators, but usually much quentially reduce the variance and use less of the privacy
less than the increased variance due to privacy- budget (see also our section on practical suggestions
protective procedures; see Mohan et al. (2012) for below). Because we introduce the correction by post-
methods of optimizing P. processing, we retain the same privacy-preserving
1281
We then write the expected value of θ^ as the

dp
properties. It even turns out that the variance of this bias-
corrected estimate is actually smaller than the uncor- weighted average of the mean of the truncated normal,
rected estimate, which is unusual for bias corrections. the spikes at −Λ and Λ, and the term we ignore:
We know of no prior attempt to correct for biases due to dp pffiffiffi
censoring in any differentially private mechanism. E θ^n ¼ −α1 Λ þ ð1−α2 −α1 ÞθT þ α2 Λ Oð1= nÞ,
We now describe our bias-corrected point estimator,
~ dp
θ , and variance estimate, Vð ^ θ~dp Þ. (9)
with truncated normal mean

Bias Correction
pffiffiffi N ð−Λ ∣ θ, σ 2 =nÞ−N ðΛ ∣ θ, σ 2 =nÞ
Our goal here is to correct the bias due to censoring in θT ¼ θ þ σ= n :
1−α2 −α1
our estimate of θ. Figure 2 helps visualize the underly-
ing distributions and notation we will introduce, with We then construct a plug-in estimator by substitut-
the original uncensored distribution in blue and the
ing our point estimate θ^ for the expected value at the
dp
censored distribution in orange, which is made up of an
left of Equation 9. We are left with three equations
unnormalized truncated distribution and the spikes
(Equation 9 and the two in Equation 8) and four
(which replace the area in the tails) at −Λ and Λ.
unknowns (θ, σ, α1 , and α2 ). We therefore use some
Although Λ and −Λ are of course symmetric around
of the privacy budget to obtain an estimate of α2 (or α1)
zero, the uncensored distribution is centered at θ. P
as α^2 ¼ P1 p 1ðθ^ > ΛÞ, which we release via the Gauss-
p
Our strategy is to approximate the distribution of θ^
p
by a normal distribution, which is correct asymptoti- ian mechanism.10 How to use the privacy budget is up
cally for most commonly used statistical methods in to the user, but we find that splitting the expenditure
political science. Thus, the distribution of θ^ across
p
equally between the original quantity of interest and
partitions, before censoring at ½−Λ, Λ, is approximately this parameter works well.
N ðθ, σ 2 =nÞ, where n ¼ N=P. With the same three equations, and our remaining
The proportion left and right censored, respectively three unknowns (θ, σ, and α1), our open-source software
gives a fast numerical solution. The result is θ~ , the
dp
(the area under distribution’s tails) is therefore approx-
imated by approximately unbiased estimate of θ (and estimates of
dp
Z −Λ Z ∞ σ^dp and α^1 ). See Appendix F of the Supplementary
α1 ¼ N ðt ∣ θ, σ 2 =nÞdt, α2 ¼ N ðt ∣ θ, σ 2 =nÞdt: Material.
−∞ Λ We use the term “approximate unbiasedness” to
(8) mean unbiasedness with respect to the private estima-
tor (which itself may not be unbiased), rather than the
Under technical regularity conditions (see Appendix true parameter. This unbiasedness is approximate for
E of the Supplementary Material for a proof), we can two reasons. First, most plug-in estimators, including
bound the error on the terms such that Prðθ^n < −ΛÞ ¼
p
ours, are guaranteed to be strictly unbiased only asymp-
pffiffiffi p ffiffiffi
α1 Oð1= nÞ and Prðθ^n > ΛÞ ¼ α2 Oð1= nÞ . This
p
totically. Second, as discussed, our estimating equations
means that the difference between the true proportion have a small error from approximating the partition
distribution as normal. And finally, our simulations
censored and our approximation pffiffiffi of it, α1 þ α2, is below show that in finite samples the bias introduced
decreasing proportional to 1= n , which is typically
very small. We thus ignore this (generally negligible) from these two sources is negligible in practice, and
considerably smaller than the bias introduced by cen-
error when constructing our estimator.

soring.
FIGURE 2. Underlying Distributions (Before Variance Estimation

Estimation). The censored distribution includes We now derive a procedure for computing an estimate
the orange area and spikes at −Λ and Λ
of the variance of our estimator, Vð^ θ~dp Þ , without any
additional privacy budget expenditure. We have the two
directly estimated quantities, θ^ and α^2 , and the three
dp dp
(deterministically post-processed) functions of these

computed during bias correction: θ~ , σ^2 , and α^ . We
dp dp
dp 1
then use standard simulation methods (King, Tomz, and
Wittenberg 2000): we treat the estimated quantities as
random variables, bias correct to generate the others,
10
Since a proportion is bounded between 0 and 1, the sensitivity of
this estimator is bounded too and so we can avoid censoring.
1282
FIGURE 3. Monte Carlo Simulations: Bias of the Uncorrected (θ^ ) and Corrected (θ~ ) Estimates, and
dp dp
(in the Bottom-Right Panel) the Standard Error of the True Uncorrected (SEθ^dp ), True Corrected (SEθ~dp ),
c dp ) Estimates (the Latter Two Having Almost Identical Values)
and Estimated Corrected (SE θ~
0.0 0.00
~dp ~dp
T T
−0.1 −0.02
dp
T^
Bias
Bias
−0.2
−0.04
−0.3 dp
−0.06 T^
0.1 0.25 0.375 0.5 0.625 0.75 0.10 0.25 0.50 0.75 1.00
D2 H
0.00
~dp 0.15
T
−0.02 Std. Error
Bias
0.10
SET^dp
−0.04
dp 0.05
−0.06 T^ ^
SE~Tdp
SE~Tdp
0.00
10000 250000 500000 750000 1000000 0.10 0.25 0.50 0.75 1.00
N H
Note: “True” SEs refer to the actual standard deviation of a point estimate.
and take the sample variance of the simulations of θ~ .

dp figure), we draw data for each row i from an independent
Thus, to represent estimation uncertainty, we draw the linear regression model: yi N ð1 þ 3xi , 102 Þ, with xi
random quantities from a multivariate normal with plug- N ð0, 72 Þ drawn once and fixed across simulations. Our
in parameter estimates, none of which need to be newly chosen quantity of interest is the coefficient on xi with
disclosed and so the procedure does not use more of the value θ ¼ 3 . We study the bias of the (uncorrected)
differentially private estimator θ^ , and our corrected
privacy budget. Appendix G of the Supplementary dp
Material provides all the derivations.

version, θ~ , as well as their standard errors.11

dp
Begin with the top-left panel, which plots bias on the

vertical axis (with zero bias indicated by a dashed
SIMULATIONS
We now evaluate the finite-sample properties of our
11
estimator. We show that while (uncorrected) differen- We have tried different parameter values, functional forms, dis-
tially private point estimates are inferentially invalid, our tributions, statistical models, and quantities of interest, all of which
estimators are approximately unbiased and come with led to similar substantive conclusions. In Figure 3, for censoring
(top-left panel), we let α1 ≈ 0, α2 ¼ f0:1, 0:25, 0:375, 0:5, 0:625, 0:75g,
accurate uncertainty estimates. In addition, in part N ¼ 100, 000 , P ¼ 1, 000 , and ϵ ¼ 1 . For privacy (in the top-right
because our bias correction uses an additional disclosed panel) and standard errors (bottom-right panel), let ϵ ¼
parameter estimate (^ α2 ), the variance of our estimator is f0:1, 0:15, 0:20, 0:30, 0:50, 1g while setting N ¼ 100, 000 , P ¼ 1, 000 ,
usually lower than the variance of the uncorrected esti- and with Λ set so that α1 ≈ 0 and α2 ¼ 0:25. For sample size (bottom
mator. (Simulations under alternative assumptions left), set N ¼ f10, 000, 25, 000, 50, 000, 100, 000, 250, 000, 50, 000,
appear in Appendix H of the Supplementary Material; 100, 000g, P ¼ 1, 000, ϵ ¼ 1, and determine Λ so that α1 ≈ 0 and α2 ¼
0:25. We ran 1,000 Monte Carlo simulations except for N ≤ 50, 000
replication information is available in Evans et al. [2023].) where we ran 4,000, and 2,000 for ϵ ¼ 0:25. The values of ϵ reported
The results appear in Figure 3, which we discuss after correspond to δ ¼ 0:01 given the Gaussian noise variance; we could
first detailing the data generation process. For four instead have set δ << 1=N, in which case the corresponding ϵ would be
different types of simulations (in separate panels of the slightly higher.
1283
FIGURE 4. Performance across P (Number of Data Partitions) for Fixed Privacy Budget (ϵ =1) and
Sample Size (N =100k)
0.6
0.4
dp
0.2 T^
Bias
0.0 ~dp
T
−0.2
0 100 200 300 400 500

P
horizontal line at zero near the top) and the degree of partitions, we perform more favorably with low P, since
censoring on the horizontal axis increasing from left to the asymptotics are in n rather than P). However, when
right (quantified by α2 ). The orange line in this panel P > 100, our procedure corrects the bias from censor-
vividly shows how statistical bias in the (uncorrected) ing. Our procedure even performs well when P ¼ 500
differentially private estimator θ^ sharply increases
dp and so n ¼ 20 . Such a small sample size can be less
with censoring. In contrast, our (bias-corrected) esti- optimal for other estimators and in skewed data. Ana-
lysts should consider the trade-off between noise and
mate in blue θ~ is approximately unbiased regardless
dp
partition sample size when choosing P.
of the level of censoring.
The top-right and bottom-left panels also plot bias on
the vertical axis with zero bias indicated by a horizontal
dashed line. The bottom-left panel shows the bias in the EMPIRICAL EXAMPLES
uncorrected estimate (in orange) for sample sizes from
We now show that the same quantities of interest to
10,000 to 1 million, and the top-right panel shows the
political scientists can still be accurately estimated even
same for different values of ϵ. Our corrected estimate
while guaranteeing the privacy of their respondents.
(in blue) is approximately zero in both panels, regard-
We do this by replicating two important recent articles
less of the value of N or ϵ.
from major journals. We then treat these datasets as if
Finally, the bottom-right panel reveals that the stan-
they were private and not accessible to researchers,
dard error of θ~ is approximately correct (i.e., equal to

dp
except through our algorithm. We then use the algo-
the true standard deviation across estimates, which can rithm to estimate the same quantities and show that we
be seen because the blue and gray lines are almost on can recover the same estimates. We also quantify the
top of one another). It is even smaller for most of the costs of our approach by the size of the standard errors.
range than the standard error of the uncorrected esti- See Evans et al. (2023) for replication information.
mate θ^ . (Appendix I of the Supplementary Material
dp
gives an example and intuition for how to avoid ana- Home Ownership and Local Political
lyses where our approach does not work as expected.) Participation
These simulations suggest that θ~ is to be preferred
dp
We begin with Yoder (2020), a study of the effect of

to θ^ with respect to bias and variance in finite samples.
dp
home ownership on participation in local politics that
We also show in Figure 4 how our procedure per- uses an unusually informative and diverse array of
forms across different partition sizes (P) for fixed ϵ, n, datasets the author combined via probabilistic match-
and Λ, using same data generation process. With small ing. Although all the information used in this article is
P, both θ~ and θ^ are biased because our estimate of
dp dp
publicly available in separate datasets, combining data-
the censoring level is noisy in this case (note, however, sets can be exponentially more informative about each
that in simulations where we use the bag-of-little- person represented. Some people represented in data
bootstraps to estimate the proportion of censored like these might well rankle about a researcher being
1284
FIGURE 5. Original versus Privacy-Protected Data Analyses
0.100 0.3
Effect of affirmative action

Effect of homeownership
0.075
0.2
0.050
0.1
0.025
0.0
0.000
Original Privatized Privatized Original Privatized Privatized
(no correction) (no correction)
(a) Yoder (2020) (b) Bhavnani and Lee (2019)
able to easily obtain their name, address, how much of a a private query for an estimate of the 0:6 quantile of the
fuss they made at various city council meetings, all the absolute value of the partition-level estimates of our
times during the last 18 years in which they voted or quantity of interest. We dedicate ϵ ¼ 0:2 of our privacy
failed to turn out, the dollar value of their home, and budget to this query, and the returned value was 0:07
which candidates and how much they contributed to (this means approximately 40% of partition-level statis-
each. Moreover, with this profile on any individual, it tics were greater in magnitude than 0:07). Then, for our
would be easy to add other variables from other main algorithm, we set Λ ¼ 0:075. Below, we also discuss
sources. the sensitivity of our estimator to this choice.
Yoder (2020) followed current best practices by The main causal estimate is portrayed in Figure 5a as
appropriately de-identifying the data, which meant a dark blue dot, in the middle of a vertical line repre-
being forced to strip the replication dataset of many senting a 95% confidence interval.
substantively interesting variables, hence limiting the When we estimate the same quantity using our algo-
range of discoveries other researchers can make. And rithm, the result is a point estimate and confidence
yet, we now know that even these procedures do not interval portrayed in the middle of Figure 5a. As can
always protect research subjects, as re-identification be seen, the point estimate (in light blue) is almost the
remains possible. same as in the original, and the confidence interval is
In a dataset with n ¼ 83,580 observations, the author similar but widened somewhat, leaving essentially the
regresses a binary indicator for whether an individual same overall substantive conclusion as in the original
comments at a local city council meeting on an indicator about how home ownership increases the likelihood of
for home ownership, controlling for year and zip code commenting in a city council meeting. The wider con-
fixed effects, and correcting for uncertainties in the data fidence intervals is the cost necessary to guarantee
matching procedure. This causal estimate, which we privacy for the research subjects. This inferential cost
replicated exactly, indicates that owning a home could be overcome, if desired, by a proportionately
increases the probability of commenting at a city coun- larger sample.12 (For comparison, we also present the
cil meeting by 5 percentage points (0.05) (Yoder 2020, biased privatized point estimate without correction on
Table 2, Model 1). We focus on this main effect, not the the right side of the same graph.)
large number of other statistics in the original paper,
many of which are of secondary relevance to the core 12
argument. To release all these statistics would require For simplicity, we replaced the large number of zip code fixed
effects in Yoder (2020) with an equivalent mean differences model.
either a large privacy budget (and possibly an unac- Since we randomly partition the data, we cannot guarantee that the
ceptable privacy loss) or excessively large standard full model with fixed effects will be estimable in every partition, since
errors. Differential privacy hence changes the nature by chance certain zip codes may not appear. By de-meaning the data
of research, necessitating choices about which statistics instead of including a fixed effect for each zip code, we retain the core
are published. logic of comparing outcome variability within a zip code, but do so in
Since there is limited prior research on this question, it a way that guarantees the model is estimable in each partition, and
significantly reduces the number of parameters to be estimated. The
is difficult to make an informed choice about the param- estimated coefficient of interest and its standard error in our simpli-
eter, Λ , which defines where to censor. We therefore fied model are identical (to three decimal points) to the author’s
dedicate a small portion of the privacy budget to private reported estimates. For our algorithm, we use Λ ¼ 0:06, ϵ θ ¼ ϵ α ¼ 0:5,
quantile estimation (Smith 2011). Specifically, we make δ ¼ 1=N where N ¼ 5, 978, and P ¼ 300.
1285
the confidence interval reflects the cost of the privacy

FIGURE 6. Distribution of Estimate across 200 protections we chose to apply.13
Runs of Our Algorithm, at Varying Levels of Λ In both applications, our algorithm provides privacy
in the form of deniability for any person who is or could
Estimated effect of homeownership
be in the data: reidentification is essentially impossible,

regardless of how much external information an
0.075 attacker may have. It also enables scholars to produce
approximately the same results and the same substan-
tive conclusion as without privacy protections. The
0.050 inferential cost of the procedure is the increase in
confidence intervals and standard errors, as a function
of the user-determined choice of ϵ. This cost can be
compensated for by collecting a larger sample. Of
0.025 course, in many situations, not paying this “cost” may
mean no data access at all.
0.04 0.07 0.1 0.15 0.3 0.5
/
PRACTICAL SUGGESTIONS AND
LIMITATIONS
In Figure 6, we show how our estimate varies over
200 runs of the algorithm for different values of Λ Like any data analytic approach, how the methods
(which dictates the amount of censoring and noise). proposed here are used in practice can be as important
We see a trade-off between pre-noise information, as their formal properties. We discuss here issues of
which decreases in the level of censoring, and the reducing the societal risks of differential privacy, choos-
variance of the noise. In this example, we find that ing ϵ, choosing Λ, and theory and practice differences.
censoring approximately 25% of partitions (corre- See also Appendix F of the Supplementary Material on
sponding to Λ ¼ 0:1) induces the lowest estimate var- suggestions for software design.
iability. However, the results are not significantly or
substantively different if we set Λ to somewhat lower Reducing Differential Privacy’s Societal Risks
(0:07) or higher (0:15) values. But if we set Λ to be very
high (0:3 or 0:5), so that censoring is nearly nonexistent, Data access systems with differential privacy are
then we have to add large amounts of noise. This designed to reduce privacy risks to individuals. Correct-
demonstrates a key benefit of our estimator that allows ing the biases due to noise and censoring, and adding
us to obtain approximately unbiased estimates in the proper uncertainty estimates, greatly reduces the
presence of censoring and with less noise. remaining risks to researchers and, in turn, to society.
There is, however, another risk we must tackle: con-
sider a firm seeking public relations benefits by making
Effect of Affirmative Action on Bureaucratic data available for academics to create public good but,
Performance in India concerned about bad news for the firm that might come
from the research, takes an excessively conservative
We also replicate Bhavnani and Lee (2021), a study position on the total privacy budget. In this situation,
showing that affirmative action hires do not reduce the firm would effectively be providing a big pile of
bureaucratic output in India, using “unusually detailed useless random numbers while claiming public credit
data on the recruitment, background, and careers of for making data available. No public good could be
India’s elite bureaucracy” (5). created, no bad news for the firm could come from the
The key analysis in this study involves a regression of research results because all causal estimates would be
bureaucratic output on the proportion of bureaucrats approximately zero, and still the firm would benefit
who were affirmative action hires. Bureaucratic output from great publicity.
was defined as the standardized log of the number of To avoid this unacceptable situation, we quantify the
households that received 100 or more days of employ- statistical cost to researchers of differential privacy or,
ment under MGNREGA, India’s (and the world’s equivalently, how much information the data provider
largest) poverty program. The causal estimate, based is actually providing to the scholarly community. To do
on n ¼ 2, 047 observations, is positive and confirms the this, we note that making population-level inferences in
authors’ hypothesis. a differentially private data access system is equivalent
We easily replicate these results, which appear in dark
blue on the left side of Figure 5b. As above, we now treat
the data as private and accessible only through our
13
We used parameters Λ ¼ 0:35, ϵ θ ¼ ϵ α ¼ 1, δ ¼ 1=N, and P ¼ 150.
algorithm and estimate the same quantity. The result Notably, this implies that n ≈ 13 in each of the partitions, which is a
relatively small sample size. This makes this a hard test for our
appears in light blue in the middle. The overall substan- procedure, since we rely on an asymptotic approximation. Our results
tive conclusion is essentially unchanged—indicating the show that the procedure nevertheless performs well in this dataset,
absence of any evidence that affirmative action hires but we warn against assuming that the partition-level distributed is
decrease bureaucratic output. The increase in the size of well approximated by a normal distribution in all small sample sizes.
1286
to an ordinary data access system with some specific logistic regression coefficients reported in the literature
proportion of the data discarded. This means we can rarely having absolute values above about five. Similar
provide an intuitive statistic, the proportion of observa- patterns are easy to identify across many statistical
tions effectively lost due to the privacy-protective pro- procedures. A good software interface would thus not
cedures. Appendix J of the Supplementary Material only include appropriate defaults, but also enable users
formally defines and shows how to estimate this quan- to enter asymmetric censoring intervals. Then the soft-
tity, which we call L for loss. ware, rather than the user, takes responsibility in the
Because L can be computed after from the results of background for rescaling the variables as necessary.
our algorithm, without any additional expenditure from Applied researchers are good at making choices like
the privacy budget, we recommend it be regularly this as they have considerable experience with scaling
reported publicly by data providers or researchers variables, a task that is an essential part of most data
using differentially private data access systems. For analyses. Researchers also frequently predict the
example, in the applications we replicate, the propor- values of their quantities of interest both informally,
tion of observations effectively lost is L ¼ 0:36 for the when deciding what analysis to run next, and formally
study of home ownership and L ¼ 0:43 for the study of for power calculations. If the data surprise us, we
dp dp
affirmative action in India. These could have been will learn this because the α^1 and α^2 are disclosed as
changed by adjusting the privacy parameters ϵ, δ, and part of our procedure. If either quantity is more than
Λ in how the data were disclosed. about 60%, we recommend researchers consider
adjusting Λ and rerunning their analysis (see
Appendix K of the Supplementary Material) or adapt-
Choosing ϵ ing Smith (2011) procedure to learn the value of Λ
From the point of view of the statistical researcher, ϵ where censoring begins (See Appendix C of the Sup-
directly influences the standard error of the quantity of plementary Material).
interest, although with our algorithm this choice will
not affect the degree of bias. Because we show above Implementation Choices
that typically Vð ^ θ^dp , ΛÞ < Vð
^ θ~dp Þ < V½cð ^ θ^dp Þ , we can
In the literature, differential privacy theorists tend to be
simplify and provide some intuition byq writing
ffiffiffiffiffiffiffiffiffiffiffiffiffiffian upper conservative in setting privacy parameters and budgets;
bound on the standard error SEθ~dp ^ θ~dp Þ as
Vð practitioners take a more lenient perspective. Both
perspectives make sense: theorists analyze worst-case
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi scenarios using mathematical certainty as the standard
Vðθ^ Þ þ SðΛ, ϵ, δ, PÞ2 :
dp
SEθ~dp < (10) of proof, and are ever wary of scientific adversaries
hunting for loopholes in their mechanisms. This diver-
A researcher can use this expression to judge how much gence even makes sense both theoretically, because
of their allocation of ϵ to assign to the next run by using privacy bounds are much higher than expected in
practice (Erlingsson et al. 2019), and empirically,
their prior information about the likely value of Vð ^ θ^dp Þ
because those responsible for implementing data access
(as they would in a power calculation), plugging in the systems have little choice but to make some compro-
chosen values of Λ, δ, and P, and then trying different mises in turning math into physical reality. Common
values of ϵ. See also Hsu et al. (2014) and Abowd and implementations thus sometimes allow larger values of
Schmutte (2019) for an economic theory approach. ϵ for each run or reset the privacy budget periodically.
We add two practical suggestions. First, although the
Choosing Λ data sharing regime can be broken by intentional
attack, because re-identification from de-identified

Although our bias correction procedure makes the data is often possible, de-identification is still helpful
choice of Λ less consequential, researchers with extra in practice. It is no surprise that university Institutional
knowledge should use it. In particular, reducing Λ Review Boards have rephrased their regulations from
increases the chance of censoring while reducing noise, “de-identified” to “not readily identifiable” rather than
whereas larger values reduce censoring but increase disallowing data sharing entirely. Adding our new
noise. This Heisenberg-like property is an intentional privacy-protective procedures to de-identified data
feature, designed to keep researchers from being able provides further protection. Other practical steps can
to see certain information with too much precision. be prudent, such as disallowing repeated runs with new
We can, however, choose among the unbiased esti- draws of noise of any one analysis.
mators our method produces that have the smallest Second, potential data providers and regulators
variance. To do that, researchers should set Λ by trying should ask themselves Are these researchers trustwor-
to capture the point estimate of the mean. Although thy? They almost always have been. When this fact
this cannot be done with certainty, researchers can provides insufficient reassurance, we can move to the
often do this without seeing the data. For example, data access regime. However, a middle ground exists by
consider the absolute value of coefficients from any trusting researchers (perhaps along with auxiliary pro-
real application of logistic regression. Although tech- tections, such as data use agreements by university
nically unbounded, empirical regularities in how employers, sanctions for violations, and auditing of
researchers typically scale their input variables lead to analyses to verify compliance). With trust, researchers
1287
can be given full access, be allowed to run any analyses, Together, approaches that are differentially private
but be required to use the algorithm proposed here to and inferentially valid may begin to convince compa-
disclose data publicly. The data holder could then nies, governments, and others to let researchers access
maintain a strict privacy budget summed over pub- their unprecedented storehouses of informative data
lished analyses, which is far more useful for scientific about individuals and societies. If this happens, it will
research than counting every exploratory data analysis generate guarantees of privacy for individuals, schol-
run against the budget. This plan approximates the arly results for researchers, and substantial value for
differential privacy ideal more closely than the typical society at large.
data access regime, as the privacy protections among
results published are then protected by mathematical
guarantees. There are theoretical risks (Dwork and
Ullman 2018), but the advantages to the public good SUPPLEMENTARY MATERIAL
that can come from research with fewer constraints may To view supplementary material for this article, please
be substantial. visit https://doi.org/10.1017/S0003055422001411.
Limits of Differential Privacy

Finally, differential privacy is a new, rapidly advancing DATA AVAILABILITY STATEMENT
technology. Off-the-shelf methods to optimally balance Research documentation and data that support the
privacy and utility do not exist for many dataset types. findings of this study are openly available in the APSR
Although we provide a generic method, with an Dataverse at https://doi.org/10.7910/DVN/ASUFTY.
approximately unbiased estimator, methods tuned to
specific datasets with better properties may sometimes
be feasible. Some applications of differential privacy
make compromises by setting high privacy budgets, ACKNOWLEDGMENTS
post-processing in intentionally damaging ways, or
periodically resetting the budget (Domingo-Ferrer, Many thanks for helpful comments to Adam Breuer,
Sánchez, and Blanco-Justicia 2021), actions that render Merce Crosas, Cynthia Dwork, Max Golperud, Roubin
privacy or utility guarantees more limited than the Gong, Andy Guess, Chase Harrison, Kosuke Imai, Dan
math implies. These and other shortcomings of how Kifer, Patrick Lam, Xiao-Li Meng, Solomon Messing,
differentially private has been applied pose a valuable Nate Persily, Aaron Roth, Paul Schroeder, Adam
opportunity for researchers to develop tools to reduce Smith, Salil Vadhan, Sergey Yekhanin, and Xiang
the utility-privacy trade-off and develop statistical Zhou. Thanks also for help from the Alexander and
methods of analyzing these data capable of answering Diviya Maguro Peer Pre-Review Program at Harvard’s
important social science questions. For other more Institute for Quantitative Social Science.
specific limitations of our methodology, see also
Appendices F, I, and K of the Supplementary Material.
CONFLICT OF INTEREST
CONCLUDING REMARKS The authors declare no ethical issues or conflicts of
interest in this research.
The differential privacy literature focuses appropri-
ately on the utility-privacy trade-off. We propose to
revise the definition of “utility,” so it offers value to
researchers who seek to use confidential data to learn ETHICAL STANDARDS

about the world, beyond inferences to the inaccessible The authors affirm this research did not involve human
private data. A scientific statement is not one that is subjects.
necessarily correct, but one that comes with known
statistical properties and an honest assessment of
uncertainty. Utility to scholarly researchers involves
inferential validity, the ability to make these informa- REFERENCES
tive scientific statements about populations. While dif- Abowd, John M. 2018. “Staring-Down the Database Reconstruction
ferential privacy can guarantee privacy to individuals, Theorem.” In Joint Statistical Meetings, Vancouver, BC. https://
researchers also need inferential validity to make a data bit.ly/census-reid.
access system safe for drawing proper scientific state- Abowd, John M., and Ian M. Schmutte. 2019. “An Economic
Analysis of Privacy Protection and Statistical Accuracy as Social
ments, for society using the results of that research, and Choices.” American Economic Review 109 (1): 171–202.
for individuals whose privacy must be protected. Infer- Al Aziz, Md Momin, Reza Ghasemi, Md Waliullah, and Noman
ential validity without differential privacy may mean Mohammed. 2017. “Aftermath of Bustamante Attack on Genomic
beautiful theory without data access, but differential Beacon Service.” BMC Medical Genomics 10 (2): 43–54.
Barrientos, Andrés F., Jerome Reiter, Machanavajjhala Ashwin, and
privacy without inferential validity may result in biased Yan Chen. 2019. “Differentially Private Significance Tests for
substantive conclusions that mislead researchers and Regression Coefficients.” Journal of Computational and Graphical
society at large. Statistics 28 (2): 440–53.
1288
Bhavnani, Rikhil R., and Alexander Lee. 2021. “Does Affirmative Garfinkel, Simson L., John M. Abowd, and Sarah Powazek. 2018.
Action Worsen Bureaucratic Performance? Evidence from the “Issues Encountered Deploying Differential Privacy.” In
Indian Administrative Service.” American Journal of Political Proceedings of the 2018 Workshop on Privacy in the Electronic
Science 65 (1): 5–20. Society, 133–37. https://doi.org/10.1145/3267323.3268949.
Blackwell, Matthew, James Honaker, and Gary King. 2017. “A Guess, Andrew, Jonathan Nagler, and Joshua Tucker. 2019. “Less
Unified Approach to Measurement Error and Missing Data: than You Think: Prevalence and Predictors of Fake News
Overview.” Sociological Methods and Research 46 (3): 303–41. Dissemination on Facebook.” Science Advances 5 (1): eaau4586.
Bun, Mark, and Thomas Steinke. 2016. “Concentrated Differential Henriksen-Bulmer, Jane, and Sheridan Jeary. 2016. “Re-
Privacy: Simplifications, Extensions, and Lower Bounds.” In Identification Attacks—A Systematic Literature Review.”
Theory of Cryptography Conference, eds. Martin Hirt and Adam International Journal of Information Management 36 (6): 1184–92.
Smith, 635–58. Berlin: Springer. Hsu, Justin, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna,
Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Arjun Narayan, Benjamin C. Pierce, and Aaron Roth. 2014.
Dawn Song. 2019. “The Secret Sharer: Evaluating and Testing “Differential Privacy: An Economic Method for Choosing
Unintended Memorization in Neural Networks.” 28th USENIX Epsilon.” In 2014 IEEE 27th Computer Security Foundations
Security Symposium (USENIX Security 19). Symposium, 398–410. Piscataway, NJ: IEEE.
Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Jayaraman, Bargav, and David Evans. 2019. “Evaluating
Ariel Herbert-Voss, Katherine Lee, Adam Roberts, et al. 2020. Differentially Private Machine Learning in Practice.” 28th
“Extracting Training Data from Large Language Models.” USENIX Security Symposium (USENIX Security 19).
Preprint, arXiv:2012.07805. Karwa, Vishesh, and Salil Vadhan. 2017. “Finite Sample
Chetty, Raj, and John N. Friedman 2019. “A Practical Method to Differentially Private Confidence Intervals.” Preprint, arXiv:
Reduce Privacy Loss When Disclosing Statistics Based on Small 1711.03908.
Samples.” Journal of Privacy and Confidentiality 9 (2): 1–23. King, Gary. 1995. “Replication, Replication.” PS: Political Science
https://doi.org/10.29012/jpc.716. and Politics 28 (3): 443–99.
Desfontaines, Damien, and Balázs Pejó. 2019. “SoK: Differential King, Gary, Robert O. Keohane, and Sidney Verba. 1994. Designing
Privacies.” Preprint, arXiv:1906.01337. Social Inquiry: Scientific Inference in Qualitative Research.
Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin. 2017. Princeton, NJ: Princeton University Press.
“Collecting Telemetry Data Privately.” In Advances in Neural King, Gary, and Nathaniel Persily. 2020. “A New Model for
Information Processing Systems, 3571–80. Industry–Academic Partnerships.” PS: Political Science and
Domingo-Ferrer, Josep, David Sánchez, and Alberto Blanco- Politics 53 (4): 703–9.
Justicia. 2021. “The Limits of Differential Privacy (and Its Misuse King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the
in Data Release and Machine Learning).” Communications of the Most of Statistical Analyses: Improving Interpretation and
ACM 64 (7): 33–5. Presentation.” American Journal of Political Science 44 (2):
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, 341–55.
Omer Rein-gold, and Aaron Roth. 2015. “The Reusable Holdout: Kleiner, Ariel, Ameet Talwalkar, Purnamrita Sarkar, and Michael I.
Preserving Validity in Adaptive Data Analysis.” Science 349 Jordan. 2014. “A Scalable Bootstrap for Massive Data.” Journal of
(6248): 636–38. the Royal Statistical Society: Series B (Statistical Methodology)
Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 76 (4): 795–816.
2006. “Calibrating Noise to Sensitivity in Private Data Analysis.” Li, Kathleen T. 2020. “Statistical Inference for Average Treatment
In Theory of Cryptography Conference, eds. Shai Halevi and Tal Effects Estimated by Synthetic Control Methods.” Journal of the
Rabin 265–84. Berlin: Springer. American Statistical Association 115 (532): 2068–83.
Dwork, Cynthia, and Aaron Roth. 2014. “The Algorithmic Mironov, Ilya. 2017. “Rényi Differential Privacy.” In 2017 IEEE 30th
Foundations of Differential Privacy.” Foundations and Trends in Computer Security Foundations Symposium (CSF), 263–75.
Theoretical Computer Science 9 (3–4): 211–407. Piscataway, NJ: IEEE.
Dwork, Cynthia, and Jonathan Ullman. 2018. “The Fienberg Mohan, Prashanth, Abhradeep Thakurta, Elaine Shi, Dawn Song,
Problem: How to Allow Human Interactive Data Analysis in the and David Culler. 2012. “GUPT: Privacy Preserving Data Analysis
Age of Differential Privacy.” Journal of Privacy and Made Easy.” In Proceedings of the 2012 ACM SIGMOD
Confidentiality 8 (1): 1–10. https://doi.org/10.29012/jpc.687. International Conference on Management of Data, 349–60. https://
Erlingsson, Úlfar, Ilya Mironov, Ananth Raghunathan, and Shuang doi.org/10.1145/2213836.2213876.
Song. 2019. “That Which We Call Private.” Preprint, arXiv: Monogan, James E. 2015. “Research Preregistration in
1908.03566. Political Science: The Case, Counterarguments, and a
Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. 2014. Response to Critiques.” PS: Political Science and Politics 48 (3):
“RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal 425–9.
Response.” In Proceedings of the 2014 ACM SIGSAC Conference Nissim, Kobbi, Sofya Raskhodnikova, and Adam Smith. 2007.
on Computer and Communications Security, 1054–67. https:// “Smooth Sensitivity and Sampling in Private Data Analysis.” In
doi.org/10.1145/2660267.2660348. Proceedings of the Thirty-Ninth Annual ACM Symposium on
Evans, Georgina, and Gary King. 2023. “Statistically Valid Theory of Computing, 75–84. doi.org/10.1145/1250790.1250803.
Inferences from Differentially Private Data Releases, with Reiter, Jerome P. 2012. “Statistical Approaches to Protecting
Application to the Facebook URLs Dataset.” Political Analysis, 1– Confidentiality for Microdata and Their Effects on the
21. doi:10.1017/pan.2022.1. Quality of Statistical Inferences.” Public Opinion Quarterly 76 (1):
Evans, Georgina, Gary King, Margaret Schwenzfeier, and 163–81.
Abhradeep Thakurta. 2023. “Replication Data for: Statistically Robbin, Alice. 2001. “The Loss of Personal Privacy and Its
Valid Inferences from Privacy-Protected Data.” Harvard Consequences for Social Research.” Journal of Government
Dataverse. Dataset. https://doi.org/10.7910/DVN/ASUFTY. Information 28 (5): 493–527.
Evans, Georgina, Gary King, Adam D. Smith, and Abhradeep Roberts, Margaret E. 2018. Censored: Distraction and Diversion
Thakurta. Forthcoming. “Differentially Private Survey Research.” Inside China’s Great Firewall. Princeton, NJ: Princeton University
American Journal of Political Science. Press.
FPF. 2017. “Understanding Corporate Data Sharing Decisions: Sheffet, Or. 2017. “Differentially Private Ordinary Least Squares.” In
Practices, Challenges, and Opportunities for Sharing Corporate Proceedings of the 34th International Conference on Machine
Data with Researchers.” Technical Report Future of Privacy Learning, 70: 3105–14.
Forum. https://bit.ly/fpfpriv. Smith, Adam. 2011. “Privacy-Preserving Statistical Estimation with
Gaboardi, Marco, Hyun-Woo Lim, Ryan M. Rogers, and Salil P. Optimal Convergence Rates.” In Proceedings of the Forty-Third
Vadhan. 2016. “Differentially Private Chi-Squared Hypothesis Annual ACM Symposium on Theory of Computing, 813–22.
Testing: Goodness of Fit and Independence Testing.” Proceedings https://doi.org/10.1145/1993636.1993743.
of the 33rd International Conference on International Conference Stefanski, Len A. 2000. “Measurement Error Models.” Journal of the
on Machine Learning, PMLR 48: 2111–20. American Statistical Association 95 (452): 1353–58.
1289
Sweeney, Latanya. 1997. “Weaving Technology and Policy Together Williams, Oliver, and Frank McSherry. 2010. “Probabilistic
to Maintain Confidentiality.” The Journal of Law, Medicine and Inference and Differential Privacy.” In Proceedings of the 23rd
Ethics 25 (2–3): 98–110. International Conference on Neural Information Processing
Tang, Jun, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, Systems 2: 2451–59.
and Xiaofeng Wang. 2017. “Privacy Loss in Apple’s Wilson, Royce J., Celia Yuxin Zhang, William Lam, Damien
Implementation of Differential Privacy on MacOS 10.12.” Desfontaines, Daniel Simmons-Marengo, and Bryant Gipson.
Preprint, arXiv:1709.02753. 2019. “Differentially Private SQL with Bounded User
Vadhan, Salil. 2017. “The Complexity of Differential Privacy.” In Contribution.” Preprint, arXiv:1909.01917.
Tutorials on the Foundations of Cryptography, ed. Yehuda Lindell, Winship, Christopher, and Robert D. Mare. 1992. “Models
347–450. Cham, CH: Springer. for Sample Selection Bias.” Annual Review of Sociology 18:
Wang, Yue, Daniel Kifer, and Jaewoo Lee. 2018. “Differentially 327–50.
Private Confidence Intervals for Empirical Risk Minimization.” Wood, Alexandra, Micah Altman, Aaron Bembenek, Mark Bun,
Preprint, arXiv:1804.03794. Marco Gaboardi, James Honaker, Kobbi Nissim, David R.
Wang, Yue, Jaewoo Lee, and Daniel Kifer. 2015. “Differentially O’Brien, Thomas Steinke, and Salil Vadhan. 2018. “Differential
Private Hypothesis Testing, Revisited.” Preprint, arXiv:1511.03376. Privacy: A Primer for a Non-Technical Audience.” Vanderbilt
Wasserman, Larry. 2006. All of Nonparametric Statistics. New York: Journal of Entertainment and Technology Law 21 (1): 209–75.
Springer Science & Business Media. Yoder, Jesse. 2020. “Does Property Ownership Lead to
Wasserman, Larry. 2012. “Minimaxity, Statistical Thinking and Participation in Local Politics? Evidence from Property Records
Differential Privacy.” Journal of Privacy and Confidentiality 4 (1): and Meeting Minutes.” American Political Science Review
51–63. 114 (4): 1213–29.
1290

Statistically Valid Inferences From Privacy Protected Data

Uploaded by

Copyright:

Available Formats

You might also like

Statistically Valid Inferences From Privacy Protected Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistically Valid Inferences From Privacy Protected Data

Uploaded by

Copyright:

Available Formats

American Political Science Review (2023) 117, 4, 1275–1290

Statistically Valid Inferences from Privacy-Protected Data

INTRODUCTION mission, but we can also take responsibility ourselves

an individual has been included or excluded. To for-

Example pairs. (A related notion of differential privacy, known

P PK datasets entirely which, instead, could be made avail-

In practice, however, choosing the bounding parameter

the noise and resulting uncertainty estimates will be large

We then write the expected value of θ^ as the

with truncated normal mean

error when constructing our estimator.

FIGURE 2. Underlying Distributions (Before Variance Estimation

(deterministically post-processed) functions of these

and take the sample variance of the simulations of θ~ .

Material provides all the derivations.

version, θ~ , as well as their standard errors.11

Begin with the top-left panel, which plots bias on the

0 100 200 300 400 500

dard error of θ~ is approximately correct (i.e., equal to

We begin with Yoder (2020), a study of the effect of

FIGURE 5. Original versus Privacy-Protected Data Analyses

Effect of affirmative action

(a) Yoder (2020) (b) Bhavnani and Lee (2019)

the confidence interval reflects the cost of the privacy

be in the data: reidentification is essentially impossible,

attack, because re-identification from de-identified

Limits of Differential Privacy

researchers who seek to use confidential data to learn ETHICAL STANDARDS

You might also like