Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

American Economic Review: Papers & Proceedings 2015, 105(5): 471–475

http://dx.doi.org/10.1257/aer.p20151016

Learning from Experiments when Context Matters†

By Lant Pritchett and Justin Sandefur*

Suppose a policymaker in context ​j​is contem- Superscripts ​o​ and e​ ​ denote parameters esti-
plating expanding some social program in hopes mated with observational and experimental vari-
of raising welfare as measured by Y ​ .​ She con- ation, respectively. The first term of the RMSE
fronts two types of evidence. First, researchers reflects sampling error in the observational esti-
have used observational data from a represen- mate, while the second term reflects selection
tative sample of eligible individuals, i​​ , some of bias from OLS, taking the experimental estimate
whom self-selected into the program, to estimate in context ​j​as the truth.
the effect of program participation, ​T​ , using the Equation (3) is the RMSE of prediction errors
following regression: a policymaker in context j​​ would make if she
relied on experimental evidence from other con-
(1) ​Y​ ij​  =  ​μ​ j​ + ​β​ j​  ​T​ ij​ + ​η​ ij​ .​ texts ​k  ≠  j​.
____________________
The OLS estimates of ​β​ j​ clearly suffer poten-
tial bias. Second, the policymaker reviews avail- (3)  R √ Var(​
̂  j​)   = ​
​ MSE(​​β ​​ e≠    ​β ​e≠j
̂    ​)   + ​(​​β ​e≠j
̂    ​ − ​​β ​ê j​ )​ 2​ .
able experimental studies of similar programs in
other contexts (​​β ​k​ for k​   ≠  j​) that use random Again the experimental estimate of impact in
assignment to overcome this bias. context ​j​is our benchmark, and the RMSE con-
This is a stylized description of many real- sists of both sampling variance and parameter
world decisions, in which policymakers must heterogeneity when comparing to experiments
weigh observational data from the right context in one or more other contexts.
against experimental evidence from a differ- The next section illustrates these calculations
ent program and context. This paper attempts for the case of microcredit, showing that RMSE
to measure the trade-off between internal and is lower using nonexperimental estimates within
external validity empirically, using examples context than a few experimental estimates from
from development economics. other contexts. Section II builds on Pritchett and
The root mean squared error (RMSE) of the Sandefur (2014) which shows similar empirical
prediction of program impact encompasses results, and discusses their broader relevance to
threats to both internal and external validity. We development policymaking where the hetero-
calculate the RMSE for observational estimates geneity of true impact across contexts and pro-
from the relevant context by comparing it to the gram design is large, and experimental evidence
experimental estimate from within the same sparse.
study:
__________________ I.  Case Study: Microcredit
(2)  ​ MSE​(​β ​ ̂ oj​ )​  =  ​ √ Var(​​β ​ô j​ ) + ​(​β ​​ ô j​  − ​​β ​ê j​ )​  ​ .
2
R     A recent crop of experimental studies on the
impact of microcredit summarized by Banerjee,
Karlan, and Zinman (2015) provides a rare
opportunity within development economics to
* Pritchett: Harvard Kennedy School, 79 JFK Street, compare cleanly identified treatment effects
Cambridge, MA 01238, and Center for Global Development, from broadly similar interventions across very
(e-mail: lpritchett@cgdev.org); Sandefur: Center for Global disparate contexts. While the studies were not
Development, 2055 L Street NW, Washington, DC 20036 coordinated ex ante, the authors have conformed
(e-mail: jsandefur@cgdev.org).

Go to http://dx.doi.org/10.1257/aer.p20151016 to visit ex post to parallel reporting formats to facilitate
the article page for additional materials and author disclo- direct comparison of the results. All six studies
sure statement(s). have transparently released their survey data and
471
472 AEA PAPERS AND PROCEEDINGS MAY 2015

programs, making it possible to compute nonex- requires additional assumptions not imposed
perimental estimates with the same data. by the original authors, because it is concep-
We compare impacts on two outcome vari- tually closest to the treatment effect produced
ables that feature prominently in all or most using observational data under the assumption
of the studies, and are measured fairly consis- of ignorability of treatment. In both cases, we
tently: business profits and household consump- attempt to measure the effect of treatment for
tion. Profits are measured in all six studies,1 and people who self-select into treatment.
household consumption in five. To aid compara- IV estimates of equation (1) for each study
bility, we standardize both dependent variables are reported in columns 2 and 4 of Table 1. Due
using the mean and standard deviation of the to modest take-up rates in many studies, con-
control group.2 fidence intervals on the experimental IV esti-
Our nonexperimental measure of the treat- mates are larger. For business profits, effects
ment is taken from an OLS regression as shown range from positive and statistically significant
in equation (1), with one significant modifica- in Morocco (0.3 s.d.) to a small, negative, and
tion. We control for the random assignment insignificant effect in Mongolia (−0.01 s.d.).
variable, ​Z​ i​ , so that our estimate of ​β​ o​“naïvely” For consumption, the only statistically signifi-
compares people who did or did not self-select cant effect is negative (−0.2 s.d. in Bosnia). The
into microcredit within the treatment group (i.e., highest estimate is huge (0.65 s.d. in Mongolia)
within the group invited or encouraged to partic- but very imprecisely estimated.
ipate for experimental purposes), and likewise Calculations based on equation (2) show
within the control group.3 that, on average, observational analysis pro-
Estimated impacts from this modified version duces estimates with a RMSE of 0.18 standard
of equation (1) are modest but varied (Table 1 deviations for profit and 0.36 for consumption
columns 1 and 3). For business profits, effect (Table 1, bottom panel). This bias is not trivial,
sizes range from a positive and statistically since most of the estimated experimental effects
significant 0.2 standard deviations in Bosnia are within 0.1 standard deviation of zero.
and Morocco, to a very small, but negative and However, the RMSE based on equation (3)
statistically significant effect in Mongolia. For from using an experimental estimate from
consumption, the range of effects is even wider another context is, in most cases, even greater
(−0.4 to 0.2 s.d.). than the bias afflicting observational data analy-
As our experimental benchmark, we estimate sis. The reliability of the experimental evidence
the effect of treatment on the treated (TOT) by improves as more experiments accumulate,
instrumenting ​T​ i​in equation (1) with the random both because of a decline in sampling error and
assignment variable, ​Z​ i​.4 We focus on the TOT, because meta-analysis averages out the con-
rather than the intention-to-treat (ITT) effect as textual variation in parameters. With only one
reported in the original studies, even though this experiment from another context, the RMSE for
profit is 0.33 standard deviations and for con-
sumption, 0.49. Contextual variation is, in this
1 
Where multiple measures are available, we use the most sense, much larger than selection bias. For the
comprehensive measure. Unlike the other studies, profits in profit variable, the observational data analysis
the Bosnia and Herzegovina study (Augsburg et al. 2015) continues to outperform experimental estimates
refer to the client’s main business only.
2 
Of the six studies, only one (Attanasio et al. 2015) uses even when averaging over five experimental
a logarithmic transformation of the consumption variable in estimates. For consumption, however, once
the original specification. For comparability, we exponenti- three other experiments are available, policy-
ate this variable before standardizing. makers would incur a smaller RMSE by relying
3 
All of the regression specifications described below
include the same set of exogenous covariates employed
on the experimental literature than observational
in the original studies. We began by replicating the data within context.
­intention-to-treat (ITT) estimates for profit (Table 3 in each
study) and consumption (Table 6). For each study, ​T​ is II.  General Discussion
defined as a binary indicator of participation in the specific
microcredit program under experimental evaluation.
4 
For simplicity, our notation ignores the clustered nature We argue here and in Pritchett and Sandefur
of most of the experiments, though we adjust standard errors (2014) that this failure of internally valid evi-
accordingly where appropriate. dence of program impact in one context to
VOL. 105 NO. 5 LEARNING FROM EXPERIMENTS 473

Table 1—Treatment Effects From Microcredit: Within- and Between-Study Comparisons

Profit Consumption
Observational Experimental Observational Experimental
(1) (2) (3) (4)

Bosnia, Augsburg et al. (2015) data 0.193** 0.179 −0.0368 −0.233**


(0.0793) (0.144) (0.0690) (0.119)
Ethiopia, Tarozzi, Desai, and Johnson (2015) data −0.0280 0.510
(0.0502) (0.429)
India, Banerjee, Karlan, and Zinman (2015) data 0.0415 0.262 −0.0174 0.0515
(0.0295) (0.235) (0.0367) (0.301)
Mexico, Angelucci, Karlan, and Zinman (2015) data 0.0845* 0.000583 0.0113 −0.0376
(0.0429) (0.198) (0.0172) (0.210)
Mongolia, Attanasio et al. (2015) data −0.0317*** −0.0105 −0.372 0.654
(0.00764) (0.0119) (0.381) (0.513)
Morocco, Crépon et al. (2015) data 0.194** 0.281* 0.199** −0.140
(0.0895) (0.169) (0.0932) (0.142)

Within-study RMSE based on


  Nonexperimental data 0.18 0.36
Between-study RMSE based on
  One other experiment 0.33 0.49
  Four other experiments 0.22 0.32

Notes: Each coefficient reports a separate treatment effect from a separate regression, with the dependent variable listed in the
first row and the data source listed in the first column. All results are based on reanalysis of microdata from the original studies
listed, using publicly available replication files provided by the authors. The TOT effects reported here correspond most closely
to the ITT estimates from Tables 3 and 6 in each of the original studies.
*** Significant at the 1 percent level.
 ** Significant at the 5 percent level.
  * Significant at the 10 percent level.

improve on predictive accuracy over simple v­ alidity of all the relevant behavioral parame-
nonexperimental estimates in a different con- ters is impossible and external validity of treat-
text is not specific to microcredit, but is likely a ment effects is improbable. For migration, the
generic (if not universal) feature of development only experimental evidence of the gains from a
policies and programs. program promoting low-skill migration comes
First, the empirical heterogeneity across con- from McKenzie, Stillman, and Gibson (2010),
texts in nonexperimental estimates of treatment who exploit a lottery rationing access to New
effects of broad classes of interventions in devel- Zealand agricultural jobs for Tongan work-
opment economics is large. For instance, suppose ers. The estimated log wage differential using
one wanted to estimate the wage gain to movers observational data is 1.64, while the lottery pro-
from a marginal relaxation of barriers to labor duces an estimate of the wage gain of 1.36 log
mobility in rich countries. Nonexperimental esti- points—hence ​β ​ê  ​  =  1.36​and the selection bias
mates from Clemens, Montenegro, and Pritchett which we define as ​ω  ≡  ​β​ ̂  o​ − ​​β​ ̂e​ is equal to​
(2008) imply an average wage gain of 1.48 log 1.64 − 1.36  =  0.28​. One cannot claim external
points (287 percent), encompassing an enor- validity for both the experimental ​β​and the ​ω ​, as
mously broad range from 0.7 to 2.7. Similarly, they have to add up to reproduce the variability
the average OLS estimate of the Mincer coeffi- of the observational estimates.
cient for a year of secondary schooling for males Third, we don’t know what “context” means.
across 103 countries is 6.8 percent (median Parameters that are known with engineering
5.5 percent) but the ­inter-quartile range is 5.0 precision, like the boiling point of water or ten-
(Montenegro and Patrinos 2014). sile strength of steel, are subject to context, in
Second, in the presence of large variabil- the sense of a denumerable and hopefully short
ity in nonexperimental estimates the external list of interacting variables: air pressure on
474 AEA PAPERS AND PROCEEDINGS MAY 2015

b­ oiling points. Social programs, in contrast, are Kaufmann, and Kraay (2013) review the success
embedded in contexts which encompass a long and failure of World Bank projects and find that
list of unknown factors which interact in often the quality of the task manager assigned to the
unknown ways. Pritchett and Sandefur (2014) project has as much impact on project success
note that experimental evidence shows lower as country or project characteristics. If the name
class sizes produce greater student learning in of the project manager is a contextual variable
the United States (Krueger 1999)—for some that explains program success what does this say
ages, for some students, in one state, in some about external validity of program evaluations?
subjects. But there is also good experimental External versus internal validity is an empir-
evidence that class size has next to no impact ical question. Allcott (2012) compares obser-
in Kenya and India (Banerjee et al. 2007; Duflo, vational and experimental analysis of energy
Dupas, and Kremer 2012). This lack of exter- conversation programs and finds internal valid-
nal validity may be explained by context, but ity is a much greater concern than external
some theory is required to know what context validity. But Allcott (2012) compares identi-
means. Teacher quality? System accountability? cal treatments across US municipalities, while
Subject matter? Student background? development policymakers confront evidence
Fourth, we doubt the construct validity from broad classes of policy interventions
(Shadish, Cook, and Campbell 2001) of pro- (e.g., ­microcredit) across contexts that differ in
gram classes like “microcredit,” “pay for perfor- income by an order of magnitude, and with big
mance,” “information campaigns,” “conditional differences in social, institutional, political, and
cash transfers,” or “expanding contraceptive infrastructure conditions.
access” to compare program or policy interven-
tions (Vivalt 2014). In the course of design and III. Conclusion
implementation, programs have to make choices
from a high-dimensional space of design attri- We analyze the trade-off between internal
butes. A microcredit program, for instance, must and external validity faced by a hypothetical
choose intended beneficiary borrowers, interest policymaker weighing experimental and nonex-
rates, repayment frequencies, group or indi- perimental evidence. Empirically, we find that
vidual liability, and how to hire loan officers.5 for several prominent questions in development
A program is just an instance of a class. What economics, relying on observational data anal-
one can infer about a class from an evaluation of ysis from within context produces treatment
few instances depends on dimensionality of the effect estimates with lower mean-square error
design space and shape of the impact function than relying on experimental estimates from
over the design space. another context. Our results suggest that as
For instance, there is evidence that NGO ver- policymakers draw lessons from experimental
sus government implementation is a key ele- impact evaluations, they would do well to focus
ment of the design space for social programs attention on heterogeneity in program design,
in the developing world—as demonstrated by context, and impacts, and may learn little from
the contrasting performance of parallel NGO meta-analyses or “systematic reviews” that
and government-led experiments in Bold et al. focus exclusively on rigorous estimates of aver-
(2013) for Kenyan education; in the contrast in age effects for broad classes of interventions.
results between Duflo, Hanna, and Ryan (2012),
Banerjee, Duflo, and Glennerster (2008), and REFERENCES
Dhaliwal and Hanna (2014); from attendance
monitoring experiments contrasting Indian Allcott, Hunt. 2012. “Site Selection Bias in Pro-
NGO and civil service contexts, and confirmed gram Evaluation.” National Bureau of Eco-
by meta-analysis in Vivalt (2014). Denizer, nomic Research Working Paper 18373.
Angelucci, Manuela, Dean Karlan, and Jonathan
Zinman. 2015. “Microcredit Impacts: Evi-
5 
Banerjee, Karlan, and Zinman (2015) implicitly make dence from a Randomized Microcredit Pro-
this point by avoiding any calculation of an average effect
across the six microcredit studies reviewed above, while gram Placement Experiment by Compartamos
emphasizing the wide variations in program design across Banco.” American Economic Journal: Applied
the studies. Economics 7 (1): 151–82.
VOL. 105 NO. 5 LEARNING FROM EXPERIMENTS 475

Attanasio, Orazio, Britta Augsburg, Ralph De ­Development Economics 105: 288–302.


Haas, Emla Fitzsimons, and Heike Harmgart. Dhaliwal, Iqbal, and Rema Hanna. 2014. “Deal
2015. “The Impacts of Microfinance: Evi- with the Devil: The Successes and Limitations
dence from Joint-Liability Lending in Mon- of Bureaucratic Reform in India.” National
golia.” American Economic Journal: Applied Bureau of Economic Research Working Paper
Economics 7 (1): 90–122. 20482.
Augsburg, Britta, Ralph De Haas, Heike Duflo, Esther, Pascaline Dupas, and Michael
Harmgart, and Costas Meghir. 2015. “The Kremer. 2012. “School Governance, Teacher
Impacts of Microcredit: Evidence from Bosnia Incentives, and Pupil-Teacher Ratios: Exper-
and Herzegovina.” American Economic Jour- imental Evidence from Kenyan Primary
nal: Applied Economics 7 (1): 183–203. Schools.” National Bureau of Economic
Banerjee, Abhijit V., Shawn Cole, Esther Duflo, Research Working Paper 17939.
and Leigh Linden. 2007. “Remedying Edu- Duflo, Esther, Rema Hanna, and Stephen P. Ryan.
cation: Evidence from Two Randomized 2012. “Incentives Work: Getting Teachers to
Experiments in India.” Quarterly Journal of Come to School.” American Economic Review
Economics 122 (3): 1235–64. 102 (4): 1241–78.
Banerjee, Abhijit V., Esther Duflo, and Rachel Krueger, Alan B. 1999. “Experimental Estimates
Glennerster. 2008. “Putting a Band-Aid on of Education Production Functions.” Quarterly
a Corpse: Incentives for Nurses in the Indian Journal of Economics 114 (2): 497–532.
Public Health Care System.” Journal of the McKenzie, David, Steven Stillman, and John
European Economic Association 6 (2–3): 487– Gibson. 2010. “How Important Is Selection?
500. Experimental vs. Non-experimental Measures
Banerjee, Abhijit, Dean Karlan, and Jonathan of the Income Gains from Migration.” Journal
Zinman. 2015. “Six Randomized Evalua- of the European Economic Association 8 (4):
tions of Microcredit: Introduction and Further 913–45.
Steps.” American Economic Journal: Applied Montenegro, Claudio E., and Harry Anthony
Economics 7 (1): 1–21. Patrinos. 2014. “Comparable estimates of
Bold, Tessa, Mwangi Kimenyi, Germano Mwabu, returns to schooling around the world.” The
Alice Ng’ang’a, and Justin Sandefur. 2013. World Bank, Policy Research Working Paper
“Scaling-up What Works: Experimental Evi- 7020.
dence on External Validity in Kenyan Educa- Pritchett, Lant, and Justin Sandefur. 2014. “Con-
tion.” Center for Global Development Working text Matters for Size: Why External Validity
Paper 321. Claims and Development Practice do not Mix.”
Clemens, Michael A., Claudio E. Montenegro, Journal of Globalization and Development 4
and Lant Pritchett. 2008. “The Place Pre- (2): 161–97.
mium: Wage Differences for Identical Workers Shadish, William R., Thomas D. Cook, and Don-
Across the US Border.” The World Bank, Pol- ald T. Campbell. 2001. Experimental and
icy Research Working Paper Series 4671. Quasi-Experimental Designs for Generalized
Crépon, Bruno, Florencia Devoto, Esther Duflo, Casual Inference. Boston: Wadswroth Cen-
and William Parienté. 2015. “Estimating the gage Learning.
Impact of Microcredit on Those Who Take It Tarozzi, Alessandro, Jaikishan Desai, and Kristin
Up: Evidence from a Randomized Experiment Johnson. 2015. “The Impacts of Microcredit:
in Morocco.” American Economic Journal: Evidence from Ethiopia.” American Economic
Applied Economics 7 (1): 123-50. Journal: Applied Economics 7 (1): 54-89.
Denizer, Cevdet, Daniel Kaufmann, and Aart Vivalt, Eva. 2014. “How Much Can We General-
Kraay. 2013. “Good Countries or Good Proj- ize from Impact Evaluations.” http://evavivalt.
ects? Macro and Micro Correlates of World com/wp-content/uploads/2014/12/Vivalt-
Bank Project Performance.” Journal of JMP-latest.pdf (accessed March 12, 2015).
This article has been cited by:

1. Ralitza Dimova. 2019. A Debate that Fatigues…: To Randomise or Not to Randomise; What’s the
Real Question?. The European Journal of Development Research 31:2, 163-168. [Crossref]
2. Rachel M. Gisselquist. 2019. Legal Empowerment and Group-Based Inequality. The Journal of
Development Studies 55:3, 333-347. [Crossref]
3. Rachael Meager. 2019. Understanding the Average Impact of Microcredit Expansions: A Bayesian
Hierarchical Analysis of Seven Randomized Experiments. American Economic Journal: Applied
Economics 11:1, 57-91. [Abstract] [View PDF article] [PDF with links]
4. Moshe Justman. 2018. Randomized controlled trials informing public policy: Lessons from project
STAR and class size reduction. European Journal of Political Economy 54, 167-174. [Crossref]
5. Jörg Peters, Jörg Langbein, Gareth Roberts. 2018. Generalization in the Tropics – Development
Policy, Randomized Controlled Trials, and External Validity. The World Bank Research Observer 33:1,
34-64. [Crossref]
6. Florent Bédécarrats, Isabelle Guérin, François Roubaud. 2017. All that Glitters is not Gold. The
Political Economy of Randomized Evaluations in Development. Development and Change 24. .
[Crossref]
7. Abhijit Banerjee, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan, Shobhini Mukerji,
Marc Shotland, Michael Walton. 2017. From Proof of Concept to Scalable Policies: Challenges and
Solutions, with an Application. Journal of Economic Perspectives 31:4, 73-102. [Abstract] [View PDF
article] [PDF with links]
8. Tim Kaiser, Lukas Menkhoff. 2017. Does Financial Education Impact Financial Literacy and
Financial Behavior, and If So, When?. The World Bank Economic Review 31:3, 611-630. [Crossref]
9. Dieter Zinnbauer. 2017. Social Accountability - Taking Stock of All the Stock-Taking and Some
Interesting Avenues for Future Practice and Research. SSRN Electronic Journal . [Crossref]
10. Jörg Peters, Jörg Langbein, Gareth Roberts. 2016. Policy evaluation, randomized controlled trials,
and external validity—A systematic review. Economics Letters 147, 51-54. [Crossref]

You might also like