Misleading Evidence and Evidence-Led Policy - Making Social Science More Experimental

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

The ANNALS of the American Academy of

Political and Social Science


http://ann.sagepub.com

Misleading Evidence and Evidence-Led Policy: Making Social Science more Experimental
Lawrence W. Sherman
The ANNALS of the American Academy of Political and Social Science 2003; 589; 6
DOI: 10.1177/0002716203256266

The online version of this article can be found at:


http://ann.sagepub.com/cgi/content/abstract/589/1/6

Published by:

http://www.sagepublications.com

On behalf of:

American Academy of Political and Social Science

Additional services and information for The ANNALS of the American Academy of Political and Social Science can be found at:

Email Alerts: http://ann.sagepub.com/cgi/alerts

Subscriptions: http://ann.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations (this article cites 9 articles hosted on the


SAGE Journals Online and HighWire Press platforms):
http://ann.sagepub.com/cgi/content/refs/589/1/6

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
10.1177/0002716203256266
THE ANNALS OF THE AMERICAN ACADEMY
PREFACE EDITORIAL
September
589

PREFACE Increasing demands by government for “evidence-led”


policy raise the risk that research evidence will mislead
government rather than leading to an unbiased conclu-
sion. The need for unbiased research conclusions has
never been greater, yet few consumers of research
understand the statistical biases with which science
must always struggle. This article introduces the vol-
ume’s discussion of those issues with an explanation of
the major threats of bias in social science research and a
map of the differing scientific opinions on how to deal
Misleading with those threats. The thesis of the volume is that many
of these threats could be reduced by making social sci-

Evidence and ence more experimental. The fact that even experimen-
tal evidence contains threats of bias does not alter that
claim but merely suggests another: that educated con-
Evidence-Led sumers of social science may be the best defense against
misleading evidence of all kinds.
Policy:
Making Social T he political fault lines of twenty-first-
century social science resemble a map of
Reformation Europe: one big division, with
Science More countless subdivisions, around obscure issues
that inflame great passions. The big division
Experimental today is between social scientists who define
their “client” as public policy makers—includ-
ing voters—versus those who do not. Further
divisions abound on each side of this chasm,
none of which are consistent with the conven-
tional boundaries of Left versus Right ideologi-
By
cal camps. Political debates increasingly feature
LAWRENCE W. SHERMAN discussion of “evidence” of the consequences of
public—and especially domestic—policy
choices. Whether such evidence leads or mis-
leads policy decisions may depend on intelligent
consumers understanding the logical distinc-
tions rather than ideological divisions in the
methods and epistemologies of contemporary
social science.
This volume offers partial guidance to citi-
zens and officials in pursuit of “evidence-led”
Lawrence W. Sherman is the Albert M. Greenfield Pro-
fessor of Human Relations at the University of Pennsyl-
vania, where he is also director of the Fels Institute of
Government and its Jerry Lee Center of Criminology.
His research focuses on causes and effects of governmen-
tal and civil society strategies to obtain public compli-
ance with law.
DOI: 10.1177/0002716203256266

6 ANNALS, AAPSS, 589, September 2003

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 7

policy. But it does so out of great concern about the risk of “evidence-misled” pol-
icy. That concern brought together a group of distinguished British and American
social scientists for a symposium at Nuffield College, Oxford University’s only col-
lege devoted to social science, in June 2002 (Nuffield College annual report 2002,
3-5). The event was convened in honor of a distinguished police executive, Sir
Charles Pollard, who helped to promote government funding of the first controlled
test of crime reduction programs in England in almost three decades (see
Farrington 2003 [this issue]). As a visiting fellow of Nuffield College from 1992 to
2000, Charles Pollard had heard a great deal about the risks of misleading evidence
and evidence-led policy. And as a government official, he was able to advocate
better social science in the service of better policy making.
The Pollard Symposium on Randomized Controlled Trials in Social Science was
cosponsored by the American Academy of Political and Social Science, which has
more than a century of experience in bringing evidence to policy, and by two orga-
nizations born in the present century. One is the Jerry Lee Center of Criminology
at the University of Pennsylvania, founded by a corporate executive in an industry
(broadcasting) that is highly dependent on good social science research. The other
organization is the International Campbell Collaboration, founded by scholars
from twelve countries in honor of the late Donald T. Campbell, a social psycholo-
gist world-renowned for his treatises on how to evaluate government programs in
an unbiased way (see, e.g., Campbell and Stanley 1963).
The Pollard Symposium came together in an era in which the idea of “evidence-
led policy” is rapidly spreading around the globe. The political use of this term,
which reached the United States in George W. Bush’s program for educational
reform, apparently began in England in the 1990s, shortly after the concept of
“evidence-based medicine” was developed at Oxford. In both medicine and poli-
tics, the definition of “evidence” has little to do with courtroom discussions of fin-
gerprints, eyewitness testimony, or DNA. The term refers instead to its common
usage in science to distinguish data from theory, where evidence is defined as
“facts . . . in support of a conclusion, statement or belief” (Shorter Oxford English
dictionary 2002).
The concept of evidence-based medicine was itself somewhat shocking, since
many consumers of medicine assumed that most medical practice was based on
solid evidence of what works. Yet in medicine as in government, much of what is
done proceeds from theory, conjecture, and untested new ideas. A 1983 study by
the U.S. Office of Technology Assessment estimated that 85 percent of everyday
medical treatments had never been scientifically tested (as cited in Millenson
1997, 4). Sir Iain Chalmers, a contributor to this volume who was knighted for his
work on systematic evaluation of medical practices, now estimates that the evi-
dence on medical practices is far more extensive but still far from complete. What
remains a challenge in medicine is to get practice to conform more with best evi-
dence on what works. What remains a double challenge in government is both to
produce more evidence and to make better use of what evidence there is.
In the judgment of most of the scholars attending the Nuffield event, social sci-
ence is failing government on both challenges. Too much social science evidence

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
8 THE ANNALS OF THE AMERICAN ACADEMY

may mislead policy with statistically biased conclusions due to weak research
designs, and too little social science evidence is presented in a way that allows gov-
ernment to assess the risk of bias. The two problems are related. If governments
cannot rely on social science evidence as unbiased, they have less reason to invest in
producing more evidence.
The extent to which evidence guides practice may depend heavily on whether
the consumers of evidence trust the social scientists who produce it and the meth-
ods they use. Such trust is unlikely to arise from a consensus among social scien-
tists. The divisions within social science are so strong that few would say, “Trust us,
we’re social scientists.” It is more likely that social scientists will draw strong dis-
tinctions between different scientific methods and, ultimately, different schools of
thought. This volume attempts to inform social science consumers—and even pro-
ducers—about such distinctions, with the goal of clarifying their logical basis and
their consequences for evidence-led policy.
A useful starting point for the volume is a short glossary of key terms that
describe the differences in methods social scientists use to generate “facts. . . . in
support of a conclusion, statement or belief.” While any such glossary must sim-
plify many complexities in these distinctions, the progress of the social sciences
may depend on the clarity that may develop from such attempts to economize on
words.

Leading and Misleading Evidence:


A Short Glossary
The most widely misunderstood distinction of social science method is found
between qualitative and quantitative evidence, although it may be the least impor-
tant distinction for evidence-led policy. Qualitative evidence is commonly drawn
from interviews, direct observations of human behavior, reviews of governmental
records, transcripts and archives, and focus group discussions. Yet quantitative evi-
dence is also produced from exactly the same sources. The difference between the
two approaches is found primarily in the method of analysis and less in the kinds of
raw data collected: quantitative analysis by definition employs some form of count-
ing, while qualitative analysis may be restricted to words. While any conclusion
limited to a single case is anecdotal evidence, both qualitative and quantitative anal-
ysis can be founded in methods of drawing samples that reduce bias and support
more general conclusions. Both forms of analysis, done properly, can be extremely
informative to government, and the conflict in logic between the two is highly exag-
gerated. The analytic methods are complementary rather than competitive, a claim
supported by the growing number of social scientists who use both forms of analy-
sis in their work (Massey and Denton 1993).
Far more important for the logic of science is the distinction between systematic
versus unsystematic methods of data collection and analysis. Both quantitative and
qualitative research can be conducted in unsystematic ways, producing misleading

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 9

evidence about the conclusions the study reaches concerning the sample of data
collected and misleading claims about how far the conclusions may be generalized
to other populations. Quantitative social scientists such as Thomas Cook, a contrib-
utor to this volume and a widely cited methodologist, commonly refer to these twin
concerns, respectively, as threats to internal validity about the conclusions drawn
within the sample and external validity of the conclusions to be generalized outside
the sample (e.g., Cook and Campbell 1979). An unsystematic method of drawing a
sample—such as asking CNN viewers to respond by e-mail to an opinion poll—
offers little internal validity as a reflection of what CNN viewers think, let alone any

What remains a double challenge in government


is both to produce more evidence and to make
better use of what evidence there is.

external validity in using the results to conclude what all Americans think. The con-
fusion of the factors motivating people to write to CNN and what opinions they
hold is a prime example of selection bias, one of the many kinds of bias in social sci-
ence that can mislead policy makers. Controlling such bias is the primary aim of the
science of sample survey research, which uses systematic methods to estimate the
connection between one survey’s result and the true opinions of the full population
being sampled.
The concept of bias in statistics refers to a “systematic distortion of a result, aris-
ing from a neglected factor” (Shorter Oxford English dictionary 2002). Bias is the
central idea in evidence-misled policy. Any evidence from social research that a
policy causes desirable or undesirable results must be examined for the possible
bias affecting the methods leading to that conclusion. That concern is the premise
of this volume, which draws primary attention to the distinction between observa-
tional and experimental methods in science. That distinction is usually made in the
logical claim that experimental methods offer a lower risk of bias than observa-
tional methods.
The term observational in this context simply means that the researchers took
no action to manipulate or change any of the things being studied, or variables. In
this sense, observation does not mean eyeball surveillance of the phenomena in
question. It simply means that the research left the subject matter alone to function
in its “natural” state. A study of income tax rates and economic growth over the past
century, for example, is an observational study, simply because the social scientists
themselves had no hand in allocating income tax rates to different people, states, or
time periods. Experimental research, in contrast, consists of any study in which the

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
10 THE ANNALS OF THE AMERICAN ACADEMY

researcher has some direct or indirect control over one of the variables that is
thought to cause a certain effect. This control, often implemented by government
at the recommendation of social scientists, allows variables to be structured in ways
that logically reduce the number of competing explanations of any results.
The contributors to this volume generally share the view that experimental
methods, used properly, can control bias better than observational methods. The
logic of this claim—illustrated by an experiment in nutrition reported in the Old
Testament Book of Daniel, chapter 10—was first formalized by Sir Francis Bacon
in 1620, when he challenged conclusions based on everyday experience as fraught
with potential for bias (Bacon [1620] 1994). Only by systematically comparing
observations with one condition (such as giving streptomycin to people with tuber-
culosis) to observations with another condition (such as not giving tuberculosis
patients streptomycin) did Bacon think a valid test of the effects of any condition
could be conducted. And because a patient’s own choice to take a new medicine
could be confounded with the likelihood of the patient recovering, Bacon’s logic
suggested that the application of the medicine should be made statistically unre-
lated to any characteristic of the patient.
Streptomycin, of course, had not been invented when Bacon was writing, but it
was available to Sir Austin Bradford Hill in the late 1940s when he led the first
modern randomized controlled trial (RCT) of a medical treatment (Streptomycin
Tuberculosis Trials Committee 1948). This landmark is an unheralded contribu-
tion of social science, since Hill was an economist, not a doctor. His application of
statistics to medicine has since been replicated in up to 1 million clinical tests
(Millenson 1997, 130). But it has had far less impact on social science itself, least of
all on his fellow economists.
The econometric solution to the problem of bias is not to use researcher control
of variables in advance but to use mathematical calculations to create logical con-
trol of comparisons based on observations of naturally occurring phenomena. This
approach is derived from the tradition of Descartes, whose preference for mathe-
matical precision over experimental control led him to question Galileo’s experi-
ments (Collins 1998). But in the precise formulations of Sir David Cox (1958) (who
offered insightful comments at the Nuffield event) and Sir Ronald Fisher (1935),
the potential for mathematical control of bias after observational data are collected
is not nearly as great as the potential for logical control of bias by experimental
design in advance of data collection.
The central issue in the debate between the econometric and experimental
approaches to the control of bias is specification error. The concept of “specifica-
tion” refers to the list of causes included or excluded from a theory (or “model”) of
cause and effect. The econometric approach assumes that a reasonably complete
set of causes can be “specified” in any mathematical formulation of the effects of
different variables on the same outcome, such as the growth rate of the economy—
and many other subjects that would be hard or impossible to study with systematic
experimentation. The danger in such models, as the textbooks make clear, is the
definition of bias itself: that some key causal factor has been neglected in the speci-

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 11

fication of causes included in the model (Wonnacott and Wonnacott 1970, 156,
160).
The danger of specification error is arguably reduced in the random assignment
of one condition to half of a large population (such as five hundred out of one thou-
sand cancer patients) by a formula that makes it equally likely that each patient will
receive one treatment or another. Used with relatively large samples, this method
generally creates equivalent distributions in each of the two groups of factors that
could affect results: age, weight, gender, history of cigarette smoking, severity of ill-
ness, and so on. Random assignment is generally thought to control for bias not
only from causes that researchers have thought of but also for causes they have not
thought of. By avoiding the need to “specify” the causes of curing cancer, or
reforming welfare, or teaching reading, an RCT can reduce bias in the test of how
effective a single treatment may be.
Experimentalists do not question the premise that elaborate econometric mod-
els provide substantial reduction of bias. What they do claim is that when both
approaches are possible, a properly controlled experiment provides greater pro-
tection from bias than an econometric model. The two approaches are not neces-
sarily incompatible, and economists are increasingly employing experimental
designs in building econometric models. But there is still some difference of opin-
ion as to whether an RCT provides any less bias in estimating causation than a well-
specified econometric model. That issue is addressed by Steven Glazerman and his
colleagues in their article in this volume that compares experimental and
nonexperimental evaluations of the same programs, showing that they very often
obtain different results. Which method is “right” more often remains a subject of
debate, but it is increasingly clear that these two approaches do not yield the same
“evidence” about whether government programs are effective.

Misleading Evidence from


Randomized Trials: Consumer Alerts
Where experimentalists often divide their own ranks is the point at which an
RCT breaks down. People—or groups—can be randomly assigned to many
options, but it is hard to make any group of humans behave in perfect consistency.
Medical patients do not always take their pills, doctors may not always withhold
surgery from a control group, and potential voters may not be at home when some-
one knocks on the door to ask them to vote on election day. When the “take-up” of
the randomly assigned condition being tested is less than 100 percent, there are
two major options for analysis. One is to use econometric methods to analyze the
effects of treatment received (TR) in a statistical model that attempts to “control”
for the factors predicting why some were treated (or accepted treatment) and oth-
ers were not. The other approach is to analyze the cases as they were randomized,
based on the intention-to-treat (ITT) principle (Peto et al. 1977).

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
12 THE ANNALS OF THE AMERICAN ACADEMY

The ITT principle holds that an RCT can test the effects of trying to get some-
one to take a treatment and, thus, provides a valid inference about the effect of the
attempt, as distinct from the actual treatment received. The effect of knocking on
ten thousand voters’ doors can be estimated by an ITT analysis, but the effect of
actually talking to a voter if he or she is home can only be estimated by a TR analy-
sis. This contrast is sometimes described as the difference between testing a policy
of providing something to a group, such as the chance to use a government voucher
to attend a private school, and a theory about the effect of actually taking up the
offer, such as attending private versus public schools. David Cox (1958) has sug-
gested that the best experimental solution to this problem is to push the point of
random assignment as far into the sequence of decisions as possible, such as only
offering the voucher to half of the parents who elected to accept it. Yet even then,
some parents may change their minds, or the student may drop out of private
school after one year of a three-year test.

People—or groups—can be randomly


assigned to many options, but it is hard
to make any group of humans behave
in perfect consistency.

The econometric solution to this problem is to apply an instrumental variable


(IV) to a TR analysis (Angrist, Imbens, and Rubin 1996). The important goal of the
analysis is to estimate the average effect of treatment on the treated. The solution
assumes that some variable (the IV) can be found that is unrelated to the causal
processes under study or “exogenous” to the econometric model. This variable can
then be used to compute the difference between the result obtained with and with-
out the actual delivery of treatment. In the case of voter recruitment, for example,
whether someone is home at the time a recruiter knocks on the door is arguably
unrelated to her or his propensity to vote. But that is indeed just an argument, or an
assumption, of the kind that some experimentalists are reluctant to make. Thus,
when Donald Green and Alan Gerber, distinguished experimentalists and contri-
butors to this volume, test the effects of voter recruitment, they report the results
both ways: an ITT analysis and a TR analysis using an instrumental variable in the
causal model (see Gerber and Green 2000).
The cost-benefit analysis of any expenditure may be far more favorable in the
TR analysis than in the ITT analysis. If a treatment has any benefit when delivered,
the average effects on those who get it and do not get it delivered would logically be

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 13

smaller than the effects on only those who get it. But the costs associated with offer-
ing the treatment can often not be limited to those who ultimately take up the treat-
ment. Thus, there is an important point for consumers of social science to remem-
ber: experimentalists assign much greater certainty to the ITT approach, and more
uncertainty to the TR approach, in drawing conclusions about the effect of any pro-
gram. In medical research, for example, there is acceptance of the limitations of the
ITT approach but reluctance to accept the IV solution (Piantadosi 1997).
There is far less disagreement in principle about another problem of misleading
evidence from randomized experiments: selective reporting of one out of several
measures of results. This problem may arise, for example, in a drug abuse preven-
tion program in public schools, with outcome measures covering the use of
tobacco, alcohol, marijuana, cocaine, heroin, and other drugs. The more tests for
statistically significant effects of a program the researchers perform, the greater
the odds that one of the effects will appear to be “significant” by chance alone. Yet
reports on randomized experiments often conclude that programs are effective on
selective reporting of one significant result rather than on a full account of all the
nonsignificant results (Gorman 2002). Taken together, the full array of results may
provide far less evidence to conclude that the program “worked” than if the one
best result is taken (selectively) as a biased presentation of the evidence.
A variant of this problem is subgroup analysis. The question of “what works for
whom” remains an important one in all program evaluation, especially when a pol-
icy has opposite effects on different kinds of people. For example, the effects of
arrest on misdemeanor domestic violence may be, on average, no different from
those of not making arrests, but that may mask the fact that arrest reduces future
violence among employed batterers while increasing domestic violence among
unemployed batterers (Sherman 1992). But when an evaluation selectively
emphasizes the benefits of a program for one small subgroup of the randomly
assigned test group and ignores the evidence about the rest of the sample, it can
bias the consumer’s reaction as a form of selective reporting. This issue is especially
important if the subgroup was formed after random assignment (such as failure to
take up treatment) rather than before random assignment or at baseline (such as
employment status at time of random assignment). Subgroups that consist of those
who have not dropped out may lose the logical benefit of random assignment, in
that their survival may be due to factors other than the treatment—a form of bias.
The widely accepted “success” of one major drug abuse prevention program, for
example, may be based on a “high-fidelity sample” that constitutes a TR analysis
rather than what the reader may interpret as an ITT analysis (Gorman 2002).

Preventing Misleading
Evidence with Systematic Reviews
One way social science can alert consumers of evidence to these problems is an
increasingly available tool: the systematic review (Peto 1987). These reviews

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
14 THE ANNALS OF THE AMERICAN ACADEMY

attempt to integrate all the evidence on the effectiveness of any program or prac-
tice whenever more than one study is available. The review thus reduces the
importance of any one test by putting it in the context of all available tests. This pro-
vides a useful antidote to the consumer’s tendency to give greater value to the “lat-
est” result than to the “average” result across all tests. By giving far greater empha-
sis to locating the full population of tests of any hypothesis, the systematic review
also helps to control publication bias, or the greater likelihood that any result will
be published if it reports a statistically significant difference than if it does not; or it
reports the first test of a hypothesis rather than a replication, which has just as
much value to science (MacCoun 1998). As Hebert Turner, Robert Boruch, and
their colleagues discuss in their article in this volume, one of the best defenses
against publication bias is the use of a “registry” of all attempts to test a program at
the time the test begins and before the results are yet known. This method may
increase public access to findings of nonsignificant results, which can often provide
key pieces to a puzzle of what the pattern of evidence truly shows.
Generally undertaken by independent analysts, systematic reviews also attempt
to identify the rules for including or excluding each study depending on whether it
meets certain criteria. For example, a review could exclude any evaluation that is
not an RCT, as is the common practice in medicine (see, e.g., Assendelft et al.
2003). And in a review limited to randomized trials, the results examined could be
limited to ITT analyses with at least 75 percent of the treatment group actually
treated, for example (Strang and Sherman 2003). Or there could be an additional
limitation about the percentage of cases followed up beyond random assignment,
such as the exclusion of any test that lost more than 30 percent of its subjects during
outcome assessment.
The benefits of systematic reviews cut both ways: they help to detect both overly
optimistic and overly negative conclusions about program effects. They are an
excellent defense against selective reporting: the use of one standard outcome
measure across multiple tests compared in a systematic and transparent fashion
can help reject conclusions based on chance effects in one study. But systematic
reviews also may reveal benefits that appear consistently across multiple studies,
even though the effects were not statistically “significant” (ruling out chance
effects) in any of the individual studies. The Cochrane Collaboration for evidence-
based medicine even uses an example of such a review as its logo: a statistical
graphic that portrays a health benefit that had been undetected before a systematic
review was done (see the Collaboration’s Web site at www.cochrane.org).
The Cochrane logo is an example of the forest graph, a new kind of graph that
systematic reviewers developed to help consumers see the overall pattern of evi-
dence. Sir Iain Chalmers, a founder of the Cochrane Collaboration, uses a forest
graph in his article in this volume to show the effects of a medical treatment that
was not known to be effective until the graph was plotted. It is a tool that should be,
but is not yet, familiar to every government official who is charged with making
evidence-based policy.
The forest graph is a remarkably simple tool that was not designed by any “Pro-
fessor Forest” but that helps the reader “see the forest from the trees.” By showing

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 15

the evidence from each individual test that a program or practice leaves people
better off or worse off, the reader can immediately see what the pattern is. Each
“tree” shows the confidence intervals of each result—the statistical range of the
estimated possible effects of the hypothesized cause on the apparent effect. Each
line that crosses zero is a result that, by itself, could have been due to chance. But if
most of the space within the confidence intervals in most of the tests falls on one
side or the other of the line between benefits and harms, then the chances of that
pattern being due to chance itself go down substantially.
This fact is perhaps the best reminder to consumers of the major difference
between statistical significance—the probability of a single effect being due to
chance—and substantive significance: the effect that a program might have if it
were adopted nationwide. Forest graphs focus the spotlight where it belongs, esti-
mating the substantive significance of a program from all the evidence rather than
the narrow question of whether any one result had more than a 5 percent chance of
being a fluke. Many policy makers would be happy to take much greater risks that a
test result was a fluke—perhaps 10 or even 20 percent—if the potential benefits in
lives saved or jobs generated would justify that risk. A forest graph changes the sub-
ject from making those bets on a single piece of research and implies that most bets
should be based on multiple tests wherever ethically feasible. In the world of evidence-
based policy, the translation of “where’s the beef?” may become “where’s the forest
graph?”
Systematic reviews, of course, are also prone to the sectarian divisions of social
science. The major division, mild though it may be, is between the “lumpers” and
the “splitters.” The lumpers in this instance are those scientists who rely on a single
estimate from all available studies taken together, a procedure usually called meta-
analysis (see Glass 1976; Cooper and Hedges 1994). The splitters in research syn-
thesis are those who distrust the conclusions based on glossing over the many dif-
ferences across evaluations in experimental design and the actual delivery of pro-
grams and therefore resist—or at least supplement—the use of an “average effect
size” to distill the overall effect of a program.
One aspect of this debate is a reprise of the ITT versus TR debate, since some
approaches to meta-analysis either employ TR findings in their results or develop
models that take TR into account. Other approaches to meta-analysis, however,
limit the average effect size statistics to ITT estimates. That is where the two major
issues of differences in analytic approach to research synthesis become most
apparent. These issues are the use of differing control groups and Thomas Edison’s
problem of the light-bulb test.
The problem of differing control groups is basic to experimental design. To say
that a program increases literacy by 50 percent is to imply a comparison: a 50 per-
cent increase relative to what? If the comparison is to students enrolled at an elite
private school (to use an extreme example), the size of the effect may be quite dif-
ferent from a comparison of the program to students enrolled in a poverty-area
public school. Most tests of such programs employ a range of comparison groups in
between such extremes. The question for systematic reviews is how to take the dif-
ferences in control group conditions into account in estimating an “average” effect

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
16 THE ANNALS OF THE AMERICAN ACADEMY

of a program relative to a haphazard collection of comparisons. That is, if the com-


parisons are not in any way “representative” of the larger world, then the external
validity of conclusions from the studies employing those comparisons is severely
limited.
The splitters deal with this problem by separating the estimates of program
effects into those derived from each kind of comparison group. In a medical review
of a specific treatment for back pain, for example, results of thirty-nine RCTs were
divided into five groups of comparisons, one to a “sham” or placebo treatment and
four to other common treatments for back pain (Assendelft et al. 2003). The chal-
lenge of using this method in social science is that there are few, if any, programs

Thomas Edison famously said that his many


failures to invent a long-burning light bulb
produced a useful compilation of knowledge
about how not to design such a device.

that have been subjected to thirty-nine randomized trials. The splitters who con-
front six or seven RCTs containing three different comparison groups are left to
advocate caution in reaching conclusions and a need to conduct further tests. A
review of restorative justice, for example, confronts comparison groups as diverse
as prosecution in a juvenile court in Australia to diversion from prosecution to a
range of more than fifty diversion programs in Indianapolis (Strang and Sherman
2003). A single review may thus address both lumpers and splitters by reporting
results in multiple ways. Yet the conclusions may still part company with the lump-
ers who are ready to compute an average effect size across all comparisons, arguing
that such averages may be meaningless. Both camps, however, could come
together with more randomized trials available for analysis, so that there is still
room for “lumping” average effect sizes within each group after the “splitting” of all
results according to the comparison group.
The most important issue for systematic reviews may be the light-bulb problem.
Thomas Edison famously said that his many failures to invent a long-burning light
bulb produced a useful compilation of knowledge about how not to design such a
device. Only the last of his hundreds of attempts was successful. Had those findings
been subjected to meta-analysis, the conclusion would have been, on average, that
long-burning light bulbs do not work. That is, the synthesis of research would show
that light bulbs do not work. Here again, the splitters are ready to part company

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 17

with the lumpers: they would ask why this single result differed from the average
effect and whether that single result can be replicated.
It is at this point that the value of systematic reviews extends beyond reaching
conclusions about what is known so far into conclusions about what is most impor-
tant to study next. An evidence-based analysis of policy choices that shows very
promising results from one approach should not lead immediately to a national
implementation of that approach—such as the legislatures in most U.S. states
mandating arrest for misdemeanor domestic violence on the basis of one RCT
(Sherman 1992). Rather, the careful consumer of social science evidence should do
what the National Institute of Justice actually did and proceed to conduct multiple
replications of the single promising approach. While those replications may com-
plicate the conclusions, they can provide a more realistic basis for assessing what
works for whom. That, in turn, can make policies more successful in achieving their
goals—or “meeting their targets,” as the current British government would (and
does) put it.
The International Campbell Collaboration will soon provide systematic reviews
of evidence about a wide range of government programs. Many consumers may
treat these reviews as final conclusions and base government funding recommen-
dations on the results. The best value of such reviews, however, will be to become
living documents, accessible on the Web and updated with every newly available
research result. In this respect, as in so many others, the “evidence” in science dif-
fers markedly from the evidence in a court of law. In the latter, the jury goes out,
then comes back, and the case is closed. In science, the presentation of evidence
may never end, and the jury may never go home.

Making Social Science More Experimental


Much of this volume focuses on the causes and possible cures for the dearth of
experimental evidence in social science. This analysis spans the North Atlantic.
The articles by Ann Oakley et al. and David Farrington examine the depth of hostil-
ity to experimental designs among British social scientists, offering qualitative and
quantitative evidence for their conclusions. They imply that it is social scientists
who have led British government away from unbiased research rather than vice
versa. In the United States, the implication is similar if not as clear-cut. The article
by Thomas Cook suggests that both educators and university-based education
researchers have opposed controlled experiments on many grounds. Whether that
situation will be remedied by the new U.S. Institute for Education Sciences
remains unclear, but that is clearly the intent of the Institute’s new director, Grover
Whitehurst (see www.ed.gov/offices/IES/NCEE). Donald Green and Alan Gerber
suggest that even basic research in political science suffers from a lack of experi-
mental evidence and that contemporary experiments can inform issues generally
approached as historical claims, such as the decline of political machines.
Much of the 2002 discussion at Nuffield College focused on graduate training.
For most graduate students in social science, the no-parking sign hung on a private

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
18 THE ANNALS OF THE AMERICAN ACADEMY

residence garage door in Washington might be apt: “Don’t even THINK of parking
here!” The message of graduate training may be, in most Ph.D. programs, “Don’t
even THINK of doing field experiments!” Whether this situation can be remedied,
or whether it should be, the authors of this volume leave up to the readers. But one
economic hypothesis is that government will get the quality of evidence that it
demands and that social science markets will supply what the customer wants.
Thus, if we are to improve the capacity of social science to lead, and not mislead,
governmental policy, creating educated consumers may be the first step. If this vol-
ume can contribute to that education, then it will have helped to accomplish the
mission of the American Academy of Political and Social Science since 1889: “to
promote the progress of the political and social sciences.”

References
Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. Identification of causal effects using
instrumental variables. Journal of the American Statistical Association 91:444-55.
Assendelft, Willem J. J., Sally C. Morton, Emily I. Yu, Marika J. Suttorp, and Paul G. Shekelle. 2003. Spinal
manipulative therapy for low back pain: A meta-analysis of effectiveness relative to other therapies.
Annals of Internal Medicine 138:871-81.
Bacon, Francis. [1620] 1994. Novum Organum. Translated and edited by Peter Urbach and John Gibson.
Chicago: Open Court.
Campbell, Donald T., and Julian C. Stanley. 1963. Experimental and quasi-experimental designs for
research. Chicago: Rand-McNally.
Collins, Randall. 1998. Sociology of the philosophies. Cambridge, MA: Harvard University Press.
Cook, Thomas D., and Donald T. Campbell. 1979. Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton-Mifflin.
Cooper, Harris, and Larry V. Hedges, eds. 1994. The handbook of research synthesis. New York: Russell Sage
Foundation.
Cox, Sir David. 1958. Planning of experiments. New York: John Wiley.
Farrington, David P. 2003. British randomized experiments on crime and justice. Annals of the American
Academy of Political and Social Science 589:150-69.
Fisher, Sir Ronald. 1935. Design of experiments. Edinburgh, UK: Oliver and Boyd.
Gerber, Alan S., and Donald P. Green. 2000. The effects of canvassing, direct mail and telephone contact on
voter turnout: A field experiment. American Political Science Review 94:653-63.
Glass, Gene V. 1976. Primary, secondary, and meta-analysis. Educational Researcher 5:3-8.
Gorman, Dennis M. 2002. The “science” of drug and alcohol prevention: The case of the randomized trial of
the Life Skills Training Program. International Journal of Drug Policy 13:21-26.
MacCoun, Robert J. 1998. Biases in the interpretation and use of research results. Annual Review of Psychol-
ogy 49:259-87.
Massey, Douglas A., and Nancy Denton. 1993. American apartheid: Segregation and the making of the
underclass. Cambridge, MA: Harvard University Press.
Millenson, Michael. 1997. Demanding medical excellence: Doctors and accountability in the Information
Age. Chicago: University of Chicago Press.
Nuffield College annual report: Academic report 2001-2002. 2002. Oxford: Nuffield College.
Peto, Richard R. 1987. Why do we need systematic overviews of randomized trials? Statistics in Medicine
6:233-40.
Peto, R., C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, S. V. Howard, N. Mantel, K. McPherson, J. Peto, and
P. G. Smith. 1977. Design and analysis of randomized clinical trials requiring prolonged observations of
each patient. II. Analysis and examples. British Journal of Cancer 35:1-39.
Piantadosi, Steven. 1997. Clinical trials: A methodological perspective. New York: John Wiley.

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
PREFACE 19

Sherman, Lawrence W. 1992. Policing domestic violence: Experiments and dilemmas. New York: Free Press.
Shorter Oxford English dictionary: On historical principles. 2002. 5th ed. Oxford: Oxford University Press.
Strang, Heather, and Lawrence Sherman. 2003. Effects of restorative justice on repeat offending and victim
satisfaction: A systematic review for the Campbell Collaboration. Manuscript. Philadelphia: Jerry Lee
Center of Criminology, University of Pennsylvania.
Streptomycin Tuberculosis Trials Committee. 1948. Streptomycin treatment of pulmonary tuberculosis: A
Medical Research Council investigation. British Medical Journal 20 (30 October):769-82.
Turner, Herbert, Robert Boruch, Anthony Petrosino, Julia Lavenberg, Dorothy de Moya, and Hannah
Rothstein. 2003. Populating an international web-based randomized trials register in the social, behav-
ioral, criminological, and education sciences. Annals of the American Academy of Political and Social Sci-
ence 589:203-25.
Wonnacott, Ronald J., and Thomas H. Wonnacott. 1970. Econometrics. New York: John Wiley.

Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008


© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.
Downloaded from http://ann.sagepub.com by Roli Talampas on May 31, 2008
© 2003 American Academy of Political & Social Science. All rights reserved. Not for commercial use or unauthorized distribution.

You might also like