Assessing The Impact of Plan de Social Change. Donald Campbell PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

http://w

www.jmde.com/ Occasion
nal Paper S
Series

Asseessing the Im
mpact of
o Plan
nned S
Social C
Chang
ge*
Originallly publisheed as Paper #8, Occasiional Paperr Series, Deccember, 19776

* Reprin
nted with peermission of
o The Public Affairs C
Center, Darttmouth Colllege1.

Donald
d T. Camp
pbell

I t is a special characterristic of all


modeern societiees that we conscioussly
decide ono and pla an projects designed to
a much

metho
h of this meethodology can be useefully
shareed–that so
odology is o
ocial project evaluaation
one of the ffields of sciience
improvee our sociial systemss. It is ou ur that has enouggh universsality to m make
universaal predicam ment that our projeccts scienttific sharingg mutually beneficial. As a
do not always
a havee their intended effectts. part oof this sharring, this paper reportts on
Very probably
p we
w all sh hare in th he progrram impactt assessmen nt methodo ology
experiennce that ofteno we cannot teell as it is develop ping in thee United S States
whetherr the projecct had any impact
i at all,
a todayy.
so com mplex is the t flux of o historiccal Thhe most co ommon nam me in the U.S.
changes that wou uld have been goin ng for th
his developing specialty is evaluaation
anyway, and so many aree the other researrch, whicch now aalmost alw ways
projects that mightt be expected to modiify impli es program m evaluation (even tho ough
the samee indicatorss. the term “evvaluation has a w well-
It seems
s inevvitable tha at in mo ost establlished ussage in assessing the
countriees this com mmon set of o problemms, adequ uacy of perrsons in th he executio on of
combineed with the obvious relevance of speciffic social roles). Alreaady there aare a
social science research proccedures, will w numb ber of anth hologies annd textbook ks in
have generated
g a methodology an nd this aarea. (Such hman, 196 67; Caro, 11971;
methodo ological speecialists foccused on thhe Weisss, 1972a, 19 972b; Rivlinn, 1971; Rossi &
problemm of asseessing the impact of Williaams, 1972; Glaser, 19773; Fairweaather,
planned d social channge. It is an
n assumptioon 1967; Wholey, et al., 1970;Caporasso &
of this paper
p that, in
i spite of differences
d in Roos,, 1973; Rieecken, Borruch, Camp pbell,
forms ofo government and ap pproaches to Caplaan, Gennan n, Pratt, Reees, & Williiams,
social planning and prob blem-solvinng, 1974. ) There is a journ nal, Evaluaation,

1Preparattion of this pa
aper has been
n supported in
n part by gran
nts from the RRussell Sage F
Foundation an
nd the
National Science
S Founndation, Gran
nt Number SO OC-7103704-0 03. A versionn of this paper was presentted to
the Visegrrad, Hungary, Conference on Social Psyychology, Mayy 5-10, 1974.

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 115 3
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

which is being given free distribution alternative in reducing equivocality about


during a trial period and after three issues what caused what in program evaluation
seems likely to survive (address: 501 S. (from Suchman’s, 1967, founding book
Park Ave., Minneapolis, Minnesota onward), this is a very important
55415). There is also Evaluation contribution of both general orientation
Comment: The Journal of Educational and specific skills.
Evaluation (address: 145 Moore Hall, Second, psychologists are best
University of California, Los Angeles, prepared with appropriately critical and
90024). Two other journals, which analytic measurement concepts.
frequently have materials of this sort, are Economists have an admirably skeptical
Social Science Research, edited by two book on monetary records (Morgenstern,
leaders in the field in the U.S., James 1963), but most economists treat the
Coleman and Peter Rossi, and Law & figures available as though they were
Society Review, Founded by Richard D. perfect. Sociologists have a literature on
Schwartz. Many other journals covering interviewer bias and under enumeration,
social science research methods carry but usually treat census figures as though
important contributions to this area. they were unbiased. Psychology, through
The participants in this new area come its long tradition of building and
from a variety of social science disciplines. criticizing its own measures, has
Economists are well represented. developed concepts and mathematical
Operations research and other forms of models of reliability and validity which
“Scientific management” contribute. are greatly needed in program evaluation,
Statisticians, sociologists, psychologists, even though they are probably not yet
political scientists, social service adequate for the study of the cognitive
administration researchers, and growth of groups differing in ability. The
educational researchers all participate. concept of bias, as developed in the older
The similarity of what they all end up psychophysics in the distinction between
recommending and doing testifies to the “constant error” (bias) and “variable
rapid emergence of a new and separate error” (unreliability), and the more recent
discipline that may soon have its own work in personality and attitude
identity divorced from this diverse measurement on response sets, halo
parentage. effects, social desirability factors, index
Since my own disciplinary background correlations, methods factors, etc.
is social psychology, I feel some need to (Cronbach, 1946, 1950; Edwards, 1957;
comment on the special contribution that Jackson & Messick, 1962; Campbell,
this field can make even though I regard Siegman & Rees, 1967; Campbell & Fiske,
what I am now doing as “applied social 1959) is also very important and is apt to
science” rather than social psychology. be missing in the concept of validity if that
First, of all of the contributing disciplines, is defined in terms of correlation
psychology is the only one with a coefficient with a criterion. All this, of
laboratory experimental orientation, and course, is not our monopoly. Indeed, it is
social psychologists in particular have had the qualitative sociologists who do studies
the most experience in extending of the conditions under which social
laboratory experimental design to social statistics get laid down (e.g., Becker, et al.,
situations. Since the model of 1968, 1970; Douglas, 1967; Garfinkel,
experimental science emerges as a major 1967; Kitsuse & Ciccourel, 1963; Beck,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 4


ISSN 1556-8180
February 2011
Donald T. Campbell

1970) who best provide the needed activities with only long-range payoff. For
skepticism of such measures as suicide potential recruits, or for those already
rates and crime rates. But even here, it is engaged in it, a full account of our
psychologists who have had the depth of difficulties, including the problem of
experience sufficient to distinguish getting our skills used in ways we can
degrees of validity lying between total condone, is bound to be discouraging. We
worthlessness and utter perfection, and cannot yet promise a set of professional
who have been willing to use, albeit skills guaranteed to make an important
critically, measures they knew were difference. In the few success stories of
partially biased and errorful. beneficial programs unequivocally
Third, many of the methodological evaluated, society has gotten by, or could
problems of social implementation and have gotten by, without our help. We still
impact measurement have to do with the lack instances of important contributions
social psychology of interaction between to societal innovation which were abetted
citizens and projects, or between citizens by our methodological skills. The need for
and modes of experimental our specialty, and the specific
implementation (randomization, control recommendations we make, must still be
groups), or between citizens and the justified by promise rather than by past
special measurement procedures performance. They are a priori in that
introduced as a part of the evaluation. they represent extrapolations into a new
These are special problems of attitude context not yet cross-validated in that
formation and of the effects of attitudes context. I myself believe that the
on responses, and are clearly within the importance of the problem of social
domain of our intended competence. system reality-testing is so great that our
Having said something about U.S. efforts and professional commitment are
evaluation research in its professional fully justified by promise. I believe that
aspects, I would like to spend the rest of the problems of equivocality of evidence
the time telling about the problems we for program effectiveness are so akin to
have encountered so far and the solutions the general problems of scientific
we have proposed. It is with regret that I inference that our extrapolations into
note that we have progressed very far recommendations about program
from my earlier review (Campbell, 1969b); evaluation procedures can be, with proper
however, I will attempt to provide new mutual criticism, well-grounded.
illustrations. Nonetheless, motivated in part by the
The focus of what follows is so much reflexive consideration that promising too
on troubles and problems that I feel the much turns out to be a major obstacle to
necessity of warning and apologizing. If meaningful program evaluation, I aim,
we set out to be methodologists, we set however ambivalently, to present an
out to be experts in problems and, honestly pessimistic picture of the
hopefully, inventors of solutions. The problem.
need for such a specialty would not exist A second problem with the problem
except for the problems. From this point focus comes from the fact that inevitably
of view, no apology is needed. But I would many of the methodological difficulties
also like to be engaged in recruiting are generated from the interaction of
members to a new profession and in aspects of the political situation
inspiring them to invest great effort in surrounding programs and their

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 5


ISSN 1556-8180
February 2011
Donald T. Campbell

evaluation. Thus the U.S. experience in human consciousness and more humane
evaluation research, combined with the forms of social life are questions I have
focus on problems, may make my not adequately faced, to say nothing of
presentation seem inappropriately and resolved.
tactlessly critical of the U.S. system of In what follows, I have grouped our
government and our current political problems under three general headings,
climate. Ideally, this would be balanced but have made little effort to keep the
out by the sharing experiences from many discussion segregated along these lines.
nations. In the absence of this, I can only First comes issues that are internal to our
ask you not to be misled by this by- scientific community and would be
product of the otherwise sensible present even if scientist program
approach of focusing on problems. evaluators were running society solely for
It is in the area of methodological the purpose of unambiguous program
problems generated by political evaluation. These are: 1) Metascientific
considerations that the assumptions of issues; and 2) Statistical issues. The
universality for the methodological remaining heading involves interaction
principles will fail as we compare with the societal context. Under 3)
experiences from widely differing social, Political system problems, I deal with
economic, and political systems. You issues that specifically involve political
listeners will have to judge the extent, if processes and governmental institutions,
any, to which the same politico- some of which are perhaps common to all
methodological problems would emerge large, bureaucratic nations, other unique
in your program evaluation settings. Most to the U.S. setting.
international conferences of scientists can
avoid the political issues which divide Metascientific Issues
nations by concentrating on the scientific
issues which unite them as scientists. On
Quantitative vs. Qualitative
the topic of assessing the impact of
planned social change we do not have that Methodology
luxury. Even so, I have hopes for a
technology that would be useful to any A controversy between “qualitative”
political system. I believe that much of the versus “quantitative” modes of knowing,
methodology of program evaluation will between geisteswissenschaftlich and
be independent of the content of the naturewissenchaftlich approaches,
program, and politically neutral in this between “humanitistic” and “scientistic”
sense. This stance is augmented by approaches is characteristic of most of the
emphasizing the social scientist’s role in social sciences in the U.S.A. today. In
helping society keep track of the effects of fields such as sociology and social
changes that its political process has psychology, many of our ablest and most
initiated and by playing down the role of dedicated graduate students are
the social scientist in the design of increasingly opting for the qualitative,
program innovations. Whether this humanitistic mode. In political science,
independence of ideology is possible, and there has been a continuous division
even if it is, how it is to be integrated with along these lines. Only economics and
our social scientist’s duty to participate in geography seem relatively immune.
the development of more authentic

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 6


ISSN 1556-8180
February 2011
Donald T. Campbell

Inevitably, this split has spilled over observation of events is an intrinsically


into evaluation research, taking the form equivocal arena for causal inference, by
of a controversy over the legitimacy of the qualitative or quantitative means, because
quantitative-experimental paradigm for of the ubiquitous confounding of selection
program evaluation (e.g., Weiss & Rein, and treatment. Any efforts to reduce that
1969, 1970; Guttentag, 1971, 1973; equivocality will have the effect of making
Campbell, 1970, 1973). The issue has not, conditions more “experimental.”
to be sure, been argued in quite these “Experiments” are, in fact, just that type
terms. The critics taking what I am calling of contrived observational setting optimal
the humanitistic position are often well- for causal inference. The problems of
trained in quantitative-experimental inference surrounding program
methods. Their specific criticisms are evaluation are intrinsic to program
often well-grounded in the settings in ongoing social processes.
experimentalist’s own framework: Experimental designs do not cause these
experiments implementing a single problems and, in fact, alleviate them,
treatment in a single setting are though often only slightly so.
profoundly ambiguous as to what caused In such protests, there often seems
what; there is a precarious rigidity in the implicitly a plea for the substitution of
measurement system, limiting recorded qualitative clairvoyance for the indirect
outcomes to those dimensions anticipated and presumptive processes of science. But
in advance; process is often neglected in while I must reject this aspect of the
an experimental program focused on the humanistic protest, there are other
overall effect of a complex treatment, and aspects of it that have motivated these
thus knowing such effects has only critics in which I can wholeheartedly join.
equivocal implications for program These other criticisms may be entitled
replication or improvement; broad-gauge “neglect of relevant qualitative contextual
programs are often hopelessly ambiguous evidence” or “over dependence upon a few
as to goals and relevant indicators; change quantified abstractions to the neglect of
of treatment program during the course of contradictory and supplementary
an ameliorative experiment, while qualitative evidence.”
practically essential, make input-output Too often qualitative social scientists,
experimental comparisons under the influence of missionaries from
uninterpretable; social programs are often logical positivism, presume that in true
implemented in ways that are poor from science, quantitative knowing replaces
an experimental design point of view; qualitative, common-sense knowing. The
even under well-controlled situations, situation is in fact quite different. Rather,
experimentation is a profoundly tedious science depends upon qualitative,
and equivocal process; experimentation is common-sense knowing even though at
too slow to be politically useful; etc. All best it goes beyond it. Science in the end
these are true enough, often enough to contradicts some items of common sense,
motivate a vigorous search for but it only does so by trusting the great
alternatives. So far, the qualitative- bulk of the rest of common-sense
knowing alternatives suggested (e.g., knowledge. Such revision of common
Weiss & Rein, 1969, 1970; Guttentag, sense by common sense, which,
1971, 1973) have not been persuasive to paradoxically, can only be done by
me. Indeed, I believe that naturalistic trusting more common sense. Let us

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 7


ISSN 1556-8180
February 2011
Donald T. Campbeell

considerr as an example the Muller-Lyyer


illustratiion (Figuree 1).

If yo
ou ask thee normal resident of a omittting ordinaary languag ge, it would d fail
“carpenttered” cultu ure (Segall,, et al., 1966) to co ommunicatee to anoth her scientisst in
which lin ne is longerr, a or b, hee will reply b. such a way as tto enable h him to repllicate
If you supply
s him
m with a ru uler, or alloow the experimeent and verify the
him to use the ed dge of anotther piece of obserrvations. In nstead, thee few scien ntific
paper as a a mak keshift ruller, he will w termss have been n imbedded d in a disco ourse
eventuallly convincce himselff that he is of elliptical prescienttific ordiinary
wrong, and that line l a is loonger. In so s languuage which the reader is presumeed to
decidingg he willl have rejected as (and p presumes tto) understand. And in n the
inaccuraate one product of visu ual laboraatory worrk of the original and
perceptiion by trustting a larger set of other repliccating labo oratory, a ccommon-seense,
visual peerceptions. He will alsso have mad de presccientific lannguage and d perceptio on of
many presumptio
p ns, inexpllicit for th he objectts, solids, and light was emplloyed
most pa art, includinng the assu umption th hat and ttrusted in ccoming to tthe conclussions
the len ngths of lines hav ve remaineed that thus revise tthe ordiinary
relativelly consttant du
uring th
he underrstanding. To challen nge and co orrect
measureement proccess, that th he ruler was the co ommon-sen nse undersstanding in n one
rigid ratther than ellastic, that the heat an nd detaill, common n-sense un nderstandin ng in
moisturee of his han nd have nott changed th he generral had to be trusted.
ruler’s length in su uch a coincidental wa ay Reelated to th his is the epistemolo ogical
as to prooduct the different
d meeasurementts, emph hasis onn qualitaative patttern
expanding it when approachin ng line a annd identiification ass prior to aan identificaation
contractting it wheen approacching line b, of quaantifiable aatomic partticles, in revverse
etc. of thee logical aatomist’s in ntuition, stiill to
Let us take as a anotherr example a widesspread (Caampbell, 19 966). Such h an
scientific paper containing
c theory an nd episteemology iss fallibilistt, rather than
experimmental resu ults demon nstrating th he clairvvoyant, emp phasizing th he presump ptive
particulaate nature of light, in dramattic error--pronenesss of ssuch patttern
contractt to commo on-sense un nderstandin ng. identiification, raather than perception n as a
Or a sciientific pap per demonstrating th hat depen ndable gro ound of ceertainty. Bu ut it
what orrdinary perrception deeems “solid ds” also recognizess this falllible, intuiitive,
are in fact open lattices. Were W such a presuumptive, orrdinary peerception to o be
paper to t limit itself to mathematic
m cal the o only routee. This is not to m make
symbolss and pu urely scien ntific term ms, perceeptions u uncriticizablle (Camp pbell,

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 115 8
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

1969a), but they are, as we have seen, only procedures such as questionnaires and
criticizable by trusting many other rating scales will often be introduced at
perceptions of the same epistemic level. this stage for reasons of convenience in
If we apply such an epistemology to collecting and summarizing, non-
evaluation research, it immediately quantitative methods of collection and
legitimizes the “narrative history” portion compiling should also be considered, such
of most reports and suggests that this as hierarchically organized discussion
activity be given formal recognition in the groups. Where such evaluations are
planning and execution of the study, contrary to the quantitative results, the
rather than only receiving attention as an quantitative results should be regarded as
afterthought. Evaluation studies are suspect until the reasons for the
uninterpretable without this, and most discrepancy are well understood. Neither
would be better interpreted with more. is infallible, of course. But for many of us,
That this content is subjective and guilty what needs to be emphasized is that the
of perspectival biases should lead us to quantitative results may be as mistaken as
better select those who are invited to the qualitative. After all, in physical
record the events, and to prepare formal science laboratories, the meters often
procedures whereby all interested work improperly, and it is usually
participants can offer additions and qualitative knowing, plus assumptions
corrections to the official story. The use of about what the meter ought to be
professionally trained historians, showing, this discovers the malfunction.
anthropologists, and qualitative (This is a far cry from the myth that meter
sociologists should be considered. The readings operationally define theoretical
narrative history is indispensable parameters.)
sociologists should be considered. The It is with regret that I report that in
narrative history is an indispensable part U.S. program evaluations, this sensible
of the final report, and the best qualitative joint use of modes of knowing is not yet
methods should be used in preparing it. practiced. Instead, there seems to be an all
We should also recognize that or none flip-flop. Where, as in Model
participants and observers have been Cities evaluation, anthropologists have
evaluating program innovations for been used as observers, this has often
centuries without benefit of quantification been in place of, rather than in addition
or scientific method. This is the common- to, quantitative indicators, pretests,
sense knowing which our scientific posttests, and control-group comparisons.
evidence should build upon and go A current example of the use of
beyond, not replace. But it is usually anthropologists in the “Experimental
neglected in quantitative evaluations, Schools” program started in the U.S.
unless a few supporting anecdotes Office of Education and now in the
haphazardly collected are included. Under national Institute of Education. In this
the epistemology I advocate, one should program, school-system initiative is
attempt to systematically tap all the encouraged, and winning programs
qualitative common-sense program receive substantial increments to their
critiques and evaluations that have been budgets (say 25%) for use in
generated among the program staff, implementing the innovations. To
program clients and their families, and evaluate some of these programs, very
community observers. While quantitative expensive contracts have been let for

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 9


ISSN 1556-8180
February 2011
Donald T. Campbell

anthropological process evaluations of a more extended discussion of the


single programs. In one case, this was to qualitative-quantitative issues, see
involve a team of five anthropologists for Campbell, 1975.)
five years, studying the school system for While the issue of quantitative vs.
a unique city with a population of qualitative orientations has important
100,000 persons. The anthropologists practical implications, it is still, as I see it,
have no prior experience with any other primarily an issue among us social
U.S. school system. They have been scientists and relatively independent of
allowed no base-line period of study the larger political process. Whether one
before the program was introduced; they or the other is used has pretty much been
arrived instead after the program had up to the advice of the segment of the
started. They were not scheduled to study social science community from which
any other comparable school system not advice was sought, motivated in part by
undergoing this change. To believe that frustration with a previously used model.
under these disadvantaged observational The issue, in other words, is up to us to
conditions, these qualitative observers decide.
could infer what aspects of the processes The remaining issues in the
they observe were due to the new program metascientific group are much more
innovation requires more faith than I involved with extrascientific issues of
have, although I should withhold human nature, social systems, and
judgment until I see the products. political process. I have classified them
Furthermore, the emphasis of the study is here only because I judge that a first step
on the primary observations of the in their resolution would be developing a
anthropologists themselves, rather than consensus among evaluation
on their role in using participants as methodologists, and such a consensus
informants. As a result there is apt to be a would involve agreement on
neglect of the observations of other metascientific issues rather than on
qualitative observers better placed than details of method.
the anthropologists. These include the
parents who have had other children in Separation of Implementation and
the school prior to the change; the Evaluation
teachers who have observed this one
system before, during, and after the A well-established policy in those U.S.
change; the teachers who have transferred government agencies most committed to
in with prior experience in otherwise program evaluation is to have program
comparable systems; and the students implementation organizationally
themselves. Such observations one would separated from program evaluation. This
perhaps want to mass produce in the form recommendation comes from the
of questionnaires. If so, one would wish academic community of scientific
that appropriate questions had also been management theory, proliferated in the
asked prior to the experimental program, governmental circles of the late 1960's as
and on both occasions in some “Programming, Planning, and Budgeting
comparable school system undergoing no System,” or PPBS, in which these
such reform, thus reestablishing functions, plus program monitoring or
experimental design and quantitative evaluation, were to be place in a separated
summaries of qualitative judgments. (For

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 10


ISSN 1556-8180
February 2011
Donald T. Campbell

organizational unit independent of the weakened. Evaluation becomes a


operating agencies. (William & Evans, demoralizing influence and a source of
1969, provide one relevant statement of distracting conflict. It might be hoped that
this policy.) This recommendation is through specialization, more technically
based on an organizational control theory proficient methodologists would be
of check and balances. It is supported not employed. If there is such a gain, it is
only by general observations on human more than lost through reduced
reluctance to engage in self-criticism, but experimental control.
more particularly on observations of a These problems are, of course, not
long standing self-defeating U.S. practice entirely due to the separation of
in which progress reports and other implementation and evaluation. And the
program evaluations are of necessity reasons that argue for the separation
designed with the primary purpose of remain strong. Yet the problems are
justifying the following year’s budget. For troublesome and related enough to justify
the typical administrator of an reconsidering the principle, particularly
ameliorative program in the U.S.A., be it a when it is noted that the separation seems
new experimental program or one of long- totally lacking in experimental science.
standing, budgets must be continually This raises the metascientific issue of how
justified, and are usually on a year-to-year objectivity in science is obtained in spite
basis with six months or more lead-time of the partisan bias of scientists, and of
rare. For such an administrator, program the relevance of this model for objectivity
evaluations can hardly be separated from in program evaluation.
this continual desperate battle. In this In ordinary science, the one who
context, it makes excellent sense to turn designs the experiment also reads the
program evaluations over to a separate meter. Comparably biasing motivational
unit having no budgetary constraints on problems exist. Almost inevitably, the
an honest evaluation. And so far, the scientist is a partisan advocate of one
policy is unchallenged. particular outcome. Ambiguities of the
My own observations, however, lead interpretation present themselves. Fame
me to the conclusion that this policy is not and Careers are at stake. Errors are made,
working either. The separation works and not all get corrected before
against modes of implementation that publication, with the hypothesis-
would optimize interpretability of supporting errors much less likely to be
evaluation data. There are such, and low caught, etc. The puzzle of how science gets
cost ones too, but these require advance its objectivity (if any) is a metascientific
planning and close implementer/ issue still unresolved. While scientists are
evaluator cooperation. The external probably more honest, cautious, and self-
evaluators also tend to lack the essential critical than most groups, this is more apt
qualitative knowledge of what happened. to be a by-product of the social forces that
The chronic conflict between evaluators produce scientific objectivity than the
and implementers, which will be bad source. Probably the tradition and
enough under a unified local direction, possibility of independent replication is a
tend to be exacerbated. Particularly when major factor. As the philosophers and
combined with U.S. research contracting sociologists of science better clarify this
procedures, the relevance of the measures issue, evaluation research methodologists
to local program goals and dangers is should be alert to the possibility of models

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 11


ISSN 1556-8180
February 2011
Donald T. Campbell

applicable to their area. Jumping ahead At present, the preference is for single,
speculatively, I come to the following coordinated, national evaluations, even
tentative stance. where the program innovation is
Ameliorative program implementation implemented in many separate, discrete
and evaluation in the U.S.A. today need sites. If one were to imitate science’s
more zeal, dedication, and morale. These approach to objectivity, it would instead
would be increased by adopting the seem optimal to split up the big
scientist’s model of experimenter- experiments and evaluations into two or
evaluator. If the conditions for cross- more contracts with the same mission so
validating replication could be that some degree of simultaneous
established, and if budgetary jeopardy replication would be achieved. Our major
from negative evaluations could be evaluations of compensatory education
removed (for example, by allowing programs (e.g., Head Start, Follow
program implementers to shift to Through) offer instances which were of
alternative programs in pursuit of the such magnitude that costs would not have
same goal), then the policy separation of been appreciably increased by this
implementation and evaluation should be process. We could often, if we so planned,
abandoned. build in some of the competitive
The issue does not have to be one or replication that keeps science objective.
the other. External evaluations can be One positive feature of the U.S.
combined with in-house evaluations. evaluation research scene in this regard is
Certainly, even under present budgeting the widespread advocacy and occasional
systems, program implementers should be practice of reanalysis by others of
funded to do their own evaluations and to program evaluation data. The Russell
argue their validity in competition with Sage Foundation has funded a series of
external evaluations. The organizational these, including one on the “Sesame
arrangement separating evaluation from Street” preschool television programs
implementation is borrowed from the (Cook, et al., 1975). The original
model of external auditors, and it should governmental evaluation of the Head
be remembered that in accounting, Start compensatory preschool program
auditors check on the internal records, (Circirelli, 1969) has been reanalyzed by
rather than creating new data. Perhaps Smith and Bissell (1970) and Barnow
some such evaluation methodologist’s (1973), and others are in progress.
audit of internal evaluation records would Similarly, for several other classic bodies
be enough of an external evaluation. of evaluation data, although this is still a
rare activity and many sets of data are not
Maximizing Replication and made available.
Criticism One needed change in research
customs or ethics is toward the
Continuing the same metascience theme encouragement of “minority reports” from
as in the previous section: a number of the research staff. The ethic that the data
other recommendations about policy should be available for critical reanalysis
research emerge, some of which run should be explicitly extended to include
counter to current U.S. orthodoxy and the staff members who did the data
practice. collection and analysis and who very
frequently have the detailed insight to see

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 12


ISSN 1556-8180
February 2011
Donald T. Campbell

how the data might be assembled to solutions simultaneously. Even without


support quite different conclusions than planning, the existence in the U.S.A. of
the official report presents. At present, state governments creates quasi-
any such activity would be seen as experimental comparisons that would be
reprehensible organizational disloyalty. unavailable in a more integrated system.
Because of this, an especially competent Zeisel (1971) has argued this well, and it is
source of criticism, and through this a illustrated in the study of Baldus (1973)
source of objectivity, is lost. An official cited more extensively below. If factories,
invitation by sponsor and administrator to schools, and units of similar size, are
every member of the professional allowed independent choice of programs
evaluation team to prepare minority and if borrowed programs are evaluated
reports would be of considerable help in as well as novel ones, the contagious
reducing both guilt and censure in this borrowing of the most promising
regard. programs would provide something of the
In this regard, we need to keep in validation of science.
mind two important models of social
experimentation. On the one hand there is Evaluation Research as Normal
the big-science model, exemplified in the Rather than Extraordinary Science
Negative Income Tax experiments to be
discussed below (See also Kershaw’s The metascientific points so far have
paper in this volume). On the other hand, shown little explicit reference to the hot
there is the low-budget “administrative metascientific issues in the U.S. today. Of
experiment” (Campbell, 1967; Thompson, these, the main focus of discussion is still
1974), in which an administrative unit Thomas Kuhn’s Structure of Scientific
such as a city or state (or factory, or Revolution (1970). While I would
school) introduces a new policy in such a emphasize the continuity and the relative
way as to achieve experimental or quasi- objectivity of science more than he (as you
experimental tests of its efficacy. Wholey’s have already seen), I recognize much of
paper (in this volume) describes such value in what he says, and some of it is
studies and the Urban Institute in general relevant here. To summarize: There are
has pioneered in this regard. Hatry, normal periods of scientific growth during
Winnie, and Fisk’s Practical Program which there is general consensus on the
Evaluation for State and Local rules for deciding which theory is more
government Officials (1973) exemplifies valid. There are extraordinary or
this emphasis. For administrative revolutionary periods in science in which
experimentation to produce objectivity, the choices facing scientists have to be
cross-validating diffusion is needed, in made on the basis of decision rules which
which those cities or states, etc., adopting are not party of the old paradigm.
a promising innovation confirm its Initially, the choice of the new dominant
efficacy by means of their own evaluation theory after such a revolution is
effort. unjustified in terms of decision rules of
Decentralization of decision-making the prior period of normal science.
has the advantage of creating more social For evaluation research, the Kuhnian
units that can replicate and cross-validate metaphor of revolution can be returned to
social ameliorative inventions or that can the political scene. Evaluation research is
explore a wide variety of alternative

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 13


ISSN 1556-8180
February 2011
Donald T. Campbell

clearly something done by, or at least The Interrupted Time-Series


tolerated by, a government in power. It Design
presumes a stable social system
generating social indicators that remain By this term I cover the formalization of
relatively constant in meaning so that they the widespread common practice of
can be used to measure the program’s plotting a time-series on some social
impact. The programs which are statistic and attempting to interpret it.
implemented must be small enough not to This practice, the problem encountered,
seriously disturb the encompassing social and the solution have been developed
system. The technology I am discussing is independently in many nations. I will
not available to measure the social impact start from some non-U.S. examples.
of a revolution. Even within a stable Figure 2 shows data on sex crimes in
political continuity, it may be limited to Denmark and possibly the effect of
the relatively minor innovations, as Zeisel removing restrictions on sale of
has argued in the case of experimentation pornography (Kutchinsky, 1973).
with the U.S. legal system. (Needless to Kutchinsky is cautious about drawing
say, I do not intend this to constitute a causal conclusions, emphasizing the
valid argument against making changes of changes in the tolerance of citizens in
a magnitude that precludes evaluation.) lodging complaints and of policemen may
have produced a drop in number of
Statistical Issues reported offenses without a drop in actual
offenses. By studying attitudes of citizens
In this section I will get into down-to- and police over time, and by other subtle
earth issues where we quantitative analyses, he concludes that for child
evaluation methodologists feel most at molestation these other explanations do
home. Here are issues that clearly call for not hold, and one must conclude that a
a professional skill. Here are issues that genuine drop in the frequency of this
both need solving and give promise of crime occurred. However, the graphic
being solvable. These statistical issues are portrayal of these trends in Figure 3 is less
ones that assume a solution to the convincing than Figure 2 because of a
metascientific issues in favor of a marked downward trend prior to the
quantitative experimental approach. In increased availability of pornography. In
this section, I will start with a useful both cases interpretation of effects is
common-sense method–the interrupted made more difficult by the problem of
time-series. Next will come some popular when to define onset. In 1965 hard-core
but unacceptable regression approaches pornographic magazines became readily
to quasi-experimental design. Following available. In 1969 the sale of pornographic
that, problems with randomization pictures to those 16 or older was legalized,
experiments will be discussed, and etc. Kutchinsky’s presentation is a model
following that, a novel comprise design. of good quasi-experimental analysis in its
careful searching out of other relevant
data to evaluate plausible rival
hypotheses.

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 14


ISSN 1556-8180
February 2011
Donald T. Campbeell

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 15 15
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

Figure 4 shows the impact in Romania later? A shift to annual data obviates these
of the October 1966 population policy two problems, but usually there are too
change which greatly restricted the use of few years or too many other changes to
abortion, reduced the availability of permit the use of tests of significance.
contraceptives, and provided several new Figure 5 shows annual data and also
incentives for large families. (David & provides an opportunity to look for the
Wright, 1971, David, 1970.) The combined effect of legalizing abortion in 1957. This
effect is clear and convincing, with the occurred at a time when the rate of use of
change in the abortion law probably the all means of birth control, including
main factor, particularly for the July- abortion, was increasing, and there is no
September 1967 peak. Presumably the graphic evidence that the 1957 law
subsequent decline represents a shift to accelerated that trend. In other data not
other means of birth control. While clear presented here, there is illustrated a
visually, the data offer problems for the chronic methodological problem with this
application of tests of significance. The design: Social systems react to abrupt
strong seasonal trend rules out the changes by using all of the discretionary
application of the best statistical models decision points to minimize that change.
(Glass, Willson, & Gottman, 1972), and Thus the abrupt onset of the October 1966
there are not enough data points plotted decrees also produced an immediate
here upon which to base a good seasonal increase in stillbirths, many of which were
adjustment. The point of onset for no doubt substitutes for the newly
computing purposes is also ambiguous. Is outlawed abortions. Such compensation
it October 1, 1966, or six months later as was, however, too minimal to prevent an
per the prior rule permitting abortions in immediate increase in births.
the first three months? Or nine months

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 16


ISSN 1556-8180
February 2011
Donald T. Campbeell

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 15 17
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

Figure 6 shows the effect of the British success, but interest soon faded, and
Breathalyser crackdown of 1967, today most British social scientists are
illustrated here more dramatically than it unaware of the program’s effectiveness. In
has yet appeared in any British figure 6 the data have been adjusted to
publication. The British Ministry of attempt to correct for seasonal trend,
Transport dutifully reported strong results uneven number of days per month,
during the year following. Their mode of uneven number of weekends per month,
report was in terms of the percentage of and, for the month of October 1969, the
decline in a given month compared with fact that the crackdown did not begin until
the same month one year earlier. This is October 9. All of these adjustments have
better than total neglect of seasonal problems and alternative solutions. In this
effects, but it is an inefficient method particular case, the effects are so strong
because unusual “effects” are often due to that any approach would have shown
much to the eccentricity of the prior them, but in many instances this will not
period as to that of the current one. It is be so. The data on commuting hours
also precludes presentation of the over-all serves as a control for weekend nights.
picture. The newspapers duly noted the

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 18


ISSN 1556-8180
February 2011
Donald T. Campbeell

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 15 19
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

Figure 7 shows data from Baldus comparable states that did not change
(1973) on the substantial effects of a law their laws to use as comparisons. One
that Baldus believes to be evil just because such instance is shown in Figure 7.
it is effective. This law requires that, when Figure 8 is a weak example of a time-
a recipient of old age assistance (charity to series study because it has so few time
the poor from the government) dies and periods. It has a compensatory strength
leaves money or property, the government because the several comparison groups
must be repaid. In our capitalist ideology, are themselves constituted on a
shared even by the poor, many old people quantitative dimension. I include it
will starve themselves just to be able to primarily because it seems to indicate that
leave their homes to their children. Baldus the U.S. Medicaid legislation of 1964 has
has examined the effects of such laws in had a most dramatic effect on the access
some 40 cases where states have initiated to medical attention of the poorest group
them and in some 40 other cases where of U.S. citizens.
states have discontinued them. In each
case, he has sought out nearby,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 20


ISSN 1556-8180
February 2011
Donald T. Campbeell

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 15 21
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

The interrupted time-series design is initiation become series ends, and


of the very greatest importance for corrections for these are much poorer
program evaluation. It is available where than for mid-series points. (Kepka, 1971;
the new program affects everyone and McCain, in preparation.)
where, therefore, a proper control group 3. There is a tendency for new
can usually not be constituted. If administrations that initiate new
comparison group data are available, it programs to make other changes in the
ranks as the strongest of all quasi- record-keeping system. This often makes
experimental designs (Campbell & changes in indicators uninterpretable
Stanley, 1966). It can often be (Campbell, 1969b, pp.414-415). Where
reconstructed from archival data. possible, this should be avoided.
Graphically presented, it is really 4. Where programs are initiated in
understood by administrators and response to an acute problem (e.g.,
legislators. Therefore, it is well worth the sudden change for the worse in a social
maximum of technical development. indicator), ameliorative effects of the
Following is a brief list of its program are confounded with “regression
methodological problems as we have artifacts” due to the fact that in an
encountered them. unstable series, points following an
1. Tests of significance are still a extreme deviation tend to be closer to the
problem. Ordinary least squares general trend (Campbell, 1969b, pp.413-
estimation is usually inapplicable because 414).
of autoregressive error; therefore moving- 5. Gradually introduced changes are
average models seem most appropriate. usually impossible to detect by this
Glass, Willson, and Gottman (1972) have design. If an administrator wants to
assembled the best approach, which build optimize evaluability using this design,
on the work of Box and Tiao (1965) and program initiation should be postponed
box and Jenkins (1970). These models until preparations are such that it can be
require that systematic cycles in the data introduced abruptly. The British
be absent, but all methods of removing Breathalyser crackdown exemplifies this
them tend to under-adjust. They also optimal practice (see Figure 6, above).
require large number of time-points, and 6. Because long series of observations
will sometimes fail to confirm an effect are required, we tend to be limited to
which is compelling visually as graphed. indicators that are already being recorded
They will also occasionally find a highly for other purposes. While these are often
significant impact where visual inspection relevant (e.g., births and deaths) and
shows none. while even the most deliberately designed
2. Removing seasonal trends remains a indicators are never completely relevant
problem. Seasonal trends are themselves this is a serious limitation. Particularly
unstable and require a moving-average lacking are reports on the participants’
model. The month-to-month change experiences and perceptions. On the other
coincident with the program change hand, it seems both impossible and
should not be counted as purely seasonal; undesirable to attempt to anticipate all
thus the series has to be split at this point future needs and to initiate bookkeeping
for estimating the seasonal pattern. procedures for them. Some intermediate
Therefore, the parts of the series just compromise is desirable, even at the
before and just after the program expense of adding to the forms to be filled

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 22


ISSN 1556-8180
February 2011
Donald T. Campbell

out and the records to be kept. For which this is done are, in my opinion,
institutional settings, it would be valuable almost always wrong. What has happened
to receive from all participants “Annual is that a set of statistical tools developed
Reports for Program Evaluation” (Gordon for and appropriate to prediction are
& Campbell, 1971). In educational settings applied to causal inference purposes for
teachers, students, and parents could file which they are inappropriate. Regression
such a report. Note that at present the analysis, multivariate statistics,
school system records how the pupil is covariance analysis are some of the names
doing but never records the pupil’s report of the statistical tools I have in mind.
on how the school is doing. Teachers are Whether from educational statistics or
annually rated for efficiency but never get economics, the choice of methods seems
a chance to systematically rate the policies to be the same. The economists have a
they are asked to implement. Some first phrase for the problem “error in
steps in this direction are being explored. variables” or, more specifically, “error in
(Weber, Cook, & Campbell, 1971; independent variables.” But while
Anderson, 1973). In the U.S. social welfare theoretically aware of the problem, they
system, both social worker and welfare are so used to regarding their indicators
recipient would be offered the opportunity as essentially lacking in error that they
to file reports (Gordon & Campbell, 1971). neglect the problem in practice. What they
All ratings would be restricted to the forget is that irrelevant systematic
evaluation of programs and policies, not components of variance create the same
persons, for reasons to be discussed problem as does random error, leading to
below. the same bias of underadjustment. Note
that the presence of error and unique
Regression Adjustments as variance has a systematic effect, i.e.
Substitutes for Randomization operates as a source of bias rather than as
a source of instability in estimates. This
The commonest evaluation design in U.S. fact, too, the economists and other
practice consists in administering a novel neglect. Thus efforts to correct for
program to a single intact institution or pretreatment differences by “regression
administrative unit, with measures before adjustments” on the means or by
and after. While this leaves much to be “partialing out” pretest differences or by
desired in the way of controls, it is still covariance adjustments all lead to
informative enough to be worth doing. underadjustment unless the pretest (or
Almost as frequently, this design is other covariate) is a perfect measure of
augmented by the addition of comparison what pretest and posttest have in
group which is also measured before and common. The older technique of using
after. This is typically another intact social only cases matched on pretest scores is
unit which does not receive the new well known to produce “regression
program and is judged comparable in artifacts” (Thorndike, 1942; Campbell &
other respects. It usually turns out that Stanley, 1966). Covariance turns out to
these two groups differ even before the produce the same bias, the same degree of
treatment, and a natural tendency is to try underadjustment only with greater
to adjust away the difference. In statistical precision (Lord, 1960, 1969; Porter, 1967;
practice in the U.S. today, the means by Campbell & Erlebacher, 1970), and also
for multiple regression and partial

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 23


ISSN 1556-8180
February 2011
Donald T. Campbell

correlation (e.g., Cook & Campbell, 1975). from better-educated parents watch it
Essentially the same problem emerges in more.) (Cook, et al., 1975.)
ex post facto studies where, although For compensatory programs usually,
there is no pretest, other covariates are although not always, the control group
available for adjustment. A common start out superior to the treatment group,
version of the problem occurs where some or are selected from a larger population
persons have received a treatment and whose average is superior. In this setting,
there is a larger population of untreated the biases of underadjustment, the
individuals from which “controls” are regression artifacts, are in the direction of
sought and a comparison group underestimating program effectiveness
assembled. and of making our program seem harmful
In U.S. experience it has become when they are merely worthless. This
important to distinguish two types of quasi-experimental research setting has
setting in which this type of quasi- occurred for our major evaluations of
experimental design and these types of compensatory education programs going
adjustments are used, since the social under the names of Head Start, Follow
implications of the underadjustment are Through, Performance Contracting, Job
opposite. On the one hand, there are those Corps (for unemployed young men), and
special opportunity programs like many others. In the major Head Start
university education which are given to evaluation (Cicirelli, 1969; Campbell &
those who need them least, or as more Erlebacher, 1970), this almost certainly
usually stated, who deserve them most or accounts for the significantly harmful
who are more likely to be able to profit effects shown in the short three month,
from them. Let us call these “distributive” ten-hours-a-week program. I am
programs in contrast with the persuaded that the overwhelming
“compensatory” programs, or those prevalence of this quasi-experimental
special opportunities given to those who setting and adjustment procedures is one
need them most. of the major sources of the pessimistic
For the regressive programs, the record for such programs of
treatment group will usually be superior compensatory education efforts. The very
to the control group or the population few studies in compensatory education
from which the quasi-experimental which have avoided this problem by
controls are chosen. In this setting the random assignment of children
inevitable underadjustment due to unique experimental and control conditions have
variance and error in the pretest and/or shown much more optimistic results.
other covariates (the “regression In the compensatory education
artifacts”) works to make the treatment situation, there are several other problems
seem effective if it is actually worthless which also work to make the program look
and to exaggerate its effectiveness in any harmful in quasi-experimental studies.
case. For most of us, this seems a benign These include tests that are too difficult,
error, confirming our belief in treatments differential growth rates combined with
we know in our hearts are good. (It may age-based, grade-equivalent, absolute, or
come as a surprise, but the U.S. Sesame raw scores, and the fact that test reliability
Street preschool educational television is higher for the post-test than for the
program is “distributive,” in that children pretest, and higher for the control group
than for the experimental (Campbell,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 24


ISSN 1556-8180
February 2011
Donald T. Campbell

1973). These require major revisions of income support payments bringing their
our test score practice. When various income up to some level between $3,000
scoring models are applied to a single and $4,000 per year for a family of four,
population on a single occasion, all under one of eight plans which differ as to
scoring procedures correlate so highly support level and incentive for increasing
that one might as well use the simplest. own earnings. Another 600 families
But when two groups that differ initially received no income support but
are measured at two different times in a cooperated with the quarterly interviews.
period of rapid growth, our standard test The experiment lasted for three years, and
score practices have the effect of making preliminary final results are now
the gap appear to increase, if, as is usual, available. This study when completed will
test reliability is increasing. The use of a have cost some $8,000,000 of which
correction for guessing becomes $3,000,000 represented to participants
important. The common model that and necessary administrative costs, and
assumes “true score” and “error” are $5,000,000 research costs, the costs of
independent needs to be abandoned, program evaluation. Before this study was
substituting one that sees error and true half completed, three other negative
score negatively correlated across persons income tax experiments were started,
(the larger the error component, the some much bigger (rural North Carolina
smaller the true score component). and Iowa, Gary, and Seattle and Denver).
The total U.S. investment for these
Problems with Randomized experiments now totals $65,000,000. It is
Experiments to me amazing, and inspiring, that our
nation achieved, for a while at least, this
The focal example of a good social great willingness to deliberately
experiment in the U.S. today is the New “experiment” with policy alternatives
Jersey Negative Income Tax Experiment. using the best of scientific methods.
(Watts & Rees, 1973; The Journal of This requires a brief historical note.
Human Resources, Vol. 9, No. 2, Spring The key period is 1964-68. L.B. Johnson
1974; Kershaw’s paper in this volume; was President and proclaimed a “Great
Kershaw, 1972; Kershaw & Fair, 1973). Society” program and a “War on Poverty.”
This is an experiment dealing with a In Washington, D.C. administrative
guaranteed annual income as an circles, the spirit of scientific management
alternative to present U.S. welfare (the PPBS I’ve already criticized and will
systems. It gets its name from the notion again) had already created a sophisticated
that when incomes fall below a given level, interest in hard-headed program
the tax should become negative, that is, evaluation. Congress was already writing
the government should pay the citizen into its legislation for new programs the
rather than the citizen paying a tax to the requirement that 1% (or some other
government. It also proposes substituting proportion) of the program budget be
income-tax like procedures for citizen devoted to evaluation of effectiveness. In a
reports of income in place of the present new agency, the Office of Economic
social worker supervision. In this Opportunity, a particularly creative and
experiment some 600 families with a dedicated group of young economist was
working male head of household received recruited, and these scientist-evaluators
were given an especially strong role in

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 25


ISSN 1556-8180
February 2011
Donald T. Campbell

guiding over-all agency policy. This OEO often try to shock audiences with the
initiated the first two of the Negative comment that the New Jersey experiment
Income Tax experiments. (Two others is the greatest example of applied social
were initiated from the Department of science since the Russian Revolution.
Health, Education and Welfare.) Under The major finding of the NJNITE is
the first Nixon administration, 1968-72, that income guarantees do not reduce the
OEO programs were continued, although effective work effort of employed poor
on a reduced scale. Under the second people. This finding, if believed, removes
Nixon administration, OEO itself was the principal argument against such a
dismantled, although several programs program–-for on a purely cost basis, it
were transferred to different agencies. I would be cheaper than the present welfare
believe that all four of the Negative system unless it tempted many persons
Income Tax Experiments are still being now employed to cease working. The
carried out much as planned. Initiation of major methodological criticisms of the
new programs has not entirely ceased, but study are focused on the credibility of the
it is greatly reduced. My general advice to belief that this “laboratory” finding would
my fellow U.S. social scientists about the continue to hold up if the support
attitude they should take toward this program became standard, permanent
historical period is as follow: Let us use U.S. policy. These are questions of
this experience so as to be ready when this “external validity” (Campbell & Stanley,
political will returns again. In spite of the 1966) or of “construct validity,” as Cook
good example provided by the New Jersey (Cook & Campbell, 1975) applies the term
Negative Income Tax Experiment, over all developed initially for measurement
we were not ready last time. Competent theory. Two specific criticisms are
evaluation researchers were not available prominent: One, there was a “Hawthorne
when Model Cities Programs, Job Corps Effect” or a “Guinea-pig Effect.” The
Programs, etc. went to their local experimental families knew they were the
universities for help. Perhaps 90% of the exceptional participant in an artificial
funds designated for program evaluation arrangement and that the spotlight of
were wasted; at any rate, 90% of the public attention was upon them.
programs came out with no interpretable Therefore, they behaved in a “good”
evidence of their effectiveness. The industrious, respectable way, producing
available evaluation experts grossly the results obtained. Such motivation
overestimated the usefulness of statistical would be lacking once the program was
adjustments as substitutes for good universal. Two features in the NJNITE
experimental design, including especially implementation can be supposed to
randomized assignment to treatment. accentuate this. There was publicity about
In this spirit I would like to use the the experiment at its start, including
experience of the New Jersey Negative television interviews with selected
Income Tax Experiment to elucidate the experimental subjects; and the random
methodological problems remaining to be selection was by families rather than
solved in the best of social experiments. In neighborhoods, so each experimental
this spirit, my comments are apt to sound family was surrounded by equally poor
predominately critical. My over-all neighbors, who were not getting this
attitude, however, is one of highest beneficence. The second common
approval. Indeed, in lectures in the U.S. I criticism, particularly among economists,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 26


ISSN 1556-8180
February 2011
Donald T. Campbell

might be called the time-limit effect. method that need detailed attention from
Participants were offered the support for creative statisticians and social
exactly three years. It was made clear that psychologists. These will only be
the experiment would terminate at that mentioned here, but are being treated
time. This being the case, prudent more extensively elsewhere (Riecken, et
participants would hang on to their jobs al., 1974). The issue of the unit of
unless they could get better ones, so that randomization has already been raised in
they would be ready for a return to their passing. Often there is a choice of
normal financial predicament. randomizing larger social units than
It should be recognized that these two persons or families–residential blocks,
problems are in no way specific to census tracts, classrooms, schools, etc. are
randomized experiments, and would also often usable. For reasons of statistical
have characterized the most casual of pilot efficiency, the smaller, more numerous
programs. They can only be avoided by units are to be preferred, maximizing the
the evaluation of the adoption of NIT as a degrees of freedom and the efficacy of
national policy. Such an evaluation will randomization. But the use of larger units
have to be quasi-experimental, as by time- often increases construct validity.
series, perhaps using several Canadian Problems of losses due to refusals and
cities as comparisons. Such evaluations later attrition interact with the choice of
would be stronger on external, construct the stage of respondent recruitment at
validity, but weaker on internal validity. It which to randomize. NJNITE used census
is characteristic of our national attitudes, statistics on poverty areas and sample
however, that this quasi-experimental survey approaches to locate eligible
evaluation is not apt to be done well, if at participants. Rethinking their problem
all–-once we’ve chosen a policy we lose shows the usefulness of distinguishing
interest in evaluating it. Had the NJNITE two types of assent required, for
shown a reduction in work effort, national measurement and for treatment; thus two
adoption of the policy would have been separate stages for refusals occur. There
very unlikely. For this reason alone, it was emerge three crucial alternative points at
well worth doing and doing well. which randomization could be done:
The details of the experiment draw
attention to a number of problems of

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 27


ISSN 1556-8180
February 2011
Donald T. Campbeell

In NJNITE, alternativ ve 1 was differrential refuusal is miniimized (tho ough


employeed. Subseq quently co ontrol grou up some will still reefuse when they learn their
responddents were asked to participate
p in lot). This seeems to maximize the
the meeasurement, and experimente tal “inforrmed conseent” requirred by the U.S.
subjectss were asked to particiipate in botth Natio onal Instituutes of Heallth for all o of the
measureement and treatment.. As a result, mediccal and beehavioral reesearch fun nded
there is the possibility that th he by theem. The U.S. Social Sccience Reseearch
experimmental group p contains persons wh ho Coun cil’s Comm mittee on E Experimentaation
would not
n have pu ut up with the
t bother of as a M Method forr Planning and Evaluaating
measureement had they by change c beeen Socialal Program ms (Riecken n, et al., 11974)
invited into
i the conntrol group
p. Staging thhe failedd to reccommend alternativee 3
invitatio
ons separattely, and randomizin ng howevver. In neet, they ju udge inforrmed
from ammong those who had agreeda to th
he conseent to be aadequately achieved w when
survey (i.e., to the con ntrol grou up the p participant is fully in nformed off the
conditioon) wou
uld havee ensureed treatmment he is to receive.. Informing g the
compara ability. In NJNITE there were contrrol group participants of the ben nefits
some reefusals to experiment
e tal treatment that others werre getting and that they
because of unwillin ngness to acccept charitty. almosst got woulld have cau used discon ntent,
This pro oducts disssimilarity, but the bias and h have made the contro ol treatmen nt an
due to such diffeerential reffusal can be b unusu ual experiience rath her than the
estimateed if thosse refusing g treatment repreesentative of the absence of
continuee in the measurement activitty. treatmment. Thee tendenccy of control
Alternattive 2 is th
he point wee would no ow subje cts to drop p out more frequently than
recomm mend. experrimental su ubjects wo ould have been
One could co onsider deeferring th he accen ntuated, an nd thus this approacch to
randomizing still further,
fu to alternative
a 3. greateer comparaability woulld be in thee end
Under this proccedure, all a potentiial self-ddefeating. These are reason nable
participaants wouldd be given a descriptio on argum ments on bo oth sides. M
More discusssion
of each of the ex xperimentaal condition ns and reesearch aree needed on n the probleem.
and his chances fo or each. Hee would theen Atttrition andd in particuular differeential
be askedd to agree to
t participa ate no matter attritiion becom me major problemss on
what ou utcome he drew by ch hance. From which h the work o of inventivee statisticiaans is
those wh ho agreed to all this, randomizeed still needed. In n the NJN NITE, attrrition
assignmment woulld be made. m Th
his rates over the three-yearr period rrange
alternatiive is one that is bo ound to seee from 25.3% in tthe controll group to only
increaseed use. The T oppoortunity for
fo 6.5% in thee most remunerrative

Journall of MultiDiisciplinary Evaluation


n, Volume 77, Number 115 28
ISSN 15556-8180
Februarry 2011
Donald T. Campbell

experimental group. These differences are individual (Schwartz & Orleans, 1967;
large enough to create pseudo effects in Campbell, Boruch, Schwartz & Steinberg,
post-test values. The availability of pretest 1975; Boruch & Campbell, 1974). Let us
scores provides some information on the call this “mutually insulated interfile
probable direction of bias, but covariance exchange.” It requires that the
on these values underadjusts and is thus administrative file have the capacity for
not an adequate correction. Methods for internal statistical analysis of its own
bracketing maximum and minimum records. Without going into detail, I would
biases under various specified assumption nonetheless like to communicate the nub
need to be developed. Where there is an of the idea. Figure 9 shows a hypothetical
encompassing periodic measurement experiment with one experimental group
framework that still retains persons who and one control group. In this case there
have ceased cooperating with the are enough cases to allow a further
experiment, other alternatives are present breakdown by socio-economic level. From
that need developing. these data some 26 lists are prepared
Such measurement frameworks ranging from 8 to 14 persons in length.
appropriate to NJNITE would include the These lists are assigned designations at
Social Security Administration’s records random (in this case A to Z) so that list
on earnings subject to social security designation communicates no
withholding and claims on unemployment information to the administrative file. The
insurance, the Internal Revenue Service list provides the name of the person, his
records on withholding taxes, Social Security number, and perhaps his
hospitalization insurance records, etc. birth date and birth place. These lists are
These records are occasionally usable in then turned over to the administrative file
evaluation research (e.g., Levenson & which deletes one person at random from
McDill, 1966; Bauman, David & Miller, each list, retrieves the desired data from
1979; Fischer, 1972, Heller, 1972) but the the files on each of the others for whom it
facilities for doing so are not adequately is available, and computes mean,
developed, and the concept of such usage variance, and number of cases with data
may seem to run counter to the current available for each list for each variable.
U.S. emphasis on preserving the These values for each list designation are
confidentiality of administrative records then returned to the evaluation
(e.g., Reubhausen & Brim, 1965; Sawyer & researchers who reassemble them into
Schechter, 1968; Goslin, 1970; Miller, statistically meaningful composites and
1971; Westin, 1967; Wheeler, 1969). then compute experimental and control
Because research access to administrative group means and variances, correlations,
files would make possible so much interactions with socio-economic level,
valuable low cost follow-up on program etc. Thus neither the research file nor the
innovations, this issue is worthy of administrative file has learned individual
discussion somewhere in this paper. For data from the other file, yet the statistical
convenience, if not organizational estimates of program effectiveness can be
consistency, I will insert the discussion made. In the U.S. it would be a great
here. achievement were this facility for program
There is a way to statistically relate evaluation to become feasible for regular
research data and administrative records use.
without revealing confidential data on

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 29


ISSN 1556-8180
February 2011
Michaell Scriven

Journall of MultiDiisciplinary Evaluation


n, Volume 6
6, Number 115 30
ISSN 15556-8180
Novembber 2010
Donald T. Campbell

To return to attrition problems in starting with randomization makes the


randomized experiments, not only do we many potential sources of bias more
need new statistical tools for the attrition troubling in that it focuses awareness on
problem, we also need social- them. I am convinced, however, that while
psychological inventions. In long-term the biases are more obvious, they are in
experiments such as that of Ikeda, Yinger, fact considerably less than those
and Laycock (1970 in which a university accompanying more casual forms of
starts working with underprivileged selecting comparison groups. In addition,
twelve-year-old during the summers, our ability to estimate the biases is
trying to motivate and guide their high immeasurably greater. Thus, we should,
school activities (ages 14 to 18) to be in my judgment, greatly increase our use
university preparatory, two of the reasons of random assignment, including in
why attrition is so differential are that regular admissions procedures in ongoing
more recent home addresses are available programs, having a surplus of applicants.
for the experimentals (who have been in To do this requires that we develop
continuous contact) and that the practical procedures and rationales that
experimentals answer more follow-up overcome the resistance to randomization
inquires because of gratitude to the met with in those settings. Just to
project. This suggests that controls in communicate to you that there are
long-term studies might be given some problems to be solved in these areas, I will
useful service on a continuing basis–less briefly sketch several of them.
than the experimentals but enough to Administrators raise many objections
motivate keeping the project informed of to randomization (Conner, 1974). While at
address changes and cooperating with one time lotteries were used to “let God
follow-up inquires (Ikeda, et al., 1970). If decide,” now a program administrator
one recognizes that comparability feels he is “playing God” himself when he
between experimental and control groups uses a randomization procedure, but not
is more important than completeness per when he is using his own incompetent and
se, it becomes conceivable that partisan judgment based on inadequate
comparability might be improved by and irrelevant information (Campbell,
deliberately degrading experimental 1971). Participants also resist
group data to the level of the control randomization, though less so when they
group. In an exploration of the possibility, themselves choose the capsule from the
Ikeda, Richardson, and I (in preparation) lottery bowl than when the administrator
are conducting an extra follow-up of this does the randomization in private
same study using five-year-old addresses, (Wortman, et al., 1974). Collecting a full
a remote unrelated inquiring agency, and list of eligible applicants and then
questions that do not refer specifically to randomizing often causes burdensome
the Ideka, Yinger and Laycock program. (I delays, and it may be better to offer a 50-
offer this unpromising example to 50 lottery to each applicant as he applies,
communicate my feeling that we need closing off all applications when the
wide-ranging explorations of possible program openings have been filled, at
solutions to this problem.) which time the controls would be
It is apparent that through refusal and approximately the same in number. For
attrition, true experiments tend to become settings like specially equipped old
quasi-experiments. Worse than that, people’s homes, the control group ceases

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 31


ISSN 1556-8180
February 2011
Donald T. Campbell

to be representative of non-experimental Goldberger (1971), working from an


conditions if those losing the lottery are econometric background, has made an
allowed to get on the waiting list–waiting essentially equivalent recommendation.
for an opening forestalls normal problem- The application of quantified eligibility
solving. For such settings, a three- procedures usually involves at least as
outcome lottery is suggested: (1) great a departure from ordinary
admitted; (2) waiting list; (3) rejected. admission procedures as does
Group 3 would be the appropriate control. randomization. Developing specific
For agencies having a few new openings routines appropriate to the setting is
each week or so, special “trickle necessary. But once instituted, their
processing” procedures are needed rather economic costs would be low and would
than large-batch randomization. Where be more than compensated for by
the program is in genuinely short supply, increased equity of the procedures.
one might think that the fact that most Resistance, however, occurs.
people were going without it would Administrators like the freedom to make
reconcile control group subjects to their exceptions eve to the rules they
lot; however, experimental procedures themselves have designed. “Validity” or
including randomization and “reliability” for the quantified eligibility
measurement may create an acute focal criterion is not required; indeed, as it
deprivation, making control status itself approaches zero reliability, it becomes the
an unusual treatment. This may result in equivalent of randomization.
compensatory striving or low moral (Cook
& Campbell, 1975). Political/Methodological
Regression-Discontinuity Design
Problems

The arguments against randomizing Resistance to Evaluation


admissions to an ameliorative program
(one with more eligible applications than In the U.S., one of the pervasive reasons
there is space for) include the fact that why interpretable program evaluations
there are degrees of eligibility, degrees of are so rare is the widespread resistance of
need or worthiness, and that the special institutions and administrators to having
program should go to the most eligible, their programs evaluated. The
needy, or worthy. If eligibility can be methodology of evaluation research
quantified (e.g., through ranks, ratings, should include the reasons for this
scores, or composite scores) and if resistance and ways of overcoming it.
admission for some or all of the applicants A major source of this resistance in the
can be made on the basis of a strict U.S. is the identification of the
application of this score, then a powerful administrator and the administrative unit
quasi-experimental design, Regression- with the program. An evaluation of a
discontinuity, is made possible. General program under our political climate
explanation and discussion of becomes an evaluation of the agency and
administrative details are to be found in its directors. In addition the machinery
Campbell (1969b) and Riecken, et al. for evaluating programs can be used
(1974). Sween (1971) has provided deliberately to evaluate administrators.
appropriate tests of significance. Combined with this, there are a number of

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 32


ISSN 1556-8180
February 2011
Donald T. Campbell

factors that lead administrators to expensive machinery of social


correctly anticipate a disappointing experimentation can be used to evaluate
outcome. As Rossi (1969) has pointed out, persons, it should not be. Such results are
the special programs that are the focus of of very limited generalizability. Our skills
evaluation interests have usually been should be reserved for the evaluation of
assigned the chronically unsolvable policies and programs that can be applied
problems, those on which the usually in more than one setting and that any
successful standard institutions have well-intentioned administrator with
failed. This in itself provides a pessimistic proper funding could adopt. We should
prognosis. Furthermore, the funding is meticulously edit our opinion surveys to
usually inadequate, both through the that only attitudes toward program
inevitable competition of many worthy alternatives are collected and such topics
causes for limited funds and because of a as supervisory efficiency excluded. This
tendency on the part of our legislatures prohibition on ad hominem research
and executives to generate token or should also be extended to program
cosmetic efforts designed more to clients. We should be evaluating not
convince the public that action is being students or welfare recipients but
taken than to solve the problem. Even for alternative policies for dealing with their
genuinely valuable programs, the great problems. It is clear that I feel such a
effort required to overcome institutional prohibition is morally justified. But I
inertia in establishing any new program should also confess that in our U.S.
leads to grossly exaggerated claims. This settings it is also recommended our of
produces the “overadvocacy trap” cowardice. Program administrators and
(Campbell, 1969b, 1971), so that even clients have it in their power to sabotage
good and effective programs fall short of our evaluation efforts, and they will
what has been promised, which intensifies attempt to do so if their own careers and
fear and evaluation. interests are at stake. While such a policy
The seriousness of these and related on our part will not entirely placate
problems can hardly be exaggerated. administrators’ fears, I do believe that if
While I have spent more time in this we conscientiously lived up to it, it would
presentation on more optimistic cases, the initiate a change toward a less self-
preceding paragraph is more typical of defeating political climate.
evaluation research in the U.S. today. As A second recommendation is for
methodologists, we in the U.S. are called advocates to justify new programs on the
upon to participate in political process in basis of the seriousness of the problem
efforts to remedy the situation. But before rather than the certainty of any one
we do so, we should sit back in our answer and combine this with the
armchairs in our ivory towers and invent emphasis on the need to go on to other
political/organizational alternatives which attempts at solution should the first one
would avoid the problem. This task we fail (Campbell, 1969b). Shaver and Staines
have hardly begun, and it is one in which (1971) have challenged this suggestion,
we may not succeed. Two minor arguing that for an administrator to take
suggestions will illustrate. I recommend this attitude of scientific tentativeness
that we evaluation research constitutes a default of leadership.
methodologists should refuse to use our Conviction, zeal, enthusiasm, faith are
skills in ad hominem research. While the required for any effective effort to change

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 33


ISSN 1556-8180
February 2011
Donald T. Campbell

traditional institutional practice. To could be easily evaded. Yet, in our region,


acknowledge only a tentative faith in the the voting statistics are regarded with
new program is to guarantee a half- suspicion while the census statistics are
hearted implementation of it. But the widely trusted (despite underenumeration
problem remains; the overadvocacy trap of young adult, black males). I believe this
continues to sabotage program evaluation. order of relative trust to be justified. The
Clearly, social-psychological and best explanation for it is that votes have
organization-theoretical skills are needed. continually been used–have had real
implications as far as jobs, money, and
Corrupting Effect of power are concerned–and have therefore
been under great pressure from efforts to
Quantitative Indicators corrupt. On the other hand, until recently
our census data were unused for political
Evaluation research in the U.S.A. is decision-making. (Even the constitutional
becoming a recognized tool for social requirement that electoral districts be
decision-making. Certain social changed to match population distribution
indicators, collected through such social after every census was neglected for
science methods as sample surveys, have decades.)
already achieved this status; for example, Another example: In the spirit of
the unemployment and cost-of-living scientific management, accountability, the
indices of the Bureau of Labor Statistics. PPBS movement, etc., police departments
As regular parts of the political decision in some jurisdictions have been evaluated
process, it seems useful to consider them by “clearance rates,” i.e., the proportion of
as akin to voting in political elections crimes solved, and considerable
(Gordon & Campbell, 1971; Campbell, administrative and public pressure is
1971). From this enlarged perspective, generated when the rate is low. Skolnick
which is supported by qualitative (1966) provide illustrations of how this
sociological studies of how public pressure has produced both corruption of
statistics get created, I come to the the indicator itself and a corruption of the
following pessimistic laws (at least for the criminal justice administered. Failure to
U.S. scene): The more any quantitative record all citizens’ complaints, or to
social indicator is used for social decision- postpone recording them unless solved,
making, the more subject it will be to are simple evasions which are hard to
corruption pressures and the more apt it check, since there is no independent
will be to distort and corrupt the social record of the complaints. A more
processes it is intended to monitor. Let complicated corruption emerges in
me illustrate these two laws with some combination with “plea-bargaining.” Plea-
evidence which I take seriously, although bargaining is a process whereby the
it is predominantly anecdotal. prosecutor and court bargain with the
Take, for example, a comparison prisoner and agree on a crime and a
between voting statistics and census data punishment to which the prisoner is
in the city of Chicago: Surrounding the willing to plead guilty, thus saving the cost
voting process, there are elaborate and delays of a trial. While this is only a
precautionary devices designed to ensure semilegal custom, it is probably not
its honesty; surrounding the census- undesirable in most instances. However,
taking process, there are few, and these combined with the clearance rate,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 34


ISSN 1556-8180
February 2011
Donald T. Campbell

Skolnick finds the following miscarriage From the experimental program in


of justice. A burglar who is caught in the compensatory education comes a very
act can end up getting a lighter sentence clear-cut illustration of the principle. In
the more prior unsolved burglaries he is the Texarkana “performance contracting”
willing to confess to. In the bargaining, he experiment (Stake, 1971), supplementary
is doing the police a great favor by teaching for undereducated children was
improving the clearance rate, and in provided by “contractors” who came to
return, they provide reduced punishment. the schools with special teaching
Skolnick believes that in many cases the machines and individualized instruction.
burglar is confessing to crimes he did not The corruption pressure was high because
in fact commit. Crime rates are in general the contractors were to be paid on the
very corruptible indicators. For many basis of the achievement test score gains
crimes, changes in rates are a reflection of of individual pupils. It turned out that the
changes in the activity of the police rather contractors were teaching the answers to
than changes in the number of criminal specific test items that were to be used on
acts (Gardiner, 1969; Zeisel, 1971). It the final play-off testing. Although they
seems to be well documented that a well- defended themselves with a logical-
publicized, deliberate effort at social positivist, operational-definitionalist
change–Nixon’s crackdown on crime–had argument that their agreed-upon goal was
as its main effect the corruption of crime- defined as improving scores on that one
rate indicators (Seidman & Couzens, 1972; test, this was generally regarded as
Morrissey, 1972; Twigg, 1972), achieved scandalous. However, the acceptability of
through underrecording and by tutoring the students on similar items
downgrading the crimes to less serious from other tests is still being debated.
classifications. From my own point of view, achievement
For other types of administrative tests may well be valuable indicators of
records, similar use-related distortions general school achievement under
are reported (Kitsuse & Cicourel, 1963; conditions of normal teaching aimed at
Garfinkel, 1967). Blau (1963) provides a general competence. But when test scores
variety of examples of how productivity become the goal of the teaching process,
standards set for workers in government they both lose their value as indicators of
offices distort their efforts in ways educational status and distort the
deleterious to program effectiveness. In educational process in undesirable ways.
an employment office, evaluating staff (Similar biases of course surround the use
members by the number of cases handled of objective tests in courses or as entrance
led to quick, ineffective interviews and examinations.) In compensatory
placements. Rating the staff by the education in general there are rumors of
number of persons placed led to other subversions of the measurement
concentration of efforts on the easiest process, such as administering pretests in
cases, neglecting those most needing the a way designed to make scores as low as
service, in a tactic know as “creaming” possible so that larger gains will be shown
(Miller, et al., 1970). Ridgeway’s on the post test, or limiting treatment to
pessimistic essay on the dysfunctional those scoring lowest on the pretest so that
effects of performance measures (1956) regression to the mean will provide
provides still other examples. apparent gains. Stake (1971) lists still

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 35


ISSN 1556-8180
February 2011
Donald T. Campbell

other problems. Achievement tests are, in specific battalions and other military
fact, highly corruptible indicators. units. There was thus created a new
That this serious methodological military goal, that of having bodies to
problem may be a universal one is count, a goal that came to function instead
demonstrated by the extensive U.S.S.R. of or in addition to more traditional goals,
literature (reviewed in Granick, 1954; and such as gaining control over territory.
Berliner, 1957) on the harmful effects of Pressure to score well in this regard was
setting quantitative industrial production passed down from higher officers to field
goals. Prior to the use of such goals, commanders. The realities of guerrilla
several indices were useful in warfare participation by persons of a wide
summarizing factory productivity–e.g., variety of sexes and ages added a
monetary value of total product, total permissive ambiguity to the situation.
weight of all products produced, or Thus poor Lt. Calley was merely engaged
number of items produced. Each of these, in getting bodies to count for the weekly
however, created dysfunctional effectiveness report when he participated
distortions of production when used as in the tragedy at My Lai. His goals had
the official goal in terms of which factory been corrupted by the worship of a
production was evaluated. If monetary quantitative indicator, leading both to a
value, then factories would tool up for and reduction in the validity of that indicator
produce only one product to avoid the for its original military purposes, and a
production interruptions of retooling. If corruption of the social processes it was
weight, then factories would produce only designed to reflect.
their heaviest item (e.g., the largest nails I am convinced that this is one of the
in a nail factory). If number of items, then major problems to be solved if we are to
only their easiest item to produce (e.g., achieve meaningful evaluations of our
the smallest nails). All these distortions efforts at planned social change. It is a
led to overproduction of unneeded items problem that will get worse, the more
and underproduction of much needed common quantitative evaluations of social
ones. programs become. We must develop ways
To return to the U.S. experience in a of avoiding this problem if we are to move
final example. During the first period of ahead. We should study the social
U.S. involvement in Viet Nam, the processes through which corruption is
estimates of enemy casualties put out by being uncovered and try to design social
both the South Vietnamese and our own systems that incorporate these features.
military were both unverifiable and In the Texarkana performance-
unbelievably large. In the spirit of contracting study, it was an “outside
McNamara and PPBS, an effort was then evaluator” who uncovered the problem. In
instituted to substitute a more a later U.S. performance-contracting
conservative and verifiable form of study, the Seattle Teachers’ Union
reporting, even if it underestimated total provided the watchdog role. We must seek
enemy casualties. Thus the “body count” out and institutionalize such objectivity-
was introduced, an enumeration of only preserving features. We should also study
those bodies left by the enemy on the the institutional form of those indicator
battlefield. This became used not only for systems, such as the census or the cost-of-
overall reflection of the tides of war, but living index in the U.S., which seem
also for evaluating the effectiveness of relatively immune to distortion. Many

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 36


ISSN 1556-8180
February 2011
Donald T. Campbell

commentators, including myself (1969b), detailed discussion of these problems in a


assume that the use of multiple indicators, social welfare service program evaluation
all recognized as imperfect, will alleviate setting.)
the problem, although Ridgeway (1956)
doubts this. Summary Comment
There are further problems that can be
anticipated in the future. A very This has been a condensed overview of
challenging group centers on the use of some of the problems encountered in the
public opinion surveys, questionnaires, or U.S. experience with assessing the impact
attitude measures in program evaluation. of planned social change. The sections of
Trends in the U.S. are such that before the paper dealing with the problems
long, it will be required that all related to political processes have seemed
participants in such surveys, before they predominantly pessimistic. While there
answer, will know the uses to which the are very serious problems, somehow the
survey will be put, and will receive copies overall picture is not as gloomy as this
of the results. Participants will have the seems. Note that the sections on time-
right to use the results for their own series and on randomized designs contain
political purposes. (Where opinion success stories worthy of emulation. And
surveys are used by the U.S. Government, many of the quasi-experimental
our present freedom of information evaluations that I have scolded could have
statutes should be sufficient to establish been implemented in better ways–had the
this right now.) Under these conditions, social science methodological community
using opinion surveys to evaluate local insisted upon it–within the present
government service programs can be political system. There are, however, new
expected to produce the following new methodological problems which emerge
problems when employed in politically when we move experimentation out of the
sophisticated communities such as we laboratory into social program evaluation.
find in some of our poorest urban In solving these problems, we may need to
neighborhoods: There will be political make new social-organizational
campaigns to get respondents to reply in inventions.
the particular ways the local political
organizations see as desirable, just as
there are campaigns to influence the vote. References
There will be efforts comparable to ballot-
box stuffing. Interviewer bias will become Anderson, J.K. Evaluation as a process in
even more of a problem. Bandwagon social systems. Unpublished doctoral
effects–i.e., conformity influence from the dissertation, Northwestern University,
published results of prior surveys–must Department of Industrial Engineering
be anticipated. New biases, like & Management Sciences, 1973.
exaggerated compliant, will emerge. Baldus, D. C. Welfare as a loan: An
In my judgment, opinion surveys will empirical study of the recovery of
still be useful if appropriate safeguards public assistance payments in the
can be developed. Most of these are United States. Stanford Law Review,
problems that we could be doing research 1973. 25.123-250. (No. 2)
on now in anticipation of future needs. Barnow, B. S. The effects of Head Start
(Gordon & Campbell, 1971 provide a and socioeconomic status on

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 37


ISSN 1556-8180
February 2011
Donald T. Campbell

cognitive development of Egon Brunswik. New York: Holt,


disadvantaged children. Unpublished Rinehart & Winston, 1966, 81-106.
doctoral dissertations, University of Campbell, D. T. Administrative
Wisconsin, Department of Economics, experimentation, institutional records,
1973. (252 pp.) and nonreactive measures. In J. C.
Bauman, R. A., David, M. H., & Miller, R. Stanley & S. M. Elam (Eds.),
F. Working with complex data files: II. Improving experimental design and
The Wisconsin assets and incomes statistical analysis. Chicago: Rand
studies archive. In R. L. Bisco (Ed.), McNally, 1967, 257-291. Reprinted: in
Data bases, computers, and the social W. M. Evan (Ed.), Organizational
sciences. New York: Wiley- experiments: Laboratory and field
Interscience, 1970, 112-136. research. New York: Harper & Row,
Beck, B. Cooking the welfare stew. In R. 1971, 169-179.
W. Habenstein (Ed.), Pathways to Campbell, D. T. A phenomenology of the
data: Field methods for studying other one: Corrigible, hypothetical and
ongoing social organizations. critical. In T. Mischel (Ed.), Human
Chicago: Aldine, 1979. action: conceptual and empirical
Becker, H. M., Geer, B., & Hughes, E. C. issues. New York: Academic Press,
Making the grade. New York: Wiley, 1969a, 41-69.
1968 Campbell, D. T. Reforms as experiments.
Becker, H. W. Sociological work: Method American Psychologist, 1969b,
and substance. Chicago: Aldine, 1970. 24.409-429. (No. 4, April)
Berliner, J. S. Factory and manager in Campbell, D. T. Considering the case
the U.S.S.R. Cambridge Mass.: against experimental evaluations of
Harvard University Press, 1957. social innovations. Administrative
Blau, P. M. The dynamics of bureaucracy. Science Quarterly, 1970, 15.110-113
(Rev. Ed.) Chicago: University of (No. 1, March)
Chicago Press, 1963. Campbell, D. T. Methods for an
Boruch, R. F., & Campbell, D. T. experimenting society. Paper
Preserving confidentiality in presented to the Eastern Psychological
evaluative social research: Intrafile Association, April 17, 1971, and to the
and interfile data analysis. Paper American Psychological Association,
presented at the meeting of the Sunday, September 5, 12:00, Diplomat
American Association for the Room, Shoreham Hotel, Washington
Advancement of Science, San D.C. to appear when revised in the
Francisco, February/March 1974. American Psychologist.
Box, G. E. P., & Tiao, G. C. A change in Campbell, D. T. Experimentation
level of non-stationary time series. revisited: A conversation with Donald
Biometrika, 1965, 52.181-192. T. Campbell (by S. Salasin).
Box, G. E. P., & Jenkins, G. M. Time- Evaluation, 1973, 1.7-13. (NO. 1)
series analysis: Forecasting and Campbell, D. T. Qualitative Knowing in
control. San Francisco: Holden Day, Action Research. Journal of Social
1970. Issues, probably 1975.
Campbell, D. T. Pattern matching as an Campbell, D. T., & Boruch, R. F. Making
essential in distal knowing. In K. R. the case for randomized assignment
Hammond (Ed.), The psychology of to treatments by considering the

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 38


ISSN 1556-8180
February 2011
Donald T. Campbell

alternatives: Six ways in which quasi- Evanston, Ill.: Northwestern


experimental evaluations in University Press, 1973
compensatory education trend to Caro, F. G. (Ed.) Readings in evaluation
underestimate effects. Lecture research. New York: Russell Sage
delivered at the conference on Central Foundation, 1972.
Issues in Social Program Evaluation, Cicirelli, V. G., et al. The impact of Head
C. A. Bennett & A. Lumsdaine, Start: An evaluation of the effects of
coordinators, Battelle Human Affairs Head Start on children’s cognitive and
Research Center, Seattle, Washington, affective development. A report
July 1973. Publication pending. presented to the Office of Economic
Campbell, D. T., Boruch, R. F., Schwartz, Opportunity pursuant to Contract
R. D., & Steinberg, J. Confidentiality- B89-4536, June 1969. Westinghouse
preserving modes of useful access to Learning Corporation, Ohio
files and to interfile exchange for University. (Distributed by
statistical analysis, National Research Clearinghouse for Federal Scientific
Council, Committee on Federal Agency and Technical Information, U.S.
Evaluation Research, Final Report, Department of Commerce, National
Appendix A. 1975. Bureau of Standards, Institute for
Campbell, D. T., & Erlebacher, A. E. How Applied Technology. PB 184 328).
regression artifacts in quasi- Conner, R. F. A methodological analysis
experimental evaluations can of twelve true experimental program
mistakenly make compensatory evaluations. Unpublished doctoral
education look harmful. In J. dissertation, Northwestern University,
Hellmuth (Ed.), Disadvantaged child. August, 1974.
Vol. 3 Compensatory education: A Cook, T. D., Appleton, H. , Conner, R.,
national debate. New York: Shaffer, A., Tamkin, G., & Weber, S. J.
Brunner/Mazel, 1970. Sesame Street revisited: A case study
Campbell, D. T., & Fiske, D. W. in evaluation research. New York:
Convergent and discriminant Russell Sage Foundation, 1975.
validation by the multitrate- Cook, T. D., & Campbell, D. T. The design
multimethod matrix. Psychological and conduct of quasi-experiments and
Bulletin, 1959, 56.81-105. (Also, true experiments in field settings. In
Bobbs-Merrill Reprint series in the M. D. Dunnette & J. P. Campbell
social sciences, S-354.) (Eds.), Handbook of industrial and
Campbell, D. T., Siegman, C. R., & Rees, organizational research. Chicago:
M. B. Direction-of-wording effects in Rand McNally, 1975.
the relationships between scales. Cronbach, L. J. Response sets and test
Psychological Bulletin, 1967, 68.293- validity. Educational and
303. (No. 5) Psychological Measurement, 1946,
Campbell, D. T. & Stanley, J. C. 6.475-494.
Experimental and quasi-experimental Cronbach, L. J. Further evidence on
designs for research. Chicago: Rand response sets and test design.
McNaly, 1966. Educational and Psychological
Caporso, J. A., & Roos, L. L., Jr. (Eds.) Measurement, 1950, 10.3-31.
Quasi-experimental approaches: David, H. P. Family planning and
Testing theory and evaluating policy. abortion in the socialist countries on

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 39


ISSN 1556-8180
February 2011
Donald T. Campbell

Central and Eastern Europe. New Papers, 123-72. Madison: Institute for
York: The Population Council, 1970. Research on Poverty, University of
Romanian data based primarily on Wisconsin, 1972.
Anuarul Statistic al Republicii Gordon, A. C., Campbell, D. T., et al.
Socialiste Romania. Recommended accounting procedures
Douglas, J. D. The social meanings of for the evaluation of improvements in
suicide. Princeton, N.J.: Princeton the delivery of state social services.
University Press, 1967. Duplicated paper, Center for Urban
Edwards, A. L. The social desirability Affairs, Northwestern University, 1971.
variable in personality assessment Goslin, D. A. Guidelines for the collection,
and research. New York: Dryden, maintenance, and dissemination of
1957. pupil records. New York: Russell Sage
Fairweather, G. W. Methods for Foundation, 1970.
experimental social innovation. New Granick, D. Management of the industrial
York: Wiley, 1967 firm in the U.S.S.R. New York:
Fischer, J. L. The uses of Internal Columbia University Press, 1954.
Revenue Service data. In M.E. Borus Guttentag, M. Models and methods in
(Ed.), Evaluating the impact of evaluation research. Journal for the
manpower programs. Lexington, Theory of Social Behavior, 1971, 1.75-
Mass.: D. C. Health, 1972, 177-180 95. (No. 1, April)
Gardiner, J. A. Traffic and the police: Guttentag, M. Evaluation of social
Variations in law-enforcement policy. intervention programs. Annals of the
Cambridge, Mass.: Harvard University New York Academy of Sciences, 1973,
Press, 1969. 218.3-13.
Garfinkel, H. “Good” organizational Guttentag, M. Subjectivity and its use in
reasons for “bad” clinic records. In H. evaluation research. Evaluation, 1973,
Garfinkel, Studies in 1.60-65. (No. 2)
ethnomethodology. Englewood Cliffs, Hatry, H. P., Winnie, R. E., & Fisk, D. M.
N.J.: Prentice Hall, 1967, 186-207. Practical program evaluation for
Glaser, D. Routinizing evaluation: state and local government officials.
Getting feedback on effectiveness of Washington, D.C.: The Urban Institute
crime and delinquency programs. (2100 M. Street NW, Washington, DC
Washington, D.C.: Center for Studies 20037), 1973.
of Crime and Delinquency, NIMH., Heller, R. N. The uses of social security
1973. (Available from U.S. administration data. In M. E. Borus
Government Printing Office, (Ed.), Evaluating the impact of
Washington, D.C. 20402; publication manpower programs. Lexington,
number 1724-00319, $1.55.) Mass.: D. C. Heath, 1972, 197-201.
Glass, G. V., Willson, V. L., & Gottman, J. Ikeda, K., Yinger, J. M., & Laycock, F.
M. Design and analysis of time-series Reforms as experiments and
experiments. Boulder, Colo.: experiments as reforms. Paper
Laboratory of Educational Research, presented to the meeting of the Ohio
University of Colorado, 1972. Valley Sociological Society, Akron,
Goldberger, A. S. Selection bias in May 1970.
evaluating treatment effects: Some Jackson, D. N., & Messick, S. Response
formal illustrations. Discussion styles on the MMPI: Comparison of

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 40


ISSN 1556-8180
February 2011
Donald T. Campbell

clinical and normal samples. Journal Levenson, B., & McDill, M. S. Vocational
of Abnormal and Social Psychology, graduates in auto mechnaics: A follow-
1962, 65.285-299. up study of Negro and white youth.
Kepka, E. J. Model representation and Phylon, 1966, 27.347-357. (No. 4)
the threat of instability in the Lohr, B. W. An historical view of the
interrupted time series quasi- research on the factors related to the
experiment. Doctoral dissertation, utilization of health services.
Northwestern University, Department Duplicated Research Report, Bureau
of Psychology, 1971. for Health Services Research and
Kershaw, D. N. A negative income tax Evaluation, Social and Economic
experiment. Scientific American, 1972, Analysis Division, Rockvill, Md.,
227.19-25. January 1973, 34 pp. In press, U.S.
Kershaw, D. N. The New Jersey negative Government Printing Office.
income tax experiment: a summary of Lord, F. M. Large-scale covariance
the design, operations, and results of analysis when the control variable is
the first large-scale social science fallible. Journal of the American
experiment. Dartmouth/OECD Statistical Association, 1960, 55.307-
Seminar on “Social Research and 321.
Public Policies,” Sept. 13-15, 1974 (this Lord, F. M. Statistical adjustments when
volume). comparing preexisting groups.
Kershaw, D. N., & Fair, J. Operations, Psychological Bulletin, 1969, 72.336-
surveys, and administration, Volume 337.
IV of the Final Report of the New McCain, L. J. Data analysis in the
Jersey Graduated Work Incentive interrupted time series quasi-
Experiment. Madison, Wisconsin: experiment. Master’s thesis,
Institute for Research on Poverty, Northwestern University, Department
December 1973. Duplicated, 456 pp. of Computer Sciences, in preparation.
(Mathematica, Inc., Princeton, N.J. as Miller, A. R. The assault on privacy. Ann
Vol. IV of the Report of the New Jersey Arbor, Mich.: University of Michigan
Negative Income Tax Experiment) Press, 1971.
December, 1973. (To be published by Miller, S. M., Roby, P., & Steenwijk, A. A.
Academic Press) V. Creaming the poor. Trans-action,
Kitsuse, J. K., & Circourel, A. V. A note on 1970, 7.38-45. (No. 8, June)
the uses of official statistics. Social Morgenstern, O. On the accuracy of
Problems, 1963, 11.131-139. (Fall) economic observations. (2nd ed.)
Kuhn, T. S. The structure of scientific Princeton, N.J.: Princeton University
revolutions. (2nd ed.) Chicago: Press, 1963.
University of Chicago Press, 1979. Morrissey, W. R. Nixon anti-crime plan
(Vol. 2, No. 2, International undermines crime statistics. Justice
Encyclopedia of Unified Science.) Magazine, 1972, 1.8-11, 14. (No. 5/6,
Kutchinsky, B. The effect of easy June/July) (Justice Magazine, 922
availability of pornography on the National Press Building, Washington,
incident of sex crimes: The Danish D.C. 20004)
experience. Journal of Social Issues, Porter, A. C. The effects of using fallible
1973, 29.163-181. (No. 3) variables in the analysis of covariance.
Unpublished doctoral dissertation,

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 41


ISSN 1556-8180
February 2011
Donald T. Campbell

University of Wisconsin, 1967. presented at the meeting of the


(University Microfilms, Ann Arbor, American Political Science
Michigan, 1968) Association, Washington, D.C.,
Ridgeway, V. Dysfunctional consequences September 1972. (To appear in Law &
of performance measures. Society Review).
Administrative Science Quarterly, Segall, M. H., Campbell, D. T., &
1956, 1.240-247. (No. 2, September) Herskovits, M. J. The influence of
Riecken, H. W., Boruch, R. F., Campbell, culture on visual perception.
D. T., Caplan, N., Glennan, T. K., Pratt, Indianapolis, Ind.: Bobbs-Merrill,
J., Rees, A., & Williams, W. Social 1966.
Experimentation: a method for Shaver, P., & Staines, G. Problems facing
planning and evaluating social Campbell’s “Experimenting Society.”
intervention. New York: Academic Urban Affairs Quarterly, 1971, 7.173-
Press, 1974. 186. (No. 2, December)
Rivlin, A. M. Systemic thinking for social Skolnick, J. H. Justice without trial: Law
action. Washington, D.C.: The enforcement in democratic society.
Brookings Institution, 1971. New York: Wiley, 1966. (Chapter 8,
Ross, H. L. Law, science and accidents: 164-181)
The British Road Safety Act of 1967. Smith, M. S., & Bissell, J. S. Report
Journal of Legal Studies, 1973, 2.1-75. analysis: The impact of Head Start.
(No. 1; American Bar Foundation) Harvard Educational Review, 1970,
Rossi, P. H. Practice, method, and theory 40.51-104.
in evaluating social-action programs. Stake, R. E. Testing hazards in
In J. L. Sundquist (Ed.), On fighting performance contracting. Phi Delta
poverty. New York: Basic Books, 1969, Kappan, 1971. 52.583-588. (No. 10,
217-234. (Chapter 10) June)
Rossi, P. H., & Williams, W. (Eds.) Suchman, E. Evaluation research. New
Evaluating social programs: Theory, York: Russell Sage Foundation, 1967.
practice, and politics. New York: Sween, J. A. The experimental regression
Seminar Press, 1972. design: An inquiry into the feasibility
Ruehausen, O. M., & Brim, O. G., Jr. of non-random treatment allocation.
Privacy and behavioral research. Unpublished doctoral dissertation,
Columbia Law Review, 1965, 65.1184- Northwestern University, 1971.
1211. Thompson, C. W. N. Administrative
Sawyer, J., & Schecter, H. Computers, experiments: The experience of fifty-
privacy, and the National Data Center: eight engineers and engineering
The responsibility of social scientists. managers. IEEE Transactions on
American Psychologists, 1968, Engineering Management, 1974, EM-
23.810-818. 21, pp. 42-50. (No. 2, May)
Schwartz, R. D., & Orleans, S. On legal Thorndike, R. L. Regression fallacies in
sanctions. University of Chicago Law the matched groups experiment.
Review, 1967, 34.274-300. Psychometrika, 1942, 7.85-102.
Seidman, D., & Couzens, M. Crime Twigg, R. Downgrading of crimes verified
statistics and the great American anti- in Baltimore. Justice Magazine, 1972,
crime crusade: Police misreporting of 1, 15, 18. (No. 5/6, June/July) (Justice
crime and political pressures. Paper

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 42


ISSN 1556-8180
February 2011
Donald T. Campbell

Magazine, 922 National Press Wholey, J. S., Nay, J. N., Scanlon, J. W.,
Building, Washington, D.C. 20004) Schmidt, R. E. If you don’t care where
Watts, H. W., & Rees, A. (Eds.) Final you get to, then it doesn’t matter which
report of the New Jersey graduated way you go. Dartmouth/OECD
work incentive experiment. Volume I. Seminar on Social Research and Public
An overview of the labor supply results Policies, Sept. 13-15, 1974 (this
and of Central labor-supply results volume).
(700 pp.), Volume II. Studies relating Wholey, J. S., Scanlon, J. W., Duffy, H. G.,
to the validity and generalizability of Fukumoto, J., & Vogt, L. M. Federal
the results (250 pp.), Volume III. evaluation policy. Washington, D.C.:
Response with respect to expenditure, The Urban Institute, 1970.
health, and social behavior and Wilder, C. S. Physician Visits, volume and
technical notes (300 pp.) Madison, interval since last visit, U.S. 1969.
Wisconsin: Institute for Research on Rockville, Md.: National Center for
Poverty, University of Wisconsin. Health Statistics, series 10, No. 75,
December, 1973. (Duplicated) July 1972 (DHEW Pub. No. [HSM] 72-
Weber, S. J., Cook, T. D., & Campbell, D. 1064).
T. The effects of school integration on Williams, W., & Evans, J. W. The politics
the academic self-concept of public of evaluation: The case of Head Start.
school children. Paper presented at the The Annals, 1969, 385.118-132.
meeting of the Midwestern (September)
Psychological Association, Detroit, Wortman, C. B., Hendricks, M., & Hillis,
1971. J. Some reactions to the
Weiss, C. H. (Ed.) Evaluating action randomization process. Duplicated
programs: Readings in social action research report, Northwestern
and education. Boston: Allyn & Bacon, University, Department of Psychology,
1972a. 1974.
Weiss, C. H. Evaluation research. Zeisel, H. And then there were none: The
Englewood Cliffs, N.J.: Prentice-Hall, diminunition of the Federal jury.
1972b. University of Chicago Law Review,
Weiss, R. S., & Rein, M. The evaluation of 1971, 38.710-724.
broad-aim programs: A cautionary Zeisel, J. The future of law enforcement
case and a moral. Annals of the statistics: A summary view. In Federal
American Academy of Political and statistics: Report of the President’s
Social Science, 1969, 385. 133-142. Commission. Vol. II, 1971, 527-555.
Weiss, R. S., & Rein, M. The evaluation of Washington, D.C.: U.S. Government
broad-aim programs: Experimental Printing Office. (Stock no. 4000-0269)
design, its difficulties, and an
alternative. Administrative Science
Quarterly, 1970, 15.97-109.
Westin, A. F. Privacy and freedom. New
York: Atheneum, 1967.
Wheeler, S. (Ed.) Files and dossiers in
American life. New York: Russell Sage
Foundation, 1969.

Journal of MultiDisciplinary Evaluation, Volume 7, Number 15 43


ISSN 1556-8180
February 2011

You might also like