Professional Documents
Culture Documents
Assessing The Impact of Plan de Social Change. Donald Campbell PDF
Assessing The Impact of Plan de Social Change. Donald Campbell PDF
Assessing The Impact of Plan de Social Change. Donald Campbell PDF
www.jmde.com/ Occasion
nal Paper S
Series
Asseessing the Im
mpact of
o Plan
nned S
Social C
Chang
ge*
Originallly publisheed as Paper #8, Occasiional Paperr Series, Deccember, 19776
* Reprin
nted with peermission of
o The Public Affairs C
Center, Darttmouth Colllege1.
Donald
d T. Camp
pbell
metho
h of this meethodology can be useefully
shareed–that so
odology is o
ocial project evaluaation
one of the ffields of sciience
improvee our sociial systemss. It is ou ur that has enouggh universsality to m make
universaal predicam ment that our projeccts scienttific sharingg mutually beneficial. As a
do not always
a havee their intended effectts. part oof this sharring, this paper reportts on
Very probably
p we
w all sh hare in th he progrram impactt assessmen nt methodo ology
experiennce that ofteno we cannot teell as it is develop ping in thee United S States
whetherr the projecct had any impact
i at all,
a todayy.
so com mplex is the t flux of o historiccal Thhe most co ommon nam me in the U.S.
changes that wou uld have been goin ng for th
his developing specialty is evaluaation
anyway, and so many aree the other researrch, whicch now aalmost alw ways
projects that mightt be expected to modiify impli es program m evaluation (even tho ough
the samee indicatorss. the term “evvaluation has a w well-
It seems
s inevvitable tha at in mo ost establlished ussage in assessing the
countriees this com mmon set of o problemms, adequ uacy of perrsons in th he executio on of
combineed with the obvious relevance of speciffic social roles). Alreaady there aare a
social science research proccedures, will w numb ber of anth hologies annd textbook ks in
have generated
g a methodology an nd this aarea. (Such hman, 196 67; Caro, 11971;
methodo ological speecialists foccused on thhe Weisss, 1972a, 19 972b; Rivlinn, 1971; Rossi &
problemm of asseessing the impact of Williaams, 1972; Glaser, 19773; Fairweaather,
planned d social channge. It is an
n assumptioon 1967; Wholey, et al., 1970;Caporasso &
of this paper
p that, in
i spite of differences
d in Roos,, 1973; Rieecken, Borruch, Camp pbell,
forms ofo government and ap pproaches to Caplaan, Gennan n, Pratt, Reees, & Williiams,
social planning and prob blem-solvinng, 1974. ) There is a journ nal, Evaluaation,
1Preparattion of this pa
aper has been
n supported in
n part by gran
nts from the RRussell Sage F
Foundation an
nd the
National Science
S Founndation, Gran
nt Number SO OC-7103704-0 03. A versionn of this paper was presentted to
the Visegrrad, Hungary, Conference on Social Psyychology, Mayy 5-10, 1974.
1970) who best provide the needed activities with only long-range payoff. For
skepticism of such measures as suicide potential recruits, or for those already
rates and crime rates. But even here, it is engaged in it, a full account of our
psychologists who have had the depth of difficulties, including the problem of
experience sufficient to distinguish getting our skills used in ways we can
degrees of validity lying between total condone, is bound to be discouraging. We
worthlessness and utter perfection, and cannot yet promise a set of professional
who have been willing to use, albeit skills guaranteed to make an important
critically, measures they knew were difference. In the few success stories of
partially biased and errorful. beneficial programs unequivocally
Third, many of the methodological evaluated, society has gotten by, or could
problems of social implementation and have gotten by, without our help. We still
impact measurement have to do with the lack instances of important contributions
social psychology of interaction between to societal innovation which were abetted
citizens and projects, or between citizens by our methodological skills. The need for
and modes of experimental our specialty, and the specific
implementation (randomization, control recommendations we make, must still be
groups), or between citizens and the justified by promise rather than by past
special measurement procedures performance. They are a priori in that
introduced as a part of the evaluation. they represent extrapolations into a new
These are special problems of attitude context not yet cross-validated in that
formation and of the effects of attitudes context. I myself believe that the
on responses, and are clearly within the importance of the problem of social
domain of our intended competence. system reality-testing is so great that our
Having said something about U.S. efforts and professional commitment are
evaluation research in its professional fully justified by promise. I believe that
aspects, I would like to spend the rest of the problems of equivocality of evidence
the time telling about the problems we for program effectiveness are so akin to
have encountered so far and the solutions the general problems of scientific
we have proposed. It is with regret that I inference that our extrapolations into
note that we have progressed very far recommendations about program
from my earlier review (Campbell, 1969b); evaluation procedures can be, with proper
however, I will attempt to provide new mutual criticism, well-grounded.
illustrations. Nonetheless, motivated in part by the
The focus of what follows is so much reflexive consideration that promising too
on troubles and problems that I feel the much turns out to be a major obstacle to
necessity of warning and apologizing. If meaningful program evaluation, I aim,
we set out to be methodologists, we set however ambivalently, to present an
out to be experts in problems and, honestly pessimistic picture of the
hopefully, inventors of solutions. The problem.
need for such a specialty would not exist A second problem with the problem
except for the problems. From this point focus comes from the fact that inevitably
of view, no apology is needed. But I would many of the methodological difficulties
also like to be engaged in recruiting are generated from the interaction of
members to a new profession and in aspects of the political situation
inspiring them to invest great effort in surrounding programs and their
evaluation. Thus the U.S. experience in human consciousness and more humane
evaluation research, combined with the forms of social life are questions I have
focus on problems, may make my not adequately faced, to say nothing of
presentation seem inappropriately and resolved.
tactlessly critical of the U.S. system of In what follows, I have grouped our
government and our current political problems under three general headings,
climate. Ideally, this would be balanced but have made little effort to keep the
out by the sharing experiences from many discussion segregated along these lines.
nations. In the absence of this, I can only First comes issues that are internal to our
ask you not to be misled by this by- scientific community and would be
product of the otherwise sensible present even if scientist program
approach of focusing on problems. evaluators were running society solely for
It is in the area of methodological the purpose of unambiguous program
problems generated by political evaluation. These are: 1) Metascientific
considerations that the assumptions of issues; and 2) Statistical issues. The
universality for the methodological remaining heading involves interaction
principles will fail as we compare with the societal context. Under 3)
experiences from widely differing social, Political system problems, I deal with
economic, and political systems. You issues that specifically involve political
listeners will have to judge the extent, if processes and governmental institutions,
any, to which the same politico- some of which are perhaps common to all
methodological problems would emerge large, bureaucratic nations, other unique
in your program evaluation settings. Most to the U.S. setting.
international conferences of scientists can
avoid the political issues which divide Metascientific Issues
nations by concentrating on the scientific
issues which unite them as scientists. On
Quantitative vs. Qualitative
the topic of assessing the impact of
planned social change we do not have that Methodology
luxury. Even so, I have hopes for a
technology that would be useful to any A controversy between “qualitative”
political system. I believe that much of the versus “quantitative” modes of knowing,
methodology of program evaluation will between geisteswissenschaftlich and
be independent of the content of the naturewissenchaftlich approaches,
program, and politically neutral in this between “humanitistic” and “scientistic”
sense. This stance is augmented by approaches is characteristic of most of the
emphasizing the social scientist’s role in social sciences in the U.S.A. today. In
helping society keep track of the effects of fields such as sociology and social
changes that its political process has psychology, many of our ablest and most
initiated and by playing down the role of dedicated graduate students are
the social scientist in the design of increasingly opting for the qualitative,
program innovations. Whether this humanitistic mode. In political science,
independence of ideology is possible, and there has been a continuous division
even if it is, how it is to be integrated with along these lines. Only economics and
our social scientist’s duty to participate in geography seem relatively immune.
the development of more authentic
If yo
ou ask thee normal resident of a omittting ordinaary languag ge, it would d fail
“carpenttered” cultu ure (Segall,, et al., 1966) to co ommunicatee to anoth her scientisst in
which lin ne is longerr, a or b, hee will reply b. such a way as tto enable h him to repllicate
If you supply
s him
m with a ru uler, or alloow the experimeent and verify the
him to use the ed dge of anotther piece of obserrvations. In nstead, thee few scien ntific
paper as a a mak keshift ruller, he will w termss have been n imbedded d in a disco ourse
eventuallly convincce himselff that he is of elliptical prescienttific ordiinary
wrong, and that line l a is loonger. In so s languuage which the reader is presumeed to
decidingg he willl have rejected as (and p presumes tto) understand. And in n the
inaccuraate one product of visu ual laboraatory worrk of the original and
perceptiion by trustting a larger set of other repliccating labo oratory, a ccommon-seense,
visual peerceptions. He will alsso have mad de presccientific lannguage and d perceptio on of
many presumptio
p ns, inexpllicit for th he objectts, solids, and light was emplloyed
most pa art, includinng the assu umption th hat and ttrusted in ccoming to tthe conclussions
the len ngths of lines hav ve remaineed that thus revise tthe ordiinary
relativelly consttant du
uring th
he underrstanding. To challen nge and co orrect
measureement proccess, that th he ruler was the co ommon-sen nse undersstanding in n one
rigid ratther than ellastic, that the heat an nd detaill, common n-sense un nderstandin ng in
moisturee of his han nd have nott changed th he generral had to be trusted.
ruler’s length in su uch a coincidental wa ay Reelated to th his is the epistemolo ogical
as to prooduct the different
d meeasurementts, emph hasis onn qualitaative patttern
expanding it when approachin ng line a annd identiification ass prior to aan identificaation
contractting it wheen approacching line b, of quaantifiable aatomic partticles, in revverse
etc. of thee logical aatomist’s in ntuition, stiill to
Let us take as a anotherr example a widesspread (Caampbell, 19 966). Such h an
scientific paper containing
c theory an nd episteemology iss fallibilistt, rather than
experimmental resu ults demon nstrating th he clairvvoyant, emp phasizing th he presump ptive
particulaate nature of light, in dramattic error--pronenesss of ssuch patttern
contractt to commo on-sense un nderstandin ng. identiification, raather than perception n as a
Or a sciientific pap per demonstrating th hat depen ndable gro ound of ceertainty. Bu ut it
what orrdinary perrception deeems “solid ds” also recognizess this falllible, intuiitive,
are in fact open lattices. Were W such a presuumptive, orrdinary peerception to o be
paper to t limit itself to mathematic
m cal the o only routee. This is not to m make
symbolss and pu urely scien ntific term ms, perceeptions u uncriticizablle (Camp pbell,
1969a), but they are, as we have seen, only procedures such as questionnaires and
criticizable by trusting many other rating scales will often be introduced at
perceptions of the same epistemic level. this stage for reasons of convenience in
If we apply such an epistemology to collecting and summarizing, non-
evaluation research, it immediately quantitative methods of collection and
legitimizes the “narrative history” portion compiling should also be considered, such
of most reports and suggests that this as hierarchically organized discussion
activity be given formal recognition in the groups. Where such evaluations are
planning and execution of the study, contrary to the quantitative results, the
rather than only receiving attention as an quantitative results should be regarded as
afterthought. Evaluation studies are suspect until the reasons for the
uninterpretable without this, and most discrepancy are well understood. Neither
would be better interpreted with more. is infallible, of course. But for many of us,
That this content is subjective and guilty what needs to be emphasized is that the
of perspectival biases should lead us to quantitative results may be as mistaken as
better select those who are invited to the qualitative. After all, in physical
record the events, and to prepare formal science laboratories, the meters often
procedures whereby all interested work improperly, and it is usually
participants can offer additions and qualitative knowing, plus assumptions
corrections to the official story. The use of about what the meter ought to be
professionally trained historians, showing, this discovers the malfunction.
anthropologists, and qualitative (This is a far cry from the myth that meter
sociologists should be considered. The readings operationally define theoretical
narrative history is indispensable parameters.)
sociologists should be considered. The It is with regret that I report that in
narrative history is an indispensable part U.S. program evaluations, this sensible
of the final report, and the best qualitative joint use of modes of knowing is not yet
methods should be used in preparing it. practiced. Instead, there seems to be an all
We should also recognize that or none flip-flop. Where, as in Model
participants and observers have been Cities evaluation, anthropologists have
evaluating program innovations for been used as observers, this has often
centuries without benefit of quantification been in place of, rather than in addition
or scientific method. This is the common- to, quantitative indicators, pretests,
sense knowing which our scientific posttests, and control-group comparisons.
evidence should build upon and go A current example of the use of
beyond, not replace. But it is usually anthropologists in the “Experimental
neglected in quantitative evaluations, Schools” program started in the U.S.
unless a few supporting anecdotes Office of Education and now in the
haphazardly collected are included. Under national Institute of Education. In this
the epistemology I advocate, one should program, school-system initiative is
attempt to systematically tap all the encouraged, and winning programs
qualitative common-sense program receive substantial increments to their
critiques and evaluations that have been budgets (say 25%) for use in
generated among the program staff, implementing the innovations. To
program clients and their families, and evaluate some of these programs, very
community observers. While quantitative expensive contracts have been let for
applicable to their area. Jumping ahead At present, the preference is for single,
speculatively, I come to the following coordinated, national evaluations, even
tentative stance. where the program innovation is
Ameliorative program implementation implemented in many separate, discrete
and evaluation in the U.S.A. today need sites. If one were to imitate science’s
more zeal, dedication, and morale. These approach to objectivity, it would instead
would be increased by adopting the seem optimal to split up the big
scientist’s model of experimenter- experiments and evaluations into two or
evaluator. If the conditions for cross- more contracts with the same mission so
validating replication could be that some degree of simultaneous
established, and if budgetary jeopardy replication would be achieved. Our major
from negative evaluations could be evaluations of compensatory education
removed (for example, by allowing programs (e.g., Head Start, Follow
program implementers to shift to Through) offer instances which were of
alternative programs in pursuit of the such magnitude that costs would not have
same goal), then the policy separation of been appreciably increased by this
implementation and evaluation should be process. We could often, if we so planned,
abandoned. build in some of the competitive
The issue does not have to be one or replication that keeps science objective.
the other. External evaluations can be One positive feature of the U.S.
combined with in-house evaluations. evaluation research scene in this regard is
Certainly, even under present budgeting the widespread advocacy and occasional
systems, program implementers should be practice of reanalysis by others of
funded to do their own evaluations and to program evaluation data. The Russell
argue their validity in competition with Sage Foundation has funded a series of
external evaluations. The organizational these, including one on the “Sesame
arrangement separating evaluation from Street” preschool television programs
implementation is borrowed from the (Cook, et al., 1975). The original
model of external auditors, and it should governmental evaluation of the Head
be remembered that in accounting, Start compensatory preschool program
auditors check on the internal records, (Circirelli, 1969) has been reanalyzed by
rather than creating new data. Perhaps Smith and Bissell (1970) and Barnow
some such evaluation methodologist’s (1973), and others are in progress.
audit of internal evaluation records would Similarly, for several other classic bodies
be enough of an external evaluation. of evaluation data, although this is still a
rare activity and many sets of data are not
Maximizing Replication and made available.
Criticism One needed change in research
customs or ethics is toward the
Continuing the same metascience theme encouragement of “minority reports” from
as in the previous section: a number of the research staff. The ethic that the data
other recommendations about policy should be available for critical reanalysis
research emerge, some of which run should be explicitly extended to include
counter to current U.S. orthodoxy and the staff members who did the data
practice. collection and analysis and who very
frequently have the detailed insight to see
Figure 4 shows the impact in Romania later? A shift to annual data obviates these
of the October 1966 population policy two problems, but usually there are too
change which greatly restricted the use of few years or too many other changes to
abortion, reduced the availability of permit the use of tests of significance.
contraceptives, and provided several new Figure 5 shows annual data and also
incentives for large families. (David & provides an opportunity to look for the
Wright, 1971, David, 1970.) The combined effect of legalizing abortion in 1957. This
effect is clear and convincing, with the occurred at a time when the rate of use of
change in the abortion law probably the all means of birth control, including
main factor, particularly for the July- abortion, was increasing, and there is no
September 1967 peak. Presumably the graphic evidence that the 1957 law
subsequent decline represents a shift to accelerated that trend. In other data not
other means of birth control. While clear presented here, there is illustrated a
visually, the data offer problems for the chronic methodological problem with this
application of tests of significance. The design: Social systems react to abrupt
strong seasonal trend rules out the changes by using all of the discretionary
application of the best statistical models decision points to minimize that change.
(Glass, Willson, & Gottman, 1972), and Thus the abrupt onset of the October 1966
there are not enough data points plotted decrees also produced an immediate
here upon which to base a good seasonal increase in stillbirths, many of which were
adjustment. The point of onset for no doubt substitutes for the newly
computing purposes is also ambiguous. Is outlawed abortions. Such compensation
it October 1, 1966, or six months later as was, however, too minimal to prevent an
per the prior rule permitting abortions in immediate increase in births.
the first three months? Or nine months
Figure 6 shows the effect of the British success, but interest soon faded, and
Breathalyser crackdown of 1967, today most British social scientists are
illustrated here more dramatically than it unaware of the program’s effectiveness. In
has yet appeared in any British figure 6 the data have been adjusted to
publication. The British Ministry of attempt to correct for seasonal trend,
Transport dutifully reported strong results uneven number of days per month,
during the year following. Their mode of uneven number of weekends per month,
report was in terms of the percentage of and, for the month of October 1969, the
decline in a given month compared with fact that the crackdown did not begin until
the same month one year earlier. This is October 9. All of these adjustments have
better than total neglect of seasonal problems and alternative solutions. In this
effects, but it is an inefficient method particular case, the effects are so strong
because unusual “effects” are often due to that any approach would have shown
much to the eccentricity of the prior them, but in many instances this will not
period as to that of the current one. It is be so. The data on commuting hours
also precludes presentation of the over-all serves as a control for weekend nights.
picture. The newspapers duly noted the
Figure 7 shows data from Baldus comparable states that did not change
(1973) on the substantial effects of a law their laws to use as comparisons. One
that Baldus believes to be evil just because such instance is shown in Figure 7.
it is effective. This law requires that, when Figure 8 is a weak example of a time-
a recipient of old age assistance (charity to series study because it has so few time
the poor from the government) dies and periods. It has a compensatory strength
leaves money or property, the government because the several comparison groups
must be repaid. In our capitalist ideology, are themselves constituted on a
shared even by the poor, many old people quantitative dimension. I include it
will starve themselves just to be able to primarily because it seems to indicate that
leave their homes to their children. Baldus the U.S. Medicaid legislation of 1964 has
has examined the effects of such laws in had a most dramatic effect on the access
some 40 cases where states have initiated to medical attention of the poorest group
them and in some 40 other cases where of U.S. citizens.
states have discontinued them. In each
case, he has sought out nearby,
out and the records to be kept. For which this is done are, in my opinion,
institutional settings, it would be valuable almost always wrong. What has happened
to receive from all participants “Annual is that a set of statistical tools developed
Reports for Program Evaluation” (Gordon for and appropriate to prediction are
& Campbell, 1971). In educational settings applied to causal inference purposes for
teachers, students, and parents could file which they are inappropriate. Regression
such a report. Note that at present the analysis, multivariate statistics,
school system records how the pupil is covariance analysis are some of the names
doing but never records the pupil’s report of the statistical tools I have in mind.
on how the school is doing. Teachers are Whether from educational statistics or
annually rated for efficiency but never get economics, the choice of methods seems
a chance to systematically rate the policies to be the same. The economists have a
they are asked to implement. Some first phrase for the problem “error in
steps in this direction are being explored. variables” or, more specifically, “error in
(Weber, Cook, & Campbell, 1971; independent variables.” But while
Anderson, 1973). In the U.S. social welfare theoretically aware of the problem, they
system, both social worker and welfare are so used to regarding their indicators
recipient would be offered the opportunity as essentially lacking in error that they
to file reports (Gordon & Campbell, 1971). neglect the problem in practice. What they
All ratings would be restricted to the forget is that irrelevant systematic
evaluation of programs and policies, not components of variance create the same
persons, for reasons to be discussed problem as does random error, leading to
below. the same bias of underadjustment. Note
that the presence of error and unique
Regression Adjustments as variance has a systematic effect, i.e.
Substitutes for Randomization operates as a source of bias rather than as
a source of instability in estimates. This
The commonest evaluation design in U.S. fact, too, the economists and other
practice consists in administering a novel neglect. Thus efforts to correct for
program to a single intact institution or pretreatment differences by “regression
administrative unit, with measures before adjustments” on the means or by
and after. While this leaves much to be “partialing out” pretest differences or by
desired in the way of controls, it is still covariance adjustments all lead to
informative enough to be worth doing. underadjustment unless the pretest (or
Almost as frequently, this design is other covariate) is a perfect measure of
augmented by the addition of comparison what pretest and posttest have in
group which is also measured before and common. The older technique of using
after. This is typically another intact social only cases matched on pretest scores is
unit which does not receive the new well known to produce “regression
program and is judged comparable in artifacts” (Thorndike, 1942; Campbell &
other respects. It usually turns out that Stanley, 1966). Covariance turns out to
these two groups differ even before the produce the same bias, the same degree of
treatment, and a natural tendency is to try underadjustment only with greater
to adjust away the difference. In statistical precision (Lord, 1960, 1969; Porter, 1967;
practice in the U.S. today, the means by Campbell & Erlebacher, 1970), and also
for multiple regression and partial
correlation (e.g., Cook & Campbell, 1975). from better-educated parents watch it
Essentially the same problem emerges in more.) (Cook, et al., 1975.)
ex post facto studies where, although For compensatory programs usually,
there is no pretest, other covariates are although not always, the control group
available for adjustment. A common start out superior to the treatment group,
version of the problem occurs where some or are selected from a larger population
persons have received a treatment and whose average is superior. In this setting,
there is a larger population of untreated the biases of underadjustment, the
individuals from which “controls” are regression artifacts, are in the direction of
sought and a comparison group underestimating program effectiveness
assembled. and of making our program seem harmful
In U.S. experience it has become when they are merely worthless. This
important to distinguish two types of quasi-experimental research setting has
setting in which this type of quasi- occurred for our major evaluations of
experimental design and these types of compensatory education programs going
adjustments are used, since the social under the names of Head Start, Follow
implications of the underadjustment are Through, Performance Contracting, Job
opposite. On the one hand, there are those Corps (for unemployed young men), and
special opportunity programs like many others. In the major Head Start
university education which are given to evaluation (Cicirelli, 1969; Campbell &
those who need them least, or as more Erlebacher, 1970), this almost certainly
usually stated, who deserve them most or accounts for the significantly harmful
who are more likely to be able to profit effects shown in the short three month,
from them. Let us call these “distributive” ten-hours-a-week program. I am
programs in contrast with the persuaded that the overwhelming
“compensatory” programs, or those prevalence of this quasi-experimental
special opportunities given to those who setting and adjustment procedures is one
need them most. of the major sources of the pessimistic
For the regressive programs, the record for such programs of
treatment group will usually be superior compensatory education efforts. The very
to the control group or the population few studies in compensatory education
from which the quasi-experimental which have avoided this problem by
controls are chosen. In this setting the random assignment of children
inevitable underadjustment due to unique experimental and control conditions have
variance and error in the pretest and/or shown much more optimistic results.
other covariates (the “regression In the compensatory education
artifacts”) works to make the treatment situation, there are several other problems
seem effective if it is actually worthless which also work to make the program look
and to exaggerate its effectiveness in any harmful in quasi-experimental studies.
case. For most of us, this seems a benign These include tests that are too difficult,
error, confirming our belief in treatments differential growth rates combined with
we know in our hearts are good. (It may age-based, grade-equivalent, absolute, or
come as a surprise, but the U.S. Sesame raw scores, and the fact that test reliability
Street preschool educational television is higher for the post-test than for the
program is “distributive,” in that children pretest, and higher for the control group
than for the experimental (Campbell,
1973). These require major revisions of income support payments bringing their
our test score practice. When various income up to some level between $3,000
scoring models are applied to a single and $4,000 per year for a family of four,
population on a single occasion, all under one of eight plans which differ as to
scoring procedures correlate so highly support level and incentive for increasing
that one might as well use the simplest. own earnings. Another 600 families
But when two groups that differ initially received no income support but
are measured at two different times in a cooperated with the quarterly interviews.
period of rapid growth, our standard test The experiment lasted for three years, and
score practices have the effect of making preliminary final results are now
the gap appear to increase, if, as is usual, available. This study when completed will
test reliability is increasing. The use of a have cost some $8,000,000 of which
correction for guessing becomes $3,000,000 represented to participants
important. The common model that and necessary administrative costs, and
assumes “true score” and “error” are $5,000,000 research costs, the costs of
independent needs to be abandoned, program evaluation. Before this study was
substituting one that sees error and true half completed, three other negative
score negatively correlated across persons income tax experiments were started,
(the larger the error component, the some much bigger (rural North Carolina
smaller the true score component). and Iowa, Gary, and Seattle and Denver).
The total U.S. investment for these
Problems with Randomized experiments now totals $65,000,000. It is
Experiments to me amazing, and inspiring, that our
nation achieved, for a while at least, this
The focal example of a good social great willingness to deliberately
experiment in the U.S. today is the New “experiment” with policy alternatives
Jersey Negative Income Tax Experiment. using the best of scientific methods.
(Watts & Rees, 1973; The Journal of This requires a brief historical note.
Human Resources, Vol. 9, No. 2, Spring The key period is 1964-68. L.B. Johnson
1974; Kershaw’s paper in this volume; was President and proclaimed a “Great
Kershaw, 1972; Kershaw & Fair, 1973). Society” program and a “War on Poverty.”
This is an experiment dealing with a In Washington, D.C. administrative
guaranteed annual income as an circles, the spirit of scientific management
alternative to present U.S. welfare (the PPBS I’ve already criticized and will
systems. It gets its name from the notion again) had already created a sophisticated
that when incomes fall below a given level, interest in hard-headed program
the tax should become negative, that is, evaluation. Congress was already writing
the government should pay the citizen into its legislation for new programs the
rather than the citizen paying a tax to the requirement that 1% (or some other
government. It also proposes substituting proportion) of the program budget be
income-tax like procedures for citizen devoted to evaluation of effectiveness. In a
reports of income in place of the present new agency, the Office of Economic
social worker supervision. In this Opportunity, a particularly creative and
experiment some 600 families with a dedicated group of young economist was
working male head of household received recruited, and these scientist-evaluators
were given an especially strong role in
guiding over-all agency policy. This OEO often try to shock audiences with the
initiated the first two of the Negative comment that the New Jersey experiment
Income Tax experiments. (Two others is the greatest example of applied social
were initiated from the Department of science since the Russian Revolution.
Health, Education and Welfare.) Under The major finding of the NJNITE is
the first Nixon administration, 1968-72, that income guarantees do not reduce the
OEO programs were continued, although effective work effort of employed poor
on a reduced scale. Under the second people. This finding, if believed, removes
Nixon administration, OEO itself was the principal argument against such a
dismantled, although several programs program–-for on a purely cost basis, it
were transferred to different agencies. I would be cheaper than the present welfare
believe that all four of the Negative system unless it tempted many persons
Income Tax Experiments are still being now employed to cease working. The
carried out much as planned. Initiation of major methodological criticisms of the
new programs has not entirely ceased, but study are focused on the credibility of the
it is greatly reduced. My general advice to belief that this “laboratory” finding would
my fellow U.S. social scientists about the continue to hold up if the support
attitude they should take toward this program became standard, permanent
historical period is as follow: Let us use U.S. policy. These are questions of
this experience so as to be ready when this “external validity” (Campbell & Stanley,
political will returns again. In spite of the 1966) or of “construct validity,” as Cook
good example provided by the New Jersey (Cook & Campbell, 1975) applies the term
Negative Income Tax Experiment, over all developed initially for measurement
we were not ready last time. Competent theory. Two specific criticisms are
evaluation researchers were not available prominent: One, there was a “Hawthorne
when Model Cities Programs, Job Corps Effect” or a “Guinea-pig Effect.” The
Programs, etc. went to their local experimental families knew they were the
universities for help. Perhaps 90% of the exceptional participant in an artificial
funds designated for program evaluation arrangement and that the spotlight of
were wasted; at any rate, 90% of the public attention was upon them.
programs came out with no interpretable Therefore, they behaved in a “good”
evidence of their effectiveness. The industrious, respectable way, producing
available evaluation experts grossly the results obtained. Such motivation
overestimated the usefulness of statistical would be lacking once the program was
adjustments as substitutes for good universal. Two features in the NJNITE
experimental design, including especially implementation can be supposed to
randomized assignment to treatment. accentuate this. There was publicity about
In this spirit I would like to use the the experiment at its start, including
experience of the New Jersey Negative television interviews with selected
Income Tax Experiment to elucidate the experimental subjects; and the random
methodological problems remaining to be selection was by families rather than
solved in the best of social experiments. In neighborhoods, so each experimental
this spirit, my comments are apt to sound family was surrounded by equally poor
predominately critical. My over-all neighbors, who were not getting this
attitude, however, is one of highest beneficence. The second common
approval. Indeed, in lectures in the U.S. I criticism, particularly among economists,
might be called the time-limit effect. method that need detailed attention from
Participants were offered the support for creative statisticians and social
exactly three years. It was made clear that psychologists. These will only be
the experiment would terminate at that mentioned here, but are being treated
time. This being the case, prudent more extensively elsewhere (Riecken, et
participants would hang on to their jobs al., 1974). The issue of the unit of
unless they could get better ones, so that randomization has already been raised in
they would be ready for a return to their passing. Often there is a choice of
normal financial predicament. randomizing larger social units than
It should be recognized that these two persons or families–residential blocks,
problems are in no way specific to census tracts, classrooms, schools, etc. are
randomized experiments, and would also often usable. For reasons of statistical
have characterized the most casual of pilot efficiency, the smaller, more numerous
programs. They can only be avoided by units are to be preferred, maximizing the
the evaluation of the adoption of NIT as a degrees of freedom and the efficacy of
national policy. Such an evaluation will randomization. But the use of larger units
have to be quasi-experimental, as by time- often increases construct validity.
series, perhaps using several Canadian Problems of losses due to refusals and
cities as comparisons. Such evaluations later attrition interact with the choice of
would be stronger on external, construct the stage of respondent recruitment at
validity, but weaker on internal validity. It which to randomize. NJNITE used census
is characteristic of our national attitudes, statistics on poverty areas and sample
however, that this quasi-experimental survey approaches to locate eligible
evaluation is not apt to be done well, if at participants. Rethinking their problem
all–-once we’ve chosen a policy we lose shows the usefulness of distinguishing
interest in evaluating it. Had the NJNITE two types of assent required, for
shown a reduction in work effort, national measurement and for treatment; thus two
adoption of the policy would have been separate stages for refusals occur. There
very unlikely. For this reason alone, it was emerge three crucial alternative points at
well worth doing and doing well. which randomization could be done:
The details of the experiment draw
attention to a number of problems of
experimental group. These differences are individual (Schwartz & Orleans, 1967;
large enough to create pseudo effects in Campbell, Boruch, Schwartz & Steinberg,
post-test values. The availability of pretest 1975; Boruch & Campbell, 1974). Let us
scores provides some information on the call this “mutually insulated interfile
probable direction of bias, but covariance exchange.” It requires that the
on these values underadjusts and is thus administrative file have the capacity for
not an adequate correction. Methods for internal statistical analysis of its own
bracketing maximum and minimum records. Without going into detail, I would
biases under various specified assumption nonetheless like to communicate the nub
need to be developed. Where there is an of the idea. Figure 9 shows a hypothetical
encompassing periodic measurement experiment with one experimental group
framework that still retains persons who and one control group. In this case there
have ceased cooperating with the are enough cases to allow a further
experiment, other alternatives are present breakdown by socio-economic level. From
that need developing. these data some 26 lists are prepared
Such measurement frameworks ranging from 8 to 14 persons in length.
appropriate to NJNITE would include the These lists are assigned designations at
Social Security Administration’s records random (in this case A to Z) so that list
on earnings subject to social security designation communicates no
withholding and claims on unemployment information to the administrative file. The
insurance, the Internal Revenue Service list provides the name of the person, his
records on withholding taxes, Social Security number, and perhaps his
hospitalization insurance records, etc. birth date and birth place. These lists are
These records are occasionally usable in then turned over to the administrative file
evaluation research (e.g., Levenson & which deletes one person at random from
McDill, 1966; Bauman, David & Miller, each list, retrieves the desired data from
1979; Fischer, 1972, Heller, 1972) but the the files on each of the others for whom it
facilities for doing so are not adequately is available, and computes mean,
developed, and the concept of such usage variance, and number of cases with data
may seem to run counter to the current available for each list for each variable.
U.S. emphasis on preserving the These values for each list designation are
confidentiality of administrative records then returned to the evaluation
(e.g., Reubhausen & Brim, 1965; Sawyer & researchers who reassemble them into
Schechter, 1968; Goslin, 1970; Miller, statistically meaningful composites and
1971; Westin, 1967; Wheeler, 1969). then compute experimental and control
Because research access to administrative group means and variances, correlations,
files would make possible so much interactions with socio-economic level,
valuable low cost follow-up on program etc. Thus neither the research file nor the
innovations, this issue is worthy of administrative file has learned individual
discussion somewhere in this paper. For data from the other file, yet the statistical
convenience, if not organizational estimates of program effectiveness can be
consistency, I will insert the discussion made. In the U.S. it would be a great
here. achievement were this facility for program
There is a way to statistically relate evaluation to become feasible for regular
research data and administrative records use.
without revealing confidential data on
other problems. Achievement tests are, in specific battalions and other military
fact, highly corruptible indicators. units. There was thus created a new
That this serious methodological military goal, that of having bodies to
problem may be a universal one is count, a goal that came to function instead
demonstrated by the extensive U.S.S.R. of or in addition to more traditional goals,
literature (reviewed in Granick, 1954; and such as gaining control over territory.
Berliner, 1957) on the harmful effects of Pressure to score well in this regard was
setting quantitative industrial production passed down from higher officers to field
goals. Prior to the use of such goals, commanders. The realities of guerrilla
several indices were useful in warfare participation by persons of a wide
summarizing factory productivity–e.g., variety of sexes and ages added a
monetary value of total product, total permissive ambiguity to the situation.
weight of all products produced, or Thus poor Lt. Calley was merely engaged
number of items produced. Each of these, in getting bodies to count for the weekly
however, created dysfunctional effectiveness report when he participated
distortions of production when used as in the tragedy at My Lai. His goals had
the official goal in terms of which factory been corrupted by the worship of a
production was evaluated. If monetary quantitative indicator, leading both to a
value, then factories would tool up for and reduction in the validity of that indicator
produce only one product to avoid the for its original military purposes, and a
production interruptions of retooling. If corruption of the social processes it was
weight, then factories would produce only designed to reflect.
their heaviest item (e.g., the largest nails I am convinced that this is one of the
in a nail factory). If number of items, then major problems to be solved if we are to
only their easiest item to produce (e.g., achieve meaningful evaluations of our
the smallest nails). All these distortions efforts at planned social change. It is a
led to overproduction of unneeded items problem that will get worse, the more
and underproduction of much needed common quantitative evaluations of social
ones. programs become. We must develop ways
To return to the U.S. experience in a of avoiding this problem if we are to move
final example. During the first period of ahead. We should study the social
U.S. involvement in Viet Nam, the processes through which corruption is
estimates of enemy casualties put out by being uncovered and try to design social
both the South Vietnamese and our own systems that incorporate these features.
military were both unverifiable and In the Texarkana performance-
unbelievably large. In the spirit of contracting study, it was an “outside
McNamara and PPBS, an effort was then evaluator” who uncovered the problem. In
instituted to substitute a more a later U.S. performance-contracting
conservative and verifiable form of study, the Seattle Teachers’ Union
reporting, even if it underestimated total provided the watchdog role. We must seek
enemy casualties. Thus the “body count” out and institutionalize such objectivity-
was introduced, an enumeration of only preserving features. We should also study
those bodies left by the enemy on the the institutional form of those indicator
battlefield. This became used not only for systems, such as the census or the cost-of-
overall reflection of the tides of war, but living index in the U.S., which seem
also for evaluating the effectiveness of relatively immune to distortion. Many
Central and Eastern Europe. New Papers, 123-72. Madison: Institute for
York: The Population Council, 1970. Research on Poverty, University of
Romanian data based primarily on Wisconsin, 1972.
Anuarul Statistic al Republicii Gordon, A. C., Campbell, D. T., et al.
Socialiste Romania. Recommended accounting procedures
Douglas, J. D. The social meanings of for the evaluation of improvements in
suicide. Princeton, N.J.: Princeton the delivery of state social services.
University Press, 1967. Duplicated paper, Center for Urban
Edwards, A. L. The social desirability Affairs, Northwestern University, 1971.
variable in personality assessment Goslin, D. A. Guidelines for the collection,
and research. New York: Dryden, maintenance, and dissemination of
1957. pupil records. New York: Russell Sage
Fairweather, G. W. Methods for Foundation, 1970.
experimental social innovation. New Granick, D. Management of the industrial
York: Wiley, 1967 firm in the U.S.S.R. New York:
Fischer, J. L. The uses of Internal Columbia University Press, 1954.
Revenue Service data. In M.E. Borus Guttentag, M. Models and methods in
(Ed.), Evaluating the impact of evaluation research. Journal for the
manpower programs. Lexington, Theory of Social Behavior, 1971, 1.75-
Mass.: D. C. Health, 1972, 177-180 95. (No. 1, April)
Gardiner, J. A. Traffic and the police: Guttentag, M. Evaluation of social
Variations in law-enforcement policy. intervention programs. Annals of the
Cambridge, Mass.: Harvard University New York Academy of Sciences, 1973,
Press, 1969. 218.3-13.
Garfinkel, H. “Good” organizational Guttentag, M. Subjectivity and its use in
reasons for “bad” clinic records. In H. evaluation research. Evaluation, 1973,
Garfinkel, Studies in 1.60-65. (No. 2)
ethnomethodology. Englewood Cliffs, Hatry, H. P., Winnie, R. E., & Fisk, D. M.
N.J.: Prentice Hall, 1967, 186-207. Practical program evaluation for
Glaser, D. Routinizing evaluation: state and local government officials.
Getting feedback on effectiveness of Washington, D.C.: The Urban Institute
crime and delinquency programs. (2100 M. Street NW, Washington, DC
Washington, D.C.: Center for Studies 20037), 1973.
of Crime and Delinquency, NIMH., Heller, R. N. The uses of social security
1973. (Available from U.S. administration data. In M. E. Borus
Government Printing Office, (Ed.), Evaluating the impact of
Washington, D.C. 20402; publication manpower programs. Lexington,
number 1724-00319, $1.55.) Mass.: D. C. Heath, 1972, 197-201.
Glass, G. V., Willson, V. L., & Gottman, J. Ikeda, K., Yinger, J. M., & Laycock, F.
M. Design and analysis of time-series Reforms as experiments and
experiments. Boulder, Colo.: experiments as reforms. Paper
Laboratory of Educational Research, presented to the meeting of the Ohio
University of Colorado, 1972. Valley Sociological Society, Akron,
Goldberger, A. S. Selection bias in May 1970.
evaluating treatment effects: Some Jackson, D. N., & Messick, S. Response
formal illustrations. Discussion styles on the MMPI: Comparison of
clinical and normal samples. Journal Levenson, B., & McDill, M. S. Vocational
of Abnormal and Social Psychology, graduates in auto mechnaics: A follow-
1962, 65.285-299. up study of Negro and white youth.
Kepka, E. J. Model representation and Phylon, 1966, 27.347-357. (No. 4)
the threat of instability in the Lohr, B. W. An historical view of the
interrupted time series quasi- research on the factors related to the
experiment. Doctoral dissertation, utilization of health services.
Northwestern University, Department Duplicated Research Report, Bureau
of Psychology, 1971. for Health Services Research and
Kershaw, D. N. A negative income tax Evaluation, Social and Economic
experiment. Scientific American, 1972, Analysis Division, Rockvill, Md.,
227.19-25. January 1973, 34 pp. In press, U.S.
Kershaw, D. N. The New Jersey negative Government Printing Office.
income tax experiment: a summary of Lord, F. M. Large-scale covariance
the design, operations, and results of analysis when the control variable is
the first large-scale social science fallible. Journal of the American
experiment. Dartmouth/OECD Statistical Association, 1960, 55.307-
Seminar on “Social Research and 321.
Public Policies,” Sept. 13-15, 1974 (this Lord, F. M. Statistical adjustments when
volume). comparing preexisting groups.
Kershaw, D. N., & Fair, J. Operations, Psychological Bulletin, 1969, 72.336-
surveys, and administration, Volume 337.
IV of the Final Report of the New McCain, L. J. Data analysis in the
Jersey Graduated Work Incentive interrupted time series quasi-
Experiment. Madison, Wisconsin: experiment. Master’s thesis,
Institute for Research on Poverty, Northwestern University, Department
December 1973. Duplicated, 456 pp. of Computer Sciences, in preparation.
(Mathematica, Inc., Princeton, N.J. as Miller, A. R. The assault on privacy. Ann
Vol. IV of the Report of the New Jersey Arbor, Mich.: University of Michigan
Negative Income Tax Experiment) Press, 1971.
December, 1973. (To be published by Miller, S. M., Roby, P., & Steenwijk, A. A.
Academic Press) V. Creaming the poor. Trans-action,
Kitsuse, J. K., & Circourel, A. V. A note on 1970, 7.38-45. (No. 8, June)
the uses of official statistics. Social Morgenstern, O. On the accuracy of
Problems, 1963, 11.131-139. (Fall) economic observations. (2nd ed.)
Kuhn, T. S. The structure of scientific Princeton, N.J.: Princeton University
revolutions. (2nd ed.) Chicago: Press, 1963.
University of Chicago Press, 1979. Morrissey, W. R. Nixon anti-crime plan
(Vol. 2, No. 2, International undermines crime statistics. Justice
Encyclopedia of Unified Science.) Magazine, 1972, 1.8-11, 14. (No. 5/6,
Kutchinsky, B. The effect of easy June/July) (Justice Magazine, 922
availability of pornography on the National Press Building, Washington,
incident of sex crimes: The Danish D.C. 20004)
experience. Journal of Social Issues, Porter, A. C. The effects of using fallible
1973, 29.163-181. (No. 3) variables in the analysis of covariance.
Unpublished doctoral dissertation,
Magazine, 922 National Press Wholey, J. S., Nay, J. N., Scanlon, J. W.,
Building, Washington, D.C. 20004) Schmidt, R. E. If you don’t care where
Watts, H. W., & Rees, A. (Eds.) Final you get to, then it doesn’t matter which
report of the New Jersey graduated way you go. Dartmouth/OECD
work incentive experiment. Volume I. Seminar on Social Research and Public
An overview of the labor supply results Policies, Sept. 13-15, 1974 (this
and of Central labor-supply results volume).
(700 pp.), Volume II. Studies relating Wholey, J. S., Scanlon, J. W., Duffy, H. G.,
to the validity and generalizability of Fukumoto, J., & Vogt, L. M. Federal
the results (250 pp.), Volume III. evaluation policy. Washington, D.C.:
Response with respect to expenditure, The Urban Institute, 1970.
health, and social behavior and Wilder, C. S. Physician Visits, volume and
technical notes (300 pp.) Madison, interval since last visit, U.S. 1969.
Wisconsin: Institute for Research on Rockville, Md.: National Center for
Poverty, University of Wisconsin. Health Statistics, series 10, No. 75,
December, 1973. (Duplicated) July 1972 (DHEW Pub. No. [HSM] 72-
Weber, S. J., Cook, T. D., & Campbell, D. 1064).
T. The effects of school integration on Williams, W., & Evans, J. W. The politics
the academic self-concept of public of evaluation: The case of Head Start.
school children. Paper presented at the The Annals, 1969, 385.118-132.
meeting of the Midwestern (September)
Psychological Association, Detroit, Wortman, C. B., Hendricks, M., & Hillis,
1971. J. Some reactions to the
Weiss, C. H. (Ed.) Evaluating action randomization process. Duplicated
programs: Readings in social action research report, Northwestern
and education. Boston: Allyn & Bacon, University, Department of Psychology,
1972a. 1974.
Weiss, C. H. Evaluation research. Zeisel, H. And then there were none: The
Englewood Cliffs, N.J.: Prentice-Hall, diminunition of the Federal jury.
1972b. University of Chicago Law Review,
Weiss, R. S., & Rein, M. The evaluation of 1971, 38.710-724.
broad-aim programs: A cautionary Zeisel, J. The future of law enforcement
case and a moral. Annals of the statistics: A summary view. In Federal
American Academy of Political and statistics: Report of the President’s
Social Science, 1969, 385. 133-142. Commission. Vol. II, 1971, 527-555.
Weiss, R. S., & Rein, M. The evaluation of Washington, D.C.: U.S. Government
broad-aim programs: Experimental Printing Office. (Stock no. 4000-0269)
design, its difficulties, and an
alternative. Administrative Science
Quarterly, 1970, 15.97-109.
Westin, A. F. Privacy and freedom. New
York: Atheneum, 1967.
Wheeler, S. (Ed.) Files and dossiers in
American life. New York: Russell Sage
Foundation, 1969.