Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Multiple Comparison Procedures: The Practical Solution

Author(s): D. J. Saville
Source: The American Statistician, Vol. 44, No. 2 (May, 1990), pp. 174-180
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2684163 .
Accessed: 17/06/2014 13:51

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The
American Statistician.

http://www.jstor.org

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
OMENT
Commentariesare informativeessays dealing with viewpoints of statis- involve longer discussions of background, issues, and perspectives. All
tical practice, statistical education, and other topics considered to be commentarieswill be refereed for their merit and compatibilitywith these
of general interest to the broad readershipof The American Statistician. criteria.
Commentaries are similar in spirit to Letters to the Editor, but they

Multiple Comparison Procedures: The Practical Solution


D. J. SAVILLE*

A practicing statistician looks at the multiple comparison Theirmisuse in experimentsin which the treatmentspossess
controversyand relatedissues throughthe eyes of the users. a factorial structure,or are quantitativein nature,has been
The concept of consistency is introducedand discussed in highlighted by many excellent papers in applied science
relation to five of the more common multiple comparison journals(e.g., Chew 1980; Cousens 1988; Little 1978;Perry
procedures.All of the proceduresare found to be inconsis- 1986; Petersen 1977). These writers have rightly pointed
tent except the simplest procedure, the unrestrictedleast out the greaterappropriatenessof the methodsof orthogonal
significant difference (LSD) procedure(or multiple t test). contrasts and regression analysis for the analysis of data
For this and other reasons the unrestrictedLSD procedure from such experiments.
is recommended for general use, with the proviso that it In most researchwork, the objectives are sufficientlywell
should be viewed as a hypothesis generatorratherthan as defined that the questions of interestcan be answeredusing
a method for simultaneoushypothesis generationand test- appropriatecontrasts. In a minority of cases, such as pes-
ing. The implicationsfor Scheff6's test for generalcontrasts ticide screening trials or cultivar evaluation trials, a least
are also discussed, and a new recommendationis made. significant difference is useful as a yardstickwith which to
assess the strengthof evidence for any particularpairwise
KEY WORDS: Comparisonwiseerrorrate;Duncan's mul- difference. In these cases I recommendusage of the simplest
tiple rangetest; Experimentwiseerrorrate;Power;Teaching multiple comparison procedure, the multiple t test, or un-
of statistics; Tukey's honest significant difference proce- restrictedLSD procedure.
dure; Waller-Duncan k-ratiotest. The purpose of this article is to outline some of the rea-
soning behind this recommendation.Particularattentionis
paid to the "inconsistency" of the alternativesto the un-
1. INTRODUCTION restrictedLSD procedure.This refersto the fact thata given
Proceduressuch as the least significant difference (LSD) procedurecan returna verdict of not significant for a given
test, Duncan's multiplerangetest, and Wallerand Duncan's difference in one experiment, but return a verdict of 1%
k-ratio LSD test are used by applied researchersfor the significant for the same difference in a second experiment,
analysis of data from experimentsin which the treatments with no change in the standarderrorof the differenceor the
have no easily definable structure.These and similar pro- numberof errordegrees of freedom.
cedures, referredto generally as multiple comparisonpro- The formatof the article is to define the unrestrictedLSD
cedures, have been the subject of controversyfor over half procedure(Sec. 2), set the scene by discussing its practical
a century. usage (Sec. 3), and discuss the common objections to its
Many statisticiansdislike the idea of simultaneouslytest- use (Sec. 4). The notion of inconsistencyis then introduced
ing a multitude of unplannedand interrelatedhypotheses, and discussed, with examples of the inconsistency of Fish-
and some question the usefulness of multiple comparison er's restricted LSD procedure, Tukey's honest significant
procedures(e.g., Nelder 1971; Plackett 1971; Preece 1982). difference (HSD) procedure, and Waller and Duncan's k-
In spite of this, multiple comparison procedurescontinue ratioLSD procedure(Sec. 5). In a discussion (Sec. 6), three
to be widely used and sometimes misused by researchers. main points are covered. First, the advantagesof the un-
restrictedLSD procedureare summarized.Second, the dis-
tinction between hypothesis generation and testing is
*D. J. Saville is Biomnetrician,Ministry of Agricultureand Fisheries,
P.O. Box 24, Lincoln, Canterbury,New Zealand. The authoris grateful highlighted, leading to the suggestion that multiple com-
to F. JacksonHills for stimulatingthe work and to PeterHeffernan,Harold parison procedures should be treated as hypothesis gener-
Henderson, Mike Ryan, Chris Dyson, and Karen Baird for constructive atorsratherthan simultaneousgeneratorsand testers. Third,
criticism. Samuel G. Carmeris also thankedfor his early encouragement the implicationsfor the analysis of general, unplannedcon-
and for assistance with the preparationof this article, including the com-
trasts are spelled out, leading to a recommendationof the
putation of the equivalent significance levels in Section 5. The ideas in
the article were presented at the thirteenthInternationalBiometric Con- usual F or t test instead of the inconsistentScheffe test. To
ference in Seattle in 1986. summarize, the "practicalsolution" is given in Section 7.

174 The American Statistician, May 1990, Vol. 44, No. 2 (C 1990 AmericanStatistical Associationi

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
2. DEFINITION OF LSD level of variability in the experiment and to allow an in-
dication of the strengthof evidence for unanticipatedques-
The 5%-level "multiple t test" consists of all pairwise tions that may be asked by fellow researchers,such as "Did
comparisonsof the populationmeans using 5%-level t tests the coded chemical depress yield when applied at twice
based on a pooled variance estimate, s2. That is, for each normal instead of the normal rate?"
pair of populations i andj, the absolute value of The second sample data set consists of data from a series
t = (Yi- )N2s2n = difference/SE(difference) of wheat cultivar evaluation trials in an agriculturaldistrict
(Table 2). In this case the LSD(5%)'s are providedto give
is compared with the 97.5 percentile, or 5% critical value,
an indicationof the level of variabilitywithin each trial and
of the tv distribution, tv(.975), where v is the number of
to provide an indication of the strength of evidence con-
degrees of freedom associated with the variance estimate cerning any particularpairwise difference within a trial. If,
s2, ji and y are the ith andjth sample means, and n is the
however, the researcherwishes to make an overall com-
common sample size; SE representsstandarderror. parisonbetween, say, the cultivarsRongotea and Orouafor
This is equivalentto comparing jY - yj with the quantity the agriculturaldistrict, the LSD's are not especially useful.
tv(.975) x SED, where SED is the standarderror of the A more appropriatemethod of analysis is to calculate the
difference between two means. This quantityis thus known differences (Rongotea - Oroua) = .23, .40, .30, .23, .49,
as the 5%-level unrestrictedleast significant difference, or and .39 tonnes/hectarein the respective trials and then test
LSD(5%). The name unrestrictedLSD procedure is used, the hypothesis Ho: Rongotea = Oroua using a simple t
since there is no restrictionrequiringa significant prelim- test. With a t value of 8.0 with 5 degrees of freedom, there
inary overall F test. would be little difficulty in concluding that Rongotea is on
average a higher-yieldingcultivar than Oroua.
3. PRACTICAL USAGE In both of these data sets, the LSD is just one piece of
In some teaching institutions, proceduresfor the testing informationused in arrivingat a sensible interpretationof
of hypotheses are presentedas "black and white" in char- the results. In all cases the most importantinformationis
acter; the significance level for the test is decided before provided by the treatmentmeans and the observed differ-
the data are examined, and the test yields a clear-cut de- ences between these means. These provide the best guesses
cision-yes, the hypothesis is true, or no, the hypothesis as to the truedifferencesbetween the means. If the observed
is false. differences are large enough and importantenough to be of
In practice, most statisticiansrecognize thatthereare also potential interest, the researcherwould be wise to conduct
shades of gray, and many adopt the alternativestrategyof furtherwork to establishmore accuratelythe truemagnitude
measuringthe strengthof the evidence againsta hypothesis of the differences. [A good way of expressing the uncer-
by varying the significance level of the test and quoting the tainty associated with the estimate of a particulardifference
most stringenttest for which the hypothesis is rejected. In is to quote the 95% confidence interval for the true differ-
the present context, this means that critical values such as ence, which is given by observed difference + LSD(5%).]
the LSD(10%), LSD(5%), LSD(1%), and LSD(.1%) are
used as distances on a yardstickthat measure the strength 4. OBJECTIONS TO LSD
of the evidence against a hypothesisHo: 0 i = /. In the statisticalliterature,the main objections to the use
To put the usage of the unrestrictedLSD procedurein
of the unrestrictedLSD procedureappearto be as follows.
perspective, we now presenttwo sample data sets. The first
sample data set consists of the results from a typical weed- 1. Too many of the pairwise differences are correlated.
control experiment (Table 1). In this case, the main ques- 2. There are too many chances of spuriouslydeclaringa
tions of interest are answered by the testing of three or- zero difference to be nonzero (i.e., of committinga Type 1
thogonal contrasts.The LSD(5%)is includedto indicatethe error).

Table 1. Report on a Typical Weed-Control Experiment

Wheat yield
(tonneslhectare)
Treatment
1. Control 5.2
2. Coded chemical, half-normal rate (.5N) 7.7
3. Coded chemical, normal rate (N) 8.8
4. Coded chemical, twice-normal rate (2N) 8.9
5. Standard chemical, normal rate (N) 8.3
LSD(5%) .9
Contrasts
Control versus treated (treatment 1 versus treatments 2-5) **a
Linear trend within coded chemical (treatments 2-4) *b

Coded (N) versus standard (N) (treatment 3 versus treatment 5) Not significant
NOTE: Treatment means, LSD(5%), and the significance of the contrasts of interest are presented for an herbicide experiment conducted
using a randomized complete block design with four replicates.
a1% significance.
b5% significance.

The AmericanStatistician, May 1990, Vol. 44, No. 2 175

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
Table 2. A Typical Series of Six Wheat Cultivar Evaluation For example, suppose in an alfalfa cultivarevaluationtrial
Trials in an Agricultural District that the cultivar Rere proved significantly higher yielding
than Wairau, Caliverde, and Atra. The analyst would then
Grain yields
Trial Cultivar (tonnes/hectare) bear in mind that the three comparisonsof Rere with each
of the other cultivars were correlated, so three wrong de-
Marshall,1981 Aotea 3.15
Oroua 3.81 cisions could have been made "for the price of one"; for
Rongotea 4.04 example, the field plots assigned to Rere may have been
LSD(5%) .25 unusuallyfertile, leading to a spuriousconclusion thatRere
McCraw,1982 Durum1 3.05 was a superior cultivar. Apart from this sort of informed
Durum2 3.11
Oroua 4.10 doubt about the results, it is a fact of life that a statistician
Rongotea 4.50 cannot use statistical reasoning to decide when or where
LSD(5%) .31 these "bunches of errors" occur.
Reed, 1982 Oroua 5.20 The second objection is that the unrestrictedLSD pro-
CRD 1015 5.36
CRD 1018 5.37 cedure allows too many chances of making a Type I error.
Rongotea 5.50 In this context, statisticiansknow that the more work they
LSD(5%) .27 do, or the greater the number of comparisons made, the
Lochhead,1983 Durum1 2.27 greaterthe chance of making at least one Type I error.This
Durum2 2.32 is another fact of life for statisticians, who must learn to
Hilgendorf 2.37
Oroua 2.98 live with the uncertaintythat is the very reason for their
Rongotea 3.21 existence. In the case of, say, a wheat cultivar evaluation
Karamu 3.52 trial in which the aim is to compareeach cultivarwith every
LSD(5%) .25
other cultivar, the chance of making at least one Type I
Kyle, 1983 Durum2 3.07
Durum1 3.24 errorwill increase with the numberof cultivarsincluded in
Hilgendorf 3.75 the trial. So too will the chance of making at least one
Oroua 4.33 Type II error. However, no redesign of the analysis pro-
Rongotea 4.82
Karamu 5.10 cedure can overcome this problem.
LSD(5%) .30 How common are Type I errors?The answerdependson
Whittaker,1984 Oroua 2.68 the true set of means, so I shall consider a few cases. First,
Rongotea 3.07 what if all treatmentmeans are truly equal? On average,
Karamu 3.22 how many Type I errorswould we expect to make in each
LSD(5%) .35
experiment?The answer is 5% of the numberof pairwise
NOTE: Wheat grain yields are means for each cultivar in each trial. Trials are labeled by
farmer and year. All trials were randomized complete block designs with four replicates. comparisons. Thus, in an experimentwith five treatments,
the average numberof Type I errorsis 10 x (.05) = .50,
or one Type I error in every second experiment. Second,
3. The procedureis based on comparisonwiseerrorrates what if, say, only 4 of the 10 pairs are truly equal? Then
ratherthan experimentwiseerrorrates. (This is a less com- the average numberof Type I errors is 4 x (.05) = .20,
mon objection.) or one Type I error in every five experiments. Last, what
How serious is the first objection?Or, more specifically, if no two treatmentmeans are truly equal?In this case there
how large is the correlation,and what proportionof differ- is absolutely no chance of a Type I error!
ences is correlated?The first answer is that the correlation In real experiments the case "all ui's unequal" occurs
is precisely ?/2for pairs of differences with one treatment more frequentlythanthe case [u = ,2 = . = l-k. Thus

in common to both differences and 0 for pairsof differences in general there is more opportunityfor Type II errors to
with no treatmentin common. The second answer is that occur than Type I errors. This means that from a practical
for an experimentinvolving k treatments,the proportionof viewpoint it is desirable to put more weight on minimizing
correlatedpairs of pairwise differences is 4/(k + 1), which Type II errors than Type I errors. Since most of the alter-
decreases rapidly as k increases. Hence, when there are 7 native procedureshave a higher Type II errorrate than the
treatments,half of the pairs of pairwise differences are cor- unrestrictedLSD procedure (Carmer and Swanson 1971,
related, but when there are 39 treatmentsonly 10% of the 1973), this points to the unrestrictedLSD procedureas the
pairs are correlated. procedureof choice.
Does the unrestrictedLSD procedureallow more corre- The preoccupationwith Type I errorsamong theoretical
lated comparisonsthan the alternativeprocedures?The an- statisticianspresumablyarises from the comparativemath-
swer is no! All proceduresare equal in this respect. This is ematical simplicity of the null case ,? = A2 = =k
because the problem of correlationcan only be solved at which has led to a predominanceof work on this case at
the design stage, by prespecifyingorthogonalcontrastscor- the expense of equally importantalternativehypotheses.
respondingto questions of interestto the researcher.At the The third objection sometimes raised is that the unre-
analysis stage, the best that the data analyst can do is to stricted LSD procedureis wrongly based on a comparison-
bear in mind the fact that each comparisonis correlatedto wise Type I error rate and should instead be based on an
certain other comparisons, so Type I errors tend to occur experimentwiseType I errorrate. The problemhere is that
in bunches, being more frequentthan usual in some exper- holding the experimentwiseType I errorrate constant, say
iments but less frequent than usual in other experiments. at 5%o,causes a rapidincrease in the probabilityof Type II

176 The American Statistician, May 1990, Vol. 44, No. 2

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
error with increasing size of experiment. This is highly Significance

undesirable;it takes the control of Type I and IIerrorsout A vs. B


Alison HSD = 2.92
of the hands of the researcherin such a way that an un- A B

desirablebalanceof Type I and IIerrorsis usually obtained.


20 25
The natural conceptual unit is the comparison, not the
experiment. An experimentis no more a naturalunit than Sue t - = 3.90
HSD
a project consisting of several experiments or a research ., B111 *

programconsisting of several projects. Clearly, it is unsat- 20 25


isfactory to have the size of the experiment,or the number
of experiments in a project, influencing the probabilityof Graeme
HSD = 4.68

detecting a particularpairwise difference.


fLtJLL zL2J1 ns
20 25
5. INCONSISTENCY
A basic defect in most multiple comparisonprocedures
Figure 1. Significance of the Difference Between Populations
is the lack of consistency in the verdicts they returncon-
Receiving Treatments A and B in Three Studies Statistically Ana-
cerning whethertwo populationmeans differ. In this section lyzed Using Tukey's HSD Procedure. The horizontal bar represents
I define the term inconsistencyand then consider several of the HSD(5%), ns is not significant, a single asterisk denotes 5%
the better-known alternative procedures, showing by ex- significance, and a double asterisk denotes 1% significance.
ample that each procedureis inconsistent.
ments A and B has varied from "no proof of a difference,"
5.1 Formal Definition to "suggestive of an effect," to "convincing evidence of a
A procedureis called inconsistent if for any two popu- difference" [to use the words of O'Brien (1983)].
lation means gi and i,j the probabilityof judging them to The "inconsistency" in the decisions returned by Tukey's
be differentdependson either the numberof populationsor procedure is in practice unacceptable. The phenomenon is
the values of the sample means for the other populations. a reflection of the increase in the critical HSD value with
Thus a procedureis consistentif the decision it generates an increasing number of experimental treatments; this is well
as to whethertwo populationmeans are differentis depen- known and is the main reason why Tukey's procedure is
dent only on (a) the difference between the two sample not widely used.
means, (b) the standarderrorof this difference, (c) the num-
ber of error degrees of freedom, and (d) the significance
level at which the procedureis operated. 5.3 Restricted LSD Procedure
The restricted (or Fisher's) LSD procedure is a two-stage
5.2 Tukey's HSD Procedure
procedure. First, the overall F value is examined. If the
Tukey's HSD procedure (Tukey 1949) is based on the overall F value is not significant, say at p = .05, all pop-
distributionof the range. The proceduredeclares two pop- ulation means are declared to be identical, and no pairwise
ulations to be different, say at p = .05, if their sample tests are conducted. If the overall F value is significant, all
means differ by more than HSD(5%) = qk,,(.95) x SEM, 5%-level pairwise t, or LSD, tests are conducted. Similarly,
where qk,,(.95) denotes the 95th percentile of the distri- if the overall F value is 1% significant, all 1%-level pairwise
bution of the range of k N(O, 1) randomvariables, k is the t, or LSD, tests are conducted.
numberof populationsin the study, v is the numberof error In Figure 2 we suppose that another three research col-
degrees of freedom, and SEM is the standarderrorof the leagues-Ron, Mary, and Dave have carried out exper-
mean. iments. These experiments all have four treatments, including
In Figure 1 we may suppose that three research col- a common pair, A and B. All designs were randomized
leagues-Alison, Sue, and Graeme-have each carriedout complete block designs with six replicates, so the number
an experiment. The experiments include 2, 4, and 8 treat- of error degrees of freedom was 15 in each case. The SEM
ments, respectively, but each experiment includes treat- was 1.00, and the observed difference between treatments
ments A and B. All designs were completely randomized, A and B was 4.3 in all cases.
with 13, 7, and 4 replicates, respectively, so the numberof In this example the statistician advises his or her col-
error degrees of freedom was 24 in each case. The SEM leagues to use the restricted LSD procedure to analyze their
was 1.00, and the observed difference between treatments data sets. As displayed in Figure 2, this advice also leads
A and B was 4.3 in all cases. to inconsistent results and embarrassment for the consulting
In our hypotheticalsituationnone of the researchershave statistician. The phenomenon in this case arises because the
any knowledge to enable them to prespecify meaningful significance of the difference between treatments A and B
contrasts, so their statisticianadvises them to use Tukey's is tied to the significance of the overall F test, with the
procedure to analyze their data. Imagine the statistician's latter varying from 5.62 (**) to 4.21 (e) to 3.14 (ns).
embarrassmentwhen they comparenotes and discover that The restrictedLSD procedurereceived favorablemention
the significance of the difference between treatmentsA and in the reviews carriedout by Carmerand Swanson (1971,
B has varied from not significant, to 5% significant, to 1% 1973) and Chew (1976, 1977), and has many followers.
significant. This means that the evidence concerningtreat- However, it is clearly not a consistent procedure.-

The AmericanStatistician, May 1990, Vol. 44, No. 2 177

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
Significance Significance
A vs. B LSD = 2.52 A vs. B
Ron Harry
1 * IA BH *
LSD(1%) = 4.17 AL **
15 20 25

20 25
LSD = 3.07
John

Mary
LSD(5%) = 3.01 A B
.JAI2LE LIL ,
20 25
*

LSD 3.87
20 25 Skip ,

IAfflFLlBI
20
ns
25
Dave
LSD = oc) A B ns
20 25 Figure 3. Significance of the Difference Between Populations
Receiving Treatments A and B in Three Studies Statistically Ana-
Figure 2. Significance of the Difference Between Populations lyzed Using Wallerand Duncan's k-Ratio LSD Procedure (and their
Receiving Treatments A and B in Three Studies Statistically Ana- tabular values). A k ratio of 100 was used to derive LSD(5%), which
lyzed Using the Restricted LSD Procedure. ns is not significant, a is shown as a horizontal bar. ns is not significant, a single asterisk
single asterisk denotes 5% significance, and a double asterisk de- denotes 5% significance, and a double asterisk denotes 1% sig-
notes 1% significance. nificance.
5.4 Waller and Duncan's Procedure the total numberof treatments,and (b) to varythe protection
In Waller and Duncan's k-ratio LSD procedure(Waller level with the numberof treatmentsso that it is consistent
with the structuredcase in which (p - 1) orthogonalcon-
and Duncan 1969) the overall F value is used in calculating
the LSD. If the overall F value is large the calculatedLSD trastscan be specified. Criticalvalues are tabulatedin many
is smaller than if the F value is small. A k-ratio of 100 statisticaltextbooks;these values inflate quite slowly as the
approximatesa 5%-level test and a k-ratioof 500 approx- group size (p) increases, so in practice the procedureoften
imates a 1%-level test. In general, critical values must be yields results similarto those obtainedfrom the unrestricted
obtained from tables (Waller and Duncan 1969). It is in- LSD procedure.
With Duncan's multiple range test it is more difficult to
formative, however, to note thatfor large experiments,with
construct examples as extreme as those shown in Figures
?15 treatmentsand ?30 degrees of freedom for error, the
1-3; these do exist, but only for relatively large experi-
LSD(5%) = LSD(k = 100) can be approximated by 1.72 x
ments. This indicates that Duncan's multiple range test is
\F/(F - 1) x SED (Duncan 1965). In this expression,
less inconsistent than the three preceding procedures.
as F -> ? the LSD(5%) approaches 1.72 x SED, and as
F -- 1 the LSD(5%) approachesoo.
5.6 Unrestricted LSD Procedure
In Figure 3 anotherthree colleagues Harry, John, and
Skip-have each conducted experimentswith 7 treatments In each of Figures 1-3 the unrestrictedLSD procedure
and 20 replicates, laid out in a completely randomizedfash- is entirely consistent in the verdict it returnsas to whether
ion. Each researcherincludeda common pairof treatments, the populationsreceiving treatmentsA and B have different
A and B, and each observed a difference of 3.6 between means. In fact, the definition of the word consistent means
treatmentsA and B. The SEM was 1.00, so the SED was that the unrestrictedLSD is always a consistent procedure
1.414, and the numberof errordegrees of freedomwas 133 and is the only consistent procedure.
in all cases.
In this hypotheticalexample the statisticianadvises his or 5.7 Comparison of Procedures
her colleagues to use Waller and Duncan's procedureas a
In Figures 1-3 the decision returnedby a particularpro-
method of analysis, using a k ratio of 100 to approximate
cedure has been shown to vary from experimentto exper-
a 5%-level test and a k ratio of 500 to approximatea 1%-
iment, with no change in the observed difference, the SED,
level test. As shown in Figure 3, the same inconsistency
or the numberof errordegrees of freedom. Similarexamples
arises as with the other two procedures. The explanation
can be constructedfor all proceduresexcept the unrestricted
here is thatthe critical value dependson the overall F value;
LSD procedure. Some procedures, however, are more in-
small F values mean large criticalvalues, and largeF values
consistent than others.
mean small critical values (in Fig. 3 the F values are 25.25,
Tukey's procedure is the most inconsistent of the alter-
3.01, and 1.40).
native proceduresmentioned in this article. With this pro-
This procedureis regardedby the reviewers mentionedin
cedure the significance can vary from not significantto . 1%
the last section as one of the better procedures. In agricul-
significant in studies of only modest size. In Figure 1, the
turalresearchit has received some acceptanceas a successor
quoted HSD(5%) values are equivalent to the following
to Duncan's multiple range test.
LSD values: HSD(Alison) = LSD(5%); HSD(Sue) =
LSD(1.1%); HSD(Graeme) = LSD(.3%). In other words,
5.5 Duncan's Multiple Range Test
each researcher has in effect used the unrestrictedLSD
Duncan's multiple range test (Duncan 1955) arises from procedurebut has also allowed the numberof treatmentsto
two modifications to Tukey's HSD procedure. These are select the significance level. Tukey's procedure,however,
(a) to use the critical value for the distributionof the range is at least predictable in its inconsistency, since the HSD
for the number of treatments,p, in a reduced group, not depends simply on the numberof treatmentsincludedin the

178 The American Statistician, May 1990, Vol. 44, No. 2

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
study and not on the observed sample means. replicationor heterogeneityof variance.The calculatedLSD
The restrictedLSD procedurecan be very inconsistent, can also be used to derive a confidence interval for the
since examples of studies of modest size can be constructed difference between two populationmeans, namelyobserved
where the significance varies from not significant to .1% difference + LSD. It has a known andconstantType I error
significant. In practice, however, the inconsistency of this rate that is "true to label." If conservatism is required, it
procedure is much less serious than that of Tukey's pro- can be operatedat a 1% or . 1% level of significance rather
cedure, since in many instances the overall F value is sub- than at a 5% level. Its power is determinedby the number
stantial. of replications and the underlyinglevel of variability, and
With Waller and Duncan's procedure the limit of the the numberof replicationsrequiredto achieve a certainlevel
variation in the significance is from not significant to 1% of power for a given truedifferencecan be easily computed.
significant. In Figure 3, the quoted LSD(5%) values are The unrestrictedLSD procedureoften has maximumpower
equivalent to the following LSD values: LSD(Harry) = among the procedures, exceeded only by Waller and Dun-
LSD(7.7%); LSD(John) = LSD(3.2%); LSD(Skip) = can's test under some assumed sets of true means (Carmer
LSD(.7%). In other words, each researcherhas in effect and Swanson 1971, 1973). The calculations are easier to
used the unrestrictedLSD procedure, but at different sig- check than with most of the alternative procedures, and
nificant levels determinedby the data. The example shown usage of these latter procedureshas undoubtedlyled to an
in Figure 3, however, had to be carefully constructedto increase in the numberof wrong conclusions based on un-
exhibit this level of variation.This suggests that the incon- detectedcomputationalerrors.In summary,if a cost-benefit
sistency of Waller and Duncan's procedureis usually only analysis were to be performed, the unrestrictedLSD pro-
moderate. cedure would shine throughas the clear leader in the field.
With Duncan's multiple range test it is the number of Carmer and Walker (1982, 1985) also presented argu-
treatmentmeansthatare intermediatebetweenthe two means ments for the conclusion that use of the LSD is appropriate
being compared that determines the effective significance whenever a pairwise multiple comparison procedureis in
level of the particularcomparison. With this test it is even order. They arguedthat the researcheris in the best position
more difficult to constructexamples with the level of vari- to fix the significance level prior to the hypothesis test, to
ation shown in Figures 1-3. In practice, this procedure achieve the best balancebetween Type I and Type II errors.
appearsto be the most consistent of the alternativesto the This is an alternativeview to thatput forwardin this article;
unrestrictedLSD procedure. I preferto carryout the test at a rangeof significance levels,
It is interesting that it is Duncan's multiple range test in effect using the LSD's to measure the strength of the
that, until recently, enjoyed the greatestacceptanceamong evidence, ratherthan to set up a "black versus white" type
agriculturalresearchers. In addition to its greater level of of hypothesis test.
consistency, Duncan's test is the least conservative of the In the more general hypothesis testing context, the sce-
alternatives to the unrestrictedLSD procedureand is the nariothat is most acceptableto statisticiansis thatof a well-
procedurethat is most similar to the latter. In fact, one can designed study in which orthogonal contrasts are prespe-
speculate that the t test, or unrestrictedLSD, is subcon- cified, correspondingto a "vision of reality" that will, it is
sciously accepted by researchersas the "standard,"so any hoped, be supportedby the data. If this vision is not sup-
procedure that is too different is unacceptable. More re- ported by the data, however, it is sometimes found that
cently, many researchershave changed back to using the anotherset of orthogonalcontrastsprovides a good descrip-
unrestrictedLSD procedure. tion of the data. This descriptiongeneratesa new vision of
reality, which will then need to be confirmedin subsequent
6. DISCUSSION studies.
In practice, an applied statisticiancannot afford to rec- Multiple comparisonproceduresgo againstthis basic phi-
ommend an inconsistentprocedure,so the unrestrictedLSD losophy in that they appearto formulateand test hypotheses
procedureis the only procedurethat can be safely recom- in the same study simultaneously.In fact, the multiplecom-
mended to researchers.In additionto its consistency, how- parisoncontroversyis resolved if the proceduresarethought
ever, the unrestrictedLSD procedure has many practical of as hypothesis generatorsratherthan as methods for si-
advantagesover the alternativeprocedures(Table 3). It is multaneous generation and testing. When viewed in this
simple, provides a naturalextension to the two-population light, the 5%-level unrestrictedLSD procedureis seen to
case, and is flexible enough to cater easily for unequal have the following simple characteristics:

Table 3. Comparison of Unrestricted LSD Procedure WithAlternative


Multiple Comparison Procedures

Unrestricted Alternative
Characteristics LSD procedure procedures
Consistency? Consistent Inconsistent
Simplicity? Simple More complex
Flexibility? Flexible Less flexible
Type I error rate? Constant Variable
Power? Maximum power Lower power
Required sample size? Easy to calculate Hard to calculate
Easy to check? Easily checked Harder to check

The AmericanzStatistician, May 1990, Vol. 44, No. 2 179

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions
1. If in fact 1xi = s,j, the hypothesis HA: 1,i # i,j will the unrestrictedLSD's for, say, 10%-, 5%-, 1%-, and . 1%-
be generatedwith probability .05. level tests, to determine the strength of the evidence for
2. If ,ij =$ ,j, this probabilitydepends only on the size each of these differences;this gives an indicationas to which
of the standardizeddifference (ui - tj)/SED. of the observed differences are likely to be real.
3. Confirmthese differences in subsequentstudies.
In any field of researchan individualstudy is merely one
componentto be interpretedin the light of other studies and
otherknowledge. The unrestrictedLSD proceduremay gen- 7.2 General Unplanned Contrasts
erate a false hypothesis in one study, but this is unlikely to
1. Do not attemptto simultaneouslyformulateand test
be confirmedby subsequentstudies or may be recognizable
hypotheses concerning contrastsin a single study.
as false from previous studies or previous knowledge. In
2. When analyzingthe datafrom a study in which several
other words, the problem of false hypotheses is not partic-
sets of contrasts could have been prespecified, formulate
ularly serious for researchworkers, who are attunedto the
hypotheses in terms of those contraststhat best describe the
problemsof working with variablematerialsand are aware
datausing the ordinarysingle-degree-of-freedomF or t tests
of the need for confirmation of unexpected results using
(not Scheffe's more conservative test).
independentdata sources.
3. Confirm any interesting hypotheses in subsequent
Scheff6's test (Scheffe 1953), which has been advocated
studies.
for the analysis of general unplannedcontrasts, is in a po-
sition analogous to multiple comparisonproceduresin that [ReceivedJuly 1987. Revised September1989.]
it also representsan attemptat simultaneouslyformulating
and testing hypotheses. Scheffe's test is inconsistent, since
the significance of a given contrastvalue with a given stan- REFERENCES
dard errorcan vary from not significant to . I% significant, Carmer, S. G., and Swanson, M. R. (1971), "Detection of Differences
dependingsimply on the numberof populationsincludedin Between Means: A Monte Carlo Study of Five Pairwise Multiple Com-
the study. Forthis reasonScheffe's test is widely recognized parison Procedures,"AgronomyJournal, 63, 940-945.
as excessively conservative and is seldom used in practice. (1973), "An Evaluation of Ten Pairwise Multiple Comparison
The consistent method of analysis is to use the same F or Proceduresby Monte Carlo Methods," Journal of the AmericanStatis-
tical Association, 68, 66-74.
t test as for plannedcontrasts. I would again suggest, how- Carmer, S. G., and Walker, W. M. (1982), "Baby Bear's Dilemma: A
ever, treatingthis as a hypothesisgeneratingprocedurerather StatisticalTale," AgronomvJournal, 74, 122-124.
than a procedurefor simultaneousgenerationand testing. (1985), "Pairwise Multiple Comparisonsof TreatmentMeans in
In summary,the only consistentway to analyzeunplanned Agronomic Research," Journal of Agronomic Education, 14, 19-26.
Chew, V. (1976), "Comparing Treatment Means: A Compendium,"
contrasts, both pairwise and general, is to use the ordinary
HortScience, 11, 348-357.
single-degree-of-freedomF test or the equivalentt test. That (1977), "ComparisonsAmong TreatmentMeans in an Analysis
is, one should use the same proceduresas in the planned of Variance," AgriculturalResearch Service Technical Bulletin H-6,
case but should treat them as hypothesis generating pro- U.S. Departmentof Agriculture,Washington, D.C.
cedures rather than hypothesis generating and confirming (1980), "Testing Differences Among Means: Correct Interpreta-
tion and Some Alternatives,"HortScience, 15, 467-470.
procedures.
Cousens, R. (1988), "Misinterpretationsof Results in Weed Research
The implications for teachers of statistics are important. ThroughInappropriateUse of Statistics," WeedResearch, 28, 281-289.
These new, simplifiedrecommendationsmeanthatthe large Duncan, D. B. (1955), "Multiple Range and Multiple F Tests," Biomet-
and confusing body of material on multiple comparison rics, 11, 1-42.
procedures, and the smaller amount on Scheffe's test, can (1965), "A Bayesian Approachto Multiple Comparisons,"Tech-
be replacedby materialthat stresses the distinctionbetween nometrics, 7, 171-222.
Little, T. M. (1978), "If Galileo Publishedin HortScience,"HortScience,
hypothesis formulationand testing. This makes a much ti- 13, 504-506.
dier and more consistent teaching package than is currently Nelder, J. A. (1971), Discussion of "The Present State of Multiple Com-
available. parison Methods," by R. O'Neill and G. B. Wetherill, Journal of the
Royal Statistical Society, Ser. B, 33, 244-246.
O'Brien, P. C. (1983), "The Appropriatenessof Analysis of Varianceand
7. THE PRACTICAL SOLUTION Multiple ComparisonProcedures,"Biometrics, 39, 787-794.
Perry,J. N. (1986), "Multiple-Comparison Procedures:A DissentingView,"
My recommendationis to use only the simplest formal Journal of Economic Entomology, 79, 1149-1155.
Petersen, R. G. (1977), "Use and Misuse of Multiple ComparisonPro-
proceduresand to rely upon improvedcommon sense, better
cedures," AgronomyJournal, 69, 205-208.
statistical education, and the incorporationof information Plackett, R. L. (1971), Discussion of "The PresentStateof MultipleCom-
from other sources as safeguardsagainst errorsof interpre- parison Methods," by R. O'Neill and G. B. Wetherill, Journal of the
tation. Royal Statistical Society, Ser. B, 33, 242-244.
Preece, D. A. (1982), "The Design and Analysis of Experiments:What
7.1 Has Gone Wrong?" Utilitas Mathematica, Ser. A, 21, 201-244.
Multiple Comparisons
Scheffe, H. (1953), "A Method for JudgingAll Contrastsin the Analysis
1. Do not attemptto simultaneouslyformulateandtest of Variance," Bometrika, 40, 87-104.
hypothesesconcerningpairwisedifferencesin a single study. Tukey, J. W. (1949), "ComparingIndividual Means in the Analysis of
Variance," Biometrics, 5, 99-114.
2. When analyzing the data from a study in which the Waller, R. A., and Duncan, D. B. (1969), "A Bayes Rule for the Sym-
populations have no discernible structure, seek out inter- metric Multiple Comparison Problem," Journal of the American Sta-
esting or importantdifferences between populations. Use tisticalAssociation, 64, 1484-1503; Corrigendum(1972), 67, 253-255.

180 The American Statistician, May 1990, Vol. 44, No. 2

This content downloaded from 138.26.188.186 on Tue, 17 Jun 2014 13:51:16 PM


All use subject to JSTOR Terms and Conditions

You might also like