Professional Documents
Culture Documents
Multiple Comparison Procedures
Multiple Comparison Procedures
Multiple Comparison Procedures
1/18/2001
This example is based on Klesges, R. C., et al. (1998) The prospective relationship between
smoking and weight in a young, biracial cohort: The coronary artery risk development in young
adults study. Journal of Counseling and Clinical Psychology, 66, 987-993.
The study looked at weight changes over a seven year period in subjects who did, and did not,
stop smoking. The authors broke down subjects by smoking condition, race, and sex, but the
way they presented their data I was not able to include sex as a variable in my example.
One reason for choosing this example is that it involved very large samples. We usually use
small samples for examples, and I thought it would be useful to look at a case with thousands of
subjects.
At baseline, data were collected on weight, smoking behavior (Never, Former, and Smoker), and
other variables for over 5000 subjects. Seven years later data were obtained from 3868 subjects
on their smoking status (Never, Former, Quitter, Intermittent, Initiator, and Continuous), their
weight, and their weight gain. Data were also collected on alcohol use and caloric intake..
For this example I am going to run one-way analyses for smoking behavior on the pretest and
the posttest data separately. I will ignore race and sex. I'll may come back to those variables
later.
The data are found in Klesges.sav for 3868 subjects. The variables are Race, Basesmoke,
Endsmoke, Alcohol, Basewt, Endweight, WtChange, and Fatpercn, in a different order. I
generated these myself based on their data. Weight is given in kilograms.
First we'll look at differences in weight of the three smoking conditions at the beginning of the
study. What would students predict?
1 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Notice that the overall Anova is significant, but when we run the multiple comparisons the only significant
difference is the 2.58 kilogram difference between Nonsmokers and smokers. Interestingly, the ex-
smokers fall in the middle and don't differ from either group. So it is not true that quitting smoking led to
weight gain--at least over the long term.
2 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
There are a whole set of procedures for making comparisons between groups subsequent to the overall
analysis of variance. I cover many more in the text that I can cover here, but the basic ideas are very
simple.
Error Rates
In the text I distinguish between two kinds of error rates. One is the probability that any
particular comparison will be a Type I error, and the other is the probability that any set of
comparisons will contain at least one Type I error.
This is the probability that any given test will be significant if the null hypothesis
is true. If we just ran a simple t test between two means at alpha (a) = .05, then
the probability that a Type I error would occur is .05.
This is the probability that a whole set of comparisons will contain at least one
Type I error. It should be apparent that the more tests you run, the greater the
likelihood that you will make a Type I error someplace. [The more you have sex,
the more likely you are to get pregnant.]
If you have sex once/night for a week, the question is the likelihood that you will
be pregnant at the end of the week, and we aren't concerned about which night.
The major point behind almost all multiple comparison procedures is to reduce
FW to something reasonable, such as .05, and we do that by reducing the
significance level for any particular comparison to a small value. [In this case, if
we ran each test at a = .01667, the familywise error rate would come out to about
1 - (1 - .016667)3 = .05(approx).]
I hate to discuss this issue because it seems to get people all twisted around. The
basic idea is rather simple. If you plan out your comparisons before you run your
experiment, you get to use more liberal procedures. If you plan your comparisons
after you have looked at the data, even if you run just a few tests, it is as if you
3 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
were running all possible comparisons among the means. This will require much
more conservative procedures.
In actual practice, the vast majority of the situations that I see involves post-hoc
tests. People don't really plan out everything ahead of time. They wait until they
get to their data, and then they decide what they want to test.
I'm not sure that this distinction, while completely defensible on theoretical
grounds, is the best one to make here. I prefer to think of it a bit differently.
In the case of truly a priori tests, I recommend that you just run simple t tests
between the means that you want to compare. The one difference is that I would
use MSerror from the overall Anova in place of the individual group variances,
unless you have good reason to believe that the variances are heterogeneous.
In each case, the df are the same as the df for error from the overall anova.
Example
4 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
The means are given above as 71.65 and 69.52 for the 333 ex-
smokers and the 1450 Smokers, respectively. The within-group
variances are all very similar, so I'll use the error term from the
Anova.
Contrasts
Bonferroni Procedure
5 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Multistage tests
We have talked about this test before. Fisher argued that If the overall Anova is
significant, you can go ahead and run multiple t tests between any and all groups.
This is the most liberal of the mutliple comparison tests, and it only keeps the
familywise error rate at alpha if the complete (omnibus) null is true--i.e. if all
populations have equal means.
This is the only test that requires a significant overall F before proceding!!!
I have been pushing this test for years for the situation in which you have only 3
groups, but people don't like it. Finally I came across a paper by Levin, Serlin,
and Seaman (1994, Psychological Bulletin, 115, 153-159) that says the same
thing.
Many procedures use what is called the Studentized Range Statistic. It was
originally designed as a statistic to compare the two extreme means in a set of
means. If there are a lot of means, the extremes are likely to be more different
than if there are just a few means. But that means that it is more likely to come up
with a "significant" difference when testing those means. So the test was
designed to adjust the critical value, making it larger when there are more means
to chose from.
For some reason that I have never seen explained, they came up with a slightly
different test statistic than the normal t statistic. There is no reason why they had
to do so, the t would do as well. But the statistic is
6 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Notice that q is just the same as t except that the "2" is missing from in front of
MSerror. This isn't a problem, because the critical values are altered in the same
way.
This testing approach is used in many of the tests which follow, which is why I
discussed it in the first place.
I happen to like this test, but lots of people complain about it. I have laid out the
reasons in the text, but I'll simplify them here.
When there are three means, the Newman-Keuls holds the familywise error
rate at .05, just as we would like.
When there are four or five means, the error rate is held at (approx) .10.
When there are six or seven means, the max FW is .15, etc.
It is rare to have more than five groups in an experiment, and when we do
it is also very likely that at least some null hypothesis is not true. It is hard
to imagine an experiment where we really believe that all five means are
equal.
Thus I think the arguments against the Newman-Keuls are not really fair.
I go over how to apply this test by hand in the text, but people don't often do that,
and I will probably cut that back drastically in the next edition..
In SPSS this test has a somewhat different printout than we have seen. I'm not
sure why they do that. Basically they show you those groups that are
homogeneous. The first example is the same set of data as the examples above.
7 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
None of these groups are different from any others. I don't know what the sig =
.053 means, although it may be the significance level of the most extreme
comparisons.
This is a good example, because the overall F was significant, but the test does
not find any differences.
Here you can see that 5 of the groups are homogeneous, but the sixth group
(Quitter) is different from the other two.
8 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
What I can't display (because of the data I have) is the very common situation in
which two homogeneous sets of groups have some overlap.
The formula that I gave above assumes equal sample sizes. When
you have unequal sample sizes, you can take the harmonic mean of
the n's and use that for all cases. You can see from the printout above
that this is what SPSS has done.
When the variances are unequal, you can use the Games and Howell
(1976) procedure. (Unfortunately, a different Howell) SPSS will
implement this procedure.
Tukey's test
Tukey's test is a very close relative of the Newman-Keuls test. The difference is
that all comparisons are done as if the groups were maximally far apart. In other
words, with 6 groups, two means that are adjacent in an ordered series are still
tested as if they were the largest and smallest of 6 means.
This test holds the familywise error rate at alpha regardless of what null
hypothesis(ses) are true.
The following very curious printout comes from an analysis of the three original
groups at baseline.
Homogeneous Subsets
9 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Notice that there are no significant differences using this test on these data.
I don't know why we get the difference between the two tables.
This shows that the Tukey a somewhat more conservative test than the Newman-
Keuls. I think this test is a bit too conservative, but lots of people like it.
Sort of like the Bonferroni logic, except that each test is run at a/(r/k) where k =
number of means in the experiment, and r is the number of means from which
these two are the largest and smallest. Einot, Gabriel, and Welch fiddled with this
just a little bit, but the basic idea is still right.
This test keeps FW at alpha regardless of the true null, but is less conservative
than Tukey's test.
SPSS will run this test. For our example the printout is shown in the following
tables for the baseline and the endpoint data..
10 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Scheffé's Test
This test is the most conservative of the lot, and I do not recommend it. Only the
purists like it.
Bonferroni
I think that the Bonferroni is not a good test as a post-hoc test. I would only use it
as an a priori test. Explain why.
Descriptives
11 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
12 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
Again, I got conflicting results with the Tukey. Students can do that on their own.
13 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...
The assignment is to take the means, etc. from what we have here, sit down with a pencil and paper and my book, and see
what is going on with the conflicting Tukey results. It may have to do with different ways of treating unequal sample sizes.
Hint: You can find an exact probability of a t, for example, by COMPUTING a new variable named tprob.
From the menu choose cdf(q,df). Put the actual t value in where the "q" is (I don't know why they don't
call it "t", but they don't.) Put the df for error in place of df. The result will be the one-tailed probability
value for a t > the obtained t. (I know that it is annoying that it will calculate that value for every case, but
I don't know a way around it. If you wanted to know the value of t that cut off the lowest 2.5%, you could
use the same compute statement except substitute idf(p,df) where p is the lower tail probability (e.g. .025)
and df is the dferror. The result will be the critical value of t, and if you drop the sign it will be the two-
tailed value. I think that this will help you solve the problem, but unfortunately you can't get a probability
for q in the same way.
14 de 14 07/05/2019 21:29