Bai Doc Them 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

J Clin Epidemiol Vol. 52, No. 3, pp.

229–235, 1999 0895-4356/99/ $–see front matter


Copyright © 1999 Elsevier Science Inc. All rights reserved. PII S0895-4356(98)00168-1

Increasing Physicians’ Awareness of the Impact of


Statistics on Research Outcomes: Comparative Power of
the t-test and Wilcoxon Rank-Sum Test in Small
Samples Applied Research
Patrick D. Bridge1,* and Shlomo S. Sawilowsky2
1Department of Family Medicine, Wayne State University School of Medicine and 2Department of Theoretical and
Behavioral Foundations, College of Education, Wayne State University, Detroit, Michigan

ABSTRACT. To effectively evaluate medical literature, practicing physicians and medical researchers must
understand the impact of statistical tests on research outcomes. Applying inefficient statistics not only increases
the need for resources, but more importantly increases the probability of committing a Type I or Type II error.
The t-test is one of the most prevalent tests used in the medical field and is the uniformally most powerful
unbiased test (UMPU) under normal curve theory. But does it maintain its UMPU properties when assumptions
of normality are violated? A Monte Carlo investigation evaluates the comparative power of the independent
samples t-test and its nonparametric counterpart, the Wilcoxon Rank-Sum (WRS) test, to violations from
population normality, using three commonly occurring distributions and small sample sizes. The t-test was more
powerful under relatively symmetric distributions, although the magnitude of the differences was moderate.
Under distributions with extreme skews, the WRS held large power advantages. When distributions consist of
heavier tails or extreme skews, the WRS should be the test of choice. In turn, when population characteristics are
unknown, the WRS is recommended, based on the magnitude of these power differences in extreme skews, and
the modest variation in symmetric distributions. J CLIN EPIDEMIOL 52;3:229–235, 1999. © 1999 Elsevier Science Inc.

KEY WORDS. Research methods; t-test; Wilcoxon Rank-Sum test; nonparametric statistics; parametric statistics;
power

INTRODUCTION under the assumption of normality and is therefore the uni-


formally most powerful unbiased test (UMPU) when data
The use of statistics in medical research has increased con-
are normally distributed. (UMPU means that by definition,
siderably in past 60 years, and the types of statistics used
when the data are normally distributed, no other test has
have become much more complex [1]. Concomitant with
greater ability to detect true differences for a given sample
this influx in the use of statistics, unfortunately, was an in-
size.) But, does it maintain its UMPU properties when vio-
crease in statistical errors [2–12]. For a practicing physician
lations from normality occur for small samples typical of ap-
trying to stay current with the literature, or a medical re-
plied research?
searcher adding new knowledge to the field, it is important
In medicine and social and behavioral science, normal
to understand the application and efficiency of statistical
theory tests (e.g., t-test, ANOVA) have been used more
tests and their impact on the outcomes of research.
extensively than nonparametric statistics (which do not ap-
Many researchers view the application of statistical tests
peal to the population shape as part of their derivation)
as a simple and clear-cut process. Nevertheless, in many sit-
[13–15, 17–19]. Yet, statisticians and researchers question
uations the appropriate application of statistical tests may
the frequency of normally distributed data in real world
be unclear, and in other situations, controversial. For exam-
problems [17, 20–24]. For example, Micceri [17] examined
ple, consider the independent samples t-test, which is one
440 psychometric and ability measures and found that all of
of the most prevalent statistics used in medicine, psychol-
the data sets were nonnormal according to the Kolomog-
ogy, and education research [13–16]. The t-test is derived
orow-Smirnov test for normality at the 0.01 alpha level.
Only 3% were even remotely similar to the normal curve
(i.e., smooth symmetric with light tails). The problem of
*Address for correspondence: Patrick D. Bridge, University Health Center,
4201 St. Antoine, Room 4J, Wayne State University, Detroit, MI 48201. non-normality in applied data is also prevalent in medical
Accepted for publication on 6 November 1998. research. For example, a literature search was conducted to
230 P. D. Bridge and S. S. Sawilowsky

identify studies in which data violated normal distribution statistic. As noted by Mansfield [39], “The hallmark of
theory. The various disciplines where this problem was preva- these tests is that they avoid the assumption of normality.”
lent included surgery [25], epidemiology [26–28], emergency Thus, Sawilowsky [40] concluded, “the real issue of the ef-
medicine [29], rehabilitation [30], chemistry [31], pediatrics fects of nonnormality is on the comparative power, not ro-
[32], biomedical science [33], anesthesiology [34], and den- bustness of the t-test.”
tistry [35], among others. The researcher’s recognition of the power properties of a
In consideration of how rarely normality occurs in ap- test enhances the ability to detect treatment differences,
plied research practice, the persistence and prevalence in promotes replication of the study under the same condi-
the use of the t-test is questionable. Does the t-test, despite tions, and diminishes the wasting of resources through the
its property of being the UMPU test, maintain its superior use of efficient statistical tests. Remarkably, however, until
power properties (i.e., the ability to detect a treatment ef- recently, the power of a statistic (or power analysis) has
fect if it exists) when normal theory assumptions are vio- been a neglected methodological tool in medical research.
lated? Blair [36] stated, “One might assume that because the (An anonymous reviewer pointed out that work in clinical
t-test is the uniformly most powerful (UMP) unbiased test trials is a notable exception, and power analysis has been
under normal theory, it will naturally be more powerful considered a requirement for funded research in clinical tri-
than other tests in the non-normal situation, provided that als for quite some time.) For example, Friedman et al. [41]
its normal theory power is preserved. But this is fallacious reviewed 71 negative clinical trials and found that 67 of the
reasoning because the optimal power properties associated trials had a greater than 10% risk of missing a 25% thera-
with the t-test under normal theory are no longer in force peutic improvement and 50 trials may have missed a 50%
once the normality stipulation has been abandoned.” therapeutic improvement. Mengel [42] canvassed three ma-
jor family practice journals in 1988, and found that only
5 of 86 studies calculated statistical power. Although it
STATISTICAL ISSUES
turned out that 80% of the studies had sufficient power to
A Type I error (alpha) occurs when the null hypothesis is detect medium and large effect sizes, very few could detect
rejected, when in fact it is true. A Type II error (beta) oc- small effect sizes. In some situations, even large effect sizes
curs by failing to reject a null hypothesis when it is actually could not be detected. Similarly, Silagy [11] reviewed 55
false. Robustness with respect to Type I error means that if randomized control trials in four peer reviewed family med-
nominal alpha was set to 0.05, then the actual Type I error icine journals from 1987–1991, and found that statistical
rate is reasonably close to this rate even if normality (or power was reported in only five studies. Williams [12] re-
other underlying assumptions) is not met. viewed 44 negative clinical trials in dermatology and found
Robustness with respect to Type II error means that a that all but one had a 10% chance of missing a 25% treat-
test achieves approximately the same rejection rate (i.e., ment difference; and 31 of the 44 negative trials had such a
ability to detect a difference) when normality (or other un- small sample size that they had a greater than 10% chance
derlying assumptions) is not met. Statistical power, com- of missing a 50% treatment difference. Edlund [43] echoed
monly referred to as the Pittman Efficiency, is the test’s similar results.
ability to detect a false null hypothesis. Note, however, that The failure to recognize the need for power analysis has
even if a test is robust with respect to Type II errors, there mystified many. Kraemer and Thiemann [44] attributed the
still may be a competitor that is more powerful when nor- neglect to over-training in dealing with significant levels
mality is not met [37]. and under-training in the use of power. Cohen [45] stated,
Traditionally, statisticians’ first criterion of the quality of “One possible reason for the continued neglect of statistical
a test is its robustness (i.e., insensitivity to violation of as- power analysis in research in the behavioral sciences is the
sumptions). The t-test’s robustness properties have been ex- inaccessibility of or difficulty with the standard material.”
amined carefully over the past few decades. The current In addition, cost and resources associated with conducting
thinking on the subject was provided by Sawilowsky and research and obtaining a large sample size has also attrib-
Blair [38], who examined the robustness of the independent uted to the low power in the medical literature. With
samples t-test when normality was violated under condi- readily available resources, such as Cohen’s [45, 46] refer-
tions using real world data. The t-test was found to be ro- ences, the failure to conduct a power analysis is no longer
bust when sample sizes are nearly equal, fairly large (25–30), acceptable.
and a two-tailed test rather than a one-tailed test is used.
Despite the t-test’s remarkable robustness powers, some
PURPOSE
statisticians caution readers to refrain from their hurried use
of the t-test, because a nonparametric competitor, such as Small samples research (i.e., repeated sampling Monte
the Wilcoxon Rank-Sum (WRS) test, may be a more pow- Carlo results) conducted on the comparative power of the
erful statistic. Robustness with respect to Type I errors (i.e., t-test and the Wilcoxon test indicate the Wilcoxon test
false positives) is assured with the nonparametric Wilcoxon holds promise for increased statistical power when the as-
Impact of Statistics on Research Outcomes 231

sumption of normality is violated [47–53]. However, these


studies were conducted using fabricated data, which was
distributed according to well known mathematical curves
(i.e., half-normal, Laplace, exponential, mixed normal).
Micceri [17] raised the question of the generalizability of
such studies because the real data sets found to occur in
practice are far less “tame” and “mathematically expedient”
than theoretical curves. Thus, despite the plethora of re-
search on the comparative power of the t-test and the Wil-
coxon test, the question remains as to how these two proce-
dures fare with real data sets.
FIGURE 1. Multimodal and lumpiness distribution.
The purpose of this study, therefore, is to assess the com-
parative power of the independent samples t-test and the
Wilcoxon Rank-Sum test to violations from normality us-
ing real world data sets, and to guide physicians and medi-
cal researchers in the appropriate use of these tests. Hope- treatment effects, instead of multiples of the population
fully, when conducting research or evaluating the medical standard deviation [45].) The simulated treatment effect
literature for clinical practice, the impact of this study will (c) was then added to each n1, producing the shift in loca-
have educated physicians and medical researchers in under- tion in comparison with the n2 scores.
standing the effect statistical tests have on research out- Three data sets from Micceri [17] were considered proto-
comes. typical and representative of real behavioral science data.
They are based on measures such as the Minnesota Mul-
tiphasic Personality Inventory (MMPI), and other well
known instruments purporting to measure “anger, anxiety,
METHODS
curiosity, locus of control, masculinity/femininity, satisfac-
Using a Gateway 2000 4DX2-66V and Microsoft FOR- tion, sociability, visual hallucinations, and so forth” [38].
TRAN 5.1 program, Monte Carlo techniques were used to Micceri [17] labeled the distribution shapes most deviant
analyze the comparative power properties of the indepen- from normality as multimodal lumpy, discrete mass at zero
dent samples t-test with the Wilcoxon Rank-Sum test. For with gap, and extreme asymmetry.
ease of computation, the Wilcoxon Rank-Sum test was Multimodal lumpy distributions are common where two
conducted by applying the independent samples t-test on (or more) dominant sub-populations exist in the same pa-
the ranks of the original scores. Zimmerman and Zumbo tient pool. For example, a study on developmental disabili-
[54], Conover and Iman [55], and others have shown the ties might produce two sub-populations, where one has a
Wilcoxon Rank-Sum test on original scores is “alpha equiv- physical disability and the other has a mental disability.
alent” to conducting an independent samples t-test on the The discrete mass at zero with gap occurs with “onset” or
ranks of those scores. Thus, the results obtained by applying “first use” variables. For example, a survey of adolescents
the Wilcoxon Rank-Sum test on original scores would be might be conducted to determine the age of first suicide at-
the same as those results obtained by applying the t-test on tempt or age when first began smoking. The extreme asym-
the ranks of those original scores. metry shape is also prolific, such as the purported decline in
The observations were randomly sampled with replace- attention deficient hyperactivity disorder from infancy into
ment from three real data sets from Micceri [17] using the adulthood [57,58]. (Note that asymmetry is not synony-
International Mathematical and Statistical Libraries [56] mous with exponentiality [59].) Histograms of the distribu-
RNSET, RNUND, and RNKSM, subroutines. Sample sizes tions studied are presented in Figures 1–3, and descriptive
(n1, n2) 5 (10,10), (15,25), (30,30), and (15,45) were used, statistics are compiled in Table 1.
with nominal alpha set at 0.05. Four treatment alternatives The independent samples t-test and Wilcoxon Rank-
of varying sizes were analyzed for each distribution and sam- Sum test were performed on each sample. Power levels (and
ple size to measure shift in location parameters. As is cus- hence, advantages of one procedure over the other) were
tomary in Monte Carlo studies where treatments are mod- obtained by comparing the rejection rate for each test. Ten
eled as a shift in location, effect sizes (c) were a function of thousand repetitions were conducted for each distribution,
a distribution’s population standard deviation, in multiples sample size, and treatment effect interval.
of 0.25s. This is appropriate in the current study due to
Micceri’s [17] proposition that the data sets obtained in his
RESULTS
study were sufficiently large to be taken as representative of
the population. (Another common, but subjective and less Results of the comparative power analysis are shown in Ta-
thorough approach is to model small, medium, and large bles 2–4. Columns 1 and 2 identify the sample size (n) and
232 P. D. Bridge and S. S. Sawilowsky

FIGURE 2. Mass at zero with gap distribution. FIGURE 3. Extreme asymmetry distribution.

test statistic. The remainder of the table depicts the effect Extreme Asymmetry
size (c) examined, the resulting power level for each statis- As noted in Table 4, the WRS demonstrated advantages for
tic, and the difference in comparative power (D) between all 16 conditions studied, with power differences over the
both statistics for each condition studied. t-test ranging from 0.029 to 0.618. The power advantages
were at consistent levels favoring the Wilcoxon Rank-Sum
test for the smaller sample sizes (i.e., n1, n2 5 5, 15 and 10,
Multimodal and Lumpiness 10). The comparative power converged for the two statis-
Comparative power results for the t-test and the WRS for tics for the larger sample sizes (i.e., n1, n25 15, 45 and 30,
the multimodal and lumpy distribution are compiled in Ta- 30) when the effect sizes were large (i.e., c 5 1.0 s).
ble 2. This distribution is somewhat symmetric (as is the
normal curve), and therefore, as expected, the t-test held a
slight power advantage for most of the comparisons. The DISCUSSION
average power difference, however, was only 0.03, with the The t-test is based on the assumption of normality and is
largest advantage of 0.05 occurring when (n1, n2) 5 5, 15. one of the most prevalent tests used in the medical field, as
It should also be noted that neither test was particularly
well as other professions. This test assumes normally distrib-
powerful in detecting small treatment effects.
uted data, but it is now obvious that the occurrence of “bell
curve” data is extremely uncommon. The t-test has been
found to be robust with respect to violations from normality
Mass at Zero with Gap to Type I and Type II errors, but this does not preclude the
The WRS produced considerable power advantages in 94% use of a nonparametric alternative that may be more power-
(15 of 16) of the conditions studied when the data were ful under these conditions.
sampled from the mass at zero with gap data set. The WRS This study assessed the comparative power of the Wil-
held an impressive average power advantage of 0.49, as coxon Rank-Sum test and the independent samples t-test
noted in Table 3. Power advantages were predominant for in measuring for shift in location parameters for three real
the smaller effect sizes (e.g., c 5 .25s) for all sample sizes data sets, and demonstrated the impact nonparametric sta-
considered. The most remarkable power advantage oc- tistics can have on research outcomes when data are non-
curred for (n1, n2) 5 15, 45, with the Wilcoxon test dis- normally distributed. The results indicate that the t-test
playing a power advantage of 0.803. The power advantages was more powerful only under a distribution that was rela-
remained substantial for moderate effect sizes (c 5 .50 s), tively symmetric, although the magnitude of the differences
averaging approximately 0.52. was trivial. In contrast, the Wilcoxon Rank-Sum test held

TABLE 1. Summary data for the three data sets from Micceri [17]
Standard
Distribution Mean Median deviation Skew Kurtosis

Multimodal and lumpiness 21.15 18.00 11.90 0.19 1.80


Discrete mass at zero with
gap 1.85 0.00 3.80 1.65 3.98
Extreme asymmetry 13.67 11.00 5.75 1.64 4.52
Impact of Statistics on Research Outcomes 233

TABLE 2. Rejection rates and power comparison of the Wil- TABLE 4. Rejection rates and power comparison of the
coxon Rank-Sum (WRS) test (t-test on ranks) with the inde- Wilcoxon Rank-Sum (WRS) test (t-test on ranks) with the
pendent samples t-test for the multimodal and lumpiness independent samples t-test for the extreme asymmetry dis-
distribution [17], a 5 0.05; 10,000 repetitions tribution [17], a 5 0.05; 10,000 repetitions
Effect size Effect size
Sample size Test 0.25s 0.50s 0.75s 1.0s Sample size Test 0.25s 0.50s 0.75s 1.0s

5,15 t 0.065 0.145 0.265 0.426 5,15 t 0.092 0.142 0.232 0.463
WRS 0.060 0.140 0.246 0.377 WRS 0.266 0.391 0.491 0.661
D= 0.005 0.005 0.019 0.049 D = 20.174 20.249 20.259 20.198
10,10 t 0.073 0.173 0.338 0.543 10,10 t 0.088 0.167 0.295 0.580
WRS 0.073 0.179 0.331 0.502 WRS 0.368 0.496 0.593 0.735
D= 0.000 20.006 0.007 0.041 D = 20.280 20.329 20.298 20.155
15,45 t 0.128 0.372 0.691 0.912 15,45 t 0.136 0.275 0.526 0.913
WRS 0.114 0.352 0.648 0.863 WRS 0.716 0.887 0.955 0.993
D= 0.014 0.020 0.043 0.049 D = 20.580 20.612 20.429 20.080
30,30 t 0.153 0.473 0.817 0.971 30,30 t 0.165 0.358 0.659 0.964
WRS 0.137 0.450 0.768 0.933 WRS 0.783 0.914 0.963 0.993
D= 0.016 0.023 0.049 0.038 D = 20.618 20.556 20.304 20.029

Note: D = power difference. Note: D = power difference.

huge power advantages for data sets which presented skew- To emphasize this point, consider the relationship be-
ness or heavy tails. tween statistical power and sample size. Suppose a medical
If the characteristics of a population are known to be rel- researcher conducted an experiment on a treatment and
atively symmetric with light tails, the t-test should be ap- control group, where the sample size was (n1, n2) 5 15,45;
plied. However, if the population characteristics are un- the treatment effect was small (0.25s), as is common in
known, which is generally the case in applied research, and medical research; and the construct of interest was an “on-
the hypothesis being tested is one of shift in means (or set” variable. As noted in Table 3, the power of the Wil-
other location parameter) the Wilcoxon Rank-Sum is rec- coxon Rank-Sum test was 0.803 under these conditions.
ommended. The logic motivating this suggestion is According to Cohen [45], the t-test would require an ap-
straightforward. There is little to lose in terms of statistical proximate sample size of (n1, n2) 5 77, 231 to obtain simi-
power (average of 0.03) should the population turn out to lar power levels.
be normally distributed, but there is considerable to gain (as To summarize, a small treatment effect that could be de-
much as 0.803) if the population is skewed or heavy tailed. tected by the Wilcoxon Rank-Sum having a sample size
with a harmonic mean of 22.5 (i.e., n1, n25 15, 45) would
require a sample size having a harmonic mean of 115.5 (i.e.,
TABLE 3. Rejection rates and power comparison of the Wil-
coxon Rank-Sum (WRS) test (t-test on ranks) with the inde- n1, n25 77, 231) before the t-test could detect the same
pendent samples t-test for the mass at zero with gap distri- treatment.1 The expense, in terms of time, effort, and cost
bution [17], a 5 0.05; 10,000 repetitions of having to increase the sample size more than five-fold
Effect size (115.5/22.5) could be avoided by simply using the nonpara-
metric Wilcoxon Rank-Sum test instead of the classical
Sample size Test 0.25s 0.50s 0.75s 1.0s
parametric t-test. Finally, and most important, smaller sam-
5,15 t 0.101 0.264 0.467 0.681 ple sizes represent a reduction in the number of patients
WRS 0.849 0.877 0.879 0.880 who must face the risk of participating in an experimental
D = 20.748 20.613 20.412 20.199 study.
10,10 t 0.036 0.140 0.428 0.660
WRS 0.700 0.720 0.811 0.826
D = 20.664 20.580 20.383 20.066 1 As pointed out by an anonymous reviewer, this illustration assumes the
distribution is known (and Gaussian), which is rarely the case in applied
15,45 t 0.161 0.475 0.820 0.974 research. The point was to illustrate the potential deleterious effects of
WRS 0.964 0.963 0.965 0.964 ignoring power properties in terms of sample size, a frame of reference
D = 20.803 20.488 20.145 0.010 appreciated by applied researchers. Also, another anonymous reviewer
stated that the use of the harmonic mean is “alien to readers of this jour-
30,30 t 0.200 0.605 0.914 0.993 nal.” We point out that using 0.5(n1 1 n2) would result in too many
WRS 0.996 0.997 0.997 0.997 degrees of freedom. For example, samples of size (15,45) would result in
D = 20.796 20.392 20.083 20.004 n1 1 n2 2 2 5 58 df using the arithmetic mean, whereas the harmonic
mean results in 2mh-2 5 43 df, where mh 5 harmonic mean. Clearly, the
Note: D = power difference. latter is more indicative of the unbalanced samples in the layout.
234 P. D. Bridge and S. S. Sawilowsky

References area between curves (ABC) measure in nutritional anthro-


1. Altman DG, Goodman SN. Transfer of technology from sta- pometry. Stat Med 1992; 11(10): 1289–1304.
tistical journals to the biomedical literature. JAMA 1994; 272 27. Gilbert GH, Longmate J, Branch LG. Factors influencing the
(2): 129–132. effectiveness of mailed health surveys. Public Health Rep
2. Altman DG. Statistics in medical journals. Stat Med 1982; 1: 1992; 107(5): 576–584.
59–71. 28. Elashoff JD, Cantor RM, Shain S. Power and validity of meth-
3. Altman DG. The scandal of poor medical research. BMJ ods to identify variability genes. Genet Epidemiol 1991; 8(6):
1994; 308: 283–284. 381–388.
4. Altman DG. Statistics in medical journals: Developments in 29. Lucke JF. Students t test and the glasgow coma scale. Ann
the 1980’s. Stat Med 1991; 10: 1897–1913. Emerg Med 1996; 28(4): 408–413.
5. Avram MJ, Shanks CA. Statistical methods in anesthesia arti- 30. Maxfield M, Schweitzer J, Gouvier WD. Measures of central
cles: An evaluation of two American journals during two six- tendency, variability, and relative standing in nonnormal dis-
month periods. Anesth Analg 1985; 64: 607–611. tributions: Alternatives to the mean and standard score. Arch
6. Glantz SA. Biostatistics: How to detect, correct and prevent Phys Med Rehabil 1988; 69(6): 406–409.
errors in the medical literature. Circulation 1980; 61(1): 1–7.] 31. Zellner D, Frankewitsch T, Simon S, Keller F. Statistical anal-
7. Lent V, Langenbach A. A retrospective quality analysis of 102 ysis of hetrogenious pharmacokinetic data from the literature.
randomized trials in four leading urological journals form Eur J Clin Chem Clin Biochem 1996; 34(7): 585–589.
1984–1989. Urol Res 1996; 24(2): 119–122. 32. Niklasson A, Ericson A, Fryer JG. An update of the Swedish
8. Mosteller JP, Gilbert JP. Reporting standards and research reference standards for weight, length and head circumference
strategies for controlled trials. Control Clin Trials 1980; 1: at birth for given gestational age (1977–1981). Acta Paediat-
37–58. rica Scandinavica 1991; 80(8-9): 756–762.
9. Mulrow CD. The medical review article: State of the science. 33. Valenta HL, Fischer SK. A monte carlo simulation of range
Ann Intern Med 1987; 106: 485–488. for an invasive impedance respiration monitor. Biomed Sci
10. Pocock PJ, Huges MD, Lee RJ. Statistical problems in the re- Instrum 1990; 26: 181–184.
porting of clinical trials. A survey of three medical journals. N 34. Colliver JA, Manchikanti L, Markwell SJ. Evaluation and
Engl J Med 1987; 317(7): 426–487. comparison of the distributions of gastric pH and hydrogen
11. Silagy CA, Jewell D, Mant D. An analysis of randomized con- ion concentration. Anesthesiol 1987; 67(3): 391–394.
trolled trials published in the US family medicine literature, 35. Sullivan LM, D’Agostino RB. Robustness of the t test applied
1987–1991. J Fam Pract 1994; 39(3): 236–242. to data distorted from normality by floor effects. J Dent Res
12. Williams HC, Seed P. Inadequate size of “negative” clinical 1992; 71(12): 1938–1943.
trials in dermatology. Br J Dermatol 1993; 128: 317–326. 36. Blair RC. A reaction to “consequences of failure to meet as-
13. Emerson JD, Colditz GA. Use of statistical analysis in the sumptions underlying the fixed effects analysis of variance and
New England Journal of Medicine. N Engl J Med 1983; covariance.” Rev Educ Res 1981; 51(4): 499–507.
309(12): 709–713. 37. Scheffe H. The Analysis of Variance. New York: Wiley;
14. Fromm BS, Snyder VL. Research design and statistical proce- 1958.
dures used in the Journal of Family Practice. J Fam Pract 38. Sawilowsky SS, Blair RC. A more realistic look at the robust-
1986; 23(6): 564–566. ness and type II error properties of the t test to departures from
15. Longnecker DE. Support versus illumination: Trends in medi- population normality. Psychol Bull 1992; 111(2): 352–360.
cal statistics. J Anesthesiol 1982; 57(2): 73–74. 39. Mansfield E. Basic Statistics with Applications. New York:
16. Sawilowsky SS. Nonparametric tests of interaction in experi- W. W. Norton; 1986.
mental design. Rev Educ Res 1990; 60(1): 91–126. 40. Sawilowsky SS. Comments on using alternatives to normal
17. Micceri T. The unicorn, the normal curve, and other improb- theory statistics in social and behavioural science. Can Psy-
able creatures. Psychol Bull 1989; 105(1): 156–166. cho 1993; 34(4): 432–439.
18. Harwell MR. A general approach to hypothesis testing for 41. Friedman JA, Chalmers TC, Smith H, Kuebler RR. The im-
nonparametric tests. J Exp Educ 1990; 58(2): 143–156. portance of beta, the type II error and sample size in the de-
19. Hunter MA, May RB. Some myths concerning parametric sign and interpretation of the randomized control trial: Survey
and nonparmetric tests. Can Psychol 1993; 34(4): 384–389. of 71 “negative” trials. N Engl J Med 1978; 299: 690–694.
20. Pearson K. Contributions to the mathematical theory of evo- 42. Mengel MB, Davis AB. The statistical power of family prac-
lution. II. Skew variation in homogeneous material. Philo tice research. Fam Pract Res J 1993; 13(2): 105–111.
Transac R Soc 1895; Series A: 343–414. 43. Edlund MJ, Overall JE, Rhoades HM. Beta, or type II error in
21. Geary RC. Testing for normality. Biometrika 1947; 34: 209– psychiatric controlled clinical trials. J Psychiat Res 1985; 19
242. (4): 563–567.
22. Nunnally J. Psychometric Theory, 2nd ed. New York: 44. Kraemer HC, Thiemann S. How Many Subjects?: Statistical
McGraw-Hill; 1978. Power Analysis in Research. Newbury Park, CA: Sage Publi-
23. Pearson ES, Please NW. Relation between the shape of popu- cations; 1987.
lation distribution and the robustness of four simple test statis- 45. Cohen J. Statistical Power Analysis for the Behavioral Sci-
tics. Biometrika 1975; 62(2): 223–241. ences, 2nd ed. Hillsdale, NJ: Erlbaum; 1988.
24. Tan WY. Sampling distributions and robustness of t, f, and 46. Cohen JA. Power primer. Psychol Bull 1992; 112(1): 155–159.
variance-ratio in two samples and ANOVA models with re- 47. Dixon WJ. Power under normality of several nonparametric
spect to departures from normality. Commun Stat 1982; A11: tests. Ann Mathemat Stat 1954; 25: 610–614.
2485–2511. 48. Chernoff H, Savage IR. Asymptotic normality and efficiency
25. Dobi RA, Wilson MJ. A comparison of t-test, F-test, and co- of certain nonparametric test statistics. Ann Mathemat Stat
herence methods of detecting steady-state auditory-evoked 1958; 29: 972–999.
potentials, distortion-product otoacoustic emission, or other 49. Neave HR, Granger CWJ. A Monte Carlo study comparing
sinusoids. J Acoust Soc Am 1996; 100(4/1): 2236–2246. various two-sample tests for differences in mean. Technomet-
26. Bohning D, Hempfling A, Schelp FP, Schlattmenn P. The rics 1968; 10: 509–522.
Impact of Statistics on Research Outcomes 235

50. Lehmann EL. Nonparametrics. San Francisco, CA: Holden- tive power of parametric and nonparametric statistical tests.
Day; 1975. Percep Motor Skills 1990; 71: 339–349.
51. Randles RH, Wolfe DA. Introduction to the Theory of Non- 55. Conover WJ, Iman RL. Rank transformations as a bridge be-
parametric Tests. New York: John Wiley; 1979. tween parametric and nonparametric statistics. Am Stat 1981;
52. Blair RC, Higgins JJ. A comparison of the power of the Wil- 35: 124–129.
coxon’s rank-sum statistic to that of Student’s t statistic under 56. International Mathematical and Statistical Libraries. IMSL
various non-normal distributions. J Educ Stat 1980; 5(4): Library Reference Manual. 10th ed. Houston, TX: 1987.
309–335. 57. Hill JC, Schoener EP. The age-dependent decline of ADHD.
53. Blair RC, Higgins JJ. A note on the asymptotic relative effi- ADHD Report 1998; 6(1): 4–5.
ciency of the Wilcoxon rank-sum test relative to the indepen- 58. Barkley RA. Age-dependent decline in ADHD: A final re-
dent means t test under mixtures of two normal distributions. joinder. ADHD Report 1998; 6(1): 7–8.
Br J Math Stat Psychol 1981; 31: 124–128. 59. Sawilowsky S, Musial JL. Modeling ADHD as exponential de-
54. Zimmerman DW, Zumbo BD. Effects of outliers on the rela- cay. ADHD Report 1998; 6(1): 10–11.

You might also like