Professional Documents
Culture Documents
Bai Doc Them 1
Bai Doc Them 1
Bai Doc Them 1
ABSTRACT. To effectively evaluate medical literature, practicing physicians and medical researchers must
understand the impact of statistical tests on research outcomes. Applying inefficient statistics not only increases
the need for resources, but more importantly increases the probability of committing a Type I or Type II error.
The t-test is one of the most prevalent tests used in the medical field and is the uniformally most powerful
unbiased test (UMPU) under normal curve theory. But does it maintain its UMPU properties when assumptions
of normality are violated? A Monte Carlo investigation evaluates the comparative power of the independent
samples t-test and its nonparametric counterpart, the Wilcoxon Rank-Sum (WRS) test, to violations from
population normality, using three commonly occurring distributions and small sample sizes. The t-test was more
powerful under relatively symmetric distributions, although the magnitude of the differences was moderate.
Under distributions with extreme skews, the WRS held large power advantages. When distributions consist of
heavier tails or extreme skews, the WRS should be the test of choice. In turn, when population characteristics are
unknown, the WRS is recommended, based on the magnitude of these power differences in extreme skews, and
the modest variation in symmetric distributions. J CLIN EPIDEMIOL 52;3:229–235, 1999. © 1999 Elsevier Science Inc.
KEY WORDS. Research methods; t-test; Wilcoxon Rank-Sum test; nonparametric statistics; parametric statistics;
power
identify studies in which data violated normal distribution statistic. As noted by Mansfield [39], “The hallmark of
theory. The various disciplines where this problem was preva- these tests is that they avoid the assumption of normality.”
lent included surgery [25], epidemiology [26–28], emergency Thus, Sawilowsky [40] concluded, “the real issue of the ef-
medicine [29], rehabilitation [30], chemistry [31], pediatrics fects of nonnormality is on the comparative power, not ro-
[32], biomedical science [33], anesthesiology [34], and den- bustness of the t-test.”
tistry [35], among others. The researcher’s recognition of the power properties of a
In consideration of how rarely normality occurs in ap- test enhances the ability to detect treatment differences,
plied research practice, the persistence and prevalence in promotes replication of the study under the same condi-
the use of the t-test is questionable. Does the t-test, despite tions, and diminishes the wasting of resources through the
its property of being the UMPU test, maintain its superior use of efficient statistical tests. Remarkably, however, until
power properties (i.e., the ability to detect a treatment ef- recently, the power of a statistic (or power analysis) has
fect if it exists) when normal theory assumptions are vio- been a neglected methodological tool in medical research.
lated? Blair [36] stated, “One might assume that because the (An anonymous reviewer pointed out that work in clinical
t-test is the uniformly most powerful (UMP) unbiased test trials is a notable exception, and power analysis has been
under normal theory, it will naturally be more powerful considered a requirement for funded research in clinical tri-
than other tests in the non-normal situation, provided that als for quite some time.) For example, Friedman et al. [41]
its normal theory power is preserved. But this is fallacious reviewed 71 negative clinical trials and found that 67 of the
reasoning because the optimal power properties associated trials had a greater than 10% risk of missing a 25% thera-
with the t-test under normal theory are no longer in force peutic improvement and 50 trials may have missed a 50%
once the normality stipulation has been abandoned.” therapeutic improvement. Mengel [42] canvassed three ma-
jor family practice journals in 1988, and found that only
5 of 86 studies calculated statistical power. Although it
STATISTICAL ISSUES
turned out that 80% of the studies had sufficient power to
A Type I error (alpha) occurs when the null hypothesis is detect medium and large effect sizes, very few could detect
rejected, when in fact it is true. A Type II error (beta) oc- small effect sizes. In some situations, even large effect sizes
curs by failing to reject a null hypothesis when it is actually could not be detected. Similarly, Silagy [11] reviewed 55
false. Robustness with respect to Type I error means that if randomized control trials in four peer reviewed family med-
nominal alpha was set to 0.05, then the actual Type I error icine journals from 1987–1991, and found that statistical
rate is reasonably close to this rate even if normality (or power was reported in only five studies. Williams [12] re-
other underlying assumptions) is not met. viewed 44 negative clinical trials in dermatology and found
Robustness with respect to Type II error means that a that all but one had a 10% chance of missing a 25% treat-
test achieves approximately the same rejection rate (i.e., ment difference; and 31 of the 44 negative trials had such a
ability to detect a difference) when normality (or other un- small sample size that they had a greater than 10% chance
derlying assumptions) is not met. Statistical power, com- of missing a 50% treatment difference. Edlund [43] echoed
monly referred to as the Pittman Efficiency, is the test’s similar results.
ability to detect a false null hypothesis. Note, however, that The failure to recognize the need for power analysis has
even if a test is robust with respect to Type II errors, there mystified many. Kraemer and Thiemann [44] attributed the
still may be a competitor that is more powerful when nor- neglect to over-training in dealing with significant levels
mality is not met [37]. and under-training in the use of power. Cohen [45] stated,
Traditionally, statisticians’ first criterion of the quality of “One possible reason for the continued neglect of statistical
a test is its robustness (i.e., insensitivity to violation of as- power analysis in research in the behavioral sciences is the
sumptions). The t-test’s robustness properties have been ex- inaccessibility of or difficulty with the standard material.”
amined carefully over the past few decades. The current In addition, cost and resources associated with conducting
thinking on the subject was provided by Sawilowsky and research and obtaining a large sample size has also attrib-
Blair [38], who examined the robustness of the independent uted to the low power in the medical literature. With
samples t-test when normality was violated under condi- readily available resources, such as Cohen’s [45, 46] refer-
tions using real world data. The t-test was found to be ro- ences, the failure to conduct a power analysis is no longer
bust when sample sizes are nearly equal, fairly large (25–30), acceptable.
and a two-tailed test rather than a one-tailed test is used.
Despite the t-test’s remarkable robustness powers, some
PURPOSE
statisticians caution readers to refrain from their hurried use
of the t-test, because a nonparametric competitor, such as Small samples research (i.e., repeated sampling Monte
the Wilcoxon Rank-Sum (WRS) test, may be a more pow- Carlo results) conducted on the comparative power of the
erful statistic. Robustness with respect to Type I errors (i.e., t-test and the Wilcoxon test indicate the Wilcoxon test
false positives) is assured with the nonparametric Wilcoxon holds promise for increased statistical power when the as-
Impact of Statistics on Research Outcomes 231
FIGURE 2. Mass at zero with gap distribution. FIGURE 3. Extreme asymmetry distribution.
test statistic. The remainder of the table depicts the effect Extreme Asymmetry
size (c) examined, the resulting power level for each statis- As noted in Table 4, the WRS demonstrated advantages for
tic, and the difference in comparative power (D) between all 16 conditions studied, with power differences over the
both statistics for each condition studied. t-test ranging from 0.029 to 0.618. The power advantages
were at consistent levels favoring the Wilcoxon Rank-Sum
test for the smaller sample sizes (i.e., n1, n2 5 5, 15 and 10,
Multimodal and Lumpiness 10). The comparative power converged for the two statis-
Comparative power results for the t-test and the WRS for tics for the larger sample sizes (i.e., n1, n25 15, 45 and 30,
the multimodal and lumpy distribution are compiled in Ta- 30) when the effect sizes were large (i.e., c 5 1.0 s).
ble 2. This distribution is somewhat symmetric (as is the
normal curve), and therefore, as expected, the t-test held a
slight power advantage for most of the comparisons. The DISCUSSION
average power difference, however, was only 0.03, with the The t-test is based on the assumption of normality and is
largest advantage of 0.05 occurring when (n1, n2) 5 5, 15. one of the most prevalent tests used in the medical field, as
It should also be noted that neither test was particularly
well as other professions. This test assumes normally distrib-
powerful in detecting small treatment effects.
uted data, but it is now obvious that the occurrence of “bell
curve” data is extremely uncommon. The t-test has been
found to be robust with respect to violations from normality
Mass at Zero with Gap to Type I and Type II errors, but this does not preclude the
The WRS produced considerable power advantages in 94% use of a nonparametric alternative that may be more power-
(15 of 16) of the conditions studied when the data were ful under these conditions.
sampled from the mass at zero with gap data set. The WRS This study assessed the comparative power of the Wil-
held an impressive average power advantage of 0.49, as coxon Rank-Sum test and the independent samples t-test
noted in Table 3. Power advantages were predominant for in measuring for shift in location parameters for three real
the smaller effect sizes (e.g., c 5 .25s) for all sample sizes data sets, and demonstrated the impact nonparametric sta-
considered. The most remarkable power advantage oc- tistics can have on research outcomes when data are non-
curred for (n1, n2) 5 15, 45, with the Wilcoxon test dis- normally distributed. The results indicate that the t-test
playing a power advantage of 0.803. The power advantages was more powerful only under a distribution that was rela-
remained substantial for moderate effect sizes (c 5 .50 s), tively symmetric, although the magnitude of the differences
averaging approximately 0.52. was trivial. In contrast, the Wilcoxon Rank-Sum test held
TABLE 1. Summary data for the three data sets from Micceri [17]
Standard
Distribution Mean Median deviation Skew Kurtosis
TABLE 2. Rejection rates and power comparison of the Wil- TABLE 4. Rejection rates and power comparison of the
coxon Rank-Sum (WRS) test (t-test on ranks) with the inde- Wilcoxon Rank-Sum (WRS) test (t-test on ranks) with the
pendent samples t-test for the multimodal and lumpiness independent samples t-test for the extreme asymmetry dis-
distribution [17], a 5 0.05; 10,000 repetitions tribution [17], a 5 0.05; 10,000 repetitions
Effect size Effect size
Sample size Test 0.25s 0.50s 0.75s 1.0s Sample size Test 0.25s 0.50s 0.75s 1.0s
5,15 t 0.065 0.145 0.265 0.426 5,15 t 0.092 0.142 0.232 0.463
WRS 0.060 0.140 0.246 0.377 WRS 0.266 0.391 0.491 0.661
D= 0.005 0.005 0.019 0.049 D = 20.174 20.249 20.259 20.198
10,10 t 0.073 0.173 0.338 0.543 10,10 t 0.088 0.167 0.295 0.580
WRS 0.073 0.179 0.331 0.502 WRS 0.368 0.496 0.593 0.735
D= 0.000 20.006 0.007 0.041 D = 20.280 20.329 20.298 20.155
15,45 t 0.128 0.372 0.691 0.912 15,45 t 0.136 0.275 0.526 0.913
WRS 0.114 0.352 0.648 0.863 WRS 0.716 0.887 0.955 0.993
D= 0.014 0.020 0.043 0.049 D = 20.580 20.612 20.429 20.080
30,30 t 0.153 0.473 0.817 0.971 30,30 t 0.165 0.358 0.659 0.964
WRS 0.137 0.450 0.768 0.933 WRS 0.783 0.914 0.963 0.993
D= 0.016 0.023 0.049 0.038 D = 20.618 20.556 20.304 20.029
huge power advantages for data sets which presented skew- To emphasize this point, consider the relationship be-
ness or heavy tails. tween statistical power and sample size. Suppose a medical
If the characteristics of a population are known to be rel- researcher conducted an experiment on a treatment and
atively symmetric with light tails, the t-test should be ap- control group, where the sample size was (n1, n2) 5 15,45;
plied. However, if the population characteristics are un- the treatment effect was small (0.25s), as is common in
known, which is generally the case in applied research, and medical research; and the construct of interest was an “on-
the hypothesis being tested is one of shift in means (or set” variable. As noted in Table 3, the power of the Wil-
other location parameter) the Wilcoxon Rank-Sum is rec- coxon Rank-Sum test was 0.803 under these conditions.
ommended. The logic motivating this suggestion is According to Cohen [45], the t-test would require an ap-
straightforward. There is little to lose in terms of statistical proximate sample size of (n1, n2) 5 77, 231 to obtain simi-
power (average of 0.03) should the population turn out to lar power levels.
be normally distributed, but there is considerable to gain (as To summarize, a small treatment effect that could be de-
much as 0.803) if the population is skewed or heavy tailed. tected by the Wilcoxon Rank-Sum having a sample size
with a harmonic mean of 22.5 (i.e., n1, n25 15, 45) would
require a sample size having a harmonic mean of 115.5 (i.e.,
TABLE 3. Rejection rates and power comparison of the Wil-
coxon Rank-Sum (WRS) test (t-test on ranks) with the inde- n1, n25 77, 231) before the t-test could detect the same
pendent samples t-test for the mass at zero with gap distri- treatment.1 The expense, in terms of time, effort, and cost
bution [17], a 5 0.05; 10,000 repetitions of having to increase the sample size more than five-fold
Effect size (115.5/22.5) could be avoided by simply using the nonpara-
metric Wilcoxon Rank-Sum test instead of the classical
Sample size Test 0.25s 0.50s 0.75s 1.0s
parametric t-test. Finally, and most important, smaller sam-
5,15 t 0.101 0.264 0.467 0.681 ple sizes represent a reduction in the number of patients
WRS 0.849 0.877 0.879 0.880 who must face the risk of participating in an experimental
D = 20.748 20.613 20.412 20.199 study.
10,10 t 0.036 0.140 0.428 0.660
WRS 0.700 0.720 0.811 0.826
D = 20.664 20.580 20.383 20.066 1 As pointed out by an anonymous reviewer, this illustration assumes the
distribution is known (and Gaussian), which is rarely the case in applied
15,45 t 0.161 0.475 0.820 0.974 research. The point was to illustrate the potential deleterious effects of
WRS 0.964 0.963 0.965 0.964 ignoring power properties in terms of sample size, a frame of reference
D = 20.803 20.488 20.145 0.010 appreciated by applied researchers. Also, another anonymous reviewer
stated that the use of the harmonic mean is “alien to readers of this jour-
30,30 t 0.200 0.605 0.914 0.993 nal.” We point out that using 0.5(n1 1 n2) would result in too many
WRS 0.996 0.997 0.997 0.997 degrees of freedom. For example, samples of size (15,45) would result in
D = 20.796 20.392 20.083 20.004 n1 1 n2 2 2 5 58 df using the arithmetic mean, whereas the harmonic
mean results in 2mh-2 5 43 df, where mh 5 harmonic mean. Clearly, the
Note: D = power difference. latter is more indicative of the unbalanced samples in the layout.
234 P. D. Bridge and S. S. Sawilowsky
50. Lehmann EL. Nonparametrics. San Francisco, CA: Holden- tive power of parametric and nonparametric statistical tests.
Day; 1975. Percep Motor Skills 1990; 71: 339–349.
51. Randles RH, Wolfe DA. Introduction to the Theory of Non- 55. Conover WJ, Iman RL. Rank transformations as a bridge be-
parametric Tests. New York: John Wiley; 1979. tween parametric and nonparametric statistics. Am Stat 1981;
52. Blair RC, Higgins JJ. A comparison of the power of the Wil- 35: 124–129.
coxon’s rank-sum statistic to that of Student’s t statistic under 56. International Mathematical and Statistical Libraries. IMSL
various non-normal distributions. J Educ Stat 1980; 5(4): Library Reference Manual. 10th ed. Houston, TX: 1987.
309–335. 57. Hill JC, Schoener EP. The age-dependent decline of ADHD.
53. Blair RC, Higgins JJ. A note on the asymptotic relative effi- ADHD Report 1998; 6(1): 4–5.
ciency of the Wilcoxon rank-sum test relative to the indepen- 58. Barkley RA. Age-dependent decline in ADHD: A final re-
dent means t test under mixtures of two normal distributions. joinder. ADHD Report 1998; 6(1): 7–8.
Br J Math Stat Psychol 1981; 31: 124–128. 59. Sawilowsky S, Musial JL. Modeling ADHD as exponential de-
54. Zimmerman DW, Zumbo BD. Effects of outliers on the rela- cay. ADHD Report 1998; 6(1): 10–11.