Professional Documents
Culture Documents
Redefining The Critical Value of Significance Level (0.005 Instead of 0.05) : The Bayes Trace
Redefining The Critical Value of Significance Level (0.005 Instead of 0.05) : The Bayes Trace
Redefining The Critical Value of Significance Level (0.005 Instead of 0.05) : The Bayes Trace
Russian Text © The Author(s), 2018, published in Radiatsionnaya Biologiya. Radioekologiya, 2018, Vol. 58, No. 5, pp. 453–462.
METHODOLOGY
OF SCIENTIFIC SEARCH
Abstract—In 2017, a group of the leading mathematical statisticians published a paper-manifesto having an
extremely simple sense: the common critical level of p-values should be decreased by an order of magnitude
(0.005 instead of 0.05) (Benjamin, et al., 2017). In this review, the arguments of proponents and opponents
of this proposal are discussed. Moreover, the problems related to the “reproducibility crisis” of the scientific
results are considered. The corresponding argumentation cannot be understood without consideration of the
fundamentals of the theory of statistical derivation. In this connection, the precise sense of some concepts,
such as p-value, the Bayes factor, and the minimum a posteriori probability of the zero hypothesis are dis-
cussed in the review. This is made mainly with the examples related to the comparison of frequencies. It was
shown that, when using p-values, particular attention should be paid to the comparison of low frequencies on
the highly abundant samples. Some practical recommendations on application of the Bayes analysis are
given.
1449
1450 RUBANOVICH
Table 1. Minimum Bayes factor (local minimum for the pri- It should also be noted that the authors of RSS pro-
ors in H0) and the minimum probability H0 corresponding posed to use the threshold value of 0.005 only for the
to the p-value from the “gray zone” newly discovered effects, maintaining the critical level
p-value min H 0 BF min(H0|data) of 0.05 for repeated (verifying) tests. In general, the
authors of RSS propose to no longer consider the
0.05 0.407 0.289 results with p-values from the range (0.005–0.05) sta-
tistically significant and define them as “suggestive”
0.04 0.350 0.259
(setting one thinking).
0.03 0.286 0.222
Of course, the manifesto of RSS is not the first
0.02 0.213 0.175 attempt to revise Fisher’s critical significance level.
0.01 0.125 0.111 Even in the middle of the last century, academician
A.N. Kolmogorov, referring to the rule of “three sig-
0.005 0.072 0.067 mas,” repeatedly proposed to use the critical level of
0.003 or even 0.001 [4]. In the 1960s, the lowering of
the critical level of p-values to 0.01 was passionately
nificance level of 0.05, the number of cases of confir- advocated by A. Melton, Chief Editor of Journal of
mation of false hypotheses will be 990 × 0.05 ≈ 50. Experimental Social Psychology [5]. A notable step in
Thus, the percentage of published false results will be rethinking the role of p-values in the scientific search
50/(8 + 50) = 86.2%. It is clear that, by lowering the was the statement of the Board of American Statistical
critical level of p-value by an order of magnitude, we Association (ASA) in 2016 [6]. However, the publica-
can achieve a significant reduction in the proportion tion of RSS had the greatest response in the scientific
of false positives: 5/(8 + 5) = 38.5%. world. In contrast to the earlier statements, RSS lays
2. Effects at p-values close to 0.05 cannot be statis- down specific and very significant changes that can be
tically significant because this is contrary to the results easily implemented by editors of scientific journals
of “Bayesian analysis,” which will be discussed in Sec- and funding institutions.
tion 5. Now we will only mention the following fairly
general statement. Suppose that, before the experi-
ment, we assumed that our chances of success were 50 2. “ALPHA-WARS” IN 2017
to 50. In other words, the a priori probability of the The RSS manifesto has caused an unprecedented
null hypothesis was equal to 50%. Then, after the (in terms of scale) discussion in blogs and scientific
experiment, which showed a p-value = 0.05, the prob- publications, which was called the “alpha-war.”
ability of the null hypothesis was at least 29% (see Immediately after the publication of the manifesto,
Table 1). This rigorous mathematical statement, a young Dutch physiologist Daniel Lakens in his blog
obtained by using the Bayesian approach, killed announced a subscription among the opponents of the
enthusiasm for considering the effects at p-value = revision of the critical level of p < 0.05. As a result, in
0.05 as statistically significant. Indeed, our experiment September 2017, the preprint “Justify your alpha” was
did not make the null hypothesis improbable (say, at a published, which was signed by 87 coworkers [7].
level of 5%, as it is often considered by a naive user). It According to the authors of this publication, the
still holds at least 29% or higher. For comparison, in threshold p < 0.005 is as arbitrary as p < 0.05. The
the same situation at p-value = 0.005, the minimum a threshold cannot be generally fixed and should depend
posteriori probability of the null hypothesis is 6.7% on what is already known about the subject of research
(Table 1). and the risks associated with obtaining an incorrect
The authors of RSS claimed that their performance answer. A relatively high probability of false positive
was dictated primarily by the concern of the extremely results can be taken in a preliminary study, whereas the
low reproducibility of biological, medical, and other final test of a drug may require lower p-values.
scientific research. Indeed, the “reproducibility cri- In addition, the authors of [7] first figured out the
sis” literally amazed the science of the 21st century, cost of the lowering of the critical p-value level by an
which was repeatedly discussed in mass media and sci- order of magnitude: to maintain the accepted testing
entific publications (see, e.g., [2, 3]). By proposing to power, the size of samples should be increased, on
lower the critical level of p-values by an order of mag- average, by 70%. Such a requirement may be too much
nitude, the authors of RSS postulate that this simple for budgets of many research groups.
step will immediately improve the reproducibility of
the results of research in many fields. Subsequently, With respect to overcoming the “reproducibility
this phrase has become the main target of a flurry of crisis,” the authors of [7] noted that, according to [8],
critical attacks by statisticians and active experiment- among the results at p-values from the range (0.005–
ers, although the authors of RSS emphasized that the 0.05), the proportion of the confirmed results was
dichotomy of p < 0.05 is not the only cause of the low 24%. However, among the results at p-value < 0.005,
reproducibility of research. the proportion of the confirmed results was also not
too high (49%), which does not correspond to the estimation of the statistical significance of results.
expectations of the authors of RSS. These include, for example,
The next prominent paper entitled “Abandon sta- 1. A posteriori formulation of hypotheses. A researcher
tistical significance,” one of the authors of which is puts forward hypotheses after obtaining data, stating
Andre Gelman, a well-known expert in Bayesian anal- that all assumptions preceded the experiment.
ysis, was published in September 2017 [9]. The authors
of this article recommend to completely abandon test- 2. Incomplete representation of experimental data.
ing the null hypothesis significance and consider the Publication of the most favorable results.
p-value as only one of many information indices with- 3. Data editing by eliminating the outliers and
out a privileged role in making decisions about the sig- grouping subsamples. Persistent fragmentation of
nificance of a phenomenon and the possibility of its a sample in search for stratification providing “signif-
publishing. With regard to classifying research into icant” differences of subsamples.
new (p < 0.005) and repeated (p < 0.05) ones, the
authors of [9] believe that this recommendation is 4. The use of various statistical tests with further
quite impractical, especially in the fields where publication of the most favorable results.
research is gradual and cumulative. And further, since 5. The use of the stopping-rule effect, i.e., gradu-
they (the authors of RSS) are not able to determine ally increasing the sample size until obtaining a signif-
what a new effect is, the proposed policy will lead to icant result at a level of p < 0.05 [15].
inconsistency in the practice of reproducing results.
6. Incorrect or incomplete account of the multi-
Similarly to many subsequent commentators [10– plicity of comparisons. Incorrect construction of per-
14], the authors of [9] believe that RSS virtually has no mutation tests.
evidence that p < 0.05 is one of the leading causes of
the poor reproducibility of scientific research. In opin- 7. Transforming and normalizing data instead of
ion of the authors of [9], the leading cause is the using nonparametric statistics.
absence of correction for multiple comparisons (both 8. The use of multivariate statistical analysis with-
actual and potential), which has become the norm in out proper validation of statistical significance, e.g.,
applied research. In [13], it is also emphasized that the selection of predictors using the “Stepwise”
lowering the threshold of p-values will lead to a drastic regression analysis algorithm [16].
increase in the publication bias, i.e., to a shift of the
pool of published works towards the greater effects. 9. The use of total estimates of risk factors selected
The harshest statements against RSS were in the work from a large number of predictors [16].
by Harry Crane “Why “redefining statistical signifi-
cance” will not improve reproducibility and could 10. The use of a test sample for “clarification” of
make the replication crisis worse,” which was pub- the results obtained using a training sample, which is a
lished in November 2017 [14]. Crane regards the pro- very common mistake in a two-stage (discovery set +
posal of RSS as extremely erroneous, presented under validation set) search for effective predictors [16, 17].
false pretenses, and supported by flawed analysis and The “effectiveness” of all these approaches,
suggests that it should not be accepted. H. Crane also indeed, little depends on the critical level of signifi-
reports that, according to his calculations, there are cance. For example, manipulations using items 8–10
several possible scenarios in which the cutoff p < 0.005 make it possible to overcome thresholds p < 10–5 or
will make the situation with reproducibility even even 10–10 [16]. For this reason, the majority of critics
worse. of RSS believe that the revision of the threshold p-value
will not result in a significant improvement to the
reproducibility of results. The author of these lines
3. RSS AND REPRODUCIBILITY CRISIS: basically shares this belief. Throughout the past cen-
p-HACKING tury, the threshold value of 0.05 was used everywhere,
Deborah Mayo, a well-known statistician and phi- but there were no talks about the “reproducibility cri-
losopher of science, wrote in her blog errorstatis- sis.” The situation has changed at the turn of the cen-
tics.com that almost everyone knows what is the real tury with the advent of new technologies, leading to
cause of nonreproducibility—selective publication of experimental plans with a huge number of indepen-
the most effective results, ignoring the multiplicity of dent, predictor, or grouping variables (e.g., microar-
tests, and sequential testing of hypotheses until one of rays, GWAS, RNA-seq, various omics-technologies,
them would be significant, as well as selective presen- etc.). Processing corresponding data inevitably leads
tation of methods and results of their application. to frequent misuse of points 6–10 of the above list of
p-hacking manifestations and, ultimately, to a low repro-
In modern biostatistics, the tendency described by ducibility of announced effects. According to some data,
Deborah Mayo is called the “p-hacking” [15]. This the false discovery rate (FDR) due to p-hacking is cur-
term covers a wide range of variants of artificial over- rently no less than 60% [6, 15].
n n n2
data probability at H1 P(data|H1) = C n11+ n2 p 1(1 – p)
4 1 2 Prior Prior
Prior BF localization smearing (σ)
3 1 0.17 data 0.08
2 0.32 H0 0.08
2 3 0.72 H0 0.29
4 1.09 H0 0.35
4
3 5 3.85 H0 0.46
1
5
Fig. 2. Dependence of BF on the prior selection. Bayes factor estimates for five priors differing in the mean value position (local-
ization) and variance (σ2) are shown.
Global
min BF
0.1 0.2 0.3 0.4
Position of local min BF σ
Fig. 3. BF dependence on the degree of prior “smearing” (σ2—prior variance): (1) priors localized in H0 and (2) priors localized
in data.
where the p value corresponding to our data is The min BF estimate provides a unique opportu-
selected: p = n1/(n1 + n2). A similar formula for com- nity to assess the minimum probability of realization
parison of the mean values for Student’s t test has a of the null hypothesis regardless of the selected prior.
very simple form [22]: It suffices to substitute the lower limit of BF (6) into
the formula (3). As a result, we obtain
−1t
2
min BF = e 2
. −ep ln p
P (H 0 data) ≥ , (7)
This minimum, corresponding to the “acute” prior in 1 − ep ln p
data, we will call the global minimum, because any where p ≡ p-value. This value can be called the “mini-
other prior will yield a higher BF value. mum a posteriori probability of the null hypothesis,”
corresponding to a given p-value.
For some classes of priors, there may be local min-
ima, which are naturally always higher than the global As above, we will consider the comparison 39/100
min BF. For example, for unimodal priors localized in vs. 1/2 as an example (Fig. 3). The “noninformative”
H0, there is always a local minimum BF, which can be Jeffreys prior indicates that, in this case, the data are
found numerically. For this minimum, a good approx- more probable under the null hypothesis, although
imation is known, which was proposed by T. Sellke p-value = 0.035. A conservative experimenter may not
[22]: for a broad class of unimodal priors in which the agree with this conclusion since BF depends on the
mean value is localized in H0, the local minimum of prior. However, according to (7), the minimal a poste-
the Bayes factor is approximately equal to riori probability of the null hypothesis is min
P(H0|39/100) = 0.24, and this conclusion almost does
min H 0 BF ≈ −ep ln p. (6) not depend on the selected prior.
The recalculation of p-values into the minimum
Here (and only here), p ≡ p-value < 1/e, and e is the probabilities of the null hypothesis is shown in Table 1.
base of the natural logarithm. In essence, this table is the main argument of the sup-
The pattern of the dependence of BF on the degree porters of the revision of the critical significance level.
of “smearing” of prior (σ) depends on the position of As can be seen in Table 1, only at p-value = 0.005 an
the mean value (i.e., on the prior localization) (Fig. 3). acceptable level of the minimum a posteriori probabil-
For the priors localized in data, BF increases mono- ity of the null hypothesis can be achieved.
tonically with increasing σ starting from the point cor-
responding to the global minimum. A characteristic
feature of the priors localized in H0 is the existence of 7. RISK ZONES IN USING p-VALUES: LOW
a local minimum that is defined by the formula (6). In FREQUENCIES AND LARGE SAMPLES
this case, for an “acute” prior localized in H0, always In conclusion, we will discuss situation in which
BF = 1. At large σ, for both classes of priors, BF the use of p-values from the “gray zone” (0.01–0.05)
increases with increasing σ according to (4). is especially dangerous. Most of these situations are
Table 2. Traditional and Bayesian analyses and compari- exposed individuals. However, the Bayesian analysis
sons 1/n vs. 9/n at different sample sizes (n) does not confirm this conclusion: BF = 0.72 > 1/3,
Comparison 1/n vs. 9/n which, according to (3), corresponds to the null
n hypothesis probability equal to 42% (at an a priori
p-value BF probability of 50%).
50 0.016 0.116 In cytogenetic studies, low frequencies for huge
samples are also often compared. This is due to the
100 0.018 0.197
malpractice of pooling all viewed metaphases for each
500 0.021 0.504 of compared group. As a result, there occur situations
1000 0.021 0.724 that are described in Table 2. For example, one dicen-
tric per 5000 metaphases was detected in the control
5000 0.021 1.639 group and nine such chromosomal aberrations per the
10000 0.021 2.322 same number of cells were detected in the exposed
individuals. In this case, p-value ≈ 0.02 and BF = 1.6;
50000 0.021 5.198 that is, the observed variant is more probable in the
p-Values for two-tailed Fisher’s exact test and Bayes factors for absence of differences.
Jeffreys prior are shown. The cases of rejection of the alternative
hypothesis H1 about the existence of differences are highlighted.
CONCLUSIONS
characterized by the formula “low frequencies and The proposals formulated in the RSS are, probably,
large samples.” In Section 5, we already mentioned timely, but the scientific community is clearly not ready
that, for large samples, the results of the traditional to accept them, as follows from the ongoing heated dis-
analysis, based on p-values, may contradict the con- cussion and the results of the survey conducted in Twitter
clusions made on the basis of the Bayesian analysis by the journal Nature News & Comment. The question
(Jeffreys–Lindley paradox). “Should we lower the critical level of p-values” received
Suppose we compare the frequencies of a certain 562 “yes” and 540 “no.” https://twitter.com/nature-
event in two samples of equal size, e.g., 1/n vs. 9/n, news/status/890530105554087936.
where n is the sample size. Then, in calculating the Of course, lowering the threshold p-value will not
Bayes factor, it is appropriate to use the “noninforma- lead to a significant improvement of reproducibility, as
tive” Jeffreys prior, as it was done in coin flipping. is promised by the authors of RSS. Nevertheless, we
Indeed, under the null hypothesis, each of 1 + 9 = 10 believe that the following recommendations are rele-
events randomly gets into one of the two samples with vant.
a probability of 1/2. Hence, it is clear that the bilateral
p-value for the comparison of 1/n vs. 9/n is practically 1. It is mandatory to calculate the Bayes factor
independent of n and asymptotically approaches the when p-values get into the “gray zone” (0.01–0.05).
doubled probability of getting less than two “heads” in This can be simply done by using the JASP freeware
ten coin tosses. However, the BF value always [11] or on-line calculators:
increases with increasing n, as is generally described by https://jasp-stats.org,
the formula (4). http://www.stat.umn.edu/geyer/5102/examp/bayes.
Table 2 shows the comparison of p-values and html,
Bayes factors at different sample sizes n. In all cases
presented in Table 2, the frequency of events in the http://pcl.missouri.edu/bf-binomial.
second group is 9 times higher than in the first group, All calculations that are listed in this review can be
and p-value ≈ 0.02. However, at n = 100, the Bayesian reproduced using the specified software.
analysis confirms the traditional approach (BF = 0.2 < 2. In addition to the BF estimates that are offered by
1/3), and at n = 50000 is confidently rejects it (BF = calculators, the min BF value (according to Sellke) and
5.2 > 3). Thus, when comparing low frequencies in the minimum a posteriori probability of the null hypoth-
large samples, the use of p-values is associated with an esis should be estimated using the formulas ((6), (7)).
increased risk of obtaining false-positive results.
The situations that are described in Table 2 often 3. Particular attention should be given to the com-
occur in cytogenetics and radiation epidemiology. parison of low frequencies at large samples. In these
Suppose, for example, the comparison of 1000 per- situations, the p-values at the level of 0.02–0.05 mean
sons exposed to radiation with a control sample of the nothing and lead to false results.
same size showed nine cases of leukemia in the
exposed persons and only one case in the control FUNDING
group. The relative risk is RR = 9. As a rule, an epide-
miologist in such cases reports a substantial and signif- This work was supported by the Russian Foundation for
icant increase in the incidence of the disease in the Basic Research (project no. 16-06-0046517).
COMPLIANCE WITH ETHICAL STANDARDS 11. Perezgonzalez, J.D. and Frías-Navarro, M.D., Retract
p < 0.005 and propose using JASP, instead, F1000Re-
The author declares that he has no conflict of interest. search, 2017, vol. 6, p. 2122.
This article does not contain any studies involving animals 12. Amrhein, V. and Greenland, S., Remove, rather than
or human participants performed by the author. redefine, statistical significance, Nat. Hum. Behav.,
2018, vol. 2, p. 4.
13. Esarey, J., Replication data for: lowering the threshold
REFERENCES of statistical significance to p < 0.005 to encourage en-
1. Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B., riched theories of politics, Polit. Methodologist, 2017,
Wagenmakers, E., Berk, R., et al., Redefine statistical vol. 24, no. 2, pp. 13–20. https://thepoliticalmethodol-
significance, Nat. Hum. Behav., 2018, no. 2, pp. 6–10. ogist.com/v24-n2-fix/.
2. Ioannidis, J., Why most published research findings are 14. Crane, H., Why “redefining statistical significance”
false, PLoS Med., 2005, vol. 2. e124. will not improve reproducibility and could make the
replication crisis worse, 2017. arXiv:1711.07801v1
3. Buck, S., Solving reproducibility, Science, 2015, [stat.AP].
vol. 348, no. 6242, p. 1403.
15. Head, M.L., Holman, L., Lanfear, R., Kahn, A.T., and
4. Kolmogorov, A.N., Probability theory, in Veroyatnost’ i Jennions, M.D., The extent and consequences of
matematicheskaya statistika. Entsiklopediya (Probability p-hacking in science, PLoS Biol., 2015, vol. 13, no. 3.
and Mathematical Statistics. Encyclopedia), Prokhor- e1002106.
ov, Yu.V., Editor-in-Chief, Moscow: Bol’shaya Rossi-
iskaya Entsiklopediya, 1999; Moscow: Drofa, 2003, 16. Rubanovich, A.V. and Khromov-Borisov, N.N., Ge-
pp. 874–875. netic risk assessment of the joint effect of several genes:
critical appraisal, Russ. J. Genet., 2016, vol. 52, no. 7,
5. Melton, A.W., Editorial, J. Exp. Psychol., 1962, vol. 64, pp. 757–769.
pp. 553–557. 17. Wray, N.R., Yang, J., Hayes, B.J., et al., Pitfalls of pre-
6. Wasserstein, R.L. and Lazar, N.A., The ASA’s state- dicting complex traits from SNPs, Nat. Rev. Genet.,
ment on p-values: context, process, and purpose, Am. 2013, vol. 14, no. 7, pp. 507–515.
Statistician, 2016, vol. 70, no. 2, pp. 129–133. 18. Goodman, S., A dirty dozen: twelve p-value miscon-
7. Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., ceptions, Semin. Hematol., 2008, vol. 45, pp. 135–140.
Apps, M.A., et al., Justify Your Alpha, 2018. psyarx- 19. Dienes, Z., How Bayes factors change scientific prac-
iv.com/9s3y6. tice, J. Math. Psychol., 2016, vol. 72, pp. 78–89.
8. Open Science Collaboration, Science. 2015, vol. 349, 20. Held, L. and Ott, M., On p-values and Bayes factors,
no. 6251, pp. 1–8. Annu. Rev. Stat. Appl., 2018, vol. 5, pp. 393–419.
9. McShane, B.B., Gal, D., Gelman, A., Robert, C., and 21. Jeffreys, H., An invariant form for the prior probability
Tackett, J.L., Abandon statistical significance, 2017. in estimation “problems,” Proc. R. Soc. London, Ser. A,
arXiv:1709.07588 [stat.ME]. 1946, vol. 186, no. 1007, pp. 453–461.
10. Trafimow, D., Amrhein, V., Areshenkoff, C.N., et al., 22. Sellke, T., Bayarri, M.J., and Berger, J.O., Calibration
Manipulating the alpha level cannot 1 cure significance of p values for testing precise null hypotheses, Am. Statist.,
testing. comments on “Redefine statistical signifi- 2001, vol. 55, pp. 62–71.
cance,” PeerJ. Preprints, 2017, vol. 5. e3411v1.
https://peerj.com/preprints/3411/. Translated by M. Batrukova