Redefining The Critical Value of Significance Level (0.005 Instead of 0.05) : The Bayes Trace

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ISSN 1062-3590, Biology Bulletin, 2019, Vol. 46, No. 11, pp. 1449–1457. © Pleiades Publishing, Inc., 2019.

Russian Text © The Author(s), 2018, published in Radiatsionnaya Biologiya. Radioekologiya, 2018, Vol. 58, No. 5, pp. 453–462.

METHODOLOGY
OF SCIENTIFIC SEARCH

Redefining the Critical Value of Significance Level


(0.005 instead of 0.05): The Bayes Trace
A. V. Rubanovich*
Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991 Russia
*e-mail: rubanovich@vigg.ru
Received April 2, 2018

Abstract—In 2017, a group of the leading mathematical statisticians published a paper-manifesto having an
extremely simple sense: the common critical level of p-values should be decreased by an order of magnitude
(0.005 instead of 0.05) (Benjamin, et al., 2017). In this review, the arguments of proponents and opponents
of this proposal are discussed. Moreover, the problems related to the “reproducibility crisis” of the scientific
results are considered. The corresponding argumentation cannot be understood without consideration of the
fundamentals of the theory of statistical derivation. In this connection, the precise sense of some concepts,
such as p-value, the Bayes factor, and the minimum a posteriori probability of the zero hypothesis are dis-
cussed in the review. This is made mainly with the examples related to the comparison of frequencies. It was
shown that, when using p-values, particular attention should be paid to the comparison of low frequencies on
the highly abundant samples. Some practical recommendations on application of the Bayes analysis are
given.

Keywords: p-value, critical level of significance, Bayes factor, reproducibility crisis


DOI: 10.1134/S1062359019110086

INTRODUCTION ments of proponents and opponents of lowering the


critical significance level. The corresponding argu-
In summer 2017, a large group of leading mathemati-
ment cannot be understood without considering the
cal statisticians (D.J. Benjamin, J. Berger, S. Goodman,
foundations of the statistical inference theory. In this
J. Ioannidis, D. Moore, T. Sellke, et al.) published a
connection, we will discuss the precise meaning of
small preprint-manifesto entitled “Redefine statistical
concepts such as the p-value, the Bayes factor, and the
significance” (RSS) on the site PsyArXiv.com. The
minimum a posteriori probability of the null hypothe-
sense of this paper is extremely simple: the common
sis. In general, this will be done using the examples
critical level of p-values should be reduced by an order
associated with the comparison of frequencies. We
of magnitude (0.005 instead of 0.05). In addition to
demonstrate that, when using p-values, special atten-
statisticians, the group of the authors of RSS includes
tion should be given to the comparison of low frequen-
the well-known researchers in various fields, including
cies at large samples. Some practical recommenda-
economics, sociology, physiology, anthropology,
tions on the use of Bayesian analysis will be given.
medicine, epidemiology, ecology, and philosophy
(a total of 72 persons). This indicates a strong support
of this innovation by the scientific community. In Jan- 1. ARGUMENTS OF THE AUTHORS
uary 2018, this document was reprinted in Nature OF THE RSS MANIFESTO
Human Behaviour [1]. Pros of the revision of the critical level of signifi-
It is not surprising that the manifesto of RSS cance in the RSS manifesto can be mostly reduced to
caused a strong response in scientific journals, online the following two provisions.
publications, and mass media. The revision of the crit- 1. Lowering the critical level of p-value will signifi-
ical level of significance affects the interests of all cantly reduce the proportion of publications with false
researchers, including radiobiologists, whose discus- results. This provision is based on the simple reasoning
sions around the “low-doses,” “thresholds,” or, say, that first appeared in the famous essay by John Ioan-
the effects of electromagnetic radiation are largely nidis “Why most published research findings are
reduced to arguments about the statistical significance false” [2]. Suppose we analyze 1000 hypotheses, of
of discovered phenomena. which only 10 are valid (1%) and 990 are wrong. At
In this review, we present a compilation of a standard test power (80%), we have 10 × 0.8 = 8 con-
responses to the RSS manifesto and discuss the argu- firmed correct hypotheses. At the same time, at a sig-

1449
1450 RUBANOVICH

Table 1. Minimum Bayes factor (local minimum for the pri- It should also be noted that the authors of RSS pro-
ors in H0) and the minimum probability H0 corresponding posed to use the threshold value of 0.005 only for the
to the p-value from the “gray zone” newly discovered effects, maintaining the critical level
p-value min H 0 BF min(H0|data) of 0.05 for repeated (verifying) tests. In general, the
authors of RSS propose to no longer consider the
0.05 0.407 0.289 results with p-values from the range (0.005–0.05) sta-
tistically significant and define them as “suggestive”
0.04 0.350 0.259
(setting one thinking).
0.03 0.286 0.222
Of course, the manifesto of RSS is not the first
0.02 0.213 0.175 attempt to revise Fisher’s critical significance level.
0.01 0.125 0.111 Even in the middle of the last century, academician
A.N. Kolmogorov, referring to the rule of “three sig-
0.005 0.072 0.067 mas,” repeatedly proposed to use the critical level of
0.003 or even 0.001 [4]. In the 1960s, the lowering of
the critical level of p-values to 0.01 was passionately
nificance level of 0.05, the number of cases of confir- advocated by A. Melton, Chief Editor of Journal of
mation of false hypotheses will be 990 × 0.05 ≈ 50. Experimental Social Psychology [5]. A notable step in
Thus, the percentage of published false results will be rethinking the role of p-values in the scientific search
50/(8 + 50) = 86.2%. It is clear that, by lowering the was the statement of the Board of American Statistical
critical level of p-value by an order of magnitude, we Association (ASA) in 2016 [6]. However, the publica-
can achieve a significant reduction in the proportion tion of RSS had the greatest response in the scientific
of false positives: 5/(8 + 5) = 38.5%. world. In contrast to the earlier statements, RSS lays
2. Effects at p-values close to 0.05 cannot be statis- down specific and very significant changes that can be
tically significant because this is contrary to the results easily implemented by editors of scientific journals
of “Bayesian analysis,” which will be discussed in Sec- and funding institutions.
tion 5. Now we will only mention the following fairly
general statement. Suppose that, before the experi-
ment, we assumed that our chances of success were 50 2. “ALPHA-WARS” IN 2017
to 50. In other words, the a priori probability of the The RSS manifesto has caused an unprecedented
null hypothesis was equal to 50%. Then, after the (in terms of scale) discussion in blogs and scientific
experiment, which showed a p-value = 0.05, the prob- publications, which was called the “alpha-war.”
ability of the null hypothesis was at least 29% (see Immediately after the publication of the manifesto,
Table 1). This rigorous mathematical statement, a young Dutch physiologist Daniel Lakens in his blog
obtained by using the Bayesian approach, killed announced a subscription among the opponents of the
enthusiasm for considering the effects at p-value = revision of the critical level of p < 0.05. As a result, in
0.05 as statistically significant. Indeed, our experiment September 2017, the preprint “Justify your alpha” was
did not make the null hypothesis improbable (say, at a published, which was signed by 87 coworkers [7].
level of 5%, as it is often considered by a naive user). It According to the authors of this publication, the
still holds at least 29% or higher. For comparison, in threshold p < 0.005 is as arbitrary as p < 0.05. The
the same situation at p-value = 0.005, the minimum a threshold cannot be generally fixed and should depend
posteriori probability of the null hypothesis is 6.7% on what is already known about the subject of research
(Table 1). and the risks associated with obtaining an incorrect
The authors of RSS claimed that their performance answer. A relatively high probability of false positive
was dictated primarily by the concern of the extremely results can be taken in a preliminary study, whereas the
low reproducibility of biological, medical, and other final test of a drug may require lower p-values.
scientific research. Indeed, the “reproducibility cri- In addition, the authors of [7] first figured out the
sis” literally amazed the science of the 21st century, cost of the lowering of the critical p-value level by an
which was repeatedly discussed in mass media and sci- order of magnitude: to maintain the accepted testing
entific publications (see, e.g., [2, 3]). By proposing to power, the size of samples should be increased, on
lower the critical level of p-values by an order of mag- average, by 70%. Such a requirement may be too much
nitude, the authors of RSS postulate that this simple for budgets of many research groups.
step will immediately improve the reproducibility of
the results of research in many fields. Subsequently, With respect to overcoming the “reproducibility
this phrase has become the main target of a flurry of crisis,” the authors of [7] noted that, according to [8],
critical attacks by statisticians and active experiment- among the results at p-values from the range (0.005–
ers, although the authors of RSS emphasized that the 0.05), the proportion of the confirmed results was
dichotomy of p < 0.05 is not the only cause of the low 24%. However, among the results at p-value < 0.005,
reproducibility of research. the proportion of the confirmed results was also not

BIOLOGY BULLETIN Vol. 46 No. 11 2019


REDEFINING THE CRITICAL VALUE OF SIGNIFICANCE LEVEL 1451

too high (49%), which does not correspond to the estimation of the statistical significance of results.
expectations of the authors of RSS. These include, for example,
The next prominent paper entitled “Abandon sta- 1. A posteriori formulation of hypotheses. A researcher
tistical significance,” one of the authors of which is puts forward hypotheses after obtaining data, stating
Andre Gelman, a well-known expert in Bayesian anal- that all assumptions preceded the experiment.
ysis, was published in September 2017 [9]. The authors
of this article recommend to completely abandon test- 2. Incomplete representation of experimental data.
ing the null hypothesis significance and consider the Publication of the most favorable results.
p-value as only one of many information indices with- 3. Data editing by eliminating the outliers and
out a privileged role in making decisions about the sig- grouping subsamples. Persistent fragmentation of
nificance of a phenomenon and the possibility of its a sample in search for stratification providing “signif-
publishing. With regard to classifying research into icant” differences of subsamples.
new (p < 0.005) and repeated (p < 0.05) ones, the
authors of [9] believe that this recommendation is 4. The use of various statistical tests with further
quite impractical, especially in the fields where publication of the most favorable results.
research is gradual and cumulative. And further, since 5. The use of the stopping-rule effect, i.e., gradu-
they (the authors of RSS) are not able to determine ally increasing the sample size until obtaining a signif-
what a new effect is, the proposed policy will lead to icant result at a level of p < 0.05 [15].
inconsistency in the practice of reproducing results.
6. Incorrect or incomplete account of the multi-
Similarly to many subsequent commentators [10– plicity of comparisons. Incorrect construction of per-
14], the authors of [9] believe that RSS virtually has no mutation tests.
evidence that p < 0.05 is one of the leading causes of
the poor reproducibility of scientific research. In opin- 7. Transforming and normalizing data instead of
ion of the authors of [9], the leading cause is the using nonparametric statistics.
absence of correction for multiple comparisons (both 8. The use of multivariate statistical analysis with-
actual and potential), which has become the norm in out proper validation of statistical significance, e.g.,
applied research. In [13], it is also emphasized that the selection of predictors using the “Stepwise”
lowering the threshold of p-values will lead to a drastic regression analysis algorithm [16].
increase in the publication bias, i.e., to a shift of the
pool of published works towards the greater effects. 9. The use of total estimates of risk factors selected
The harshest statements against RSS were in the work from a large number of predictors [16].
by Harry Crane “Why “redefining statistical signifi-
cance” will not improve reproducibility and could 10. The use of a test sample for “clarification” of
make the replication crisis worse,” which was pub- the results obtained using a training sample, which is a
lished in November 2017 [14]. Crane regards the pro- very common mistake in a two-stage (discovery set +
posal of RSS as extremely erroneous, presented under validation set) search for effective predictors [16, 17].
false pretenses, and supported by flawed analysis and The “effectiveness” of all these approaches,
suggests that it should not be accepted. H. Crane also indeed, little depends on the critical level of signifi-
reports that, according to his calculations, there are cance. For example, manipulations using items 8–10
several possible scenarios in which the cutoff p < 0.005 make it possible to overcome thresholds p < 10–5 or
will make the situation with reproducibility even even 10–10 [16]. For this reason, the majority of critics
worse. of RSS believe that the revision of the threshold p-value
will not result in a significant improvement to the
reproducibility of results. The author of these lines
3. RSS AND REPRODUCIBILITY CRISIS: basically shares this belief. Throughout the past cen-
p-HACKING tury, the threshold value of 0.05 was used everywhere,
Deborah Mayo, a well-known statistician and phi- but there were no talks about the “reproducibility cri-
losopher of science, wrote in her blog errorstatis- sis.” The situation has changed at the turn of the cen-
tics.com that almost everyone knows what is the real tury with the advent of new technologies, leading to
cause of nonreproducibility—selective publication of experimental plans with a huge number of indepen-
the most effective results, ignoring the multiplicity of dent, predictor, or grouping variables (e.g., microar-
tests, and sequential testing of hypotheses until one of rays, GWAS, RNA-seq, various omics-technologies,
them would be significant, as well as selective presen- etc.). Processing corresponding data inevitably leads
tation of methods and results of their application. to frequent misuse of points 6–10 of the above list of
p-hacking manifestations and, ultimately, to a low repro-
In modern biostatistics, the tendency described by ducibility of announced effects. According to some data,
Deborah Mayo is called the “p-hacking” [15]. This the false discovery rate (FDR) due to p-hacking is cur-
term covers a wide range of variants of artificial over- rently no less than 60% [6, 15].

BIOLOGY BULLETIN Vol. 46 No. 11 2019


1452 RUBANOVICH

4. LOGICS OF p-VALUES This example clearly illustrates a number of simple


Next, we proceed to the discussion of the key con- but important provisions to keep in mind when using
cepts of the statistical inference theory, which are p-values:
absolutely necessary to fully understand the argu- 1. A low p-value is a necessary but not sufficient con-
ments put forward by the authors of RSS. dition for the validity of the alternative hypothesis Н1.
The largest number of errors and false interpreta- 2. If the probability of a certain event data is small,
tions is directly related to the concept of “p-value” this does not necessarily mean that the probability of
itself [18]. What a p-value actually is and what it is not? an event that is generated by data is also small.
Unfortunately, many users believe that a p-value is 3. In some cases, the problem of adopting the null
equal to the probability of the null hypothesis of no hypothesis or the alternative hypotheses should be
effects. Of course, this is not valid: p-value ≠ P(H0). solved by comparing the conditional probabilities
Moreover, p-value ≠ P(H0|data); i.e., p-value is not P(data|H0) and P(data|H1).
equal to the conditional probability of the null hypoth-
esis for the observed scenario for data (data). In the The last provision is fully implemented in the frame-
definition of p-value, the “transposed” conditional work of the so-called “Bayesian approach,” in which the
probability appears: decision about adopting the null hypothesis is made on
the basis of estimates of the Bayes factor (BF):
p-value = P ( data H 0 ) . (1)
P (data H 0 )
Strictly speaking, definition (1) is also not quite BF = . (2)
P (data H1)
correct. In defining p-value, data should be under-
stood not only as the obtained scenario of data but also It is generally accepted that the null hypothesis is
as the aggregate of all scenarios with an even more confidently adopted at BF > 3 and is rejected in favor
extreme deviation from the null hypothesis. However, of the alternative hypothesis at BF < 1/3. The key
we will omit this fact so as not to clutter the presenta- argument for an immediate revision of the critical sig-
tion. nificance level of 0.05 is the following observation: at
Thus, the p-value is the conditional probability of such p-values, the data obtained are more probable
the observed data, provided that the null hypothesis is under the null hypothesis than at the alternative
valid, rather than the conditional probability of the hypothesis (BF > 1). It is clear that, in this case, it is
null hypothesis itself. This probability is very indi- quite unacceptable to consider the observed effect sta-
rectly related to the null hypothesis. We compute tistically significant. Relevant examples are discussed
p-value = P(data|H0); however, to make a decision in detail in the next section.
about the statistical significance, we need the trans-
posed probability P(H0|data). This situation is similar 5. BAYESIAN APPROACH:
to that which occurs constantly in the studies with the ADVANTAGES, DISADVANTAGES,
“cases–controls” design: we estimate a marker fre- AND PRINCIPAL LIMITATIONS
quency among patients but need primarily the esti-
mates of the frequency of patients among the marker The advantages of the Bayesian approach are quite
carriers. obvious and can be reduced to the following provi-
sions.
Typically, judging by small p-values, an experi-
menter concludes that the alternative hypothesis Н1 1. In contrast to the p-value, the Bayes factor BF
(the presence of effect) is valid. This conclusion is not has a simple and clear meaning and shows how many
always correct. The fact is that р-value = P(data|H0) times the observed data are more probable under the
may be small, but the P(data|H1) value may be even null hypothesis than in the presence of differences.
smaller. This possibility is well illustrated by the 2. The estimation of BF makes it possible to esti-
famous anecdotal example. Let the statement “You mate the a posteriori probability of the null hypothe-
are a woman” be the null hypothesis Н0 and the state- sis, which is impossible in principle when using p-val-
ment “You are a man” be the alternative hypothesis ues. Indeed, according to the formula of Bayes, we
Н1. Then, Р(Н0) = Р(Н1) = 1/2. It is known that about have
3% of all women living at the moment are pregnant. P (H 0 data) P (data H 0 ) P (H 0 ) P (H 0 )
That is, assuming data = “Pregnancy,” we have: = = BF .
P(data|H0) = 0.03. In this case, P(data|H1) ≅ 0 ! P (H1 data) P (data H1) P (H1) P (H1)
P(data|H0). By transposing the common practice of This ratio makes it possible to perform a highly desir-
using p-values to this situation, we have: able “transposition” of the conditional probabilities,
i.e., to proceed from P(data|H0) to P(H0|data). For this
p-value = P ( data H 0 ) = 0.03 < 0.05 → H 0 purpose, it is usually assumed that, a priori, P(H0) =
is rejected and H1 is valid, which in this case means: P(H1) = 1/2. Then, keeping in mind that P(H1|data) =
“Are you pregnant? So you are a man!” 1 – P(H0|data), we obtain:

BIOLOGY BULLETIN Vol. 46 No. 11 2019


REDEFINING THE CRITICAL VALUE OF SIGNIFICANCE LEVEL 1453

BF . the variant 4870/10000 corresponds to the p-value =


P (H 0 data) = (3) 0.01. However, the Bayesian analysis shows that, in
BF + 1
this case, the null hypothesis is much more probable:
3. In contrast to p-values, BF is symmetrical with BF = 4.27 > 3, which corresponds to P(H0|4870/10000) ≈
respect to hypotheses Н0 and Н1. This means that BF > 0.81. The discrepancies between the results of tradi-
3 is an argument in favor of hypothesis Н0 and BF < tional and Bayesian analyses (Jeffreys–Lindley para-
1/3 is an argument against H0 in favor of H1. For com- dox) are particularly noticeable for large samples.
parison: p-value = 0.002 is an argument against H0;
however, p-value = 0.2 is not an argument in favor of As expected, the estimates of the Bayes factor
the null hypothesis. essentially depend on the selection of priors. Figure 2
shows the density distributions of five priors differing
4. BF does not require test power estimates because, in the position of the mean μ value (localization) and
in some sense, it takes into account the type 2 errors. It the standard deviation σ. The corresponding BF esti-
can be easily seen that BF has an order of p-value/test mates are shown next in the table. Priors 1 and 2 are
power. At a low test power, the BF value will be obvi- nearly Gaussian distributions for which the mean val-
ously high [19]. ues are 0.39 and 1/2, respectively. In the first case, we
5. Numerical experiments showed that the use of BF will say that the prior is localized in data, whereas in
almost completely eliminates the stopping rule [19]. the second case it is localized in H0. Priors 3 and 4 are
6. BF allows naturally taking into account the mul- located in H0 and are “noninformative” (see Fig. 1).
tiplicity of comparisons [19, 20]. The other priors bear certain a priori information
To implement the Bayesian approach, it is neces- about the distribution of p under condition H1.
sary to estimate the ratio of two conditional probabili- Figure 2 shows that, depending on the prior selec-
ties—P(data|H0) and P(data|H1). The former is almost tion, one may come to the opposite conclusions: prior 1
always calculated simply. The corresponding calculations, allows adopting hypothesis H1 (BF = 0.17 < 1/3),
in fact, are similar to the calculations of the p-value. The whereas prior 5 leads to the conclusion that the null
main difficulties are associated with the estimation of hypothesis is valid (BF = 3.85 > 3). It is quite another
the conditional probability P(data|H1), for which it is matter that both of these priors clearly do not corre-
necessary to assume the distribution pattern of data spond to our situation, for which it is more natural to
provided that the alternative hypothesis H1 is valid. use the “noninformative” priors 3 and 4. Neverthe-
This a priori and usually unknown distribution is less, in many cases the dependence of the Bayes factor
called prior. on the selected prior creates principal difficulties. It
The main highlights associated with calculating the can be shown [22] that, irrespective of the prior, the
Bayes factors are illustrated in Fig. 1 using the analysis Bayes factor for large samples (n) increases as
of coin tossing (“heads and tails”) as an example. The
null hypothesis is that the frequency of “heads” corre- BF ~ nσ. (4)
sponds to 1/2 (“proper” coin) against the assumption In essence, this means that, by postulating a fairly
H1 that the coin is asymmetrical. In case of validity of “smeared” prior (largeσ), it is always possible to
hypothesis H1, the probability of the “head” (p) is con- obtain high BF values and then confidently adopt the
sidered a random variable with “noninformative” null hypothesis.
(e.g., uniform) distribution (all p-values are equiprob-
able). Earlier, this distribution was used by Bayes and It may seem that this negates any possibility to use
Laplace. Nowadays, the “objectively noninformative” Bayesian analysis in the cases when we do not have a
Jeffreys prior (the so-called arcsine distribution) is sufficient basis for prior selection (e.g., on the basis of
considered more appropriate [21]. preliminary experiments). However, the situation is
partially improved by the fact that, by selecting the
Figure 1 shows that the p-values from the “gray prior, the Bayes factor can be made as large as desired
zone” (0.01–0.05) may correspond to relatively high but cannot be made arbitrarily small. The BF value is
values of the Bayes factor. For example, the variant always limited below.
39/100 corresponds to the p-value = 0.035; however,
the Bayesian analysis shows that such data is more
probable under the null hypothesis (BF = 1.1 > 1 for 6. MINIMUM BAYES FACTOR
the Jeffreys prior). If we then proceed to the probabil-
ity of the null hypothesis validity using formula (3), we It can be easily seen that the minimum Bayes factor
obtain P(H0|39/100) = 1.1/2.1 ≈ 0.52. Meanwhile, the is achieved for an “acute” prior (σ = 0), which is local-
traditional analysis assures us that the differences of ized in data. For example, in the case of coin tosses
39/100 from 1/2 are statistically significant (p-value = (Fig. 1), we obtain
0.035).
There are more striking examples of such situa- min BF = 2−n1 − n2 , (5)
tions. For example, at a large number of coin tosses, p n1 (1 − p)n2

BIOLOGY BULLETIN Vol. 46 No. 11 2019


1454 RUBANOVICH

Bayesian coin toss analysis


H0: Frequency of “heads” = 1/2 data = {n , n }—the number of “heads” and “tails”
1 2
H1: Frequency of “heads” ≠ 1/2
n n n2
data probability at H0 P(data|H0) = C n11 + n2(1/2) 1 (1/2)

n n n2
data probability at H1 P(data|H1) = C n11+ n2 p 1(1 – p)

2(n1 + n2) Unknown random variable


BF = 1 with noninformative distribution
∫0 pn1(1 – p)n2 f(p)dp f(p) (the so-called prior)

100 tosses: n1/100 vs. 1/2 comparison


Noninformative priors
BF
1. Homogeneous prior n1 p-value
(all p values are equiprobable) homogeneous Jeffreys
prior prior
35 0.004 0.087 0.130
f(p) = 1
36 0.007 0.158 0.236
0 1 37 0.012 0.272 0.411

2. Jeffreys objectively informative prior 38 0.021 0.452 0.686


39 0.035 0.718 1.095
40 0.057 1.095 1.678
1
f(p) = 41 0.089 1.603 2.465
π√p(1 – p)
42 0.133 2.252 3.474
0 1
Fig. 1. The simplest case of estimation of the Bayes factor—testing a coin for symmetry. The algorithm for calculating BF for two
types of “noninformative” priors is shown. In the table, BF values are compared to the p-values that were estimated using two-
tailed exact binomial test. The cases where the Bayesian analysis did not confirm the significance of differences from 1/2 are high-
lighted.

Estimates of the Bayes factor significantly depend on the prior selection


100 coin tosses: five priors when comparing 39/100 vs. 1/2, p-value = 0.035
f(p)
5

4 1 2 Prior Prior
Prior BF localization smearing (σ)
3 1 0.17 data 0.08
2 0.32 H0 0.08
2 3 0.72 H0 0.29
4 1.09 H0 0.35
4
3 5 3.85 H0 0.46
1
5

0 0.2 0.4 0.6 0.8 1.0


data H0 p

Fig. 2. Dependence of BF on the prior selection. Bayes factor estimates for five priors differing in the mean value position (local-
ization) and variance (σ2) are shown.

BIOLOGY BULLETIN Vol. 46 No. 11 2019


REDEFINING THE CRITICAL VALUE OF SIGNIFICANCE LEVEL 1455

Dependence of BF on the fraction of priors localized in H0 (1)


and priors localized in data (2)
BF Comparison of 39/100 vs. 1/2
1.5 p-value 0.035

Global min BF 0.087


1
1.0 Local min BF 0.317
2
Local min BF according
0.320
to Sellke: –eplnp
0.5
–eplnp BF for Jeffreys prior 1.095

Global
min BF
0.1 0.2 0.3 0.4
Position of local min BF σ

Fig. 3. BF dependence on the degree of prior “smearing” (σ2—prior variance): (1) priors localized in H0 and (2) priors localized
in data.

where the p value corresponding to our data is The min BF estimate provides a unique opportu-
selected: p = n1/(n1 + n2). A similar formula for com- nity to assess the minimum probability of realization
parison of the mean values for Student’s t test has a of the null hypothesis regardless of the selected prior.
very simple form [22]: It suffices to substitute the lower limit of BF (6) into
the formula (3). As a result, we obtain
−1t
2

min BF = e 2
. −ep ln p
P (H 0 data) ≥ , (7)
This minimum, corresponding to the “acute” prior in 1 − ep ln p
data, we will call the global minimum, because any where p ≡ p-value. This value can be called the “mini-
other prior will yield a higher BF value. mum a posteriori probability of the null hypothesis,”
corresponding to a given p-value.
For some classes of priors, there may be local min-
ima, which are naturally always higher than the global As above, we will consider the comparison 39/100
min BF. For example, for unimodal priors localized in vs. 1/2 as an example (Fig. 3). The “noninformative”
H0, there is always a local minimum BF, which can be Jeffreys prior indicates that, in this case, the data are
found numerically. For this minimum, a good approx- more probable under the null hypothesis, although
imation is known, which was proposed by T. Sellke p-value = 0.035. A conservative experimenter may not
[22]: for a broad class of unimodal priors in which the agree with this conclusion since BF depends on the
mean value is localized in H0, the local minimum of prior. However, according to (7), the minimal a poste-
the Bayes factor is approximately equal to riori probability of the null hypothesis is min
P(H0|39/100) = 0.24, and this conclusion almost does
min H 0 BF ≈ −ep ln p. (6) not depend on the selected prior.
The recalculation of p-values into the minimum
Here (and only here), p ≡ p-value < 1/e, and e is the probabilities of the null hypothesis is shown in Table 1.
base of the natural logarithm. In essence, this table is the main argument of the sup-
The pattern of the dependence of BF on the degree porters of the revision of the critical significance level.
of “smearing” of prior (σ) depends on the position of As can be seen in Table 1, only at p-value = 0.005 an
the mean value (i.e., on the prior localization) (Fig. 3). acceptable level of the minimum a posteriori probabil-
For the priors localized in data, BF increases mono- ity of the null hypothesis can be achieved.
tonically with increasing σ starting from the point cor-
responding to the global minimum. A characteristic
feature of the priors localized in H0 is the existence of 7. RISK ZONES IN USING p-VALUES: LOW
a local minimum that is defined by the formula (6). In FREQUENCIES AND LARGE SAMPLES
this case, for an “acute” prior localized in H0, always In conclusion, we will discuss situation in which
BF = 1. At large σ, for both classes of priors, BF the use of p-values from the “gray zone” (0.01–0.05)
increases with increasing σ according to (4). is especially dangerous. Most of these situations are

BIOLOGY BULLETIN Vol. 46 No. 11 2019


1456 RUBANOVICH

Table 2. Traditional and Bayesian analyses and compari- exposed individuals. However, the Bayesian analysis
sons 1/n vs. 9/n at different sample sizes (n) does not confirm this conclusion: BF = 0.72 > 1/3,
Comparison 1/n vs. 9/n which, according to (3), corresponds to the null
n hypothesis probability equal to 42% (at an a priori
p-value BF probability of 50%).
50 0.016 0.116 In cytogenetic studies, low frequencies for huge
samples are also often compared. This is due to the
100 0.018 0.197
malpractice of pooling all viewed metaphases for each
500 0.021 0.504 of compared group. As a result, there occur situations
1000 0.021 0.724 that are described in Table 2. For example, one dicen-
tric per 5000 metaphases was detected in the control
5000 0.021 1.639 group and nine such chromosomal aberrations per the
10000 0.021 2.322 same number of cells were detected in the exposed
individuals. In this case, p-value ≈ 0.02 and BF = 1.6;
50000 0.021 5.198 that is, the observed variant is more probable in the
p-Values for two-tailed Fisher’s exact test and Bayes factors for absence of differences.
Jeffreys prior are shown. The cases of rejection of the alternative
hypothesis H1 about the existence of differences are highlighted.
CONCLUSIONS
characterized by the formula “low frequencies and The proposals formulated in the RSS are, probably,
large samples.” In Section 5, we already mentioned timely, but the scientific community is clearly not ready
that, for large samples, the results of the traditional to accept them, as follows from the ongoing heated dis-
analysis, based on p-values, may contradict the con- cussion and the results of the survey conducted in Twitter
clusions made on the basis of the Bayesian analysis by the journal Nature News & Comment. The question
(Jeffreys–Lindley paradox). “Should we lower the critical level of p-values” received
Suppose we compare the frequencies of a certain 562 “yes” and 540 “no.” https://twitter.com/nature-
event in two samples of equal size, e.g., 1/n vs. 9/n, news/status/890530105554087936.
where n is the sample size. Then, in calculating the Of course, lowering the threshold p-value will not
Bayes factor, it is appropriate to use the “noninforma- lead to a significant improvement of reproducibility, as
tive” Jeffreys prior, as it was done in coin flipping. is promised by the authors of RSS. Nevertheless, we
Indeed, under the null hypothesis, each of 1 + 9 = 10 believe that the following recommendations are rele-
events randomly gets into one of the two samples with vant.
a probability of 1/2. Hence, it is clear that the bilateral
p-value for the comparison of 1/n vs. 9/n is practically 1. It is mandatory to calculate the Bayes factor
independent of n and asymptotically approaches the when p-values get into the “gray zone” (0.01–0.05).
doubled probability of getting less than two “heads” in This can be simply done by using the JASP freeware
ten coin tosses. However, the BF value always [11] or on-line calculators:
increases with increasing n, as is generally described by https://jasp-stats.org,
the formula (4). http://www.stat.umn.edu/geyer/5102/examp/bayes.
Table 2 shows the comparison of p-values and html,
Bayes factors at different sample sizes n. In all cases
presented in Table 2, the frequency of events in the http://pcl.missouri.edu/bf-binomial.
second group is 9 times higher than in the first group, All calculations that are listed in this review can be
and p-value ≈ 0.02. However, at n = 100, the Bayesian reproduced using the specified software.
analysis confirms the traditional approach (BF = 0.2 < 2. In addition to the BF estimates that are offered by
1/3), and at n = 50000 is confidently rejects it (BF = calculators, the min BF value (according to Sellke) and
5.2 > 3). Thus, when comparing low frequencies in the minimum a posteriori probability of the null hypoth-
large samples, the use of p-values is associated with an esis should be estimated using the formulas ((6), (7)).
increased risk of obtaining false-positive results.
The situations that are described in Table 2 often 3. Particular attention should be given to the com-
occur in cytogenetics and radiation epidemiology. parison of low frequencies at large samples. In these
Suppose, for example, the comparison of 1000 per- situations, the p-values at the level of 0.02–0.05 mean
sons exposed to radiation with a control sample of the nothing and lead to false results.
same size showed nine cases of leukemia in the
exposed persons and only one case in the control FUNDING
group. The relative risk is RR = 9. As a rule, an epide-
miologist in such cases reports a substantial and signif- This work was supported by the Russian Foundation for
icant increase in the incidence of the disease in the Basic Research (project no. 16-06-0046517).

BIOLOGY BULLETIN Vol. 46 No. 11 2019


REDEFINING THE CRITICAL VALUE OF SIGNIFICANCE LEVEL 1457

COMPLIANCE WITH ETHICAL STANDARDS 11. Perezgonzalez, J.D. and Frías-Navarro, M.D., Retract
p < 0.005 and propose using JASP, instead, F1000Re-
The author declares that he has no conflict of interest. search, 2017, vol. 6, p. 2122.
This article does not contain any studies involving animals 12. Amrhein, V. and Greenland, S., Remove, rather than
or human participants performed by the author. redefine, statistical significance, Nat. Hum. Behav.,
2018, vol. 2, p. 4.
13. Esarey, J., Replication data for: lowering the threshold
REFERENCES of statistical significance to p < 0.005 to encourage en-
1. Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B., riched theories of politics, Polit. Methodologist, 2017,
Wagenmakers, E., Berk, R., et al., Redefine statistical vol. 24, no. 2, pp. 13–20. https://thepoliticalmethodol-
significance, Nat. Hum. Behav., 2018, no. 2, pp. 6–10. ogist.com/v24-n2-fix/.
2. Ioannidis, J., Why most published research findings are 14. Crane, H., Why “redefining statistical significance”
false, PLoS Med., 2005, vol. 2. e124. will not improve reproducibility and could make the
replication crisis worse, 2017. arXiv:1711.07801v1
3. Buck, S., Solving reproducibility, Science, 2015, [stat.AP].
vol. 348, no. 6242, p. 1403.
15. Head, M.L., Holman, L., Lanfear, R., Kahn, A.T., and
4. Kolmogorov, A.N., Probability theory, in Veroyatnost’ i Jennions, M.D., The extent and consequences of
matematicheskaya statistika. Entsiklopediya (Probability p-hacking in science, PLoS Biol., 2015, vol. 13, no. 3.
and Mathematical Statistics. Encyclopedia), Prokhor- e1002106.
ov, Yu.V., Editor-in-Chief, Moscow: Bol’shaya Rossi-
iskaya Entsiklopediya, 1999; Moscow: Drofa, 2003, 16. Rubanovich, A.V. and Khromov-Borisov, N.N., Ge-
pp. 874–875. netic risk assessment of the joint effect of several genes:
critical appraisal, Russ. J. Genet., 2016, vol. 52, no. 7,
5. Melton, A.W., Editorial, J. Exp. Psychol., 1962, vol. 64, pp. 757–769.
pp. 553–557. 17. Wray, N.R., Yang, J., Hayes, B.J., et al., Pitfalls of pre-
6. Wasserstein, R.L. and Lazar, N.A., The ASA’s state- dicting complex traits from SNPs, Nat. Rev. Genet.,
ment on p-values: context, process, and purpose, Am. 2013, vol. 14, no. 7, pp. 507–515.
Statistician, 2016, vol. 70, no. 2, pp. 129–133. 18. Goodman, S., A dirty dozen: twelve p-value miscon-
7. Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., ceptions, Semin. Hematol., 2008, vol. 45, pp. 135–140.
Apps, M.A., et al., Justify Your Alpha, 2018. psyarx- 19. Dienes, Z., How Bayes factors change scientific prac-
iv.com/9s3y6. tice, J. Math. Psychol., 2016, vol. 72, pp. 78–89.
8. Open Science Collaboration, Science. 2015, vol. 349, 20. Held, L. and Ott, M., On p-values and Bayes factors,
no. 6251, pp. 1–8. Annu. Rev. Stat. Appl., 2018, vol. 5, pp. 393–419.
9. McShane, B.B., Gal, D., Gelman, A., Robert, C., and 21. Jeffreys, H., An invariant form for the prior probability
Tackett, J.L., Abandon statistical significance, 2017. in estimation “problems,” Proc. R. Soc. London, Ser. A,
arXiv:1709.07588 [stat.ME]. 1946, vol. 186, no. 1007, pp. 453–461.
10. Trafimow, D., Amrhein, V., Areshenkoff, C.N., et al., 22. Sellke, T., Bayarri, M.J., and Berger, J.O., Calibration
Manipulating the alpha level cannot 1 cure significance of p values for testing precise null hypotheses, Am. Statist.,
testing. comments on “Redefine statistical signifi- 2001, vol. 55, pp. 62–71.
cance,” PeerJ. Preprints, 2017, vol. 5. e3411v1.
https://peerj.com/preprints/3411/. Translated by M. Batrukova

BIOLOGY BULLETIN Vol. 46 No. 11 2019

You might also like