Group 4 - Five Ways To Fix Statistics PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

SOCIETY Re-imagine our

COMMENT RESEARCH MANAGEMENT How to POLICY Ban on bear OBITUARY Gilbert Stork,
institutions to harness structure and support teams hunting balances synthesis pioneer,
collective intelligence p.561 for effectiveness p.562 societal values p.565 remembered p.566
ILLUSTRATION BY DAVID PARKINS

Five ways
to fix statistics
As debate rumbles on about how and how much poor statistics is to blame for poor
reproducibility, Nature asked influential statisticians to recommend one change to
improve science. The common theme? The problem is not our maths, but ourselves.

In the past couple of decades, many fields and a lack of training in analysis (J. T. Leek
JEFF LEEK have shifted from data sets with a dozen
measurements to data sets with millions.
and R. D. Peng Proc. Natl Acad. Sci. USA 112,
1645–1646; 2015). It’s also impractical to say
Adjust for human Methods that were developed for a world
with sparse and hard-to-collect informa-
that statistical metrics such as P values should
not be used to make decisions. Sometimes a
cognition tion have been jury-rigged to handle bigger, decision (editorial or funding, say) must be
more-diverse and more-complex data sets. made, and clear guidelines are useful.
Johns Hopkins Bloomberg School of No wonder the literature is now full of The root problem is that we know very
Public Health in Baltimore, Maryland papers that use outdated statistics, misapply little about how people analyse and process
statistical tests and misinterpret results. The information. An illustrative exception is

T
o use statistics well, researchers must application of P values to determine whether graphs. Experiments show that people strug-
study how scientists analyse and an analysis is interesting is just one of the gle to compare angles in pie charts yet breeze
interpret data and then apply that most visible of many shortcomings. through comparative lengths and heights in
information to prevent cognitive mistakes. It’s not enough to blame a surfeit of data bar charts (W. S. Cleveland and R. McGill

3 0 NOV E M B E R 2 0 1 7 | VO L 5 5 1 | NAT U R E | 5 5 7
©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT

J. Am. Stat. Assoc. 79, 531–554; 1984). The to statistical significance and thus a declara- for poor experiments. Real advances will
move from pies to bars has brought better tion of truth or falsity. NHST was supposed require researchers to make predictions
understanding. to protect researchers from over-interpreting more capable of probing their theories
We need to appreciate that data analysis noisy data. Now it has the opposite effect. and invest in more precise measurements
is not purely computational and algorith- This year has seen a debate about whether featuring, in many cases, within-person
mic — it is a human behaviour. In this case, tightening the threshold for statistical sig- comparisons.
the behaviour is made worse by training nificance would improve science. More A crucial step is to move beyond the
that was developed for a data-poor era. This than 150 researchers have weighed in (D. J. alchemy of binary statements about ‘an
framing will enable us to address practical Benjamin et al. Nature Hum. Behav. http:// effect’ or ‘no effect’ with only a P value divid-
problems. For instance, how do we reduce doi.org/cff2 (2017); D. Lakens et al. Pre- ing them. Instead, researchers must accept
the number of choices an analyst has to make print on PsyArXiv at http://doi.org/cgbn; uncertainty and embrace variation under
without missing key features in a data set? 2017). We think improvements will come different circumstances.
How do we help researchers to explore data not from tighter thresholds, but from drop-
without introducing bias? ping them altogether. We have no desire
The first step is to observe: what do people
do now, and how do they report it? My col-
to ban P values. Instead, we wish them to
be considered as just one piece of evidence
DAVID COLQUHOUN
leagues and I are doing this and taking the
next step: running controlled experiments
among many, along with prior knowledge,
plausibility of mechanism, study design and
State false-
on how people handle specific analytical data quality, real-world costs and benefits, positive risk, too
challenges in our massive online open courses and other factors. For more, see our article
(L. Myint et al. Preprint at bioRxiv http:// with David Gal at the University of Illinois University College London
dx.doi.org/10.1101/218784; 2017). at Chicago, Christian Robert at the Univer-

T
We need more observational studies and sity of Paris-Dauphine o demote P values to their rightful
randomized trials — more epidemiology “A crucial and Jennifer Tackett at place, researchers need better ways
on how people collect, manipulate, analyse, step is to Northwestern Univer- to interpret them. What matters is
communicate and consume data. We can move beyond sity (B. B. McShane et al. the probability that a result that has been
then use this evidence to improve training the alchemy Preprint at https://arxiv. labelled as ‘statistically significant’ turns out
programmes for researchers and the public. of binary org/abs/1709.07588; to be a false positive. This false-positive risk
As cheap, abundant and noisy data inundate 2017). (FPR) is always bigger than the P value.
statements
analyses, this is our only hope for robust For example, con- How much bigger depends strongly on
about ‘an
information. sider a claim, published the plausibility of the hypothesis before an
effect’ or ‘no in a leading psychology experiment is done — the prior probability
effect’.” journal in 2011, that a of there being a real effect. If this prior prob-
BLAKELEY B. MCSHANE single exposure to the
US flag shifts support towards the Republi-
ability were low, say 10%, then a P value close
to 0.05 would carry an FPR of 76%. To lower
AND ANDREW GELMAN can Party for up to eight months (T. J. Carter
et al. Psychol. Sci. 22, 1011–1018; 2011). In
that risk to 5% (which is what many peo-
ple still believe P < 0.05 means), the P value
Abandon statistical our view, this finding has no backing from would need to be 0.00045.

significance political-science theory or polling data;


the reported effect is implausibly large and
So why not report the false-positive
risk instead of the easily misinterpreted P
long-lasting; the sample sizes were small and value? The problem is that researchers usu-
Northwestern University, nonrepresentative; and the measurements ally have no way of knowing what the prior
Evanston Illinois; Columbia (for example, those of voting and political probability is.
University, New York. ideology) were noisy. Although the authors The best solution is to specify the prior
stand by their findings, we argue that their probability needed to believe in order to

I
n many fields, decisions about whether P values provide very little information. achieve an FPR of 5%, as well as providing
to publish an empirical finding, pursue a Statistical-significance thresholds are the P value and confidence interval.
line of research or enact a policy are con- perhaps useful under certain conditions: Another approach is to assume, arbitrarily,
sidered only when results are ‘statistically when effects are large and vary little under a prior probability of 0.5 and calculate the
significant’, defined as having a P value (or the conditions being studied, and when vari- minimum FPR for the observed P value. (The
similar metric) that falls below some pre- ables can be measured accurately. This may calculations can be done easily with an online
specified threshold. This approach is called well describe the experiments for which calculator, see http://fpr-calc.ucl.ac.uk.)
null hypothesis significance testing (NHST). NHST and canonical statistical methods This is one strategy that combines familiar
It encourages researchers to investigate so were developed, such as agricultural tri- statistics with Bayes’ theorem, which updates
many paths in their analyses that whatever als in the 1920s and 1930s examining how prior probabilities using the evidence from an
appears in papers is an unrepresentative various fertilizers affected crop yields. Now- experiment. Of course, there are assumptions
selection of the data. adays, however, in areas ranging from policy behind these calculations (D. Colquhoun
Worse, NHST is often taken to mean that analysis to biomedicine, changes tend to be Preprint at https://www.biorxiv.org/con-
any data can be used to decide between two small, situation-dependent and difficult to tent/early/2017/10/25/144337; 2017), and
inverse claims: either ‘an effect’ that posits a measure. For example, in nutrition studies, no automated tool can absolve a researcher
relationship between, say, a treatment and an it can be a challenge to get accurate report- from careful thought.
outcome (typically the favoured hypothesis) ing of dietary choices and health outcomes. The hope is that my proposal might help
or ‘no effect’ (defined as the null hypothesis). Open-science practices can benefit sci- to break the deadlock among statisticians
In practice, this often amounts to ence by making it more difficult for research- about how to improve reproducibility.
uncertainty laundering. Any study, no matter ers to make overly strong claims from noisy Imagine the healthy scepticism readers
how poorly designed and conducted, can lead data, but cannot by themselves compensate would feel if, when reporting a just-significant

5 5 8 | NAT U R E | VO L 5 5 1 | 3 0 NOV E M B E R 2 0 1 7
©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT

Statisticians can be allies — the American


Statistical Association, for instance, broke
from tradition to warn against misuse of P
values (R. L. Wasserstein and N. A. Lazar Am.
Stat. 70, 129–133; 2016) — but they cannot fix
the cultures of other fields.
When training scientists in the use of
quantitative methods, I and others often feel
pressure to teach the standard approaches
that peers and journals expect rather than
to expose the problems. Explaining to
young scientists why they should be able
to argue for a real finding when P = 0.10, or
for its non-existence when P = 0.01 does not
enhance their professional prospects, and
usually takes more time than we have. Many
scientists want only enough knowledge to
run the statistical software that allows them
to get their papers out quickly, and looking
like all the others in their field.
Norms are established within communi-
ties partly through methodological mimicry.
In a paper published last month on predict-
ing suicidality (M. A. Just et al. Nature Hum.
Behav. http://dx.doi.org/10.1038/s41562-
017-0234-y; 2017), the authors justified their
sample size of 17 participants per group by
stating that a previous study of people on the
autism spectrum had used those numbers.
P value, a value close to 0.05, they also will probably be a false positive. Previous publication is not a true justifica-
reported that the results imply a false-positive Planning and openness can help tion for the sample size, but it does legitimize
risk of at least 26%. And that to reduce this researchers to avoid false positives. One it as a model. To quote from a Berwick report
risk to 5%, you’d have to be almost (at least technique is to preregister analysis plans: on system change, “culture will trump rules,
87%) sure that there was a real effect before scientists write down (and preferably pub- standards and control strategies every single
you did the experiment. lish) how they intend to analyse their data time” (see go.nature.com/2hxo4q2).
before they even see them. This eliminates Disparate norms govern what kinds of
the temptation to hack out the one path that result are sufficient to claim a discovery.
MICHÈLE B. NUIJTEN leads to significance and afterwards ration-
alize why that path made the most sense.
Biomedical research generally uses the
2 sigma (P ≤ 0.05) rule; physics requires at
Share analysis With the plan in place, researchers can still
actively try out several analyses and learn
least 3 sigma (P ≤  0.003). In clinical research,
the idea that a small randomized trial could
plans and results whether results hinge on a particular varia- establish therapeutic efficacy was discarded
ble or a narrow set of choices, as long as they decades ago. In psychology, the notion that
Tilburg University, the Netherlands clearly state that these explorations were not one randomized trial can establish a bold
planned beforehand. theory had been the norm until about five

B
etter than rules about how to ana- The next step is to share all data and years ago. Even now, replicating a psychol-
lyse data are conventions that keep results of all analyses as well as any relevant ogy study is sometimes taken as an affront to
researchers accountable for analyses. syntax or code. That way, people can judge the original investigator.
A set of rigorous rules won’t work to for themselves if they agree with the analyti- No single approach will address problems
improve statistical practices because there cal choices, identify innocent mistakes and in all fields. The challenge must be taken up
will be too many situations to account for. try other routes. by funders, journals and, most importantly,
Even a seemingly simple research question the leaders of the innumerable subdisci-
(does drug A work better than drug B?) plines. Once the process starts, it could be
can lead to a surfeit of different analyses.
How should researchers account for vari-
STEVEN N. GOODMAN self-reinforcing. Scientists will follow prac-
tices they see in publications; peer reviewers
ables such as gender or age, if they do so
at all? Which extreme data points should
Change norms will demand what other reviewers demand
of them.
be excluded, and when? The plethora of from within The time is ripe for reform. The ‘repro-
options creates a hazard that statistician ducibility crisis’ has shown the cost of
Andrew Gelman has dubbed the garden of Stanford University, California inattention to proper design and analysis.
forking paths, a place where people are eas- Many young scientists today are demanding

I
ily led astray. In the vast number of routes, t is not statistics that is broken, but how it change; field leaders must champion efforts
at least one will lead to a ‘significant’ find- is applied to science. This varies in myriad to properly train the next generation and
ing simply by chance. Researchers who ways from subfield to subfield. Unfortu- re-train the existing one. Statisticians have
hunt hard enough will turn up a result that nately, disciplinary conventions die hard, even an important, but secondary role. Norms of
fits statistical criteria — but their discovery when they contribute to shaky conclusions. practice must be changed from within. ■

3 0 NOV E M B E R 2 0 1 7 | VO L 5 5 1 | NAT U R E | 5 5 9
©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

You might also like