P Value

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

What is a P-value?

P-value ≠ Pr(H0 is true)!


Pr(A|B) ≠ Pr(B|A)
“Probability of B given A”

Pr(data|H0) ≠ Pr(H0|data)
Pr(clouds|rain) ≠ Pr(rain|clouds)

2
Doing a hypothesis test means making a
decision.
Reject H0 Retain H0
H0 true Type I error OK

H0 false OK Type II error

α = Type I error rate = Pr(Type I error if H0 is true)

3
Examples (?)
George stands trial for a crime (e.g., burglary).
What is H0, Ha? Type I error? Type II error?
H0: George is innocent
Ha: He is guilty
Type I error: innocent man convicted
Type II error: guilty man set free
Note: “not guilty” ≠ “innocent”
Note: A grand jury might have looked at several possible
defendants and only agreed to let the DA bring forward
George’s case. I.e., George was not chosen randomly to
stand trial. If we were to randomly chose defendants, then we
would make lots of Type I errors over the many trials.

4
Susan goes to her doctor because she thinks she is ill.
What is H0, Ha? Type I error? Type II error?
H0: Susan is well
Ha: She is sick

Type I error: give unnecessary drugs


Type II error: fail to treat a sick person

5
Fred reads that aliens landed at Roswell, NM in 1947.
Should he believe this?
What is H0, Ha? Type I error? Type II error?
H0: No such thing happened
Ha: There is a conspiracy of silence

Type I error: believe nonsense (be gullible)


Type II error: miss real news (be too
skeptical)

6
Expert witness work
Consider the question asked, then give one of
the six acceptable answers:
Yes
No
I don’t know
I don’t remember
Could you please repeat the question?
Green
Not “The car was a green Honda with a sunroof, NY
license plates, and the radio was blaring.”
Q: What color was the car? A: Green
7
Hypothesis test
No matter what question you wish the test
would answer, a hypothesis test only answers
one question.
Not “This model is probably true.”
Not “The effect of the drug is large.”
Not “People should care about the difference I have
found.”
Q: Are the data consistent with the model (such that
any deviation from the model could reasonably have
happened by chance)? A: Yes (or No)

8
See the Dance of the P-values
https://www.youtube.com/watch?v=ez4DgdurRPg

See the Amer. Stat. Assoc. paper on P-values


http://amstat.tandfonline.com/doi/abs/10.1080/00031305.
2016.1154108
One way to introduce H0 testing:
See “Introduction to Hypothesis Testing” in
Activity Based Statistics.
Basic idea: Pretend to be doing something
that is random (e.g., tossing a fair coin) while
secretly doing some that is fixed (e.g.,
tossing a two-tailed coin). “Weird” data arise
under the unstated null hypothesis of
fairness. Students often become suspicious
around the time that the P-value approaches
1/32.
See http://www.openintro.org/why05 for a nice activity.
Large n makes everything “significant”
Suppose 31% of American households have a cat.
Suppose we test H0: p =1/3 vs HA: p ≠ 1/3.
sample n Z P-value
Very strong
0.31 1000 -1.56 0.117 evidence for
0.31 2000 -2.21 0.027 HA, but do we
0.31 3000 -2.71 0.0067 care that 31%
0.31 10000 -4.9 0.00000007 ≠ 1/3?

So we should think about effect size! That’s a topic for


another day, but see http://rpsychologist.com/d3/cohend/
11
The difference b/w “significant” and “not
significant” is not, itself, statistically
significant (Gelman and Stern, 2006)

12
Consider testing whether an effect is zero.
mean SE H 0?
Group 1 25 10 Reject
Group 2 10 10 Retain
Group 1
vs 15 14 Retain!
Group 2

There is evidence that µ1 is not zero.


There is not sufficient evidence to say
that µ2 is not zero.
There is not sufficient evidence to say
that µ1 is different than µ2.
13
# older brothers
is “sig”
# older sisters is
“not sig”
But we can’t say that
“# older brothers”
and “# older sisters”
are different.

14
Two (different) Ideas

Fisher: Determine whether the data are consistent with a


given model/hypothesis.
Neyman and Pearson: Choose between two hypotheses.
STAT 101: Blend Fisher’s idea and the Neyman/Pearson
idea into a single method – as if this made sense…

See Chance 2008 article on tables: Student, and Pearson,


had tables that gave probabilities for given df and test stat.
Fisher didn’t want to ask for copyright permission and so
created tables with critical values presented for given df
and alpha. This contributed to fixed-level (alpha) testing.
R.A. Fisher in 1925:
“… An observation is judged significant, if it would rarely have
been produced, in the absence of a real cause of the kind we
are seeking. It is a common practice to judge a result significant,
if it is of such a magnitude that it would have been produced by
chance not more frequently than once in twenty trials. This is an
arbitrary, but convenient, level of significance for the practical
investigator, but it does not mean that he allows himself to be
deceived once in every twenty experiments. The test of
significance only tells him what to ignore, namely all
experiments in which significant results are not obtained. He
should only claim that a phenomenon is experimentally
demonstrable when he knows how to design an experiment so
that it will rarely fail to give a significant result. Consequently,
isolated significant results which he does not know how to
reproduce are left in suspense pending further investigation.”
Digression: Bayesian Inference
A Two-sample t-Test

H0: μsoap = μcon


HA: μsoap ≠ μcon

> favstats(bacteria ~ Group, data=SoapData)


Group min Q1 median Q3 max mean sd n missing
1 control 21 33.75 37 49.5 66 41.750 15.636 8 0
2 soap 6 21.00 27 38.0 76 32.429 22.832 7 0
> t.test(bacteria ~ Group, data=SoapData)
Welch Two Sample t-test
data: bacteria by Group
t = 0.91, df = 10.4, p-value = 0.38
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.387 32.030
sample estimates:
mean in group control mean in group soap
41.750 32.429

This is the standard (“frequentist”) method.


One way to make a Bayesian inference is to use
the R package called BayesianFirstAid and the
command bayes.t.test. This uses Markov Chain
Monte Carlo…
We could also go to
http://www.sumsar.net/best_online/ and use the
applet there.
A Bayesian perspective

In loose terms, the p-value is about an order of magnitude


larger than Pr{H0 true} when comparing a typical H0 to HA
under a Bayesian analysis with a vague prior on HA.
That is, if P=0.01, then think "There is about a 10% -- not
a 1% -- chance that H0 is true (i.e., that using H0 as the
model is better than using HA as the model)."
If P=0.05 then there is perhaps a 50-50 chance that the
drug is really better than the placebo; etc.
See Berger and Sellke (1987) “Testing a Point Null
Hypothesis: The Irreconcilability of P Values and
Evidence.” JASA, vol 82, Issue 397, 112 – 122.
A large P-value doesn’t mean that H0 is true. It means that
such data might arise by chance alone if H0 is true. But it
might well be that H0 is false and the data arose under HA
(and were compatible with both H0 and HA).
Beware of data snooping…
See http://xkcd.com/1478/
Publication bias refers to the fact that
(statistically) significant results are much more
likely to be published than are non-sig results.
The file drawer problem… Non-
significant results end up “in the file
drawer” rather than being published, or
even submitted for publication.

23
Publication bias
One study looked at 10 years of papers
indexed in PubMed and identified 4970
observational studies of medical treatments.
82% of them had statistically significant results
at the 0.05 level.
Another study looked at 1046 research
articles in three clinical psychology journals.
86% of them used statistical tests; 94% of
these rejected H0 at the 0.05 level.

24
Ben Goldacre TED MED talk

25
2005 paper in PLoS Medicine

A simplified view of the Ioannidis paper:


Imagine 1000 research projects. We might expect H0 to be
true 920 times. We might expect power to be 50%.

920 H0 true → 920 * 0.05 = 46 false alarms


80 H0 false → 80 * 0.50 = 40 true alarms
86 alarms, fewer
than half are true

26
2013 paper, Statistics in Medicine
“Our experiment provides evidence that the majority of
observational studies would declare statistical
significance when no effect is present. Empirical
calibration was found to reduce spurious results to the
desired 5% level. Applying these adjustments to literature
suggests that at least 54% of findings with p < 0.05 are not
actually statistically significant and should be
reevaluated.”

Schuemie et al., Statist. Med. 2014, 33 209–218

27
Garden of Forking Paths
(“researcher degrees of freedom”)

See Gelman and Loken (2014), American


Scientist, vol. 102, no. 6, page 460+
Do I include that outlier?
Should I do a separate analysis for women?
What about for people over age 40?
It makes sense to exclude participants we later
found out are not native English speakers, right.
Etc.
28
The Reproducibility Project
Attempting to reproduce 100 research findings in
three major psychology journals. Only 39 of them
were classified as replications. (Of the 61 non-
replications, 24 had “at least moderately similar
findings” and 37 failed to meet even that standard.)
97 of the 100 original studies had “statistically
significant” results, but only 36 of the replications
did.

29
Note: The Reproducibility Project has its critics.
See http://science.sciencemag.org/content/351/6277/1037.2

And a response:
https://hardsci.wordpress.com/2016/03/03/evaluating-a-
new-critique-of-the-reproducibility-project/

And also: http://andrewgelman.com/2016/03/05/29195/

30
NIH new (2015) stat guidelines
See http://www.nih.gov/about/reporting-
preclinical-research.htm for a statement of
“principles with the aim of facilitating the
interpretation and repetition of experiments”

31

You might also like