P Valuejaw 2

What is a P-value?
Jeff Witmer
21 April 2016
P-value ≠ Pr(H0 is true)!
Pr(A|B) ≠ Pr(B|A)
“Probability of B given A”
Pr(data|H0) ≠ Pr(H0|data)
Pr(clouds|rain) ≠ Pr(rain|clouds)
2
Doing a hypothesis test means making a
decision.
Reject H0 Retain H0
H0 true Type I error OK
H0 false OK Type II error
α = Type I error rate = Pr(Type I error if H0 is true)
3
Examples (?)
George stands trial for a crime (e.g., burglary).
What is H0, Ha? Type I error? Type II error?
H0: George is innocent
Ha: He is guilty
Type I error: innocent man convicted
Type II error: guilty man set free
Note: “not guilty” ≠ “innocent”
Note: A grand jury might have looked at several possible
defendants and only agreed to let the DA bring forward
George’s case. I.e., George was not chosen randomly to
stand trial. If we were to randomly chose defendants, then we
would make lots of Type I errors over the many trials.
4
Susan goes to her doctor because she thinks she is ill.
H0: Susan is well
Ha: She is sick
Type I error: give unnecessary drugs

Type II error: fail to treat a sick person
5
Fred reads that aliens landed at Roswell, NM in 1947.
Should he believe this?
H0: No such thing happened
Ha: There is a conspiracy of silence
Type I error: believe nonsense (be gullible)

Type II error: miss real news (be too
skeptical)
6
Expert witness work
Consider the question asked, then give one of
the six acceptable answers:
Yes
No
I don’t know
I don’t remember
Could you please repeat the question?
Green
Not “The car was a green Honda with a sunroof, NY
license plates, and the radio was blaring.”
Q: What color was the car? A: Green
7
Hypothesis test
No matter what question you wish the test
would answer, a hypothesis test only answers
one question.
Not “This model is probably true.”
Not “The effect of the drug is large.”
Not “People should care about the difference I have
found.”
Q: Are the data consistent with the model (such that
any deviation from the model could reasonably have
happened by chance)? A: Yes (or No)
8
See the Dance of the P-values
https://www.youtube.com/watch?v=ez4DgdurRPg
See the Amer. Stat. Assoc. paper on P-values

http://amstat.tandfonline.com/doi/abs/
10.1080/00031305.2016.1154108
One way to introduce H0 testing:
See “Introduction to Hypothesis Testing” in
Activity Based Statistics.
Basic idea: Pretend to be doing something
that is random (e.g., tossing a fair coin) while
secretly doing some that is fixed (e.g.,
tossing a two-tailed coin). “Weird” data arise
under the unstated null hypothesis of
fairness. Students often become suspicious
around the time that the P-value approaches
1/32.
See http://www.openintro.org/why05 for a nice activity.
Large n makes everything “significant”
Suppose 31% of American households have a cat.
Suppose we test H0: p =1/3 vs HA: p ≠ 1/3.
sample n Z P-value
Very strong
0.31 1000 -1.56 0.117 evidence for
0.31 2000 -2.21 0.027 HA, but do we
0.31 3000 -2.71 0.0067 care that 31%
0.31 10000 -4.9 0.00000007 ≠ 1/3?
So we should think about effect size! That’s a topic for

another day, but see http://rpsychologist.com/d3/cohend/
11
The difference b/w “significant” and “not
significant” is not, itself, statistically
significant (Gelman and Stern, 2006)
12
Consider testing whether an effect is zero.
mean SE H0?
Group 1 25 10 Reject
Group 2 10 10 Retain
Group 1
vs 15 14 Retain!
Group 2
There is evidence that µ1 is not zero.

There is not sufficient evidence to say
that µ2 is not zero.
There is not sufficient evidence to say
that µ1 is different than µ2.
13
# older brothers
is “sig”
# older sisters is
“not sig”
But we can’t say that
“# older brothers”
and “# older sisters”
are different.
14
Two (different) Ideas
Fisher: Determine whether the data are consistent with a

given model/hypothesis.
Neyman and Pearson: Choose between two hypotheses.
STAT 101: Blend Fisher’s idea and the Neyman/Pearson
idea into a single method – as if this made sense…
See Chance 2008 article on tables: Student, and Pearson,

had tables that gave probabilities for given df and test stat.
Fisher didn’t want to ask for copyright permission and so
created tables with critical values presented for given df
and alpha. This contributed to fixed-level (alpha) testing.
R.A. Fisher in 1925:
“… An observation is judged significant, if it would rarely have
been produced, in the absence of a real cause of the kind we
are seeking. It is a common practice to judge a result significant,
if it is of such a magnitude that it would have been produced by
chance not more frequently than once in twenty trials. This is an
arbitrary, but convenient, level of significance for the practical
investigator, but it does not mean that he allows himself to be
deceived once in every twenty experiments. The test of
significance only tells him what to ignore, namely all
experiments in which significant results are not obtained. He
should only claim that a phenomenon is experimentally
demonstrable when he knows how to design an experiment so
that it will rarely fail to give a significant result. Consequently,
isolated significant results which he does not know how to
reproduce are left in suspense pending further investigation.”
Digression: Bayesian Inference
A Two-sample t-Test
H0: μsoap = μcon

HA: μsoap ≠ μcon
> favstats(bacteria ~ Group, data=SoapData)

Group min Q1 median Q3 max mean sd n missing
1 control 21 33.75 37 49.5 66 41.750 15.636 8 0
2 soap 6 21.00 27 38.0 76 32.429 22.832 7 0
> t.test(bacteria ~ Group, data=SoapData)
Welch Two Sample t-test
data: bacteria by Group
t = 0.91, df = 10.4, p-value = 0.38
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.387 32.030
sample estimates:
mean in group control mean in group soap
41.750 32.429
This is the standard (“frequentist”) method.

One way to make a Bayesian inference is to use
the R package called BayesianFirstAid and the
command bayes.t.test. This uses Markov Chain
Monte Carlo…
We could also go to
http://www.sumsar.net/best_online/ and use the
applet there.
A Bayesian perspective
In loose terms, the p-value is about an order of magnitude

larger than Pr{H0 true} when comparing a typical H0 to HA
under a Bayesian analysis with a vague prior on HA.
That is, if P=0.01, then think "There is about a 10% -- not
a 1% -- chance that H0 is true (i.e., that using H0 as the
model is better than using HA as the model)."
If P=0.05 then there is perhaps a 50-50 chance that the
drug is really better than the placebo; etc.
See Berger and Sellke (1987) “Testing a Point Null
Hypothesis: The Irreconcilability of P Values and
Evidence.” JASA, vol 82, Issue 397, 112 – 122.
A large P-value doesn’t mean that H0 is true. It means that
such data might arise by chance alone if H0 is true. But it
might well be that H0 is false and the data arose under HA
(and were compatible with both H0 and HA).
Beware of data snooping…
See http://xkcd.com/1478/
Publication bias refers to the fact that

(statistically) significant results are much more
likely to be published than are non-sig results.
The file drawer problem… Non-
significant results end up “in the file
drawer” rather than being published, or
even submitted for publication.
23
Publication bias
One study looked at 10 years of papers
indexed in PubMed and identified 4970
observational studies of medical treatments.
82% of them had statistically significant results
at the 0.05 level.
Another study looked at 1046 research
articles in three clinical psychology journals.
86% of them used statistical tests; 94% of
these rejected H0 at the 0.05 level.
24
Ben Goldacre TED MED talk
25
2005 paper in PLoS Medicine
A simplified view of the Ioannidis paper:

Imagine 1000 research projects. We might expect H0 to be
true 920 times. We might expect power to be 50%.
920 H0 true  920 * 0.05 = 46 false alarms

80 H0 false  80 * 0.50 = 40 true alarms
86 alarms, fewer
than half are true
26
2013 paper, Statistics in Medicine
“Our experiment provides evidence that the majority of
observational studies would declare statistical significance when
no effect is present. Empirical calibration was found to reduce
spurious results to the desired 5% level. Applying these
adjustments to literature suggests that at least 54% of findings with
p < 0.05 are not actually statistically significant and should be

Schuemie et al., Statist. Med. 2014, 33 209–218
reevaluated.”
27
Garden of Forking Paths
(“researcher degrees of freedom”)
See Gelman and Loken (2014), American
Scientist, vol. 102, no. 6, page 460+
Do I include that outlier?
Should I do a separate analysis for women?
What about for people over age 40?
It makes sense to exclude participants we later
found out are not native English speakers, right.
Etc.
28
The Reproducibility Project
Attempting to reproduce 100 research findings in
three major psychology journals. Only 39 of them
were classified as replications. (Of the 61 non-
replications, 24 had “at least moderately similar
findings” and 37 failed to meet even that standard.)
97 of the 100 original studies had “statistically
significant” results, but only 36 of the replications
did.
29
Note: The Reproducibility Project has its critics.
See http://science.sciencemag.org/content/351/6277/1037.2
And a response:
https://hardsci.wordpress.com/2016/03/03/evaluating-a-
new-critique-of-the-reproducibility-project/
And also: http://andrewgelman.com/2016/03/05/29195/
30
NIH new (2015) stat guidelines
See
http://www.nih.gov/about/reporting-preclinical-re
search.htm
for a statement of “principles with the aim of
facilitating the interpretation and repetition of
experiments”
31

P Valuejaw 2

Uploaded by

Copyright:

Available Formats

You might also like

P Valuejaw 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P Valuejaw 2

Uploaded by

Copyright:

Available Formats

What is a P-value?

H0 false OK Type II error

α = Type I error rate = Pr(Type I error if H0 is true)

Type I error: give unnecessary drugs

Type I error: believe nonsense (be gullible)

See the Amer. Stat. Assoc. paper on P-values

So we should think about effect size! That’s a topic for

There is evidence that µ1 is not zero.

Fisher: Determine whether the data are consistent with a

See Chance 2008 article on tables: Student, and Pearson,

H0: μsoap = μcon

> favstats(bacteria ~ Group, data=SoapData)

This is the standard (“frequentist”) method.

In loose terms, the p-value is about an order of magnitude

Publication bias refers to the fact that

A simplified view of the Ioannidis paper:

920 H0 true  920 * 0.05 = 46 false alarms

observational studies would declare statistical significance when

no effect is present. Empirical calibration was found to reduce

spurious results to the desired 5% level. Applying these

adjustments to literature suggests that at least 54% of findings with

p < 0.05 are not actually statistically significant and should be

And also: http://andrewgelman.com/2016/03/05/29195/

You might also like