Professional Documents
Culture Documents
Chapter 3: Statistics: Springerbriefs in Applied Sciences and Technology May 2017
Chapter 3: Statistics: Springerbriefs in Applied Sciences and Technology May 2017
Chapter 3: Statistics: Springerbriefs in Applied Sciences and Technology May 2017
net/publication/317344578
Chapter 3: Statistics
CITATIONS READS
0 238
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dimitra Dodou on 23 September 2019.
Joost C.F. de Winter
Dimitra Dodou
Human Subject
Research for
Engineers
A Practical Guide
123
Joost C.F. de Winter Dimitra Dodou
•
123
Joost C.F. de Winter Dimitra Dodou
Department of BioMechanical Engineering, Department of BioMechanical Engineering,
Faculty of Mechanical, Maritime and Faculty of Mechanical, Maritime and
Materials Engineering Materials Engineering
Delft University of Technology Delft University of Technology
Delft Delft
The Netherlands The Netherlands
Abstract After the measurements have been completed, the data have to be sta-
tistically analysed. This chapter explains how to analyse data and how to conduct
statistical tests. We explain differences between a population and a sample, data
distributions, descriptive statistics (i.e., statistics describing a sample: central ten-
dency, variability, effect sizes—including Cohen’s d and correlation coefficients),
and inferential statistics (i.e., statistics are used to infer characteristics of a popu-
lation based on a sample that is taken from this population: standard error of the
mean, null hypothesis significance testing, univariate and multivariate statistics).
We draw attention to pitfalls that may occur in statistical analyses, such as mis-
interpretations of null hypothesis significance testing and false positives. Attention
is also drawn to questionable research practices and their remedies. Replicability of
research is also discussed, and recommendations for maximizing replicability are
provided.
The results section of a research paper usually includes descriptive statistics and
inferential statistics. The aim of descriptive statistics is to summarize the charac-
teristics of the data, whereas inferential statistics are used to test hypotheses or to
make estimates about a population.
The chapter covers the essentials of statistics in a concise manner; it does not
offer a comprehensive guide on statistics. The website http://stats.stackexchange.
com provides answers to many statistical questions. Well known textbooks are
Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by
Cohen et al. (1983; 3rd ed. 2003), counting 149,222 citations in Google Scholar by
19 February 2017, Using Multivariate Statistics by Tabachnick and Fidell (1989;
6th ed. 2012) with 65,150 citations, and Discovering Statistics Using SPSS by Field
(2000; 4th ed. 2013) with 30,988 citations. Note that in this book we cover only
frequentist and not Bayesian inference.
1X n
x¼ xi ð3:1Þ
n i¼1
pffiffiffiffi
s¼ s2 ð3:2Þ
1 X n
s2 ¼ ðxi xÞ2 ð3:3Þ
n 1 i¼1
In human subject research, the unit of analysis is usually the participant. Thus, in
Eqs. (3.1) and (3.2), a data point xi is the score of a participant on a measure, and
n is the number of participants. For example, if there are 10 scores per participant
(e.g., 10 reaction times per participant) and 20 participants, then n = 20, not
n = 200; one should first calculate aggregate scores per participant (e.g., the mean
across the 10 reaction times) and subsequently calculate the mean and standard
deviation across the 20 participants.
Other descriptive measures are the median (equivalent to the 50th percentile),
skewness, and kurtosis (median, prctile, skewness, and kurtosis). The
median is a robust measure of central tendency, which means that it is insensitive to
outliers. Skewness is a measure of the symmetry of the distribution, and kurtosis is
a measure of the tailedness of the distribution (DeCarlo 1997). A normal distri-
bution has a skewness of 0 and a kurtosis of 3 (note that kurtosis minus 3 is also
called ‘excess kurtosis’). A distribution with kurtosis less than 3 is called
platykurtic, whereas a distribution with kurtosis greater than 3 is called leptokurtic.
Figure 3.1 shows a Student’s t distribution and an exponential distribution, which
are both leptokurtic distributions, meaning that these distributions have heavy tails
relative to the normal distribution.
Fig. 3.1 Probability density function of (1) a normal distribution (which is equivalent to a
t distribution with infinite degrees of freedom), (2) a Student’s t distribution with five degrees of
freedom (df = 5), (3) a Student’s t distribution with df = 5, but now scaled so that the variance
equals 1 (if a distribution with high kurtosis is scaled to variance, high kurtosis appears as heavy
tails, and (4) an exponential distribution
44 3 Statistics
Next to measures of central tendency (e.g., mean) and spread (e.g., standard
deviation), it is customary to report effect sizes in a paper.
3.2.2.1 Cohen’s d
A common measure of effect size is Cohen’s d, which describes how much two
samples (x1 and x2) differ on a variable of interest with respect to each other. d is
calculated as the difference in means divided by the pooled standard deviation of
the two samples (Eq. (3.4)). In MATLAB d is calculated as follows: n1=length
(x1); n2=length(x2); d=(mean(x1)-mean(x2))/(sqrt(((n1-1)
*std(x1)^2+(n2-1)*std(x2)^2)/(n1+n2-2))).
x1 x2
d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ð3:4Þ
ðn1 1Þs1 þ ðn2 1Þs2
2 2
n1 þ n2 2
Additionally, researchers often report the correlation matrix among the variables
involved in the study. A correlation matrix allows one to gauge how strongly the
variables are related to each other.
P
N
ðxi xÞðyi yÞ
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
i¼1
ffi ð3:5Þ
PN PN
ðxi xÞ2 ðyi yÞ2
i¼1 i¼1
Fig. 3.2 Two normally distributed variables (n = 1000) sampled from two populations having
different Pearson correlation coefficients (the population correlation coefficient is designated by the
symbol R and corresponds to the slope of the magenta dashed line). x and y have been drawn from
a normal distribution population with l = 0 and r = 1
Cohen’s d represents the magnitude of the difference between two samples (x1, x2),
whereas r is the association between two variables (x, y) for the same sample.
However, r can also be used to describe the magnitude of the difference between
two samples, in which case it is called point-biserial correlation coefficient. The
point-biserial correlation coefficient is calculated from Eq. (3.5), with one variable
being dichotomous (i.e., containing zeros and ones, which represent the group the
data point belongs to) and the other variable being the pooled vectors of both
samples. In MATLAB, the point-biserial correlation is calculated as follows:
rpb=corr([ones(n1,1);zeros(n2,1)],[x1;x2]). The point-biserial
correlation is related to d according to Eq. (3.6) (Hedges and Olkin 1985; for more
conversions between effect size measures, see Aaron et al. 1988; Rosenthal 1994).
In MATLAB the point-biserial correlation can be calculated based on d as follows:
rpb=d/sqrt(d^2+(n1+n2)*(n1+n2-2)/(n1*n2)).
d
rpb ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:6Þ
ð n þ n2 Þðn1 þ n2 2Þ
d2 þ 1
n1 n2
46 3 Statistics
Just like the median is more robust than the mean, so is the Spearman rank-order
correlation more robust than the Pearson correlation. The Spearman correlation is
calculated in the same way as the Pearson correlation, except that the data are first
converted to ranks (De Winter et al. 2016). Thus, corr(tiedrank(x),tie-
drank(y)) and corr(x,y,’type’,’spearman’) give identical results. It is
advisable to use the Spearman correlation when one expects that the variables have
high kurtosis or when outliers may be present.
Other effect size measures are risk ratios and odds ratios, which are particularly
used in the medical field (De Winter et al. 2016). Referring to the example of
Fig. 2.4, where using a monocular display was the risk factor and experiencing
visual complaints was the outcome, one could create a 2 2 contingency table
(Table 3.1) describing the exposure of the participants to the risk factor and their
status with respect to the outcome
The risk ratio (RR) is defined as the ratio of the probability of the outcome being
present in the group of participants exposed to the risk factor to the probability of
the outcome being present in the group of participants not exposed to the risk factor
(Eq. (3.7)):
A=ðA þ BÞ
RR ¼ ð3:7Þ
C=ðC þ DÞ
The odds ratio (OR) is defined as the ratio of the odds of the outcome being present
in the group of participants exposed to the risk factor to the odds of the outcome
being present in the group of participants not exposed to the risk factor, where the
odds is defined as the number of participants with the outcome present divided by
the number of participants with the outcome not being present (Eq. (3.8)):
A=B
OR ¼ ð3:8Þ
C=D
Table 3.1 Contingency table of participant counts based on their status with respect to the risk
factor and the outcome variable
Risk Outcome
Experiencing Not experiencing
visual complaints visual complaints
Using monocular displays A B
Not using monocular displays C D
3.2 Descriptive Statistics 47
OR and RR should not be confused with each other. If the probability (prevalence)
of the outcome is low (i.e., A/(A + B) < 20%), then OR can be approximated with
RR; if the probability of the outcome is high, however, OR is considerably higher
than RR (Davies et al. 1998; Schmidt and Kohlmann 2008). OR can be converted to
RR according to Eq. (3.9) (Zhang and Kai 1998):
OR
RR ¼ ð3:9Þ
1 C
CþD þ Cþ
C
D OR
AD BC
rpb ¼ / ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:10Þ
ðA þ BÞðC þ DÞðA þ CÞðB þ DÞ
Descriptive statistics in tables are useful but provide an incomplete picture of the
raw data. The importance of figures is nicely illustrated by Anscombe’s quartet
(Fig. 3.3; Anscombe 1973). Each of these four datasets has the same means for
x and y (9 and 7.5, respectively), the same variance for x and y (11 and 4.12,
respectively), and the same correlation between x and y (r = 0.82) (and see Matejka
and Fitzmaurice 2017, for more examples).
Useful figures for describing data are the histogram (histc or histcounts),
the boxplot (boxplot), time series, Fourier analysis (using fft), and scatter plots
(plot or scatter). No matter how data are plotted, it is important that not only
the central tendency (the mean or median) can be distinguished, but also the
variability (standard deviation, percentile values, or raw data).
Fig. 3.4 Results of a simulation where the sample mean is calculated for 1,000,000 samples
drawn from a normal distribution population with l = 0 and r = 1
Fig. 3.5 Results of a simulation where the sample mean is calculated for 1,000,000 samples
drawn from an exponentially distributed population with l = 1 and r = 1
50 3 Statistics
Table 3.2 The standard deviation of the sample mean as observed from the above simulations, in
comparison with the expected value of 1/n0.5
n=1 n=2 n=5 n = 20 n = 50
1/n0.5 1.000 0.707 0.447 0.224 0.141
Figure 3.4 (Normal distribution) 1.000 0.707 0.447 0.224 0.142
Figure 3.5 (Exponential distribution) 0.998 0.707 0.447 0.224 0.142
distribution, which is in agreement with the central limit theorem. Again, the
SEM decreases according to the square root of n.
Table 3.2 shows the SEMs observed in the above simulations in com-
parison with the expected value. It can be seen that Eq. (3.11) holds
regardless of the distribution of the population (e.g., normal or exponential).
Note that in the simulations, the standard deviation of the population was
known (r = 1). In reality, the standard deviation is unknown and must be
observed. Because the sample standard deviation (s) is a biased estimate of
the population standard deviation (see Textbox 3.1), the SEM based on the
sample standard deviation (Eq. (3.11)) is a biased estimate of the SEM based
on the population standard deviation.
Hypothesis testing can take different forms. The most common form is that of null
hypothesis significance testing, in which there are four possibilities: (1) correctly
rejecting the null hypothesis, (2) correctly accepting the null hypothesis, (3) re-
jecting the null hypothesis when it is true (Type I error), and (4) accepting the null
hypothesis when it is false (Type II error) (Table 3.3; see also Fig. 3.6). The
probability of rejecting the null hypothesis when it is false is the statistical power, or
1 − b, where b is the probability of Type II error. In other words, the statistical
Fig. 3.6 Left Illustration of a Type I error (false positive; i.e., to report there is something while
there is nothing). Right Illustration of a Type II error (false negative; i.e., to report there is nothing
while there is something). Photo on the left taken from Wikimedia Commons (https://commons.
wikimedia.org/wiki/File:Toyota_Curren_ST-206_1996_parking.jpg). Author: Qurren. Created: 26
April 2006. Photo on the right taken from Wikimedia Commons (https://commons.wikimedia.org/
wiki/File:Parking_violation_Vaughan_Mills.jpg). Author: Challisrussia. Created: 20 November
2011. Photo of the policeman adapted from Wikimedia Commons (https://commons.wikimedia.
org/wiki/File:British_Policeman.jpg). Author: Southbanksteve. Created: 15 November 2006
power is the probability of not making a Type II error. The significance level a is
the probability of a Type I error, that is, the probability of rejecting the null
hypothesis when it is true.
If the data are sampled from a population having a normal distribution with equal
variances, then the Student’s t test is the most powerful unbiased test. This means
that the t test gives the maximum probability of correctly rejecting the null
hypothesis (it maximizes 1 − b) while maintaining the nominal Type I error rate (a).
The independent-samples Student’s t test works as follows. It first calculates a
t statistic according to Eq. (3.12):
x1 x2
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:12Þ
ðn1 1Þs21 þ ðn2 1Þs22
n1 þ n2
1 1
n1 þ n2 2
The t statistic is larger (1) when the difference between the means of the two
samples is larger, (2) when the standard deviations of the samples are smaller, and
(3) when the sample size is larger. The t statistic describes the distance between the
52 3 Statistics
two groups and is related to Cohen’s d (Eq. (3.4)) according to Eq. (3.13) (Aaron
et al. 1988; Rosenthal 1994):
d
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:13Þ
n1 þ n2
1 1
From the t statistic, the p value is calculated using the Student’s t distribution. The
Student’s t distribution resembles the normal distribution but has heavier tails,
especially when the sample size is small (Fig. 3.1). If the p value is smaller than a,
the effect is said to be statistically significant.
Consider a situation where we want to test whether males have a different height
than females. Let us assume that males and females are on average 182.5 and
168.7 cm tall respectively (NCD Risk Factor Collaboration 2016) and that the
standard deviation of both populations equals 7.1 cm (Fig. 3.7). Of course, in
reality, we do not have access to the population distributions, and so we cannot
know the population means and standard deviations; we only obtain data from
samples. Let us sample 10 men and 10 women. In MATLAB, a t test can be
performed as follows: [*,p,*,stats]=ttest2(x1,x2), with x1 being a
vector of length n1 with the heights of the males, and x2 being a vector of length n2
with the heights of females. The result of the t test is a p value, defined as the
probability of obtaining a result equal to or more extreme than the observed result,
assuming that the null hypothesis of equal means is true. For example, p = 0.020
means that, assuming two random samples were drawn from the same normal
distribution, in only 2% of the cases one would find such a large difference (see also
Textbox 3.3).
Fig. 3.7 Probability density function of assumed population distribution of males and females.
lwomen = 168.7 cm, lmen = 182.5 cm, rmen = rwomen = 7.1 cm
3.3 Inferential Statistics 53
0.7
0.6
p value
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Test number (sorted on p value)
Fig. 3.8 Simulation results when submitting a sample of n men and n women to an
independent-samples Student’s t test (lmen = 182.5 cm, lwomen = 168.7 cm, rmen = rwomen =
7.1 cm). The horizontal dashed line is drawn at a = 0.05
54 3 Statistics
0.9
0.8
0.7
0.6
p value
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Test number (sorted on p value)
Fig. 3.9 Simulation results when submitting a sample of 10 men and 10 women to an
independent-samples Student’s t test. Here, it was assumed that men and women have equal height
(lmen = 182.5 cm, lwomen = 182.5 cm, rmen = rwomen = 7.1 cm). The horizontal dashed line is
drawn at a = 0.05
Note that if the population means of both distributions are equal, then the
p value is uniformly distributed. In other words, if men and women had equal
height, the simulation results would look like those in Fig. 3.9. It can be seen
that a Type I error is made in 5% of the cases.
In a scientific paper, it is important to not only report the p value, but also the
t statistic and degrees of freedom of the Student’s t distribution, as well as the means
and standard deviations of the two samples. In the aforementioned example about
the height of men and women, the results can be reported as follows: ‘Men were
taller (M = 182.5 cm, SD = 8.2 cm) than women (M = 169.1 cm, SD = 6.1 cm), t
(18) = 4.14, p< 0.000’. Here (18) is the degrees of freedom of the Student’s t dis-
tribution, which equals n1 + n2 − 2.
An independent-samples t test is used for comparing two groups, for example males
with females, or the results of a between-subjects experiment. For a within-subject
design, a paired-samples t test can be used ([*,p,*,stats]=ttest(x1,
x2)). Here, the t statistic is a function of the change of the scores for participants
between two conditions (Eq. (3.14)). A paired-samples t test is usually more
powerful than an independent-samples t test, because participants are compared
with themselves (see also Sect. 2.3). Specifically, the denominator in Eq. (3.14) is
smaller than the denominator in Eq. (3.12) when the two samples are positively
correlated (see Eq. (3.15)). The results of a paired t test can be reported as: ‘The
3.3 Inferential Statistics 55
task completion time was larger with the traditional walking aid (M = 51.7 s,
SD = 6.1 s) than with the exoskeleton (M = 43.9 s, SD = 7.6 s), t(9) = 3.34,
p = 0.009’. Here, (9) is the number of degrees of freedom, being equal to n − 1.
x1 x2
t¼ qffiffi ð3:14Þ
s12 1n
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s12 ¼ s21 þ s22 2r s1 s2 ð3:15Þ
A statistical test can be one-tailed or two-tailed. A one-tailed test is used for testing
a hypothesis in one direction, whereas a two-tailed test examines the hypothesis in
both directions. For example, a two-tailed test can be used to examine whether a
new exoskeleton is less or more efficient than a traditional walking aid. In one-tailed
tests, only one of the two directions is tested; for example, to test whether a new
exoskeleton is more efficient than a traditional walking aid (and not whether the
exoskeleton is less efficient than the traditional walking aid). In MATLAB, a
two-tailed test is the default. A one-tailed t test can be conducted as follows: [*,
p]=ttest2(x1,x2,‘tail’,‘right’) or [*,p]=ttest2(x1,x2,
‘tail’,‘left’). If the test is one-sided, the p value is the probability of obtaining
a result as extreme or more extreme in the selected direction, whereas the two-tailed
probability is the one-tailed probability (for the nearest rejection side) multiplied by
two. It is easier to reach significance (p < a) when using a one-tailed test as
compared to a two-tailed test, but this should never be the reason of opting for
one-tailed tests. In human subject research, it is customary to use two-tailed tests.
has low power (low 1 − b) or yields a Type I error rate that deviates from the
nominal a. There are various alternative tests, such as the Welch test (Eq. (3.16);
ttest2(x1,x2,[],[],‘unequal’), which is robust to Type I errors if sample
sizes are unequal in combination with unequal population variances:
x1 x2
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:16Þ
s21 s22
n1 þ n2
There are also non-parametric variants of the t test, such as the Mann-Whitney
U test (also called Wilcoxon rank-sum test; ranksum(x1,x2)) and the Wilcoxon
signed rank test (signrank(x1,x2)). When there are more than two groups, an
analysis of variance (ANOVA) can be used (anova1 for a one-way ANOVA) or
its non-parametric equivalent, the Kruskal-Wallis test (kruskalwallis).
The t test is a univariate test, meaning that one variable (x) is analysed per par-
ticipant. Multivariate statistical methods analyse more than one variable simulta-
neously. The correlation coefficient and simple linear regression are relatively
simple bivariate methods, involving two variables (x and y) per participant.
A regression analysis with multiple predictor variables and one criterion variable is
called multiple regression (regress). Examples of more sophisticated multi-
variate statistical techniques are: (1) multivariate regression (mvregress; for
predicting multiple criterion variables), (2) exploratory factor analysis (factoran;
a linear system in which the predictor variables are not directly observed, and the
number of predictor variables is smaller than the number of criterion variables),
(3) principal component analysis (pca; a data reduction technique which resembles
factor analysis), (4) structural equation modelling (a combination of multivariate
regression and factor analysis), and (5) multivariate analysis of variance (manova;
this resembles multivariate regression).
As explained above, if p < 0.05 (for a = 0.05), then the effect is declared statisti-
cally significant. However, a statistically significant finding does not imply that the
effect is strong or important. Suppose that the true effect is small (e.g., a difference
of 0.1 cm between the height of men and women) but the sample size is very large
(e.g., nmen = nwomen = 1,000,000), then it is very likely that p < 0.05. That is, the
effect is statistically significant because the sample size is very large, but the size of
3.4 Pitfalls of Null Hypothesis Significance Testing 57
the effect is small (a difference of 0.1 cm; d = 0.014) and therefore does not nec-
essarily have practical relevance.
Furthermore, a statistically significant finding does not imply that the alternative
hypothesis is true. Null hypothesis significance testing cannot establish whether a
hypothesis is true or false. After all, a p value represents the likelihood of the data,
assuming that the null hypothesis holds. Such a failure of significance testing was
demonstrated by the work of Bem (2011) who, based on a number of experiments
that yielded p < 0.05, claimed that people are able to ‘feel the future’. Based on
what is known from physics and numerous counterexamples such as the fact that
casinos still make money, however, it is extremely unlikely that people can really
feel the future. Therefore, the statistically significant findings reported by Bem
(2011) have to be false positives (Wagenmakers et al. 2015).
In a highly cited article, Ioannidis (2005) claimed that “most published research
findings are false”. Since then, he has been proven right in various areas, such as
medicine (Begley and Ellis 2012; Freedman et al. 2015; Ioannidis 2007), experi-
mental economics (Ioannidis and Doucouliagos 2013), and psychology (Open
Science Collaboration 2015; Textbox 3.4).
Ioannidis’ (2005) argument is as follows. He first points out that if a researcher
reports ‘there is an effect, p < a’, it can be either a true positive or a false positive
(see also Table 3.3). The pre-study probability that a research finding is true is
called p. p depends strongly on the research field. For example, in the area of
research into clairvoyance, p is extremely close to 0. Conversely, if the research
field targets highly probable effects (such as the hypothesis that males are taller than
females), then p is extremely close to 1.
Fig. 3.10 p values in original studies versus replication studies. Dashed lines run across p = 0.05.
See also Open Science Collaboration (2015, 2016)
from the original studies (with the protocols of the replications being
endorsed only by 69% of the original authors). It has been further argued that
because of the small sample sizes of the original studies, failure to replicate
(even by means of a high-powered replication) cannot tell us much about the
original results (Etz and Vandekerckhove 2016; Morey and Lakens 2016).
The expected number of true positives is the product of the statistical power
(1 − b) and p. Similarly, the expected number of false positives is a times the
probability that a research hypothesis is false (1 − p). Thus, the probability that a
statistically significant research finding is false (the False Positive Report
Probability; FPRP) equals the expected number of false positives divided by the
expected number of true positives plus the expected number of false positives
(Eq. (3.17); Wacholder et al. 2004).
að1 pÞ
FPRP ¼ ð3:17Þ
að1 pÞ þ ð1 bÞp
According to Eq. (3.17), research findings are more likely to be true when p is
higher, and when the statistical power is higher. Ioannidis (2005) argues that in
confirmatory research, such as randomized controlled trials, the FPRP is probably
less than 0.5. However, if research is exploratory (i.e., discovery-oriented), then it
becomes likely that a positive research finding is false. Figure 3.11 illustrates the
perils of research with low p. In the case presented, p = 0.02, 1 − b = 0.8, and
a = 0.05, yielding a FPRP of 75% [see also Eq. (3.17)].
3.4 Pitfalls of Null Hypothesis Significance Testing 59
3.4.2 Bias
In his paper, Ioannidis (2005) describes another risk: bias. Bias is the tendency of
researchers to ‘tweak’ a p value so that it becomes statistically significant while it
should not have been significant. There is evidence that researchers ‘like’ statisti-
cally significant results (e.g., Bakker and Wicherts 2011): a p < 0.05 might please
the sponsors (Lexchin et al. 2003), get the work more easily accepted into a journal
(Mobley et al. 2013), attract media attention (Boffetta et al. 2008), or reflect the
researchers’ tendency to confirm their own hypothesis (i.e., experimenter’s
expectancy, see Sect. 2.5.2). Questionable research practices during statistical
analysis leading to false positives (i.e., Type I errors) are (for an overview, see
Banks et al. 2016; Forstmeier et al. in press):
60 3 Statistics
Fig. 3.12 Distribution of p values when ‘strategically’ selecting a non-parametric test when the
parametric test yields a result that is not statistically significant. In this simulation, 1,000,000
independent-samples t tests were run with a sample size of 25 per group
3.4 Pitfalls of Null Hypothesis Significance Testing 61
compute the required sample size for a given level of significance, desired
power, and expected effect size. An excellent power analysis tool is G*Power,
which can be downloaded for free: http://www.gpower.hhu.de. Another useful
effect size calculator which does not require installing software but runs in
Microsoft Excel is provided by Lakens (2013).
Note that this book focuses on frequentist inference (e.g., null hypothesis signifi-
cance testing and p values). However, the use of frequentist inference has been
criticized by many, because it dichotomizes research into significant and
non-significant findings, and because p values are easily misinterpreted. Nowadays,
Bayesian inference is gaining popularity (Cumming 2013; Poirier 2006), because it
does not suffer from the same problems as frequentist inference does
(Wagenmakers et al. 2008). Bayesian statistical methods are available in several
software packages, including winBUGS (Lunn et al. 2000) and Mplus (Kaplan and
Depaoli 2012). However, frequentist inference still seems to be dominant. In an
analysis of abstracts published in biomedical journals between 1990 and 2015,
Chavalarias et al. (2016) found that, out of 796 abstracts of papers with empirical
data, 15.7% of the abstracts reported p values, 13.9% reported effect sizes, 2.3%
reported confidence intervals, and 0% reported a Bayes factor.
References
Aaron, B., Kromrey, J. D., & Ferron, J. (1988). Equating r-based and d-based effect size indices:
Problems with a commonly recommended formula. Paper presented at the 43rd Annual
Meeting of the Florida Educational Research Association, Orlando, FL.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21.
https://doi.org/10.1080/00031305.1973.10478966
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., et al.
(2013). Recommendations for increasing replicability in psychology. European Journal of
Personality, 27, 108–119. https://doi.org/10.1002/per.1919
Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology
journals. Behavior Research Methods, 43, 666–678. https://doi.org/10.3758/s13428-011-0089-5
Bakker, M., & Wicherts, J. M. (2014). Outlier removal and the relation with reporting errors and
quality of psychological research. PLOS ONE, 9, e103360. https://doi.org/10.1371/journal.
pone.0103360
Banks, G. C., O’Boyle, E. H., Pollack, J. M., White, C. D., Batchelor, J. H., Whelpley, C. E., et al.
(2016). Questions about questionable research practices in the field of management: A guest
commentary. Journal of Management, 42, 5–20. https://doi.org/10.1177/0149206315619011
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer
research. Nature, 483, 531–533. https://doi.org/10.1038/483531a
References 63
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. https://
doi.org/10.1037/a0021524
Boffetta, P., McLaughlin, J. K., La Vecchia, C., Tarone, R. E., Lipworth, L., & Blot, W. J. (2008).
False-positive results in cancer epidemiology: A plea for epistemological modesty. Journal of
the National Cancer Institute, 100, 988–995. https://doi.org/10.1093/jnci/djn191
Bolch, B. W. (1968). The teacher’s corner: More on unbiased estimation of the standard deviation.
The American Statistician, 22, 27. https://doi.org/10.1080/00031305.1968.10480476
Burt, C. (1957). Distribution of intelligence. British Journal of Psychology, 48, 161–175. https://
doi.org/10.1111/j.2044-8295.1957.tb00614.x
Chavalarias, D., Wallach, J. D., Li, A. H. T., & Ioannidis, J. P. (2016). Evolution of reporting
p values in the biomedical literature, 1990–2015. JAMA, 315, 1141–1148. https://doi.org/10.
1001/jama.2016.1952
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum.
Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-
values. Royal Society Open Science, 1, 140216. https://doi.org/10.1098/rsos.140216
Cumming, G. (2013). The new statistics why and how. Psychological Science, 25, 7–29. https://
doi.org/10.1177/0956797613504966
Davies, H. T. O., Crombie, I. K., & Tavakoli, M. (1998). When can odds ratios mislead? BMJ,
316, 989–991. https://doi.org/10.1136/bmj.316.7136.989
De Winter, J. C. F. (2015). A commentary on “Problems in using text-mining and p-curve analysis
to detect rate of p-hacking”. https://sites.google.com/site/jcfdewinter/Bishop%20short%
20commentary.pdf?attredirects=0&d=1
De Winter, J. C. F., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman
correlation coefficients across distributions and sample sizes: A tutorial using simulations and
empirical data. Psychological Methods, 21, 273–290. https://doi.org/10.1037/met0000079
DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292–307.
https://doi.org/10.1037/1082-989X.2.3.292
Etz, A., & Vandekerckhove, J. (2016). A Bayesian perspective on the reproducibility project:
Psychology. PLOS ONE, 11, e0149794. https://doi.org/10.1371/journal.pone.0149794
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in
preclinical research. PLOS Biology, 13, e1002165. https://doi.org/10.1371/journal.pbio.
1002165
Field, A. (2013). Discovering statistics using IBM SPSS statistics. London, UK: Sage Publications.
Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (in press). Detecting and avoiding likely
false‐positive findings–a practical guide. Biological Reviews. https://doi.org/10.1111/brv.
12315
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the
reproducibility of psychological science”. Science, 351, 1037. https://doi.org/10.1126/science.
aad7243
Goel, S., & Tashakkori, R. (2015). Correlation between body measurements of different genders
and races. In J. Rychtár, M. Chhetri, S. N. Gupta, & R. Shivaji (Eds.), Collaborative
mathematics and statistics research (pp. 7–17). Springer International Publishing. https://doi.
org/10.1007/978-3-319-11125-4_2
Gross, E., & Vitells, O. (2010). Trial factors for the look elsewhere effect in high energy physics.
The European Physical Journal C, 70, 525–530. https://doi.org/10.1140/epjc/s10052-010-
1470-8
Guilford, J. P., & Perry, N. C. (1951). Estimation of other coefficients of correlation from the phi
coefficient. Psychometrika, 16, 335–346. https://doi.org/10.1007/BF02310556
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
64 3 Statistics
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85, 869–879. https://doi.org/10.1037/0021-9010.85.6.869
Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine, 2, e124.
https://doi.org/10.1371/journal.pmed.0020124
Ioannidis, J. P. (2007). Non-replication and inconsistency in the genome-wide association setting.
Human Heredity, 64, 203–213. https://doi.org/10.1159/000103512
Ioannidis, J., & Doucouliagos, C. (2013). What’s to know about the credibility of empirical
economics? Journal of Economic Surveys, 27, 997–1004. https://doi.org/10.1111/joes.12032
Jager, L. R., & Leek, J. T. (2013). An estimate of the science-wise false discovery rate and
application to the top medical literature. Biostatistics, 15, 1–12. https://doi.org/10.1093/
biostatistics/kxt007
Kaplan, D., & Depaoli, S. (2012). Bayesian structural equation modeling. In R. Hoyle (Ed.),
Handbook of structural equation modeling (pp. 650–673). New York: Guilford Press.
Krawczyk, M. (2015). The search for significance: A few peculiarities in the distribution of
p values in experimental psychology literature. PLOS ONE, 10, e0127872. https://doi.org/10.
1371/journal.pone.0127872
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A
practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. https://doi.org/10.3389/
fpsyg.2013.00863
Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p: “Just
significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66,
2303–2309. https://doi.org/10.1080/17470218.2013.863371
Lexchin, J., Bero, L. A., Djulbegovic, B., & Clark, O. (2003). Pharmaceutical industry
sponsorship and research outcome and quality: Systematic review. BMJ, 326, 1167–1170.
https://doi.org/10.1136/bmj.326.7400.1167
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS-a Bayesian modelling
framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
https://doi.org/10.1023/A:1008929526011
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with
varied appearance and identical statistics through simulated annealing. Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems, 1290–1294. https://doi.org/10.
1145/3025453.3025912
Meyer, G. J., Finn, S. E., Eyde, L. D., Kay, G. G., Moreland, K. L., Dies, R. R., … & Reed, G. M.
(2001). Psychological testing and psychological assessment: A review of evidence and issues.
American Psychologist, 56, 128–165. https://doi.org/10.1037/0003-066X.56.2.128
Mobley, A., Linder, S. K., Braeuer, R., Ellis, L. M., & Zwelling, L. (2013). A survey on data
reproducibility in cancer research provides insights into our limited ability to translate findings
from the laboratory to the clinic. PLOS ONE, 8, e63221. https://doi.org/10.1371/journal.pone.
0063221
Morey, R. D., & Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. https://
raw.githubusercontent.com/richarddmorey/psychology_resolution/master/paper/response.pdf
NCD Risk Factor Collaboration. (2016). A century of trends in adult human height. ELife, 5,
e13410. https://doi.org/10.7554/eLife.13410
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349, aac4716. https://doi.org/10.1126/science.aac4716
Open Science Collaboration. (2016). RPPdataConverted.xlsx. https://osf.io/ytpuq/
Plomin, R., & Deary, I. J. (2015). Genetics and intelligence differences: Five special findings.
Molecular Psychiatry, 20, 98–108. https://doi.org/10.1038/mp.2014.105
Poirier, D. J. (2006). The growth of Bayesian methods in statistics and economics since 1970.
Bayesian Analysis, 1, 969–979.
Rasch, D., Kubinger, K. D., & Moder, K. (2011). The two-sample t test: Pre-testing its
assumptions does not pay off. Statistical Papers, 52, 219–231. https://doi.org/10.1007/s00362-
009-0224-x
References 65
Reeves, S. L., Varakamin, C., & Henry, C. J. (1996). The relationship between arm-span
measurement and height with special reference to gender and ethnicity. European Journal of
Clinical Nutrition, 50, 398–400.
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The
handbook of research synthesis (pp. 231–244). New York, NY: Russell Sage Foundation.
Schmidt, C. O., & Kohlmann, T. (2008). When to use the odds ratio or the relative risk?
International Journal of Public Health, 53, 165–167. https://doi.org/10.1007/s00038-008-
7068-3
Tabachnick, B. G., & Fidell, L. S. (1989). Using multivariate statistics. New York: Harper & Row.
Thorndike, R. L. (1947). Research problems and techniques (Report No. 3). Washington DC:
Army Air Forces.
Wacholder, S., Chanock, S., Garcia-Closas, M., & Rothman, N. (2004). Assessing the probability
that a positive report is false: An approach for molecular epidemiology studies. Journal of the
National Cancer Institute, 96, 434–442. https://doi.org/10.1093/jnci/djh075
Wagenmakers, E. J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian versus frequentist
inference. In H. Hoijtink, I. Klugkist, & P. A. Boelen (Eds.), Bayesian evaluation of
informative hypotheses (pp. 181–207). New York: Springer.
Wagenmakers, E. J., Wetzels, R., Borsboom, D., Kievit, R. A., & Van der Maas, H. L. (2015).
A skeptical eye on psi. In E. C. May & S. B. Marwaha (Eds.), Extrasensory perception:
Support, skepticism, and science (Volume I) (pp. 153–176). Santa Barbara, CA: ABC-CLIO
LLC.
Zhang, J., & Kai, F. Y. (1998). What’s the relative risk? A method of correcting the odds ratio in
cohort studies of common outcomes. JAMA, 280, 1690–1691. https://doi.org/10.1001/jama.
280.19.1690
MATLAB Scripts
See Figs. 1.1, 2.2, 2.5, 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, ‘3.8 and 3.9’, 3.10 and 3.12.
Fig. 2.5 Weight judgement: Wisdom of the crowd (Gordon 1924; Eysenck 1939)
clear variables
r=.41^2;
n1=[1 5 10 20 50 ];
robs=[.41 .68 .79 .86 .94];
n2=1:200;
R=sqrt((n2.*r)./(1+(n2-1).*r));
figure('Name','Figure 2.5','NumberTitle','off');hold on
plot(n1,robs,'ko','Markersize',14,'Markerfacecolor','k')
plot(n2,R,'Linewidth',2)
set(gca,'color', 'None');grid on;box on
set(gca,'xlim',[0 100],'ylim',[0 1])
legend('Observed correlation','Predicted
correlation','location','southeast')
xlabel('Number of participants per group');
ylabel('Correlation with true weights')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
clear variables
x = -50:.001:50;y=NaN(3,length(x));
y(1,:)=tpdf(x,5);y(2,:)=tpdf(x,inf);y(3,:)=exppdf(x,1);
figure('Name','Figure 3.1','NumberTitle','off');hold on
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
plot(x,y(2,:),'k','Linewidth',2)
plot(x,y(1,:),'g','Linewidth',2)
plot(x./sqrt(5/3),y(1,:).*sqrt(5/3),'-','color',[255 165
0]/255,'Linewidth',2)
plot(x,y(3,:),'m:','Linewidth',2)
h=legend('(1) Normal : variance = 1, skewness = 0, kurtosis =
3','(2) \it{t}\rm : variance = 5/3, skewness = 0, kurtosis =
9','(3) \it{t}\rm (scaled) : variance = 1, skewness = 0, kurtosis
= 9','(4) Exponential : variance = 1, skewness = 2, kurtosis =
9','location','northeast','orientation','vertical');
set(h,'color','none')
xlabel('Value')
ylabel('Density')
set(gca,'color', 'None')
set(gca,'xlim',[-5 10],'ylim',[0 1.01],'xtick',-
10:1:10,'FontSize',24);
h=rectangle('position',[3 0 2
.06],'facecolor','none','Linewidth',1);
pan on
set(h,'Clipping','off')
plot([3.7 4],[.06 .23],'k-','Linewidth',1)
ah=axes('position',[.5465 .30 .35 .35]);
hold on;box on
plot(x,y(1,:),'g','Linewidth',2)
plot(x./sqrt(5/3),y(1,:).*sqrt(5/3),'-','color',[255 165
0]/255,'Linewidth',2)
plot(x,y(2,:),'k','Linewidth',2)
plot(x,y(3,:),'m:','Linewidth',2)
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'xlim',[3 5],'ylim',[0 .06],'FontSize',16,'color','none')
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
102 MATLAB Scripts
clear variables;rng('default')
N=1000;RR=[0 .2 .4 .6 .8 .9];
figure('Name','Figure 3.2','NumberTitle','off');hold on
for i=1:length(RR);
subplot(2,3,i);hold on
set(gca,'color','none')
plot([-10 10],[-10 10]*RR(i),'m--','Linewidth',2)
h=legend(['\rm\it{R}\rm = '
num2str(RR(i))],'location','southeast');
set(h,'color','none')
x=randn(N,1);y=RR(i)*x+sqrt((1-RR(i)^2))*randn(N,1);
plot(x,y,'ko');
xlabel('\itx');ylabel('\ity');
axis equal
set(gca,'xlim',[-5 5],'ylim',[-5 5],'xtick',[-5 0
5],'ytick',[-5 0 5])
box on
end
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',18)
clear variables
d=[10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0
6.89];
figure('Name','Figure 3.3','NumberTitle','off');hold on
for i=1:4
subplot(2,2,i)
plot(d(:,i*2-
1),d(:,i*2),'ko','Markersize',10,'Markerfacecolor','k')
xlabel('\itx');ylabel('\ity');
set(gca,'xlim',[3 20],'ylim',[4 14]);
set(gca,'color','none')
end
disp('Means')
fprintf('%8.3f',mean(d(:,1:2:end)));fprintf('\n')
fprintf('%8.3f',mean(d(:,2:2:end)));fprintf('\n')
disp('Variances')
fprintf('%8.3f',var(d(:,1:2:end)));fprintf('\n')
fprintf('%8.3f',var(d(:,2:2:end)));fprintf('\n')
disp('Correlations')
fprintf('%8.3f',[corr(d(:,1),d(:,2)) corr(d(:,3),d(:,4))
corr(d(:,5),d(:,6)) corr(d(:,7),d(:,8))]);fprintf('\n')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
MATLAB Scripts 103
clear variables;rng('default')
reps=10^6;nn=[1 2 5 20 50];V=-100:0.01:100;SE=NaN(length(nn),1);
figure('Name','Figure 3.4','NumberTitle','off');
for i=1:length(nn) % loop over 5 sample sizes
M=mean(randn(nn(i),reps),1); % sample mean (vector of length
reps)
D=histc(M,V);
Dnorm=D./sum(D)/mean(diff(V));
SE(i)=std(M);
plot(V+mean(diff(V)),Dnorm,'-o','linewidth',2);hold on
end
h=legend('\it{n}\rm = 1','\it{n}\rm = 2','\it{n}\rm =
5','\it{n}\rm = 20','\it{n}\rm = 50');
set(h,'color','none')
set(gca,'xlim',[-3 3])
xlabel('Sample mean')
ylabel('Density')
set(gca,'color','None')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
disp('1/sqrt(n)')
for
i=1:length(SE);fprintf('%8.3f',1/sqrt(nn(i)));fprintf('\n');end
disp('Standard deviation of the sample mean')
for i=1:length(SE);fprintf('%8.3f',mean(SE(i)));fprintf('\n');end
clear variables;rng('default')
reps=10^6;nn=[1 2 5 20 50];V=-
100:0.01:100;SE=NaN(length(nn),1);SD=NaN(length(nn),1);
figure('Name','Figure 3.5','NumberTitle','off')
for i=1:length(nn);
M=mean(exprnd(1,nn(i),reps),1); % sample mean (vector of
length reps)
D=histc(M,V);
Dnorm=D./sum(D)/mean(diff(V));
SE(i)=std(M);
plot(V+mean(diff(V)),Dnorm,'-o','linewidth',2);hold on
end
h=legend('\it{n}\rm = 1','\it{n}\rm = 2','\it{n}\rm =
5','\it{n}\rm = 20','\it{n}\rm = 50');
set(h,'color','none')
set(gca,'xlim',[-.1 3])
xlabel('Sample mean')
ylabel('Density')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca,'color','bone','looseInset', [0.01 0.01 0.01 0.01])
disp('Standard deviation of the sample mean')
for i=1:length(SE);fprintf('%8.3f',mean(SE(i)));fprintf('\n');end
104 MATLAB Scripts
Fig. 3.7 Probability density function of population distribution of males and females
clear variables
V=0:.1:300;
d_men_c=normpdf(V,182.5,7.1);
d_women_c=normpdf(V,168.7,7.1);
figure('Name','Figure 3.7','NumberTitle','off');hold on
plot(V,d_men_c,'Linewidth',3)
plot(V,d_women_c,'--','color',[double(216) double(82)
double(24)]./255,'Linewidth',3)
box on
xlabel('Height (cm)')
ylabel('Density')
set(gca,'xlim',[130 230],'FontSize',24)
h=legend('Males','Females')
set(h,'color','none')
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])
Figs. 3.8 and 3.9 p values for two normal distributions with unequal means, and equal means,
respectively
clear variables;rng('default')
reps=10000;nn=[3 6 10];
pu=NaN(reps,length(nn(1)));
pe=NaN(reps,length(nn(1)));
for i=1:length(nn)
n=nn(i);
disp(n)
for i2=1:reps
disp(i2)
height_pollm=randn(n,1)*7.1+182.5;
height_pollw=randn(n,1)*7.1+168.7;
height_pollw2=randn(n,1)*7.1+182.5; % now assume that men
and women have equal height
[~,pu(i2,i)]=ttest2(height_pollm,height_pollw);
[~,pe(i2,i)]=ttest2(height_pollm,height_pollw2);
end
end
figure('Name','Figure 3.8','NumberTitle','off');hold on
plot(sort(pu),'o','Linewidth',2,'Markersize',4);hold on
plot([0 reps],[.05 .05], 'k--','Linewidth',2)
h=legend('{\itn_m_e_n} = {\itn_w_o_m_e_n} = 3','{\itn_m_e_n} =
{\itn_w_o_m_e_n} = 6', '{\itn_m_e_n} = {\itn_w_o_m_e_n} =
10','location','northwest');
set(h,'color','none')
xlabel('Test number (sorted on {\itp} value)')
ylabel('{\itp} value')
box on
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])
figure('Name','Figure 3.9','NumberTitle','off');hold on
plot(sort(pe(:,3)),'o','Linewidth',2,'Markersize',4)
plot([0 reps],[.05 .05], 'k--','Linewidth',2)
xlabel('Test number (sorted on {\itp} value)')
ylabel('{\itp} value')
box on
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])
MATLAB Scripts 105
Fig. 3.10 p values in original studies and replication studies in the Open Science Collaboration
Project
clear variables
pO=xlsread('RPPdataConverted.xlsx','DH2:DH168');
pR=xlsread('RPPdataConverted.xlsx','DT2:DT168');
figure('Name','Figure 3.10','NumberTitle','off');hold on
plot(pO,pR,'kx','Linewidth',2)
plot([0 1],[.05 .05],'m--','Linewidth',2)
plot([0.05 0.05],[0 1],'m--','Linewidth',2)
set(gca,'xlim',[0 0.06],'ylim',[0 1])
box on
xlabel('Original study {\itp} value')
ylabel('Replication study {\itp} value')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])
clear variables;rng('default')
reps=10^6;n=25;
pp=NaN(reps,1);
V=0:0.005:1;
for i=1:reps;
if rem(i/1000,1)==0;fprintf('Percentage completed =
%5.3f',100*i/reps);fprintf('\n');end
x=randn(n,1);y=randn(n,1);
[~,pp(i)]=ttest2(x,y);p2=ranksum(x,y);
if pp(i)>.05 && p2 < .05
pp(i)=p2;
end
end
figure('Name','Figure 3.12','NumberTitle','off');hold on
D=histc(pp,V);Dnorm=D./sum(D)/mean(diff(V));
plot(V+mean(diff(V)),Dnorm,'k-o','Linewidth',2)
box on
xlabel('\itp\rm value');ylabel('Density')
set(gca,'xlim',[0 .4])
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])