Chapter 3: Statistics: Springerbriefs in Applied Sciences and Technology May 2017

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/317344578

Chapter 3: Statistics

Chapter  in  SpringerBriefs in Applied Sciences and Technology · May 2017


DOI: 10.1007/978-3-319-56964-2_3

CITATIONS READS

0 238

2 authors:

Joost de Winter Dimitra Dodou


Delft University of Technology Delft University of Technology
259 PUBLICATIONS   4,342 CITATIONS    97 PUBLICATIONS   1,852 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MOTORIST View project

Cyclist behavior View project

All content following this page was uploaded by Dimitra Dodou on 23 September 2019.

The user has requested enhancement of the downloaded file.


SPRINGER BRIEFS IN
APPLIED SCIENCES AND TECHNOLOGY

Joost C.F. de Winter
Dimitra Dodou

Human Subject
Research for
Engineers
A Practical Guide

123
Joost C.F. de Winter Dimitra Dodou

Human Subject Research


for Engineers
A Practical Guide

123
Joost C.F. de Winter Dimitra Dodou
Department of BioMechanical Engineering, Department of BioMechanical Engineering,
Faculty of Mechanical, Maritime and Faculty of Mechanical, Maritime and
Materials Engineering Materials Engineering
Delft University of Technology Delft University of Technology
Delft Delft
The Netherlands The Netherlands

ISSN 2191-530X ISSN 2191-5318 (electronic)


SpringerBriefs in Applied Sciences and Technology
ISBN 978-3-319-56963-5 ISBN 978-3-319-56964-2 (eBook)
DOI 10.1007/978-3-319-56964-2
Library of Congress Control Number: 2017939873

© The Author(s) 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Chapter 3
Statistics

Abstract After the measurements have been completed, the data have to be sta-
tistically analysed. This chapter explains how to analyse data and how to conduct
statistical tests. We explain differences between a population and a sample, data
distributions, descriptive statistics (i.e., statistics describing a sample: central ten-
dency, variability, effect sizes—including Cohen’s d and correlation coefficients),
and inferential statistics (i.e., statistics are used to infer characteristics of a popu-
lation based on a sample that is taken from this population: standard error of the
mean, null hypothesis significance testing, univariate and multivariate statistics).
We draw attention to pitfalls that may occur in statistical analyses, such as mis-
interpretations of null hypothesis significance testing and false positives. Attention
is also drawn to questionable research practices and their remedies. Replicability of
research is also discussed, and recommendations for maximizing replicability are
provided.

3.1 What This Chapter Does (Not) Cover

The results section of a research paper usually includes descriptive statistics and
inferential statistics. The aim of descriptive statistics is to summarize the charac-
teristics of the data, whereas inferential statistics are used to test hypotheses or to
make estimates about a population.
The chapter covers the essentials of statistics in a concise manner; it does not
offer a comprehensive guide on statistics. The website http://stats.stackexchange.
com provides answers to many statistical questions. Well known textbooks are
Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by
Cohen et al. (1983; 3rd ed. 2003), counting 149,222 citations in Google Scholar by
19 February 2017, Using Multivariate Statistics by Tabachnick and Fidell (1989;
6th ed. 2012) with 65,150 citations, and Discovering Statistics Using SPSS by Field
(2000; 4th ed. 2013) with 30,988 citations. Note that in this book we cover only
frequentist and not Bayesian inference.

© The Author(s) 2017 41


J.C.F. de Winter and D. Dodou, Human Subject Research for Engineers,
SpringerBriefs in Applied Sciences and Technology,
DOI 10.1007/978-3-319-56964-2_3
42 3 Statistics

3.2 Descriptive Statistics

3.2.1 Central Tendency and Variability

In a paper, it is customary to include a table with arithmetic means (Eq. (3.1),


mean) and standard deviations (Eq. (3.2), std) of the variables of interest. The
standard deviation provides an indication of the variation of the data points and
equals the square root of the variance (Eq. (3.3), var).

1X n
x¼ xi ð3:1Þ
n i¼1
pffiffiffiffi
s¼ s2 ð3:2Þ

1 X n
s2 ¼ ðxi  xÞ2 ð3:3Þ
n  1 i¼1

In human subject research, the unit of analysis is usually the participant. Thus, in
Eqs. (3.1) and (3.2), a data point xi is the score of a participant on a measure, and
n is the number of participants. For example, if there are 10 scores per participant
(e.g., 10 reaction times per participant) and 20 participants, then n = 20, not
n = 200; one should first calculate aggregate scores per participant (e.g., the mean
across the 10 reaction times) and subsequently calculate the mean and standard
deviation across the 20 participants.

Textbox 3.1 Sample variance and standard deviation as estimators of the


corresponding population values
The term 1/(n − 1) in Eq. (3.3) ensures that the sample variance is an
unbiased estimate of the population variance (i.e., the variance when the
sample size would be infinite). For example, if n = 5, not using the 1/(n − 1)
term would lead to an underestimation of the variance by 20%. This can be
verified with the following MATLAB simulation:
clear variables;clc
reps=10000;n=5;
s2_uncorrected=NaN(reps,1);s2_corrected=NaN(reps,1);
for i=1:reps
x=randn(n,1);
s2_uncorrected(i)=sum((x-mean(x)).^2)/length(x);
% this is equivalent to var(x,1)
s2_corrected(i)=sum((x-mean(x)).^2)/(length(x)-1);
% this is equivalent to var(x)
end
disp([mean(s2_uncorrected) mean(s2_corrected)]);
3.2 Descriptive Statistics 43

Because the standard deviation is the square root of the variance


(Eq. (3.2)), the standard deviation of the sample is not an unbiased estimate
of the standard deviation of the population [despite the fact that the term
1/(n − 1) is included in Eq. (3.3)]. There exists no general correction factor
for the sample standard deviation; the correction that is required depends on
the distribution of the variable and the sample size. For a normal distribution,
in order to obtain an accurate estimate of the population standard deviation,
the standard deviation calculated via Eqs. (3.2) and (3.3) has to be multiplied
by *1.064 when n = 5, by *1.028 when n = 10, and by *1.0025 when
n = 100 (Bolch 1968).

Other descriptive measures are the median (equivalent to the 50th percentile),
skewness, and kurtosis (median, prctile, skewness, and kurtosis). The
median is a robust measure of central tendency, which means that it is insensitive to
outliers. Skewness is a measure of the symmetry of the distribution, and kurtosis is
a measure of the tailedness of the distribution (DeCarlo 1997). A normal distri-
bution has a skewness of 0 and a kurtosis of 3 (note that kurtosis minus 3 is also
called ‘excess kurtosis’). A distribution with kurtosis less than 3 is called
platykurtic, whereas a distribution with kurtosis greater than 3 is called leptokurtic.
Figure 3.1 shows a Student’s t distribution and an exponential distribution, which
are both leptokurtic distributions, meaning that these distributions have heavy tails
relative to the normal distribution.

Fig. 3.1 Probability density function of (1) a normal distribution (which is equivalent to a
t distribution with infinite degrees of freedom), (2) a Student’s t distribution with five degrees of
freedom (df = 5), (3) a Student’s t distribution with df = 5, but now scaled so that the variance
equals 1 (if a distribution with high kurtosis is scaled to variance, high kurtosis appears as heavy
tails, and (4) an exponential distribution
44 3 Statistics

3.2.2 Effect Sizes

Next to measures of central tendency (e.g., mean) and spread (e.g., standard
deviation), it is customary to report effect sizes in a paper.

3.2.2.1 Cohen’s d

A common measure of effect size is Cohen’s d, which describes how much two
samples (x1 and x2) differ on a variable of interest with respect to each other. d is
calculated as the difference in means divided by the pooled standard deviation of
the two samples (Eq. (3.4)). In MATLAB d is calculated as follows: n1=length
(x1); n2=length(x2); d=(mean(x1)-mean(x2))/(sqrt(((n1-1)
*std(x1)^2+(n2-1)*std(x2)^2)/(n1+n2-2))).

x1  x2
d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ð3:4Þ
ðn1 1Þs1 þ ðn2 1Þs2
2 2

n1 þ n2 2

Additionally, researchers often report the correlation matrix among the variables
involved in the study. A correlation matrix allows one to gauge how strongly the
variables are related to each other.

3.2.2.2 Pearson Product-Moment Correlation Coefficient

The Pearson product-moment correlation coefficient between variables x and y is


calculated according to Eq. (3.5) (corr(x,y)).

P
N
ðxi  xÞðyi  yÞ
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
i¼1
ffi ð3:5Þ
PN PN
ðxi  xÞ2 ðyi  yÞ2
i¼1 i¼1

The absolute value of r is an indicator of the strength of the linear relationship


between x and y. r can take values between −1 and 1. If r = −1, the relationship
between the two variables is perfectly negatively linear; if r = 1, the relationship is
perfectly positively linear. The square of the correlation coefficient (r2) can be
interpreted as the proportion of variance of y accounted for by x. Figure 3.2
illustrates six correlation coefficients.
3.2 Descriptive Statistics 45

Fig. 3.2 Two normally distributed variables (n = 1000) sampled from two populations having
different Pearson correlation coefficients (the population correlation coefficient is designated by the
symbol R and corresponds to the slope of the magenta dashed line). x and y have been drawn from
a normal distribution population with l = 0 and r = 1

3.2.2.3 Point-Biserial Correlation

Cohen’s d represents the magnitude of the difference between two samples (x1, x2),
whereas r is the association between two variables (x, y) for the same sample.
However, r can also be used to describe the magnitude of the difference between
two samples, in which case it is called point-biserial correlation coefficient. The
point-biserial correlation coefficient is calculated from Eq. (3.5), with one variable
being dichotomous (i.e., containing zeros and ones, which represent the group the
data point belongs to) and the other variable being the pooled vectors of both
samples. In MATLAB, the point-biserial correlation is calculated as follows:
rpb=corr([ones(n1,1);zeros(n2,1)],[x1;x2]). The point-biserial
correlation is related to d according to Eq. (3.6) (Hedges and Olkin 1985; for more
conversions between effect size measures, see Aaron et al. 1988; Rosenthal 1994).
In MATLAB the point-biserial correlation can be calculated based on d as follows:
rpb=d/sqrt(d^2+(n1+n2)*(n1+n2-2)/(n1*n2)).

d
rpb ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:6Þ
ð n þ n2 Þðn1 þ n2 2Þ
d2 þ 1
n1 n2
46 3 Statistics

3.2.2.4 Spearman Rank-Order Correlation

Just like the median is more robust than the mean, so is the Spearman rank-order
correlation more robust than the Pearson correlation. The Spearman correlation is
calculated in the same way as the Pearson correlation, except that the data are first
converted to ranks (De Winter et al. 2016). Thus, corr(tiedrank(x),tie-
drank(y)) and corr(x,y,’type’,’spearman’) give identical results. It is
advisable to use the Spearman correlation when one expects that the variables have
high kurtosis or when outliers may be present.

3.2.2.5 Risk Ratios and Odds Ratios

Other effect size measures are risk ratios and odds ratios, which are particularly
used in the medical field (De Winter et al. 2016). Referring to the example of
Fig. 2.4, where using a monocular display was the risk factor and experiencing
visual complaints was the outcome, one could create a 2  2 contingency table
(Table 3.1) describing the exposure of the participants to the risk factor and their
status with respect to the outcome
The risk ratio (RR) is defined as the ratio of the probability of the outcome being
present in the group of participants exposed to the risk factor to the probability of
the outcome being present in the group of participants not exposed to the risk factor
(Eq. (3.7)):

A=ðA þ BÞ
RR ¼ ð3:7Þ
C=ðC þ DÞ

The odds ratio (OR) is defined as the ratio of the odds of the outcome being present
in the group of participants exposed to the risk factor to the odds of the outcome
being present in the group of participants not exposed to the risk factor, where the
odds is defined as the number of participants with the outcome present divided by
the number of participants with the outcome not being present (Eq. (3.8)):
A=B
OR ¼ ð3:8Þ
C=D

Table 3.1 Contingency table of participant counts based on their status with respect to the risk
factor and the outcome variable
Risk Outcome
Experiencing Not experiencing
visual complaints visual complaints
Using monocular displays A B
Not using monocular displays C D
3.2 Descriptive Statistics 47

OR and RR should not be confused with each other. If the probability (prevalence)
of the outcome is low (i.e., A/(A + B) < 20%), then OR can be approximated with
RR; if the probability of the outcome is high, however, OR is considerably higher
than RR (Davies et al. 1998; Schmidt and Kohlmann 2008). OR can be converted to
RR according to Eq. (3.9) (Zhang and Kai 1998):

OR
RR ¼ ð3:9Þ
1 C
CþD þ Cþ
C
D  OR

An important property of OR is its invertibility. The odds ratio for a positive


outcome (for the contingency table above: the odds of experiencing visual com-
plaints when using monocular displays divided by the odds of experiencing visual
complaints when not using monocular display) and the odds ratio for a negative
outcome (i.e., the odds of not experiencing visual complaints when using monoc-
ular displays divided by the odds of not experiencing visual complaints when not
using monocular displays; also called ‘odds for survival’ in epidemiology) are
reciprocal. This invertibility property does not hold for RR. Another attractive
property of OR is that it can be used in case-control studies in which persons having
a positive outcome (e.g., the persons having visual complaints) are over-sampled.
The RR cannot be estimated in case-control studies, because in such studies the
probability of the outcome is not known.
The point-biserial correlation for two binary variables corresponds to the phi
coefficient and can be calculated according to Eq. (3.10) (Guilford and Perry 1951;
Thorndike 1947):

AD  BC
rpb ¼ / ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:10Þ
ðA þ BÞðC þ DÞðA þ CÞðB þ DÞ

3.2.3 What is a Strong Effect?

In human subject research, correlations close to −1 or 1 ought not to be expected.


A d between 0.2 and 0.3 (corresponding to rpb between 0.10 and 0.15 assuming
equal sample sizes between groups) is typically considered a small effect, a
d around 0.5 (corresponding to rpb = 0.24) is considered a medium effect, whereas
a d greater than 0.8 (corresponding to rpb > 0.37) is considered a large effect
(Cohen 1988; Meyer et al. 2001). For example, it has been found that the per-
sonality trait conscientiousness correlates moderately with job performance across
various occupations (correlations between 0.17 and 0.29; Hurtz and Donovan
2000). If variables are free from measurement error and have a strong conceptual
similarity, then correlations can be stronger than 0.4. For example, the height and
arm span of humans correlate strongly (about r = 0.7−0.9; Goel and Tashakkori
2015; Reeves et al. 1996).
48 3 Statistics

Fig. 3.3 Anscombe’s quartet

3.2.4 Why Tables Are Not Enough

Descriptive statistics in tables are useful but provide an incomplete picture of the
raw data. The importance of figures is nicely illustrated by Anscombe’s quartet
(Fig. 3.3; Anscombe 1973). Each of these four datasets has the same means for
x and y (9 and 7.5, respectively), the same variance for x and y (11 and 4.12,
respectively), and the same correlation between x and y (r = 0.82) (and see Matejka
and Fitzmaurice 2017, for more examples).
Useful figures for describing data are the histogram (histc or histcounts),
the boxplot (boxplot), time series, Fourier analysis (using fft), and scatter plots
(plot or scatter). No matter how data are plotted, it is important that not only
the central tendency (the mean or median) can be distinguished, but also the
variability (standard deviation, percentile values, or raw data).

3.3 Inferential Statistics

3.3.1 Sample Versus Population

Typically in human subject research, researchers undertake a study on samples (i.e.,


a subset of the population), with the aim of estimating the true effect in the popu-
lation. The larger the sample size, the more accurately the parameters of the sample
reflect the population parameters. The standard deviation of the sample mean is
called the standard error of the mean (SEM) and is estimated according to Eq. (3.11).
The simulations in Textbox 3.2 illustrate that the SEM decreases as the sample size
increases, regardless of the distribution of the variables. In other words, the larger the
sample size, the closer the sample mean approximates the population mean.
s
SEM ¼ pffiffiffi ð3:11Þ
n
3.3 Inferential Statistics 49

Fig. 3.4 Results of a simulation where the sample mean is calculated for 1,000,000 samples
drawn from a normal distribution population with l = 0 and r = 1

Textbox 3.2 Illustration of the central limit theorem


Figure 3.4 shows the distribution of the mean of the sample, for five sample
sizes. The population is a normal distribution with mean l = 0 and standard
deviation r = 1. The SEM (i.e., the standard deviation of the mean of the
sample) decreases according to the square root of n (Eq. (3.11)).
Figure 3.5 shows the distribution of the sample mean, but now the parent
population has an exponential distribution with l = 1 and r = 1. It can be
seen that as the sample size increases, the distribution approaches a normal

Fig. 3.5 Results of a simulation where the sample mean is calculated for 1,000,000 samples
drawn from an exponentially distributed population with l = 1 and r = 1
50 3 Statistics

Table 3.2 The standard deviation of the sample mean as observed from the above simulations, in
comparison with the expected value of 1/n0.5
n=1 n=2 n=5 n = 20 n = 50
1/n0.5 1.000 0.707 0.447 0.224 0.141
Figure 3.4 (Normal distribution) 1.000 0.707 0.447 0.224 0.142
Figure 3.5 (Exponential distribution) 0.998 0.707 0.447 0.224 0.142

distribution, which is in agreement with the central limit theorem. Again, the
SEM decreases according to the square root of n.
Table 3.2 shows the SEMs observed in the above simulations in com-
parison with the expected value. It can be seen that Eq. (3.11) holds
regardless of the distribution of the population (e.g., normal or exponential).
Note that in the simulations, the standard deviation of the population was
known (r = 1). In reality, the standard deviation is unknown and must be
observed. Because the sample standard deviation (s) is a biased estimate of
the population standard deviation (see Textbox 3.1), the SEM based on the
sample standard deviation (Eq. (3.11)) is a biased estimate of the SEM based
on the population standard deviation.

3.3.2 Hypothesis Testing

Hypothesis testing can take different forms. The most common form is that of null
hypothesis significance testing, in which there are four possibilities: (1) correctly
rejecting the null hypothesis, (2) correctly accepting the null hypothesis, (3) re-
jecting the null hypothesis when it is true (Type I error), and (4) accepting the null
hypothesis when it is false (Type II error) (Table 3.3; see also Fig. 3.6). The
probability of rejecting the null hypothesis when it is false is the statistical power, or
1 − b, where b is the probability of Type II error. In other words, the statistical

Table 3.3 Four possibilities in null hypothesis significance testing


Outcome of the statistical test State of the world
There is an effect There is no effect
(the null hypothesis is false) (the null hypothesis is true)
There is an effect Correct rejection Type I error (false positive)
(p < a) Probability: 1 − b Probability: a
There is no effect Type II error (false negative) Correct acceptance
(p  a) Probability: b Probability: 1 − a
Note p is the probability of getting a result equal to or more extreme than the observed result, under
the assumption that the null hypothesis is true
3.3 Inferential Statistics 51

Fig. 3.6 Left Illustration of a Type I error (false positive; i.e., to report there is something while
there is nothing). Right Illustration of a Type II error (false negative; i.e., to report there is nothing
while there is something). Photo on the left taken from Wikimedia Commons (https://commons.
wikimedia.org/wiki/File:Toyota_Curren_ST-206_1996_parking.jpg). Author: Qurren. Created: 26
April 2006. Photo on the right taken from Wikimedia Commons (https://commons.wikimedia.org/
wiki/File:Parking_violation_Vaughan_Mills.jpg). Author: Challisrussia. Created: 20 November
2011. Photo of the policeman adapted from Wikimedia Commons (https://commons.wikimedia.
org/wiki/File:British_Policeman.jpg). Author: Southbanksteve. Created: 15 November 2006

power is the probability of not making a Type II error. The significance level a is
the probability of a Type I error, that is, the probability of rejecting the null
hypothesis when it is true.

3.3.3 Independent-Samples t Test

If the data are sampled from a population having a normal distribution with equal
variances, then the Student’s t test is the most powerful unbiased test. This means
that the t test gives the maximum probability of correctly rejecting the null
hypothesis (it maximizes 1 − b) while maintaining the nominal Type I error rate (a).
The independent-samples Student’s t test works as follows. It first calculates a
t statistic according to Eq. (3.12):

x1  x2
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:12Þ
ðn1 1Þs21 þ ðn2 1Þs22
n1 þ n2
1 1
n1 þ n2 2

The t statistic is larger (1) when the difference between the means of the two
samples is larger, (2) when the standard deviations of the samples are smaller, and
(3) when the sample size is larger. The t statistic describes the distance between the
52 3 Statistics

two groups and is related to Cohen’s d (Eq. (3.4)) according to Eq. (3.13) (Aaron
et al. 1988; Rosenthal 1994):

d
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:13Þ
n1 þ n2
1 1

From the t statistic, the p value is calculated using the Student’s t distribution. The
Student’s t distribution resembles the normal distribution but has heavier tails,
especially when the sample size is small (Fig. 3.1). If the p value is smaller than a,
the effect is said to be statistically significant.
Consider a situation where we want to test whether males have a different height
than females. Let us assume that males and females are on average 182.5 and
168.7 cm tall respectively (NCD Risk Factor Collaboration 2016) and that the
standard deviation of both populations equals 7.1 cm (Fig. 3.7). Of course, in
reality, we do not have access to the population distributions, and so we cannot
know the population means and standard deviations; we only obtain data from
samples. Let us sample 10 men and 10 women. In MATLAB, a t test can be
performed as follows: [*,p,*,stats]=ttest2(x1,x2), with x1 being a
vector of length n1 with the heights of the males, and x2 being a vector of length n2
with the heights of females. The result of the t test is a p value, defined as the
probability of obtaining a result equal to or more extreme than the observed result,
assuming that the null hypothesis of equal means is true. For example, p = 0.020
means that, assuming two random samples were drawn from the same normal
distribution, in only 2% of the cases one would find such a large difference (see also
Textbox 3.3).

Fig. 3.7 Probability density function of assumed population distribution of males and females.
lwomen = 168.7 cm, lmen = 182.5 cm, rmen = rwomen = 7.1 cm
3.3 Inferential Statistics 53

Textbox 3.3 p value, sample size, and statistical power


Let us do a simulation with a sample size of n = 3, 6, and 10 per
group. Figure 3.8 shows all obtained p values when the procedure is repeated
10,000 times. Researchers declare a finding ‘statistically significant’ when the
p value is smaller than a.
Figure 3.8 shows that the p value is generally smaller when the sample
size is larger. Specifically, the statistical power (1 − b) is 0.44 when n = 3
per group, 0.86 when n = 6 per group, and 0.98 when n = 10 per
group. Thus, using Eq. (3.12) and Fig. 3.8 it becomes clear that statistical
power is a function of the following:
1. The effect size in the population (i.e., the difference in means with respect
to the pooled standard deviation). In other words, the larger the difference
in height between men and women, the more likely that the null
hypothesis is rejected.
2. The sample size. The larger the sample size, the greater the statistical power.
3. The a value. Usually a is set at 0.05. When setting a more conservative a
(e.g., 0.001), the statistical power decreases. The benefit of a low a is the
protection against false positives (see Table 3.3).
4. The number of tails of the statistical test. Power is higher in one-tailed
tests than in two-tailed tests (see Sect. 3.3.5). The simulation above was
conducted with two-tailed t tests.
5. Measurement error. As explained in sect. 2.9.1, measurement error,
caused for example by unreliable sensors, reduces statistical power.
1
nmen = nwomen = 3
0.9 nmen = nwomen = 6

0.8 nmen = nwomen = 10

0.7

0.6
p value

0.5

0.4

0.3

0.2

0.1

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Test number (sorted on p value)

Fig. 3.8 Simulation results when submitting a sample of n men and n women to an
independent-samples Student’s t test (lmen = 182.5 cm, lwomen = 168.7 cm, rmen = rwomen =
7.1 cm). The horizontal dashed line is drawn at a = 0.05
54 3 Statistics

0.9

0.8

0.7

0.6
p value

0.5

0.4

0.3

0.2

0.1

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Test number (sorted on p value)

Fig. 3.9 Simulation results when submitting a sample of 10 men and 10 women to an
independent-samples Student’s t test. Here, it was assumed that men and women have equal height
(lmen = 182.5 cm, lwomen = 182.5 cm, rmen = rwomen = 7.1 cm). The horizontal dashed line is
drawn at a = 0.05

Note that if the population means of both distributions are equal, then the
p value is uniformly distributed. In other words, if men and women had equal
height, the simulation results would look like those in Fig. 3.9. It can be seen
that a Type I error is made in 5% of the cases.

In a scientific paper, it is important to not only report the p value, but also the
t statistic and degrees of freedom of the Student’s t distribution, as well as the means
and standard deviations of the two samples. In the aforementioned example about
the height of men and women, the results can be reported as follows: ‘Men were
taller (M = 182.5 cm, SD = 8.2 cm) than women (M = 169.1 cm, SD = 6.1 cm), t
(18) = 4.14, p< 0.000’. Here (18) is the degrees of freedom of the Student’s t dis-
tribution, which equals n1 + n2 − 2.

3.3.4 Paired-Samples t Test

An independent-samples t test is used for comparing two groups, for example males
with females, or the results of a between-subjects experiment. For a within-subject
design, a paired-samples t test can be used ([*,p,*,stats]=ttest(x1,
x2)). Here, the t statistic is a function of the change of the scores for participants
between two conditions (Eq. (3.14)). A paired-samples t test is usually more
powerful than an independent-samples t test, because participants are compared
with themselves (see also Sect. 2.3). Specifically, the denominator in Eq. (3.14) is
smaller than the denominator in Eq. (3.12) when the two samples are positively
correlated (see Eq. (3.15)). The results of a paired t test can be reported as: ‘The
3.3 Inferential Statistics 55

task completion time was larger with the traditional walking aid (M = 51.7 s,
SD = 6.1 s) than with the exoskeleton (M = 43.9 s, SD = 7.6 s), t(9) = 3.34,
p = 0.009’. Here, (9) is the number of degrees of freedom, being equal to n − 1.

x1  x2
t¼ qffiffi ð3:14Þ
s12 1n
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s12 ¼ s21 þ s22  2r  s1 s2 ð3:15Þ

3.3.5 One-Tailed Versus Two-Tailed Tests

A statistical test can be one-tailed or two-tailed. A one-tailed test is used for testing
a hypothesis in one direction, whereas a two-tailed test examines the hypothesis in
both directions. For example, a two-tailed test can be used to examine whether a
new exoskeleton is less or more efficient than a traditional walking aid. In one-tailed
tests, only one of the two directions is tested; for example, to test whether a new
exoskeleton is more efficient than a traditional walking aid (and not whether the
exoskeleton is less efficient than the traditional walking aid). In MATLAB, a
two-tailed test is the default. A one-tailed t test can be conducted as follows: [*,
p]=ttest2(x1,x2,‘tail’,‘right’) or [*,p]=ttest2(x1,x2,
‘tail’,‘left’). If the test is one-sided, the p value is the probability of obtaining
a result as extreme or more extreme in the selected direction, whereas the two-tailed
probability is the one-tailed probability (for the nearest rejection side) multiplied by
two. It is easier to reach significance (p < a) when using a one-tailed test as
compared to a two-tailed test, but this should never be the reason of opting for
one-tailed tests. In human subject research, it is customary to use two-tailed tests.

3.3.6 Alternatives to the t Test

Many human attributes, such as intelligence, are approximately normally dis-


tributed across participants (Burt 1957; Plomin and Deary 2015). However, in case
of disease or disorders, non-normal distributions are common. Moreover, if the
measurement scales are flawed, the normal distribution may not arise either.
A ceiling effect occurs, for example, when almost all respondents answer ‘totally
agree’ to a questionnaire item.
The t test is optimal when the two populations are normally distributed, and is
robust to deviations from normality. However, when variables feature a high kur-
tosis, when outliers are present, or when variables have unequal variances com-
bined with unequal sample sizes, the t test can be suboptimal, which means that it
56 3 Statistics

has low power (low 1 − b) or yields a Type I error rate that deviates from the
nominal a. There are various alternative tests, such as the Welch test (Eq. (3.16);
ttest2(x1,x2,[],[],‘unequal’), which is robust to Type I errors if sample
sizes are unequal in combination with unequal population variances:

x1  x2
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:16Þ
s21 s22
n1 þ n2

There are also non-parametric variants of the t test, such as the Mann-Whitney
U test (also called Wilcoxon rank-sum test; ranksum(x1,x2)) and the Wilcoxon
signed rank test (signrank(x1,x2)). When there are more than two groups, an
analysis of variance (ANOVA) can be used (anova1 for a one-way ANOVA) or
its non-parametric equivalent, the Kruskal-Wallis test (kruskalwallis).

3.3.7 Multivariate Statistics

The t test is a univariate test, meaning that one variable (x) is analysed per par-
ticipant. Multivariate statistical methods analyse more than one variable simulta-
neously. The correlation coefficient and simple linear regression are relatively
simple bivariate methods, involving two variables (x and y) per participant.
A regression analysis with multiple predictor variables and one criterion variable is
called multiple regression (regress). Examples of more sophisticated multi-
variate statistical techniques are: (1) multivariate regression (mvregress; for
predicting multiple criterion variables), (2) exploratory factor analysis (factoran;
a linear system in which the predictor variables are not directly observed, and the
number of predictor variables is smaller than the number of criterion variables),
(3) principal component analysis (pca; a data reduction technique which resembles
factor analysis), (4) structural equation modelling (a combination of multivariate
regression and factor analysis), and (5) multivariate analysis of variance (manova;
this resembles multivariate regression).

3.4 Pitfalls of Null Hypothesis Significance Testing

As explained above, if p < 0.05 (for a = 0.05), then the effect is declared statisti-
cally significant. However, a statistically significant finding does not imply that the
effect is strong or important. Suppose that the true effect is small (e.g., a difference
of 0.1 cm between the height of men and women) but the sample size is very large
(e.g., nmen = nwomen = 1,000,000), then it is very likely that p < 0.05. That is, the
effect is statistically significant because the sample size is very large, but the size of
3.4 Pitfalls of Null Hypothesis Significance Testing 57

the effect is small (a difference of 0.1 cm; d = 0.014) and therefore does not nec-
essarily have practical relevance.
Furthermore, a statistically significant finding does not imply that the alternative
hypothesis is true. Null hypothesis significance testing cannot establish whether a
hypothesis is true or false. After all, a p value represents the likelihood of the data,
assuming that the null hypothesis holds. Such a failure of significance testing was
demonstrated by the work of Bem (2011) who, based on a number of experiments
that yielded p < 0.05, claimed that people are able to ‘feel the future’. Based on
what is known from physics and numerous counterexamples such as the fact that
casinos still make money, however, it is extremely unlikely that people can really
feel the future. Therefore, the statistically significant findings reported by Bem
(2011) have to be false positives (Wagenmakers et al. 2015).

3.4.1 Most Published Research Findings Are False

In a highly cited article, Ioannidis (2005) claimed that “most published research
findings are false”. Since then, he has been proven right in various areas, such as
medicine (Begley and Ellis 2012; Freedman et al. 2015; Ioannidis 2007), experi-
mental economics (Ioannidis and Doucouliagos 2013), and psychology (Open
Science Collaboration 2015; Textbox 3.4).
Ioannidis’ (2005) argument is as follows. He first points out that if a researcher
reports ‘there is an effect, p < a’, it can be either a true positive or a false positive
(see also Table 3.3). The pre-study probability that a research finding is true is
called p. p depends strongly on the research field. For example, in the area of
research into clairvoyance, p is extremely close to 0. Conversely, if the research
field targets highly probable effects (such as the hypothesis that males are taller than
females), then p is extremely close to 1.

Textbox 3.4 Replicability


A study is successfully replicated if a repetition of the study in highly similar
conditions but with different participants leads to equivalent results
(Asendorpf et al. 2013). The Open Science Collaboration Project was the
collaborative effort of 270 authors who repeated 97 published psychological
studies having reported significant results, in order to investigate whether the
statistical results would replicate (Open Science Collaboration 2015). The
results, published in Science, showed that, out of the 97 replications, only 35
were statistically significant (Fig. 3.10).
The project received extensive media attention, partly in the form of praise
for the effort to replicate a large number of studies. Criticisms appeared as
well. Gilbert et al. (2016) questioned the results of the Open Science
Collaboration, arguing that the replication studies differed in important ways
58 3 Statistics

Fig. 3.10 p values in original studies versus replication studies. Dashed lines run across p = 0.05.
See also Open Science Collaboration (2015, 2016)

from the original studies (with the protocols of the replications being
endorsed only by 69% of the original authors). It has been further argued that
because of the small sample sizes of the original studies, failure to replicate
(even by means of a high-powered replication) cannot tell us much about the
original results (Etz and Vandekerckhove 2016; Morey and Lakens 2016).

The expected number of true positives is the product of the statistical power
(1 − b) and p. Similarly, the expected number of false positives is a times the
probability that a research hypothesis is false (1 − p). Thus, the probability that a
statistically significant research finding is false (the False Positive Report
Probability; FPRP) equals the expected number of false positives divided by the
expected number of true positives plus the expected number of false positives
(Eq. (3.17); Wacholder et al. 2004).

að1  pÞ
FPRP ¼ ð3:17Þ
að1  pÞ þ ð1  bÞp

According to Eq. (3.17), research findings are more likely to be true when p is
higher, and when the statistical power is higher. Ioannidis (2005) argues that in
confirmatory research, such as randomized controlled trials, the FPRP is probably
less than 0.5. However, if research is exploratory (i.e., discovery-oriented), then it
becomes likely that a positive research finding is false. Figure 3.11 illustrates the
perils of research with low p. In the case presented, p = 0.02, 1 − b = 0.8, and
a = 0.05, yielding a FPRP of 75% [see also Eq. (3.17)].
3.4 Pitfalls of Null Hypothesis Significance Testing 59

Fig. 3.11 When the


pre-study probability that a 1,000 hypotheses tested
hypothesis is true is 2%
(p = 0.02), the significance
level a = 0.05, and the
statistical power 1 − b = 0.8, 2% of tested hypotheses are true
then 25% of the reported
statistically significant results
are true positives, whereas the
remainder 75% are false
positives. Figure based on
Jager and Leek (2013)
20 true 980 false

(1−β = 0.8): 80% are (α = 0.05): 5% are


considered significant considered significant

16 are considered 49 are considered


significant significant
(true positives) (false positives)

16/(16+49) = 25% of significant results


are true positives

49/(16+49) = 75% of significant results


are false positives

3.4.2 Bias

In his paper, Ioannidis (2005) describes another risk: bias. Bias is the tendency of
researchers to ‘tweak’ a p value so that it becomes statistically significant while it
should not have been significant. There is evidence that researchers ‘like’ statisti-
cally significant results (e.g., Bakker and Wicherts 2011): a p < 0.05 might please
the sponsors (Lexchin et al. 2003), get the work more easily accepted into a journal
(Mobley et al. 2013), attract media attention (Boffetta et al. 2008), or reflect the
researchers’ tendency to confirm their own hypothesis (i.e., experimenter’s
expectancy, see Sect. 2.5.2). Questionable research practices during statistical
analysis leading to false positives (i.e., Type I errors) are (for an overview, see
Banks et al. 2016; Forstmeier et al. in press):
60 3 Statistics

• Recruiting participants until statistically significant results are obtained (also


called optional stopping). For example, suppose a researcher has tested 20
participants and after observing that p > 0.05, he lets a few more people par-
ticipate and tests again whether the effects is statistically significant.
• Excluding/modifying data after looking at the impact of doing so. Because it
turns out that the results are not statistically significant, a researcher drops or
aggregates measures or observations, for example, by trying out different outlier
removal criteria (Bakker and Wicherts 2014).
• Trying out different statistical tests. With statistical software such as SPSS,
statistics can become misleadingly easy. With a few clicks of a mouse button, it
is possible to run complex statistical analyses. Leggett et al. (2013): “Modern
statistical software now allows for simple and instantaneous calculations, per-
mitting researchers to monitor their data continuously while collecting it.” If one
tries out multiple options and selects the ‘best-looking’ result, this result may
well be a false positive. Figure 3.12 shows the results of a simulation in which a
researcher conducts an independent-samples t test, and if the results are not
significant (p > 0.05), he applies a Wilcoxon rank-sum test, and if this yields
p < 0.05, he reports the results of the latter. In these simulations, the null
hypothesis of equal means was true. Because of this questionable practice, a
peak of p values just below 0.05 arises while a uniform distribution is expected.
Accordingly, the Type I error rate has increased to 6% while it should have been
5% (see also De Winter 2015). Similarly, pre-testing the assumptions of the
statistical test, for example by goodness-of-fit tests for normality, does not pay
off and may actually lead to increased Type I and Type II error rates (Rasch
et al. 2011).

Fig. 3.12 Distribution of p values when ‘strategically’ selecting a non-parametric test when the
parametric test yields a result that is not statistically significant. In this simulation, 1,000,000
independent-samples t tests were run with a sample size of 25 per group
3.4 Pitfalls of Null Hypothesis Significance Testing 61

3.4.3 Recommendations in Order to Maximize


the Replicability of a Work

Above, we illustrated some of the pitfalls of null hypothesis significance testing. To


maximize the replicability of a work, the following advice is offered:
• Improve your understanding of the p value (Eq. (3.17)). In other words, try to
assess how likely it is that the alternative hypothesis is true, prior to collecting
the data. Such information can be acquired through a literature review and by
asking colleagues in the field. The p value determines how sceptical a researcher
should be towards one’s own results. ‘Surprising’ results are a reason for
concern.
• When possible, analyse the data blind to the experimental condition. In
Sect. 2.5.2, experimenter expectancy bias while conducting an experiment was
discussed. Experimenter expectancy bias can also affect the statistical analysis of
the data, whereby researchers may fall victim to confirmation bias when ana-
lysing their own data.
• Do not ‘chase’ statistical significance. By analysing over 135,000 p values,
Krawczyk (2015) concluded that authors “round the p values down more
eagerly than up”. Researchers should try to remain unbiased and not try out
things in order to achieve p < 0.05. That is, it is not acceptable to inspect the
data and, if an outlier is present, switch to a different test or remove the outliers
(Bakker and Wicherts 2014). The choice of statistical procedure should be based
on ‘extra-data’ sources, such as the results of pilot tests or theoretical consid-
erations prior to conducting the study. Only if the data points are erroneous (e.g.,
due to a participant who misunderstood the task instructions or a sensor failure),
it is allowed to remove these data points (and report this removal in the paper).
In a paper, it is important to report all results, not only the significant ones.
• Correct for multiple testing where appropriate. Testing multiple hypotheses
increases the Type I error rate, because the more tests one is conducting the
higher the probability that at least one of these tests produces a statistically
significant result. In particle physics this is called the look-elsewhere phe-
nomenon (Gross and Vitells 2010). The a value can be adjusted downward,
making statistical significance testing more conservative. The Bonferroni cor-
rection is a well-known (but perhaps overly conservative) adjustment method. In
a Bonferroni correction, a is divided by the number of statistical tests that have
been conducted. Similarly, it has been argued that researchers should use
a = 0.001 instead of the more common a = 0.05 in order to prevent false
positives (Colquhoun 2014).
• Use large sample sizes. The larger the sample size, the higher the statistical
power (1 − b). This means that if the null hypothesis is false, the larger the
sample size, the more likely it is to detect that the null hypothesis is indeed false.
Moreover, if the statistical power is higher, the more likely it is that a research
finding is in fact true (Eq. (3.17)). By means of a power analysis, it is possible to
62 3 Statistics

compute the required sample size for a given level of significance, desired
power, and expected effect size. An excellent power analysis tool is G*Power,
which can be downloaded for free: http://www.gpower.hhu.de. Another useful
effect size calculator which does not require installing software but runs in
Microsoft Excel is provided by Lakens (2013).

3.5 Final Note

Note that this book focuses on frequentist inference (e.g., null hypothesis signifi-
cance testing and p values). However, the use of frequentist inference has been
criticized by many, because it dichotomizes research into significant and
non-significant findings, and because p values are easily misinterpreted. Nowadays,
Bayesian inference is gaining popularity (Cumming 2013; Poirier 2006), because it
does not suffer from the same problems as frequentist inference does
(Wagenmakers et al. 2008). Bayesian statistical methods are available in several
software packages, including winBUGS (Lunn et al. 2000) and Mplus (Kaplan and
Depaoli 2012). However, frequentist inference still seems to be dominant. In an
analysis of abstracts published in biomedical journals between 1990 and 2015,
Chavalarias et al. (2016) found that, out of 796 abstracts of papers with empirical
data, 15.7% of the abstracts reported p values, 13.9% reported effect sizes, 2.3%
reported confidence intervals, and 0% reported a Bayes factor.

References

Aaron, B., Kromrey, J. D., & Ferron, J. (1988). Equating r-based and d-based effect size indices:
Problems with a commonly recommended formula. Paper presented at the 43rd Annual
Meeting of the Florida Educational Research Association, Orlando, FL.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21.
https://doi.org/10.1080/00031305.1973.10478966
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., et al.
(2013). Recommendations for increasing replicability in psychology. European Journal of
Personality, 27, 108–119. https://doi.org/10.1002/per.1919
Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology
journals. Behavior Research Methods, 43, 666–678. https://doi.org/10.3758/s13428-011-0089-5
Bakker, M., & Wicherts, J. M. (2014). Outlier removal and the relation with reporting errors and
quality of psychological research. PLOS ONE, 9, e103360. https://doi.org/10.1371/journal.
pone.0103360
Banks, G. C., O’Boyle, E. H., Pollack, J. M., White, C. D., Batchelor, J. H., Whelpley, C. E., et al.
(2016). Questions about questionable research practices in the field of management: A guest
commentary. Journal of Management, 42, 5–20. https://doi.org/10.1177/0149206315619011
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer
research. Nature, 483, 531–533. https://doi.org/10.1038/483531a
References 63

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. https://
doi.org/10.1037/a0021524
Boffetta, P., McLaughlin, J. K., La Vecchia, C., Tarone, R. E., Lipworth, L., & Blot, W. J. (2008).
False-positive results in cancer epidemiology: A plea for epistemological modesty. Journal of
the National Cancer Institute, 100, 988–995. https://doi.org/10.1093/jnci/djn191
Bolch, B. W. (1968). The teacher’s corner: More on unbiased estimation of the standard deviation.
The American Statistician, 22, 27. https://doi.org/10.1080/00031305.1968.10480476
Burt, C. (1957). Distribution of intelligence. British Journal of Psychology, 48, 161–175. https://
doi.org/10.1111/j.2044-8295.1957.tb00614.x
Chavalarias, D., Wallach, J. D., Li, A. H. T., & Ioannidis, J. P. (2016). Evolution of reporting
p values in the biomedical literature, 1990–2015. JAMA, 315, 1141–1148. https://doi.org/10.
1001/jama.2016.1952
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum.
Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-
values. Royal Society Open Science, 1, 140216. https://doi.org/10.1098/rsos.140216
Cumming, G. (2013). The new statistics why and how. Psychological Science, 25, 7–29. https://
doi.org/10.1177/0956797613504966
Davies, H. T. O., Crombie, I. K., & Tavakoli, M. (1998). When can odds ratios mislead? BMJ,
316, 989–991. https://doi.org/10.1136/bmj.316.7136.989
De Winter, J. C. F. (2015). A commentary on “Problems in using text-mining and p-curve analysis
to detect rate of p-hacking”. https://sites.google.com/site/jcfdewinter/Bishop%20short%
20commentary.pdf?attredirects=0&d=1
De Winter, J. C. F., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman
correlation coefficients across distributions and sample sizes: A tutorial using simulations and
empirical data. Psychological Methods, 21, 273–290. https://doi.org/10.1037/met0000079
DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292–307.
https://doi.org/10.1037/1082-989X.2.3.292
Etz, A., & Vandekerckhove, J. (2016). A Bayesian perspective on the reproducibility project:
Psychology. PLOS ONE, 11, e0149794. https://doi.org/10.1371/journal.pone.0149794
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in
preclinical research. PLOS Biology, 13, e1002165. https://doi.org/10.1371/journal.pbio.
1002165
Field, A. (2013). Discovering statistics using IBM SPSS statistics. London, UK: Sage Publications.
Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (in press). Detecting and avoiding likely
false‐positive findings–a practical guide. Biological Reviews. https://doi.org/10.1111/brv.
12315
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the
reproducibility of psychological science”. Science, 351, 1037. https://doi.org/10.1126/science.
aad7243
Goel, S., & Tashakkori, R. (2015). Correlation between body measurements of different genders
and races. In J. Rychtár, M. Chhetri, S. N. Gupta, & R. Shivaji (Eds.), Collaborative
mathematics and statistics research (pp. 7–17). Springer International Publishing. https://doi.
org/10.1007/978-3-319-11125-4_2
Gross, E., & Vitells, O. (2010). Trial factors for the look elsewhere effect in high energy physics.
The European Physical Journal C, 70, 525–530. https://doi.org/10.1140/epjc/s10052-010-
1470-8
Guilford, J. P., & Perry, N. C. (1951). Estimation of other coefficients of correlation from the phi
coefficient. Psychometrika, 16, 335–346. https://doi.org/10.1007/BF02310556
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
64 3 Statistics

Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85, 869–879. https://doi.org/10.1037/0021-9010.85.6.869
Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine, 2, e124.
https://doi.org/10.1371/journal.pmed.0020124
Ioannidis, J. P. (2007). Non-replication and inconsistency in the genome-wide association setting.
Human Heredity, 64, 203–213. https://doi.org/10.1159/000103512
Ioannidis, J., & Doucouliagos, C. (2013). What’s to know about the credibility of empirical
economics? Journal of Economic Surveys, 27, 997–1004. https://doi.org/10.1111/joes.12032
Jager, L. R., & Leek, J. T. (2013). An estimate of the science-wise false discovery rate and
application to the top medical literature. Biostatistics, 15, 1–12. https://doi.org/10.1093/
biostatistics/kxt007
Kaplan, D., & Depaoli, S. (2012). Bayesian structural equation modeling. In R. Hoyle (Ed.),
Handbook of structural equation modeling (pp. 650–673). New York: Guilford Press.
Krawczyk, M. (2015). The search for significance: A few peculiarities in the distribution of
p values in experimental psychology literature. PLOS ONE, 10, e0127872. https://doi.org/10.
1371/journal.pone.0127872
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A
practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. https://doi.org/10.3389/
fpsyg.2013.00863
Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p: “Just
significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66,
2303–2309. https://doi.org/10.1080/17470218.2013.863371
Lexchin, J., Bero, L. A., Djulbegovic, B., & Clark, O. (2003). Pharmaceutical industry
sponsorship and research outcome and quality: Systematic review. BMJ, 326, 1167–1170.
https://doi.org/10.1136/bmj.326.7400.1167
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS-a Bayesian modelling
framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
https://doi.org/10.1023/A:1008929526011
Matejka, J., & Fitzmaurice, G. (2017). Same stats, different graphs: Generating datasets with
varied appearance and identical statistics through simulated annealing. Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems, 1290–1294. https://doi.org/10.
1145/3025453.3025912
Meyer, G. J., Finn, S. E., Eyde, L. D., Kay, G. G., Moreland, K. L., Dies, R. R., … & Reed, G. M.
(2001). Psychological testing and psychological assessment: A review of evidence and issues.
American Psychologist, 56, 128–165. https://doi.org/10.1037/0003-066X.56.2.128
Mobley, A., Linder, S. K., Braeuer, R., Ellis, L. M., & Zwelling, L. (2013). A survey on data
reproducibility in cancer research provides insights into our limited ability to translate findings
from the laboratory to the clinic. PLOS ONE, 8, e63221. https://doi.org/10.1371/journal.pone.
0063221
Morey, R. D., & Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. https://
raw.githubusercontent.com/richarddmorey/psychology_resolution/master/paper/response.pdf
NCD Risk Factor Collaboration. (2016). A century of trends in adult human height. ELife, 5,
e13410. https://doi.org/10.7554/eLife.13410
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349, aac4716. https://doi.org/10.1126/science.aac4716
Open Science Collaboration. (2016). RPPdataConverted.xlsx. https://osf.io/ytpuq/
Plomin, R., & Deary, I. J. (2015). Genetics and intelligence differences: Five special findings.
Molecular Psychiatry, 20, 98–108. https://doi.org/10.1038/mp.2014.105
Poirier, D. J. (2006). The growth of Bayesian methods in statistics and economics since 1970.
Bayesian Analysis, 1, 969–979.
Rasch, D., Kubinger, K. D., & Moder, K. (2011). The two-sample t test: Pre-testing its
assumptions does not pay off. Statistical Papers, 52, 219–231. https://doi.org/10.1007/s00362-
009-0224-x
References 65

Reeves, S. L., Varakamin, C., & Henry, C. J. (1996). The relationship between arm-span
measurement and height with special reference to gender and ethnicity. European Journal of
Clinical Nutrition, 50, 398–400.
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The
handbook of research synthesis (pp. 231–244). New York, NY: Russell Sage Foundation.
Schmidt, C. O., & Kohlmann, T. (2008). When to use the odds ratio or the relative risk?
International Journal of Public Health, 53, 165–167. https://doi.org/10.1007/s00038-008-
7068-3
Tabachnick, B. G., & Fidell, L. S. (1989). Using multivariate statistics. New York: Harper & Row.
Thorndike, R. L. (1947). Research problems and techniques (Report No. 3). Washington DC:
Army Air Forces.
Wacholder, S., Chanock, S., Garcia-Closas, M., & Rothman, N. (2004). Assessing the probability
that a positive report is false: An approach for molecular epidemiology studies. Journal of the
National Cancer Institute, 96, 434–442. https://doi.org/10.1093/jnci/djh075
Wagenmakers, E. J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian versus frequentist
inference. In H. Hoijtink, I. Klugkist, & P. A. Boelen (Eds.), Bayesian evaluation of
informative hypotheses (pp. 181–207). New York: Springer.
Wagenmakers, E. J., Wetzels, R., Borsboom, D., Kievit, R. A., & Van der Maas, H. L. (2015).
A skeptical eye on psi. In E. C. May & S. B. Marwaha (Eds.), Extrasensory perception:
Support, skepticism, and science (Volume I) (pp. 153–176). Santa Barbara, CA: ABC-CLIO
LLC.
Zhang, J., & Kai, F. Y. (1998). What’s the relative risk? A method of correcting the odds ratio in
cohort studies of common outcomes. JAMA, 280, 1690–1691. https://doi.org/10.1001/jama.
280.19.1690
MATLAB Scripts

See Figs. 1.1, 2.2, 2.5, 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, ‘3.8 and 3.9’, 3.10 and 3.12.

© The Author(s) 2017 99


J.C.F. de Winter and D. Dodou, Human Subject Research for Engineers,
SpringerBriefs in Applied Sciences and Technology,
DOI 10.1007/978-3-319-56964-2
MATLAB Scripts 101

Fig. 2.5 Weight judgement: Wisdom of the crowd (Gordon 1924; Eysenck 1939)

clear variables
r=.41^2;
n1=[1 5 10 20 50 ];
robs=[.41 .68 .79 .86 .94];
n2=1:200;
R=sqrt((n2.*r)./(1+(n2-1).*r));
figure('Name','Figure 2.5','NumberTitle','off');hold on
plot(n1,robs,'ko','Markersize',14,'Markerfacecolor','k')
plot(n2,R,'Linewidth',2)
set(gca,'color', 'None');grid on;box on
set(gca,'xlim',[0 100],'ylim',[0 1])
legend('Observed correlation','Predicted
correlation','location','southeast')
xlabel('Number of participants per group');
ylabel('Correlation with true weights')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)

Fig. 3.1 Probability density functions for four different distributions

clear variables
x = -50:.001:50;y=NaN(3,length(x));
y(1,:)=tpdf(x,5);y(2,:)=tpdf(x,inf);y(3,:)=exppdf(x,1);
figure('Name','Figure 3.1','NumberTitle','off');hold on
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
plot(x,y(2,:),'k','Linewidth',2)
plot(x,y(1,:),'g','Linewidth',2)
plot(x./sqrt(5/3),y(1,:).*sqrt(5/3),'-','color',[255 165
0]/255,'Linewidth',2)
plot(x,y(3,:),'m:','Linewidth',2)
h=legend('(1) Normal : variance = 1, skewness = 0, kurtosis =
3','(2) \it{t}\rm : variance = 5/3, skewness = 0, kurtosis =
9','(3) \it{t}\rm (scaled) : variance = 1, skewness = 0, kurtosis
= 9','(4) Exponential : variance = 1, skewness = 2, kurtosis =
9','location','northeast','orientation','vertical');
set(h,'color','none')
xlabel('Value')
ylabel('Density')
set(gca,'color', 'None')
set(gca,'xlim',[-5 10],'ylim',[0 1.01],'xtick',-
10:1:10,'FontSize',24);
h=rectangle('position',[3 0 2
.06],'facecolor','none','Linewidth',1);
pan on
set(h,'Clipping','off')
plot([3.7 4],[.06 .23],'k-','Linewidth',1)
ah=axes('position',[.5465 .30 .35 .35]);
hold on;box on
plot(x,y(1,:),'g','Linewidth',2)
plot(x./sqrt(5/3),y(1,:).*sqrt(5/3),'-','color',[255 165
0]/255,'Linewidth',2)
plot(x,y(2,:),'k','Linewidth',2)
plot(x,y(3,:),'m:','Linewidth',2)
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'xlim',[3 5],'ylim',[0 .06],'FontSize',16,'color','none')
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
102 MATLAB Scripts

Fig. 3.2 Illustration of correlations

clear variables;rng('default')
N=1000;RR=[0 .2 .4 .6 .8 .9];
figure('Name','Figure 3.2','NumberTitle','off');hold on
for i=1:length(RR);
subplot(2,3,i);hold on
set(gca,'color','none')
plot([-10 10],[-10 10]*RR(i),'m--','Linewidth',2)
h=legend(['\rm\it{R}\rm = '
num2str(RR(i))],'location','southeast');
set(h,'color','none')
x=randn(N,1);y=RR(i)*x+sqrt((1-RR(i)^2))*randn(N,1);
plot(x,y,'ko');
xlabel('\itx');ylabel('\ity');
axis equal
set(gca,'xlim',[-5 5],'ylim',[-5 5],'xtick',[-5 0
5],'ytick',[-5 0 5])
box on
end
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',18)

Fig. 3.3 Anscombe’s quartet

clear variables
d=[10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0
6.89];
figure('Name','Figure 3.3','NumberTitle','off');hold on
for i=1:4
subplot(2,2,i)
plot(d(:,i*2-
1),d(:,i*2),'ko','Markersize',10,'Markerfacecolor','k')
xlabel('\itx');ylabel('\ity');
set(gca,'xlim',[3 20],'ylim',[4 14]);
set(gca,'color','none')
end
disp('Means')
fprintf('%8.3f',mean(d(:,1:2:end)));fprintf('\n')
fprintf('%8.3f',mean(d(:,2:2:end)));fprintf('\n')
disp('Variances')
fprintf('%8.3f',var(d(:,1:2:end)));fprintf('\n')
fprintf('%8.3f',var(d(:,2:2:end)));fprintf('\n')
disp('Correlations')
fprintf('%8.3f',[corr(d(:,1),d(:,2)) corr(d(:,3),d(:,4))
corr(d(:,5),d(:,6)) corr(d(:,7),d(:,8))]);fprintf('\n')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
MATLAB Scripts 103

Fig. 3.4 Central limit theorem—normal distribution

clear variables;rng('default')
reps=10^6;nn=[1 2 5 20 50];V=-100:0.01:100;SE=NaN(length(nn),1);
figure('Name','Figure 3.4','NumberTitle','off');
for i=1:length(nn) % loop over 5 sample sizes
M=mean(randn(nn(i),reps),1); % sample mean (vector of length
reps)
D=histc(M,V);
Dnorm=D./sum(D)/mean(diff(V));
SE(i)=std(M);
plot(V+mean(diff(V)),Dnorm,'-o','linewidth',2);hold on
end
h=legend('\it{n}\rm = 1','\it{n}\rm = 2','\it{n}\rm =
5','\it{n}\rm = 20','\it{n}\rm = 50');
set(h,'color','none')
set(gca,'xlim',[-3 3])
xlabel('Sample mean')
ylabel('Density')
set(gca,'color','None')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca, 'LooseInset', [0.01 0.01 0.01 0.01]);
disp('1/sqrt(n)')
for
i=1:length(SE);fprintf('%8.3f',1/sqrt(nn(i)));fprintf('\n');end
disp('Standard deviation of the sample mean')
for i=1:length(SE);fprintf('%8.3f',mean(SE(i)));fprintf('\n');end

Fig. 3.5 Central limit theorem—exponential distribution

clear variables;rng('default')
reps=10^6;nn=[1 2 5 20 50];V=-
100:0.01:100;SE=NaN(length(nn),1);SD=NaN(length(nn),1);
figure('Name','Figure 3.5','NumberTitle','off')
for i=1:length(nn);
M=mean(exprnd(1,nn(i),reps),1); % sample mean (vector of
length reps)
D=histc(M,V);
Dnorm=D./sum(D)/mean(diff(V));
SE(i)=std(M);
plot(V+mean(diff(V)),Dnorm,'-o','linewidth',2);hold on
end
h=legend('\it{n}\rm = 1','\it{n}\rm = 2','\it{n}\rm =
5','\it{n}\rm = 20','\it{n}\rm = 50');
set(h,'color','none')
set(gca,'xlim',[-.1 3])
xlabel('Sample mean')
ylabel('Density')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca,'color','bone','looseInset', [0.01 0.01 0.01 0.01])
disp('Standard deviation of the sample mean')
for i=1:length(SE);fprintf('%8.3f',mean(SE(i)));fprintf('\n');end
104 MATLAB Scripts

Fig. 3.7 Probability density function of population distribution of males and females

clear variables
V=0:.1:300;
d_men_c=normpdf(V,182.5,7.1);
d_women_c=normpdf(V,168.7,7.1);
figure('Name','Figure 3.7','NumberTitle','off');hold on
plot(V,d_men_c,'Linewidth',3)
plot(V,d_women_c,'--','color',[double(216) double(82)
double(24)]./255,'Linewidth',3)
box on
xlabel('Height (cm)')
ylabel('Density')
set(gca,'xlim',[130 230],'FontSize',24)
h=legend('Males','Females')
set(h,'color','none')
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])

Figs. 3.8 and 3.9 p values for two normal distributions with unequal means, and equal means,
respectively

clear variables;rng('default')
reps=10000;nn=[3 6 10];
pu=NaN(reps,length(nn(1)));
pe=NaN(reps,length(nn(1)));
for i=1:length(nn)
n=nn(i);
disp(n)
for i2=1:reps
disp(i2)
height_pollm=randn(n,1)*7.1+182.5;
height_pollw=randn(n,1)*7.1+168.7;
height_pollw2=randn(n,1)*7.1+182.5; % now assume that men
and women have equal height
[~,pu(i2,i)]=ttest2(height_pollm,height_pollw);
[~,pe(i2,i)]=ttest2(height_pollm,height_pollw2);
end
end
figure('Name','Figure 3.8','NumberTitle','off');hold on
plot(sort(pu),'o','Linewidth',2,'Markersize',4);hold on
plot([0 reps],[.05 .05], 'k--','Linewidth',2)
h=legend('{\itn_m_e_n} = {\itn_w_o_m_e_n} = 3','{\itn_m_e_n} =
{\itn_w_o_m_e_n} = 6', '{\itn_m_e_n} = {\itn_w_o_m_e_n} =
10','location','northwest');
set(h,'color','none')
xlabel('Test number (sorted on {\itp} value)')
ylabel('{\itp} value')
box on
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])

figure('Name','Figure 3.9','NumberTitle','off');hold on
plot(sort(pe(:,3)),'o','Linewidth',2,'Markersize',4)
plot([0 reps],[.05 .05], 'k--','Linewidth',2)
xlabel('Test number (sorted on {\itp} value)')
ylabel('{\itp} value')
box on
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])
MATLAB Scripts 105

Fig. 3.10 p values in original studies and replication studies in the Open Science Collaboration
Project

clear variables
pO=xlsread('RPPdataConverted.xlsx','DH2:DH168');
pR=xlsread('RPPdataConverted.xlsx','DT2:DT168');
figure('Name','Figure 3.10','NumberTitle','off');hold on
plot(pO,pR,'kx','Linewidth',2)
plot([0 1],[.05 .05],'m--','Linewidth',2)
plot([0.05 0.05],[0 1],'m--','Linewidth',2)
set(gca,'xlim',[0 0.06],'ylim',[0 1])
box on
xlabel('Original study {\itp} value')
ylabel('Replication study {\itp} value')
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',24)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])

Fig. 3.12 p-hacking. This simulation takes a few minutes to complete

clear variables;rng('default')
reps=10^6;n=25;
pp=NaN(reps,1);
V=0:0.005:1;
for i=1:reps;
if rem(i/1000,1)==0;fprintf('Percentage completed =
%5.3f',100*i/reps);fprintf('\n');end
x=randn(n,1);y=randn(n,1);
[~,pp(i)]=ttest2(x,y);p2=ranksum(x,y);
if pp(i)>.05 && p2 < .05
pp(i)=p2;
end
end
figure('Name','Figure 3.12','NumberTitle','off');hold on
D=histc(pp,V);Dnorm=D./sum(D)/mean(diff(V));
plot(V+mean(diff(V)),Dnorm,'k-o','Linewidth',2)
box on
xlabel('\itp\rm value');ylabel('Density')
set(gca,'xlim',[0 .4])
fig=gcf;set(findall(fig,'-property','FontSize'),'FontSize',20)
set(gca,'color','none','looseInset', [0.01 0.01 0.01 0.01])

View publication stats

You might also like