Sequential Analyses Workshop Groningen

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

Sequential

Analyses
Performing High-Powered Studies Efficiently
Better
Questions

Better Better
Theories Methods

Better
Better
Statistical
Knowledge
Inferences
Smallest
Pre-
Effect Size of Better
Questions
registration
Interest

Type 1 error
Better Better
Theories Methods control
Sequential
Analyses

Higher Power

Better
Better
Statistical
Knowledge
Inferences

Lower
p-values
Overview of the Workshop

 Variability in Small Samples


 Power Analysis: Limitations of Current Approaches
 Sequential Analyses
 A Practical Example
 Adaptive Designs, Internal Pilot Studies, & Conditional Power
 Defining the Smallest Effect Size of Interest (SESOI)
 Some Remarks on NHST
 Pre-registration
 Drinks at Intermate!
Sample Size
Planning
How Do You Plan For a Sample Size?
How Do You Plan For a Sample Size?

 > 20 per cell (Simmons, Nelson, & Simonsohn, 2011)


 > 50 per cell (Simmons, Nelson, & Simonsohn, 2013)
 As many as possible (Simmons, Nelson, & Simonsohn, 2014)

 The last recommendation is correct from a statistical viewpoint


(also, if you manage to collect data from the entire sample,
you no longer need statistics!).
But there are also ethical issues to
take into account.

Continuing data collection


whenever the desired level of
confidence is reached, or whenever
it is sufficiently clear the expected
effects are not present, is a waste
of the time of participants and the
money provided by tax-payers.
But there are also ethical issues to
take into account.

Continuing data collection


whenever the desired level of
confidence is reached, or whenever Also,
it is sufficiently clear the expected running less
effects are not present, is a waste
of the time of participants and the
participants
money provided by tax-payers. is more
efficient.
Variability in Small Samples

 Let’s start be playing around in R to see what the problem is


when we perform studies that have low power.

 I want to highlight the problem is not the p-value, but the


variability in data.
Repeatedly Analyzing Your Data

 One could easily argue that psychological researchers have an


ethical obligation to repeatedly analyze accumulating data,
and stop when the data are convincing enough.
 But wait: Didn’t repeatedly analyzing data increase the chance
for Type 1 errors?
 No, not necessarily. It’s possible to control the Type 1 error,
by lowering the significance level.
 Before we see how, first a reminder why high powered studies
are important.
Power
Analyses
Limitations of Current Approaches
High Powered Studies are Important

 The probability of rejecting the null hypothesis is known as the


power of a statistical test (Cohen, 1988).

 The statistical power of a study is determined by the size of


the effect, the sample size of the study (and the reliability of
the sample result), and the significance criterion (typically α =
.05).
High Powered Studies are Important

 Average power in psychology is estimated to be around 50%.

 A recent paper from neuroscience (Button et al., 2013)


estimated it in neuroscience at 21%.

 Power is 35% if you use 21 ppn/condition and the effect size


is d = 0.5 (average in psychology is d = 0.43).
High Powered Studies are Important
A-priori power analyses are often
recommended when designing
studies to provide at least some
indication of the required sample
size (e.g., Lakens, 2013).

That’s easier said than done –


notice how these people (e.g.,
Lakens, 2013) never tell you HOW
to do a power analysis!
You can do a pilot study (but to get
an accurate ES estimate, the
sample size needs to be large), see
Lakens & Evers, 2014.

You can take a guess based on a


related study (but there’s no way to
know whether you are right or not).

Rock > YOU < Hard Place


But wait! Let’s take the easy way out!

‘we are at risk of becoming methodological


fetishists’ which ‘would reverse the means and
the ends of doing research and stifles the
creativity that is essential to the advancement
of science’ (Ellemers, 2013)

Methodological rigor and creativity


are orthogonal.

Nevertheless, researchers might


become more conservative if the
have to collect large samples.
Sequential
Analyses
How you will become a better
researcher
The main idea
 With a symmetrical two-sided test, and an α = .05, this test
should yield a Z-value larger than 1.96 (or smaller than -1.96)
for the observed effect to be considered significant (which has
a probability smaller than .025 for each tail, assuming the null-
hypothesis is true).

Data Statistical
Collection test Z > 1.96
The main idea

 When using sequential analyses with a single planned interim


analysis, and a final analysis when all data is collected, one
test is performed after n (e.g., 80) of the planned N (e.g., 160)
observations have been collected, and another test is
performed after all N observations are collected.

Data Statistical Data Statistical


Collection test Z > c1 Collection test Z > c2
We need to select boundary critical
Z-values c1 and c2 (for the first and
the second analysis) such that (for
the upper boundary) the probability
(Pr) that the null-hypothesis is
rejected either when in the first
analysis Zn ≥ c1, or (when Zn < c1
in the first analysis) ZN ≥ c2 in the
second analysis. In formal terms:

Pr{Zn ≥ c1} + Pr{Zn < c1, ZN ≥ c2} = 0.025


So how do we determine the critical values? (and
their accompanying nominal α levels) There are
different approaches, each with its own rationale.
For example, the Pocock boundary
will lower the alpha level for each
interim analysis. With 2 looks, the α
= 0.0294 for each analysis.

These boundaries require an a-


priori specification of the time at
which they are performed.

Spending functions (see previous


slide) have been developed that
mirror these boundary functions,
but give more flexibility.
For example, the Pocock boundary
will lower the alpha level for each
interim analysis. With 2 looks, the α
= 0.0294 for each analysis.

If after the first analysis, you find:


t(79) = 2.30, p = .024.

Because p < .0294, you terminate


the data collection (and take the
rest of the day off!).
The Risk of Early Stopping

 Stopping a trial early has clear benefits, such as saving time


and money when the available data is considered convincing.
 It also has disadvantages, as noted by Pocock (1992). Effect
size estimates from small studies are sometimes still ‘sailing
the seas of chaos’ (Lakens & Evers, 2014), and the large
variation in effect size estimates makes results from small
studies less convincing, because the study might have stopped
at a ‘random high’.
 Obviously, this argument holds for any small study, regardless
of whether sequential analyses were used.
The Risk of Early Stopping

 It is therefore recommended to use sequential analyses and


significance testing to identify which effects in a line of
research hold promise. Follow-up studies with larger sample
sizes, or meta-analyses, can be used to provide more accurate
effect size estimates.

 You can also stop early when there seems to be no effect


(sometimes called ‘stopping for futility’). In these situations
there is a risk of Type 2 errors.
The Benefit of Early Stopping

 Remember that statistical power functions are a downward


concave function.
 This means an increase in power from 70 to 90% power will
often require more participants than an increase from 0 to
70% power.
 Having 50% or 70% change to find an effect is not desirable if
you can only do 1 analysis. But if there is an effect, being able
to stop after 120 instead of 240 participants in nice.
100% δ=0.8
90% δ=0.7
80% δ=0.6
70% δ=0.5
60% δ=0.4

Power 50% δ=0.3

40%
30%
20%
10%
0%
10 20 30 40 50 60 70 80 90 100
Sample size per condition
The Benefit of Early Stopping
80% power 90% power
Sequential N Non-Sequential N Reduction Sequential N Non-Sequential N Reduction
δ = 0.8 39.17 52 24.67% 45.91 68 32.49%
δ = 0.5 100.82 128 21.23% 117.88 172 31.47%
δ = 0.43 135.73 172 21.09% 158.82 230 30.95%
δ = 0.3 278.38 352 20.91% 326.33 470 30.57%
δ = 0.2 625.09 788 20.67% 733.30 1054 30.43%
Questions?
After questions, let’s do a practical
example.
A practical example

 Suppose a researcher is interested in whether a relatively well-


established effect in The Netherlands can also be observed in
Japan. A meta-analysis of the studies performed in The
Netherlands has yielded a meta-analytic effect size estimate
of Cohen’s d = .5, and she expects the effect should
generalize to the Japanese population. Due to practical
constraints, the researcher determines she is willing to collect
a maximum of 180 observations, which means the study has a
power of .92 to observe an effect of dpop = 0.5 in a single
statistical test performed after all participants are collected.
A practical example

 The researcher decides to perform two-sided interim analyses


after collecting 60, 120, and 180 participants.
 Calculate the nominal alpha-levels using a linear spending
function (in WinDL indicated by the Power Family function with
a Phi of 1, in GroupSeq indicated by the alpha*t^phi function).
A practical example

 Let’s assume that after 60 participants in each condition (N =


120), the second interim analysis reveals a Cohen’s ds = 0.46,
95% CI [0.09, 0.82], t(118) = 2.49, p = .014. This falls below the
significance level of 0.22, and the data collection is
terminated.
 Calculate the p-value adjusted for sequential analyses.
 Assume the SD for both groups is 20. Calculate the mean
difference and 95% CI adjusted for sequential analyses.
Sequential Analyses when d is unknown

 If you do extremely novel research, sequential analyses are an


even better solution than a-priori power analyses.

 Since there is no way to know the true effect size, let’s imagine
what might happen if the effect is small (d = 0.3), medium (d =
0.5), or large (d = 0.8) when a researcher performs interim
analyses after n = 100, n = 200, and N = 400 participants.
 Acceptable levels of power (> .80) are reached at the first look
for large effects, at the second look for medium effects, and at
the final look for small effects.
Sequential Analyses when d is unknown

 Under these circumstances, sequential analyses are especially


efficient. If there seems to be an effect, you can continue. If
not, you save yourself from running participant 101 to 400.
 Remember early stopping always entails the risk of a Type 2
error (well, actually so does late stopping).
 Never interpret a p-value < .05 as support for the null-
hypothesis. (Remember, p-values give P(D|H0, not P(H0|D).
 I like to think of the absence of a statistical difference in null-
hypothesis significance testing as mu (無), a concept that
plays an important role in Zen Buddhism, which means neither
yes nor no.
Adaptive Designs,
Internal Pilot
Studies, &
Conditional Power
But wait, there’s even more! If you order
now, you get increased flexibility in your
sample size planning!
Adaptive Designs

 According to the European Medicines Agency (2006): “a study


design is ‘adaptive’ if statistical methodology allows the
modification of a design element (e.g., sample-size,
randomization ratio, number of treatment arms) at an interim
analysis with full control of Type I error rate.”

 With interim analyses, it becomes possible to use data


collected early in a study as an internal pilot study (Wittes &
Brittain, 1990), and use the effect size estimate from an
interim analysis to determine the sample size for the full
study.
In medical sciences, such adaptive
designs are controversial (see
Chow & Chang, 2008), because
they can lead to a statistically
significant result without any
practical significance, and thus a
new treatment can claim to be However, in
‘better’ without having a practical psychology,
benefit. that seems
to be the
least of our
problems.
Adaptive Designs

 If researchers use an adaptive design, interim analyses allow


researchers to calculate conditional power, which is the
conditional probability of a statistically significant benefit at
the end of a study, given the data collected so far.

 Keep in mind that effect sizes calculated from small sample


sizes have large confidence intervals, and conditional power
analyses based on small samples should not be given too
much weight.
Adaptive Designs

 A useful recommendation is to use sequential analyses based


on a reasonable effect size estimate, but include the possibility
of an adaptive design if the effect size is lower, but still of
interest to the researcher.
Defining the
Smallest Effect
Size of Interest
(SESOI)
Because Size Always Matters
SESOI

 What’s your SESOI (Smallest Effect Size Of Interest)?


SESOI

 In applied research, practical limitations of the SESOI can


often be determined based on a cost-benefit analysis. For
example, if an intervention costs more money than it saves,
the effect size is too small to be of practical significance.
SESOI

 In theoretical research, the SESOI might be determined by a


theoretical model that is detailed enough to make falsifiable
predictions about the hypothesized size of effects. Such
theoretical models are rare, and therefore researchers often
state they are interested in any effect size that is reliably
different from zero.
SESOI

 Even so, because you can only reliably examine an effect that
your study is adequately powered to observe, researchers are
always limited by the practical limitation of the number of
participants that are willing to participate in their experiment,
or the number of observations they have the resources to
collect.
SESOI

 What is the maximum number of participants you are willing to


run in a lab study?
SESOI

 I would personally, and in general, consider a sample size of


500 participants for a lab experiment consisting of individual
sessions too much effort to be worthwhile. As a consequence,
assuming a lower limit of statistical power of 0.80 (or .90), the
SESOI would be Cohen’s ds = .25 (or Cohen’s ds = .29) for a
between subjects t-test with an alpha of .05. The SESOI for a
paired-samples t-test would be Cohen’s dz = .13 (or Cohen’s
dz = .15).
 Moving beyond the rather hollow statement that researchers
are in principal interested in an effect of any size that is
reliably different from zero is useful because it allows a
researcher to stop the data collection because of futility
whenever an effect size is smaller than the SESOI.
SESOI

 Lan and Trost (1997) suggest researchers stop for futility when
conditional power drops below a lower limit, continue with the
study when the conditional power is above an upper limit, and
whenever it lies between the lower and upper limit, extend the
study (i.e., increase the planned sample size) to achieve
conditional power that equals the upper limit.
Some Remarks on
NHST
Why lowering α levels isn’t a cost, but a
necessity.
NHST

 Although Type 1 errors are the alpha of statistical inferences,


they are not its alpha and omega. The goal of research is not
to control Type 1 error rates, but to discover what is likely to
be true.

 Inferences from data are only one aspect of the empirical


cycle, and good theory and methodology are equally essential
for scientific progress.
NHST

 One might feel that Type 1 error control is not necessary, and
that everything is allowed, as long as people clearly report
what they have done. This idea assumes that reviewers can
decide whether the results of a study provide convincing
support for a hypothesis or not. However, since people are
notoriously bad at thinking about probabilities, and specifically
bad at thinking about conditional probabilities such as p-
values, letting reviewers decide whether a chosen procedure
leads to a result that is convincing might be asking for trouble.
NHST

 You might think that lowering your nominal significance level


as a function of the looks at the data is a bad thing.

 However, the significance level you choose should be a


decreasing function of sample size no matter whether you use
sequential analyses or not.

 Decreasing the significance level as the sample size decreases


needs to be done anyway (even though no one does it) – and
you can get a number of interim analyses for free.
But wait – if no one reduces their
significance level as a function of
sample size, why should I?

Because we are moving forward as


a discipline, and are becoming
better at statistics. And now you
know this, there’s no way to
unlearn it.
NHST

 Let’s focus on the logic behind the need to reduce the


significance level as a function of the sample size.

 This is not really essential for a workshop on sequential


analyses, but because I just learned this myself I like to
discuss it anyway 
The distribution of p-values under
the null-hypothesis
The distribution of p-values under
the null-hypothesis
The distribution of p-values when
the alternative hypothesis is true
depends on the power of a study.

The power of a study depends


(among other things) of the
sample size. So (all other things
being equal) the p-values you
should observe are a function of
the sample size. See where we’re
going?

.
The distribution of p-values under
the alternative hypothesis
Where Bayes factors get it right,
and p-values get it wrong
Lowering the significance level for large
samples

 As Good (1992, p. 600) remarks:


 ‘Thus we have empirical evidence that sensible P values are
related to weights of evidence and, therefore, that P values are
not entirely without merit. The real objection to P values is not
that they usually are utter nonsense, but rather that they can
be highly misleading, especially if the value of N is not also
taken into account and is large.’
Why? What’s wrong with a p-value?

P-values give you Pr(Data|H0) –


the probability of the data, given
H0 is true. People are inclined to
interpret them as an indication of
Pr(HO|Data) – but p-values
overstate evidence against the H0
if interpreted like that.

.
Lower Bounds on the Bayes factor:
B(p) = −e p log(p)
Lower Bounds on the Bayes factor:
B(p) = −e p log(p)
Lowering the significance level for large
samples

 Based on the observation by Jeffrey’s (1939) that, under


specific circumstances, the Bayes factor against the null-
hypothesis is approximately inversely proportional to 𝑁,
Good (1982) suggests a standardized p-value to bring p-
values in closer relationship with weights of evidence:
𝑁
pstan = p 100
Lowering the significance level for large
samples

 There are different formula’s, but Good’s is Best (or easiest)


 Because the Bayes factor is not
influenced by repeatedly looking
at your data setting a lower
significance level for sequential
analyses bring p-values in closer Fun fact:
relationship with weights of Bayesians can
evidence for free. analyze their
data after
every
participant
without any
cost.
Pre-registration
If you use sequential analyses, it’s
important to determine the number
of looks and the data collection
termination beforehand.

Pre-registration is therefore
essential. It has the added benefit
of allowing you to remember which
hypothesis you came up with a-
priori, and which are post-hoc
ideas (p-values are only useful for
a-priori hypotheses).

.
Pre-Registration

 A meta-analysis across 4 earlier studies revealed the following:


Testing the predicted decrease in the size of the congruency
effect between the key assignment that afforded a spatial
grouping (AS KL) and the condition that did not afford creating
two spatial subgroups (FGHJ), a one-sided t-test yielded a
small but significant reduction of the congruency effect,
Hedges’s gs = 0.25, t(236) = 1.84, p = .034, 95% CI [-56, 2].
We plan to follow a sequential analyses procedure (based on
recommendations by Lakens, 2014):
Pre-Registration

 Based on an expected effect size of Hedges’s g = 0.25, a


power analysis indicated that for a two-sided t-test with an
alpha of .05, a desired statistical power of .9, and four looks
using a linear spending function, a total of 758 participants
are needed. If the expected difference is significant at an
interim analysis (planned at equally spaced distances, but due
to practical issues concerning the data collection, data
analyses will be performed at closest possible moment in
time), the data collection will be terminated. The data
collection will also be terminated when the observed effect
size is smaller than the SESOI, which is set at d = 0.21 based
on the researcher’s willingness to collect at most 800
participants for this study, and the fact that with four interim
analyses 800 participants provide .8 power to detect an effect
of d = 0.21.
Pre-Registration

 If an interim analysis reveals an effect size larger than 0.25


(while p > .nominal alpha specified below), the data collection
will be continued. If the effect size lies between the SESOI (d =
0.21) and the expected effect size (d = 0.25), the planned
sample size will be increased based on a conditional power
analysis to achieve a power of .9 (or to a maximum of 800
participants in total). The interim analyses and nominal alpha
levels are specified in the following table:
Details when Spending = (Alpha)(Time), N1 = 379, N2 =379, S1 = 1,00, S2 = 1,00, Diff = 0,25
Lower Upper Nominal Inc Total Inc Total
Look Time Bndry Bndry Alpha Alpha Alpha Power Power
1 0,25 -2,49771 2,49771 0,012500 0,012500 0,012500 0,218601 0,218601
2 0,50 -2,40712 2,40712 0,016079 0,012500 0,025000 0,314400 0,533001
3 0,75 -2,32079 2,32079 0,020298 0,012500 0,037500 0,236549 0,769550
4 1,00 -2,24475 2,24475 0,024784 0,012500 0,050000 0,131135 0,900685
Smallest
Pre-
Effect Size of Better
Questions
registration
Interest

Type 1 error
Better Better
Theories Methods control
Sequential
Analyses

Higher Power

Better
Better
Statistical
Knowledge
Inferences

Lower
p-values
Questions?
Thanks for You
Attention!
And good luck being a better and more efficient
researcher.

You might also like