Professional Documents
Culture Documents
Language Teaching Research-A Review
Language Teaching Research-A Review
research-article2016
LTR0010.1177/1362168816649979Language Teaching ResearchLindstromberg
LANGUAGE
TEACHING
Article RESEARCH
Inferential statistics in
2016, Vol. 20(6) 741–768
© The Author(s) 2016
Reprints and permissions:
Language Teaching Research: sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1362168816649979
A review and ways forward ltr.sagepub.com
Seth Lindstromberg
Hilderstone College, Broadstairs, UK
Abstract
This article reviews all (quasi)experimental studies appearing in the first 19 volumes (1997–
2015) of Language Teaching Research (LTR). Specifically, it provides an overview of how statistical
analyses were conducted in these studies and of how the analyses were reported. The overall
conclusion is that there has been a tight adherence to traditional methods and practices, some
of which are suboptimal. Accordingly, a number of improvements are recommended. Topics
covered include the implications of small average sample sizes, the unsuitability of p values as
indicators of replicability, statistical power and implications of low power, the non-robustness
of the most commonly used significance tests, the benefits of reporting standardized effect sizes
such as Cohen’s d, options regarding control of the familywise Type I error rate, analytic options
in pretest–posttest designs, ‘meta-analytic thinking’ and its benefits, and the mistaken use of a
significance test to show that treatment groups are equivalent at pretest. An online companion
article elaborates on some of these topics plus a few additional ones and offers guidelines,
recommendations, and additional background discussion for researchers intending to submit to
LTR an article reporting a (quasi)experimental study.
Keywords
Effect sizes, L2 quantitative research, pretest–posttest designs, (quasi)experimental studies,
robust methods, small sample sizes, statistical analysis, statistical power, testing for baseline
balance
I Introduction
This article reports a survey of all issues of Language Teaching Research (LTR) from the
first issue in 1997 through the latest issue (at the time of writing) of 2015, including
special issues. The main aims are (1) to outline of how LTR authors have subjected data
Corresponding author:
Seth Lindstromberg, Hilderstone College, 18 Saint Peters Road, Broadstairs, Kent CT10 2JW, UK
Email: lindstromberg@gmail.com
742 Language Teaching Research 20(6)
Figure 1. Loess and robust regression lines (dotted and solid, respectively), showing the
increasing representation of intervention studies in LTR, 1997–2015.2
96 studies surveyed, four reported interventions that researchers had found rather than
designed. An example here is a comparative study comparing learners who had partici-
pated in a study-abroad program with similar learners who had not done so. Figure 1
shows a clear rising trend over the period of the study in the number of articles per five-
article batch (i.e. about one LTR issue) that report an intervention study.
In this article, I follow mainstream usage in applying the term ‘experiment’ – or,
more exactly, ‘experimental study’ – only to a quantitative study featuring both a con-
trol or a comparison condition and random assignment of available participants to con-
ditions. (The survey came across only one study that also included prior random
selection of participants from a large pool.) Experiment-like studies not featuring ran-
dom assignment to groups are hereafter referred to as ‘quasi-experiments’. Importantly,
all but one of the surveyed studies (the exception was in the context of a ‘found’ inter-
vention) featured some kind of control or comparison group/condition. (For further
relevant discussion with respect to second language (L2) research, see Hudson and
Llosa, 2015; Mackey and Gass, 2005.) Of the 96 studies surveyed only 20 (21%) were
conducted in a laboratory. Sixteen of the 96 studies had a purely within-participants
744 Language Teaching Research 20(6)
Comparing notes from years of reviewing quantitative research articles for applied linguistics
journals, common challenges that we had encountered quickly became apparent, including:
inadequate experimental design and control coupled with claims of causality, effectiveness, and
impact; impoverished sample sizes, as a rule, in combination with increasing use of multivariate
analyses; inattention to assumptions about numeric data in the selection of fitting inferential
techniques; dramatic and persistent over- and misinterpretation of p values and statistical
significance testing, coupled with inattention to effect sizes of all sorts; incomplete reporting of
descriptive and inferential statistics; a willingness to generalize on the basis of one or a few
cases; and so on. (Norris, Ross, & Schoonen, 2015, pp. 1–2)
Plonsky and Gass (2011) reviewed 174 observational and (quasi)experimental studies of
the effects of learner interaction on L2 development. Topics of their review included
types of study design, statistical procedures, the character and thoroughness with which
statistics were reported, the relationship between the magnitude of observed effects on
the one hand and research designs and reporting practices on the other and trends with
respect to study designs, reporting practices, and magnitude of typically observed effects
(e.g. the typical size of the difference between the mean score of learners in two groups).
These reviewers found considerable room for improvement in reporting practices and
recommended, in particular, that researchers should report all means and associated
standard deviations (SDs), exact p values, and all associated test statistics (e.g. t and F),
and that they should report estimates of effect size (ES) and give confidence intervals for
these estimates (see also American Psychological Association, 2010). Plonsky (2013)
Lindstromberg 745
To what extent has L2 research included various design features and statistical procedures, both
(a) generally or descriptively and (b) with respect to those associated with methodological and
experimental rigor? What types of data have been reported in L2 research, both (a) generally or
descriptively and (b) with respect to those associated with transparency?
Regarding matters especially relevant to the survey reported here, Plonsky (2013) found
that it was, unfortunately, common for test statistics and p values not to be reported for
tests when p > α, and that about a third of studies failed to give a SD for each reported
mean. In only 17% of surveyed articles had researchers mentioned whether they had
checked to see if the assumptions (or validity conditions) of their statistical procedures
were in place, and in only 1% had an appropriate power analysis been reported. His
estimate of the median number of significance tests per study was 18. Finally, Plonsky
remarked that, like earlier reviewers (e.g. Gass, 2009), he had found that the great
majority of L2 studies had used means-based statistical procedures, especially anova
and t-tests.
Various other scholars have also pointed out deficiencies and missed opportunities in
the statistical practices of L2 researchers and called for improvements (e.g. Chaudron,
2001; Larson-Hall & Herrington, 2009; Nassaji, 2012; Norris, 2015; Norris & Ortega,
2000, 2006). This review joins the chorus. However, one difference between this review
and most previous ones is that this one goes into somewhat greater technical detail with
respect to improvements that L2 researchers ought to adopt and pitfalls they ought to avoid.
III Findings of the present survey, and key basic terms and
concepts
1 Significance tests used
As Table 1 shows, LTR authors have relied heavily on a rather narrow range of inferential
procedures, which have been in widespread use for several generations. Prominent among
these procedures are two types of means-based parametric tests of statistical significance:
first, t-tests, including post-hoc pairwise t-tests and t-tests of the significance of correla-
tions and, second, F-tests conducted in the context of anova, ancova, or manova. Somewhat
less prominent are two commonly used rank-based nonparametric tests, namely, the
Wilcoxon–Mann–Whitney (WMW) and the Wilcoxon Signed Ranks tests. Modern robust
procedures such as randomization tests and bootstraps were not used at all. Such proce-
dures will be discussed further below; but it should be said here that a robust statistical
procedure is one that continues to yield accurate estimates of parameters such as means
and mean differences and to maintain the desired Type I error rate (e.g. .05) when a para-
metric or a well-known nonparametric test such as the WMW test fails to do so because
its assumptions have been violated (Wilcox, 2012a, 2012b); regarding the nonrobustness
of the WMW test, see Lindstromberg 2016: Section IV.3.
746 Language Teaching Research 20(6)
Table 1. The inferential statistical procedures used in the 96 experiments and quasi-
experiments covered by the survey (1997–2015) along with noteworthy absences.
Notes. aOnly three articles report use of multiple regression; however, modern software is likely automati-
cally to take a regression approach to anova and ancova. bAlso known as Mixed-level, Mixed-effects, or
Hierarchical linear modeling.
2 Sample size
The survey found that the typical sample (e.g. the number of learners per group) was
small, particularly in the case of studies with a between-participants or mixed design
(Table 2), where the typical sample size was very similar to the mean sample sizes that
Plonsky and Gass (2011) reported in their multi-journal survey of L2 interaction studies
(i.e. 23.5 for treatment groups and 19.3 for comparison groups) and similar also to the
overall median sample size of 19 reported by Plonsky (2013). (Trends are noted near the
end of this article.)
With respect to common procedures such as t-tests and anova, the vague traditional
consensus of applied statisticians has been that a sample is small if it consists of fewer
than 30 data points. Certainly, a sample of that size or smaller is likely to include a higher
proportion of atypical values than a much larger sample from the same population (e.g.
Wilcox, 2012b). A facet of this characteristic is that a small sample tends to have a
greater standard deviation (SD) than does a larger sample that is taken from the same
population. In short, for any parameter of interest, such as a population mean, a small
sample is relatively likely to yield an estimate that is not only imprecise and so relatively
uninformative (with the imprecision being reflected in a wide confidence interval, but
also inaccurate (see Lindstromberg, 2016: Section III.2). Thus, a finding stemming from
any given innovative small scale study − even if it was well conceived, well designed,
and well conducted − is unlikely to support a firm conclusion about anything that was not
Lindstromberg 747
Notes. For each study, a mean sample size was calculated. The statistics in this table are based on those
means; so ‘mean = 35.2’ gives the mean of 15 mean sample sizes of the purely within-participants stud-
ies. No thorough attempt was made to weight each sample size by the number of significance tests that
it played a role in. aOne early study is not included here because learners were treated as individuals in
statistical analysis even though they had worked in pairs. bIQR = inter-quartile range.
already evident to the larger community of researchers.3 That said, firm bases for inter-
esting conclusions can be constructed in small scale research via multiple replications,
particularly if their findings are subjected to statistical meta-analysis, as explained
in detail by, for example, Asendorpf, Conner, de Fruyt, et al., 2013; Borenstein, Hedges,
Higgins, and Rothstein, 2009; Cumming, 2012a, 2014. We return to this matter further
below.
Finally, with respect to the issue of sample size we have so far explicitly considered
only scores deriving from learners. But in 38 (40%) of the 96 studies surveyed, test
scores were matched not with learners but with items such as words or formulaic
sequences. The typical sample size here was especially small: median = 15 (range,
2–170). For relevant discussion, see Lindstromberg, 2016: Section IV.10.
3 Replication and p
A replication study (or ‘replication’) is exact if it addresses precisely the same research
questions that were addressed by an earlier ‘original study’ through using the same pro-
cedure and materials but with different participants (see Cumming, 2008). A fairly exact
replication might follow the original procedure but use new materials (e.g. new vocabu-
lary items) as well as new learners. A so-called ‘partial’ replication displays a clear fam-
ily resemblance to an original study but shows one or more clear differences over and
above ones already mentioned (Cumming, 2008); in particular, the research questions
are likely to be somewhat different. There is no fixed rule about how, or how often, an
original study should be replicated. On the other hand, there is no serious theory-based
denial that replication is a crucial element of scientific activity.
It has been remarked that L2 research has shown a weak commitment to replication
(Polio, 2012; Porte, 2012). This observation is corroborated by the finding that only
seven (7%) of the 96 studies covered by the survey were explicitly characterized by the
748 Language Teaching Research 20(6)
researchers who conducted them as having been planned as (non-exact) replications. (Of
three further studies it is said only in concluding discussion that these studies ‘replicated’
a finding reported earlier in the literature.) Among the possible reasons for the dearth of
replication studies in L2 research is one that directly relates to the practice of statistical
analysis: Researchers may be too inclined to think that a research question has been con-
clusively answered if the question was once tested experimentally and a significant
p value was found (J. Cohen, 1994; Cumming, 2012b; Cumming, Williams, & Fidler,
2004; Lai, Fidler, & Cumming, 2009; Nickerson, 2000, p. 256–257; Tversky &
Kahneman, 1971). In fact, it is very seldom warranted to draw such a conclusion from a
single result, especially when samples are small, since sheer random variation is all too
likely to yield p ⩽ α when a null hypothesis is in fact completely true (Cumming, 2008,
2012a; Schönbrodt, 2015).
As it happens, the measures of ES most commonly noted in the survey − r, (partial) eta2,
and d − are means-based and therefore not robust, which is to say that they are liable to
be inaccurate when they relate to populations of data that do not conform to the assump-
tions of a parametric method (e.g. a more or less normal distribution; see Section IV.5
below). A robust estimator of POV discussed by Wilcox (2012b, pp. 377–384) may have
advantages, but it is not easy to calculate by hand. The POS, on the other hand, is not
only fairly robust and easy to calculate, but it may also be easier for many researchers to
interpret than, say r, r2, eta2, and maybe even d. For more on the POS see, perhaps in this
order, Wuensch (2015), McGraw and Wong (1992), Vargha and Delaney (2000), and
Grissom and Kim (2012).
Although the survey found no instance of the POS being reported, all the other three
types of ES measure have been, with at least one ES being reported in slightly over half
of the reports. However, common lapses have been failure to supply ESs consistently and
failure to verbally interpret the ones that are given. When verbal interpretations are
given, they are often brief and highly generic in character (e.g. ‘This is a large effect’).
Interestingly, occurrences of r and rSP have almost never been explicitly referred to as
measures of ES. Extreme idiosyncrasies, though, have been very rare. For instance, I
found only one instance of each of the following: using eta2 as the estimator of ES in
connection with a paired t-test, calling p a measure of ES, and reporting an ES in order
dispel the notion that a study’s results ‘were distorted by the small sample sizes’, which
is not something an ES can give information about.
To conclude this section, I would like to say a little more about meta-analysis, particu-
larly CCMA. The latter, as already hinted, is an application of meta-analysis that pro-
ceeds as follows. Previous and new estimates of a given effect are quantitatively
synthesized to yield a best overall interim estimate. As new estimates come in, these are
added into the meta-analysis to yield series of increasingly precise interim estimates
(Braver et al., 2014). The word ‘interim’, which is key here, relates to the fact that a
CCMA project can begin with as few as two studies. CCMA is therefore already highly
feasible in L2 research (namely Ellis & Sagarra, 2011; Lindstromberg & Eyckmans,
2014), notwithstanding the fact that in our field the reporting of ESs is a fairly new thing.
Indeed, what CCMA means is that any L2 researcher can also be a meta-analyst, on a
750 Language Teaching Research 20(6)
small scale at least. However, even the small-scale meta-analyst must heed certain cave-
ats. For example, a meta-analysis based only on published studies may well yield an
overestimate of the true effect size. Because small, or even negative, effects are most
likely to have been observed in unpublished studies, the meta-analyst should strive to
find any of these that exist for the case at hand (e.g. Borenstein et al., 2009; Larson-Hall
& Plonsky, 2015; Plonsky & Oswald, 2010). Of course, the meta-analyst must also try to
ensure that methodologically poor studies are filtered out (for an excellent beginner’s
guide, see Field, 1999).
5 Confidence intervals*
Although applied statisticians have been recommending for years that educational and
behavioral researchers furnish observed ESs with confidence intervals (Wilkinson &
Task Force on Statistical Inference, 1999), this is a practice that L2 researchers have been
very slow to adopt (Norris et al., 2015; Plonsky & Gass, 2011; Plonsky, 2013). Indeed,
the survey found only two articles whose results sections give confidence intervals (CIs)
for estimates of population effect sizes (e.g. a difference in mean gains). One likely rea-
son for this rather extreme conservatism is poor awareness of how informative CIs can
be. For relevant discussion and references, see Lindstromberg, 2016.
over-control the FWER and so reduce statistical power (Romano et al., 2010). Because
statistical power tends already to be too low in many fields, including ours (Plonsky,
2013; Plonsky & Gass, 2011), some authorities have recommended against any correc-
tion of p and against any corresponding correction of CIs as well. They favor, instead,
letting readers form their own judgments about the credibility of unadjusted p values
(e.g. O’Keefe, 2003; for additional references, see Keselman, Cribbie, & Holland, 2004).
However, readers can only form such judgments when they have been informed about all
the tests that were conducted; and, as mentioned, some of the surveyed reports are vague
in this respect.
About 56% of the reports of studies involving anova say explicitly whether and how
the FWER was controlled across the post-hoc tests. The four methods of control most
often mentioned (a few reports mention two) are: Tukey’s HSD (12), standard Bonferroni
(8), Scheffé (4), and Fisher’s LSD (4). A small number of researchers simply reduced α
across the board to .025 or .01, which smacks of guesswork. The relatively powerful
methods of Hochberg and Rom (see Wilcox, 2012b) were not used, and control of the
False Discovery Rate is mentioned nowhere although it may be advisable when the
number of tests is very large (Benjamini & Hochberg, 2000; Field, Miles, & Field,
2012). As a final point of possible interest, virtually all estimates of correlation were
tested for significance although it did not always make sense for such tests to be carried
out. In fact, r is often amply serviceable as a purely descriptive statistic (Baguley, 2010;
Howell, 2010).
assumption checking does not necessarily mean that assumptions were not checked. It
would be better, though, for authors to give details.
A well-known problem with using a significance test (e.g. the Shapiro–Wilk normal-
ity test) to test an assumption of another significance test (such as a t-test) is that in any
given research situation it cannot be known for sure whether the assumption test has
enough power to detect a violation or whether instead it has so much power that it is
liable to find to find p ⩽ α when a violation is too small to matter (e.g. Baguley, 2012;
Field et al., 2012; Wilcox, 2012a). A problem that is much less well known is that Type I
error is increased by making use of a nonparametric or robust test conditional on a test of
normality finding p ⩽ α (for references, see Baguley, 2012, p. 324). A way of avoiding
both problems is to use robust procedures since in this case assumptions tests are irrel-
evant (see Erceg-Hurn & Mirosevich, 2008; Wilcox, 2012a, 2012b). (Note: No one rec-
ommends that researchers forego other means of judging the character of their data, such
as creating and examining histograms, density plots, and QQ plots.) Recently, though,
Keselman, Othman, and Wilcox (2013, 2014) have found that the Cramér–von-Mises
and especially the Anderson–Darling normality tests are sufficiently reliable for samples
as small as 20 or perhaps less (although a sample of 8, for instance, would be too small),
provided that one sets α at .15 or even .20. (They also propose a method for controlling
the FWER.5) However, there remain strong reasons to avoid other tests of normality
(Keselman et al., 2013, 2014) as well as commonly used tests of the equality of variances
(Zimmerman, 2004a).
For these data the standard (i.e. Student’s) IS t-test gives: t = 2.207(34), p = .034.
But suppose that on checking the score sheets we see that the final EGr score is 28
instead of 20. Although this looks like even better evidence that the EGr outperformed
the CGr, on rerunning the t-test we find: t = 1.992(34), p = .054; and, unlike the earlier CI
(not given here), the new one, CI95% [–0.04, 3.81], includes zero, meaning that the null
hypothesis, MD = 0, cannot be excluded. What happened? Recall that one formula for t
is: t = [(√n)*MD)] / SDpooled. So, as a result of changing 20 to 28, the revised SD of the
experimental group (i.e. 3.70) – which influences the denominator – is 1.62 times bigger
than before, whereas the corresponding revised MD (1.89), which is part of the new
numerator, is only 1.31 times bigger. This disproportionately large influence of an outlier
on the denominator of the equation for the test statistic of a parametric means-based test
(e.g. t and F) is a trait that makes parametric tests non-robust (e.g. Wilcox, 2010, 2012a).
Let us now consider the performance of a robust alternative, specifically a bootstrap
of the Yuen–Welch t-test (with 4,999 simulated replications) based on a 10% trimmed
mean (Wilcox 2012b, pp. 339–343).7 On applying this bootstrap first to the original data
and then to the revised data we find: original data, t = 1.855, p = .046, CI [0.03, 2.72];
revised data: t = 1.855, p = .040, CI [0.07, 2.68]. Note that the output from the second
bootstrap, compared to the first, fairly reflects the fact that the revised data constitute
improved evidence that the EGr outperformed the CGr yet, at the same time, the revised
statistics show no drastic shift. This is indicative of the superior stability of the trimmed
mean compared to the mean. Also, we see that the bootstrap CI is narrower (i.e. more
precise) than the one associated with the t-test.8 As a final point regarding our running
example, note the relatively poor performance of the Wilcoxon–Mann–Whitney (WMW)
test even on the original data: W = 101.5, p = .051. This is because the WMW test loses
power in the presence of tied scores: For instance, there are three scores of 15 in each
example data set. (The WMW test can show loss of performance also in other ways and
for other reasons; Zimmerman, 1998, 2003, 2004b). For an interesting account of why
complacency about the non-robustness of parametric tests took firm root in the behavio-
ral sciences, see Osborne (2013).
756 Language Teaching Research 20(6)
Finally, we saw above that just one atypical value can inflate the SD more than
the MD. Recall that d equals a MD divided by a SD (the pooled inferential SD, in the
case of Cohen’s d) and that a SD is a measure of the deviation of data points from the
mean. Thus, d and other means-based measures of ES (such as r) can be distorted by
outliers in the same way that a p value can be. For discussion, see Grissom and Kim
(2012); Kelly and Preacher (2012). P. Ellis (2010) clarifies the different versions of
d that there are.
Table 3. Overview of the various tests of significance used in prettest studies, 1997–2015:
Statistical analysis in 49 pretest–posttest studies.
Split-plot (i.e. mixed between-by-within) anova on raw scores, with or 14 (28.5%)
without follow-on one-way anova and/or paired t-tests
Split-plot anova on gain scores, with or without follow-on one-way 5 (10%)
anova and/or paired t-tests
manova on raw scores, with time as a factor, with follow-on split-plot 2 (5%)
and/or one-way anova on raw scores
manova as above plus calculation of Pearson’s r between gain scores and 1 (2%)
raw scores on a test of language aptitude
manova on raw pretest scores with follow-on one-way and split-plot 1
anova on raw scores
One-way independent samples (IS) anova on raw posttest scores 3
Ancova on raw scores 3
Multi-level modeling on raw scores 2
One-way anova on raw scores and on gain scores 2
Paired and IS t-tests on gain scores 2
WMW tests on raw scores 2
WMW and paired t-tests on raw scores 1
WMW, chi-square, and McNemar tests on raw scores 1
Split-plot anova on raw scores and ancova on raw scores 1
Split-plot anova on raw scores & ancova on both raw scores and gain 1
scores
Latent growth curve analysis on raw scores 1
Three-way mixed anova 1
mancova, manova, ancova, and one-way anova 1
Fisher exact test on scores representing differences in percentage gained 1
Two-way IS anova on gain scores divided by the maximum gain possible 1*
Multiple regression plus manova on gain scores 1
Multiple regression and split-plot anova plus partial correlation analysis 1
Notes. Not included are independent samples pairwise post-hoc tests (mainly t-tests) carried out as follow-
ons from omnibus F-tests. For simplicity the terms anova, ancova, manova, and manova are given here in
the singular form, even though in most studies tests of these types were usually conducted multiple times.
* In this study, the data were mathematically transformed before anova so as to correct for inequality of
variances. Among the total of 96 studies surveyed, only one other study featured a data transformation
intended to mitigate the negative consequences of the violation of an assumption of a significance test.
is especially jeopardized whenever the mean pretest scores are not (virtually) identical
for all groups (e.g. Bonate, 2000). Split-plot anova based on pretest and posttest scores
can be used when groups have not been formed by random assignment, but as this
approach yields the same results as gain score analysis (the key statistics being those
for the interaction; e.g. Knapp & Schafer, 2009), I will say no more about it for the
time being. Now, on the understanding that we will just be scratching the surface of a
hugely complex area, let us consider the research questions addressed by the two
remaining basic options, gain score analysis and ancova.9
758 Language Teaching Research 20(6)
•• Gain score analysis: ‘How do groups differ in score change from pretest to
posttest?’
•• Ancova: ‘Is there an effect of treatment on the posttest scores that is not predict-
able from pretest scores?’ (Knapp & Schafer, 2009). In other words ‘Are there
between-group differences in posttest scores when we compare individuals from
the different groups who had the same pretest score?’ (see Fitzmaurice, 2001).
To sum up, gain score analysis focuses on amount of change whereas ancova focuses on
the end level of score.
Among applied statisticians the consensus seems to be that if groups have been
formed by random assignment of participants, then ancova has more advantages than
gain score analysis even though the results of a gain score analysis are likely to be easier
to interpret and may better address the question that researchers really want to have
answered (Knapp & Schafer, 2009). A major advantage of ancova is that it tends to
afford more statistical power than analysis of gain scores (e.g. Bonate, 2000; Vickers &
Altman, 2001). In particular, when there is no ceiling or floor effect in the posttest scores,
gain score analysis and ancova have similar statistical power; but when there is a ceiling
or floor effect in these scores, ancova retains power when gain score analysis does not
(Cribbie & Jamieson, 2004). It must be stressed, though, that it is highly controversial to
use ancova when groups have not been formed by random assignment (Cribbie &
Jamieson 2004; Miller & Chapman, 2001). Regarding gain score analysis, a common
concern is that participants who score low on the pretest will have more available gain
on the posttest than will participants whose pretest score was nearer the maximum (e.g.
Vickers & Altman, 2001). Such bias is not at all inevitable (Bonate, 2000). Nevertheless,
when participants have not been assigned randomly to groups it has been relatively com-
mon for researchers in some fields to try to circumvent the (perceived) problem of
between group differences in mean available gain by analysing ‘relative gain’ scores
(also known as ‘fractional’, ‘proportional’, or ‘percent’ scores) instead of raw gain
scores. Indeed, the survey found three studies in which relative gain (RG) scores were
used in inferential analysis, although these scores were calculated in three different ways.
To give an example, a RG score may be derived as follows: RG score = (Posttest score
– Pretest score) / Pretest score. (For this and nine other equations, see Bonate, 2000.)
Because a RG score is always a proportion, multiplying a RG score by 100 changes it
into an explicit ‘percentage’ gain score.10 An advantage of RG scores and explicit per-
centage gain scores is that they can be relatively easy to interpret (e.g. Bonate, 2000). In
general, however, applied statisticians seem unenthusiastic about basing inferential anal-
yses on either of these two types of score, and some authorities (Bonate, 2000; Vickers,
2001) have recommended against using them in significance testing at all, on the grounds
that they are not superior to gain scores (in particular, it seems that they do not actually
accomplish the goal of correcting for cross-group differences in available gain) and that
they are highly likely to be nonnormally distributed. Additionally, these two types of
scores may afford relatively low statistical power (Vickers, 2001). Adjustment for cross
group imbalance in available gain (owing to unbalanced floor or ceiling effects) can be
achieved by ancova. But, again, ancova should be reserved for use when groups have
been formed by random assignment (e.g. Jamieson & Cribbie, 2004).
Lindstromberg 759
Let us now consider Table 3. Among other things it shows that LTR authors have
approached the analysis of pretest–posttest data in a wide variety of ways. Some of these
ways were discussed above, where it was noted that in some cases two different
approaches yield the same results when applied to the same data. While we cannot exam-
ine all the additional approaches listed in Table 3, a word or two must be said about some
of them. First, mixed design manova can handle situations in which there are multiple
dependent variables. Otherwise, it accomplishes much the same goals as split-plot anova,
although possibly with reduced statistical power and less interpretable output. Split-plot
anova should certainly be preferred over mixed manova when ns are small and there is
only one posttest (Baguley, 2012). Second, latent growth curve analysis (being a version
of structural equation modeling) requires large samples, which limits its applicability in
L2 research. Third, multi-level modeling offers solid advantages in the case of mixed
designs (e.g. when ns are unequal and there are missing values): L2 researchers should
undoubtedly use this approach much more often than they have done in the past (Linck
& Cunnings, 2015). Finally, much of the considerable variety seen Table 3 is no doubt
due to dissimilarities between studies. At the same time, it is probable that a good deal of
the variety is attributable to researchers’ uncertainty about how best to analyse certain
kinds of pretest–posttest data.
comments that it not only inflates the familywise Type I error rate (as each additional
significance test has the potential to do) but that it does so needlessly. Bonate’s argument
here is that even when baseline testing does find a significant difference between group
means, it is nevertheless valid for a researcher to proceed with statistical analysis as long
as the results are interpreted with due circumspection. However, there is more to be said
about why testing for baseline balance is dubious. Considering the case of two independ-
ent groups, let us look at why accepting H0 as just described is never necessary, almost
always futile, and full of potential to be hazardous. Our example data is two sets of mock
pretest scores (nA = nB = 20) created using the statistical freeware R (R Core Team, 2016)
by random selection from normal populations of data points that were then rounded
down to whole numbers resembling typical test scores: Means, 27.85A vs. 28.55B; SDs,
2.92A vs. 2.82B; MD = 0.70. The IS t-test gives: t(38) = 0.606, p = .548, the CI95% for the
MD is [–2.39 < 0.70 < 1.29]. Because this CI includes zero, H0 may well be true.
However, within this CI it is not zero but rather 0.70 that is the most plausible location
of the true MD. Moreover, 62% of the CI includes plausible values for the true MD that
are even further from zero than is |0.70|. In short, the high p value (i.e. .548) is due to low
statistical power rather than to the substantive triviality of the MD: Triviality has not
been demonstrated (cf. Norris, 2015). Actually, compared to stating the MD arithmeti-
cally (e.g. as a percentage), the act of carrying out a significance test and then finding
p = .548 constitutes no progress whatsoever toward showing that the groups’ mean scores
were equivalent at pretest. In fact, if the two groups are made large enough while the
means and the score ranges stay the same, p is certain to fall below α eventually. In short,
the more data one has (and thus the more statistical power), the less likely it is that the
ploy of accepting H0 will succeed even superficially. Anyway, significance tests are
applied to samples in order to make inferences about populations with respect to effects
posited on the basis of theory. But in Scenario 1 the focus is on the samples themselves,
and it may be difficult to imagine what theoretically interesting effect might be at issue.
The authors of the 37 studies that featured baseline testing invariably reported p > α
and then stated that pretest means could be assumed to be equal (in one study this was
even done when p was very near to .05). The danger in drawing this unwarranted conclu-
sion is that once it is fixed in a researcher’s mind, the researcher could easily go on to
over-optimistically interpret the results of the main statistical analysis, particularly if this
is an analysis of the posttest scores only. For further discussion of the issues here see,
perhaps in this order, P. Cohen (1996), J. Cohen (1994), and Norris (2015).
We now turn to Scenario 2 which, in contrast to Scenario 1, can be very fruitful.
This scenario was encountered in the survey only once (Nakata, 2015), but I summa-
rize it anyway in the hope of encouraging other researchers to follow Nakata’s lead.
Scenario 2 arises when researchers have found evidence that a theoretically interesting
effect is very small; for instance, they have observed an MD near zero. The researchers
seek evidence that the population MD is too small to be of practical importance.
Success depends on their being able to deploy such high statistical power (e.g. through
recruitment of a very large number of experimental participants) that a CI centered
around the observed MD not only includes zero but is also so narrow that it includes
no values large enough to be of any practical significance (Hoenig & Heisey, 2001;
Wuensch, 2009).11
Lindstromberg 761
Notes. arS = Spearman’s rho. Also, the well-known default benchmark interpretations of Pearson’s r seem
broadly applicable here. That is, the values .10, .30, and .50 indicate correlations that are, respectively, weak,
medium, and strong (e.g. P. Ellis, 2010). bThe mean n was calculated for each study. This is the median of
those means. cNot included here are the four studies that report ‘found’ interventions.
9 Trends
Because of the heavy use of anova and the virtual certainty that this is not a good thing,
I checked to see if there was a trend in the use of anova over the period 1997–2015. To
do so, I ordered the 96 surveyed studies according to their position in the volumes and
issues, the aim being to order the studies more or less chronologically (from older to
newer). I then represented each study with 1 or 0 depending on whether it did or did not
involve at least one use of anova. To represent the variable of time, I created a data set
consisting of the integers 1, 2, 3, … 96. The correlation between ‘time’ and ‘anova use’
(Spearman’s rho = .05) is consistent with a fairly steady reliance on anova throughout the
period of the survey. Additional estimates of trends (calculated by the approach just indi-
cated) are summarized in Table 4. These trends are, on the whole, encouraging. In par-
ticular, sample sizes grew and estimates of ES appeared more frequently and were
interpreted more often. Three other welcome trends, albeit weak ones, are that the median
number of research questions per study fell, the incidence of replication studies rose, and
random assignment of participants to groups became more common.
that has to do with statistics must be made regarding the last of these three positive char-
acteristics. Namely, the literature reviews in the surveyed articles are invariably tradi-
tional, which is to say that when an effect is touched on (e.g. the observed effect of a type
of instruction) authors rarely progress from a binary, yes-or-no consideration of the evi-
dence that the effect exists to a consideration of the likely size of the effect in terms, say,
of d or r. (Authors’ concluding discussions show a similar, if somewhat less strong, ten-
dency even when the matter at issue is their own results.) While failure to discuss, or
even mention, effect sizes may not invalidate discussion of a given effect’s place in a
theory, such omissions undoubtedly limit what can be said about an effect’s importance
in practice. For example, a researcher who has observed an effect of d = .15 might go on
to say whether and why an effect of this size should merit the attention of L2 teachers and
materials writers. Thus, if the effect is one that manifests itself only in a kind of circum-
stance that is brief and uncommon, the researcher may be justified in saying that the
effect should rank low on a scale of effects worth taking into practical account. If, in
contrast, the effect operates in situations that are common or long-lasting, the researcher
may be justified in claiming that practitioners should not ignore it, even though effects of
d = .15 are typically regarded as negligible (see Abelson, 1985; P. Ellis, 2010). Of course,
potentially useful interpretation of this kind is possible only when estimates of effect size
are available in the first place. To shift now to a historical perspective, widespread failure
to state and verbally interpret effect sizes limits the degree to which later studies can cog
into and follow on from earlier ones. Suppose, for example, that the literature on a given
pedagogical technique furnishes estimates of effect size for several versions of the tech-
nique (e.g. several ways to typographically highlight targeted vocabulary). Suppose also
that a researcher is planning an experiment with the goal of finding out which version of
the technique is the best. The researcher’s task of selecting which two or three versions
to test first is bound to be facilitated by the availability of estimates of effect size, as com-
pared to a situation in which the literature furnishes few such estimates or none at all.12
As to the future, there is no practical reason why the performance and reporting of
statistical analyses in LTR cannot show a number of conspicuous improvements within
just a few years, provided that these improvements are promoted by reviewers and edi-
tors. One of these improvements, as just discussed, would be for authors to plan, carry
out, and report their research with a much greater emphasis than hitherto on effect sizes
and a diminished emphasis on p values. Another improvement – relevant to studies based
on small sample sizes – would be that the conclusions that authors draw from their
results would reflect the fact that small samples may well have atypical characteristics.
In this newer and better world, LTR authors would also do the following: show greater
and more explicit commitment to meta-analytic thinking (Cumming, 2012a, 2014) and
to CCMA (Braver et al., 2014); use modern robust procedures of statistical analysis
relatively routinely (e.g. Wilcox, 2012b); and grasp opportunities to move away from
complex (m)anova designs toward slimmer, more focused, more coherent, and more
powerful designs through implementation of contrast analysis (Lindstromberg, 2016:
Section IV.6).
Additionally, it would be easy and very beneficial for all authors to clearly state in
their results sections what factors and levels are involved in their (m)anova designs.
Certain other possible improvements seem likely to take longer to become established, if
they ever do, purely because they appear to require researchers (reviewers included) to
Lindstromberg 763
undergo a good deal of extra training. One such improvement, relevant when measure-
ments have been taken at three or more points in time, would be increased use of rela-
tively modern approaches to the analysis of change; for example, approaches that fit a
growth model either within a multi-level framework (Linck & Cunnings, 2015) or within
a structural equation modeling framework (e.g. Acock & Li, no date; Curran, Obeidat, &
Losardo, 2010; Hox & Stoel, 2014). As mentioned, and as Table 1 shows, the survey did
come across a small number of recent studies featuring use of such approaches (albeit
under a variety of names). An improvement that we may not see any time soon is a
marked rise in average sample sizes: The practical obstacles may be too great. Fortunately,
some of the most serious problems that arise from reliance on small samples can be
addressed by adopting a meta-analytic approach to research.
Acknowledgements
I am indebted to the editors for their many helpful comments and to Luke Plonsky for generously
providing me with key articles.
Notes
1 The survey initially took in all 315 main articles, that is, every article with an abstract. It did
not include editors’ introductions, short reports, book reviews, or miscellaneous notices.
2 The batches (of five articles each) do not precisely align with successive issues of LTR issues
since, over the time of the survey, issues have included from three to eight main articles
(median = 4.56). The approach used to derive the correlation comes from Howell (2010).
3 Researchers should not be complacent if n > 30 since even samples larger than 100 are all too
likely to show awkward traits such as skew and outliers (Hesterberg, 2008; Wilcox, 2012a,
2012b).
4 Power can also be understood as the chance that a given type of significance test will find a
targeted effect on a particular occasion, again with p ⩽ α. When significance tests are con-
ducted across multiple sets of data, three versions of power come into play. For example, in a
2 × 2 anova where the null hypotheses (H0) regarding the two main effects and the interaction
are all false, there is (1) the probability of rejecting a prespecified false H0, (2) the probability
of rejecting a single, unspecified false H0, and (3) the probability of rejecting all false H0; see
especially Maxwell, 2004.
5 The normality tests mentioned here can be implemented using functions in the R packages
‘fBasics’ and ‘nortest’ (R Core Team, 2016).
6 For discussions of attractive rank-based robust methods, see Wilcox, 2012a, 2012b.
7 Very roughly, a 10% trimmed mean is calculated by ordering all the values in a data set from
low to high and then deleting both the lowest 10% and the highest 10% in order to focus only
on the values likely to be the most typical. A trimmed mean is a compromise between the
mean and the median, the latter being equivalent to a 50% trimmed mean (Wilcox, 2012a).
8 Even if the outlying value of 28 was not mistakenly present (e.g. because of a recording
error), some researchers might remove it anyway. In the running example, doing so would
raise the p value to .67 by the t-test and to .63 or so by the bootstrap. However, unless outliers
are genuine errors (e.g. recording errors), removing them by guesswork or even by a sup-
posedly trustworthy traditional method increases Type I and Type II error rates (Bakker &
Wicherts, 2014).
9 For a quick introduction to the main controversies about gain score analysis, see Smolkowski,
2013; informative as well are Cribbie and Jamieson, 2004; Rogosa, 1988; Williams and
Zimmerman, 1996. It must be stressed that early claims that gain scores are inherently
untrustworthy have been comprehensively refuted.
764 Language Teaching Research 20(6)
10 One LTR study used the unusual equation, RG score = (Posttest score – Pretest score) /
(maximum score possible – Pretest score), where the denominator is the gain available to be
made on the posttest. Within applied linguistics the earliest use of this plausible equation may
be Horst, Cobb, and Meara (1998). I have not yet found a discussion of it in the literatures of
theoretical or applied statistics.
11 Acceptance of the null hypothesis occurs also when a researcher (1) uses a significance test to
test an assumption of a parametric test, (2) finds p > α, (3) accepts the null hypothesis that all
is well with the data, and (4) goes on to use a parametric test for the main statistical analysis.
This is one of reasons why the use of assumptions tests is controversial.
12 For between-participants designs, d is calculable from descriptive statistics (i.e. ns, means,
and SDs). For calculation of some other types of effect size, including r, ordinary descriptive
statistics may not suffice.
References
Abelson, R. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin,
97, 129–133.
Abelson, R. (1995). Statistics as principled argument. New York: Psychology Press.
Acock, A., & Li, F. (no date). Latent growth curve analysis: A gentle introduction. Available at:
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf (accessed May 2016).
American Psychological Association (APA). (2010). Publication manual of the American
Psychological Association. Washington, DC: American Psychological Association.
An, Q., Xu, D., & Brooks, G. (2013). Type I error rates and power of multiple hypothesis testing
procedures in factorial ANOVA. Multiple Linear Regression Viewpoints, 39, 1–16.
Asendorpf, J., Conner, M., De Fruyt, F., et al. (2013). Recommendations for increasing replicabil-
ity in psychology. (Target article.) European Journal of Personality, 27, 108–119. See also
‘Authors’ response’, pp. 138–144.
Baguley, T. (2010). When correlations go bad. The Psychologist, 23, 122–123. Available at:
https://thepsychologist.bps.org.uk/volume-23/edition-2/methods-when-correlations-go-bad
(accessed May 2016).
Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences.
Basingstoke: Palgrave Macmillan.
Bakker, M., & Wicherts, J. (2014). Outlier removal, sum scores, and the inflation of the Type I
error rate in independent samples t tests: The power of alternatives and recommendations.
Psychological Methods, 19, 409–427.
Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in mul-
tiple testing with independent statistics. Journal of Educational and Behavioral Statistics,
25, 60–83.
Bonate, P. (2000). Analysis of pretest–posttest designs. Boca Raton, FL: Chapman and Hall/CRC.
Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). Introduction to meta-analysis.
Oxford: Wiley.
Braver, S., Thoemmes, F., & Rosenthal, R. (2014). Continuously accumulating meta-analysis and
replicability. Perspectives on Psychological Science, 9, 333–342. Available at: http://www.
human.cornell.edu/hd/qml/upload/Braver_Thoemmes_2014.pdf (accessed May 2016).
Chaudron, C. (2001). Progress in language classroom research: Evidence from The Modern
Language Journal, 1916–2000. The Modern Language Journal, 85, 57–76.
Cohen, J. (1994). The Earth is round, p < .05. American Psychologist, 49, 997–1003. Available at:
http://en.wikipedia.org/wiki/Jacob_Cohen_(statistician) (accessed May 2016).
Cohen, P. (1996). Getting what you deserve from data. IEEE Expert, 11, 12–14. Available at:
http://w3.sista.arizona.edu/~cohen/Publications/papers/cohenIEEE96.pdf (accessed May
2016).
Lindstromberg 765
Cribbie, R., & Jamieson, J. (2004). Decreases in posttest variance and the measurement of change.
Methods of Psychological Research Online, 9, 37–55. Available at: http://www.dgps.de/fach-
gruppen/methoden/mpr-online (accessed May 2016).
Cumming, G. (2008). Replication and p intervals: P values predict the future only vaguely; but
confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300.
Cumming, G. (2012a). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. Hove: Routledge.
Cumming, G. (2012b). Researchers underestimate the variability of p values over replication.
Methodology, 8, 61–62.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29. Available
at: http://pss.sagepub.com/content/25/1/7.full.pdf+html (accessed May 2016).
Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of
confidence intervals and standard error bars. Understanding Statistics, 3, 299–311.
Curran, P., Obeidat, K., & Losardo, D. (2010). Twelve frequently asked questions about growth
curve modeling. Journal of Cognition and Development, 11, 121–136.
Ellis, N. (2015). Forward. Language Learning, 65, Supplement 1, v–vi.
Ellis, N., & Sagarra, N. (2011). Learned attention in adult language acquisition: A replication
and generalization study and meta-analysis. Studies in Second Language Acquisition, 33,
589–624.
Ellis, P. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the inter-
pretation of research results. Cambridge: Cambridge University Press.
Erceg-Hurn, D., & Mirosevich, V. (2008). American Psychologist, 63, 591–601. Available at:
http://www.unt.edu/rss/class/mike/5700/articles/robustAmerPsyc.pdf (accessed May 2016).
Field, A. (1999). A bluffer’s guide to meta-analysis. Newsletter of the mathematical, statistical
and computing section of the British Psychological Society, 7, 16–25. Available at: http://
users.sussex.ac.uk/~andyf/meta.pdf (accessed May 2016).
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Thousand Oaks, CA: Sage.
Fitzmaurice, G. (2001). A conundrum in the analysis of change. Nutrition, 17, 360 –361.
García, L. (2004). Escaping the Bonferroni iron claw in ecological studies. OIKOS, 105, 557–663.
Gass, S. (2009). A historical survey of SLA research. In W. Ritchie, & T. Bhatia (Eds.), Handbook
of second language acquisition (pp. 3–28). Bingley: Emerald.
Grissom, R., & Kim, J. (2012). Effect sizes for research: Univariate and multivariate applications,
2nd edition. Hove: Routledge.
Hesterberg, T. (2008). It’s time to retire the ‘n > = 30’ rule. Proceedings of the American Statistical
Association, statistical computing section [CD-ROM]. Available at: http://www.timhester-
berg.net/articles/JSM08-n30.pdf (accessed May 2016).
Hoenig, J., & Heisey, D. (2001). The abuse of power: The pervasive fallacy of power calcula-
tions for data analysis. The American Statistician, 55, 1–6. Available at: http://www.tc.umn.
edu/~alonso/Hoenig_AmericanStat_2001.pdf (accessed May 2016).
Horst, M., Cobb, T., & Meara, P. (1998). Beyond ‘A Clockwork Orange’: Acquiring second lan-
guage vocabulary through reading. Reading in a Foreign Language, 11, 207–223.
Howell, D. (2010). Statistical methods for psychology. 7th edition. Belmont, CA: Cengage
Wadsworth.
Hox, J., & Stoel, R. (2014). Multilevel and SEM approaches to growth curve modeling. Wiley
StatsRef: Statistics Reference Online. Originally published online in 2005 in Encyclopedia of
Statistics in Behavioral Science. Oxford: Wiley.
Hudson, T., & Llosa, L. (2015). Design issues and inference in experimental L2 research. Language
Learning, 65, 76–96.
Kelly, K., & Preacher, K. (2012). On effect size. Psychological Science, 17, 137–152. Available
at: http://www.quantpsy.org/pubs/kelley_preacher_2012.pdf (accessed May 2016).
766 Language Teaching Research 20(6)
Keselman, H., Cribbie, R., & Holland, B. (2004). Pairwise multiple comparison test procedures:
An update for clinical child and adolescent psychologists. Journal of Clinical Child and
Adolescent Psychology, 33, 623–645. Available at: http://home.cc.umanitoba.ca/~kesel/
jccap-4.pdf (accessed May 2016).
Keselman, H., Othman, A., & Wilcox, R. (2013). Preliminary testing for normality: Is this a good
practice? Journal of Modern Applied Statistical Methods, 12, Article 2. Available at: http://
digitalcommons.wayne.edu/jmasm/vol12/iss2/2 (accessed May 2016).
Keselman, H., Othman, A., & Wilcox, R. (2014). Testing for normality in the multi-group prob-
lem: Is this a good practice? Clinical Dermatology, 2, 29–43.
Keselman, H., Huberty, C., Lix, L., et al. (1998). Statistical practices of educational researchers:
An analysis of Their ANOVA, MANOVA and ANCOVA analyses. Review of Educational
Research, 68, 350–86. Available at: http://home.cc.umanitoba.ca/~kesel/rer1998.pdf
(accessed May 2016).
Kline, R. (2013). Beyond significance testing: Statistics reform in the behavioral sciences. 2nd
edition. Washington, DC: American Psychological Association.
Knapp, T., & Schafer, W. (2009). From gain score t to ANCOVA F (and vice versa). Practical
Assessment, Research & Evaluation, 14. Available at: http://pareonline.net/getvn.asp?v =
14&n = 6 (accessed May 2016).
Lai, J., Fidler, F., & Cumming, G. (2009). Subjective p intervals: Researchers underestimate the
variability of p values over replication. Methodology: European Journal of Research Methods
for the Behavioral and Social Sciences, 8, 51–62.
Larson-Hall, J., & Herrington, R. (2009). Improving data analysis in second language acquisition
by utilizing modern developments in applied statistics. Applied Linguistics, 31, 368–390.
Larson-Hall, J., & Plonsky, L. 2015. Reporting and interpreting quantitative research Findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Lenth, R. (2000). Two sample-size practices that I don’t recommend: Comments from the panel
discussion at the 2000 Joint Statistical Meetings. Unpublished article, presented at the 2000
Joint Statistical Meetings, Indianapolis, IN, USA. Available at: http://homepage.stat.uiowa.
edu/~rlenth/Power/2badHabits.pdf (accessed May 2016).
Linck, J., & Cunnings, J. (2015). The utility and application of mixed-effects models in second
language research. Language Learning, 65, 185–207.
Lindstromberg, S., & Eyckmans, J. (2014). How big is the positive effect of assonance on the
near-term recall of L2 collocations? ITL International Journal of Applied Linguistics, 165,
19–45.
Lindstromberg, S. (2016). Guidelines, recommendations, and supplementary discussion: For
researchers planning to submit a report of an intervention study to Language Teaching
Research. Language Teaching Research. DOI: 10.1177/1362168816651895.
Lix, L., & Sajobi, T. (2010). Testing multiple outcomes in repeated measures designs. Psychological
Methods, 15, 268–280.
Mackey, A., & Gass, S. (2005). Language research: Methodology and design. Mahwah, NJ:
Lawrence Erlbaum.
Maxwell, S. (2004). The persistence of underpowered studies in psychological research: Causes,
consequences, and remedies. Psychological Methods, 9, 147–163.
McGraw, K., & Wong, S. (1992). A common language effect size statistic. Psychological Bulletin,
111, 361–365.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166. Available at: http://isites.harvard.edu/fs/docs/icb.topic988008.files/
micceri89.pdf (accessed May 2016).
Miller, G., & Chapman, J. (2001). Misunderstanding analysis of covariance. Journal of Abnormal
Psychology, 110, 40–48.
Lindstromberg 767
Nakata, T. (2015). Effects of feedback timing on second language vocabulary learning: Does
delaying feedback increase learning? Language Teaching Research, 19, 416–434.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconcep-
tions, and a case for replication. In G. Porte (Ed.), Replication research in applied linguistics,
(pp. 92–115). Cambridge: Cambridge University Press.
Nickerson, R. (2000). Null hypothesis significance testing: A review of an old and continu-
ing controversy. Psychological Methods, 5, 241–301. Available at: http://psych.colorado.
edu/~willcutt/pdfs/Nickerson_2000.pdf (accessed May 2016).
Norris, J. (2015). Statistical significance testing in second language research: Basic problems and
suggestions for reform. Language Learning, 65, 97–126.
Norris, J., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantita-
tive meta-analysis. Language Learning, 50, 417–528.
Norris, J., & Ortega, L. (2006). The value and practice of research synthesis for language learning
and teaching. In J. Norris, & L. Ortega (eds), Synthesizing research on language learning and
teaching (pp. 3–50). Amsterdam: John Benjamins.
Norris, J., Ross, S., & Schoonen, R. (2015). Improving second language quantitative research.
Language Learning, 65, 1–8.
O’Keefe, D. (2003). Against familywise alpha adjustment. Human Communication Research, 29,
431–447. Available at: http://www.dokeefe.net/pub/okeefe07cmm-posthoc.pdf (accessed
May 2016).
O’Keefe, D. (2007). Post hoc power, observed power, a priori power, retrospective power, achieved
power: Sorting out appropriate uses of statistical power analyses. Communication Methods
and Measures, 1, 291–299. Available at: http://www.dokeefe.net/pub/okeefe07cmm-post-
hoc.pdf (accessed May 2016).
Osborne, J. (2013). Is data cleaning and the testing of assumptions relevant in the 21st century?
Frontiers in Psychology, 4, 4–7.
Oswald, F., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and chal-
lenges. Annual Review of Applied Linguistics, 30, 85–110.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Plonsky, L., Egbert, J., & Laflair, G. (2014). Bootstrapping in applied linguistics: Assessing its
potential using shared data. Applied Linguistics, 36, 591–610.
Polio, C. (2012). Replication in published applied linguistics research: A historical perspec-
tive. In G. Porte (Ed.), Replication research in applied linguistics (pp. 47–91). Cambridge:
Cambridge University Press.
Porte, G. (2012). Introduction. In G. Porte (Ed.), Replication research in applied linguistics (pp.
1–17). Cambridge: Cambridge University Press.
R Core Team. (2016). R: A language and environment for statistical computing [computer soft-
ware]. Vienna: R Foundation for Statistical Computing. Available at: http://www.R-project.
org (accessed May 2016).
Rogosa, D. (1988). Myths about longitudinal research. In K. Schaie, R. Campbell, W. Meredith,
& S. Rawlings (Eds), Methodological issues in aging research (pp. 171–209). New York:
Springer.
Romano, J., Shaikh, A., & Wolf, M. (2010). Hypothesis testing in econometrics. Annual Review of
Economics, 2, 75–104. Available at: http://home.uchicago.edu/amshaikh/webfiles/testingre-
view.pdf (accessed May 2016).
768 Language Teaching Research 20(6)
Schönbrodt, F. (2015). What’s the probability that a significant p-value indicates a true effect?
Blogpost, 3 Nov. Available at: http://www.nicebread.de/whats-the-probability-that-a-signif-
icant-p-value-indicates-a-true-effect (accessed May 2016).
Smolkowski, K. (2013). Gain score analysis. Available at: http://homes.ori.org/~keiths/Tips/
Stats_GainScores.html (accessed May 2016).
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin,
76, 105–110.
Valentine, J., Pigott, T., & Rothstein, H. (2010). How many studies do you need? A primer on
statistical power for meta-analysis. Journal of Educational and Behavioral Statistics, 34,
215–247.
Vargha, A., & Delaney, H. (2000). A critique and improvement of the CL Common Language
Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics,
25, 101–132.
Vickers, A. (2001). The use of percentage change from baseline as an outcome in a controlled trial
is statistically inefficient: A simulation study. BMC Medical Research Methodology, 1, 6.
Available at: http://www.biomedcentral.com/content/pdf/1471-2288-1-6.pdf (accessed May
2016).
Vickers, A., & Altman, D. (2001). Analysing controlled trials with baseline and follow up meas-
urements. British Medical Journal, 323, 1123–1124. Available at: http://www.ncbi.nlm.nih.
gov/pmc/articles/PMC1121605 (accessed May 2016).
Westfall, P., & Young, S. (1993). Resampling-based multiple testing: Examples and methods for
p-value adjustment. New York: Wiley.
Wilcox, R. (1998). How many discoveries have been lost by ignoring modern statistical methods?
American Psychologist, 53, 300–314. Available at: http://psych.colorado.edu/~willcutt/pdfs/
Wilcox_1998.pdf (accessed May 2016).
Wilcox, R. (2010). Fundamentals of modern statistical methods. 2nd edition. Heidelberg: Springer.
Wilcox, R. (2012a). Introduction to robust estimation and hypothesis testing. 3rd edition. Waltham,
MA: Academic Press.
Wilcox, R. (2012b). Modern statistics for the social and behavioral sciences: A practical introduc-
tion. Boca Raton, FL: CRC Press.
Wilkinson, L., & Task Force on Statistical Inference. (1999.) Statistical methods in psychological
journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Williams, R., & Zimmerman, D. (1996). Are simple gain scores obsolete? Applied Psychological
Measurement, 20, 59–69. Available at: http://conservancy.umn.edu/handle/11299//119063
(accessed May 2016).
Wuensch, K. (2009). An overview of power analysis. Greenville, NC: East Carolina University.
Available at: http://core.ecu.edu/psyc/wuenschk/StatHelp/PowerAnalysis_Overview.pdf
(accessed May 2016).
Wuensch, K. (2015). CL: The Common Language Effect Size Statistic. Available at: http://core.
ecu.edu/psyc/wuenschk/docs30/CL.pdf (accessed May 2016).
Zimmerman, D. (1998). Invalidation of parametric and nonparametric statistical tests by concur-
rent violation of two assumptions. Journal of Experimental Education, 67, 55–68.
Zimmerman, D.W. (2003). A warning about the large-sample Wilcoxon–Mann–Whitney test.
Understanding Statistics, 2, 267–280.
Zimmerman, D. (2004a). A note on preliminary tests of equality of variances. British Journal of
Mathematical and Statistical Psychology, 57, 173–181.
Zimmerman, D. (2004b). A note on the influence of outliers on parametric and nonparametric
tests. Journal of General Psychology, 121, 391–401.