Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

649979

research-article2016
LTR0010.1177/1362168816649979Language Teaching ResearchLindstromberg

LANGUAGE
TEACHING
Article RESEARCH

Language Teaching Research

Inferential statistics in
2016, Vol. 20(6) 741­–768
© The Author(s) 2016
Reprints and permissions:
Language Teaching Research: sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1362168816649979
A review and ways forward ltr.sagepub.com

Seth Lindstromberg
Hilderstone College, Broadstairs, UK

Abstract
This article reviews all (quasi)experimental studies appearing in the first 19 volumes (1997–
2015) of Language Teaching Research (LTR). Specifically, it provides an overview of how statistical
analyses were conducted in these studies and of how the analyses were reported. The overall
conclusion is that there has been a tight adherence to traditional methods and practices, some
of which are suboptimal. Accordingly, a number of improvements are recommended. Topics
covered include the implications of small average sample sizes, the unsuitability of p values as
indicators of replicability, statistical power and implications of low power, the non-robustness
of the most commonly used significance tests, the benefits of reporting standardized effect sizes
such as Cohen’s d, options regarding control of the familywise Type I error rate, analytic options
in pretest–posttest designs, ‘meta-analytic thinking’ and its benefits, and the mistaken use of a
significance test to show that treatment groups are equivalent at pretest. An online companion
article elaborates on some of these topics plus a few additional ones and offers guidelines,
recommendations, and additional background discussion for researchers intending to submit to
LTR an article reporting a (quasi)experimental study.

Keywords
Effect sizes, L2 quantitative research, pretest–posttest designs, (quasi)experimental studies,
robust methods, small sample sizes, statistical analysis, statistical power, testing for baseline
balance

I Introduction
This article reports a survey of all issues of Language Teaching Research (LTR) from the
first issue in 1997 through the latest issue (at the time of writing) of 2015, including
special issues. The main aims are (1) to outline of how LTR authors have subjected data

Corresponding author:
Seth Lindstromberg, Hilderstone College, 18 Saint Peters Road, Broadstairs, Kent CT10 2JW, UK
Email: lindstromberg@gmail.com
742 Language Teaching Research 20(6)

from (quasi)experimental intervention studies to inferential statistical analysis, (2) to


portray authors’ practices in reporting the results of these analyses, and (3) to offer con-
structive criticism and suggestions for improvement. A few pertinent topics that could
not be discussed in this article owing to space constraints are dealt with in the online
companion article (e.g. interpretation of confidence intervals and the language-as-fixed-
effect fallacy; Lindstromberg, 2016). A number of other relevant topics have had to be
left completely undiscussed (e.g. graphical presentation of information and estimation of
the reliability of measurements) or only briefly touched on (e.g. rater agreement). Even
so, this article touches on so many (mostly basic) matters of a statistical nature that few
of them can be explained ‘from the ground up’. It is necessary, therefore, to assume that
readers have a serviceable recollection of the main elements of a typical introductory
statistics course, including the gist of the following terms: sample, population, statistic,
parameter, distribution (of data points), dependent variable, independent variable, meas-
ure of central tendency (e.g. the mean and the median), variance, standard deviation
(SD), sampling variation, a Type I error (i.e. a false discovery), and a Type II error (i.e. a
missed discovery). It is assumed that readers are broadly aware of the difference between
a parametric test of statistical significance and a nonparametric one, that they recall the
key assumptions of commonly used parametric tests (e.g. that the samples of data have
been drawn from populations of data that are normally distributed and that the variances
are equal), that they know that statistics such as the observed mean and variance of a
sample can be used to estimate the corresponding parameters (i.e. the population mean
and variance), that they are aware of the advantages that can come from randomly assign-
ing participants to the treatment groups in an experiment, that they know more or less
why a paired samples t-test usually has more statistical power (i.e. can find a smaller
p value) than an independent samples (IS) t-test when applied to the same data, that they
have a general idea of what a rank-based test is, and that they know that a t-test can be
conducted on one sample of data or on two, that the (post-hoc) pairwise comparisons
generally carried out following an omnibus F-test (e.g. in the context of anova) are gen-
erally t-tests, and that the standard way of testing the significance of a correlation also
involves a t-test. In one respect, though, this article may clash with what is said or implied
in some introductory statistics textbooks, particularly old ones. Namely, over the last 60
years or so a very large body of research in theoretical and applied statistics has demon-
strated that t-tests, anova, ancova, and other commonly used parametric procedures are
in fact not ‘robust’: That is to say, in real research situations the assumptions of paramet-
ric tests are frequently violated in ways that make using other kinds of tests an attractive
alternative. More is said about this vital matter below and in Lindstromberg (2016), but
for detailed discussion and copious references see Wilcox (2010, 2012a).
From among the main articles initially looked at, the survey ultimately included only
those 90 articles (28.5%) that reported experimental or quasi-experimental studies of
pedagogical interventions.1 (Inclusion of purely observational studies would have
required this article to be much longer.) As three of these 90 articles reported multiple
studies, the survey covered 96 separate studies in all. A study was considered to be an
intervention study if different groups of learners experienced different learning condi-
tions, or treatments, or if a single group of learners was assigned to more than one condi-
tion (e.g. they were asked to learn vocabulary items of two different kinds). Among the
Lindstromberg 743

Figure 1.  Loess and robust regression lines (dotted and solid, respectively), showing the
increasing representation of intervention studies in LTR, 1997–2015.2

96 studies surveyed, four reported interventions that researchers had found rather than
designed. An example here is a comparative study comparing learners who had partici-
pated in a study-abroad program with similar learners who had not done so. Figure 1
shows a clear rising trend over the period of the study in the number of articles per five-
article batch (i.e. about one LTR issue) that report an intervention study.
In this article, I follow mainstream usage in applying the term ‘experiment’ – or,
more exactly, ‘experimental study’ – only to a quantitative study featuring both a con-
trol or a comparison condition and random assignment of available participants to con-
ditions. (The survey came across only one study that also included prior random
selection of participants from a large pool.) Experiment-like studies not featuring ran-
dom assignment to groups are hereafter referred to as ‘quasi-experiments’. Importantly,
all but one of the surveyed studies (the exception was in the context of a ‘found’ inter-
vention) featured some kind of control or comparison group/condition. (For further
relevant discussion with respect to second language (L2) research, see Hudson and
Llosa, 2015; Mackey and Gass, 2005.) Of the 96 studies surveyed only 20 (21%) were
conducted in a laboratory. Sixteen of the 96 studies had a purely within-participants
744 Language Teaching Research 20(6)

design, meaning that disadvantages of the absence of random assignment of participants


could be somewhat mitigated. Of the 76 studies that were neither purely within-partici-
pants nor based on found interventions, 30 (39.5%) were (true) experimental studies as
these have been characterized above. The majority of the 96 studies, then, were quasi-
experiments using intact groups of students, which is typical of L2 classroom research
owing to well-known practical obstacles to random assignment of learners to groups. In
only eight (27%) of the lab-based studies were groups formed by random assignment,
which might run counter to expectation. In a small number of articles it was stated that
classes were randomly assigned to treatments; but this practice should not be mistaken
for the kind of random assignment that distinguishes an experimental study from a
quasi-experiment. We return to the topic of random assignment in Section IV.6 below.
Finally, some section headings below are followed by an asterisk. This means that the
online supplement (Lindstromberg, 2016) includes relevant guidelines, recommenda-
tions, words of caution, and/or additional discussion.

II  Key findings of other, recent journal surveys


It has been remarked that there tends to be a huge gap between the statistical practices
recommended in the technical literature of applied statistics and the practices prevalent
in a wide range of client disciplines including educational research (Keselman, Huberty,
Lix, et al., 1998). Without going into as much technical detail as Keselman et al. (1998),
reviewers of published reports of quantitative L2 research reports have tended to com-
ment in the same vein, for example:

Comparing notes from years of reviewing quantitative research articles for applied linguistics
journals, common challenges that we had encountered quickly became apparent, including:
inadequate experimental design and control coupled with claims of causality, effectiveness, and
impact; impoverished sample sizes, as a rule, in combination with increasing use of multivariate
analyses; inattention to assumptions about numeric data in the selection of fitting inferential
techniques; dramatic and persistent over- and misinterpretation of p values and statistical
significance testing, coupled with inattention to effect sizes of all sorts; incomplete reporting of
descriptive and inferential statistics; a willingness to generalize on the basis of one or a few
cases; and so on. (Norris, Ross, & Schoonen, 2015, pp. 1–2)

Plonsky and Gass (2011) reviewed 174 observational and (quasi)experimental studies of
the effects of learner interaction on L2 development. Topics of their review included
types of study design, statistical procedures, the character and thoroughness with which
statistics were reported, the relationship between the magnitude of observed effects on
the one hand and research designs and reporting practices on the other and trends with
respect to study designs, reporting practices, and magnitude of typically observed effects
(e.g. the typical size of the difference between the mean score of learners in two groups).
These reviewers found considerable room for improvement in reporting practices and
recommended, in particular, that researchers should report all means and associated
standard deviations (SDs), exact p values, and all associated test statistics (e.g. t and F),
and that they should report estimates of effect size (ES) and give confidence intervals for
these estimates (see also American Psychological Association, 2010). Plonsky (2013)
Lindstromberg 745

surveyed 606 primary quantitative observational and (quasi)experimental studies that


appeared in Language Learning and Studies in Second Language Acquisition from 1990
to 2010. His aim was to address two questions (p. 9):

To what extent has L2 research included various design features and statistical procedures, both
(a) generally or descriptively and (b) with respect to those associated with methodological and
experimental rigor? What types of data have been reported in L2 research, both (a) generally or
descriptively and (b) with respect to those associated with transparency?

Regarding matters especially relevant to the survey reported here, Plonsky (2013) found
that it was, unfortunately, common for test statistics and p values not to be reported for
tests when p > α, and that about a third of studies failed to give a SD for each reported
mean. In only 17% of surveyed articles had researchers mentioned whether they had
checked to see if the assumptions (or validity conditions) of their statistical procedures
were in place, and in only 1% had an appropriate power analysis been reported. His
estimate of the median number of significance tests per study was 18. Finally, Plonsky
remarked that, like earlier reviewers (e.g. Gass, 2009), he had found that the great
majority of L2 studies had used means-based statistical procedures, especially anova
and t-tests.
Various other scholars have also pointed out deficiencies and missed opportunities in
the statistical practices of L2 researchers and called for improvements (e.g. Chaudron,
2001; Larson-Hall & Herrington, 2009; Nassaji, 2012; Norris, 2015; Norris & Ortega,
2000, 2006). This review joins the chorus. However, one difference between this review
and most previous ones is that this one goes into somewhat greater technical detail with
respect to improvements that L2 researchers ought to adopt and pitfalls they ought to avoid.

III  Findings of the present survey, and key basic terms and
concepts
1  Significance tests used
As Table 1 shows, LTR authors have relied heavily on a rather narrow range of inferential
procedures, which have been in widespread use for several generations. Prominent among
these procedures are two types of means-based parametric tests of statistical significance:
first, t-tests, including post-hoc pairwise t-tests and t-tests of the significance of correla-
tions and, second, F-tests conducted in the context of anova, ancova, or manova. Somewhat
less prominent are two commonly used rank-based nonparametric tests, namely, the
Wilcoxon–Mann–Whitney (WMW) and the Wilcoxon Signed Ranks tests. Modern robust
procedures such as randomization tests and bootstraps were not used at all. Such proce-
dures will be discussed further below; but it should be said here that a robust statistical
procedure is one that continues to yield accurate estimates of parameters such as means
and mean differences and to maintain the desired Type I error rate (e.g. .05) when a para-
metric or a well-known nonparametric test such as the WMW test fails to do so because
its assumptions have been violated (Wilcox, 2012a, 2012b); regarding the nonrobustness
of the WMW test, see Lindstromberg 2016: Section IV.3.
746 Language Teaching Research 20(6)

Table 1.  The inferential statistical procedures used in the 96 experiments and quasi-
experiments covered by the survey (1997–2015) along with noteworthy absences.

Traditional parametric and rank-based procedures Number of studies using the


procedure
Anova 57
Independent samples (IS) t-test of group differences 49
Paired samples t-test 17
Manova 15
IS t-test applied to a correlation or a regression coefficient 12
Chi Square test or Fisher exact test 11
Ancova 10
Wilcoxon–Mann–Whitney rank sum test  9
Wilcoxon signed ranks test  9
Other traditional procedures: Multiple linear regression,a Kruskal–Wallis test (2), Friedman test
(2), Sign test (2), mancova (1), McNemar test (1)
Contrast analysis in the context of an anova design: (0)
Modern robust procedures: Bootstrap (0), Randomization (also known as Permutation) tests (0)
Other, relatively modern procedures: Factor analysis (4), Multi-level modeling,b (3) Structural
equation modeling (also known as Latent growth curve analysis) (2), Cluster analysis (1),
Principal components analysis (1), and Time series analysis using Revusky’s Rn test (1).

Notes. aOnly three articles report use of multiple regression; however, modern software is likely automati-
cally to take a regression approach to anova and ancova. bAlso known as Mixed-level, Mixed-effects, or
Hierarchical linear modeling.

2  Sample size
The survey found that the typical sample (e.g. the number of learners per group) was
small, particularly in the case of studies with a between-participants or mixed design
(Table 2), where the typical sample size was very similar to the mean sample sizes that
Plonsky and Gass (2011) reported in their multi-journal survey of L2 interaction studies
(i.e. 23.5 for treatment groups and 19.3 for comparison groups) and similar also to the
overall median sample size of 19 reported by Plonsky (2013). (Trends are noted near the
end of this article.)
With respect to common procedures such as t-tests and anova, the vague traditional
consensus of applied statisticians has been that a sample is small if it consists of fewer
than 30 data points. Certainly, a sample of that size or smaller is likely to include a higher
proportion of atypical values than a much larger sample from the same population (e.g.
Wilcox, 2012b). A facet of this characteristic is that a small sample tends to have a
greater standard deviation (SD) than does a larger sample that is taken from the same
population. In short, for any parameter of interest, such as a population mean, a small
sample is relatively likely to yield an estimate that is not only imprecise and so relatively
uninformative (with the imprecision being reflected in a wide confidence interval, but
also inaccurate (see Lindstromberg, 2016: Section III.2). Thus, a finding stemming from
any given innovative small scale study − even if it was well conceived, well designed,
and well conducted − is unlikely to support a firm conclusion about anything that was not
Lindstromberg 747

Table 2.  Sizes of learner groups in LTR intervention studies, 1997–2015.a

Designed interventions Found interventions


between-participants
  Between-participants Within-participants and mixed designs
and mixed designs designs
Number of studies 75a 15 4
Mean 24.6 35.2 37.8
Median 20.0 26.0 36.3
Range 5.0–63.7 8.0–171.0 19.5–59.0
IQRb 15.0–32.9 21.0–32.5 22.1–52.0
SD 13.6 38.5 19.5

Notes. For each study, a mean sample size was calculated. The statistics in this table are based on those
means; so ‘mean = 35.2’ gives the mean of 15 mean sample sizes of the purely within-participants stud-
ies. No thorough attempt was made to weight each sample size by the number of significance tests that
it played a role in. aOne early study is not included here because learners were treated as individuals in
statistical analysis even though they had worked in pairs. bIQR = inter-quartile range.

already evident to the larger community of researchers.3 That said, firm bases for inter-
esting conclusions can be constructed in small scale research via multiple replications,
particularly if their findings are subjected to statistical meta-analysis, as explained
in detail by, for example, Asendorpf, Conner, de Fruyt, et al., 2013; Borenstein, Hedges,
Higgins, and Rothstein, 2009; Cumming, 2012a, 2014. We return to this matter further
below.
Finally, with respect to the issue of sample size we have so far explicitly considered
only scores deriving from learners. But in 38 (40%) of the 96 studies surveyed, test
scores were matched not with learners but with items such as words or formulaic
sequences. The typical sample size here was especially small: median = 15 (range,
2–170). For relevant discussion, see Lindstromberg, 2016: Section IV.10.

3  Replication and p
A replication study (or ‘replication’) is exact if it addresses precisely the same research
questions that were addressed by an earlier ‘original study’ through using the same pro-
cedure and materials but with different participants (see Cumming, 2008). A fairly exact
replication might follow the original procedure but use new materials (e.g. new vocabu-
lary items) as well as new learners. A so-called ‘partial’ replication displays a clear fam-
ily resemblance to an original study but shows one or more clear differences over and
above ones already mentioned (Cumming, 2008); in particular, the research questions
are likely to be somewhat different. There is no fixed rule about how, or how often, an
original study should be replicated. On the other hand, there is no serious theory-based
denial that replication is a crucial element of scientific activity.
It has been remarked that L2 research has shown a weak commitment to replication
(Polio, 2012; Porte, 2012). This observation is corroborated by the finding that only
seven (7%) of the 96 studies covered by the survey were explicitly characterized by the
748 Language Teaching Research 20(6)

researchers who conducted them as having been planned as (non-exact) replications. (Of
three further studies it is said only in concluding discussion that these studies ‘replicated’
a finding reported earlier in the literature.) Among the possible reasons for the dearth of
replication studies in L2 research is one that directly relates to the practice of statistical
analysis: Researchers may be too inclined to think that a research question has been con-
clusively answered if the question was once tested experimentally and a significant
p value was found (J. Cohen, 1994; Cumming, 2012b; Cumming, Williams, & Fidler,
2004; Lai, Fidler, & Cumming, 2009; Nickerson, 2000, p. 256–257; Tversky &
Kahneman, 1971). In fact, it is very seldom warranted to draw such a conclusion from a
single result, especially when samples are small, since sheer random variation is all too
likely to yield p ⩽ α when a null hypothesis is in fact completely true (Cumming, 2008,
2012a; Schönbrodt, 2015).

4  Estimation of effect sizes and the role of statistical meta-analysis


Our field’s obsession with p may at long last be nearing the beginning of its end (e.g. N.
Ellis, 2015). This is good news, as there can now be little doubt that this obsession has
been a seriously unfruitful distraction from the quantitative researcher’s proper concern.
That is, instead of asking only the question, ‘Is there an effect, p ⩽ α?’, the spotlight
should be on questions such as, ‘How big is the effect?’(more specifically, ‘How big is the
difference or association between A and B?’), ‘What is the substantive importance of this
difference or association?’ (e.g. J. Cohen, 1994; Cumming, 2012a, 2014; Kline, 2013),
and ‘How does the present quantitative estimate of the effect compare to others in the
literature?’ (e.g. Borenstein et al., 2009; Cumming, 2012a; 2014; P. Ellis, 2010; and, with
specific regard to L2 research, Larson-Hall & Plonsky, 2015; Oswald & Plonsky, 2010;
Plonsky & Oswald, 2014). Keeping these sorts of substantive questions in the forefront of
one’s mind is a cornerstone of two complementary, non-dichotomous modes of thought
that Cumming has called ‘estimation thinking’ and ‘meta-analytic thinking’ (2012a, p. 9).
His proposal is that, in combination, these ways of thinking should have the following
three routine results. First, every quantitative study yields a pertinent estimate of the size
of the effect(s) seen. Second, every worthwhile quantitative study is replicated multiple
times. Third, estimates of effect size (ES) derived from an original study and from any
sufficiently close replications are averaged using procedures of statistical meta-analysis
(e.g. Borenstein et al., 2009; Valentine, Pigott, & Rothstein, 2010). Here it must be stressed
that an effect can be the subject of a meta-analysis if as few as two studies have furnished
estimates, on the understanding that for greater precision and credibility the meta-analysis
may be extended again and again as new replications yield new estimates of effect size
(Asendorpf et al., 2013; Braver, Thoemmes, & Rosenthal, 2014; Cumming, 2012a, 2014).
This program of action is referred to by Braver et al. as ‘Continuously Cumulating Meta-
Analysis’ (CCMA). We return to CCMA not far below. But because the concept of an ES
is so pivotal, it may be worthwhile to remark beforehand that the literature recognizes four
main conceptions of an effect and its magnitude:

1. the strength of an association between phenomena, which is most commonly


estimated by Pearson’s r and Spearman’s rho (or rSP);
Lindstromberg 749

2. the ‘proportion of variation’ (POV) explained, commonly estimated by r2 and


eta2, where the denominator is all observed variation, including unexplained var-
iation, and the numerator is the amount of variation in the dependent variable
(e.g. test scores) that is associated with or attributable to variation in the inde-
pendent variable (e.g. Grissom & Kim, 2012, pp. 135–136);
3. a difference between means or medians or other measures of central tendency,
estimated usually by the raw difference or by d (e.g. Cohen’s d);
4. a ‘difference in probabilities’ expressed, say, in terms of the Probability of
Superiority (POS), also known as the Common Language Effect Size. For exam-
ple, ‘POS = 60%’ means that if a participant were randomly chosen from Group
A and their score were compared with a randomly chosen participant from Group
B, and if this were done many times, then in 60% of those comparisons the score
of the Group A participant would be the higher one.

As it happens, the measures of ES most commonly noted in the survey − r, (partial) eta2,
and d − are means-based and therefore not robust, which is to say that they are liable to
be inaccurate when they relate to populations of data that do not conform to the assump-
tions of a parametric method (e.g. a more or less normal distribution; see Section IV.5
below). A robust estimator of POV discussed by Wilcox (2012b, pp. 377–384) may have
advantages, but it is not easy to calculate by hand. The POS, on the other hand, is not
only fairly robust and easy to calculate, but it may also be easier for many researchers to
interpret than, say r, r2, eta2, and maybe even d. For more on the POS see, perhaps in this
order, Wuensch (2015), McGraw and Wong (1992), Vargha and Delaney (2000), and
Grissom and Kim (2012).
Although the survey found no instance of the POS being reported, all the other three
types of ES measure have been, with at least one ES being reported in slightly over half
of the reports. However, common lapses have been failure to supply ESs consistently and
failure to verbally interpret the ones that are given. When verbal interpretations are
given, they are often brief and highly generic in character (e.g. ‘This is a large effect’).
Interestingly, occurrences of r and rSP have almost never been explicitly referred to as
measures of ES. Extreme idiosyncrasies, though, have been very rare. For instance, I
found only one instance of each of the following: using eta2 as the estimator of ES in
connection with a paired t-test, calling p a measure of ES, and reporting an ES in order
dispel the notion that a study’s results ‘were distorted by the small sample sizes’, which
is not something an ES can give information about.
To conclude this section, I would like to say a little more about meta-analysis, particu-
larly CCMA. The latter, as already hinted, is an application of meta-analysis that pro-
ceeds as follows. Previous and new estimates of a given effect are quantitatively
synthesized to yield a best overall interim estimate. As new estimates come in, these are
added into the meta-analysis to yield series of increasingly precise interim estimates
(Braver et al., 2014). The word ‘interim’, which is key here, relates to the fact that a
CCMA project can begin with as few as two studies. CCMA is therefore already highly
feasible in L2 research (namely Ellis & Sagarra, 2011; Lindstromberg & Eyckmans,
2014), notwithstanding the fact that in our field the reporting of ESs is a fairly new thing.
Indeed, what CCMA means is that any L2 researcher can also be a meta-analyst, on a
750 Language Teaching Research 20(6)

small scale at least. However, even the small-scale meta-analyst must heed certain cave-
ats. For example, a meta-analysis based only on published studies may well yield an
overestimate of the true effect size. Because small, or even negative, effects are most
likely to have been observed in unpublished studies, the meta-analyst should strive to
find any of these that exist for the case at hand (e.g. Borenstein et al., 2009; Larson-Hall
& Plonsky, 2015; Plonsky & Oswald, 2010). Of course, the meta-analyst must also try to
ensure that methodologically poor studies are filtered out (for an excellent beginner’s
guide, see Field, 1999).

5  Confidence intervals*
Although applied statisticians have been recommending for years that educational and
behavioral researchers furnish observed ESs with confidence intervals (Wilkinson &
Task Force on Statistical Inference, 1999), this is a practice that L2 researchers have been
very slow to adopt (Norris et al., 2015; Plonsky & Gass, 2011; Plonsky, 2013). Indeed,
the survey found only two articles whose results sections give confidence intervals (CIs)
for estimates of population effect sizes (e.g. a difference in mean gains). One likely rea-
son for this rather extreme conservatism is poor awareness of how informative CIs can
be. For relevant discussion and references, see Lindstromberg, 2016.

IV  Further observations and discussion


1  Statistical power and the power problem in L2 research
Statistical power is the probability, in a particular circumstance, that a given statistical
test will detect an existing effect and at the same time find p ⩽ α (e.g. Wilcox, 2012b).4
Thus, power is increased by anything that makes it easier to find a significant p value
when the null hypothesis is false and decreased by anything that makes finding a signifi-
cant p value harder. Most notably, power rises with (1) the sample size, (2) the size of the
relevant effect, and (3) the magnitude of α (e.g. it is easier to find p ⩽ α when α = .05
than when α = .01). Power goes up too if a one-sided test is used (but the survey found
no instance of such a test being used). But power sinks with an increase of noise, fuzzi-
ness, or ambiguity in the data: Suppose, for instance, that a researcher plans to use the IS
t-test. If the samples are expected to have large variances, power can be expected to be
correspondingly low; and this is because high variances magnify the denominator of the
formula for t. Power can be influenced by other factors as well. For example, the power
of a parametric test can be affected by the degree to which the data conform to the test’s
assumptions of distributional normality and equality of variances. If these assumptions
and other assumptions are met sufficiently well, a small amount of extra power may be
gained by using the test rather than a non-parametric analog. But if these assumptions are
not met, use of the parametric test may entail a sacrifice of power (Wilcox, 2012a,
2012b). Further, the power of a parametric test (e.g. the IS t-test and IS anova) falls as
sample sizes become more unequal (e.g. Wilcox, 2012a, 2012b). This is relevant because
80% of the surveyed between-participants or mixed design studies included samples of
different sizes. Fortunately, though, the discrepancies were usually small.
Lindstromberg 751

As mentioned, previous reviewers have identified low statistical power as a serious


problem in L2 research. The problem arises mainly because sample sizes in L2 research
tend to be small and because of the excessive importance that has been attached to
p values (Plonsky, 2013). Recall that the survey found 20 to be the median sample size
in (quasi)experimental studies with a between-participants or mixed design (Table 2).
Let us briefly explore what this means for statistical power when α ⩽ .05, all test assump-
tions are tenable, and the null hypothesis is false. Suppose that a team of researchers
plans to use the IS t-test to analyse the scores of 40 learners (n1 = n2 = 20). Suppose also
that the effect being investigated happens to be of average size (d = 0.50, let us say). By
using the free online statistical power calculator G*Power (http://www.gpower.hhu.de/
en.html), we can find that these researchers will deploy statistical power of about .34.
What this means is that the researchers will have only a 34% chance of detecting the
effect against a 66% chance of failing to detect it. In the soft sciences, power of .80 is
often taken to be the minimum acceptable level of power (P. Ellis, 2010; Howell, 2010).
Plainly, power of .34 is miserable. For power of .80 the researchers in our running exam-
ple would need 128 learners (n1 = n2 = 64), which is a sample size far above the average
in the intervention studies surveyed here (see Table 2). Given a medium effect (eta2 ≈ .06
or R2 ≈ .13) and n1 = n2 = n3 = 20 (N = 60), at α = .05 the power of a one-way anova
omnibus F test is also abysmal (ca. .37). In short, the levels of power deployed by LTR
researchers have typically been nowhere near .80. Plonsky (2013) estimated the average
level of power in L2 research to be about .57, which is certainly much nearer the mark.
Owing to space constraints no detailed or wide-ranging discussion of our field’s
power problem is possible here; for this, see Plonsky, 2013; Plonsky and Oswald, 2014.
However, the following facts should be stressed. First, widespread low power means that
a L2 (quasi)experimental researcher is all too likely to make Type II errors. That is, the
researcher routinely runs a regrettably high risk of overlooking a valid discovery,
whereby their study is likely to go unpublished, however informative it may otherwise
be (see Wilcox 1998). Second, an underpowered study is especially likely to report an
unusually large effect that will not be detected by a replication study because the obser-
vation of the effect was purely a result of random sampling variation and therefore spuri-
ous (Maxwell, 2004; Oswald & Plonsky, 2010). These two facts together mean that L2
research is likely to have over-reported large effects and under-reported small and
medium ones (e.g. Plonsky & Oswald, 2014). All these problems could be mitigated if
L2 researchers were to carry out many more replication studies.
For more additional discussion, recommendations, and useful references regarding
statistical power, see Lindstromberg (2016). Here I add two remarks only. First, I came
across no clear evidence that LTR authors have conducted prospective (or a priori)
power analyses to see how many participants they would need for their planned study
to have reasonable statistical power. Wuensch (2009) provides very clear instructions
about how to do this using G*power, and Howell (2010) gives a readable statement of
rationale. Second, in two of the surveyed studies, authors reported ‘observed statistical
power’ (i.e. power calculated from the ES seen in the data already gathered). But there
is no point in reporting observed power since it gives no information not already given
by the observed p value. Lenth (2000) explains the problem very clearly; see also
O’Keefe, 2007.
752 Language Teaching Research 20(6)

2  Familywise Type I error*


As the number of a study’s research questions and the complexity of its design increase,
the number of significance tests that it includes is likely to go up as well. For the period
of the survey the median number of research questions per study is three (range: 1 to
about 12). Accordingly, complex designs – especially ones involving factorial anova
and manova – are common and the median number of significance tests per study is
high, even without taking tests of assumptions (e.g. normality), instrument reliability,
and rater agreement into account. The estimated median number, which is 18 (estimated
range: 2–227), is certainly an under-estimate since several reports imply or allude to
significance tests for which they give no information, not even the total number of these
additional tests; Plonsky (2013) expressed the same opinion about his estimate, which
was also 18. But even 18 significance tests per study is enough to raise the issue of
‘familywise Type I error probability’ or, for short, ‘familywise error’. This concerns the
likelihood that the overall incidence, or ‘rate’, of false positive findings (Type I errors)
may rise further and further above α as the number of significance tests in a group, or
‘family’, of tests goes up. The issues here are extraordinarily complex, and authorities
have differed greatly in their answers even to a question as fundamental as what a fam-
ily is (An, Xu, & Brooks, 2013). Westfall and Young (1993) have offered a guideline
that seems a fair average of the wide range of definitions that have been given: A family
of tests is all the tests conducted in a study that relate simultaneously to a single coher-
ent body of evidence. But regardless of how the term ‘family’ is defined, there is agree-
ment that the ‘familywise error rate’ (FWER) is likely to be dramatically higher than α
if a family includes many significance tests, particularly if these tests are based on
uncorrelated data sets. If tests correlate positively, the FWER is held down but to a
degree that is difficult to estimate (Romano, Shaikh, & Wolf, 2010). Howell (2010,
p. 365) has remarked that ‘in most reasonable cases’ the FWER can be approximated by
multiplying α by c, where α is alpha and c is the number of tests. So, for example, when
α = .05 and a family consists of six tests, c times α gives the very rough estimate, FWER
≈ .30. The result of this six-fold inflation is that any significant p value in the family has
little credibility. It seems likely, though, that Howell’s guideline could lead to consider-
able over-estimation of FWER when tests are positively correlated to a nontrivial
degree, which may well be the case in repeated measures designs (e.g. García, 2004;
Lix & Sajobi, 2010).
It has been remarked that there is an inconsistency in the common practice of
using, say, the Bonferroni procedure to control the FWER across a set of pairwise
t-tests following a global, or ‘omnibus’, anova F-test, but failing to do so across mul-
tiple F-tests themselves no matter how many there have been (Abelson, 1995;
Baguley, 2012; O’Keefe, 2003). In the same vein, Howell (2010, note 7, p. 480)
observes, with respect to a 2 × 3 × 2 anova, that while its seven F-tests mean that the
FWER would be quite high without appropriate correction, the problem is one that
most researchers ignore. Indeed, the survey found only one instance of FWER control
across F-tests (Nakata, 2015).
Importantly, the classic Bonferroni procedure is based on the often improbable
assumption that all tests are uncorrelated, whereby in common situations it is likely to
Lindstromberg 753

over-control the FWER and so reduce statistical power (Romano et al., 2010). Because
statistical power tends already to be too low in many fields, including ours (Plonsky,
2013; Plonsky & Gass, 2011), some authorities have recommended against any correc-
tion of p and against any corresponding correction of CIs as well. They favor, instead,
letting readers form their own judgments about the credibility of unadjusted p values
(e.g. O’Keefe, 2003; for additional references, see Keselman, Cribbie, & Holland, 2004).
However, readers can only form such judgments when they have been informed about all
the tests that were conducted; and, as mentioned, some of the surveyed reports are vague
in this respect.
About 56% of the reports of studies involving anova say explicitly whether and how
the FWER was controlled across the post-hoc tests. The four methods of control most
often mentioned (a few reports mention two) are: Tukey’s HSD (12), standard Bonferroni
(8), Scheffé (4), and Fisher’s LSD (4). A small number of researchers simply reduced α
across the board to .025 or .01, which smacks of guesswork. The relatively powerful
methods of Hochberg and Rom (see Wilcox, 2012b) were not used, and control of the
False Discovery Rate is mentioned nowhere although it may be advisable when the
number of tests is very large (Benjamini & Hochberg, 2000; Field, Miles, & Field,
2012). As a final point of possible interest, virtually all estimates of correlation were
tested for significance although it did not always make sense for such tests to be carried
out. In fact, r is often amply serviceable as a purely descriptive statistic (Baguley, 2010;
Howell, 2010).

3  Checking the assumptions of parametric tests*


A key assumption of common parametric significance tests is that each set of data points
has been sampled from a population of data points that is normally distributed. When this
validity condition does not hold, the probabilities of Type I and, especially, Type II error
(i.e. the error of overlooking a legitimate discovery) may be very different from the
desired levels (e.g. Wilcox, 2012b). Much the same can be said of other validity condi-
tions of parametric tests such as approximate equality of group variances, sphericity (for
repeated-measures anova with more than two repeated measures), and equality of regres-
sion slopes (for ancova). An important fact to note here is that problems arising from
violations of key assumptions are likely to increase as the number of groups in an experi-
ment increases (Wilcox, 2012a, 2012b). One might expect that researchers using para-
metric tests would check whether the assumptions of these tests appear to be tenable
since, as Keselman et al. (1998, p. 351) bluntly put it: ‘The applied researcher who rou-
tinely adopts a traditional procedure without giving thought to its associated assumptions
may unwittingly be filling the literature with nonreplicable results.’ As it happens,
though, only 21 (22%) of the surveyed articles include explicit mention of any of check
on the tenability of even one assumption (almost always the assumption of normality).
Where assumption checking is mentioned, there is typically no detail about how this was
done (e.g. that a Q–Q plot was examined) or about what was found (e.g. rightward skew),
and only three reports (3%) state explicitly that multiple assumptions were checked. That
said, 10 reports that make no mention of assumption checking nevertheless state that
one or more nonparametric tests were carried out. This implies that failure to mention
754 Language Teaching Research 20(6)

assumption checking does not necessarily mean that assumptions were not checked. It
would be better, though, for authors to give details.
A well-known problem with using a significance test (e.g. the Shapiro–Wilk normal-
ity test) to test an assumption of another significance test (such as a t-test) is that in any
given research situation it cannot be known for sure whether the assumption test has
enough power to detect a violation or whether instead it has so much power that it is
liable to find to find p ⩽ α when a violation is too small to matter (e.g. Baguley, 2012;
Field et al., 2012; Wilcox, 2012a). A problem that is much less well known is that Type I
error is increased by making use of a nonparametric or robust test conditional on a test of
normality finding p ⩽ α (for references, see Baguley, 2012, p. 324). A way of avoiding
both problems is to use robust procedures since in this case assumptions tests are irrel-
evant (see Erceg-Hurn & Mirosevich, 2008; Wilcox, 2012a, 2012b). (Note: No one rec-
ommends that researchers forego other means of judging the character of their data, such
as creating and examining histograms, density plots, and QQ plots.) Recently, though,
Keselman, Othman, and Wilcox (2013, 2014) have found that the Cramér–von-Mises
and especially the Anderson–Darling normality tests are sufficiently reliable for samples
as small as 20 or perhaps less (although a sample of 8, for instance, would be too small),
provided that one sets α at .15 or even .20. (They also propose a method for controlling
the FWER.5) However, there remain strong reasons to avoid other tests of normality
(Keselman et al., 2013, 2014) as well as commonly used tests of the equality of variances
(Zimmerman, 2004a).

4  Two kinds of robust methods: Randomization and bootstrapping


Two important kinds of robust method are randomization (also known as permutation)
tests and bootstraps.6 What makes these methods different from parametric tests is that
they do not require the researcher to make the possibly unwarranted assumption that a set
of data at hand represents a population of data points with a well-understood distribution
such as a so-called ‘normal’ distribution. Rather, the researcher begins by regarding the
data, whatever its distribution, as the best available evidence of the nature of the relevant
population(s). (For a basic account of how randomization tests and bootstraps work, see
Lindstromberg, 2016: Sections IV.7 and IV.8.)
Numerous applied statisticians (e.g. Hesterberg, 2008; Keselman et al., 1998;
Wilcox, 2012a, 2012b) have argued that reliance on traditional parametric statistical
procedures is dangerous because these procedures are likely to produce erroneous
results under conditions of inequality of variances (or ‘heteroscedasticity’) and distri-
butional nonnormality (e.g. in the form of skew and the presence of outliers). These
conditions, once thought to be rare and generally harmless, are now known to be com-
mon (e.g. Micceri, 1989) and all too likely to be troublesome (Wilcox, 2012a, 2012b).
An abundance of software for implementing robust methods has been freely available
for over 10 years (especially in R), but as indicated in Table 1 the survey found no
mention of a robust procedure having been used. For discussions of benefits of using
robust methods in L2 research, see Larson-Hall and Herrington (2009); Plonsky,
Egbert, and Laflair (2014).
Lindstromberg 755

5  The vulnerability of means-based tests to outliers


Although the survey found evidence of massive reliance on parametric tests, the terms
‘outlier’, ‘extreme value’, and ‘extreme score’ do not occur in any of the surveyed arti-
cles in a context indicating that authors had looked for outliers in their data. This is
concerning given that a single outlier can readily cause a parametric test to produce
either a Type I or a Type II error. Consider the following two sets of invented test scores
where n1 = n2 = 18 and MD = 1.44, where CGr stands for ‘Control Group’ and EGr stands
for ‘Experimental Group’, and where each subscript indicates the number of times a
value occurs in the score set in question.

•• CGr: 112,124,135,142,153,162 (Mn = 13.33, SD = 1.57, with slight rightward skew)


•• EGr: 111, 122, 132, 144, 153, 162, 172, 181, 201 (Mn = 14.77, SD = 2.29, with slight
rightward skew)

For these data the standard (i.e. Student’s) IS t-test gives: t = 2.207(34), p = .034.
But suppose that on checking the score sheets we see that the final EGr score is 28
instead of 20. Although this looks like even better evidence that the EGr outperformed
the CGr, on rerunning the t-test we find: t = 1.992(34), p = .054; and, unlike the earlier CI
(not given here), the new one, CI95% [–0.04, 3.81], includes zero, meaning that the null
hypothesis, MD = 0, cannot be excluded. What happened? Recall that one formula for t
is: t = [(√n)*MD)] / SDpooled. So, as a result of changing 20 to 28, the revised SD of the
experimental group (i.e. 3.70) – which influences the denominator – is 1.62 times bigger
than before, whereas the corresponding revised MD (1.89), which is part of the new
numerator, is only 1.31 times bigger. This disproportionately large influence of an outlier
on the denominator of the equation for the test statistic of a parametric means-based test
(e.g. t and F) is a trait that makes parametric tests non-robust (e.g. Wilcox, 2010, 2012a).
Let us now consider the performance of a robust alternative, specifically a bootstrap
of the Yuen–Welch t-test (with 4,999 simulated replications) based on a 10% trimmed
mean (Wilcox 2012b, pp. 339–343).7 On applying this bootstrap first to the original data
and then to the revised data we find: original data, t = 1.855, p = .046, CI [0.03, 2.72];
revised data: t = 1.855, p = .040, CI [0.07, 2.68]. Note that the output from the second
bootstrap, compared to the first, fairly reflects the fact that the revised data constitute
improved evidence that the EGr outperformed the CGr yet, at the same time, the revised
statistics show no drastic shift. This is indicative of the superior stability of the trimmed
mean compared to the mean. Also, we see that the bootstrap CI is narrower (i.e. more
precise) than the one associated with the t-test.8 As a final point regarding our running
example, note the relatively poor performance of the Wilcoxon–Mann–Whitney (WMW)
test even on the original data: W = 101.5, p = .051. This is because the WMW test loses
power in the presence of tied scores: For instance, there are three scores of 15 in each
example data set. (The WMW test can show loss of performance also in other ways and
for other reasons; Zimmerman, 1998, 2003, 2004b). For an interesting account of why
complacency about the non-robustness of parametric tests took firm root in the behavio-
ral sciences, see Osborne (2013).
756 Language Teaching Research 20(6)

Finally, we saw above that just one atypical value can inflate the SD more than
the MD. Recall that d equals a MD divided by a SD (the pooled inferential SD, in the
case of Cohen’s d) and that a SD is a measure of the deviation of data points from the
mean. Thus, d and other means-based measures of ES (such as r) can be distorted by
outliers in the same way that a p value can be. For discussion, see Grissom and Kim
(2012); Kelly and Preacher (2012). P. Ellis (2010) clarifies the different versions of
d that there are.

6  Formation of groups of learners by random assignment


Experimental studies can involve pre-existing, or ‘intact’, groups of participants (e.g.
students in pre-existing EFL language classes) or they can involve temporary groups
formed by random assignment (RA), as already noted further above. (Other options for
group formation are of marginal relevance here and are not discussed.) Of the 96 studies
surveyed, 31 (32%) featured RA. (The survey found only one instance of the desirable
additional step of drawing participants at random from a pool of candidates much larger
than the number of participants used in the study.) As is well known, RA is especially
advantageous in studies of between-participants or mixed design (e.g. Bonate, 2000).
The survey found 76 such studies, of which 30 (39.5%) featured RA.

7  Statistical analysis of pretest–posttest designs with independent groups*


Fifty-four of the intervention studies reported in LTR followed a pretest–posttest
design. As this design involves issues of considerable complexity, let us simplify dis-
cussion by focusing initially on the biggest group of these studies. To do this, we tem-
porarily set aside five pretest–posttest studies in which the pretests were used simply
to filter out vocabulary items that learners already knew (four studies) or to screen out
learners who already knew the language items selected (one study). This leaves 49
studies. As a brief study of Table 3 may suggest, statistical analysis of data from these
pretest–posttest studies was by no means uniform. So let’s simplify things further by
considering only cases where there was just one pretest and one posttest, where partici-
pants were randomly assigned to groups prior to the pretest, and where the posttest was
identical to the pretest except for the ordering of items. For such cases, the main tradi-
tional approaches are: (1) analysis only of the raw posttest scores using the independ-
ent samples (IS) t-test, one-way IS anova or one of its nonparametric analogs (e.g. the
Kruskal–Wallis test); (2) use of two-way mixed between-by-within-participants, or
‘split plot’, anova to analyse both the raw pretest and the raw posttest scores; (3) use
of the IS t-test or IS anova to analyse the so-called ‘gain’ (or ‘change’ or ‘difference’)
scores of participants in the parallel groups, where each gain score is a participant’s
posttest score (X2) minus their pretest score (X1); and (4) ancova applied to raw pretest
and raw posttest scores. These four approaches vary considerably in versatility and in
the advantages they offer. Typically, analysis of only the posttest scores affords inferior
statistical power (e.g. Bonate, 2000). Moreover, this approach is fundamentally ques-
tionable when groups have not been formed by random assignment; and its tenability
Lindstromberg 757

Table 3.  Overview of the various tests of significance used in prettest studies, 1997–2015:
Statistical analysis in 49 pretest–posttest studies.
Split-plot (i.e. mixed between-by-within) anova on raw scores, with or 14 (28.5%)
without follow-on one-way anova and/or paired t-tests
Split-plot anova on gain scores, with or without follow-on one-way 5 (10%)
anova and/or paired t-tests
manova on raw scores, with time as a factor, with follow-on split-plot 2 (5%)
and/or one-way anova on raw scores
manova as above plus calculation of Pearson’s r between gain scores and 1 (2%)
raw scores on a test of language aptitude
manova on raw pretest scores with follow-on one-way and split-plot 1
anova on raw scores
One-way independent samples (IS) anova on raw posttest scores 3
Ancova on raw scores 3
Multi-level modeling on raw scores 2
One-way anova on raw scores and on gain scores 2
Paired and IS t-tests on gain scores 2
WMW tests on raw scores 2
WMW and paired t-tests on raw scores 1
WMW, chi-square, and McNemar tests on raw scores 1
Split-plot anova on raw scores and ancova on raw scores 1
Split-plot anova on raw scores & ancova on both raw scores and gain 1
scores
Latent growth curve analysis on raw scores 1
Three-way mixed anova 1
mancova, manova, ancova, and one-way anova 1
Fisher exact test on scores representing differences in percentage gained 1
Two-way IS anova on gain scores divided by the maximum gain possible 1*
Multiple regression plus manova on gain scores 1
Multiple regression and split-plot anova plus partial correlation analysis 1

Notes. Not included are independent samples pairwise post-hoc tests (mainly t-tests) carried out as follow-
ons from omnibus F-tests. For simplicity the terms anova, ancova, manova, and manova are given here in
the singular form, even though in most studies tests of these types were usually conducted multiple times.
* In this study, the data were mathematically transformed before anova so as to correct for inequality of
variances. Among the total of 96 studies surveyed, only one other study featured a data transformation
intended to mitigate the negative consequences of the violation of an assumption of a significance test.

is especially jeopardized whenever the mean pretest scores are not (virtually) identical
for all groups (e.g. Bonate, 2000). Split-plot anova based on pretest and posttest scores
can be used when groups have not been formed by random assignment, but as this
approach yields the same results as gain score analysis (the key statistics being those
for the interaction; e.g. Knapp & Schafer, 2009), I will say no more about it for the
time being. Now, on the understanding that we will just be scratching the surface of a
hugely complex area, let us consider the research questions addressed by the two
remaining basic options, gain score analysis and ancova.9
758 Language Teaching Research 20(6)

•• Gain score analysis: ‘How do groups differ in score change from pretest to
posttest?’
•• Ancova: ‘Is there an effect of treatment on the posttest scores that is not predict-
able from pretest scores?’ (Knapp & Schafer, 2009). In other words ‘Are there
between-group differences in posttest scores when we compare individuals from
the different groups who had the same pretest score?’ (see Fitzmaurice, 2001).

To sum up, gain score analysis focuses on amount of change whereas ancova focuses on
the end level of score.
Among applied statisticians the consensus seems to be that if groups have been
formed by random assignment of participants, then ancova has more advantages than
gain score analysis even though the results of a gain score analysis are likely to be easier
to interpret and may better address the question that researchers really want to have
answered (Knapp & Schafer, 2009). A major advantage of ancova is that it tends to
afford more statistical power than analysis of gain scores (e.g. Bonate, 2000; Vickers &
Altman, 2001). In particular, when there is no ceiling or floor effect in the posttest scores,
gain score analysis and ancova have similar statistical power; but when there is a ceiling
or floor effect in these scores, ancova retains power when gain score analysis does not
(Cribbie & Jamieson, 2004). It must be stressed, though, that it is highly controversial to
use ancova when groups have not been formed by random assignment (Cribbie &
Jamieson 2004; Miller & Chapman, 2001). Regarding gain score analysis, a common
concern is that participants who score low on the pretest will have more available gain
on the posttest than will participants whose pretest score was nearer the maximum (e.g.
Vickers & Altman, 2001). Such bias is not at all inevitable (Bonate, 2000). Nevertheless,
when participants have not been assigned randomly to groups it has been relatively com-
mon for researchers in some fields to try to circumvent the (perceived) problem of
between group differences in mean available gain by analysing ‘relative gain’ scores
(also known as ‘fractional’, ‘proportional’, or ‘percent’ scores) instead of raw gain
scores. Indeed, the survey found three studies in which relative gain (RG) scores were
used in inferential analysis, although these scores were calculated in three different ways.
To give an example, a RG score may be derived as follows: RG score = (Posttest score
– Pretest score) / Pretest score. (For this and nine other equations, see Bonate, 2000.)
Because a RG score is always a proportion, multiplying a RG score by 100 changes it
into an explicit ‘percentage’ gain score.10 An advantage of RG scores and explicit per-
centage gain scores is that they can be relatively easy to interpret (e.g. Bonate, 2000). In
general, however, applied statisticians seem unenthusiastic about basing inferential anal-
yses on either of these two types of score, and some authorities (Bonate, 2000; Vickers,
2001) have recommended against using them in significance testing at all, on the grounds
that they are not superior to gain scores (in particular, it seems that they do not actually
accomplish the goal of correcting for cross-group differences in available gain) and that
they are highly likely to be nonnormally distributed. Additionally, these two types of
scores may afford relatively low statistical power (Vickers, 2001). Adjustment for cross
group imbalance in available gain (owing to unbalanced floor or ceiling effects) can be
achieved by ancova. But, again, ancova should be reserved for use when groups have
been formed by random assignment (e.g. Jamieson & Cribbie, 2004).
Lindstromberg 759

Let us now consider Table 3. Among other things it shows that LTR authors have
approached the analysis of pretest–posttest data in a wide variety of ways. Some of these
ways were discussed above, where it was noted that in some cases two different
approaches yield the same results when applied to the same data. While we cannot exam-
ine all the additional approaches listed in Table 3, a word or two must be said about some
of them. First, mixed design manova can handle situations in which there are multiple
dependent variables. Otherwise, it accomplishes much the same goals as split-plot anova,
although possibly with reduced statistical power and less interpretable output. Split-plot
anova should certainly be preferred over mixed manova when ns are small and there is
only one posttest (Baguley, 2012). Second, latent growth curve analysis (being a version
of structural equation modeling) requires large samples, which limits its applicability in
L2 research. Third, multi-level modeling offers solid advantages in the case of mixed
designs (e.g. when ns are unequal and there are missing values): L2 researchers should
undoubtedly use this approach much more often than they have done in the past (Linck
& Cunnings, 2015). Finally, much of the considerable variety seen Table 3 is no doubt
due to dissimilarities between studies. At the same time, it is probable that a good deal of
the variety is attributable to researchers’ uncertainty about how best to analyse certain
kinds of pretest–posttest data.

8  Accepting the null hypothesis*


The logic of null hypothesis significance testing is notoriously difficult to follow (e.g. J.
Cohen, 1994; Nassaji, 2012; Nickerson, 2000). Nickerson (2000) reviewed nine current
false or controversial beliefs about the outcome of a significance test. One of these is the
‘belief that failing to reject the null hypothesis is equivalent to demonstrating it’ (pp.
260–262). Our consideration of this belief, which statisticians sometimes label ‘accept-
ing the null hypothesis’, focuses on two scenarios that were encountered in the survey.
Scenario 1 occurs whenever researchers conduct a significance test, find p > α, and
then conclude that the effect they were investigating does not exist. This is like using a
low magnification telescope to search the night sky for asteroids and then, when none are
seen, concluding that there are no asteroids there. The flaw in this kind of reasoning may
seem obvious, and yet a version of Scenario 1 occurred in 37 (39%) of the 96 studies
surveyed, mostly just once but up to 30 times per study. So what exactly was done?
Especially when participants have not been randomly assigned to treatment groups, it
is desirable for the parallel groups to have more or less the same mean pretest score.
However, it is common for group means not to be the same. When the pretest means do
differ, researchers sometimes attempt to dispel the problem they fear this can entail for
later analysis by running a significance test on the pretest scores in the hope of finding
p > α. If p > α is found, the researchers declare that the null hypothesis (H0) of no differ-
ence between pretest mean scores has been upheld and that consequently the groups can
be regarded as having been equivalent at pretest. Once H0 has been accepted in this way,
the main analysis is conducted. For instance, the IS t-test or one-way anova is applied to
the groups’ posttest scores (again, on the assumption that the means are equal) and then,
if the main analysis shows p ⩽ α, it is concluded that the intervention had a significant
effect. Bonate (2000), who refers to this practice as ‘testing for baseline balance’,
760 Language Teaching Research 20(6)

comments that it not only inflates the familywise Type I error rate (as each additional
significance test has the potential to do) but that it does so needlessly. Bonate’s argument
here is that even when baseline testing does find a significant difference between group
means, it is nevertheless valid for a researcher to proceed with statistical analysis as long
as the results are interpreted with due circumspection. However, there is more to be said
about why testing for baseline balance is dubious. Considering the case of two independ-
ent groups, let us look at why accepting H0 as just described is never necessary, almost
always futile, and full of potential to be hazardous. Our example data is two sets of mock
pretest scores (nA = nB = 20) created using the statistical freeware R (R Core Team, 2016)
by random selection from normal populations of data points that were then rounded
down to whole numbers resembling typical test scores: Means, 27.85A vs. 28.55B; SDs,
2.92A vs. 2.82B; MD = 0.70. The IS t-test gives: t(38) = 0.606, p = .548, the CI95% for the
MD is [–2.39 < 0.70 < 1.29]. Because this CI includes zero, H0 may well be true.
However, within this CI it is not zero but rather 0.70 that is the most plausible location
of the true MD. Moreover, 62% of the CI includes plausible values for the true MD that
are even further from zero than is |0.70|. In short, the high p value (i.e. .548) is due to low
statistical power rather than to the substantive triviality of the MD: Triviality has not
been demonstrated (cf. Norris, 2015). Actually, compared to stating the MD arithmeti-
cally (e.g. as a percentage), the act of carrying out a significance test and then finding
p = .548 constitutes no progress whatsoever toward showing that the groups’ mean scores
were equivalent at pretest. In fact, if the two groups are made large enough while the
means and the score ranges stay the same, p is certain to fall below α eventually. In short,
the more data one has (and thus the more statistical power), the less likely it is that the
ploy of accepting H0 will succeed even superficially. Anyway, significance tests are
applied to samples in order to make inferences about populations with respect to effects
posited on the basis of theory. But in Scenario 1 the focus is on the samples themselves,
and it may be difficult to imagine what theoretically interesting effect might be at issue.
The authors of the 37 studies that featured baseline testing invariably reported p > α
and then stated that pretest means could be assumed to be equal (in one study this was
even done when p was very near to .05). The danger in drawing this unwarranted conclu-
sion is that once it is fixed in a researcher’s mind, the researcher could easily go on to
over-optimistically interpret the results of the main statistical analysis, particularly if this
is an analysis of the posttest scores only. For further discussion of the issues here see,
perhaps in this order, P. Cohen (1996), J. Cohen (1994), and Norris (2015).
We now turn to Scenario 2 which, in contrast to Scenario 1, can be very fruitful.
This scenario was encountered in the survey only once (Nakata, 2015), but I summa-
rize it anyway in the hope of encouraging other researchers to follow Nakata’s lead.
Scenario 2 arises when researchers have found evidence that a theoretically interesting
effect is very small; for instance, they have observed an MD near zero. The researchers
seek evidence that the population MD is too small to be of practical importance.
Success depends on their being able to deploy such high statistical power (e.g. through
recruitment of a very large number of experimental participants) that a CI centered
around the observed MD not only includes zero but is also so narrow that it includes
no values large enough to be of any practical significance (Hoenig & Heisey, 2001;
Wuensch, 2009).11
Lindstromberg 761

Table 4.  Overview of selected trends, 1997–2015.

Change over time in… rSa


The sample size (n) in each between-participants and mixed design .46
study (median = 20, N = 76)b, c
The smallest n featuring in each between-participants and mixed .36
design study (median = 16.5, N = 76)c
N in within-participants designs in designed interventions (median .10
= 26, N = 15)c
Random assignment of participants to groups, yes or no (yes in .17
41.6% of the 76 relevant studies)c
Reporting of effect sizes, yes or no (yes in 49% of 96) .32
Interpretation of effect sizes, yes or no (yes in 55% of 47) .25
The number of significance tests per study (median = 18, N = 96) .05
Use of anova at least once in a study(N = 96) .06
The number of research questions per study (median = 3, N = 96) –.11
The incidence of laboratory studies .13
The incidence of replication studies .13

Notes. arS = Spearman’s rho. Also, the well-known default benchmark interpretations of Pearson’s r seem
broadly applicable here. That is, the values .10, .30, and .50 indicate correlations that are, respectively, weak,
medium, and strong (e.g. P. Ellis, 2010). bThe mean n was calculated for each study. This is the median of
those means. cNot included here are the four studies that report ‘found’ interventions.

9 Trends
Because of the heavy use of anova and the virtual certainty that this is not a good thing,
I checked to see if there was a trend in the use of anova over the period 1997–2015. To
do so, I ordered the 96 surveyed studies according to their position in the volumes and
issues, the aim being to order the studies more or less chronologically (from older to
newer). I then represented each study with 1 or 0 depending on whether it did or did not
involve at least one use of anova. To represent the variable of time, I created a data set
consisting of the integers 1, 2, 3, … 96. The correlation between ‘time’ and ‘anova use’
(Spearman’s rho = .05) is consistent with a fairly steady reliance on anova throughout the
period of the survey. Additional estimates of trends (calculated by the approach just indi-
cated) are summarized in Table 4. These trends are, on the whole, encouraging. In par-
ticular, sample sizes grew and estimates of ES appeared more frequently and were
interpreted more often. Three other welcome trends, albeit weak ones, are that the median
number of research questions per study fell, the incidence of replication studies rose, and
random assignment of participants to groups became more common.

V  Summary and conclusions


By traditional standards, LTR authors have almost always performed and reported their
statistical analyses at least adequately. That said, the quality of the surveyed articles with
respect to statistics does not in general match their impressive quality in other regards,
most notably their informativeness about instructed second language learning, their
relevance to L2 pedagogy, and their linkage to past research. However, a qualification
762 Language Teaching Research 20(6)

that has to do with statistics must be made regarding the last of these three positive char-
acteristics. Namely, the literature reviews in the surveyed articles are invariably tradi-
tional, which is to say that when an effect is touched on (e.g. the observed effect of a type
of instruction) authors rarely progress from a binary, yes-or-no consideration of the evi-
dence that the effect exists to a consideration of the likely size of the effect in terms, say,
of d or r. (Authors’ concluding discussions show a similar, if somewhat less strong, ten-
dency even when the matter at issue is their own results.) While failure to discuss, or
even mention, effect sizes may not invalidate discussion of a given effect’s place in a
theory, such omissions undoubtedly limit what can be said about an effect’s importance
in practice. For example, a researcher who has observed an effect of d = .15 might go on
to say whether and why an effect of this size should merit the attention of L2 teachers and
materials writers. Thus, if the effect is one that manifests itself only in a kind of circum-
stance that is brief and uncommon, the researcher may be justified in saying that the
effect should rank low on a scale of effects worth taking into practical account. If, in
contrast, the effect operates in situations that are common or long-lasting, the researcher
may be justified in claiming that practitioners should not ignore it, even though effects of
d = .15 are typically regarded as negligible (see Abelson, 1985; P. Ellis, 2010). Of course,
potentially useful interpretation of this kind is possible only when estimates of effect size
are available in the first place. To shift now to a historical perspective, widespread failure
to state and verbally interpret effect sizes limits the degree to which later studies can cog
into and follow on from earlier ones. Suppose, for example, that the literature on a given
pedagogical technique furnishes estimates of effect size for several versions of the tech-
nique (e.g. several ways to typographically highlight targeted vocabulary). Suppose also
that a researcher is planning an experiment with the goal of finding out which version of
the technique is the best. The researcher’s task of selecting which two or three versions
to test first is bound to be facilitated by the availability of estimates of effect size, as com-
pared to a situation in which the literature furnishes few such estimates or none at all.12
As to the future, there is no practical reason why the performance and reporting of
statistical analyses in LTR cannot show a number of conspicuous improvements within
just a few years, provided that these improvements are promoted by reviewers and edi-
tors. One of these improvements, as just discussed, would be for authors to plan, carry
out, and report their research with a much greater emphasis than hitherto on effect sizes
and a diminished emphasis on p values. Another improvement – relevant to studies based
on small sample sizes – would be that the conclusions that authors draw from their
results would reflect the fact that small samples may well have atypical characteristics.
In this newer and better world, LTR authors would also do the following: show greater
and more explicit commitment to meta-analytic thinking (Cumming, 2012a, 2014) and
to CCMA (Braver et al., 2014); use modern robust procedures of statistical analysis
relatively routinely (e.g. Wilcox, 2012b); and grasp opportunities to move away from
complex (m)anova designs toward slimmer, more focused, more coherent, and more
powerful designs through implementation of contrast analysis (Lindstromberg, 2016:
Section IV.6).
Additionally, it would be easy and very beneficial for all authors to clearly state in
their results sections what factors and levels are involved in their (m)anova designs.
Certain other possible improvements seem likely to take longer to become established, if
they ever do, purely because they appear to require researchers (reviewers included) to
Lindstromberg 763

undergo a good deal of extra training. One such improvement, relevant when measure-
ments have been taken at three or more points in time, would be increased use of rela-
tively modern approaches to the analysis of change; for example, approaches that fit a
growth model either within a multi-level framework (Linck & Cunnings, 2015) or within
a structural equation modeling framework (e.g. Acock & Li, no date; Curran, Obeidat, &
Losardo, 2010; Hox & Stoel, 2014). As mentioned, and as Table 1 shows, the survey did
come across a small number of recent studies featuring use of such approaches (albeit
under a variety of names). An improvement that we may not see any time soon is a
marked rise in average sample sizes: The practical obstacles may be too great. Fortunately,
some of the most serious problems that arise from reliance on small samples can be
addressed by adopting a meta-analytic approach to research.

Acknowledgements
I am indebted to the editors for their many helpful comments and to Luke Plonsky for generously
providing me with key articles.

Notes
  1 The survey initially took in all 315 main articles, that is, every article with an abstract. It did
not include editors’ introductions, short reports, book reviews, or miscellaneous notices.
  2 The batches (of five articles each) do not precisely align with successive issues of LTR issues
since, over the time of the survey, issues have included from three to eight main articles
(median = 4.56). The approach used to derive the correlation comes from Howell (2010).
  3 Researchers should not be complacent if n > 30 since even samples larger than 100 are all too
likely to show awkward traits such as skew and outliers (Hesterberg, 2008; Wilcox, 2012a,
2012b).
  4 Power can also be understood as the chance that a given type of significance test will find a
targeted effect on a particular occasion, again with p ⩽ α. When significance tests are con-
ducted across multiple sets of data, three versions of power come into play. For example, in a
2 × 2 anova where the null hypotheses (H0) regarding the two main effects and the interaction
are all false, there is (1) the probability of rejecting a prespecified false H0, (2) the probability
of rejecting a single, unspecified false H0, and (3) the probability of rejecting all false H0; see
especially Maxwell, 2004.
  5 The normality tests mentioned here can be implemented using functions in the R packages
‘fBasics’ and ‘nortest’ (R Core Team, 2016).
  6 For discussions of attractive rank-based robust methods, see Wilcox, 2012a, 2012b.
  7 Very roughly, a 10% trimmed mean is calculated by ordering all the values in a data set from
low to high and then deleting both the lowest 10% and the highest 10% in order to focus only
on the values likely to be the most typical. A trimmed mean is a compromise between the
mean and the median, the latter being equivalent to a 50% trimmed mean (Wilcox, 2012a).
  8 Even if the outlying value of 28 was not mistakenly present (e.g. because of a recording
error), some researchers might remove it anyway. In the running example, doing so would
raise the p value to .67 by the t-test and to .63 or so by the bootstrap. However, unless outliers
are genuine errors (e.g. recording errors), removing them by guesswork or even by a sup-
posedly trustworthy traditional method increases Type I and Type II error rates (Bakker &
Wicherts, 2014).
  9 For a quick introduction to the main controversies about gain score analysis, see Smolkowski,
2013; informative as well are Cribbie and Jamieson, 2004; Rogosa, 1988; Williams and
Zimmerman, 1996. It must be stressed that early claims that gain scores are inherently
untrustworthy have been comprehensively refuted.
764 Language Teaching Research 20(6)

10 One LTR study used the unusual equation, RG score = (Posttest score – Pretest score) /
(maximum score possible – Pretest score), where the denominator is the gain available to be
made on the posttest. Within applied linguistics the earliest use of this plausible equation may
be Horst, Cobb, and Meara (1998). I have not yet found a discussion of it in the literatures of
theoretical or applied statistics.
11 Acceptance of the null hypothesis occurs also when a researcher (1) uses a significance test to
test an assumption of a parametric test, (2) finds p > α, (3) accepts the null hypothesis that all
is well with the data, and (4) goes on to use a parametric test for the main statistical analysis.
This is one of reasons why the use of assumptions tests is controversial.
12 For between-participants designs, d is calculable from descriptive statistics (i.e. ns, means,
and SDs). For calculation of some other types of effect size, including r, ordinary descriptive
statistics may not suffice.

References
Abelson, R. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin,
97, 129–133.
Abelson, R. (1995). Statistics as principled argument. New York: Psychology Press.
Acock, A., & Li, F. (no date). Latent growth curve analysis: A gentle introduction. Available at:
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf (accessed May 2016).
American Psychological Association (APA). (2010). Publication manual of the American
Psychological Association. Washington, DC: American Psychological Association.
An, Q., Xu, D., & Brooks, G. (2013). Type I error rates and power of multiple hypothesis testing
procedures in factorial ANOVA. Multiple Linear Regression Viewpoints, 39, 1–16.
Asendorpf, J., Conner, M., De Fruyt, F., et al. (2013). Recommendations for increasing replicabil-
ity in psychology. (Target article.) European Journal of Personality, 27, 108–119. See also
‘Authors’ response’, pp. 138–144.
Baguley, T. (2010). When correlations go bad. The Psychologist, 23, 122–123. Available at:
https://thepsychologist.bps.org.uk/volume-23/edition-2/methods-when-correlations-go-bad
(accessed May 2016).
Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences.
Basingstoke: Palgrave Macmillan.
Bakker, M., & Wicherts, J. (2014). Outlier removal, sum scores, and the inflation of the Type I
error rate in independent samples t tests: The power of alternatives and recommendations.
Psychological Methods, 19, 409–427.
Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in mul-
tiple testing with independent statistics. Journal of Educational and Behavioral Statistics,
25, 60–83.
Bonate, P. (2000). Analysis of pretest–posttest designs. Boca Raton, FL: Chapman and Hall/CRC.
Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). Introduction to meta-analysis.
Oxford: Wiley.
Braver, S., Thoemmes, F., & Rosenthal, R. (2014). Continuously accumulating meta-analysis and
replicability. Perspectives on Psychological Science, 9, 333–342. Available at: http://www.
human.cornell.edu/hd/qml/upload/Braver_Thoemmes_2014.pdf (accessed May 2016).
Chaudron, C. (2001). Progress in language classroom research: Evidence from The Modern
Language Journal, 1916–2000. The Modern Language Journal, 85, 57–76.
Cohen, J. (1994). The Earth is round, p < .05. American Psychologist, 49, 997–1003. Available at:
http://en.wikipedia.org/wiki/Jacob_Cohen_(statistician) (accessed May 2016).
Cohen, P. (1996). Getting what you deserve from data. IEEE Expert, 11, 12–14. Available at:
http://w3.sista.arizona.edu/~cohen/Publications/papers/cohenIEEE96.pdf (accessed May
2016).
Lindstromberg 765

Cribbie, R., & Jamieson, J. (2004). Decreases in posttest variance and the measurement of change.
Methods of Psychological Research Online, 9, 37–55. Available at: http://www.dgps.de/fach-
gruppen/methoden/mpr-online (accessed May 2016).
Cumming, G. (2008). Replication and p intervals: P values predict the future only vaguely; but
confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300.
Cumming, G. (2012a). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. Hove: Routledge.
Cumming, G. (2012b). Researchers underestimate the variability of p values over replication.
Methodology, 8, 61–62.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29. Available
at: http://pss.sagepub.com/content/25/1/7.full.pdf+html (accessed May 2016).
Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of
confidence intervals and standard error bars. Understanding Statistics, 3, 299–311.
Curran, P., Obeidat, K., & Losardo, D. (2010). Twelve frequently asked questions about growth
curve modeling. Journal of Cognition and Development, 11, 121–136.
Ellis, N. (2015). Forward. Language Learning, 65, Supplement 1, v–vi.
Ellis, N., & Sagarra, N. (2011). Learned attention in adult language acquisition: A replication
and generalization study and meta-analysis. Studies in Second Language Acquisition, 33,
589–624.
Ellis, P. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the inter-
pretation of research results. Cambridge: Cambridge University Press.
Erceg-Hurn, D., & Mirosevich, V. (2008). American Psychologist, 63, 591–601. Available at:
http://www.unt.edu/rss/class/mike/5700/articles/robustAmerPsyc.pdf (accessed May 2016).
Field, A. (1999). A bluffer’s guide to meta-analysis. Newsletter of the mathematical, statistical
and computing section of the British Psychological Society, 7, 16–25. Available at: http://
users.sussex.ac.uk/~andyf/meta.pdf (accessed May 2016).
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Thousand Oaks, CA: Sage.
Fitzmaurice, G. (2001). A conundrum in the analysis of change. Nutrition, 17, 360 –361.
García, L. (2004). Escaping the Bonferroni iron claw in ecological studies. OIKOS, 105, 557–663.
Gass, S. (2009). A historical survey of SLA research. In W. Ritchie, & T. Bhatia (Eds.), Handbook
of second language acquisition (pp. 3–28). Bingley: Emerald.
Grissom, R., & Kim, J. (2012). Effect sizes for research: Univariate and multivariate applications,
2nd edition. Hove: Routledge.
Hesterberg, T. (2008). It’s time to retire the ‘n > = 30’ rule. Proceedings of the American Statistical
Association, statistical computing section [CD-ROM]. Available at: http://www.timhester-
berg.net/articles/JSM08-n30.pdf (accessed May 2016).
Hoenig, J., & Heisey, D. (2001). The abuse of power: The pervasive fallacy of power calcula-
tions for data analysis. The American Statistician, 55, 1–6. Available at: http://www.tc.umn.
edu/~alonso/Hoenig_AmericanStat_2001.pdf (accessed May 2016).
Horst, M., Cobb, T., & Meara, P. (1998). Beyond ‘A Clockwork Orange’: Acquiring second lan-
guage vocabulary through reading. Reading in a Foreign Language, 11, 207–223.
Howell, D. (2010). Statistical methods for psychology. 7th edition. Belmont, CA: Cengage
Wadsworth.
Hox, J., & Stoel, R. (2014). Multilevel and SEM approaches to growth curve modeling. Wiley
StatsRef: Statistics Reference Online. Originally published online in 2005 in Encyclopedia of
Statistics in Behavioral Science. Oxford: Wiley.
Hudson, T., & Llosa, L. (2015). Design issues and inference in experimental L2 research. Language
Learning, 65, 76–96.
Kelly, K., & Preacher, K. (2012). On effect size. Psychological Science, 17, 137–152. Available
at: http://www.quantpsy.org/pubs/kelley_preacher_2012.pdf (accessed May 2016).
766 Language Teaching Research 20(6)

Keselman, H., Cribbie, R., & Holland, B. (2004). Pairwise multiple comparison test procedures:
An update for clinical child and adolescent psychologists. Journal of Clinical Child and
Adolescent Psychology, 33, 623–645. Available at: http://home.cc.umanitoba.ca/~kesel/
jccap-4.pdf (accessed May 2016).
Keselman, H., Othman, A., & Wilcox, R. (2013). Preliminary testing for normality: Is this a good
practice? Journal of Modern Applied Statistical Methods, 12, Article 2. Available at: http://
digitalcommons.wayne.edu/jmasm/vol12/iss2/2 (accessed May 2016).
Keselman, H., Othman, A., & Wilcox, R. (2014). Testing for normality in the multi-group prob-
lem: Is this a good practice? Clinical Dermatology, 2, 29–43.
Keselman, H., Huberty, C., Lix, L., et al. (1998). Statistical practices of educational researchers:
An analysis of Their ANOVA, MANOVA and ANCOVA analyses. Review of Educational
Research, 68, 350–86. Available at: http://home.cc.umanitoba.ca/~kesel/rer1998.pdf
(accessed May 2016).
Kline, R. (2013). Beyond significance testing: Statistics reform in the behavioral sciences. 2nd
edition. Washington, DC: American Psychological Association.
Knapp, T., & Schafer, W. (2009). From gain score t to ANCOVA F (and vice versa). Practical
Assessment, Research & Evaluation, 14. Available at: http://pareonline.net/getvn.asp?v =
14&n = 6 (accessed May 2016).
Lai, J., Fidler, F., & Cumming, G. (2009). Subjective p intervals: Researchers underestimate the
variability of p values over replication. Methodology: European Journal of Research Methods
for the Behavioral and Social Sciences, 8, 51–62.
Larson-Hall, J., & Herrington, R. (2009). Improving data analysis in second language acquisition
by utilizing modern developments in applied statistics. Applied Linguistics, 31, 368–390.
Larson-Hall, J., & Plonsky, L. 2015. Reporting and interpreting quantitative research Findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Lenth, R. (2000). Two sample-size practices that I don’t recommend: Comments from the panel
discussion at the 2000 Joint Statistical Meetings. Unpublished article, presented at the 2000
Joint Statistical Meetings, Indianapolis, IN, USA. Available at: http://homepage.stat.uiowa.
edu/~rlenth/Power/2badHabits.pdf (accessed May 2016).
Linck, J., & Cunnings, J. (2015). The utility and application of mixed-effects models in second
language research. Language Learning, 65, 185–207.
Lindstromberg, S., & Eyckmans, J. (2014). How big is the positive effect of assonance on the
near-term recall of L2 collocations? ITL International Journal of Applied Linguistics, 165,
19–45.
Lindstromberg, S. (2016). Guidelines, recommendations, and supplementary discussion: For
researchers planning to submit a report of an intervention study to Language Teaching
Research. Language Teaching Research. DOI: 10.1177/1362168816651895.
Lix, L., & Sajobi, T. (2010). Testing multiple outcomes in repeated measures designs. Psychological
Methods, 15, 268–280.
Mackey, A., & Gass, S. (2005). Language research: Methodology and design. Mahwah, NJ:
Lawrence Erlbaum.
Maxwell, S. (2004). The persistence of underpowered studies in psychological research: Causes,
consequences, and remedies. Psychological Methods, 9, 147–163.
McGraw, K., & Wong, S. (1992). A common language effect size statistic. Psychological Bulletin,
111, 361–365.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166. Available at: http://isites.harvard.edu/fs/docs/icb.topic988008.files/
micceri89.pdf (accessed May 2016).
Miller, G., & Chapman, J. (2001). Misunderstanding analysis of covariance. Journal of Abnormal
Psychology, 110, 40–48.
Lindstromberg 767

Nakata, T. (2015). Effects of feedback timing on second language vocabulary learning: Does
delaying feedback increase learning? Language Teaching Research, 19, 416–434.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconcep-
tions, and a case for replication. In G. Porte (Ed.), Replication research in applied linguistics,
(pp. 92–115). Cambridge: Cambridge University Press.
Nickerson, R. (2000). Null hypothesis significance testing: A review of an old and continu-
ing controversy. Psychological Methods, 5, 241–301. Available at: http://psych.colorado.
edu/~willcutt/pdfs/Nickerson_2000.pdf (accessed May 2016).
Norris, J. (2015). Statistical significance testing in second language research: Basic problems and
suggestions for reform. Language Learning, 65, 97–126.
Norris, J., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantita-
tive meta-analysis. Language Learning, 50, 417–528.
Norris, J., & Ortega, L. (2006). The value and practice of research synthesis for language learning
and teaching. In J. Norris, & L. Ortega (eds), Synthesizing research on language learning and
teaching (pp. 3–50). Amsterdam: John Benjamins.
Norris, J., Ross, S., & Schoonen, R. (2015). Improving second language quantitative research.
Language Learning, 65, 1–8.
O’Keefe, D. (2003). Against familywise alpha adjustment. Human Communication Research, 29,
431–447. Available at: http://www.dokeefe.net/pub/okeefe07cmm-posthoc.pdf (accessed
May 2016).
O’Keefe, D. (2007). Post hoc power, observed power, a priori power, retrospective power, achieved
power: Sorting out appropriate uses of statistical power analyses. Communication Methods
and Measures, 1, 291–299. Available at: http://www.dokeefe.net/pub/okeefe07cmm-post-
hoc.pdf (accessed May 2016).
Osborne, J. (2013). Is data cleaning and the testing of assumptions relevant in the 21st century?
Frontiers in Psychology, 4, 4–7.
Oswald, F., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and chal-
lenges. Annual Review of Applied Linguistics, 30, 85–110.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Plonsky, L., Egbert, J., & Laflair, G. (2014). Bootstrapping in applied linguistics: Assessing its
potential using shared data. Applied Linguistics, 36, 591–610.
Polio, C. (2012). Replication in published applied linguistics research: A historical perspec-
tive. In G. Porte (Ed.), Replication research in applied linguistics (pp. 47–91). Cambridge:
Cambridge University Press.
Porte, G. (2012). Introduction. In G. Porte (Ed.), Replication research in applied linguistics (pp.
1–17). Cambridge: Cambridge University Press.
R Core Team. (2016). R: A language and environment for statistical computing [computer soft-
ware]. Vienna: R Foundation for Statistical Computing. Available at: http://www.R-project.
org (accessed May 2016).
Rogosa, D. (1988). Myths about longitudinal research. In K. Schaie, R. Campbell, W. Meredith,
& S. Rawlings (Eds), Methodological issues in aging research (pp. 171–209). New York:
Springer.
Romano, J., Shaikh, A., & Wolf, M. (2010). Hypothesis testing in econometrics. Annual Review of
Economics, 2, 75–104. Available at: http://home.uchicago.edu/amshaikh/webfiles/testingre-
view.pdf (accessed May 2016).
768 Language Teaching Research 20(6)

Schönbrodt, F. (2015). What’s the probability that a significant p-value indicates a true effect?
Blogpost, 3 Nov. Available at: http://www.nicebread.de/whats-the-probability-that-a-signif-
icant-p-value-indicates-a-true-effect (accessed May 2016).
Smolkowski, K. (2013). Gain score analysis. Available at: http://homes.ori.org/~keiths/Tips/
Stats_GainScores.html (accessed May 2016).
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin,
76, 105–110.
Valentine, J., Pigott, T., & Rothstein, H. (2010). How many studies do you need? A primer on
statistical power for meta-analysis. Journal of Educational and Behavioral Statistics, 34,
215–247.
Vargha, A., & Delaney, H. (2000). A critique and improvement of the CL Common Language
Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics,
25, 101–132.
Vickers, A. (2001). The use of percentage change from baseline as an outcome in a controlled trial
is statistically inefficient: A simulation study. BMC Medical Research Methodology, 1, 6.
Available at: http://www.biomedcentral.com/content/pdf/1471-2288-1-6.pdf (accessed May
2016).
Vickers, A., & Altman, D. (2001). Analysing controlled trials with baseline and follow up meas-
urements. British Medical Journal, 323, 1123–1124. Available at: http://www.ncbi.nlm.nih.
gov/pmc/articles/PMC1121605 (accessed May 2016).
Westfall, P., & Young, S. (1993). Resampling-based multiple testing: Examples and methods for
p-value adjustment. New York: Wiley.
Wilcox, R. (1998). How many discoveries have been lost by ignoring modern statistical methods?
American Psychologist, 53, 300–314. Available at: http://psych.colorado.edu/~willcutt/pdfs/
Wilcox_1998.pdf (accessed May 2016).
Wilcox, R. (2010). Fundamentals of modern statistical methods. 2nd edition. Heidelberg: Springer.
Wilcox, R. (2012a). Introduction to robust estimation and hypothesis testing. 3rd edition. Waltham,
MA: Academic Press.
Wilcox, R. (2012b). Modern statistics for the social and behavioral sciences: A practical introduc-
tion. Boca Raton, FL: CRC Press.
Wilkinson, L., & Task Force on Statistical Inference. (1999.) Statistical methods in psychological
journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Williams, R., & Zimmerman, D. (1996). Are simple gain scores obsolete? Applied Psychological
Measurement, 20, 59–69. Available at: http://conservancy.umn.edu/handle/11299//119063
(accessed May 2016).
Wuensch, K. (2009). An overview of power analysis. Greenville, NC: East Carolina University.
Available at: http://core.ecu.edu/psyc/wuenschk/StatHelp/PowerAnalysis_Overview.pdf
(accessed May 2016).
Wuensch, K. (2015). CL: The Common Language Effect Size Statistic. Available at: http://core.
ecu.edu/psyc/wuenschk/docs30/CL.pdf (accessed May 2016).
Zimmerman, D. (1998). Invalidation of parametric and nonparametric statistical tests by concur-
rent violation of two assumptions. Journal of Experimental Education, 67, 55–68.
Zimmerman, D.W. (2003). A warning about the large-sample Wilcoxon–Mann–Whitney test.
Understanding Statistics, 2, 267–280.
Zimmerman, D. (2004a). A note on preliminary tests of equality of variances. British Journal of
Mathematical and Statistical Psychology, 57, 173–181.
Zimmerman, D. (2004b). A note on the influence of outliers on parametric and nonparametric
tests. Journal of General Psychology, 121, 391–401.

You might also like