Professional Documents
Culture Documents
Quantitative Approaches For Second Language Education Research
Quantitative Approaches For Second Language Education Research
!"#$
%
"
& '
' ( ($
$
)*
+),-.+/$
!
#
Preface ........................................................................................................................... 5
Independent variables............................................................................................ 9
Moderator variables............................................................................................. 11
Levels of measurement............................................................................................ 14
Distributions ............................................................................................................ 22
ANCOVA ................................................................................................................ 51
MANOVA ............................................................................................................... 54
Parametric correlations............................................................................................ 57
Linear Regression.................................................................................................... 77
Reliability Tests....................................................................................................... 85
Cohen's Kappa..................................................................................................... 90
Factor Analysis........................................................................................................ 91
3
4
Preface
Second Language (L2) education researchers generally design their studies on the basis
of qualitative and quantitative data. While the analysis of the data from a qualitative
perspective entails completely different procedures, quantitative data has to be
processed and analyzed using statistical methods. In this sense, L2 education
researchers usually work with data from multifaceted tests (e.g. Language tests,
working memory tests, among many others) whose purpose is to quantify a specific
construct, such as the level of proficiency. The data obtained from these procedures
are, in fact, observed data that are counted and classified. Statistical approaches are
still regarded by most L2 education researchers as an unexplored, highly feared area
because of the lack of affinity with language-related fields of research. Conversely, it
cannot be denied that, without statistics, it would not be possible to determine whether
the results of a test after an intervention are due to a random effect or to the intervention
itself. The relevance of statistics lies in this previous statement, but L2 education
researchers are more inclined to avoid using quantitative approaches in their research
studies, mainly because of their lack of knowledge and training. In part, this book has
emerged out of the teaching notes that I was preparing as a supervisor of several
students who were currently writing their Bachelor's, Master and Ph.D. theses. In an
attempt to provide them with a comprehensible guide to statistics in L2 education
research, I opted for converting it into a tool that could be used not only by them but
by the rest of the researchers.
Likewise, the statistical software package of reference in this book is JASP (Jasp Team,
2020), which is free open-source software. All the references and figures to the way of
implementing these statistical analyses are based upon JASP.
Finally, my ultimate endeavor has been to accompany the statistical explanations and
procedures with sample L2 education research designs. These examples may serve as
guides for the L2 education researcher who struggles to understand how a specific
statistical procedure or test would be applied to their study.
6
Chapter 1
Doing research in any field requires a series of tools and skills whose main motivation
is to lead us further, and confirm or reject our hypotheses. Among these tools and skills,
one may adopt a purely qualitative approach, which entails the use of varied elements
of data collection such as interviews or observation methods. Nevertheless, the
adoption of a quantitative approach involves that the researcher gathers numerical data
– but also quantification of qualitative data – with the aim of answering the research
questions proposed. Most importantly, there are a number of essential aspects to bear
in mind in which the research design conflates with the research questions, directly
intersecting with quantitative processing data analysis. Throughout this first chapter,
the most important concepts in statistics will be explained. These first contact with the
basics in statistics will be made through the description of concepts (e.g. types of
variables) by providing numerous examples drawing on what is the norm in Applied
Linguistics and Second Language (L2) education.
8
refer to any information that can be codified. For instance, a variable would be L2
proficiency – which would have categories such as A1, A2, B1... –, gender, age, or, for
instance, anxiety. In short, variables define the characteristics or attributes of an
individual or a specific category. Variables may be classified as quantitative or
qualitative. Quantitative variables are those which are continuous, in which the
numbers are indicative of some amount. In this case, a quantitative variable in L2
education research would be equated with a test score or level of anxiety. Conversely,
qualitative variables are categories to which values are assigned. In that case, L2
proficiency would be a qualitative variable (also, categorial variable) since, in order to
quantify it, values should be assigned: for instance, A1 = 1, A2 = 2, B1 = 3. Within the
category of qualitative variables, dichotomous variables are commonplace in social
sciences research, and hence, in L2 education research. These dichotomous variables
are useful for survey-based research or when observations are classified and quantified.
Aside from these characteristics, statistics are highly dependent on the role that these
variables play in the research design. Hence, within a research design, two important
variables play a significant role: independent variables and dependent variables.
Besides these variables, others are equally relevant: control variables, moderator
variables, intervening variables, confounding variables, and extraneous variables.
Independent variables
This type of variable implies that when isolated – that is, they are independent–, they
may exert some influence on some changes or variation on dependent variables. In
other words, independent variables are systematically manipulated in order to observe
whether their variation contributes to further changes in other variables (Heiman, 2011;
Larson-Hall, 2010). An L2 researcher may hypothesize the independent variable in an
attempt to observe the extent to which this action influences the dependent variable.
For instance, one may look into how changing the teaching methodology (independent
variable) may contribute to improving or not vocabulary scores (dependent variables).
9
Nevertheless, research designs are not always composed of one independent variable,
but they may involve more than one. These designs include several independent
variables which correspond to a number of factors. In L2 education research, an
independent variable could refer to the methodology used (e.g. CLIL vs non-CLIL
classroom), and the L2 proficiency (LOW vs HIGH).
Finally, the aim of an experimental design could be different from observing the
influence of the independent variable on the dependent variable. Hence, the objective
may be directed towards predicting a certain outcome on the basis of this independent
variable and several dependent variables. For instance, an L2 education researcher may
be interested in measuring test duration, and how it can predict students' scores on a
language proficiency test. In this case, the test duration would be the predictor or
explanatory variable, and the test scores would be considered the criterion (or outcome
variables).
Dependent variables
As has been anticipated in the variable explained previously, dependent variables are
variables that the researcher observes in order to clarify or determine the effect of the
independent variable. In essence, dependent variables are the elements or
characteristics that change or are modified as a result of the manipulation of the
independent variable.
10
Control variables
Another important type of variable that needs to be included in some research designs
is control variables. In essence, control variables are not considered of interest in our
study, but they should be controlled given the purported influence they may exert over
the outcomes or the dependent variables. These control variables are advised to be
controlled in an attempt to provide more internal validity to our study. In L2 education
research, a common control variable is related to the participants' age or, more
specifically, the years studying the L2 as well as the number of hours of L2 lessons
received.
A solution to control this type of variable lies in the research design per se. Thus, the
L2 researcher may opt for two options: (1) randomize the groups, that is, even despite
the mentioned control variables (e.g. years studying the L2), participants are randomly
assigned to either the control or experimental group or (2) to standardize the procedures
by adapting, for instance, the research design. For L2 proficiency, subgroups could be
created referring to LOW and HIGH proficient L2 learners. Nevertheless, including
more subgroups may lead to the necessity of enlarging the sample size.
Figure 1. The connection between the control variables and the independent and dependent
variables.
Moderator variables
11
To contextualize how a moderator variable may have an influence on an L2 research
design, let us imagine that we decide to study how belonging to a CLIL or non-CLIL
classroom (the independent variable) may exert an influence on anxiety levels
(dependent variable) by taking into account the proficiency level (moderator variable).
Intervening variables
Intervening variables are abstract theoretical variables which somehow mediate the
relationship between the independent variable and dependent variables. However,
intervening variables are not directly observable, and they may be only inferred from
what is observed. Another important characteristic of intervening variables is that they
tend to be unexpected in the research design, but their appearance on stage might be
due to further reasoning regarding the link between the independent variable and the
dependent variables.
12
Figure 3. Intervening variable in the research design.
Confounding variables
When a variable affects the dependent and independent variables and may act as a
distractor or may confound the relationship between them, one may be dealing with a
confounding variable. This type of variable has to meet two principal conditions: (1)
there should exist a correlation with the independent variable, and (2) it should be
causally related to the dependent variable. The L2 researcher has to take into
consideration the pivotal role that confounding variables may play in research designs;
in this regard, results may not be robust and could be misleading if a confounding
variable is not considered. Equally important, the internal validity of the research could
be questioned.
13
Figure 4. A confounding variable in the research design.
Levels of measurement
To be able to measure these variables, variables are generally classified into different
scales: (1) nominal, (2) ordinal, (3) interval, and (4) ratio.
The first type of scale of measurement, nominal, implies that the variable has different
categories (e.g. Experimental Group; Primary 5; Feedback Group). In the case of
ordinal variables, these entail values whose meaningfulness is ordered (e.g. strongly
disagree to strongly agree). In L2 education research, ordinal variables tend to be used
to classify individuals according to their L2 proficiency level, but also to organize L2
background (e.g. 1-3 years; 4-9 years). Interval variables are those whose distance
between scores is equal, and they measure continuous variables. In L2 education
research, interval variables are measured with interval scales, such as test duration, or
holistic ratings.
Types of research design
Prior to deciding on the type of statistical approach that L2 research has to take, the
research design has to be purposely devised. Firstly, in an experimental design, the
sample is divided into two equal or unequal groups. In both groups, a series of variables
will be explored, and one of them is going to be manipulated in order to observe
whether there is an effect over the other variables. In L2 education research, an example
could be two different types of methodologies to teach the same content. Hence, part
of the sample (the experimental group) would receive a novel methodology, while the
other part (the control group) would be taught using the traditional methodology. In
14
this case, the independent variable consists of two levels (1= novel methodology; 2=
traditional methodology), and the dependent variables would be determined by the
score in a test evaluating the content taught. However, experimental designs are usually
conducted in laboratory settings. When experimental designs are carried out in the
same educational environment, that is, a naturally occurring setting, then it would be
called quasi-experimental design. In this case, the researcher would have access to
the intact classrooms. In this regard, the L2 education researcher may opt for selecting
a random class in which the experimental intervention (i.e. the novel methodology) is
implemented. Likewise, the traditional methodology would be applied in a normal
classroom but using the contents specified by the researcher.
Another type of research design is correlational research designs, in which the
researcher does not manipulate any independent variable, and the assignation of
participants does not respond to any specific sampling procedure. The main objective
of this type of research design involves that the researcher is interested in observing
how two variables are correlated with each other. In L2 education research, data about
participants' anxiety levels and scores in the Speaking Cambridge examination may be
collected. To that end, a series of statistical analyses may be performed. Suppose that,
when the values in anxiety levels and the test scores are statistically correlated, there is
a negative correlation between both. This would help us gain insight into the
relationship between anxiety and speaking skill in this specific examination.
Nevertheless, this type of research design cannot lead the researcher to conclude that
this occurs because there is causality (Urdan, 2017).
Another important research design that is framed within the wide array of quantitative
methods is the survey research design. As its name indicates, the data collection
procedure in this design involves delivering surveys to a population of people in order
to carefully describe trends. Survey research is commonplace in L2 education research,
and to illustrate this, let us imagine that we are interested in exploring how a group of
Primary students perceive a specific type of intervention in the L2 classroom. For this
purpose, a survey should be used to gather this perception data.
15
Finally, education research tends to rely on another type of combined research design
that includes both quantitative and qualitative approaches. Action research aims to
explore on-the-spot how a specific issue or problem may be addressed. Through
constant monitoring, and using a variety of methodological tools (e.g. questionnaire,
diary study, or observation), data collected is thought to guide the researcher through
the process of adjusting or modifying to the process (Richards, 2003). In L2 education
research, action research designs are manifold and may involve several procedures,
among which not only one teacher-researcher is likely to take part in the process, but
also others. In a sense, action research is a collaboration between the researcher and
the teacher in the classroom. For instance, an L2 education researcher might be
interested in examining how the task-based language approach differs from the
communicative language teaching approach. In a sort of pre-research phase, both
approaches could be implemented in the L2 classroom whilst the teacher keeps a
journal to observe whether there are modifications or adjustments to be made in these
approaches. Simultaneously, the researcher may collect purely quantitative data from
tests. Putting qualitative information (from the journals and teacher's observations) in
parallel with the quantitative data ensures that the statistical differences between both
approaches may be supplemented with subjective data.
16
Chapter 2
DESCRIPTIVE STATISTICS
One of the very first steps in statistics corresponds to the description of raw data. When
quantitative data is collected, descriptive statistics are used to organize, summarize,
and describe the characteristics of the data obtained. Although they are described as
statistical procedures (Porte, 2010) since a number of calculations are involved, the
inferences that may be drawn are not as reliable as in the case of more advanced
statistical tests (see Chapters 3 and 4).
The most commonly used statistic in L2 education research is the mean. In essence, it
is the arithmetic average of the distribution of scores, and it allows researchers to
summarize the information obtained from this specific variable. Despite the valuable
piece of information that the mean constitutes, it does not inform about the particular
distribution of the scores as well as which scores are closer to the mean. Let us imagine
that we want to observe how different the scores in the Primary 5 and Primary 6 classes
are. Using JASP, we click on 'Descriptives', then add the between-group variable (i.e.
the grouping variable) into 'Split', and add the variable from which we are interested in
obtaining the mean into 'Variables' (see Figure 5 below).
17
Figure 5. Inserting variables in the 'Descriptives' module in JASP.
The output will show the number of participants in each group ('Valid'), and the
'Mean'. Below, the values for 'Minimum' and 'Maximum' are also provided. These
calculations are equally relevant since they allow us to identify the tendency in the
scores.
Before continuing with the following measures of central tendency, the importance of
confidence intervals must be highlighted. In definitory terms, a confidence interval is
a range of values that contains a lower and upper limit within a specific population
parameter. In essence, confidence intervals indicate whether there is the possibility that
18
the interval contains the value of the variable studied. The confidence intervals are
presented alongside the mean: M= 6.526, 95% CI [6.12, 6.93]. In this case, 95%
indicates that with 95% confidence the mean of the population is between 6.12 and
6.93.
Another measure of central tendency is the median. In mathematical terms, the median
is the middle value that is situated in the 50th percentile. To compute the median, scores
are arranged from the lowest to the highest. As can be observed in Figure 7, the medians
for both groups are the same, which indicates that scarce differences exist between both
Primary 5 and 6.
Figure 7. JASP output for 'Descriptives' with the mean and median scores.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Median 6.500 6.500
Mean 6.526 6.643
Minimum 4.500 4.000
Maximum 8.500 10.000
Finally, the mode is another measure of central tendency whose use is not as extended
as in the previous measures. The mode is the most frequently occurring score in a
distribution of scores (Tavakoli, 2013). Because of the reduced amount of information
that it provides, the mode is not generally the preferred measure of central tendency in
L2 education research. As can be observed in Figure 8, the mode points to the most
frequently occurring scores in both Primary 5 and Primary 6. However, it is not
representative of the data as it has been revealed in both the mean and the median.
When dealing with categorical variables (nominal scales), the mode would be the most
appropriate measure of central tendency. In L2 education research, a common
categorical variable would be L2 proficiency as expressed with the Common European
Framework of Reference for Languages (CEFRL).
Measures of Dispersion
Parallel to the rich amount of information that measures of central tendency provide,
measures of dispersion (or variability) supplement in a relevant manner what the mean
or the median inform about. Measures of dispersion clarify the amount of variability
among the scores in the variables of interest within the sample. If there is a wide spread
of these scores, there will thus be a large dispersion in the data.
One of the most common measures of variability is the standard deviation. In essence,
this measure refers to how different an individual score is in a distribution concerning
the average score of the distribution. The standard deviation is calculated with the
square root of the variance (see later). Usually, the standard deviation accompanies the
mean when reporting results.
Figure 9. JASP output for the mean and standard deviation (Std. Deviation).
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Std. Deviation 1.307 1.468
As can be observed in Figure 9, when the standard deviation is small and closer to 0, it
points to the inexistence of large variations within the data of the sample. The L2
researcher may observe the standard deviation to gain more insight into how different,
20
for instance, scores in a specific class are. In this case, there seems to be little
variability.
Another measure of dispersion is the variance, which is calculated "by summing the
squared deviations of the data values about the mean" (Tavakoli, 2013, p. 701), and
then dividing it by the number of participants minus one. Looking from another
perspective, the variance is the squared value of the standard deviation. If the variance
is large, the observations would be more scattered on average (Urdan, 2017). As may
be observed in Figure 10, the variance is larger in Primary 6.
Figure 10. JASP outcome with the mean and the variance.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Variance 1.708 2.154
Nevertheless, the use of this measure of dispersion is not commonly used in the same
manner as the standard deviation. In general terms, it is part of a calculation of other
statistical analyses, such as ANOVA (Urdan, 2017).
Finally, another common measure of dispersion used together with the median is the
interquartile range (IQR), which is the difference between the score marking the 75th
percentile (Q3) and the score marking the 25th percentile (Q1). The formula for this
measure of dispersion is IQR = Q3 – Q1.
Figure 11. JASP outcome for the median and the IQR.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Median 6.500 6.500
IQR 2.250 1.500
21
The use of IQR is most appropriate for ordinal scaled test scores, and it is generally
more adequate with non-normal distributions (Larson-Hall, 2010; Tavakoli, 2013).
However, when the data is normally distributed, the IQR does not appear to be the most
appropriate option. For instance, the data in both Primary 5 and Primary 6 for the
variable 'Writing_Pre' (see Figure 11 above) are normally distributed. If we compare
the information provided by the IQR with the one in the SD, we will observe that it
does not lead to the same interpretation. That is why one has to be particularly attentive
to the dichotomy normal vs non-normal distribution.
Distributions
Before performing any statistical analyses on the data, one of the assumptions that has
to be checked is whether the data has or not a normal distribution. A normal
distribution implies that the data rises smoothly from a small number of scores in the
tails (i.e. the extremes), and augments to a higher number of scores in the middle of the
distribution. The normal distribution always falls between the mean and the distances
which are situated above and below the mean. As explained previously, the distances
with the mean are the standard deviations. There are a number of graphical
representations – besides the measure of the standard deviation – that allow for
observing how distributed the data are. These are called distribution plots, and they
offer a more visual account of the data and the scores of a specific variable.
22
As can be observed in Figure 12, the data is normally distributed since, on both sides,
the data contains lower values or scores than in the middle of the distribution.
Conversely, in Figure 13 below, the distribution is not normal. The values are
accumulated on both sides, and these do not produce a bell-shaped curve.
However, the assumption of normality may also be checked with a statistical test such
as the Shapiro-Wilk test. This test must only be used when the sample size contains
fewer than 50 subjects. In JASP, this test of normality may be performed under the
'Descriptives' module, in the "Distribution" section.
23
Figure 14. 'Statistics' section in the 'Descriptives' module in JASP.
Once the checkbox of the Shapiro-Wilk test is marked, JASP gives the following result,
as in Figure 15:
The interpretation of the results of the Shapiro-Wilk test may be interpreted as follows.
Firstly, the value provided in Shapiro-Wilk is a W value, whose size determines
whether it is normally distributed or not. When this value is lower than 0.9x, it is very
likely that the sample corresponding to this variable is not normally distributed. Then,
the value in the P-value of Shapiro-Wilk, corresponds to the p-value. Although this will
be explained in-depth in the forthcoming chapters, p-value indicates the probability
24
that what the statistics (W-value) points is significant. When the p-value is below 0.05,
then the test is statistically significant, and in the case of the Shapiro-Wilk test, the data
would be not normally distributed. Should the p-value be over 0.05, the result would
not be statistically significant. Relying on the results in Figure 15 above, the Shapiro-
Wilk tests are not statistically significant (that is, the p-value is above 0.05), and thus
the data is normally distributed.
When the sample size is over 50 participants, normality must be checked with the
Kolmogorov-Smirnov test. In JASP, this normality test is found in the 'Distribution' >
'Normal' module. Once there, go to section 'Assess Fit' and click on 'Kolmogorov-
Smirnov' in 'Statistics', as in Figure 16 below.
Bear in mind that, before choosing the corresponding statistics test, you are supposed
to introduce the data about the descriptives – mean and variance – as in Figure 17.
Figure 17. JASP module 'Show Distribution' to introduce the mean and the variance.
When the data is introduced, and the statistical tests are carefully selected (see Figure
18), JASP will yield this result:
While distribution plots present a visual description of the distribution of the data (see
Figures 19 and 20), box plots provide us with valuable information about outliers.
26
Figure 20. Boxplots for the variable 'Writing_Pre' in Primary 5 and 6.
27
data, it is always advisable to test for normality using the appropriate statistical tests
(i.e. Kolmogorov-Smirnov or Shapiro-Wilk).
Another way of presenting data visually is using an interval plot. In this case, it allows
us to observe and compare confidence intervals of the means of the groups. The mean
is represented with the dot (see Figure 23), while the lines present the range of values
that include the population mean. The fact that these lines are elongated points to the
wide spread of the groups.
28
The dot plot is another type of visual representation for the data. It is used to display
the distribution of the data with continuous variables. As can be seen in Figure 24
below, the horizontal X-axis represents the frequencies. Each dot is representative of a
specific number of observations. The usefulness of dot plots lies in the visualization of
the shape and spread of the data, and they serve equally as histograms. In the case of
Figure 24 below, the dot plot shows that values were more concentrated in '5' and '7.5'.
29
30
Chapter 3
PARAMETRIC STATISTICS
Throughout this chapter, the tests used in parametric statistics will be presented. One
of the most common parametric tests is t Tests. These types of tests imply that there is
a comparison of two means to observe whether they are significantly different from
each other. Additionally, the t stands for the t family of distributions, which is a
measure of probability, and it is highly dependent on the size of the sample.
Before performing this type of test, there are a number of assumptions that need to be
complied with: (1) the data is continuous; (2) the data is independent; (3) the data is
normally distributed; and (3) no outliers are present.
Back to the initial definition, a one sample t-test is used when the researcher wants to
compare the mean from a specific sample to a population mean. For instance, if you
want to compare the scores of A2 level students in the writing Cambridge: Key (A2,
CEFRL) of all Primary classes to the test scores in the same writing module of A2 level
students in a particular class. To do that, I select 30 students from different classes, all
of whom have an A2 command of English, and who have sat mock tests for the writing
part in the Cambridge: Key exam. Subsequently, I calculate the score obtained from
these students and obtain a population mean. Then, I calculate the test scores from a
particular class.
31
To perform this statistical test in JASP, you have to click on 'T-Tests', and select
'Classical' > 'One Sample T-Test'. Subsequently, the variable of interest has to be added
to the 'Variables' box, and in 'Test' section, indicate the 'Test value', which would be
the population mean (see Figure 25 below):
Once this is selected, and the corresponding population mean is introduced, JASP
yields the following results:
Table 1. JASP outcome for one sample T-test along with the descriptives.
One Sample T-Test
95% CI for Cohen's d
t df p Cohen's d Lower Upper
Writing_Pre -5.325 27 < .001 -1.006 -1.457 -0.543
Note. For the Student t-test, effect size is given by Cohen's d .
Note. For the Student t-test, the alternative hypothesis specifies that the mean is different from 7.8.
Note. Student's t-test.
Descriptives
N Mean SD SE
Writing_Pre 28 6.875 0.919 0.174
There are a number of aspects to be highlighted here: (1) p-value. As anticipated in the
previous chapter, the p-value is core to the hypothesis testing. That is, through this
one sample T-test, we attempted to test the hypothesis that the sample mean was equal
to the population mean. However, as the p-value is below 0.05, our hypothesis is
rejected, indicating that both means are different (p < .001); (2) Cohen's d. Although
the relevance of effect sizes is going to be put under scrutiny in Chapter 7, an effect
32
size measures the magnitude of the effect. In this case, the comparison between the
sample mean and the population mean indicates that the difference is large (d= –1.006);
(3) df stands for degrees of freedom, which is the minimum amount of data needed to
calculate the statistic. In short, it is the number of independent units of information in
the sample, and whose values may vary when the statistic (e.g. F or Z) is calculated.
Additionally, degrees of freedom are used to measure the amount of information which
is available to estimate population parameters; (4) SE stands for standard error, which
is a statistic that determines the degree to which the population parameter may differ
from the computed sample statistic. In essence, the standard error provides the
researcher with useful information about the degree of accuracy of the population
parameter. If the standard error is small, then, the sample statistic is better to estimate
the population parameter.
Another important aspect to bear in mind for publication purposes is to how to report
these results following some citation guidelines, such as the American Pyschological
Association (APA):
A one sample T-Test showed that participants having an A2 level in Primary 5 scored
significantly lower in the writing test than the overall A2 level students in the school (t
(27) = –5.325, p<.001).
33
Paired Samples T-Tests
The second of the parametric tests is called paired (or dependent) samples t-test. In
this type of test, the values in one sample are related to the values in the other sample.
In other words, the individuals in both samples are related or equal. In L2 education
research, this type of parametric test is commonly used to verify, for instance, the
efficiency of some educational interventions or the implementation of certain
methodologies. Paired Samples T-Tests are used, in general terms, when individuals in
the sample are measured two times in time, that is, in experimental research designs
including a pre-test and a post-test.
To access this test in JASP (see Figure 26), we have to click on 'T-Tests' > 'Paired
Samples T-Test'.
Before selecting the variables, we have to make sure that we have filtered out the group
of interest in case two groups (e.g. Primary 5 or 6; Control or Experimental Group) are
present in the research design.
Let us imagine that we are interested in testing how effective writing instruction is in
a Primary classroom. To check the effect of this intervention, participants perform a
pre-test and a post-test. In Figure 27, you can see how these variables are introduced
in the corresponding areas of JASP.
34
Figure 27. JASP interface for Paired Samples T-Test.
As anticipated in the previous subsection, the effect size is equally marked in order to
observe the magnitude of the effect. Figure 28 display the outcomes for the paired
samples t-test.
Figure 28. JASP outcome for paired samples t-test, descriptives, and normality test (Shapiro-Wilk
test).
Paired Samples T-Test
95% CI for Cohen's d
Measure 1 Measure 2 t df p Cohen's d Lower Upper
Writing_Pre - Writing_Post 3.139 18 0.006 0.720 0.206 1.219
Note. Student's t-test.
Descriptives
N Mean SD SE
Writing_Pre 19 6.526 1.307 0.300
35
Descriptives
N Mean SD SE
Writing_Post 19 4.421 2.567 0.589
Similar to what was mentioned in the previous chapter, the sample meets the
assumptions: data is normally distributed (see Shapiro-Wilk test with a p-value of
0.949). Regarding the statistical test per se, it compares how different the pre-test
(Writing_Pre) is from the post-test (Writing_Post). In essence, observing the p-value
(p = 0.006), it is below the 0.05 benchmark, which allows us to confirm that there is a
variation from pre-test to post-test. The effect size (Cohen's d) suggests a medium
effect size (see Chapter 7 for further information). If we take a look at the descriptives,
we may conclude that writing instruction did not contribute to improving the mean of
the scores. It seems to have had a counter-effect since the score decreased (6.526 to
4.421).
It is important to bear in mind that Paired Samples T-Tests are used when (1) the
assumptions are met, and (2) when we are interested in comparing data from the same
sample at two different times.
On average, participants scored less after the writing instruction. A paired samples t-
test showed this decrease to be significant (t (18)= 3.13, p = 0.006). Cohen's d suggests
that there is a medium effect (d = 0.72).
36
scores). In some sense, the dependent variable depends on the value of the independent
variable, which may cause the dependent variable to change.
For example, suppose that we propose an educational intervention in which one of the
groups (experimental group) tries a series of listening activities using a newly
conceived app. The other group – the control group – follows the traditional teaching
approach to listening. To verify whether the listening activities have had an effect on
the listening scores in the experimental group, an independent t-test is performed. As
seen in Figure 29 below, the variables of interest have to be introduced in the
corresponding section. Similarly, this type of test requires that the independent variable
is equally included (i.e. the grouping variable).
Figure 29. JASP interface to introduce the variables of interest and the grouping variable.
Once all the data are introduced, it is necessary to test the assumptions for these
statistical tests: (1) group independence; (2) normality of the dependent variables; and
(3) homogeneity of variance. While assumption (1) should be taken into consideration
before the data are processed, (2) and (3) are to be verified using statistical tests. In
JASP, both assumptions may be tested as in Figure 30:
Figure 30. JASP options for Assumption Checks in Independent Samples T-Tests.
When the statistical tests for these assumptions are performed, JASP provides the
following output:
37
Figure 31. Assumption Checks results (Independent Samples T-Test).
Test of Normality (Shapiro-Wilk)
W p
Listening_Post EG 0.933 0.197
CG 0.925 0.110
Note. Significant results suggest a deviation from normality.
Both tests did not yield statistically significant results, as may be observed in the p-
values. This indicates that the assumption of normality and the equality of variances
are successfully met. Such an outcome allows us to freely proceed with the statistical
T-Test.
In Figure 32 below, the results of the t-statistics are presented along with the
descriptives.
Figure 32. JASP outcome for independent samples t-test, descriptives, and descriptive plot.
Independent Samples T-Test
Cohen's
t df p
d
Listening_Post 1.028 38 0.311 0.325
As can be observed, the p-value indicates that there is not a significant statistical
difference between groups (p= 0.311), and Cohen's d suggests this is a small effect
(below the 0.40 benchmark). The group descriptives also indicate that, despite the
apparent difference between the EG and CG, no statistical difference is found between
them.
38
Figure 33. JASP descriptive plot for independent samples t-test.
The descriptive plot as shown in Figure 33 above clearly depicts the tendency that the
EG obtained a higher score than the CG. However, this difference is not deemed
significant.
The descriptives show that the experimental group performed better than the control
group in the listening post-test. However, an independent t-test showed that this
difference was not significant (t(38) = 1.028, p = 0.31), and Cohen's d suggests this is
a small effect (d= 0.32).
One-Way ANOVA
Another parametrical test of interest is the one-way analysis of variance or One-Way
ANOVA. The main objective of this test is to compare two or more groups (i.e. the
grouping variable of the independent variable) and the effects on one dependent
variable to observe whether the existing differences are statistically significant. The
purposes of one-way ANOVA are identical to those of an Independent Samples T-Test,
and results yielded by this test are supposed to be equal. One may wonder when it is
more appropriate to use one or the other if their purposes are identical. In fact, one-way
ANOVA allows for comparing two or more groups while with T-Tests one is forced to
perform three independent t-test, thus reducing the probability that the differences are
meaningful.
39
1) The independent variable must be categorical.
2) The dependent variable must be continuous or scale-based.
3) The grouping variable should be independent.
4) The dependent variable should be noramlly distributed.
5) No outliers should be present.
6) Homogeneity of variance.
JASP offers an entire module for ANOVA tests, as can be observed in Figure 34 below:
40
Figure 36. JASP outcome for one-way ANOVA.
ANOVA - Listening_Post
Cases Sum of Squares df Mean Square F p η²
Group 7.941 1 7.941 1.056 0.311 0.027
Residuals 285.659 38 7.517
Note. Type III Sum of Squares
Descriptives - Listening_Post
Group Mean SD N
5P 6.368 2.499 19
6P 5.476 2.943 21
Unlike in T-Tests, where the t statistics are taken as reference, in ANOVAs, the F-
statistics is used. As can be observed in Figure 12, the p-value is above 0.05, which
means that there are no differences among the groups. Nevertheless, the eta-squared
(η²) points to a medium effect size (see Chapter 7 for further information on effect
sizes).
To report one-way ANOVA following the APA guidelines, let us take a look at the
following example:
Aside from using one-way ANOVA with two groups, in which case the outcome is the
same as if we performed an independent T-Test, the situation is different when three
groups are used. This is because, besides the general F-statistics, the differences
between the three groups have to be corroborated with what is called post-hoc
comparison using a series of statistical procedures, as will be further explained.
41
potential effect. The measure is a continuous variable in which the L2 contents taught
in the Montessori methodology were tested.
To include the variables in the ANOVA module in JASP are the same; however, it is
necessary to adjust the Post Hoc Tests beforehand, as in Figure 37 below.
A post-hoc test is a comparison used after the data have been analyzed and examined
(Tavakoli, 2013). When this comparison is statistically significant, then the researcher
examines the combination of means of the different groups. In other words, a post-hoc
test is a follow-up statistical test performed after a comparison of three or more groups
has yielded a significant F statistics. In our previous example, our independent variable
had more than two levels: CLIL, NO-CLIL, and CONTROL GROUP. Thus, as the F
statistics is significant, the subsequent stage would be to compare all the groups, that
is: CLIL vs NO-CLIL, CLIL vs CONTROL GROUP, NO-CLIL vs CONTROL
GROUP.
To do so, the researcher may opt for a wide variety of post-hoc tests: Tukey's test,
Bonferroni's test, or Holm's test. JASP offers several other options, but our focus will
be placed on those three post-hoc tests.
42
(1) Tukey's test (also, Tukey HSD test) is a post-hoc test used when a pairwise
comparison is of interest, that is, when the group sizes are equal. As indicated by
Tavakoli (2013), Tukey's test is more conservative as it reduces the likelihood of a
Type I error. Nevertheless, it has less statistical power, and is robust to nonnormality
(Cohen et al., 2011; Mackey & Gass, 2005).
(3) Holm test (also Holm-Bonferroni test) is a sequential method, based upon
Bonferroni, which is less conservative. In a stepwise manner, the Holm test computes
the significance levels depending on the P-value-based rank of hypotheses (Chen et al.,
2017).
Figure 38. JASP outcome for one-way ANOVA with three groups.
ANOVA - Montessori
43
As can be observed, the ANOVA yielded a statistically significant result (p < .001)
with a nearly medium effect size (η² = 0.22).
-
CG CLIL -1.719 0.384 < .001 *** < .001 *** < .001 ***
4.471
-
NOCLIL -0.702 0.388 0.174 0.224 0.075
1.808
Descriptives - Montessori
Group Mean SD N
CG 6.458 2.085 24
CLIL 8.177 0.847 26
NOCLIL 7.160 0.787 25
Figure 39 above shows the post-hoc comparisons between the three groups that have
been tested in the ANOVA parametric test. In order to illustrate what variations are
present between the post-hoc tests explained previously, we have opted for including
the three tests. As can be observed, there is a statistically significant difference between
CG and CLIL, and between CLIL and NO-CLIL. Likewise, it is important to observe
Mean Difference since it provides us with an idea of how different the means are
between the groups.
44
Independent one-way ANOVA showed a significant effect of the Montessori method on
three different classes (F (2, 72) = 37.47, p < .001, η² = 0.22).
Post-hoc testing using Tukey's correction revealed that the CLIL group resulted in
significantly greater scores than the CG (p < .001) and the NO-CLIL group (p < .05).
There were no significant differences between NO-CLIL and CG (p= .174).
Two-Way ANOVA
Another type of ANOVA parametric test is two-way ANOVA (or Factorial ANOVA),
in which the effect of two categorical independent variables with two or more levels
are estimated on a single, continuous dependent variable (Tavakoli, 2013). Two-way
ANOVA also tests the interactions between these variables.
In the example that we propose, there are two independent variables: (1) group: CLIL
or CONTROL GROUP, and (2) Type of Learning: Cooperative Learning (CL) and
Project-Based Learning (PBL). In other words, both Factor 1 (Group) and Factor 2
(Type of Learning) have two levels. The two-way ANOVA tests two different
hypotheses (Goss-Sampson, 2020):
As with one-way ANOVA, two-way ANOVA also requires that a series of assumptions
are met:
45
In JASP, the same ANOVA module should be used to perform this two-way ANOVA.
In this case, both independent variables – categorical in nature – should be included in
'Fixed Factors' while only a dependent variable must be selected (see Figure 40 below).
Figure 40. ANOVA module in JASP with the two independent variables and the dependent
variable.
After introducing the data into JASP, the statistical analyses yield the results displayed
in Figure 41.
Descriptives - App
Group TypeLearning Mean SD N
CG CL 6.538 1.761 13
PBL 5.364 2.501 11
CLIL CL 9.731 0.388 13
46
Descriptives - App
Group TypeLearning Mean SD N
PBL 9.308 0.630 13
As can be observed, the ANOVA table shows that there are significant effects for
Group (p < .001) with a large effect size. In this case, there was not a significant
difference for Type of Learning (p = 0.07) or the interaction between Group and Type
of Learning (p = 0.38). This suggests that differences were not considerable as regards
the type of learning, but rather as a result of the difference in Factor 1, i.e. the group.
Even though the interaction (Group * TypeLearning) is not significant according to the
ANOVA, there are likely significant differences that post-hoc tests may highlight. In
Figure 42, these post-hoc comparisons are presented.
Figure 42. JASP outcome for Post Hoc tests in two-way ANOVA.
Post Hoc Comparisons - Group ✻ TypeLearning
Mean Difference SE t ptukey pbonf pholm
CG CL CLIL CL -3.192 0.596 -5.353 < .001 < .001 < .001
CG PBL 1.175 0.623 1.886 0.248 0.394 0.131
CLIL PBL -2.769 0.596 -4.643 < .001 < .001 < .001
CLIL CL CG PBL 4.367 0.623 7.011 < .001 < .001 < .001
CLIL PBL 0.423 0.596 0.709 0.893 1.000 0.482
CG PBL CLIL PBL -3.944 0.623 -6.332 < .001 < .001 < .001
Note. P-value adjusted for comparing a family of 4
47
A two-way ANOVA was used to examine whether the effect of using an App varied
depending on the type of learning used in a CLIL group or a traditional group. There
were significant main effects for group (F (1, 46) = 68.476, p < .001, η² = .55).
Tukey's post-hoc correction showed that cooperative learning in the CLIL group was
higher compared to cooperative learning and project-based learning in the control
group (t = –3.192, p <.001 and t=4.367, p <.001 respectively).
As well as the rest of the ANOVA tests, it uses the F-statistic. If it is large, this may be
interpreted as the independent variable having a significant effect on the dependent
variable.
48
To illustrate one-way repeated measures ANOVA, suppose that we are interested in
observing how a CLIL class performs in a language test when being taught in three
different conditions: Montessori methodology (Montessori), Communicative
Language Teaching (CLT), and a language App (App). To test how these conditions
are different from each other, and whether the scores obtained in the tests are
statistically significant, the JASP module has to be opened, and select 'Repeated
Measures ANOVA' (see Figure 43).
Figure 44. Interface to introduce the data for Repeated Measures ANOVA.
49
when the epsilon is <0.75, Greenhouse-Geisser correction is preferred (Goss-Sampson,
2020).
As can be observed in Figure 45, the test of sphericity is statistically significant (p <
.001). Hence, the Greenhouse-Geisser correction has to be applied to the ANOVA test.
In Figure 46 below, two options are provided: 'None' (no correction applied), and
'Greenhouse-Geisser' correction. The F-statistics is large, and the p-value of the
repeated-measures ANOVA test is statistically significant (p < .001). Hence, we may
proceed to check which combination of conditions is significant through a post-hoc
comparison test.
Descriptives
RM Factor 1 Mean SD N
Montessori 8.177 0.847 26
CLT 9.231 0.620 26
App 9.519 0.556 26
In Figure 47 below, the post-hoc comparison tests (using Bonferroni adjustment and
Holm-Bonferroni test) are presented. It clearly shows that there is a statistically
50
significant difference between Montessori and CLT (p <.001), in which the CLT had
higher scores. Likewise, the differences between Montessori and the App group are
statistically significant (p < .001), once again with this last condition holding higher
values. Although marginally significant (p = .04), the differences between CLT and
App are equally statistically significant, although the means are not considerably
different (CLT = 9.23, and App = 9.51).
Post-hoc testing using Bonferroni correction revealed that in the Montessori condition,
participants scored lower than in the CLT (mean difference = –1.054, p < .001), and
App condition (mean difference = –1.892, p < .001).
ANCOVA
The analysis of covariance or ANCOVA is a statistical procedure that allows us to
observe group differences on a continuous dependent variable, with one or more
continuous independent variables – covariates – which are controlled for. In summary,
ANCOVA allows for one or more categorical independent variables, one continuous
dependent variable, and one or more covariates. Therefore, the covariate is, in essence,
another independent variable (Tavakoli, 2013).
To illustrate how ANCOVA may be used, the following research design will be
presented as an example:
Once this information is introduced, several features must be marked in JASP. Firstly,
in 'Display', it is important to mark both eta-squared and omega-squared as estimates
52
of effect size. Marking 'descriptive statistics' will equally allow us to observe the extent
of the differences. Subsequently, in 'Post Hoc Tests', in 'Type', 'Effect size' should be
marked to observe the magnitude of the effect of the post-hoc comparison tests. As for
the 'Correction', mark Tukey, Bonferroni, and Holm.
As can be observed in Figure 50, the output for ANCOVA indicates that there are
statistically significant differences (p < .001 and p = .002, respectively) for both years
of learning the L2 and group, that is, the covariate and the independent variable.
Descriptives - Post_Test
Group Mean SD N
CLIL 9.231 0.620 26
NOCLIL 7.440 1.253 25
Another way to observe these results is by creating a descriptive plot that includes
confidence intervals. In Figure 52, a JASP-generated descriptive plot based on the
53
sample study presented for ANCOVA is shown. As seen, the CLIL group obtained
higher marks than the NOCLIL group which, in turn, had been learning the L2 for
fewer years than the CLIL group.
The covariate, years learning the L2, was significantly related to the group variable,
F (1, 48) = 22.443, w2 = 0.260.
Post hoc testing using Tukey's correction revealed that the CLIL group obtained higher
scores in comparison to the NO-CLIL group (p = .002).
MANOVA
The MANOVA parametric test, which stands for multivariate analysis of variance,
refers to a situation where several continuous dependent variables are included.
Researchers using MANOVA are interested in examining whether the combinations of
continuous dependent variables are significantly different from one group to the other.
In other words, MANOVA serves to identify which groups and to what extent they
differ from each other, and the dependent variables in which differences exist
(Tavakoli, 2013).
54
For instance, if a researcher wanted to examine whether students in a CLIL and a
traditional group (control), that is, the independent variable with two levels, varied
depending on the education method used: communicative-language teaching
(dependent variable 1) and app-based teaching approach (dependent variable 2).
The assumptions that must be taken into consideration when deciding on using
MANOVA are as follows:
These tests of assumptions may be checked in JASP, after going to ANOVA >
MANOVA, as can be seen in Figure 53.
The output of these assumptions tests in JASP provides the following information:
Once these assumptions have been checked, it is time to select the appropriate test for
our MANOVA. JASP provides four different options: Pillai, Wilks, Hotelling-Lawley,
and Roy (see Figure 55).
55
Figure 55. Selection of tests for MANOVA.
There are several explanations as to what each test provides the researcher with, and
which one would be more appropriate or at least, more common in L2 education
research.
Pillai's test will be used to illustrate an example. In this case, suppose that, as an L2
education researcher, you are interested in observing how two different groups (CLIL
and a control group) react to two different methodologies: CLT and App (the dependent
variables). After performing the analysis in JASP, Figure 56 displays the results:
As can be observed, there is a statistically significant difference between both CLT and
App, F (1, 48) = 31.69, p < .001; TracePillai = 0.574.
56
Subsequently, if we decide to follow with some stepwise comparisons, it is possible to
display the ANOVATables (in JASP, check the box of 'ANOVAtables' in 'Additional
Options'). This allows us to observe in which dependent variables differences are
found, and whether these are statistically significant or not.
Figure 57. JASP 'ANOVAtables' for the dependent variables of the MANOVA analysis.
ANOVA: CLT
Cases Sum of Squares df Mean Square F p
(Intercept) 3136.320 1 3136.320 1779.149 < .001
Group 93.065 1 93.065 52.793 < .001
Residuals 84.615 48 1.763
ANOVA: App
Cases Sum of Squares df Mean Square F p
(Intercept) 3065.445 1 3065.445 1271.305 < .001
Group 154.565 1 154.565 64.101 < .001
Residuals 115.740 48 2.411
As can be observed in Figure 57 above, the p-value is below 0.05 in both cases, and
thus statistically significant. This will lead us to review the descriptives in order to
observe what the differences are:
Figure 58. Descriptive statistics for both dependent variables (CLT and App).
Descriptive Statistics
CLT App
CG CLIL CG CLIL
Valid 24 26 24 26
Missing 0 0 0 0
Mean 6.500 9.231 6.000 9.519
Std. Deviation 1.806 0.620 2.167 0.556
Minimum 3.000 8.000 2.000 8.500
Maximum 9.500 10.000 9.000 10.000
Parametric correlations
One of the most basic tests used for association among variables is the correlation
coefficient (Urdan, 2017). Correlations are statistical indices that point to the strength
and direction of the relationship between two variables. This relationship determines
57
how strongly a pair of variables is associated, but also, whether this association is
considered statistically significant or not. Correlation coefficients are calculated
through quantifiable variables which are continuous or ordinal data. These data are
supposed to be based upon observed data, that is, data that summarizes qualities from
a certain variable. When a researcher is interested in using correlation coefficients, the
first characteristic to pay attention to is the direction of this correlation, that is, whether
it is positive (+) or negative (–). In the case of a positive correlation, it occurs when
variable X increases at the same time as variable Y. This implies a direct linear
relationship in which variables Y and X increase at the same pace. On the other hand,
a negative correlation occurs when variable X increases at the expense of variable Y,
that is, a change on one variable is associated with change on the other variable in the
other direction.
There are several correlation coefficients, and one of the most commonly used
parametric tests for correlations is Pearson's correlation coefficient. It is also
represented as r. The coefficient implies that, the closer the r-statistic is to +1 or –1,
the more the two variables are related. As with many other parametric tests presented
in this chapter, Pearson's r can only be performed when some assumptions are met:
(1) data must be normally distributed, and (2) it must be linear.
58
Figure 59. Options to select in Correlation in JASP.
Suppose that we are interested in observing whether the number of years spent
learning English may be correlated with Anxiety levels and the test scores in an L2
test during a two-week intervention with the Montessori methodology.
When the variables are introduced in the 'Variables' textbox, a series of tables of results
appear. In Figure 60, correlations are presented pairwise while in Figure 61 these
appear in a correlation matrix. As can be observed, although the correlation coefficients
are reported to be statistically significant for Years L2 and Montessori as well as for
Anxiety Levels and Montessori (p = .007 and p = .025, respectively), the coefficients
are still very weak (r = .310 and r = .259). This indicates that more years learning the
L2 is positively correlated with the use of a Montessori methodology and that Anxiety
Levels are more likely to increase when this methodology is implemented.
60
Proportions, Chi-Squared Test, and Contingency Tables
In L2 education research, another way of gathering data and conducting research
studies is through surveys, which are generally used to obtain valuable information
about students' perceptions. The statistical analysis of these data tends to be subject to
our research interests, but the most common procedures include proportions and
contingency tables together with chi-squared tests.
In the case of proportions, it allows us to observe how the responses to each survey
item are distributed. JASP offers the possibility to observe these data in 'Descriptives',
and selecting the 'Frequency tables' option:
It is important to remember that the variables (in this case, the survey items or
questions) have to be introduced into the boxes. Subsequently, JASP offers the
following output for proportions:
61
Selecting frequency tables in the Descriptives modules only provides us with a
descriptive view of the responses given by the students for each question. Nevertheless,
in order to observe whether there are differences in this survey item, namely with a
grouping variable, the use of the chi-squared test (χ2) is necessary. The chi-squared
test is a nonparametric test, and a test of significance, since it tests a hypothesis. It is
used to compare actual or observed frequencies with expected frequencies to discern
whether they differ in statistical terms. As a note of caution, the chi-squared test can
only be used when there is a between-group comparison, that is when an independent
or grouping variable is present in the research design. To do this statistical test in JASP,
go to the Frequencies module, and then select 'Contingency Tables'.
62
Figure 66. Cells selection of percentages.
Thus, JASP will generate the following output, as shown in Figure 66. As seen, the
percentage for each option for this survey question is displayed as well as the
information per group. This information is valuable for descriptive procedures in L2
education research, especially when we aim to discern tendencies.
Figure 67. Contingency table for a survey question divided by the grouping variable.
Contingency Tables
Group
P_1 First Year Second Year Total
Count 2.000 0.000 2.000
Very Little
% within column 4.651 % 0.000 % 2.564 %
Count 6.000 4.000 10.000
Little
% within column 13.953 % 11.429 % 12.821 %
Count 22.000 10.000 32.000
Normal
% within column 51.163 % 28.571 % 41.026 %
Count 9.000 11.000 20.000
A lot
% within column 20.930 % 31.429 % 25.641 %
Count 4.000 10.000 14.000
Quite a lot
% within column 9.302 % 28.571 % 17.949 %
Count 43.000 35.000 78.000
Total
% within column 100.000 % 100.000 % 100.000 %
The chi-squared test performed on the results of survey question 1 is shown in Figure
68 below. As can be seen, the chi-squared significance is above 0.05 (p = 0.062), which
indicates that there are no significant differences.
63
Figure 68. Chi-squared results for contingency table 1.
Chi-Squared Tests
Value df p
Χ² 8.945 4 0.062
N 78
Χ² statistic (Χ² (4) = 8.945, p = .06) suggests that there is not a significant
association between the students' answers in question one and the year of the degree
they are pursuing.
64
Chapter 4
The use of Wilcoxon signed-ranks test proves to be more powerful against Type II
Error than other tests (e.g. Sign Test) since it considers the magnitude of the scores and
their direction.
In JASP, the Wilcoxon signed-ranks test has to be selected from the 'T-tests' < 'Paired
Samples T-Test' module, as described in Figures 69 and 70 below. As observed, this is
65
a paired samples test, which requires a combination of two variables, such as pre-test
and post-test.
In order to perform this test, as seen in Figure 70, a tick has to be marked in 'Wilcoxon
signed-rank". For our example, and as a general recommendation, it is advisable to tick
'Effect size' and 'Descriptives' given the rich information they provide about the data
to be analyzed. In our example, let us suppose that we want to explore how a listening
intervention favors increasing the scores in a vocabulary test. To do so, a pre-test-post-
test research design is devised.
Once everything is set, JASP gives the results of the analysis as in Figure 71. The p-
value indicates that it is highly significant (p < .001). The effect size, calculated with
the rank-biserial correlation (rB), is interpreted in the same way as parametric
correlations. Hence, rB = –0.966 is a large effect size.
66
Figure 71. JASP output for Wilcoxon signed-ranks test.
Paired Samples T-Test
95% CI for Rank-
Biserial Correlation
Measure Measure Rank-Biserial
W df p Lower Upper
1 2 Correlation
Pre_Test - Post_Test 5.500 < .001 -0.966 -0.986 -0.919
Note. Wilcoxon signed-rank test.
Descriptives
N Mean SD SE
Pre_Test 26 8.177 0.847 0.166
Post_Test 26 9.231 0.620 0.122
The use of a Mann-Whitney U Test is justified when the normality is violated (i.e.
when the data are not normally distributed), and the homogeneity of variance is not
equal. This nonparametric test is also preferred when the sample size is less than 30
participants.
Similarly, there are some statistical considerations when using Mann-Whitney U Test
(Tavakoli, 2013): if the sample size has fewer than 20 participants, the smaller U value
67
is taken into account to calculate statistical significance. When this figure is higher,
that is, over 20 participants, the U value is converted into the Z value.
Figure 72. JASP interface to select the tests and additional statistics.
In Figure 72 above, the JASP interface is presented. To select the Mann Whitney U
Test, go to 'T-tests' > 'Independent Samples T-tests'. Under the section, select 'Mann
Whitney' in Tests. It is also highly advisable to select the 'Effect Size'. 'Descriptives'
and 'Descriptive plots' may also help in visually observing the extent of the differences
between both groups.
The example to illustrate the use of Mann Whitney U Test is as follows: let us suppose
that we are interested in discerning whether there are big differences between a CLIL
group and a control group when implementing a listening program centered on learning
new vocabulary. To verify its effectiveness, a pre-test and a post-test are conducted. In
Figure 73, the results of the Mann Whitney U Test are presented.
68
Group Descriptives
Group N Mean SD SE
Post_Test CG 24 6.500 1.806 0.369
CLIL 26 9.231 0.620 0.122
As can be observed, the Mann Whitney U Test yielded a statistically significant result
(p < .001) with a large effect size (rB = –0.864). In this case, the descriptives reveal that
the CLIL group scored better than the control group.
Figure 74 represents a raincloud plot in which the data is visually represented. This is
a graphical manner to present the data and comment on the implications that it
reveals. In this case, the distribution of scores in the CLIL group (i.e. the standard
deviation) is more concentrated than in the control group.
A Mann-Whitney U test showed that the CLIL group scored higher in the vocabulary
post-test (M= 9.23) than the control group (M= 6.50), U= 42.50, p < .001, rB = –
0.864.
69
Friedman's Tests
Another very commonly used nonparametric test is Friedman's test (also Friedman's
two-way analysis of variance), which is a rank-based method to compare multiple
dependent variables within the same group. In this sense, the sample must be the same.
As with one-way repeated measures ANOVA, Friedman's test serves to compare three
or more related samples with ordinal scores. Likewise, Friedman's test should be used
when the assumptions of normality or homogeneity of variance are not satisfied, and
thus the parametric ANOVA alternative cannot be used. The calculation of Friedman's
test depends on the ranks within each case (Tavakoli, 2013).
In JASP, this test can be run going to ANOVA > Repeated Measures ANOVA. The
introduction of the data follows the same procedure as if it were the parametric test.
Nevertheless, to run the nonparametric Friedman's test, it is necessary to expand the
Nonparametrics tab and move the Factor – which may be named or labeled according
to your research needs – to the RM Factor box. Should we be interested in observing
the differences between the groups, that is, the post-hoc comparison tests, the
'Conover's post hoc test' box must be ticked (see Figure 75).
To illustrate this test, let us take a similar example to the one in repeated measures
ANOVA. In our study, we want to observe how three different L2 teaching
methodologies (i.e. Montessori, CLT and App-based) contribute to having an effect on
an intact Secondary class. To do so, data are gathered in three different moments:
firstly, students are taught with the Montessori methodology, and are then provided
70
with a test to check the contents learnt. This procedure is repeated for CLT and App-
based methodologies.
To verify these differences, a Friedman's test is performed (see Figure 76). As can be
observed, the test yielded a statistically significant result (p = 0.014). This indicates
that there are differences between the three methodologies.
In the case of repeated measures ANOVA, post-hoc comparisons were done with other
correction tests. For Friedman's test, Conover's test must be used. As indicated by
Tavakoli (2013), it is a nonparametric test that tests the equality of variances of two
populations with different medians. In Figure 77, we obtained Conover's test post-hoc
comparisons for the three different methodologies. There are statistically significant
differences between Montessori and App (p = .03) and CLT and App (p = .008).
The type of methodology has a significant effect on test scores χ2 (2) = 8.575, p = .014.
Pairwise comparisons showed that scores were significantly different between
Montessori and App (p = .031) and between CLT and App (p = .008).
Kruskal-Wallis Tests
The Kruskal-Wallis test is a nonparametric alternative test to the one-way ANOVA
(that is, the independent samples ANOVA). It is used with ordinal data in a hypothesis
71
testing situation in a research design involving three or more independent groups of
participants. The Kruskal-Wallis test aims to determine whether the scores of three or
more unrelated groups differ significantly. Under the same premise as the rest of the
nonparametric tests, Kruskal-Wallis calculates the statistic through a rank-based
procedure. In sum, as Tavakoli (2013) states, the Kruskal-Wallis test is an extension of
the Mann-Whitney U Test. Unlike parametric tests, Kruskal-Wallis does not require
any statistical assumptions except for the presence of an ordinal scaling of the
dependent variable.
To run a Kruskal-Wallis test in JASP, you have to go to ANOVA > ANOVA. Then, in
the analysis windows, the data has to be introduced as if it were a normal parametric
ANOVA test. Hence, to activate the nonparametric test, the Nonparametric tab has to
be opened, and the independent variable must be moved to the box on the right (see
Figure 78):
To illustrate the use of the Kruskal-Wallis test, suppose that we are conducting a study
in which the Montessori methodology integrated into L2 learning is going to be
implemented in three different groups: CLIL, NO-CLIL, and a control group (CG). In
order to observe the potential differences between them, a Kruskal-Wallis test would
be the statistical response.
In Figure 79, the results of the test indicate that there are statistically significant
differences (p < .001) between the groups. However, the Kruskal Wallis test does not
provide the post-hoc comparison tests by itself. It has to be set as was done in the case
72
of Friedman's tests (in summary, 'Post Hoc Tests' tab, and select Dunn's Post Hoc Type;
also select Bonferroni and Holm corrections).
Descriptives - Montessori
Group Mean SD N
CG 6.458 2.085 24
CLIL 8.177 0.847 26
NOCLIL 7.160 0.787 25
Thus, a series of post-hoc comparison tests were run, and as shown in Figure 80 below,
the results were statistically significant for CG vs CLIL (p < .001) and CLIL vs NO-
CLIL (p = .002).
Figure 80. Post hoc comparison tests for Kruskal-Wallis tests (Dunn Type).
Dunn's Post Hoc Comparisons - Group
Comparison z Wi Wj p pbonf pholm
CG - CLIL -3.548 29.833 51.462 < .001 *** < .001 *** < .001 ***
CG - NOCLIL -0.326 29.833 31.840 0.372 1.000 0.372
CLIL - NOCLIL 3.253 51.462 31.840 < .001 *** 0.002 ** 0.001 **
** p < .01, *** p < .001
73
Nonparametric correlations
Much as it has been presented throughout this chapter, correlations have their
nonparametric alternatives as well. In this case, when the data has violated the
assumptions – that is, normality or variance – two different nonparametric correlation
alternatives can be used: Spearman's rho and Kendall's tau correlation coefficients.
In the case of Kendall's tau, it is a test of rank correlation used with two ordinal
variables. This correlation coefficient is more adequate when there are ties in the
rankings. There are three forms of this measure (Cramer & Howitt, 2004; Larson-Hall,
2010):
1) Kendall's rank correlation tau A, which is used when there are no ties or tied
ranks.
2) Kendall's rank correlation tau B, which is used when there are ties or tied ranks.
3) Kendall's rank correlation tau C (also Kendall-Stuart Tau-c) is used when "the
table of ranks is rectangular rather than square as the value of tau c can come
closer to –1 or 1" (Tavakoli, 2013, p. 311).
74
In JASP, the calculation of these nonparametric correlations is done in the same manner
as for parametric correlations. The only difference lies in the selection of these
nonparametric alternatives, as seen in Figure 81 below.
As can be seen in Figure 82 below, the correlation between the anxiety level and the
Montessori methodology is moderate (rs = –0.488, p = .011). In terms of correlations
between methodologies, a moderate correlation exists between Montessori and CLT
(rs = 0.426, p = .03), and a high correlation between CLT and App (rs= 0.918, p <
.001).
75
Aside from the correlation table generated by JASP, the statistics software equally
offers a heatmap where these correlations may be visually observed (see Figure 83
below). In the case of our example, it may be concluded that anxiety levels do not have
a strong association with the different teaching methodologies.
76
Chapter 5
Linear Regression
Regression is a statistical technique that allows researchers to "examine the nature and
strength of the relations between variables, the relative predictive power of several
independent variables on a dependent variable" (Urdan, 2017, p. 183). There are two
types of regressions: simple linear regression (or bivariate regression) and multiple
regression.
Simple linear regression is a statistical test that predicts the value for an independent
variable from one independent variable (also called predictor). This independent
variable must be measured with an interval or ratio scale. A simple linear regression
entails that the researcher only examines one predictor variable and one criterion
variable. The purpose of the regression analysis, as noted by Urdan (2017), is to make
predictions about the values of the dependent variable which depend on certain values
of the predictor variable. In L2 education research, an example of a research design
whose research questions may be answered with a simple linear regression could be as
follows. We may wish to see the effect of the years studying an L2 (in this case,
English) on the scores of a text in a classroom following the Communicative Language
Teaching approach, in order to check whether this variable is dependent on the years
studying English. Hence, 'years studying English' is the predictor variable while 'CLT'
(test score) is the criterion variable. In JASP, there is a 'Regression' module, as shown
in Figure 84 below, in which 'Linear Regression' should be selected.
77
Figure 84. 'Regression' module.
Under the premise of the sample research design that we have presented previously,
these variables must be introduced, as can be seen in Figure 85. In this case, CLT (test
score) is introduced in the dependent variable as it is the outcome variable (dependent
variable), and YearsL2 ('years studying the L2') as the predictor variable, that is, the
covariate. Since we aim to illustrate a simple linear regression, only one covariate is
introduced.
Figure 85. Introduction of variables in the 'Regression' > 'Linear Regression' module.
Once the variables are introduced into the corresponding areas, the following output
will be generated by JASP:
78
Figure 86. Model Summary of Simple Linear Regression.
Model Summary - CLT
Durbin-Watson
Model R R² Adjusted R² RMSE Autocorrelation Statistic p
H₀ 0.000 0.000 0.000 1.721 0.661 0.620 < .001
H₁ 0.397 0.157 0.146 1.591 0.594 0.739 < .001
The table in Figure 86 above shows that the correlation (R) between both variables is
not high (0.397). On the other hand, the squared R (R2) indicates that the years studying
English accounts for 15.7% of the variance in the test score in Communicative
language teaching.
The ANOVA table above displays the sums of squares. Regression is the model while
residual is the error. The F-Statistic is significant at p < .001. Following the APA
guidelines, this information should be reported as F (1, 73) = 13.641, p < .001.
The information presented in the coefficients table (see Figure 88 above) provides the
coefficients (i.e. unstandardized) that are to be put into the linear equation:
y=c+b*x
79
These coefficients stand for the following information:
Following the previous equation, for 0.5 years the score that students in the
Communicative Language Teaching method are predicted to obtain is the following:
A potential interpretation of this is that, when students study 0.5 years of the language,
they might be expected to increase a CLT test score by 2.37 points.
Linear regression shows that years studying the L2 may significantly predict a CLT
test score F(1, 73) = 13.641, p < .001. The equation reveals that, after 6 months of
studying the L2, students may increase their CLT test scores by 2.37 points.
Multiple Regression
80
favors how to control variables may influence the general model. The association
between the multiple predictor variables is determined by multiple correlation
coefficient, which is included in the multiple linear regression analysis.
Tavakoli (2013) summarizes the extent to which multiple linear regression may be
useful for research purposes: a) the degree of relation between the predictor variables
and the criterion variable; b) how strong the relationship between each predictor
variable and the criterion variable is. Additionally, multiple linear regression allows
the researcher to control for other variables in the model; c) the relative strength of
each predictor variable, and finally, d) the interaction effects between each of the
predictor variables.
As Tavakoli (2013) points out, multiple regression analysis may have two central uses:
(1) to determine the strength and association between the criterion and predictor
variable controlling for the internal association of predictor variables and criterion one.
This association is represented with the letter β (unstandardized partial regression
coefficient), and (2) to determine how much particular predictors can account for the
variance in the criterion variable. In this case, this association is represented with a
multiple correlation coefficient (R) or squared R (R2).
Subsequently, the following output will be generated (see Figure 90). As seen, the
adjusted R2 informs the researcher that the multiple predictors can predict 42.7% of
the outcome variance. Durbin-Watson checks – which should be between 1 and 3 – are
in the corresponding benchmarks.
Figure 90. JASP output - model summary - for multiple linear regression.
Model Summary - Anxiety_Levels
Durbin-Watson
Model R R² Adjusted R² RMSE Autocorrelation Statistic p
H₀ 0.000 0.000 0.000 1.313 0.107 1.716 0.479
H₁ 0.708 0.502 0.427 0.994 -0.151 2.214 0.822
82
The ANOVA table provides us with valuable information about the significance of the
F-statistic which, as observed, is statistically significant (p = .003), suggesting that the
model is a better predictor of level of anxiety depending on the type of methodology.
The table below (Figure 91) shows that all statistics are forced into the model, and the
ANOVA model is significant. In the case of the predictor regression coefficients, the
CLT score is marginally significant (p = .059). However, the Tolerance and Variance
Inflation Factor (VIF), which are collinearity statistics, allow us to observe the degree
of multicollinearity that exists between the variables. For multicollinearity to be
discarded, the average VIF should be below 1, and tolerance should be equal to or less
than 0.2. In Figure 92, the Tolerance is maintained at adequate levels while VIF is quite
large (M= 14.28). Hence, the model is biased, and no predictions may be done.
This information may be observed visually through a series of plots, which allow for
the confirmation or rejection of assumptions.
83
Figure 93. Residuals vs Predicted plot.
As can be seen in Figure 93 above, the distribution of residuals around the baseline
suggests that the assumption of homoscedasticity may have been violated.
In the case of the Q-Q plot, as in Figure 94, it shows that the standardized residuals fit
along the diagonal, which points to normality and linearity. Thus, these assumptions
have not been violated.
84
Reliability Tests
In statistics, the role of reliability tests is deemed essential in order to discern whether
the association between certain items or groups of items (e.g. in a survey, or the scores
graded by several raters) is consistent.
Cronbach's alpha - α
One of the most common reliability tests is Cronbach's alpha (α), in which the
associations between a set or group of items are used to specify how strong these items
hold together (Urdan, 2017). Cronbach's alpha allows researchers to estimate the
internal consistency reliability of a certain measuring instrument (e.g. a test) by taking
into consideration certain information from the data: the number of items, the variance
of the scores in each item, and the variance of the total test scores (Tavakoli, 2013). As
mentioned previously, Cronbach's alpha is a measure of internal consistency that offers
factual data about the reliability of items within a group, for instance, a questionnaire
delivered to a classroom.
To interpret Cronbach's alpha, the maximum value of this measure is 1. Thus, when
values approach 1, it reflects a stronger relationship between the test items. When there
is a low alpha in the test, it suggests that the similarity of responses is very low.
Firstly, in JASP, we introduce the 23 items into the variables box, as in Figure 95.
85
Figure 95. JASP introduction of variables for unidimensional reliability.
The next step consists of selecting the appropriate scale statistics – in this case,
Cronbach's alpha α – and the confidence interval (see Figure 96).
This will generate the results of Cronbach's alpha, as in Figure 97 below. Following
the benchmarks mentioned previously, the reliability test yielded α = .736 (.639 – .811).
Hence, the degree of internal consistency is acceptable.
86
Figure 97. Reliability test- Cronbach's alpha results.
Frequentist Scale Reliability Statistics
Estimate Cronbach's α
Point estimate 0.736
95% CI lower bound 0.639
95% CI upper bound 0.811
Note. The following items correlated negatively with the scale: P_15, P_17, P_18, P_19, P_20,
P_22.
Another follow-up step that may be revealing is observing the individual item
reliability (in JASP, go to 'Individual Item Statistics' > Cronbach's α [if item dropped]).
Ticking this option allows us to observe Cronbach's alpha for each of the items in the
questionnaire.
Figure 98. Individual item reliability statistics for each question in the survey.
Frequentist Individual Item Reliability Statistics
If item dropped
Item Cronbach's α
P_1 0.718
P_2 0.718
P_3 0.713
P_4 0.725
P_5 0.731
P_6 0.720
P_7 0.713
P_8 0.723
P_9 0.724
P_10 0.716
P_11 0.708
P_12 0.731
P_13 0.728
P_14 0.708
P_15 0.734
P_16 0.725
P_17 0.739
P_18 0.730
P_19 0.755
P_20 0.763
P_21 0.722
P_22 0.736
P_23 0.727
87
Intraclass Correlation Coefficient (ICC)
In JASP, this option is available in the 'Reliability' module > 'Intraclass Correlation'.
The scores of each rater must be properly organized. In the case of the example in
Figure 99, the data belongs to data re-coding, which is why variables are called
'Time_1' and 'Time_2'. Both have to be included in the 'Variables' box.
In Figure 100, the results of the ICC are presented. As observed, ICC = 0.903, which
is estimated as an excellent degree of association.
88
Intraclass Correlation
Type Point Estimate Lower 95% CI Upper 95% CI
Note. 661 subjects and 2 judges/measurements. ICC type as referenced by Shrout
and Fleiss (1979).
89
Cohen's Kappa
Parallel to ICC, Cohen's kappa is another measure of agreement that allows for the
calculation of interrater reliability. In essence, it represents the average rate of
agreement in a set of scores, revealing the degree of agreement and disagreement by
category. Cohen's kappa adopts a dichotomous coding scheme (Tavakoli, 2013).
Although percentages could be calculated, Cohen's kappa accounts for change, making
it a valuable statistical test to check for intra and interrater agreement. As in ICC, the
closer Cohen's kappa value is to +1, the greater agreement there is.
As can be observed in Figure 101, Cohen's kappa is .901, which points to a good degree
of agreement between raters.
90
Factor Analysis
Taking as reference the example previously mentioned, in JASP, Factor Analyses are
found in the 'Factor' module (see Figure 102 below):
As observed there are three types of analyses. In this book, an overview of these three
will be provided.
91
Figure 103. PCA module in JASP.
In Figure 103, all the variables have to be introduced into the 'Variables' box. Then, in
number of components, 'Eigenvalues' have to be selected. It is important to notice that
only Eigenvalues above 1 will be shown.
The table shown in Figure 105 below is of paramount importance to understand the
relevance of PCA in the framework of our research study. As can be observed, the
questions from the questionnaire (e.g. P_21, P_10, etc) are grouped into different
components (i.e. RC1, RC2). The values that are shown in Figure 105 display the
degree of relationship that exists between the values.
92
Component Loadings
RC1 RC2 RC3 RC4 RC5 RC6 RC7 Uniqueness
P_11 0.567 0.356
P_14 0.458 0.438 0.368
P_18 0.999 0.201
P_19 0.972 0.156
P_16 0.817 0.367
P_15 0.603 0.514
P_5 0.574 0.390
P_9 0.873 0.302
P_6 0.838 0.239
P_17 0.858 0.333
P_3 0.540 0.363
P_13 0.763 0.330
P_12 -0.702 0.321
P_1 0.446 0.400
P_22 0.386
Note. Applied rotation method is promax.
The table below in Figure 106 is relevant for the researcher since it provides us with
important information about the proportion variance as well as the Eigenvalue. As can
be observed, components 1 to 3 have the highest eigenvalues. However, the proportion
variance is equally relevant. In this case, 0.227 is the highest, which makes us discard
the rest of the components in order to group the variables.
93
Component Characteristics
Unrotated solution Rotated solution
Proportion SumSq. Proportion
Eigenvalue Cumulative Cumulative
var. Loadings var.
Component
1.084 0.047 0.672 1.488 0.065 0.672
7
Another visual manner to observe the results of the PCA is through a path plot, as can
be observed in Figure 108. In this case, all the components are shown, and a series of
red and green arrows, with differing widths, point to each of the variables.
In this case, the L2 researcher may gain more insight into which variables may be
merged into which component. Nevertheless, decisions are to be taken based on
theoretically or empirically motivated grounds.
94
Figure 108. Path plot for PCA.
Similar to the previous analysis, PCA, Exploratory Factor Analysis (EFA) is another
type of factor analysis centered on describing and summarizing data through the
organization of variables that are supposed to be correlated linearly. Although the
objective is similar to PCA, EFA helps the researcher to decide which constructs or
95
factors best represent the data. While useful as a technique, it is generally used in the
early stages of research (Tavakoli, 2013) in order to consolidate the variables. This
eases the process of generating hypotheses.
Let us take the research study proposed in which a questionnaire is used to explore
students' perceptions about L2 digital writing, as in PCA. In JASP, the researcher must
go to 'Factor Analysis' > 'Exploratory Factor Analysis'. The box where the data has to
be introduced is the same as in PCA. The statistical aspects to be selected are,
depending on our research interests, 'Eigenvalues' and 'Oblique' rotation. An important
aspect about rotation is that, should there be a high degree of correlation among the
variables, then 'oblique' is the fittest option. Conversely, when correlation does not
exist between the variables, then 'orthogonal' should be selected. In our case, 'oblique'
is the most appropriate option since intercorrelations exist.
Under the 'Output options' tab, Assumption check must be selected. These are the
Kaiser-Meyer-Olkin (KMO) test (see Figure 109) and Bartlett's test (see Figure 110).
In the first case, the KMO test allows us to determine how suited our data is for Factor
Analysis. The result provided in the Overall MSA should be above .500. As can be
observed in Figure 109, the overall MSA is .665. Hence, this assumption check is met.
The subsequent assumption check is Bartlett's test, which determines whether or not
there is sphericity. In this case, the result should be statistically significant, as can be
seen in Figure 110.
In Figure 111, the model is tested through chi-squared. Although the value is not
statistically significant, the model will be explored in order to observe how these
factors are developed.
In Figure 112, the numerical entries in the table, which are factor loadings, indicate the
correlation between the original variables and the different factors. In essence, when a
factor loading is high, its variable contributes to this particular factor, and helps define
97
it (Tavakoli, 2013). In our example, factor loadings are disseminated throughout seven
different factors.
Parallel to the previous factor loadings table, Figure 113 below displays the different
factor characteristics. As can be observed, only factors from one to four are above 1,
although the value in the table is not an eigenvalue, this indicates that they may not be
as representative of the variables included in the model as the others.
98
Figure 113. Factor characteristics.
Factor Characteristics
Unrotated solution Rotated solution
SumSq. Proportion SumSq. Proportion
Cumulative Cumulative
Loadings var. Loadings var.
Factor
4.835 0.210 0.210 2.757 0.120 0.120
1
Factor
2.326 0.101 0.311 2.543 0.111 0.230
2
Factor
1.931 0.084 0.395 1.977 0.086 0.316
3
Factor
1.119 0.049 0.444 1.608 0.070 0.386
4
Factor
0.830 0.036 0.480 1.490 0.065 0.451
5
Factor
0.702 0.031 0.511 1.059 0.046 0.497
6
Factor
0.590 0.026 0.536 0.898 0.039 0.536
7
99
Both Figures 114 and 115 provide a visual overview of the EFA model. In the case of
the path diagram, it allows us to observe in a much clearer manner each factor and the
associated variables.
The last of the Factor Analyses is Confirmatory Factor Analysis (CFA), which
allows the researcher to examine the relationship between different measure variables
100
and a set of factors. Unlike EFA, in which measured variables are related to every
factor by a factor loading, in CFA an advanced knowledge is presupposed on the
researcher's part, and under this assumption, factors are created by the researcher
himself. Hence, CFA responds to a hypothesized factor structure and the associated
correlations between the variables. A major difference with EFA, thus, is related to the
researcher's role in CFA. As can be seen in Figure 116 below, the construct or factor
must be created and variables assigned to it on the basis of the theory being tested.
Preconceived theories, then, may be tested with CFA as well (Tavakoli, 2013).
Let us hence imagine that, as a result of the EFA, a further confirmatory check is made
through CFA. Hence, two factors – corresponding to these higher factors in EFA with
the highest eigenvalues – are manually created, as can be observed in Figure 116.
Hence, once all the data are introduced and classified into the corresponding
researcher-created factors, a chi-square test for the model is calculated. In our study, it
is statistically significant (see Figure 117).
101
Figure 118 displays a set of tables with additional fit measures to observe the
appropriateness of the model. Values in Comparative Fit Index (CFI) should be closer
to +1, indicating the model fit. In our example, it is close to +1. Tucker-Lewis Index
(TLI) is similar to CFI, although a more conservative option. As can be observed, the
value of TLI is close to +1, hence indicating the model fit.
In the case of other fit measures, such as RMSEA, the value provided should be below
.10. A traditional benchmark is that, below .05, it is a good model. Between .05 and
.10, it is appropriate but close attention should be paid.
Information criteria
Value
Log-likelihood -1062.405
Number of free parameters 21.000
Akaike (AIC) 2166.810
Bayesian (BIC) 2216.301
Sample-size adjusted Bayesian (SSABIC) 2150.093
102
Other fit measures
Metric Value
Root mean square error of approximation (RMSEA) 0.080
RMSEA 90% CI lower bound 0.024
RMSEA 90% CI upper bound 0.123
RMSEA p-value 0.148
Standardized root mean square residual (SRMR) 0.071
Hoelter's critical N (α = .05) 75.662
Hoelter's critical N (α = .01) 87.119
Goodness of fit index (GFI) 0.891
McDonald fit index (MFI) 0.898
Expected cross validation index (ECVI) 1.189
After the checks regarding fit measures, factor loadings corresponding to the variables
introduced in each manually created factor are presented. As can be observed in Figure
119 below, all estimates seem to be over .40, indicating that these variables – in our
case, the questions in the questionnaire – fit well with the proposed factors.
Additionally,k attention should be paid to variable P_20, since it points to a negative
correlation.
Much as it was shown in the previous factor analyses, CFA may be equally seen
through a model plot, allowing for a more visual perspective of the interrelationship
between the factors. Hence, in Figure 120, the model plot for CFA shows how factors
are correlated, and also, the correlation between each factor and the associated
variables.
104
Chapter 6
EFFECT SIZES
Cohen's d
One of the most used effect size indices is Cohen's d, which measures the difference
between means from two independent samples in terms of their standard deviation units
(Larson-Hall, 2010; Tavakoli, 2013). In essence, Cohen's d is measured from zero and
it stretches as much as the difference between means is. For effect sizes, there are
several benchmarks that are established to determine or estimate the magnitude of this
effect. However, Plonsky et al. (2021) indicated that "benchmarks are nothing more
than a starting point for gauging the magnitude of effects within the field" (p. 822). In
this respect, while traditional Cohen's d benchmarks have been established as follows:
small (0.2), medium (0.5), and large (0.8), Plonsky & Oswald (2014) proposed a series
of field-specific benchmarks for the interpretation of a series of effect sizes (namely, d
index, r index, and R2) in L2 research. In the case of the benchmarks proposed by these
authors, they distinguish between- and within-groups for the calculation of the
magnitude. For Cohen's d (between-groups): small (0.40), medium (0.70), and large
(1.00). Conversely, for Cohen's d (within-groups): small (0.60), medium (1.00), and
large (1.40).
106
Figure 121. Selection of effect sizes for independent and dependent samples t-tests.
The effect size indices in Figure 122 are commonly used for independent and
dependent samples T-Tests (both parametric and non-parametric).
Figure 122. JASP outcome for an independent samples T-test with Cohen's d effect size.
Independent Samples T-Test
95% CI for Cohen's d
t df p Cohen's d Lower Upper
Post_Test -7.266 48 < .001 -2.057 -2.740 -1.360
Note. Student's t-test.
As can be seen in Figure 122 above, Cohen's d is large (d= –2.057), which indicates
that the difference between the groups is large. The inclusion of confidence intervals
allows us to observe the variability and extension of this effect size.
Hedges' g
Another common statistical effect size index is Hedges' g, which is very similar to
Cohen's d. However, Hedges' g takes into account the sample size since the effect size
yielded by Cohen's d is multiplied by a correction factor for small sample sizes (Turner
& Bernard, 2006).
Figure 123. JASP outcome for independent sample t-test with Hedges' g.
Independent Samples T-Test
95% CI for Hedges' g
t df p Hedges' g Lower Upper
Post_Test -7.266 48 < .001 -2.024 -2.704 -1.331
Note. Student's t-test.
107
Independent Samples T-Test
95% CI for Hedges' g
t df p Hedges' g Lower Upper
In Figure 123, the same example as in the previous effect size index was computed.
This time Hedges' g was calculated to determine the magnitude of the independent
variable on the dependent variable. If we compare the result of Hedges' g with Cohen's
d's, one may realize that the difference is not that distant. Nevertheless, the use of
Hedges' g ensures that a correction is applied when using small sample sizes (e.g. below
20 participants).
Cramer's V
Another important effect size index is Cramer's V, which is used for Chi-square
analyses. Its main purpose is to provide information about how strongly two categorical
variables are associated. Cramer's V is generally used with contingency tables, and it
is an extension of Phi correlation coefficient – whose use goes beyond the purposes of
this book (Tavakoli, 2013).
To illustrate the use of this effect size, let us take an example in which we are interested
in observing the level of anxiety in both groups: CLIL and a control group. The
independent variable is the presence or absence of a bilingual program.
Chi-Squared Tests
Value df p
Χ² 11.463 4 0.022
108
Chi-Squared Tests
Value df p
N 50
As can be observed in Figure 124 above, Contingency tables and the chi-squared value
reveal that the differences between both groups are statistically significant (p = .022).
Hence, the L2 researcher has to verify the strength of the magnitude of the effect. To
do that, Cramer's V is computed.
As can be observed in Figure 125, the value is (V = 0.479). A series of benchmarks are
proposed for Cramer's V depending on the df (degrees of freedom). Let us have a look
at the table below (based on Goss-Sampson, 2020):
Comparing the result of Cramer's V in Figure 126 with the benchmarks proposed, the
effect size would be large (>0.25).
109
Rank-biserial (rb)
The rank-biserial correlation coefficient is not an effect size per se, but it is a
measure of association between a continuous variable and a dichotomized variable with
two categories (that is, an independent variable). Traditionally, this measure of
association has not been used very widely in research since it is problematic in its
calculation, especially when distributions are not normal.
In JASP, the rank-biserial correlation coefficient is used with non-parametric tests, and
it is interpreted as an effect size using the same benchmarks as for Pearson's correlation.
As our interest lies in L2 education research, Plonsky and Oswald's (2014) field-
specific benchmarks are taken as reference: small (0.25), medium (0.40), and large
(0.65).
Taking the same example as in the previous effect size indices, Figure 127 shows how
the rank-biserial correlation coefficient is shown in JASP:
As can be observed, the result of the effect size is very large (r b = –0.864) which
suggests that the magnitude of the effect of the independent variable is relevant.
As can be observed in Figure 128 below, the eta squared is provided next to the p-
value. Following Goss-Sampson (2020), the benchmarks for eta squared are: trivial
(<0.1), small (0.1), medium (0.25), and large (0.37). Under this assumption, the eta
squared in the example in Figure 128 would be considered nearly medium (η² = 0.23).
Figure 128. ANOVA table with the eta squared effect size.
ANOVA - Montessori
Cases Sum of Squares df Mean Square F p η²
Group 36.860 1 36.860 15.006 < .001 0.238
Residuals 117.904 48 2.456
Note. Type III Sum of Squares
The benchmarks for partial eta squared are: trivial (<0.01), small (0.01), medium
(0.06), and large (0.14).
As can be observed in Figure 129 below, using the same example as in the previous
effect size indices, the partial eta squared would be considered as very large (ƞ²p =
0.57).
111
Omega squared (ω2)
The last of the effect size indices reviewed in this book is the omega squared (ω2),
which is one of the most commonly employed measures of treatment effect (Tavakoli,
2013). In essence, omega squared measures the proportion of variability on the
dependent variable which is directly associated with the independent variable in the
population. The use of omega squared as an effect size ensures that our estimate is not
unbiased in terms of the proportion in the population (Pagano, 2009).
For omega squared, the benchmarks are the same as partial eta squared ones (see in the
previous section). Thus, as can be seen in Figure 130, the effect size is very large (ω2
= 0.509).
112
Chapter 7
In order to determine and verify this probability, frequentist methods use the p-value
as a reference. This p-value is the calculated probability that the effect and the results
obtained are not random, and thus the null hypothesis may be rejected. Traditionally,
frequentist methods have relied on the p-value as a reference, setting the alpha level at
0.05. Below this benchmark, a result is statistically significant. Nevertheless, one of
the issues raised in frequentists statistics is that p-values tend to be overestimated and,
in turn, misinterpreted. This is the main reason why frequentist statistics have to be
supplemented, for instance, with the (sometimes necessary) inclusion of effect size
indices.
Aside from frequentist methods, research and L2 education research have most recently
shifted toward bayesian methods or statistics (Norouzian et al., 2018). Bayesian
statistics entails that probability expresses a degree of belief in a specific event. While
frequentist statistics emphasize the chance of an event, and its probability, Bayesian
statistics provides the researcher with valuable information about how probably a
variant is better than the original one.
Equally relevant, Bayesian probability is more conditional since it uses the concepts of
prior and posterior knowledge in an attempt to predict outcomes. This is called
conditional probability, and the premise behind it is that the probability of an event
X given Y may be equal to the probability of Y and X happening together divided by
113
the probability of Y. The main usefulness of conditional probability is that it takes into
account both false positives and false negatives.
In what follows, Bayesian terminology will be explained along with some essential
concepts of Bayesian statistics.
Basic terminology
Credibility interval. In Bayesian statistics, traditional confidence intervals are not
used. Instead, credibility intervals are used, and they are interpreted as the probability
– in JASP it can be set at 95% or according to our research interests – that the
population parameter is found in the upper or lower bounds determined by the Bayesian
credibility interval (Goss-Sampson, 2020). In Figure 131 below descriptive statistics
are shown along with the credible interval.
Prior distribution. Prior distribution is the distribution that may capture the amount
of certainty or uncertainty in a population parameter (Goss-Sampson, 2020). Hence,
the distribution is weighted in such a manner that the posterior is obtained from the
data, and as a result, inferences are made. In terms of research, the prior distribution is
based upon what previous research has determined to be the norm. In essence, in
Bayesian statistics, the researcher has to be quite knowledgeable about the tendency
existing in previous research. Nevertheless, as pointed out by Norouzian et al. (2018),
the use of Bayesian statistics is still limited in L2 research and Applied Linguistics.
Thus, the establishment of prior distribution turns out to be challenging. In JASP, the
prior distribution is set in the 'Prior' tab (see Figure 132). JASP has a default Cauchy
distribution of a zero effect size (Cohen's d) and width or scale of .707 (Goss-Sampson,
114
2020). Such a prior distribution allows us to establish parameter estimation, whose
values may change depending on what previous research has stated to be the norm.
Likelihood functions are, in essence, based on the data generated and they crudely
describe it. They are highly dependent on the type of data (Norouzian et al., 2018).
Equally relevant, likelihood weights the prior distribution to obtain the posterior
distribution, which will allow us to make inferences.
Posterior distribution is obtained when the prior and the likelihood are combined in
the Bayesian estimation process. Statistically, the posterior is obtained when the prior
distribution is multiplied by the likelihood function.
On the basis of the above, Norouzian et al. (2018) very clearly state that choosing a
prior is a relevant decision in any research study wherein Bayesian methods are the
statistical choice. Nevertheless, it has also been assumed that prior knowledge is absent
or diminished, but their absence may lead to a biased Bayesian result. As mentioned
previously, prior knowledge (and hence, prior distributions) may not be considered
beyond the default Cauchy factor in JASP since the absence of studies using Bayesian
statistics makes it difficult to define it. A useful manner to set priors is identifying
effect sizes in previous research, especially Cohen's d, as units of reference for the
cauchy prior.
Prior odds are the outcome before the evidence is considered (Goss-Sampson, 2020).
Additionally, these prior odds may be uninformative or informative, depending on the
degree of knowledge in previous work that may be applied to the Bayesian statistical
115
method. Conversely, posterior odds are Bayes Factor – which will be explained in the
next section – multiplied by the prior odds. Through this formula, Goss-Sampson
(2020) indicates that Bayes Factor (BF10) informs us about the confirmation or
rejection of the hypothesis.
Figure 134 below shows the weight carried by the value in the Bayes factor, according
to which statistical significance may be revealed. The higher the value of the Bayes
factor, the closer it will be to rejecting the null hypothesis.
116
Figure 134. Graphical representation of a Bayes factor classification table (Van Doorn et al., 2020).
117
Although the output shown in the table is clear, Figure 136 shows a prior and posterior
distribution plot which allows us to observe the conditional distribution in a much
clearer way. The plot reveals that there is evidence for the alternative hypothesis, which
is equally supported by the median in the case of the effect size (Mdn = –1.196).
Likewise, Figure 137 below displays the same information as in the previous graph,
although in this case the prior likelihood is observed along with the value of the Bayes
factor. Once again, the evidence for the alternative hypothesis is revealing.
118
Apart from the usual statistical procedures such as T-tests, Bayesian correlations may
also be performed. JASP offers this option in the 'Regression' module, as can be seen
in Figure 138:
The procedure to introduce the data is the same as for frequentist statistics. Conversely,
the output is different (see Figure 139 below). In this case, Pearson's r statistics are
presented, but p-values are replaced by the Bayes factor (BF10). Hence, to determine
whether the correlations were statistically significant, the Bayes factor has to be
observed. To name a few of these results, CLT is positively correlated with the level
of anxiety with a high Bayes factor.
119
REFERENCES
Chen, S. Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for
multiple comparisons. Journal of thoracic disease, 9(6), 1725–1729.
https://doi.org/10.21037/jtd.2017.05.34
Cohen, L., Manion, L., & Morrison, K. (2011). Research methods in education (7th
ed.). London: Routledge.
Cramer, D., & Howitt, D. (2004). The SAGE dictionary of statistics: A practical
resource for students in the social sciences. Thousand Oaks, CA: Sage.
Norouzian, R., de Miranda, M., & Plonsky, L. (2018). The Bayesian revolution in
second language research: An applied approach. Language Learning, 68(4),
1032-1075.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2
research. Language learning, 64(4), 878-912.
Plonsky, L., Sudina, E., & Hu, Y. (2021). Applying meta-analysis to research on
bilingualism: An introduction. Bilingualism: Language and Cognition, 1-6.
120
Porte, G. K. (2010). Appraising research in second language learning: A practical
approach to critical analysis of quantitative research (2nd ed.). Amsterdam:
John Benjamins.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater
reliability. Psychological bulletin, 86(2), 420.
Turner, H. M. I., & Bernard, R. M. (2006). Calculating and synthesizing effect sizes.
Contemporary issues in communication science and disorders, 33(Spring), 42-
55.
Urdan, T. C. (2017). Statistics in plain English (4th ed.). Mahwah, NJ: Lawrence
Erlbaum Associates.
Van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., Etz, A.,
Evans, N. J., Gronau, Q. F., Haaf, J. M., Hinne, M., Kucharský, Š., Ly, A., Marsman,
M., Matzke, D., Gupta, A. R. K. N., Sarafoglou, A., Stefan, A., Voelkel, J. G., &
Wagenmakers, E. J. (2020). The JASP guidelines for conducting and reporting a
Bayesian analysis. Psychonomic Bulletin and Review.
https://doi.org/10.3758/s13423-020-01798-5
121
Buy your books fast and straightforward online - at one of world’s
fastest growing online book stores! Environmentally sound due to
Print-on-Demand technologies.
Buy your books online at
www.morebooks.shop
Kaufen Sie Ihre Bücher schnell und unkompliziert online – auf einer
der am schnellsten wachsenden Buchhandelsplattformen weltweit!
Dank Print-On-Demand umwelt- und ressourcenschonend produzi
ert.
Bücher schneller online kaufen
www.morebooks.shop
KS OmniScriptum Publishing
Brivibas gatve 197
LV-1039 Riga, Latvia info@omniscriptum.com
Telefax: +371 686 204 55 www.omniscriptum.com