Download as pdf or txt
Download as pdf or txt
You are on page 1of 129

            

 
  

  
 
  

  

 
  

  
 

  

  


 




  

 
  
 


   
  
   
 
   
 



   
 


  
  
 
   
  

 


  
   


 
       
  
 

 
  
 

   

  

  
 
  
 
    

 


 
 
 

 

 

 !"#$
 
 
 
%   "   & '
 

  
 ' (   ($
  
$ )* +) ,-.+/$
 !  #




  


 

 0 12 ,!3



  0 -.-- %   "   & '
 

  

' (  ($   


Table of Contents

Preface ........................................................................................................................... 5

Chapter 1 ESSENTIAL CONCEPTS IN STATISTICS............................................... 7

Populations and Samples........................................................................................... 7

Types of variables ..................................................................................................... 8

Independent variables............................................................................................ 9

Dependent variables ............................................................................................ 10

Control variables ................................................................................................. 11

Moderator variables............................................................................................. 11

Intervening variables ........................................................................................... 12

Confounding variables ........................................................................................ 13

Levels of measurement............................................................................................ 14

Types of research design ......................................................................................... 14

Chapter 2 DESCRIPTIVE STATISTICS ................................................................... 17

Measures of Central Tendency................................................................................ 17

Measures of Dispersion ........................................................................................... 20

Distributions ............................................................................................................ 22

Graphical representations of the data ...................................................................... 26

Chapter 3 PARAMETRIC STATISTICS ................................................................... 31

One Sample T-Tests ................................................................................................ 31

Paired Samples T-Tests ........................................................................................... 34

Independent Samples T-Tests ................................................................................. 36


1
One-Way ANOVA .................................................................................................. 39

One-way ANOVA with three groups.................................................................. 41

Two-Way ANOVA ................................................................................................. 45

Repeated Measures ANOVA .................................................................................. 48

One-way Repeated Measures ANOVA (RMANOVA) ...................................... 48

ANCOVA ................................................................................................................ 51

MANOVA ............................................................................................................... 54

Parametric correlations............................................................................................ 57

Proportions, Chi-Squared Test, and Contingency Tables ....................................... 61

Chapter 4 NON PARAMETRIC STATISTICS ......................................................... 65

Wilcoxon Signed-Ranks Tests ................................................................................ 65

Mann Whitney U Tests............................................................................................ 67

Friedman's Tests ...................................................................................................... 70

Kruskal-Wallis Tests ............................................................................................... 71

Nonparametric correlations ..................................................................................... 74

Chapter 5 OTHER STATISTICAL ANALYSES....................................................... 77

Linear Regression.................................................................................................... 77

Simple Linear Regression ................................................................................... 77

Multiple Regression ............................................................................................ 80

Reliability Tests....................................................................................................... 85

Cronbach's alpha - α ............................................................................................ 85

Intraclass Correlation Coefficient (ICC) ............................................................. 88

Cohen's Kappa..................................................................................................... 90

Factor Analysis........................................................................................................ 91

Principal Component Analysis............................................................................ 91


2
Exploratory Factor Analysis ................................................................................ 95

Confirmatory Factor Analysis ........................................................................... 100

Chapter 6 EFFECT SIZES ........................................................................................ 105

What is an 'effect size'? ......................................................................................... 105

Cohen's d ............................................................................................................... 106

Hedges' g ............................................................................................................... 107

Cramer's V ............................................................................................................. 108

Rank-biserial (rb) ................................................................................................... 110

Eta squared (ƞ²) ..................................................................................................... 110

Partial eta squared (ƞ²p) ......................................................................................... 111

Omega squared (ω2) .............................................................................................. 112

Chapter 7 INTRODUCTION TO BAYESIAN STATISTICS ................................. 113

Basic terminology.................................................................................................. 114

Bayes Hypothesis Testing: T-Tests and Correlations ........................................... 116

REFERENCES ...................................................................................................... 120

3
4
Preface
Second Language (L2) education researchers generally design their studies on the basis
of qualitative and quantitative data. While the analysis of the data from a qualitative
perspective entails completely different procedures, quantitative data has to be
processed and analyzed using statistical methods. In this sense, L2 education
researchers usually work with data from multifaceted tests (e.g. Language tests,
working memory tests, among many others) whose purpose is to quantify a specific
construct, such as the level of proficiency. The data obtained from these procedures
are, in fact, observed data that are counted and classified. Statistical approaches are
still regarded by most L2 education researchers as an unexplored, highly feared area
because of the lack of affinity with language-related fields of research. Conversely, it
cannot be denied that, without statistics, it would not be possible to determine whether
the results of a test after an intervention are due to a random effect or to the intervention
itself. The relevance of statistics lies in this previous statement, but L2 education
researchers are more inclined to avoid using quantitative approaches in their research
studies, mainly because of their lack of knowledge and training. In part, this book has
emerged out of the teaching notes that I was preparing as a supervisor of several
students who were currently writing their Bachelor's, Master and Ph.D. theses. In an
attempt to provide them with a comprehensible guide to statistics in L2 education
research, I opted for converting it into a tool that could be used not only by them but
by the rest of the researchers.

This book intends to provide L2 education researchers with an easy-to-consult guide


for the most basic statistical tests. Many statistical books attempt to describe in detail
and more technically each statistical test, including mathematical terms and formulae
which, in most cases, are difficult to understand. The purpose of this book is completely
the opposite since L2 education researchers are more interested in the practical
application of a specific statistical test rather than in its mathematical calculation.
5
Hence, explanations of the different statistical procedures and essential concepts are
provided in an understandable manner. The reader must not expect complex
explanations involving mathematical terms.

Likewise, the statistical software package of reference in this book is JASP (Jasp Team,
2020), which is free open-source software. All the references and figures to the way of
implementing these statistical analyses are based upon JASP.

Finally, my ultimate endeavor has been to accompany the statistical explanations and
procedures with sample L2 education research designs. These examples may serve as
guides for the L2 education researcher who struggles to understand how a specific
statistical procedure or test would be applied to their study.

6
Chapter 1

ESSENTIAL CONCEPTS IN STATISTICS

Doing research in any field requires a series of tools and skills whose main motivation
is to lead us further, and confirm or reject our hypotheses. Among these tools and skills,
one may adopt a purely qualitative approach, which entails the use of varied elements
of data collection such as interviews or observation methods. Nevertheless, the
adoption of a quantitative approach involves that the researcher gathers numerical data
– but also quantification of qualitative data – with the aim of answering the research
questions proposed. Most importantly, there are a number of essential aspects to bear
in mind in which the research design conflates with the research questions, directly
intersecting with quantitative processing data analysis. Throughout this first chapter,
the most important concepts in statistics will be explained. These first contact with the
basics in statistics will be made through the description of concepts (e.g. types of
variables) by providing numerous examples drawing on what is the norm in Applied
Linguistics and Second Language (L2) education.

Populations and Samples


When doing research in L2 education – and in any field of research – the most relevant
source of data stems from a specific population. In definitory terms, a population may
be an individual or a group of people who represent all or most of the members of a
specific group. As a rule of thumb, a population may be people, but also categories of
interest. In the case of L2 education, a specific population would be, for instance, adult
L2 writers. In order to get access to such a population, you need to devise a sampling
7
plan. Given the broad spectrum of adult L2 writers around the world, you would have
to make a distinction to determine which population is accessible for your research. To
reach this specific portion of a population, you will have to use some sampling
procedures: probability sampling or non-probability sampling. Probability sampling
entails that the individuals are randomly obtained from the population. Non-probability
sampling, however, requires that the researcher selects the individuals on the basis of
some criteria (e.g. adult L2 writers who have been learning English for 12 years).
Hence, after these sampling procedures, we have to select a smaller number of
individuals that belong to this wider population. This reduced group of people would
be a sample, whose size will condition the extent to which our results may be
representative of the larger population. Building on the above, a sample obtained using
probability sampling could be random sampling. In the case of non-probability
sampling, representative sampling is another concept for the sampling plan. The
researcher in this case selects individuals that are believed to be more representative in
the face of results. For instance, in L2 education, out of an intact classroom, the
researcher may select only those adult L2 writers whose proficiency level is A2 and
B1. In such a case, the sample would be representative of those specific levels within
the population. On the other hand, convenience sampling is one of the most common
sampling procedures in L2 education research since, as a rule of thumb, the researcher
may gather the research data from the nearest or more accessible place. To illustrate
this, let us take the example of a researcher who wants to explore how a specific
didactic proposal may affect L2 primary students' performance. To do so, the
researcher may resort to (1) the nearest elementary school, and/or (2) the education
center in which access is much easier for a number of logistic and social reasons (e.g.
the researcher may have a social connection with some teacher within the school).
Types of variables
Once the sample is defined, that is, the subjects are accessible, the researcher has to
determine which aspects of the sample are going to be put under scrutiny. Such aspects
are materialized in the form of variables. In L2 education research, a variable would

8
refer to any information that can be codified. For instance, a variable would be L2
proficiency – which would have categories such as A1, A2, B1... –, gender, age, or, for
instance, anxiety. In short, variables define the characteristics or attributes of an
individual or a specific category. Variables may be classified as quantitative or
qualitative. Quantitative variables are those which are continuous, in which the
numbers are indicative of some amount. In this case, a quantitative variable in L2
education research would be equated with a test score or level of anxiety. Conversely,
qualitative variables are categories to which values are assigned. In that case, L2
proficiency would be a qualitative variable (also, categorial variable) since, in order to
quantify it, values should be assigned: for instance, A1 = 1, A2 = 2, B1 = 3. Within the
category of qualitative variables, dichotomous variables are commonplace in social
sciences research, and hence, in L2 education research. These dichotomous variables
are useful for survey-based research or when observations are classified and quantified.
Aside from these characteristics, statistics are highly dependent on the role that these
variables play in the research design. Hence, within a research design, two important
variables play a significant role: independent variables and dependent variables.
Besides these variables, others are equally relevant: control variables, moderator
variables, intervening variables, confounding variables, and extraneous variables.

Independent variables

This type of variable implies that when isolated – that is, they are independent–, they
may exert some influence on some changes or variation on dependent variables. In
other words, independent variables are systematically manipulated in order to observe
whether their variation contributes to further changes in other variables (Heiman, 2011;
Larson-Hall, 2010). An L2 researcher may hypothesize the independent variable in an
attempt to observe the extent to which this action influences the dependent variable.
For instance, one may look into how changing the teaching methodology (independent
variable) may contribute to improving or not vocabulary scores (dependent variables).

9
Nevertheless, research designs are not always composed of one independent variable,
but they may involve more than one. These designs include several independent
variables which correspond to a number of factors. In L2 education research, an
independent variable could refer to the methodology used (e.g. CLIL vs non-CLIL
classroom), and the L2 proficiency (LOW vs HIGH).

Finally, the aim of an experimental design could be different from observing the
influence of the independent variable on the dependent variable. Hence, the objective
may be directed towards predicting a certain outcome on the basis of this independent
variable and several dependent variables. For instance, an L2 education researcher may
be interested in measuring test duration, and how it can predict students' scores on a
language proficiency test. In this case, the test duration would be the predictor or
explanatory variable, and the test scores would be considered the criterion (or outcome
variables).

Dependent variables

As has been anticipated in the variable explained previously, dependent variables are
variables that the researcher observes in order to clarify or determine the effect of the
independent variable. In essence, dependent variables are the elements or
characteristics that change or are modified as a result of the manipulation of the
independent variable.

To illustrate how dependent variables may be characterized or operationalized, let us


imagine that, in an L2 education research study, you intend to study the construct
'fluency', which you operationalize as the words per minute. Hence, the variation in the
words per minute from pre-test to post-test after the influence of the independent
variable is thought to contribute to variation in the construct of fluency. In a study,
more than one dependent variable is usually the norm, especially in L2 education
research where a construct may be reflected in several dependent variables.

10
Control variables

Another important type of variable that needs to be included in some research designs
is control variables. In essence, control variables are not considered of interest in our
study, but they should be controlled given the purported influence they may exert over
the outcomes or the dependent variables. These control variables are advised to be
controlled in an attempt to provide more internal validity to our study. In L2 education
research, a common control variable is related to the participants' age or, more
specifically, the years studying the L2 as well as the number of hours of L2 lessons
received.

A solution to control this type of variable lies in the research design per se. Thus, the
L2 researcher may opt for two options: (1) randomize the groups, that is, even despite
the mentioned control variables (e.g. years studying the L2), participants are randomly
assigned to either the control or experimental group or (2) to standardize the procedures
by adapting, for instance, the research design. For L2 proficiency, subgroups could be
created referring to LOW and HIGH proficient L2 learners. Nevertheless, including
more subgroups may lead to the necessity of enlarging the sample size.

Figure 1. The connection between the control variables and the independent and dependent
variables.

Moderator variables

Moderator variables are special types of independent variables in which the


relationship between the dependent and independent variable are examined. In this
sense, a moderator variable may act upon such a relationship with certain strength that
it may lead to changes in its direction or strength.

11
To contextualize how a moderator variable may have an influence on an L2 research
design, let us imagine that we decide to study how belonging to a CLIL or non-CLIL
classroom (the independent variable) may exert an influence on anxiety levels
(dependent variable) by taking into account the proficiency level (moderator variable).

Figure 2. Connection with the moderator variable.

Intervening variables

Intervening variables are abstract theoretical variables which somehow mediate the
relationship between the independent variable and dependent variables. However,
intervening variables are not directly observable, and they may be only inferred from
what is observed. Another important characteristic of intervening variables is that they
tend to be unexpected in the research design, but their appearance on stage might be
due to further reasoning regarding the link between the independent variable and the
dependent variables.

In L2 education research, a research design in which two types of methodologies are


tested may include such an intervening variable. Hence, while the experimental group
received lessons using methodology A, and the control group received them with
methodology B, results in the post-test turned out to be much higher in methodology
A. Nevertheless, after careful consideration, the L2 researcher may observe that this
positive variation may be due to behavioral reasons, such as the teacher in methodology
A being more lively – that is, an unanticipated variable. Despite not being a sure
influential variable, it may explain the relationship in the study outcomes.

12
Figure 3. Intervening variable in the research design.

Confounding variables

When a variable affects the dependent and independent variables and may act as a
distractor or may confound the relationship between them, one may be dealing with a
confounding variable. This type of variable has to meet two principal conditions: (1)
there should exist a correlation with the independent variable, and (2) it should be
causally related to the dependent variable. The L2 researcher has to take into
consideration the pivotal role that confounding variables may play in research designs;
in this regard, results may not be robust and could be misleading if a confounding
variable is not considered. Equally important, the internal validity of the research could
be questioned.

To illustrate this in L2 research, let us imagine a research design in which we are


comparing students' performance in a communicative task and a reading-into-speaking
task. The first type of task was provided to students in the first year, and the second
task was delivered to students in their third year. After performing a series of statistical
analyses, a between-group difference was observed. However, the L2 researcher would
find it very complicated to determine whether such a difference is due to the use of a
different task, or because of the nature of both groups, that is, their L2 proficiency or
linguistic knowledge may be predictably higher. Controlling confounding variables is
determined by the manner in which the research design is devised, and they should be
taken into account beforehand.

13
Figure 4. A confounding variable in the research design.

Levels of measurement
To be able to measure these variables, variables are generally classified into different
scales: (1) nominal, (2) ordinal, (3) interval, and (4) ratio.
The first type of scale of measurement, nominal, implies that the variable has different
categories (e.g. Experimental Group; Primary 5; Feedback Group). In the case of
ordinal variables, these entail values whose meaningfulness is ordered (e.g. strongly
disagree to strongly agree). In L2 education research, ordinal variables tend to be used
to classify individuals according to their L2 proficiency level, but also to organize L2
background (e.g. 1-3 years; 4-9 years). Interval variables are those whose distance
between scores is equal, and they measure continuous variables. In L2 education
research, interval variables are measured with interval scales, such as test duration, or
holistic ratings.
Types of research design
Prior to deciding on the type of statistical approach that L2 research has to take, the
research design has to be purposely devised. Firstly, in an experimental design, the
sample is divided into two equal or unequal groups. In both groups, a series of variables
will be explored, and one of them is going to be manipulated in order to observe
whether there is an effect over the other variables. In L2 education research, an example
could be two different types of methodologies to teach the same content. Hence, part
of the sample (the experimental group) would receive a novel methodology, while the
other part (the control group) would be taught using the traditional methodology. In

14
this case, the independent variable consists of two levels (1= novel methodology; 2=
traditional methodology), and the dependent variables would be determined by the
score in a test evaluating the content taught. However, experimental designs are usually
conducted in laboratory settings. When experimental designs are carried out in the
same educational environment, that is, a naturally occurring setting, then it would be
called quasi-experimental design. In this case, the researcher would have access to
the intact classrooms. In this regard, the L2 education researcher may opt for selecting
a random class in which the experimental intervention (i.e. the novel methodology) is
implemented. Likewise, the traditional methodology would be applied in a normal
classroom but using the contents specified by the researcher.
Another type of research design is correlational research designs, in which the
researcher does not manipulate any independent variable, and the assignation of
participants does not respond to any specific sampling procedure. The main objective
of this type of research design involves that the researcher is interested in observing
how two variables are correlated with each other. In L2 education research, data about
participants' anxiety levels and scores in the Speaking Cambridge examination may be
collected. To that end, a series of statistical analyses may be performed. Suppose that,
when the values in anxiety levels and the test scores are statistically correlated, there is
a negative correlation between both. This would help us gain insight into the
relationship between anxiety and speaking skill in this specific examination.
Nevertheless, this type of research design cannot lead the researcher to conclude that
this occurs because there is causality (Urdan, 2017).

Another important research design that is framed within the wide array of quantitative
methods is the survey research design. As its name indicates, the data collection
procedure in this design involves delivering surveys to a population of people in order
to carefully describe trends. Survey research is commonplace in L2 education research,
and to illustrate this, let us imagine that we are interested in exploring how a group of
Primary students perceive a specific type of intervention in the L2 classroom. For this
purpose, a survey should be used to gather this perception data.

15
Finally, education research tends to rely on another type of combined research design
that includes both quantitative and qualitative approaches. Action research aims to
explore on-the-spot how a specific issue or problem may be addressed. Through
constant monitoring, and using a variety of methodological tools (e.g. questionnaire,
diary study, or observation), data collected is thought to guide the researcher through
the process of adjusting or modifying to the process (Richards, 2003). In L2 education
research, action research designs are manifold and may involve several procedures,
among which not only one teacher-researcher is likely to take part in the process, but
also others. In a sense, action research is a collaboration between the researcher and
the teacher in the classroom. For instance, an L2 education researcher might be
interested in examining how the task-based language approach differs from the
communicative language teaching approach. In a sort of pre-research phase, both
approaches could be implemented in the L2 classroom whilst the teacher keeps a
journal to observe whether there are modifications or adjustments to be made in these
approaches. Simultaneously, the researcher may collect purely quantitative data from
tests. Putting qualitative information (from the journals and teacher's observations) in
parallel with the quantitative data ensures that the statistical differences between both
approaches may be supplemented with subjective data.

16
Chapter 2

DESCRIPTIVE STATISTICS
One of the very first steps in statistics corresponds to the description of raw data. When
quantitative data is collected, descriptive statistics are used to organize, summarize,
and describe the characteristics of the data obtained. Although they are described as
statistical procedures (Porte, 2010) since a number of calculations are involved, the
inferences that may be drawn are not as reliable as in the case of more advanced
statistical tests (see Chapters 3 and 4).

Throughout this chapter, a series of measures of central tendency, dispersion, and


graphical representation of data will be thoroughly presented. Since the main objective
of this book is to offer a clear and comprehensive overview of statistical methods, we
will mainly rely on a free statistical software package, JASP, in order to perform our
statistical analyses. It is true, however, that making sense of the equations and formulae
leading each statistical analysis is certainly useful. For this book, our concentration on
these mathematical aspects will be especially limited to necessary occasions.

Measures of Central Tendency


Measures of central tendency indicate how scores on a variable are distributed. These
measures allow researchers to observe the characteristics of these scores (Urdan, 2017).

The most commonly used statistic in L2 education research is the mean. In essence, it
is the arithmetic average of the distribution of scores, and it allows researchers to
summarize the information obtained from this specific variable. Despite the valuable
piece of information that the mean constitutes, it does not inform about the particular
distribution of the scores as well as which scores are closer to the mean. Let us imagine
that we want to observe how different the scores in the Primary 5 and Primary 6 classes
are. Using JASP, we click on 'Descriptives', then add the between-group variable (i.e.
the grouping variable) into 'Split', and add the variable from which we are interested in
obtaining the mean into 'Variables' (see Figure 5 below).
17
Figure 5. Inserting variables in the 'Descriptives' module in JASP.

The output will show the number of participants in each group ('Valid'), and the
'Mean'. Below, the values for 'Minimum' and 'Maximum' are also provided. These
calculations are equally relevant since they allow us to identify the tendency in the
scores.

Figure 6. JASP output for 'Descriptives' with the mean score.


Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Minimum 4.500 4.000
Maximum 8.500 10.000
One of the earliest conclusions an L2 education researcher may draw from the means
and the minimum/maximum is that while scores were very adjusted in terms of the
minimum, higher scores were obtained in Primary 6. This, however, does not provide
us with further valuable information about how these scores are distributed.

Before continuing with the following measures of central tendency, the importance of
confidence intervals must be highlighted. In definitory terms, a confidence interval is
a range of values that contains a lower and upper limit within a specific population
parameter. In essence, confidence intervals indicate whether there is the possibility that
18
the interval contains the value of the variable studied. The confidence intervals are
presented alongside the mean: M= 6.526, 95% CI [6.12, 6.93]. In this case, 95%
indicates that with 95% confidence the mean of the population is between 6.12 and
6.93.

Another measure of central tendency is the median. In mathematical terms, the median
is the middle value that is situated in the 50th percentile. To compute the median, scores
are arranged from the lowest to the highest. As can be observed in Figure 7, the medians
for both groups are the same, which indicates that scarce differences exist between both
Primary 5 and 6.

Figure 7. JASP output for 'Descriptives' with the mean and median scores.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Median 6.500 6.500
Mean 6.526 6.643
Minimum 4.500 4.000
Maximum 8.500 10.000

Finally, the mode is another measure of central tendency whose use is not as extended
as in the previous measures. The mode is the most frequently occurring score in a
distribution of scores (Tavakoli, 2013). Because of the reduced amount of information
that it provides, the mode is not generally the preferred measure of central tendency in
L2 education research. As can be observed in Figure 8, the mode points to the most
frequently occurring scores in both Primary 5 and Primary 6. However, it is not
representative of the data as it has been revealed in both the mean and the median.

Figure 8. JASP output for 'Descriptives with the mode scores.


Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mode ᵃ 5.000 6.500
19
Descriptive Statistics
Writing_Pre
5P 6P
Minimum 4.500 4.000
Maximum 8.500 10.000
ᵃ More than one mode exists, only the first is reported

When dealing with categorical variables (nominal scales), the mode would be the most
appropriate measure of central tendency. In L2 education research, a common
categorical variable would be L2 proficiency as expressed with the Common European
Framework of Reference for Languages (CEFRL).

Measures of Dispersion
Parallel to the rich amount of information that measures of central tendency provide,
measures of dispersion (or variability) supplement in a relevant manner what the mean
or the median inform about. Measures of dispersion clarify the amount of variability
among the scores in the variables of interest within the sample. If there is a wide spread
of these scores, there will thus be a large dispersion in the data.

One of the most common measures of variability is the standard deviation. In essence,
this measure refers to how different an individual score is in a distribution concerning
the average score of the distribution. The standard deviation is calculated with the
square root of the variance (see later). Usually, the standard deviation accompanies the
mean when reporting results.

Figure 9. JASP output for the mean and standard deviation (Std. Deviation).
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Std. Deviation 1.307 1.468
As can be observed in Figure 9, when the standard deviation is small and closer to 0, it
points to the inexistence of large variations within the data of the sample. The L2
researcher may observe the standard deviation to gain more insight into how different,
20
for instance, scores in a specific class are. In this case, there seems to be little
variability.

Another measure of dispersion is the variance, which is calculated "by summing the
squared deviations of the data values about the mean" (Tavakoli, 2013, p. 701), and
then dividing it by the number of participants minus one. Looking from another
perspective, the variance is the squared value of the standard deviation. If the variance
is large, the observations would be more scattered on average (Urdan, 2017). As may
be observed in Figure 10, the variance is larger in Primary 6.

Figure 10. JASP outcome with the mean and the variance.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Variance 1.708 2.154

Nevertheless, the use of this measure of dispersion is not commonly used in the same
manner as the standard deviation. In general terms, it is part of a calculation of other
statistical analyses, such as ANOVA (Urdan, 2017).

Finally, another common measure of dispersion used together with the median is the
interquartile range (IQR), which is the difference between the score marking the 75th
percentile (Q3) and the score marking the 25th percentile (Q1). The formula for this
measure of dispersion is IQR = Q3 – Q1.

Figure 11. JASP outcome for the median and the IQR.
Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Median 6.500 6.500
IQR 2.250 1.500

21
The use of IQR is most appropriate for ordinal scaled test scores, and it is generally
more adequate with non-normal distributions (Larson-Hall, 2010; Tavakoli, 2013).
However, when the data is normally distributed, the IQR does not appear to be the most
appropriate option. For instance, the data in both Primary 5 and Primary 6 for the
variable 'Writing_Pre' (see Figure 11 above) are normally distributed. If we compare
the information provided by the IQR with the one in the SD, we will observe that it
does not lead to the same interpretation. That is why one has to be particularly attentive
to the dichotomy normal vs non-normal distribution.

Distributions
Before performing any statistical analyses on the data, one of the assumptions that has
to be checked is whether the data has or not a normal distribution. A normal
distribution implies that the data rises smoothly from a small number of scores in the
tails (i.e. the extremes), and augments to a higher number of scores in the middle of the
distribution. The normal distribution always falls between the mean and the distances
which are situated above and below the mean. As explained previously, the distances
with the mean are the standard deviations. There are a number of graphical
representations – besides the measure of the standard deviation – that allow for
observing how distributed the data are. These are called distribution plots, and they
offer a more visual account of the data and the scores of a specific variable.

Figure 12. Normal distribution for score "Writing_Pre" in Primary 6.

22
As can be observed in Figure 12, the data is normally distributed since, on both sides,
the data contains lower values or scores than in the middle of the distribution.
Conversely, in Figure 13 below, the distribution is not normal. The values are
accumulated on both sides, and these do not produce a bell-shaped curve.

Figure 13. Non-normal distribution for score "Writing_Post" in Primary 6.

However, the assumption of normality may also be checked with a statistical test such
as the Shapiro-Wilk test. This test must only be used when the sample size contains
fewer than 50 subjects. In JASP, this test of normality may be performed under the
'Descriptives' module, in the "Distribution" section.

23
Figure 14. 'Statistics' section in the 'Descriptives' module in JASP.

Once the checkbox of the Shapiro-Wilk test is marked, JASP gives the following result,
as in Figure 15:

Figure 15. JASP outcome for descriptives and Shapiro-Wilk test.


Descriptive Statistics
Writing_Pre
5P 6P
Valid 19 21
Missing 0 0
Mean 6.526 6.643
Std. Deviation 1.307 1.468
Variance 1.708 2.154
Shapiro-Wilk 0.924 0.962
P-value of Shapiro-Wilk 0.136 0.555
Minimum 4.500 4.000
Maximum 8.500 10.000

The interpretation of the results of the Shapiro-Wilk test may be interpreted as follows.
Firstly, the value provided in Shapiro-Wilk is a W value, whose size determines
whether it is normally distributed or not. When this value is lower than 0.9x, it is very
likely that the sample corresponding to this variable is not normally distributed. Then,
the value in the P-value of Shapiro-Wilk, corresponds to the p-value. Although this will
be explained in-depth in the forthcoming chapters, p-value indicates the probability

24
that what the statistics (W-value) points is significant. When the p-value is below 0.05,
then the test is statistically significant, and in the case of the Shapiro-Wilk test, the data
would be not normally distributed. Should the p-value be over 0.05, the result would
not be statistically significant. Relying on the results in Figure 15 above, the Shapiro-
Wilk tests are not statistically significant (that is, the p-value is above 0.05), and thus
the data is normally distributed.

When the sample size is over 50 participants, normality must be checked with the
Kolmogorov-Smirnov test. In JASP, this normality test is found in the 'Distribution' >
'Normal' module. Once there, go to section 'Assess Fit' and click on 'Kolmogorov-
Smirnov' in 'Statistics', as in Figure 16 below.

Figure 16. Normality checks in the 'Distribution' module.

Bear in mind that, before choosing the corresponding statistics test, you are supposed
to introduce the data about the descriptives – mean and variance – as in Figure 17.

Figure 17. JASP module 'Show Distribution' to introduce the mean and the variance.

When the data is introduced, and the statistical tests are carefully selected (see Figure
18), JASP will yield this result:

Figure 18. JASP outcome for both normality tests.


Fit Statistics - Writing_Pre
Test Statistic p
Kolmogorov-Smirnov 0.104 0.785
Shapiro-Wilk 0.975 0.515
25
As observed, the Kolmogorov-Smirnov test is not statistically significant, and thus the
data are normally distributed. This indicates that the distribution of scores regarding
results in the pre-test for Writing in Primary 6 follows a normal distribution.
Nevertheless, the assumption of normality does not only serve to identify whether the
data in a specific variable of the sample is normally distributed, but it also leads the
researcher to decide whether to perform parametric or non-parametric statistical
tests (see Chapters 3 and 4). In general, terms, when the data are normally distributed,
and the sample size is above 20, parametric tests should be used. Conversely, when the
data are not normally distributed and the sample size is small (>20), it is preferable to
perform non-parametric tests.

Graphical representations of the data


Besides distribution plots, numerous basic plots are used to represent the data. In
JASP, plots can be selected in the 'Descriptives' module. In what follows, we will
explain some of these plots, and what their purpose is.

Figure 19. 'Plots' section in the 'Descriptives' module.

While distribution plots present a visual description of the distribution of the data (see
Figures 19 and 20), box plots provide us with valuable information about outliers.

26
Figure 20. Boxplots for the variable 'Writing_Pre' in Primary 5 and 6.

To interpret the information provided in these boxplots, Goss-Sampson (2020) clarifies


it very clearly in Figure 21. As can be observed, the wider the box, the more variability
exists within the data. Similarly, the black line represents the median. Looking back at
Figure 20 above, variability is higher in Primary 5. Conversely, there is an outlier –
that is, the isolated dot above the maximum in Primary 6.

Figure 21. Explanation of what boxplots show (Goss-Sampson, 2020, p. 25).

Another graphical representation is the Q-Q plot (quantile-quantile plot), which


presents information about the quantiles of the data against the quantiles of the normal
distribution. The information is visually presented with a scatter plot (see Figure 22
below). When the distribution is normal, the plot forms a straight line at a 45-degree
angle. However, even if these Q-Q plots may be revealing about the distribution of the

27
data, it is always advisable to test for normality using the appropriate statistical tests
(i.e. Kolmogorov-Smirnov or Shapiro-Wilk).

Figure 22. Q-Q plot for 'Writing_Pre' of Primary 6.

Another way of presenting data visually is using an interval plot. In this case, it allows
us to observe and compare confidence intervals of the means of the groups. The mean
is represented with the dot (see Figure 23), while the lines present the range of values
that include the population mean. The fact that these lines are elongated points to the
wide spread of the groups.

Figure 23. Interval plot.

28
The dot plot is another type of visual representation for the data. It is used to display
the distribution of the data with continuous variables. As can be seen in Figure 24
below, the horizontal X-axis represents the frequencies. Each dot is representative of a
specific number of observations. The usefulness of dot plots lies in the visualization of
the shape and spread of the data, and they serve equally as histograms. In the case of
Figure 24 below, the dot plot shows that values were more concentrated in '5' and '7.5'.

Figure 24. Dot plots.

29
30
Chapter 3

PARAMETRIC STATISTICS
Throughout this chapter, the tests used in parametric statistics will be presented. One
of the most common parametric tests is t Tests. These types of tests imply that there is
a comparison of two means to observe whether they are significantly different from
each other. Additionally, the t stands for the t family of distributions, which is a
measure of probability, and it is highly dependent on the size of the sample.

One Sample T-Tests


The first of the parametric tests is the one sample t-test, which is used to determine
whether the sample mean is different from a specific value in the population. Generally,
one sample t-tests are used with continuous data, and the sample is randomized from a
normal population.

Before performing this type of test, there are a number of assumptions that need to be
complied with: (1) the data is continuous; (2) the data is independent; (3) the data is
normally distributed; and (3) no outliers are present.

Back to the initial definition, a one sample t-test is used when the researcher wants to
compare the mean from a specific sample to a population mean. For instance, if you
want to compare the scores of A2 level students in the writing Cambridge: Key (A2,
CEFRL) of all Primary classes to the test scores in the same writing module of A2 level
students in a particular class. To do that, I select 30 students from different classes, all
of whom have an A2 command of English, and who have sat mock tests for the writing
part in the Cambridge: Key exam. Subsequently, I calculate the score obtained from
these students and obtain a population mean. Then, I calculate the test scores from a
particular class.

31
To perform this statistical test in JASP, you have to click on 'T-Tests', and select
'Classical' > 'One Sample T-Test'. Subsequently, the variable of interest has to be added
to the 'Variables' box, and in 'Test' section, indicate the 'Test value', which would be
the population mean (see Figure 25 below):

Figure 25. Test value.

Once this is selected, and the corresponding population mean is introduced, JASP
yields the following results:

Table 1. JASP outcome for one sample T-test along with the descriptives.
One Sample T-Test
95% CI for Cohen's d
t df p Cohen's d Lower Upper
Writing_Pre -5.325 27 < .001 -1.006 -1.457 -0.543
Note. For the Student t-test, effect size is given by Cohen's d .
Note. For the Student t-test, the alternative hypothesis specifies that the mean is different from 7.8.
Note. Student's t-test.

Descriptives
N Mean SD SE
Writing_Pre 28 6.875 0.919 0.174

There are a number of aspects to be highlighted here: (1) p-value. As anticipated in the
previous chapter, the p-value is core to the hypothesis testing. That is, through this
one sample T-test, we attempted to test the hypothesis that the sample mean was equal
to the population mean. However, as the p-value is below 0.05, our hypothesis is
rejected, indicating that both means are different (p < .001); (2) Cohen's d. Although
the relevance of effect sizes is going to be put under scrutiny in Chapter 7, an effect

32
size measures the magnitude of the effect. In this case, the comparison between the
sample mean and the population mean indicates that the difference is large (d= –1.006);
(3) df stands for degrees of freedom, which is the minimum amount of data needed to
calculate the statistic. In short, it is the number of independent units of information in
the sample, and whose values may vary when the statistic (e.g. F or Z) is calculated.
Additionally, degrees of freedom are used to measure the amount of information which
is available to estimate population parameters; (4) SE stands for standard error, which
is a statistic that determines the degree to which the population parameter may differ
from the computed sample statistic. In essence, the standard error provides the
researcher with useful information about the degree of accuracy of the population
parameter. If the standard error is small, then, the sample statistic is better to estimate
the population parameter.

Another important aspect to bear in mind for publication purposes is to how to report
these results following some citation guidelines, such as the American Pyschological
Association (APA):

A one sample T-Test showed that participants having an A2 level in Primary 5 scored
significantly lower in the writing test than the overall A2 level students in the school (t
(27) = –5.325, p<.001).

33
Paired Samples T-Tests
The second of the parametric tests is called paired (or dependent) samples t-test. In
this type of test, the values in one sample are related to the values in the other sample.
In other words, the individuals in both samples are related or equal. In L2 education
research, this type of parametric test is commonly used to verify, for instance, the
efficiency of some educational interventions or the implementation of certain
methodologies. Paired Samples T-Tests are used, in general terms, when individuals in
the sample are measured two times in time, that is, in experimental research designs
including a pre-test and a post-test.

To access this test in JASP (see Figure 26), we have to click on 'T-Tests' > 'Paired
Samples T-Test'.

Figure 26. Accessing Dependent Samples T-Tests.

Before selecting the variables, we have to make sure that we have filtered out the group
of interest in case two groups (e.g. Primary 5 or 6; Control or Experimental Group) are
present in the research design.

Let us imagine that we are interested in testing how effective writing instruction is in
a Primary classroom. To check the effect of this intervention, participants perform a
pre-test and a post-test. In Figure 27, you can see how these variables are introduced
in the corresponding areas of JASP.

34
Figure 27. JASP interface for Paired Samples T-Test.

As anticipated in the previous subsection, the effect size is equally marked in order to
observe the magnitude of the effect. Figure 28 display the outcomes for the paired
samples t-test.

Figure 28. JASP outcome for paired samples t-test, descriptives, and normality test (Shapiro-Wilk
test).
Paired Samples T-Test
95% CI for Cohen's d
Measure 1 Measure 2 t df p Cohen's d Lower Upper
Writing_Pre - Writing_Post 3.139 18 0.006 0.720 0.206 1.219
Note. Student's t-test.

Test of Normality (Shapiro-Wilk)


W p
Writing_Pre - Writing_Post 0.981 0.949
Note. Significant results suggest a deviation from normality.

Descriptives
N Mean SD SE
Writing_Pre 19 6.526 1.307 0.300
35
Descriptives
N Mean SD SE
Writing_Post 19 4.421 2.567 0.589

Similar to what was mentioned in the previous chapter, the sample meets the
assumptions: data is normally distributed (see Shapiro-Wilk test with a p-value of
0.949). Regarding the statistical test per se, it compares how different the pre-test
(Writing_Pre) is from the post-test (Writing_Post). In essence, observing the p-value
(p = 0.006), it is below the 0.05 benchmark, which allows us to confirm that there is a
variation from pre-test to post-test. The effect size (Cohen's d) suggests a medium
effect size (see Chapter 7 for further information). If we take a look at the descriptives,
we may conclude that writing instruction did not contribute to improving the mean of
the scores. It seems to have had a counter-effect since the score decreased (6.526 to
4.421).

It is important to bear in mind that Paired Samples T-Tests are used when (1) the
assumptions are met, and (2) when we are interested in comparing data from the same
sample at two different times.

The APA reporting of the results of the previous statistical test:

On average, participants scored less after the writing instruction. A paired samples t-
test showed this decrease to be significant (t (18)= 3.13, p = 0.006). Cohen's d suggests
that there is a medium effect (d = 0.72).

Independent Samples T-Tests


The independent samples T-tests are one of the most common statistical tests used in
L2 education research. This type of test allows the researcher to determine whether the
means of two independent samples, for instance, two different classes, are equal or
different. In order to perform an Independent Samples T-Test, there are two initial
requirements: (a) a categorical or nominal independent variable (e.g. Primary 5 and
Primary 6), and (b) one or several continuous dependent variables (e.g. Listening test

36
scores). In some sense, the dependent variable depends on the value of the independent
variable, which may cause the dependent variable to change.

For example, suppose that we propose an educational intervention in which one of the
groups (experimental group) tries a series of listening activities using a newly
conceived app. The other group – the control group – follows the traditional teaching
approach to listening. To verify whether the listening activities have had an effect on
the listening scores in the experimental group, an independent t-test is performed. As
seen in Figure 29 below, the variables of interest have to be introduced in the
corresponding section. Similarly, this type of test requires that the independent variable
is equally included (i.e. the grouping variable).

Figure 29. JASP interface to introduce the variables of interest and the grouping variable.

Once all the data are introduced, it is necessary to test the assumptions for these
statistical tests: (1) group independence; (2) normality of the dependent variables; and
(3) homogeneity of variance. While assumption (1) should be taken into consideration
before the data are processed, (2) and (3) are to be verified using statistical tests. In
JASP, both assumptions may be tested as in Figure 30:

Figure 30. JASP options for Assumption Checks in Independent Samples T-Tests.

When the statistical tests for these assumptions are performed, JASP provides the
following output:

37
Figure 31. Assumption Checks results (Independent Samples T-Test).
Test of Normality (Shapiro-Wilk)
W p
Listening_Post EG 0.933 0.197
CG 0.925 0.110
Note. Significant results suggest a deviation from normality.

Test of Equality of Variances (Levene's)


F df p
Listening_Post 0.884 1 0.353

Both tests did not yield statistically significant results, as may be observed in the p-
values. This indicates that the assumption of normality and the equality of variances
are successfully met. Such an outcome allows us to freely proceed with the statistical
T-Test.

In Figure 32 below, the results of the t-statistics are presented along with the
descriptives.

Figure 32. JASP outcome for independent samples t-test, descriptives, and descriptive plot.
Independent Samples T-Test
Cohen's
t df p
d
Listening_Post 1.028 38 0.311 0.325

Note. Student's t-test.


Group Descriptives
Group N Mean SD SE
Listening_Post EG 19 6.368 2.499 0.573
CG 21 5.476 2.943 0.642

As can be observed, the p-value indicates that there is not a significant statistical
difference between groups (p= 0.311), and Cohen's d suggests this is a small effect
(below the 0.40 benchmark). The group descriptives also indicate that, despite the
apparent difference between the EG and CG, no statistical difference is found between
them.

38
Figure 33. JASP descriptive plot for independent samples t-test.

The descriptive plot as shown in Figure 33 above clearly depicts the tendency that the
EG obtained a higher score than the CG. However, this difference is not deemed
significant.

The APA reporting of the results of the previous statistical test:

The descriptives show that the experimental group performed better than the control
group in the listening post-test. However, an independent t-test showed that this
difference was not significant (t(38) = 1.028, p = 0.31), and Cohen's d suggests this is
a small effect (d= 0.32).

One-Way ANOVA
Another parametrical test of interest is the one-way analysis of variance or One-Way
ANOVA. The main objective of this test is to compare two or more groups (i.e. the
grouping variable of the independent variable) and the effects on one dependent
variable to observe whether the existing differences are statistically significant. The
purposes of one-way ANOVA are identical to those of an Independent Samples T-Test,
and results yielded by this test are supposed to be equal. One may wonder when it is
more appropriate to use one or the other if their purposes are identical. In fact, one-way
ANOVA allows for comparing two or more groups while with T-Tests one is forced to
perform three independent t-test, thus reducing the probability that the differences are
meaningful.

The assumptions for one-way ANOVA are as follows:

39
1) The independent variable must be categorical.
2) The dependent variable must be continuous or scale-based.
3) The grouping variable should be independent.
4) The dependent variable should be noramlly distributed.
5) No outliers should be present.
6) Homogeneity of variance.

JASP offers an entire module for ANOVA tests, as can be observed in Figure 34 below:

Figure 34. ANOVA module in JASP.

Once it is selected, the independent and dependent variables should be introduced in


the corresponding areas:

Figure 35. ANOVA module in JASP: introducing the variables.

The one-way ANOVA yields the following results:

40
Figure 36. JASP outcome for one-way ANOVA.
ANOVA - Listening_Post
Cases Sum of Squares df Mean Square F p η²
Group 7.941 1 7.941 1.056 0.311 0.027
Residuals 285.659 38 7.517
Note. Type III Sum of Squares
Descriptives - Listening_Post
Group Mean SD N
5P 6.368 2.499 19
6P 5.476 2.943 21

Unlike in T-Tests, where the t statistics are taken as reference, in ANOVAs, the F-
statistics is used. As can be observed in Figure 12, the p-value is above 0.05, which
means that there are no differences among the groups. Nevertheless, the eta-squared
(η²) points to a medium effect size (see Chapter 7 for further information on effect
sizes).

To report one-way ANOVA following the APA guidelines, let us take a look at the
following example:

On average, students in Primary 5 performed better at the Listening post-test than


Primary 6. However, this difference was not statistically significant (f(1)= 7.941, p =
0.31, η² = .027).

One-way ANOVA with three groups

Aside from using one-way ANOVA with two groups, in which case the outcome is the
same as if we performed an independent T-Test, the situation is different when three
groups are used. This is because, besides the general F-statistics, the differences
between the three groups have to be corroborated with what is called post-hoc
comparison using a series of statistical procedures, as will be further explained.

Let us take an example in L2 education. Suppose we are interested in exploring how


different the use of the Montessori methodology may affect a CLIL and a NON-CLIL
group. A control group is equally introduced in the equation so as to observe the

41
potential effect. The measure is a continuous variable in which the L2 contents taught
in the Montessori methodology were tested.

To include the variables in the ANOVA module in JASP are the same; however, it is
necessary to adjust the Post Hoc Tests beforehand, as in Figure 37 below.

Figure 37. Post Hoc Tests in the ANOVA module.

A post-hoc test is a comparison used after the data have been analyzed and examined
(Tavakoli, 2013). When this comparison is statistically significant, then the researcher
examines the combination of means of the different groups. In other words, a post-hoc
test is a follow-up statistical test performed after a comparison of three or more groups
has yielded a significant F statistics. In our previous example, our independent variable
had more than two levels: CLIL, NO-CLIL, and CONTROL GROUP. Thus, as the F
statistics is significant, the subsequent stage would be to compare all the groups, that
is: CLIL vs NO-CLIL, CLIL vs CONTROL GROUP, NO-CLIL vs CONTROL
GROUP.

To do so, the researcher may opt for a wide variety of post-hoc tests: Tukey's test,
Bonferroni's test, or Holm's test. JASP offers several other options, but our focus will
be placed on those three post-hoc tests.

42
(1) Tukey's test (also, Tukey HSD test) is a post-hoc test used when a pairwise
comparison is of interest, that is, when the group sizes are equal. As indicated by
Tavakoli (2013), Tukey's test is more conservative as it reduces the likelihood of a
Type I error. Nevertheless, it has less statistical power, and is robust to nonnormality
(Cohen et al., 2011; Mackey & Gass, 2005).

(2) Bonferroni test (also Bonferroni-Dunn test or Bonferroni adjustment). It is a


procedure to guard against Type I error with multiple significance tests by adjusting
the alpha level (i.e. the p-value). In essence, the Bonferroni test reduces the
conventional level of probability (that is, p-value benchmark at 0.05) to control for
Type I error. To do so, the Bonferroni adjustment divides the alpha level (e.g. p = 0.05)
by the number of tests performed. Bonferroni test is a very conservative test, although
it may be used with equal and unequal group sizes so long as the variances within all
groups are equal. Should the variances be unequal, the most appropriate test would be
Games-Howell Test (see 'Type' in Figure 37 above).

(3) Holm test (also Holm-Bonferroni test) is a sequential method, based upon
Bonferroni, which is less conservative. In a stepwise manner, the Holm test computes
the significance levels depending on the P-value-based rank of hypotheses (Chen et al.,
2017).

In Figure 38 below, the result of the ANOVA parametric test is presented:

Figure 38. JASP outcome for one-way ANOVA with three groups.
ANOVA - Montessori

Cases Sum of Squares df Mean Square F p η²

Group 37.475 2 18.737 10.162 < .001 0.220

Residuals 132.764 72 1.844

Note. Type III Sum of Squares

43
As can be observed, the ANOVA yielded a statistically significant result (p < .001)
with a nearly medium effect size (η² = 0.22).

Figure 39. JASP outcome for the post-hoc comparison tests.


Post Hoc Comparisons - Group

Mean Difference SE t ptukey pbonf pholm

-
CG CLIL -1.719 0.384 < .001 *** < .001 *** < .001 ***
4.471

-
NOCLIL -0.702 0.388 0.174 0.224 0.075
1.808

CLIL NOCLIL 1.017 0.380 2.674 0.025 * 0.028 * 0.019 *

* p < .05, *** p < .001

Note. P-value adjusted for comparing a family of 3

Descriptives - Montessori
Group Mean SD N
CG 6.458 2.085 24
CLIL 8.177 0.847 26
NOCLIL 7.160 0.787 25

Figure 39 above shows the post-hoc comparisons between the three groups that have
been tested in the ANOVA parametric test. In order to illustrate what variations are
present between the post-hoc tests explained previously, we have opted for including
the three tests. As can be observed, there is a statistically significant difference between
CG and CLIL, and between CLIL and NO-CLIL. Likewise, it is important to observe
Mean Difference since it provides us with an idea of how different the means are
between the groups.

Reporting of these results following the APA guidelines:

44
Independent one-way ANOVA showed a significant effect of the Montessori method on
three different classes (F (2, 72) = 37.47, p < .001, η² = 0.22).

Post-hoc testing using Tukey's correction revealed that the CLIL group resulted in
significantly greater scores than the CG (p < .001) and the NO-CLIL group (p < .05).
There were no significant differences between NO-CLIL and CG (p= .174).

Two-Way ANOVA
Another type of ANOVA parametric test is two-way ANOVA (or Factorial ANOVA),
in which the effect of two categorical independent variables with two or more levels
are estimated on a single, continuous dependent variable (Tavakoli, 2013). Two-way
ANOVA also tests the interactions between these variables.

In the example that we propose, there are two independent variables: (1) group: CLIL
or CONTROL GROUP, and (2) Type of Learning: Cooperative Learning (CL) and
Project-Based Learning (PBL). In other words, both Factor 1 (Group) and Factor 2
(Type of Learning) have two levels. The two-way ANOVA tests two different
hypotheses (Goss-Sampson, 2020):

1) There is no significant between-subject effect.


2) There is no significant interaction effect (i.e. no significant group differences
across the conditions).

As with one-way ANOVA, two-way ANOVA also requires that a series of assumptions
are met:

1. The independent variables (factors) should have two categorical independent


levels.
2. A continuous and normally distributed dependent variable.
3. Homogeneity of variance.
4. No significant outliers.

45
In JASP, the same ANOVA module should be used to perform this two-way ANOVA.
In this case, both independent variables – categorical in nature – should be included in
'Fixed Factors' while only a dependent variable must be selected (see Figure 40 below).

Figure 40. ANOVA module in JASP with the two independent variables and the dependent
variable.

Following the example we proposed previously, the dependent variable is continuous


and is related to the use of an app meant to help in language learning. In this case, one
of the questions that we aim to answer with two-way ANOVA is whether the use of
this app bears any effect.

After introducing the data into JASP, the statistical analyses yield the results displayed
in Figure 41.

Figure 41. JASP outcome for two-way ANOVA.


ANOVA - App
Cases Sum of Squares df Mean Square F p η²
Group 158.319 1 158.319 68.476 < .001 0.577
TypeLearning 7.937 1 7.937 3.433 0.070 0.029
Group ✻ TypeLearning 1.757 1 1.757 0.760 0.388 0.006
Residuals 106.353 46 2.312
Note. Type III Sum of Squares

Descriptives - App
Group TypeLearning Mean SD N
CG CL 6.538 1.761 13
PBL 5.364 2.501 11
CLIL CL 9.731 0.388 13
46
Descriptives - App
Group TypeLearning Mean SD N
PBL 9.308 0.630 13

As can be observed, the ANOVA table shows that there are significant effects for
Group (p < .001) with a large effect size. In this case, there was not a significant
difference for Type of Learning (p = 0.07) or the interaction between Group and Type
of Learning (p = 0.38). This suggests that differences were not considerable as regards
the type of learning, but rather as a result of the difference in Factor 1, i.e. the group.

Even though the interaction (Group * TypeLearning) is not significant according to the
ANOVA, there are likely significant differences that post-hoc tests may highlight. In
Figure 42, these post-hoc comparisons are presented.

Figure 42. JASP outcome for Post Hoc tests in two-way ANOVA.
Post Hoc Comparisons - Group ✻ TypeLearning
Mean Difference SE t ptukey pbonf pholm
CG CL CLIL CL -3.192 0.596 -5.353 < .001 < .001 < .001
CG PBL 1.175 0.623 1.886 0.248 0.394 0.131
CLIL PBL -2.769 0.596 -4.643 < .001 < .001 < .001
CLIL CL CG PBL 4.367 0.623 7.011 < .001 < .001 < .001
CLIL PBL 0.423 0.596 0.709 0.893 1.000 0.482
CG PBL CLIL PBL -3.944 0.623 -6.332 < .001 < .001 < .001
Note. P-value adjusted for comparing a family of 4

As seen, there are differences between CG CL vs CLIL CL (p < .001), CG CL vs CLIL


PBL (p < .001), CLIL CL vs CG PBL (p < .001), and CG PBL vs CLIL PBL (p < .001).
As L2 researchers, these post-hoc comparisons would surely help us gain more insight
into how the difference in type of learning may be related to belonging to a specific
group following a different approach (e.g. CLIL group) or a traditional one (Control
group).

Reporting of these results following the APA guidelines:

47
A two-way ANOVA was used to examine whether the effect of using an App varied
depending on the type of learning used in a CLIL group or a traditional group. There
were significant main effects for group (F (1, 46) = 68.476, p < .001, η² = .55).

Tukey's post-hoc correction showed that cooperative learning in the CLIL group was
higher compared to cooperative learning and project-based learning in the control
group (t = –3.192, p <.001 and t=4.367, p <.001 respectively).

Repeated Measures ANOVA

One-way Repeated Measures ANOVA (RMANOVA)

Another ANOVA parametric test is one-way repeated measures ANOVA, in which


individuals within the same group are tested with respect to three or more occasions,
or more than three conditions. In L2 education research, this is particularly relevant
since the researcher may be interested in exploring how a given class improves a
particular skill longitudinally. To do that, several tests may be conducted throughout
time. Through repeated-measures ANOVA, the L2 education researcher may confirm
or reject whether these changes are due to the education intervention implemented in
the classroom environment.

Much as the previous ANOVA parametric tests, repeated-measures ANOVA equally


follows a series of assumptions that have to be met:

1. One continuous dependent variable (interval or ratio scale).


2. One categorical independent variable (two or more levels).
3. Normally distributed data in the dependent variable.
4. Homogeneity of variance.
5. Sphericity.

As well as the rest of the ANOVA tests, it uses the F-statistic. If it is large, this may be
interpreted as the independent variable having a significant effect on the dependent
variable.

48
To illustrate one-way repeated measures ANOVA, suppose that we are interested in
observing how a CLIL class performs in a language test when being taught in three
different conditions: Montessori methodology (Montessori), Communicative
Language Teaching (CLT), and a language App (App). To test how these conditions
are different from each other, and whether the scores obtained in the tests are
statistically significant, the JASP module has to be opened, and select 'Repeated
Measures ANOVA' (see Figure 43).

Figure 43. Repeated Measures ANOVA selection.

Then, the following interface will appear:

Figure 44. Interface to introduce the data for Repeated Measures ANOVA.

The appropriate dependent variables have to be introduced in the 'Repeated Measures


Cells'. Once this is done, it is important that 'Assumption Checks' are marked since
these would help us discern whether they are met, and thus we may proceed with the
analysis. For instance, should the assumption of sphericity be violated as, in Figure 45,
the F-statistic has to be corrected by using Greenhouse-Geisser epsilon. In that case,

49
when the epsilon is <0.75, Greenhouse-Geisser correction is preferred (Goss-Sampson,
2020).

Figure 45. Test of Sphericity (one-way Repeated Measures ANOVA).


Test of Sphericity
Mauchly's Approx. p- Greenhouse-Geisser Huynh-Feldt Lower Bound
df
W Χ² value ε ε ε
RM Factor
0.301 28.831 2 < .001 0.589 0.600 0.500
1

As can be observed in Figure 45, the test of sphericity is statistically significant (p <
.001). Hence, the Greenhouse-Geisser correction has to be applied to the ANOVA test.
In Figure 46 below, two options are provided: 'None' (no correction applied), and
'Greenhouse-Geisser' correction. The F-statistics is large, and the p-value of the
repeated-measures ANOVA test is statistically significant (p < .001). Hence, we may
proceed to check which combination of conditions is significant through a post-hoc
comparison test.

Figure 46. Within-Subjects effects.


Within Subjects Effects
Cases Sphericity Correction Sum of Squares df Mean Square F p
RM Factor 1 None 25.962 ᵃ 2.000 ᵃ 12.981 ᵃ 51.601 ᵃ < .001 ᵃ
Greenhouse-Geisser 25.962 1.177 22.057 51.601 < .001
Residuals None 12.578 50.000 0.252
Greenhouse-Geisser 12.578 29.426 0.427
Note. Type III Sum of Squares
ᵃ Mauchly's test of sphericity indicates that the assumption of sphericity is violated (p < .05).

Descriptives
RM Factor 1 Mean SD N
Montessori 8.177 0.847 26
CLT 9.231 0.620 26
App 9.519 0.556 26

In Figure 47 below, the post-hoc comparison tests (using Bonferroni adjustment and
Holm-Bonferroni test) are presented. It clearly shows that there is a statistically

50
significant difference between Montessori and CLT (p <.001), in which the CLT had
higher scores. Likewise, the differences between Montessori and the App group are
statistically significant (p < .001), once again with this last condition holding higher
values. Although marginally significant (p = .04), the differences between CLT and
App are equally statistically significant, although the means are not considerably
different (CLT = 9.23, and App = 9.51).

Figure 47. Repeated-measures ANOVA test (post-hoc comparison test).


Post Hoc Comparisons - RM Factor 1
Mean Difference SE t Cohen's d pbonf pholm
Montessori CLT -1.054 0.139 -7.576 -1.486 < .001 *** < .001 ***
App -1.342 0.139 -9.649 -1.892 < .001 *** < .001 ***
CLT App -0.288 0.139 -2.074 -0.407 0.130 0.043 *
* p < .05, *** p < .001
Note. Cohen's d does not correct for multiple comparisons.
Note. P-value adjusted for comparing a family of 3
Reporting of these results following the APA guidelines (Goss-Sampson, 2020):

Since Mauchly's test of sphericity was significant, the Greenhouse-Geisser correction


was used. This showed that the educational conditions differed significantly between
them: F (1.177, 29.426) = 51.601, p < .001.

Post-hoc testing using Bonferroni correction revealed that in the Montessori condition,
participants scored lower than in the CLT (mean difference = –1.054, p < .001), and
App condition (mean difference = –1.892, p < .001).

ANCOVA
The analysis of covariance or ANCOVA is a statistical procedure that allows us to
observe group differences on a continuous dependent variable, with one or more
continuous independent variables – covariates – which are controlled for. In summary,
ANCOVA allows for one or more categorical independent variables, one continuous
dependent variable, and one or more covariates. Therefore, the covariate is, in essence,
another independent variable (Tavakoli, 2013).

A series of patterns should be understood before performing an ANCOVA test: (1) a


moderate correlation should exist between covariates and the dependent variable, (2)
51
covariates are reliably measured, (3) if there are several covariates, low correlations
should exist among them, and (4) no correlation should exist between covariates and
grouping variables.

To illustrate how ANCOVA may be used, the following research design will be
presented as an example:

Figure 48. ANCOVA research design.

JASP, as demonstrated in the previous analyses, offers a very friendly interface to


introduce the data of the variables into the corresponding place (see Figure 49 below):

Figure 49. ANCOVA interface in JASP.

Once this information is introduced, several features must be marked in JASP. Firstly,
in 'Display', it is important to mark both eta-squared and omega-squared as estimates
52
of effect size. Marking 'descriptive statistics' will equally allow us to observe the extent
of the differences. Subsequently, in 'Post Hoc Tests', in 'Type', 'Effect size' should be
marked to observe the magnitude of the effect of the post-hoc comparison tests. As for
the 'Correction', mark Tukey, Bonferroni, and Holm.

As can be observed in Figure 50, the output for ANCOVA indicates that there are
statistically significant differences (p < .001 and p = .002, respectively) for both years
of learning the L2 and group, that is, the covariate and the independent variable.

Figure 50. JASP output for ANCOVA.


ANCOVA - Post_Test
Cases Sum of Squares df Mean Square F p η² ω²
YearsL2 15.062 1 15.062 22.443 < .001 0.275 0.260
Group 7.428 1 7.428 11.069 0.002 0.136 0.122
Residuals 32.214 48 0.671
Note. Type III Sum of Squares

Descriptives - Post_Test
Group Mean SD N
CLIL 9.231 0.620 26
NOCLIL 7.440 1.253 25

Then, a post-hoc comparison test is performed to observe how statistically significant


these differences are. In Figure 51, there is a great deal of information to be drawn: p-
values are statistically significant, and Cohen's d, i.e. the effect size, indicates that there
is a large effect size (d= 0.978).

Figure 51. JASP output for Post Hoc Comparisons - ANCOVA.


Post Hoc Comparisons - Group
Mean Difference SE t Cohen's d ptukey pbonf pholm
CLIL NOCLIL 0.961 0.289 3.327 0.978 0.002 ** 0.002 ** 0.002 **
** p < .01
Note. Cohen's d does not correct for multiple comparisons.

Another way to observe these results is by creating a descriptive plot that includes
confidence intervals. In Figure 52, a JASP-generated descriptive plot based on the

53
sample study presented for ANCOVA is shown. As seen, the CLIL group obtained
higher marks than the NOCLIL group which, in turn, had been learning the L2 for
fewer years than the CLIL group.

Figure 52. Descriptive plot with confidence intervals.

Reporting of these results following the APA guidelines (Goss-Sampson, 2020):

The covariate, years learning the L2, was significantly related to the group variable,
F (1, 48) = 22.443, w2 = 0.260.

Post hoc testing using Tukey's correction revealed that the CLIL group obtained higher
scores in comparison to the NO-CLIL group (p = .002).

MANOVA
The MANOVA parametric test, which stands for multivariate analysis of variance,
refers to a situation where several continuous dependent variables are included.
Researchers using MANOVA are interested in examining whether the combinations of
continuous dependent variables are significantly different from one group to the other.
In other words, MANOVA serves to identify which groups and to what extent they
differ from each other, and the dependent variables in which differences exist
(Tavakoli, 2013).

54
For instance, if a researcher wanted to examine whether students in a CLIL and a
traditional group (control), that is, the independent variable with two levels, varied
depending on the education method used: communicative-language teaching
(dependent variable 1) and app-based teaching approach (dependent variable 2).

The assumptions that must be taken into consideration when deciding on using
MANOVA are as follows:

1) Independence of observations (there must be between-group differences).


2) Multivariate normality (i.e. normal distribution across the dependent variables).
3) Homogeneity of variance or covariance matrices.

These tests of assumptions may be checked in JASP, after going to ANOVA >
MANOVA, as can be seen in Figure 53.

Figure 53. Assumption checks for MANOVA.

The output of these assumptions tests in JASP provides the following information:

Figure 54. JASP outcome for tests of assumption.


Box's M-test for Homogeneity of Covariance Matrices
χ² df p
3.244 3 0.355

Shapiro-Wilk Test for Multivariate Normality


Shapiro-Wilk p
0.929 0.005

Once these assumptions have been checked, it is time to select the appropriate test for
our MANOVA. JASP provides four different options: Pillai, Wilks, Hotelling-Lawley,
and Roy (see Figure 55).

55
Figure 55. Selection of tests for MANOVA.

There are several explanations as to what each test provides the researcher with, and
which one would be more appropriate or at least, more common in L2 education
research.

a) Wilk's test is the most commonly used test in MANOVA since it is an


approximation of F-statistics.
b) Lawley-Hotelling is another statistic test, which is not very common.
c) Pillai's test is, in general terms, the most adequate, and results are very similar
to the ones in the two previous tests.
d) Roy's test is more appropriate when vectors are collinear, but it does not present
a satisfactory approximation to F-statistics.

Pillai's test will be used to illustrate an example. In this case, suppose that, as an L2
education researcher, you are interested in observing how two different groups (CLIL
and a control group) react to two different methodologies: CLT and App (the dependent
variables). After performing the analysis in JASP, Figure 56 displays the results:

Figure 56. JASP outcome for MANOVA: Pillai test.


MANOVA: Pillai Test
Cases df Approx. F TracePillai Num df Den df p
(Intercept) 1 940.986 0.976 2 47.000 < .001
Group 1 31.693 0.574 2 47.000 < .001
Residuals 48

As can be observed, there is a statistically significant difference between both CLT and
App, F (1, 48) = 31.69, p < .001; TracePillai = 0.574.

56
Subsequently, if we decide to follow with some stepwise comparisons, it is possible to
display the ANOVATables (in JASP, check the box of 'ANOVAtables' in 'Additional
Options'). This allows us to observe in which dependent variables differences are
found, and whether these are statistically significant or not.

Figure 57. JASP 'ANOVAtables' for the dependent variables of the MANOVA analysis.
ANOVA: CLT
Cases Sum of Squares df Mean Square F p
(Intercept) 3136.320 1 3136.320 1779.149 < .001
Group 93.065 1 93.065 52.793 < .001
Residuals 84.615 48 1.763

ANOVA: App
Cases Sum of Squares df Mean Square F p
(Intercept) 3065.445 1 3065.445 1271.305 < .001
Group 154.565 1 154.565 64.101 < .001
Residuals 115.740 48 2.411

As can be observed in Figure 57 above, the p-value is below 0.05 in both cases, and
thus statistically significant. This will lead us to review the descriptives in order to
observe what the differences are:

Figure 58. Descriptive statistics for both dependent variables (CLT and App).
Descriptive Statistics
CLT App
CG CLIL CG CLIL
Valid 24 26 24 26
Missing 0 0 0 0
Mean 6.500 9.231 6.000 9.519
Std. Deviation 1.806 0.620 2.167 0.556
Minimum 3.000 8.000 2.000 8.500
Maximum 9.500 10.000 9.000 10.000

Parametric correlations
One of the most basic tests used for association among variables is the correlation
coefficient (Urdan, 2017). Correlations are statistical indices that point to the strength
and direction of the relationship between two variables. This relationship determines
57
how strongly a pair of variables is associated, but also, whether this association is
considered statistically significant or not. Correlation coefficients are calculated
through quantifiable variables which are continuous or ordinal data. These data are
supposed to be based upon observed data, that is, data that summarizes qualities from
a certain variable. When a researcher is interested in using correlation coefficients, the
first characteristic to pay attention to is the direction of this correlation, that is, whether
it is positive (+) or negative (–). In the case of a positive correlation, it occurs when
variable X increases at the same time as variable Y. This implies a direct linear
relationship in which variables Y and X increase at the same pace. On the other hand,
a negative correlation occurs when variable X increases at the expense of variable Y,
that is, a change on one variable is associated with change on the other variable in the
other direction.

Another important characteristic of correlation coefficients is that it is represented by


a number between –1 and +1. In essence, the value of the correlation coefficient cannot
be less than –1 or more than +1. The interpretation of the coefficient does not imply a
cause-and-effect relationship between two factors, i.e. if we are interested in observing
the correlation of working memory and a test score, and the coefficient is .85, it does
not mean that because working memory increases, test score does too. It is simply a
relationship or association.

There are several correlation coefficients, and one of the most commonly used
parametric tests for correlations is Pearson's correlation coefficient. It is also
represented as r. The coefficient implies that, the closer the r-statistic is to +1 or –1,
the more the two variables are related. As with many other parametric tests presented
in this chapter, Pearson's r can only be performed when some assumptions are met:
(1) data must be normally distributed, and (2) it must be linear.

In JASP, correlation coefficients may be tested in 'Regression' > 'Correlation'. Insert in


'Variables' the two or more variables of interest. Then, do not forget to tick the
following options:

58
Figure 59. Options to select in Correlation in JASP.

Suppose that we are interested in observing whether the number of years spent
learning English may be correlated with Anxiety levels and the test scores in an L2
test during a two-week intervention with the Montessori methodology.

When the variables are introduced in the 'Variables' textbox, a series of tables of results
appear. In Figure 60, correlations are presented pairwise while in Figure 61 these
appear in a correlation matrix. As can be observed, although the correlation coefficients
are reported to be statistically significant for Years L2 and Montessori as well as for
Anxiety Levels and Montessori (p = .007 and p = .025, respectively), the coefficients
are still very weak (r = .310 and r = .259). This indicates that more years learning the
L2 is positively correlated with the use of a Montessori methodology and that Anxiety
Levels are more likely to increase when this methodology is implemented.

Figure 60. Pairwise display of correlations.


Pearson's Correlations
n Pearson's r p
YearsL2 - Anxiety_Levels 75 -0.013 0.912
YearsL2 - Montessori 75 0.310 ** 0.007
Anxiety_Levels - Montessori 75 0.259 * 0.025
* p < .05, ** p < .01, *** p < .001

Figure 61. Correlation matrix.


Pearson's Correlations
Variable YearsL2 Anxiety_Levels Montessori
1. YearsL2 n —
Pearson's r —
p-value —
2. Anxiety_Levels n 75 —
Pearson's r -0.013 —
59
Pearson's Correlations
Variable YearsL2 Anxiety_Levels Montessori
p-value 0.912 —
3. Montessori n 75 75 —
Pearson's r 0.310 ** 0.259 * —
p-value 0.007 0.025 —
* p < .05, ** p < .01, *** p < .001

Another way of observing the correlations in a visual manner is through a heatmap,


which, as can be seen in Figure 62 below, shows how these correlations are.

Figure 62. Heatmap for Pearson's r coefficients.

60
Proportions, Chi-Squared Test, and Contingency Tables
In L2 education research, another way of gathering data and conducting research
studies is through surveys, which are generally used to obtain valuable information
about students' perceptions. The statistical analysis of these data tends to be subject to
our research interests, but the most common procedures include proportions and
contingency tables together with chi-squared tests.

In the case of proportions, it allows us to observe how the responses to each survey
item are distributed. JASP offers the possibility to observe these data in 'Descriptives',
and selecting the 'Frequency tables' option:

Figure 63. Selecting 'Frequency tables' in the Descriptive module.

It is important to remember that the variables (in this case, the survey items or
questions) have to be introduced into the boxes. Subsequently, JASP offers the
following output for proportions:

Figure 64. JASP output for frequency tables.


Frequencies for P_1
Group P_1 Frequency Percent Valid Percent Cumulative Percent
First Year Very Little 2 4.651 4.651 4.651
Little 6 13.953 13.953 18.605
Normal 22 51.163 51.163 69.767
A lot 9 20.930 20.930 90.698
Quite a lot 4 9.302 9.302 100.000
Missing 0 0.000
Total 43 100.000
Second Year Very Little 0 0.000 0.000 0.000
Little 4 11.429 11.429 11.429
Normal 10 28.571 28.571 40.000
A lot 11 31.429 31.429 71.429
Quite a lot 10 28.571 28.571 100.000
Missing 0 0.000
Total 35 100.000

61
Selecting frequency tables in the Descriptives modules only provides us with a
descriptive view of the responses given by the students for each question. Nevertheless,
in order to observe whether there are differences in this survey item, namely with a
grouping variable, the use of the chi-squared test (χ2) is necessary. The chi-squared
test is a nonparametric test, and a test of significance, since it tests a hypothesis. It is
used to compare actual or observed frequencies with expected frequencies to discern
whether they differ in statistical terms. As a note of caution, the chi-squared test can
only be used when there is a between-group comparison, that is when an independent
or grouping variable is present in the research design. To do this statistical test in JASP,
go to the Frequencies module, and then select 'Contingency Tables'.

Figure 65. Frequencies module and options.

62
Figure 66. Cells selection of percentages.

Thus, JASP will generate the following output, as shown in Figure 66. As seen, the
percentage for each option for this survey question is displayed as well as the
information per group. This information is valuable for descriptive procedures in L2
education research, especially when we aim to discern tendencies.

Figure 67. Contingency table for a survey question divided by the grouping variable.
Contingency Tables
Group
P_1 First Year Second Year Total
Count 2.000 0.000 2.000
Very Little
% within column 4.651 % 0.000 % 2.564 %
Count 6.000 4.000 10.000
Little
% within column 13.953 % 11.429 % 12.821 %
Count 22.000 10.000 32.000
Normal
% within column 51.163 % 28.571 % 41.026 %
Count 9.000 11.000 20.000
A lot
% within column 20.930 % 31.429 % 25.641 %
Count 4.000 10.000 14.000
Quite a lot
% within column 9.302 % 28.571 % 17.949 %
Count 43.000 35.000 78.000
Total
% within column 100.000 % 100.000 % 100.000 %

The chi-squared test performed on the results of survey question 1 is shown in Figure
68 below. As can be seen, the chi-squared significance is above 0.05 (p = 0.062), which
indicates that there are no significant differences.
63
Figure 68. Chi-squared results for contingency table 1.
Chi-Squared Tests
Value df p
Χ² 8.945 4 0.062
N 78

Let us report this result with APA:

Χ² statistic (Χ² (4) = 8.945, p = .06) suggests that there is not a significant

association between the students' answers in question one and the year of the degree
they are pursuing.

64
Chapter 4

NON PARAMETRIC STATISTICS


Non-parametric statistics are used when T-tests are not adequate since certain
assumptions are not met, for instance, when data are not normally distributed or the
sample size has fewer than 30 participants. Throughout this chapter, a series of non-
parametric statistical tests will be presented along with directly related L2 education
research examples.

Wilcoxon Signed-Ranks Tests


The Wilcoxon signed-ranks test is a nonparametric alternative to the paired samples
T-test. Contrary to the latter, the Wilcoxon test determines whether the differences
between two sets of data or scores from the same group are different from each other.
In sum, the Wilcoxon signed-ranks test requires a categorical independent variable with
two levels, and an ordinal dependent variable (Tavakoli, 2013). Unlike T-tests, this
non-parametric test ranks the pair of scores in order of size, and then the differences in
these ranks with the same sign are added together. Should there be no differences
among these scores in the two samples, the sum of positive ranked differences and
negative ones should be similar. In case they are different, it is very likely that they
differ significantly from each other.

The use of Wilcoxon signed-ranks test proves to be more powerful against Type II
Error than other tests (e.g. Sign Test) since it considers the magnitude of the scores and
their direction.

In JASP, the Wilcoxon signed-ranks test has to be selected from the 'T-tests' < 'Paired
Samples T-Test' module, as described in Figures 69 and 70 below. As observed, this is

65
a paired samples test, which requires a combination of two variables, such as pre-test
and post-test.

Figure 69. Pair of variables introduced.

In order to perform this test, as seen in Figure 70, a tick has to be marked in 'Wilcoxon
signed-rank". For our example, and as a general recommendation, it is advisable to tick
'Effect size' and 'Descriptives' given the rich information they provide about the data
to be analyzed. In our example, let us suppose that we want to explore how a listening
intervention favors increasing the scores in a vocabulary test. To do so, a pre-test-post-
test research design is devised.

Figure 70. Wilcoxon signed-rank test selected.

Once everything is set, JASP gives the results of the analysis as in Figure 71. The p-
value indicates that it is highly significant (p < .001). The effect size, calculated with
the rank-biserial correlation (rB), is interpreted in the same way as parametric
correlations. Hence, rB = –0.966 is a large effect size.
66
Figure 71. JASP output for Wilcoxon signed-ranks test.
Paired Samples T-Test
95% CI for Rank-
Biserial Correlation
Measure Measure Rank-Biserial
W df p Lower Upper
1 2 Correlation
Pre_Test - Post_Test 5.500 < .001 -0.966 -0.986 -0.919
Note. Wilcoxon signed-rank test.

Descriptives
N Mean SD SE
Pre_Test 26 8.177 0.847 0.166
Post_Test 26 9.231 0.620 0.122

The reporting of results following APA 7th would be as follows:

A Wilcoxon's signed-rank test showed that learners significantly increased scores in


the vocabulary test (M = 9.23) compared to the pre-test (M = 8.17) scores, W = 5.50,
p < .001.

Mann Whitney U Tests


The Mann-Whitney U Test is the non-parametric alternative to the Independent
Samples T-Test, and it is a rank-based text used to compare two independent groups
according to their ranks with respect to the median. To perform this test, the researcher
needs one categorical independent variable with at least two levels, and one
ordinal/scale-based dependent variable.

The use of a Mann-Whitney U Test is justified when the normality is violated (i.e.
when the data are not normally distributed), and the homogeneity of variance is not
equal. This nonparametric test is also preferred when the sample size is less than 30
participants.

Similarly, there are some statistical considerations when using Mann-Whitney U Test
(Tavakoli, 2013): if the sample size has fewer than 20 participants, the smaller U value

67
is taken into account to calculate statistical significance. When this figure is higher,
that is, over 20 participants, the U value is converted into the Z value.

Figure 72. JASP interface to select the tests and additional statistics.

In Figure 72 above, the JASP interface is presented. To select the Mann Whitney U
Test, go to 'T-tests' > 'Independent Samples T-tests'. Under the section, select 'Mann
Whitney' in Tests. It is also highly advisable to select the 'Effect Size'. 'Descriptives'
and 'Descriptive plots' may also help in visually observing the extent of the differences
between both groups.

The example to illustrate the use of Mann Whitney U Test is as follows: let us suppose
that we are interested in discerning whether there are big differences between a CLIL
group and a control group when implementing a listening program centered on learning
new vocabulary. To verify its effectiveness, a pre-test and a post-test are conducted. In
Figure 73, the results of the Mann Whitney U Test are presented.

Figure 73. JASP outcome for the Mann-Whitney U Test.


Independent Samples T-Test
W df p Rank-Biserial Correlation
Post_Test 42.500 < .001 -0.864
Note. For the Mann-Whitney test, effect size is given by the rank biserial
correlation.
Note. Mann-Whitney U test.

68
Group Descriptives
Group N Mean SD SE
Post_Test CG 24 6.500 1.806 0.369
CLIL 26 9.231 0.620 0.122

As can be observed, the Mann Whitney U Test yielded a statistically significant result
(p < .001) with a large effect size (rB = –0.864). In this case, the descriptives reveal that
the CLIL group scored better than the control group.

Figure 74. Raincloud plot for the pre-test in both groups.

Figure 74 represents a raincloud plot in which the data is visually represented. This is
a graphical manner to present the data and comment on the implications that it
reveals. In this case, the distribution of scores in the CLIL group (i.e. the standard
deviation) is more concentrated than in the control group.

The reporting of results following APA 7th would be as follows:

A Mann-Whitney U test showed that the CLIL group scored higher in the vocabulary
post-test (M= 9.23) than the control group (M= 6.50), U= 42.50, p < .001, rB = –
0.864.

69
Friedman's Tests
Another very commonly used nonparametric test is Friedman's test (also Friedman's
two-way analysis of variance), which is a rank-based method to compare multiple
dependent variables within the same group. In this sense, the sample must be the same.
As with one-way repeated measures ANOVA, Friedman's test serves to compare three
or more related samples with ordinal scores. Likewise, Friedman's test should be used
when the assumptions of normality or homogeneity of variance are not satisfied, and
thus the parametric ANOVA alternative cannot be used. The calculation of Friedman's
test depends on the ranks within each case (Tavakoli, 2013).

In JASP, this test can be run going to ANOVA > Repeated Measures ANOVA. The
introduction of the data follows the same procedure as if it were the parametric test.
Nevertheless, to run the nonparametric Friedman's test, it is necessary to expand the
Nonparametrics tab and move the Factor – which may be named or labeled according
to your research needs – to the RM Factor box. Should we be interested in observing
the differences between the groups, that is, the post-hoc comparison tests, the
'Conover's post hoc test' box must be ticked (see Figure 75).

Figure 75. Nonparametrics tab to enable the Friedman's test.

To illustrate this test, let us take a similar example to the one in repeated measures
ANOVA. In our study, we want to observe how three different L2 teaching
methodologies (i.e. Montessori, CLT and App-based) contribute to having an effect on
an intact Secondary class. To do so, data are gathered in three different moments:
firstly, students are taught with the Montessori methodology, and are then provided

70
with a test to check the contents learnt. This procedure is repeated for CLT and App-
based methodologies.

To verify these differences, a Friedman's test is performed (see Figure 76). As can be
observed, the test yielded a statistically significant result (p = 0.014). This indicates
that there are differences between the three methodologies.

Figure 76. JASP outcome for Friedman's tests.


Friedman Test
Factor Chi-Squared df p Kendall's W
RM Factor 1 8.575 2 0.014 0.179

In the case of repeated measures ANOVA, post-hoc comparisons were done with other
correction tests. For Friedman's test, Conover's test must be used. As indicated by
Tavakoli (2013), it is a nonparametric test that tests the equality of variances of two
populations with different medians. In Figure 77, we obtained Conover's test post-hoc
comparisons for the three different methodologies. There are statistically significant
differences between Montessori and App (p = .03) and CLT and App (p = .008).

Figure 77. JASP outcome for Conover's post-hoc comparisons.


Conover's Post Hoc Comparisons - RM Factor 1
T-Stat df Wi Wj p pbonf pholm
Montessori CLT 0.556 46 51.500 55.000 0.581 1.000 0.581
App 2.223 46 51.500 37.500 0.031 0.093 0.062
CLT App 2.779 46 55.000 37.500 0.008 0.024 0.024
Note. Grouped by subject.
The reporting of results following APA 7th would be as follows:

The type of methodology has a significant effect on test scores χ2 (2) = 8.575, p = .014.
Pairwise comparisons showed that scores were significantly different between
Montessori and App (p = .031) and between CLT and App (p = .008).

Kruskal-Wallis Tests
The Kruskal-Wallis test is a nonparametric alternative test to the one-way ANOVA
(that is, the independent samples ANOVA). It is used with ordinal data in a hypothesis

71
testing situation in a research design involving three or more independent groups of
participants. The Kruskal-Wallis test aims to determine whether the scores of three or
more unrelated groups differ significantly. Under the same premise as the rest of the
nonparametric tests, Kruskal-Wallis calculates the statistic through a rank-based
procedure. In sum, as Tavakoli (2013) states, the Kruskal-Wallis test is an extension of
the Mann-Whitney U Test. Unlike parametric tests, Kruskal-Wallis does not require
any statistical assumptions except for the presence of an ordinal scaling of the
dependent variable.

To run a Kruskal-Wallis test in JASP, you have to go to ANOVA > ANOVA. Then, in
the analysis windows, the data has to be introduced as if it were a normal parametric
ANOVA test. Hence, to activate the nonparametric test, the Nonparametric tab has to
be opened, and the independent variable must be moved to the box on the right (see
Figure 78):

Figure 78. Nonparametrics tab with Kruskal-Wallis Test.

To illustrate the use of the Kruskal-Wallis test, suppose that we are conducting a study
in which the Montessori methodology integrated into L2 learning is going to be
implemented in three different groups: CLIL, NO-CLIL, and a control group (CG). In
order to observe the potential differences between them, a Kruskal-Wallis test would
be the statistical response.

In Figure 79, the results of the test indicate that there are statistically significant
differences (p < .001) between the groups. However, the Kruskal Wallis test does not
provide the post-hoc comparison tests by itself. It has to be set as was done in the case

72
of Friedman's tests (in summary, 'Post Hoc Tests' tab, and select Dunn's Post Hoc Type;
also select Bonferroni and Holm corrections).

Figure 79. JASP outcome for Kruskal-Wallis test.


Kruskal-Wallis Test
Factor Statistic df p
Group 15.659 2 < .001

Descriptives - Montessori
Group Mean SD N
CG 6.458 2.085 24
CLIL 8.177 0.847 26
NOCLIL 7.160 0.787 25

Thus, a series of post-hoc comparison tests were run, and as shown in Figure 80 below,
the results were statistically significant for CG vs CLIL (p < .001) and CLIL vs NO-
CLIL (p = .002).

Figure 80. Post hoc comparison tests for Kruskal-Wallis tests (Dunn Type).
Dunn's Post Hoc Comparisons - Group
Comparison z Wi Wj p pbonf pholm
CG - CLIL -3.548 29.833 51.462 < .001 *** < .001 *** < .001 ***
CG - NOCLIL -0.326 29.833 31.840 0.372 1.000 0.372
CLIL - NOCLIL 3.253 51.462 31.840 < .001 *** 0.002 ** 0.001 **
** p < .01, *** p < .001

The reporting of results following APA 7th would be as follows:

The different groups were significantly affected by the implementation of the


Montessori methodology for L2 learning: H(2) = 15.659, p < .001. Pairwise
comparisons showed that the CLIL group obtained a higher score than the control
group (p <.001) and the NO-CLIL group (p = .002). There were no significant
differences between the control group and the NO-CLIL group (p= 1.00).

73
Nonparametric correlations
Much as it has been presented throughout this chapter, correlations have their
nonparametric alternatives as well. In this case, when the data has violated the
assumptions – that is, normality or variance – two different nonparametric correlation
alternatives can be used: Spearman's rho and Kendall's tau correlation coefficients.

Spearman's rho correlation coefficient (rho - ρ) is a nonparametric bivariate measure


of association that is used with two ordinal variables. In essence, it determines the
monotonic relationship existing between two variables, with one increasing (i.e. a
positive correlation) and the other decreasing (i.e. a negative correlation). The
interpretation of a Spearman's rho correlation follows the same pattern as the Pearson's
correlation. This nonparametric correlation should be used when assumptions are
violated, and also when small sample sizes are used.

In the case of Kendall's tau, it is a test of rank correlation used with two ordinal
variables. This correlation coefficient is more adequate when there are ties in the
rankings. There are three forms of this measure (Cramer & Howitt, 2004; Larson-Hall,
2010):

1) Kendall's rank correlation tau A, which is used when there are no ties or tied
ranks.
2) Kendall's rank correlation tau B, which is used when there are ties or tied ranks.
3) Kendall's rank correlation tau C (also Kendall-Stuart Tau-c) is used when "the
table of ranks is rectangular rather than square as the value of tau c can come
closer to –1 or 1" (Tavakoli, 2013, p. 311).

To illustrate how these correlations may be framed in L2 education research, let us


suppose that we are interested in observing how Anxiety levels are correlated with each
methodology (Montessori, CLT, and App) in a CLIL group. The strength of association
will give us an idea of whether these dependent variables might be related.

74
In JASP, the calculation of these nonparametric correlations is done in the same manner
as for parametric correlations. The only difference lies in the selection of these
nonparametric alternatives, as seen in Figure 81 below.

Figure 81. Selection of tests in the 'Correlation' module in JASP.

As can be seen in Figure 82 below, the correlation between the anxiety level and the
Montessori methodology is moderate (rs = –0.488, p = .011). In terms of correlations
between methodologies, a moderate correlation exists between Montessori and CLT
(rs = 0.426, p = .03), and a high correlation between CLT and App (rs= 0.918, p <
.001).

Figure 82. JASP results of Spearman rho and Kendall Tau B.


Correlation Table
Spearman Kendall
n rho p tau B p
Anxiety_Levels - Montessori 26 -0.488 * 0.011 -0.428 ** 0.010
Anxiety_Levels - CLT 26 -0.289 0.152 -0.241 0.156
Anxiety_Levels - App 26 -0.148 0.472 -0.132 0.446
Montessori - CLT 26 0.426 * 0.030 0.357 * 0.028
Montessori - App 26 0.360 0.071 0.289 0.082
CLT - App 26 0.918 *** < .001 0.843 *** < .001
* p < .05, ** p < .01, *** p < .001

75
Aside from the correlation table generated by JASP, the statistics software equally
offers a heatmap where these correlations may be visually observed (see Figure 83
below). In the case of our example, it may be concluded that anxiety levels do not have
a strong association with the different teaching methodologies.

Figure 83. Heatmap for Spearman Rho and Kendall's Tau B.

76
Chapter 5

OTHER STATISTICAL ANALYSES

Linear Regression
Regression is a statistical technique that allows researchers to "examine the nature and
strength of the relations between variables, the relative predictive power of several
independent variables on a dependent variable" (Urdan, 2017, p. 183). There are two
types of regressions: simple linear regression (or bivariate regression) and multiple
regression.

Simple Linear Regression

Simple linear regression is a statistical test that predicts the value for an independent
variable from one independent variable (also called predictor). This independent
variable must be measured with an interval or ratio scale. A simple linear regression
entails that the researcher only examines one predictor variable and one criterion
variable. The purpose of the regression analysis, as noted by Urdan (2017), is to make
predictions about the values of the dependent variable which depend on certain values
of the predictor variable. In L2 education research, an example of a research design
whose research questions may be answered with a simple linear regression could be as
follows. We may wish to see the effect of the years studying an L2 (in this case,
English) on the scores of a text in a classroom following the Communicative Language
Teaching approach, in order to check whether this variable is dependent on the years
studying English. Hence, 'years studying English' is the predictor variable while 'CLT'
(test score) is the criterion variable. In JASP, there is a 'Regression' module, as shown
in Figure 84 below, in which 'Linear Regression' should be selected.
77
Figure 84. 'Regression' module.

Under the premise of the sample research design that we have presented previously,
these variables must be introduced, as can be seen in Figure 85. In this case, CLT (test
score) is introduced in the dependent variable as it is the outcome variable (dependent
variable), and YearsL2 ('years studying the L2') as the predictor variable, that is, the
covariate. Since we aim to illustrate a simple linear regression, only one covariate is
introduced.

Figure 85. Introduction of variables in the 'Regression' > 'Linear Regression' module.

Once the variables are introduced into the corresponding areas, the following output
will be generated by JASP:

78
Figure 86. Model Summary of Simple Linear Regression.
Model Summary - CLT
Durbin-Watson
Model R R² Adjusted R² RMSE Autocorrelation Statistic p
H₀ 0.000 0.000 0.000 1.721 0.661 0.620 < .001
H₁ 0.397 0.157 0.146 1.591 0.594 0.739 < .001

The table in Figure 86 above shows that the correlation (R) between both variables is
not high (0.397). On the other hand, the squared R (R2) indicates that the years studying
English accounts for 15.7% of the variance in the test score in Communicative
language teaching.

Figure 87. ANOVA table for the Simple Linear Regression.


ANOVA
Model Sum of Squares df Mean Square F p
H₁ Regression 34.508 1 34.508 13.641 < .001
Residual 184.672 73 2.530
Total 219.180 74
Note. The intercept model is omitted, as no meaningful information can be shown.

The ANOVA table above displays the sums of squares. Regression is the model while
residual is the error. The F-Statistic is significant at p < .001. Following the APA
guidelines, this information should be reported as F (1, 73) = 13.641, p < .001.

Figure 88. Coefficients table for Simple Linear Regression.


Coefficients
95% CI
Model Unstandardized Standard Error Standardized t p Lower Upper
H₀ (Intercept) 7.760 0.199 39.049 < .001 7.364 8.156
H₁ (Intercept) 2.025 1.564 1.295 0.199 -1.092 5.141
YearsL2 0.684 0.185 0.397 3.693 < .001 0.315 1.053

The information presented in the coefficients table (see Figure 88 above) provides the
coefficients (i.e. unstandardized) that are to be put into the linear equation:

y=c+b*x
79
These coefficients stand for the following information:

1) y = dependent outcome variable score


2) c= constant (in the table, intercept).
3) b = regression coefficient (in the table, the score under unstandardized for
YearsL2).
4) x = score on the independent predictor variable (of our choice).

Following the previous equation, for 0.5 years the score that students in the
Communicative Language Teaching method are predicted to obtain is the following:

CLT (score) = 2.025 + (0.684 * 0.5) = 2.37 points

A potential interpretation of this is that, when students study 0.5 years of the language,
they might be expected to increase a CLT test score by 2.37 points.

Reporting the results of this linear regression would be as follows:

Linear regression shows that years studying the L2 may significantly predict a CLT
test score F(1, 73) = 13.641, p < .001. The equation reveals that, after 6 months of
studying the L2, students may increase their CLT test scores by 2.37 points.

Multiple Regression

Multiple linear regression is an extension of simple linear regression since it is


employed with one continuous criterion variable (dependent variable) and several
predictor variables (independent continuous variables). In essence, the purpose of
multiple linear regression is equal to that of simple linear regression, that is, to estimate
a value for the dependent variable from more than one predictor variable.

To the L2 education researcher, it allows examining associations among several


predictor variables (for instance, the type of teaching methodology, individual
differences), and one dependent variable (such as years of study, or hours devoted to
studying the L2 per week). As indicated by Urdan (2017), the fact that multiple
regression analysis allows for the introduction of several variables into the model

80
favors how to control variables may influence the general model. The association
between the multiple predictor variables is determined by multiple correlation
coefficient, which is included in the multiple linear regression analysis.

Tavakoli (2013) summarizes the extent to which multiple linear regression may be
useful for research purposes: a) the degree of relation between the predictor variables
and the criterion variable; b) how strong the relationship between each predictor
variable and the criterion variable is. Additionally, multiple linear regression allows
the researcher to control for other variables in the model; c) the relative strength of
each predictor variable, and finally, d) the interaction effects between each of the
predictor variables.

Multiple regression analysis makes certain assumptions about the data:

1) The absence of multicollinearity or singularity – otherwise, it will not favor the


regression model.
2) Outliers should not be present in the data.
3) Normality of variance.
4) Linearity.
5) Homoscedasticity.

As Tavakoli (2013) points out, multiple regression analysis may have two central uses:
(1) to determine the strength and association between the criterion and predictor
variable controlling for the internal association of predictor variables and criterion one.
This association is represented with the letter β (unstandardized partial regression
coefficient), and (2) to determine how much particular predictors can account for the
variance in the criterion variable. In this case, this association is represented with a
multiple correlation coefficient (R) or squared R (R2).

Nevertheless, it is important to bear in mind that multiple linear regressions do not


allow the researcher to conclude from a cause-and-effect perspective. This is due to the
potential presence of extraneous variables that may somewhat affect the causal
relationship between the different variables.
81
Let us then imagine a research study in which we are interested in exploring how the
level of anxiety may predict the scores depending on three types of teaching
methodology: Montessori, CLT, and App-based. In JASP, we introduced the teaching
methodologies in 'Covariates' and 'the level of anxiety' as the dependent variable, as
can be observed in Figure 89 below:

Figure 89. JASP interface for multiple linear regression.

Subsequently, the following output will be generated (see Figure 90). As seen, the
adjusted R2 informs the researcher that the multiple predictors can predict 42.7% of
the outcome variance. Durbin-Watson checks – which should be between 1 and 3 – are
in the corresponding benchmarks.

Figure 90. JASP output - model summary - for multiple linear regression.
Model Summary - Anxiety_Levels
Durbin-Watson
Model R R² Adjusted R² RMSE Autocorrelation Statistic p
H₀ 0.000 0.000 0.000 1.313 0.107 1.716 0.479
H₁ 0.708 0.502 0.427 0.994 -0.151 2.214 0.822

82
The ANOVA table provides us with valuable information about the significance of the
F-statistic which, as observed, is statistically significant (p = .003), suggesting that the
model is a better predictor of level of anxiety depending on the type of methodology.

Figure 91. ANOVA table for multiple linear regression.


ANOVA
Model Sum of Squares df Mean Square F p
H₁ Regression 19.876 3 6.625 6.709 0.003
Residual 19.749 20 0.987
Total 39.625 23
Note. The intercept model is omitted, as no meaningful information can be shown.

The table below (Figure 91) shows that all statistics are forced into the model, and the
ANOVA model is significant. In the case of the predictor regression coefficients, the
CLT score is marginally significant (p = .059). However, the Tolerance and Variance
Inflation Factor (VIF), which are collinearity statistics, allow us to observe the degree
of multicollinearity that exists between the variables. For multicollinearity to be
discarded, the average VIF should be below 1, and tolerance should be equal to or less
than 0.2. In Figure 92, the Tolerance is maintained at adequate levels while VIF is quite
large (M= 14.28). Hence, the model is biased, and no predictions may be done.

Figure 92. Coefficients table for each variable.


Coefficients
Collinearity Statistics
Model Unstandardized Standard Error Standardized t p Tolerance VIF
H₀ (Intercept) 2.625 0.268 9.797 < .001
H₁ (Intercept) -1.244 0.930 -1.337 0.196
Montessori -0.084 0.272 -0.134 -0.309 0.761 0.133 7.512
CLT 1.115 0.558 1.535 1.999 0.059 0.042 23.646
App -0.473 0.327 -0.781 -1.447 0.163 0.086 11.683

This information may be observed visually through a series of plots, which allow for
the confirmation or rejection of assumptions.

83
Figure 93. Residuals vs Predicted plot.

As can be seen in Figure 93 above, the distribution of residuals around the baseline
suggests that the assumption of homoscedasticity may have been violated.

In the case of the Q-Q plot, as in Figure 94, it shows that the standardized residuals fit
along the diagonal, which points to normality and linearity. Thus, these assumptions
have not been violated.

Figure 94. Q-Q Plot Standardized Residuals.

84
Reliability Tests
In statistics, the role of reliability tests is deemed essential in order to discern whether
the association between certain items or groups of items (e.g. in a survey, or the scores
graded by several raters) is consistent.

Cronbach's alpha - α

One of the most common reliability tests is Cronbach's alpha (α), in which the
associations between a set or group of items are used to specify how strong these items
hold together (Urdan, 2017). Cronbach's alpha allows researchers to estimate the
internal consistency reliability of a certain measuring instrument (e.g. a test) by taking
into consideration certain information from the data: the number of items, the variance
of the scores in each item, and the variance of the total test scores (Tavakoli, 2013). As
mentioned previously, Cronbach's alpha is a measure of internal consistency that offers
factual data about the reliability of items within a group, for instance, a questionnaire
delivered to a classroom.

To interpret Cronbach's alpha, the maximum value of this measure is 1. Thus, when
values approach 1, it reflects a stronger relationship between the test items. When there
is a low alpha in the test, it suggests that the similarity of responses is very low.

Let us imagine that we have designed a questionnaire consisting of 23 items whereby


we intend to observe what L2 students' perceptions are. In order to verify its internal
consistency, the questionnaire is delivered to a piloting sample of 15 people. Hence,
once the answers to this test are gathered, a reliability test must be carried out before it
is eventually delivered to the whole sample.

Firstly, in JASP, we introduce the 23 items into the variables box, as in Figure 95.

85
Figure 95. JASP introduction of variables for unidimensional reliability.

The next step consists of selecting the appropriate scale statistics – in this case,
Cronbach's alpha α – and the confidence interval (see Figure 96).

Figure 96. Selection of Cronbach's alpha as a reliability test.

This will generate the results of Cronbach's alpha, as in Figure 97 below. Following
the benchmarks mentioned previously, the reliability test yielded α = .736 (.639 – .811).
Hence, the degree of internal consistency is acceptable.

86
Figure 97. Reliability test- Cronbach's alpha results.
Frequentist Scale Reliability Statistics
Estimate Cronbach's α
Point estimate 0.736
95% CI lower bound 0.639
95% CI upper bound 0.811
Note. The following items correlated negatively with the scale: P_15, P_17, P_18, P_19, P_20,
P_22.

Another follow-up step that may be revealing is observing the individual item
reliability (in JASP, go to 'Individual Item Statistics' > Cronbach's α [if item dropped]).
Ticking this option allows us to observe Cronbach's alpha for each of the items in the
questionnaire.

Figure 98. Individual item reliability statistics for each question in the survey.
Frequentist Individual Item Reliability Statistics
If item dropped
Item Cronbach's α
P_1 0.718
P_2 0.718
P_3 0.713
P_4 0.725
P_5 0.731
P_6 0.720
P_7 0.713
P_8 0.723
P_9 0.724
P_10 0.716
P_11 0.708
P_12 0.731
P_13 0.728
P_14 0.708
P_15 0.734
P_16 0.725
P_17 0.739
P_18 0.730
P_19 0.755
P_20 0.763
P_21 0.722
P_22 0.736
P_23 0.727

87
Intraclass Correlation Coefficient (ICC)

Another measure of reliability is the intraclass correlation coefficient (ICC), which


is a descriptive statistic used in quantitative measurements, and more specifically, on
units organized into groups. In essence, its purpose is similar to Cronbach's alpha, but
it tests the degree of relatedness between individuals from a quantitative perspective.
Additionally, ICC equally assesses the consistency or reproducibility of quantitative
measurements by different observers.

Figure 99. JASP introduction of variables for ICC.

In JASP, this option is available in the 'Reliability' module > 'Intraclass Correlation'.
The scores of each rater must be properly organized. In the case of the example in
Figure 99, the data belongs to data re-coding, which is why variables are called
'Time_1' and 'Time_2'. Both have to be included in the 'Variables' box.

In Figure 100, the results of the ICC are presented. As observed, ICC = 0.903, which
is estimated as an excellent degree of association.

Figure 100. Results of ICC.


Intraclass Correlation
Type Point Estimate Lower 95% CI Upper 95% CI
ICC3,1 0.903 0.891 0.914

88
Intraclass Correlation
Type Point Estimate Lower 95% CI Upper 95% CI
Note. 661 subjects and 2 judges/measurements. ICC type as referenced by Shrout
and Fleiss (1979).

89
Cohen's Kappa

Parallel to ICC, Cohen's kappa is another measure of agreement that allows for the
calculation of interrater reliability. In essence, it represents the average rate of
agreement in a set of scores, revealing the degree of agreement and disagreement by
category. Cohen's kappa adopts a dichotomous coding scheme (Tavakoli, 2013).
Although percentages could be calculated, Cohen's kappa accounts for change, making
it a valuable statistical test to check for intra and interrater agreement. As in ICC, the
closer Cohen's kappa value is to +1, the greater agreement there is.

As can be observed in Figure 101, Cohen's kappa is .901, which points to a good degree
of agreement between raters.

Figure 101. Cohen's Kappa results (JASP).


Cohen's Weighted kappa
95% CI
Ratings Weighted kappa SE Lower Upper
Average kappa 0.901
Time_1 - Time_2 0.901 0.013 0.875 0.928
Note. 661 subjects/items and 2 raters/measurements. Confidence intervals are asymptotic.

90
Factor Analysis

Principal Component Analysis

Some research studies which make use of questionnaires or which operationalize


constructs with different variables make use of Factor Analyses. One of the most
common ones is Principal Component Analysis (PCA), which is a multivariate
analysis to explore how a number of variables are interrelated with each other. Hence,
this allows us to discern the potential existing higher-order components that may
account for this pattern of intercorrelations. PCA is certainly useful if, for instance, a
questionnaire is delivered to an L2 Higher Education classroom on the role of digital
writing in their everyday lives. The questionnaire may contain several sections in
which specific questions are directed (e.g. the use of digital writing with the L2; how
digital writing is present in their studies, among many other aspects). Along this line,
PCA allows for measuring several variables to merge these into a similar component
or latent variable (Tavakoli, 2013). This is done through the variance of a set of
observed variables, which produces a set of components accounting for all the variance
in this set of observed variables.

Taking as reference the example previously mentioned, in JASP, Factor Analyses are
found in the 'Factor' module (see Figure 102 below):

Figure 102. Factor module in JASP.

As observed there are three types of analyses. In this book, an overview of these three
will be provided.

91
Figure 103. PCA module in JASP.

In Figure 103, all the variables have to be introduced into the 'Variables' box. Then, in
number of components, 'Eigenvalues' have to be selected. It is important to notice that
only Eigenvalues above 1 will be shown.

Figure 104. Chi-squared test for the PCA.


Chi-squared Test
Value df p
Model 204.452 113 < .001

The table shown in Figure 105 below is of paramount importance to understand the
relevance of PCA in the framework of our research study. As can be observed, the
questions from the questionnaire (e.g. P_21, P_10, etc) are grouped into different
components (i.e. RC1, RC2). The values that are shown in Figure 105 display the
degree of relationship that exists between the values.

Figure 105. Component loadings for PCA.


Component Loadings
RC1 RC2 RC3 RC4 RC5 RC6 RC7 Uniqueness
P_21 0.906 0.251
P_23 0.900 0.324
P_4 0.727 0.315
P_8 0.720 0.234
P_20 -0.408 0.525
P_7 0.883 0.161
P_2 0.835 0.251
P_10 0.638 0.453

92
Component Loadings
RC1 RC2 RC3 RC4 RC5 RC6 RC7 Uniqueness
P_11 0.567 0.356
P_14 0.458 0.438 0.368
P_18 0.999 0.201
P_19 0.972 0.156
P_16 0.817 0.367
P_15 0.603 0.514
P_5 0.574 0.390
P_9 0.873 0.302
P_6 0.838 0.239
P_17 0.858 0.333
P_3 0.540 0.363
P_13 0.763 0.330
P_12 -0.702 0.321
P_1 0.446 0.400
P_22 0.386
Note. Applied rotation method is promax.

The table below in Figure 106 is relevant for the researcher since it provides us with
important information about the proportion variance as well as the Eigenvalue. As can
be observed, components 1 to 3 have the highest eigenvalues. However, the proportion
variance is equally relevant. In this case, 0.227 is the highest, which makes us discard
the rest of the components in order to group the variables.

Figure 106. Component characteristics with Eigenvalues for PCA.


Component Characteristics
Unrotated solution Rotated solution
Proportion SumSq. Proportion
Eigenvalue Cumulative Cumulative
var. Loadings var.
Component
5.215 0.227 0.227 3.114 0.135 0.135
1
Component
2.799 0.122 0.348 3.019 0.131 0.267
2
Component
2.308 0.100 0.449 2.289 0.100 0.366
3
Component
1.560 0.068 0.517 2.038 0.089 0.455
4
Component
1.349 0.059 0.575 1.988 0.086 0.541
5
Component
1.145 0.050 0.625 1.523 0.066 0.607
6

93
Component Characteristics
Unrotated solution Rotated solution
Proportion SumSq. Proportion
Eigenvalue Cumulative Cumulative
var. Loadings var.
Component
1.084 0.047 0.672 1.488 0.065 0.672
7

A visual representation of these eigenvalues is shown in Figure 107 below, which is a


scree plot. As shown, only the components above the eigenvalue (= 1) appear above
this benchmark. This indicates the degree to which these components may be
representative of the variables that they hold together.

Figure 107. Scree plot for PCA.

Another visual manner to observe the results of the PCA is through a path plot, as can
be observed in Figure 108. In this case, all the components are shown, and a series of
red and green arrows, with differing widths, point to each of the variables.

In this case, the L2 researcher may gain more insight into which variables may be
merged into which component. Nevertheless, decisions are to be taken based on
theoretically or empirically motivated grounds.

94
Figure 108. Path plot for PCA.

Exploratory Factor Analysis

Similar to the previous analysis, PCA, Exploratory Factor Analysis (EFA) is another
type of factor analysis centered on describing and summarizing data through the
organization of variables that are supposed to be correlated linearly. Although the
objective is similar to PCA, EFA helps the researcher to decide which constructs or
95
factors best represent the data. While useful as a technique, it is generally used in the
early stages of research (Tavakoli, 2013) in order to consolidate the variables. This
eases the process of generating hypotheses.

Let us take the research study proposed in which a questionnaire is used to explore
students' perceptions about L2 digital writing, as in PCA. In JASP, the researcher must
go to 'Factor Analysis' > 'Exploratory Factor Analysis'. The box where the data has to
be introduced is the same as in PCA. The statistical aspects to be selected are,
depending on our research interests, 'Eigenvalues' and 'Oblique' rotation. An important
aspect about rotation is that, should there be a high degree of correlation among the
variables, then 'oblique' is the fittest option. Conversely, when correlation does not
exist between the variables, then 'orthogonal' should be selected. In our case, 'oblique'
is the most appropriate option since intercorrelations exist.

Under the 'Output options' tab, Assumption check must be selected. These are the
Kaiser-Meyer-Olkin (KMO) test (see Figure 109) and Bartlett's test (see Figure 110).
In the first case, the KMO test allows us to determine how suited our data is for Factor
Analysis. The result provided in the Overall MSA should be above .500. As can be
observed in Figure 109, the overall MSA is .665. Hence, this assumption check is met.

Figure 109. Kaiser-Meyer-Olkin test for EFA.


Kaiser-Meyer-Olkin test
MSA
Overall MSA 0.665
P_1 0.749
P_2 0.594
P_3 0.804
P_4 0.729
P_5 0.543
P_6 0.716
P_7 0.800
P_8 0.789
P_9 0.474
P_10 0.741
P_11 0.729
P_12 0.411
P_13 0.572
P_14 0.779
96
Kaiser-Meyer-Olkin test
MSA
P_15 0.741
P_16 0.466
P_17 0.562
P_18 0.520
P_19 0.554
P_20 0.690
P_21 0.668
P_22 0.609
P_23 0.633

The subsequent assumption check is Bartlett's test, which determines whether or not
there is sphericity. In this case, the result should be statistically significant, as can be
seen in Figure 110.

Figure 110. Bartlett's test result (EFA).


Bartlett's test
Χ² df p
704.354 253.000 < .001

In Figure 111, the model is tested through chi-squared. Although the value is not
statistically significant, the model will be explored in order to observe how these
factors are developed.

Figure 111. Chi-squared test for the module EFA.


Chi-squared Test
Value df p
Model 115.835 113 0.409

In Figure 112, the numerical entries in the table, which are factor loadings, indicate the
correlation between the original variables and the different factors. In essence, when a
factor loading is high, its variable contributes to this particular factor, and helps define

97
it (Tavakoli, 2013). In our example, factor loadings are disseminated throughout seven
different factors.

Figure 112. Factor loadings (EFA).


Factor Loadings
Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Uniqueness
P_21 0.909 0.333
P_23 0.833 0.454
P_8 0.705 0.272
P_4 0.693 0.407
P_7 0.934 0.101
P_2 0.787 0.312
P_10 0.522 0.657
P_11 0.453 0.417
P_14 0.417 0.506
P_19 1.029 0.117
P_18 0.929 0.334
P_6 0.925 0.236
P_9 0.606 0.585
P_16 0.740 0.489
P_15 0.457 0.731
P_17 0.625 0.604
P_3 0.546 0.421
P_12 -0.549 0.563
P_13 0.510 0.643
P_1 0.535
P_5 0.696
P_20 0.643
P_22 0.610
Note. Applied rotation method is promax.

Parallel to the previous factor loadings table, Figure 113 below displays the different
factor characteristics. As can be observed, only factors from one to four are above 1,
although the value in the table is not an eigenvalue, this indicates that they may not be
as representative of the variables included in the model as the others.

98
Figure 113. Factor characteristics.
Factor Characteristics
Unrotated solution Rotated solution
SumSq. Proportion SumSq. Proportion
Cumulative Cumulative
Loadings var. Loadings var.
Factor
4.835 0.210 0.210 2.757 0.120 0.120
1
Factor
2.326 0.101 0.311 2.543 0.111 0.230
2
Factor
1.931 0.084 0.395 1.977 0.086 0.316
3
Factor
1.119 0.049 0.444 1.608 0.070 0.386
4
Factor
0.830 0.036 0.480 1.490 0.065 0.451
5
Factor
0.702 0.031 0.511 1.059 0.046 0.497
6
Factor
0.590 0.026 0.536 0.898 0.039 0.536
7

99
Both Figures 114 and 115 provide a visual overview of the EFA model. In the case of
the path diagram, it allows us to observe in a much clearer manner each factor and the
associated variables.

Figure 114. Scree plot for EFA.

Figure 115. Path diagram for EFA.

Confirmatory Factor Analysis

The last of the Factor Analyses is Confirmatory Factor Analysis (CFA), which
allows the researcher to examine the relationship between different measure variables
100
and a set of factors. Unlike EFA, in which measured variables are related to every
factor by a factor loading, in CFA an advanced knowledge is presupposed on the
researcher's part, and under this assumption, factors are created by the researcher
himself. Hence, CFA responds to a hypothesized factor structure and the associated
correlations between the variables. A major difference with EFA, thus, is related to the
researcher's role in CFA. As can be seen in Figure 116 below, the construct or factor
must be created and variables assigned to it on the basis of the theory being tested.
Preconceived theories, then, may be tested with CFA as well (Tavakoli, 2013).

Let us hence imagine that, as a result of the EFA, a further confirmatory check is made
through CFA. Hence, two factors – corresponding to these higher factors in EFA with
the highest eigenvalues – are manually created, as can be observed in Figure 116.

Figure 116. CFA module in JASP.

Hence, once all the data are introduced and classified into the corresponding
researcher-created factors, a chi-square test for the model is calculated. In our study, it
is statistically significant (see Figure 117).

Figure 117. Model fit - CFA.


Chi-square test
Model Χ² df p
Baseline model 310.313 45
Factor model 50.776 34 0.032

101
Figure 118 displays a set of tables with additional fit measures to observe the
appropriateness of the model. Values in Comparative Fit Index (CFI) should be closer
to +1, indicating the model fit. In our example, it is close to +1. Tucker-Lewis Index
(TLI) is similar to CFI, although a more conservative option. As can be observed, the
value of TLI is close to +1, hence indicating the model fit.

In the case of other fit measures, such as RMSEA, the value provided should be below
.10. A traditional benchmark is that, below .05, it is a good model. Between .05 and
.10, it is appropriate but close attention should be paid.

Figure 118. Additional fit measures.


Fit indices
Index Value
Comparative Fit Index (CFI) 0.937
Tucker-Lewis Index (TLI) 0.916
Bentler-Bonett Non-normed Fit Index (NNFI) 0.916
Bentler-Bonett Normed Fit Index (NFI) 0.836
Parsimony Normed Fit Index (PNFI) 0.632
Bollen's Relative Fit Index (RFI) 0.783
Bollen's Incremental Fit Index (IFI) 0.939
Relative Noncentrality Index (RNI) 0.937

Information criteria
Value
Log-likelihood -1062.405
Number of free parameters 21.000
Akaike (AIC) 2166.810
Bayesian (BIC) 2216.301
Sample-size adjusted Bayesian (SSABIC) 2150.093

102
Other fit measures
Metric Value
Root mean square error of approximation (RMSEA) 0.080
RMSEA 90% CI lower bound 0.024
RMSEA 90% CI upper bound 0.123
RMSEA p-value 0.148
Standardized root mean square residual (SRMR) 0.071
Hoelter's critical N (α = .05) 75.662
Hoelter's critical N (α = .01) 87.119
Goodness of fit index (GFI) 0.891
McDonald fit index (MFI) 0.898
Expected cross validation index (ECVI) 1.189

After the checks regarding fit measures, factor loadings corresponding to the variables
introduced in each manually created factor are presented. As can be observed in Figure
119 below, all estimates seem to be over .40, indicating that these variables – in our
case, the questions in the questionnaire – fit well with the proposed factors.
Additionally,k attention should be paid to variable P_20, since it points to a negative
correlation.

Figure 119. Factor loadings CFA.


Factor loadings
95% Confidence
Interval
Std. z-
Factor Indicator Symbol Estimate p Lower Upper
Error value
Digital Writing - use
P_4 λ11 0.672 0.101 6.683 < .001 0.475 0.869
L1
P_8 λ12 0.839 0.103 8.114 < .001 0.636 1.041
P_20 λ13 -0.641 0.146 -4.404 < .001 -0.927 -0.356
P_21 λ14 0.739 0.111 6.676 < .001 0.522 0.956
P_23 λ15 0.668 0.104 6.420 < .001 0.464 0.872
103
Factor loadings
95% Confidence
Interval
Std. z-
Factor Indicator Symbol Estimate p Lower Upper
Error value
Digital writing - use
P_2 λ21 0.861 0.110 7.802 < .001 0.645 1.078
L2
P_7 λ22 1.160 0.118 9.857 < .001 0.930 1.391
P_10 λ23 0.529 0.131 4.051 < .001 0.273 0.785
P_11 λ24 0.804 0.138 5.841 < .001 0.534 1.073
P_14 λ25 0.540 0.133 4.049 < .001 0.279 0.802

Much as it was shown in the previous factor analyses, CFA may be equally seen
through a model plot, allowing for a more visual perspective of the interrelationship
between the factors. Hence, in Figure 120, the model plot for CFA shows how factors
are correlated, and also, the correlation between each factor and the associated
variables.

Figure 120. Model plot for CFA.

104
Chapter 6

EFFECT SIZES

What is an 'effect size'?


Throughout this book, the main reference taken to determine whether a certain
statistical test is relevant has been to observe the statistic, i.e. r, t, F, among others, and
the p-value in order to decide the acceptance or rejection of the null hypothesis.
Nevertheless, the presence of a p-value below the .05 alpha value does not provide the
researcher with any valuable information about the strength of the effect. Such
information about the magnitude of the effect is provided by effect sizes. In other
words, the results of a T-test may yield a statistically significant result (i.e. p-value
below 0.05) but the effect size may reveal that the magnitude of such an effect and
result is small.

An effect size is a value that points to the proportion or percentage of variability


existing on a dependent variable, whose variation may be attributed to the effect of the
independent variable (Tavakoli, 2013). As announced before, while a p-value indicates
whether a researcher can confirm or reject the null hypothesis, the effect size reveals
the importance of this statistical result. When rejecting the null hypothesis and
statistically signaling a difference between groups, the effect size gives valuable
information about the magnitude of the effect of the independent variable (for instance,
the presence or absence of a specific language teaching technique) on the dependent
variable (e.g. the scores on a vocabulary test). Hence, if the effect size is large, it will
indicate that there is something statistically significant to look into.

Similarly, the frequentist hypothesis-testing with p-values is highly dependent on the


power of the test, and on the sample size. Conversely, effect sizes do not depend on
group size, and additionally, this allows researchers to establish comparisons across
105
groups with different sample sizes. Previous research on a meta-analysis on L2 studies
has suggested that the widespread use of effect sizes favors obtaining valuable
information about a certain aspect of an area of research (Plonsky & Oswald, 2014).
There are several types of effect size indices: g index, h index, w index, squared indices
of r, q index, and d family of effect sizes. In what follows, some of these effect size
indices are going to be explained and illustrated with examples.

Cohen's d
One of the most used effect size indices is Cohen's d, which measures the difference
between means from two independent samples in terms of their standard deviation units
(Larson-Hall, 2010; Tavakoli, 2013). In essence, Cohen's d is measured from zero and
it stretches as much as the difference between means is. For effect sizes, there are
several benchmarks that are established to determine or estimate the magnitude of this
effect. However, Plonsky et al. (2021) indicated that "benchmarks are nothing more
than a starting point for gauging the magnitude of effects within the field" (p. 822). In
this respect, while traditional Cohen's d benchmarks have been established as follows:
small (0.2), medium (0.5), and large (0.8), Plonsky & Oswald (2014) proposed a series
of field-specific benchmarks for the interpretation of a series of effect sizes (namely, d
index, r index, and R2) in L2 research. In the case of the benchmarks proposed by these
authors, they distinguish between- and within-groups for the calculation of the
magnitude. For Cohen's d (between-groups): small (0.40), medium (0.70), and large
(1.00). Conversely, for Cohen's d (within-groups): small (0.60), medium (1.00), and
large (1.40).

The calculation of effect sizes is possible in JASP, as observed in Figure 121:

106
Figure 121. Selection of effect sizes for independent and dependent samples t-tests.

The effect size indices in Figure 122 are commonly used for independent and
dependent samples T-Tests (both parametric and non-parametric).

Figure 122. JASP outcome for an independent samples T-test with Cohen's d effect size.
Independent Samples T-Test
95% CI for Cohen's d
t df p Cohen's d Lower Upper
Post_Test -7.266 48 < .001 -2.057 -2.740 -1.360
Note. Student's t-test.

As can be seen in Figure 122 above, Cohen's d is large (d= –2.057), which indicates
that the difference between the groups is large. The inclusion of confidence intervals
allows us to observe the variability and extension of this effect size.

Hedges' g
Another common statistical effect size index is Hedges' g, which is very similar to
Cohen's d. However, Hedges' g takes into account the sample size since the effect size
yielded by Cohen's d is multiplied by a correction factor for small sample sizes (Turner
& Bernard, 2006).

Figure 123. JASP outcome for independent sample t-test with Hedges' g.
Independent Samples T-Test
95% CI for Hedges' g
t df p Hedges' g Lower Upper
Post_Test -7.266 48 < .001 -2.024 -2.704 -1.331
Note. Student's t-test.

107
Independent Samples T-Test
95% CI for Hedges' g
t df p Hedges' g Lower Upper

In Figure 123, the same example as in the previous effect size index was computed.
This time Hedges' g was calculated to determine the magnitude of the independent
variable on the dependent variable. If we compare the result of Hedges' g with Cohen's
d's, one may realize that the difference is not that distant. Nevertheless, the use of
Hedges' g ensures that a correction is applied when using small sample sizes (e.g. below
20 participants).

Cramer's V
Another important effect size index is Cramer's V, which is used for Chi-square
analyses. Its main purpose is to provide information about how strongly two categorical
variables are associated. Cramer's V is generally used with contingency tables, and it
is an extension of Phi correlation coefficient – whose use goes beyond the purposes of
this book (Tavakoli, 2013).

To illustrate the use of this effect size, let us take an example in which we are interested
in observing the level of anxiety in both groups: CLIL and a control group. The
independent variable is the presence or absence of a bilingual program.

Figure 124. Contingency table from JASP.


Contingency Tables
Group
Anxiety_Levels CG CLIL Total
1 5 5 10
2 9 2 11
3 2 0 2
4 6 11 17
5 2 8 10
Total 24 26 50

Chi-Squared Tests
Value df p
Χ² 11.463 4 0.022
108
Chi-Squared Tests
Value df p
N 50

As can be observed in Figure 124 above, Contingency tables and the chi-squared value
reveal that the differences between both groups are statistically significant (p = .022).
Hence, the L2 researcher has to verify the strength of the magnitude of the effect. To
do that, Cramer's V is computed.

Figure 125. Cramer's V effect size.


Nominal
Value
Cramer's V 0.479

As can be observed in Figure 125, the value is (V = 0.479). A series of benchmarks are
proposed for Cramer's V depending on the df (degrees of freedom). Let us have a look
at the table below (based on Goss-Sampson, 2020):

Figure 126. Cramer's V benchmarks.


Effect size and df SMALL MEDIUM LARGE
Contingency
table
Phi (2x2 only) 1 0.1 0.3 0.5
Cramer's V
Cramer's V 2 0.07 0.21 0.35
Cramer's V 3 0.06 0.17 0.29
Cramer's V 4 0.05 0.15 0.25
Cramer's V 5 0.04 0.13 0.22

Comparing the result of Cramer's V in Figure 126 with the benchmarks proposed, the
effect size would be large (>0.25).

109
Rank-biserial (rb)
The rank-biserial correlation coefficient is not an effect size per se, but it is a
measure of association between a continuous variable and a dichotomized variable with
two categories (that is, an independent variable). Traditionally, this measure of
association has not been used very widely in research since it is problematic in its
calculation, especially when distributions are not normal.

In JASP, the rank-biserial correlation coefficient is used with non-parametric tests, and
it is interpreted as an effect size using the same benchmarks as for Pearson's correlation.
As our interest lies in L2 education research, Plonsky and Oswald's (2014) field-
specific benchmarks are taken as reference: small (0.25), medium (0.40), and large
(0.65).

Taking the same example as in the previous effect size indices, Figure 127 shows how
the rank-biserial correlation coefficient is shown in JASP:

Figure 127. Rank-Biserial correlation as effect size.


Independent Samples T-Test
95% CI for Rank-Biserial
Correlation
Rank-Biserial
W df p Lower Upper
Correlation
Post_Test 42.500 < .001 -0.864 -0.926 -0.755
Note. For the Mann-Whitney test, effect size is given by the rank biserial
correlation.
Note. Mann-Whitney U test.

As can be observed, the result of the effect size is very large (r b = –0.864) which
suggests that the magnitude of the effect of the independent variable is relevant.

Eta squared (ƞ²)


Another measure of association that is usually employed in parametric ANOVAs is eta
squared, represented as ƞ². It is employed to determine the effect size, and it is the
proportion of the total variability of the dependent variable, which is further explained
by the variation in the independent variable. Eta squared helps in observing how much
110
of a difference between two groups may be explained by the independent variable. In
JASP, eta squared is available in the ANOVA module.

As can be observed in Figure 128 below, the eta squared is provided next to the p-
value. Following Goss-Sampson (2020), the benchmarks for eta squared are: trivial
(<0.1), small (0.1), medium (0.25), and large (0.37). Under this assumption, the eta
squared in the example in Figure 128 would be considered nearly medium (η² = 0.23).

Figure 128. ANOVA table with the eta squared effect size.
ANOVA - Montessori
Cases Sum of Squares df Mean Square F p η²
Group 36.860 1 36.860 15.006 < .001 0.238
Residuals 117.904 48 2.456
Note. Type III Sum of Squares

Partial eta squared (ƞ²p)


An extension of the previously explained eta squared is the partial eta-squared (ƞ²p),
which indicates the proportion of variance of the dependent variable explained by the
independent variable. It is partial as a result of the elimination of other factors that may
be influenced in the design.

The benchmarks for partial eta squared are: trivial (<0.01), small (0.01), medium
(0.06), and large (0.14).

As can be observed in Figure 129 below, using the same example as in the previous
effect size indices, the partial eta squared would be considered as very large (ƞ²p =
0.57).

Figure 129. ANOVA table for partial eta squared.


ANOVA - App
Cases Sum of Squares df Mean Square F p η²p
Group 154.565 1 154.565 64.101 < .001 0.572
Residuals 115.740 48 2.411
Note. Type III Sum of Squares

111
Omega squared (ω2)
The last of the effect size indices reviewed in this book is the omega squared (ω2),
which is one of the most commonly employed measures of treatment effect (Tavakoli,
2013). In essence, omega squared measures the proportion of variability on the
dependent variable which is directly associated with the independent variable in the
population. The use of omega squared as an effect size ensures that our estimate is not
unbiased in terms of the proportion in the population (Pagano, 2009).

For omega squared, the benchmarks are the same as partial eta squared ones (see in the
previous section). Thus, as can be seen in Figure 130, the effect size is very large (ω2
= 0.509).

Figure 130. ANOVA table with omega squared.


ANOVA - CLT
Cases Sum of Squares df Mean Square F p ω²
Group 93.065 1 93.065 52.793 < .001 0.509
Residuals 84.615 48 1.763
Note. Type III Sum of Squares

112
Chapter 7

INTRODUCTION TO BAYESIAN STATISTICS


Throughout the previous chapters, the statistical tests and methods presented and
applied belong to the frequentist methods. Frequentist models set probability as the
limit of the relative frequency of a certain event or hypothesis after a number of trials.
Hence, frequentist statistics calculates the probability that the experiment carried out
may yield the same results would it be replicated under the same conditions.

In order to determine and verify this probability, frequentist methods use the p-value
as a reference. This p-value is the calculated probability that the effect and the results
obtained are not random, and thus the null hypothesis may be rejected. Traditionally,
frequentist methods have relied on the p-value as a reference, setting the alpha level at
0.05. Below this benchmark, a result is statistically significant. Nevertheless, one of
the issues raised in frequentists statistics is that p-values tend to be overestimated and,
in turn, misinterpreted. This is the main reason why frequentist statistics have to be
supplemented, for instance, with the (sometimes necessary) inclusion of effect size
indices.

Aside from frequentist methods, research and L2 education research have most recently
shifted toward bayesian methods or statistics (Norouzian et al., 2018). Bayesian
statistics entails that probability expresses a degree of belief in a specific event. While
frequentist statistics emphasize the chance of an event, and its probability, Bayesian
statistics provides the researcher with valuable information about how probably a
variant is better than the original one.

Equally relevant, Bayesian probability is more conditional since it uses the concepts of
prior and posterior knowledge in an attempt to predict outcomes. This is called
conditional probability, and the premise behind it is that the probability of an event
X given Y may be equal to the probability of Y and X happening together divided by

113
the probability of Y. The main usefulness of conditional probability is that it takes into
account both false positives and false negatives.

In what follows, Bayesian terminology will be explained along with some essential
concepts of Bayesian statistics.

Basic terminology
Credibility interval. In Bayesian statistics, traditional confidence intervals are not
used. Instead, credibility intervals are used, and they are interpreted as the probability
– in JASP it can be set at 95% or according to our research interests – that the
population parameter is found in the upper or lower bounds determined by the Bayesian
credibility interval (Goss-Sampson, 2020). In Figure 131 below descriptive statistics
are shown along with the credible interval.

Figure 131. Descriptive statistics with credible intervals.


Descriptives
95% Credible Interval
N Mean SD SE Lower Upper
Pre_Test 26 8.177 0.847 0.166 7.835 8.519
Post_Test 26 9.231 0.620 0.122 8.980 9.481

Prior distribution. Prior distribution is the distribution that may capture the amount
of certainty or uncertainty in a population parameter (Goss-Sampson, 2020). Hence,
the distribution is weighted in such a manner that the posterior is obtained from the
data, and as a result, inferences are made. In terms of research, the prior distribution is
based upon what previous research has determined to be the norm. In essence, in
Bayesian statistics, the researcher has to be quite knowledgeable about the tendency
existing in previous research. Nevertheless, as pointed out by Norouzian et al. (2018),
the use of Bayesian statistics is still limited in L2 research and Applied Linguistics.
Thus, the establishment of prior distribution turns out to be challenging. In JASP, the
prior distribution is set in the 'Prior' tab (see Figure 132). JASP has a default Cauchy
distribution of a zero effect size (Cohen's d) and width or scale of .707 (Goss-Sampson,

114
2020). Such a prior distribution allows us to establish parameter estimation, whose
values may change depending on what previous research has stated to be the norm.

Figure 132. Prior distribution setting in JASP.

Likelihood functions are, in essence, based on the data generated and they crudely
describe it. They are highly dependent on the type of data (Norouzian et al., 2018).
Equally relevant, likelihood weights the prior distribution to obtain the posterior
distribution, which will allow us to make inferences.

Posterior distribution is obtained when the prior and the likelihood are combined in
the Bayesian estimation process. Statistically, the posterior is obtained when the prior
distribution is multiplied by the likelihood function.

On the basis of the above, Norouzian et al. (2018) very clearly state that choosing a
prior is a relevant decision in any research study wherein Bayesian methods are the
statistical choice. Nevertheless, it has also been assumed that prior knowledge is absent
or diminished, but their absence may lead to a biased Bayesian result. As mentioned
previously, prior knowledge (and hence, prior distributions) may not be considered
beyond the default Cauchy factor in JASP since the absence of studies using Bayesian
statistics makes it difficult to define it. A useful manner to set priors is identifying
effect sizes in previous research, especially Cohen's d, as units of reference for the
cauchy prior.

Prior odds are the outcome before the evidence is considered (Goss-Sampson, 2020).
Additionally, these prior odds may be uninformative or informative, depending on the
degree of knowledge in previous work that may be applied to the Bayesian statistical

115
method. Conversely, posterior odds are Bayes Factor – which will be explained in the
next section – multiplied by the prior odds. Through this formula, Goss-Sampson
(2020) indicates that Bayes Factor (BF10) informs us about the confirmation or
rejection of the hypothesis.

Bayes Hypothesis Testing: T-Tests and Correlations


Bayesian hypothesis testing is an alternative to frequentist methods. Some statistical
software packages have introduced Bayesian testing into their functions in an attempt
to facilitate performing T-tests and other statistical procedures (e.g. correlations or
linear regressions) using this method.

In JASP, Bayes Hypothesis testing is introduced with separate modules in each


statistical test that allows the application of such a method. For instance, as shown in
Figure 133, Bayesian hypothesis testing may be applied to these three types of T-tests.

Figure 133. Bayesian T-Test alternatives.

Unlike frequentist statistics, in which the rejection or confirmation of the null


hypothesis is demonstrated through p-values, Bayesian hypothesis testing relies on the
Bayes factor (BF10). It is the evaluation of the conditional probability mentioned
previously between two hypotheses: the null hypothesis and alternative hypothesis.
Bayes factor aims to support levels for each hypothesis, updated when new information
is available.

Figure 134 below shows the weight carried by the value in the Bayes factor, according
to which statistical significance may be revealed. The higher the value of the Bayes
factor, the closer it will be to rejecting the null hypothesis.

116
Figure 134. Graphical representation of a Bayes factor classification table (Van Doorn et al., 2020).

In JASP, as mentioned previously, common T-Tests may be conducted under the


auspices of Bayesian statistics. The introduction of the data is identical to frequentist
methods, but the output, as revealed in Figure 135 below, is different. In this case, only
BF10 (i.e. Bayes factor) is provided, and the error %, which is the error of the Gaussian
quadrature integration. Following the benchmarks presented in Figure 134 above, the
null hypothesis is rejected and then, the alternative hypothesis is accepted. In light of
this, we can conclude that there is a high probability that changes occurred owing to
the intervention.

Figure 135. Bayesian paired samples T-test.


Bayesian Paired Samples T-Test
Measure 1 Measure 2 BF₁₀ error %
Pre_Test - Post_Test 20316.557 3.696e-10

117
Although the output shown in the table is clear, Figure 136 shows a prior and posterior
distribution plot which allows us to observe the conditional distribution in a much
clearer way. The plot reveals that there is evidence for the alternative hypothesis, which
is equally supported by the median in the case of the effect size (Mdn = –1.196).

Figure 136. Prior and posterior distribution plot.

Likewise, Figure 137 below displays the same information as in the previous graph,
although in this case the prior likelihood is observed along with the value of the Bayes
factor. Once again, the evidence for the alternative hypothesis is revealing.

Figure 137. Robustness check plot.

118
Apart from the usual statistical procedures such as T-tests, Bayesian correlations may
also be performed. JASP offers this option in the 'Regression' module, as can be seen
in Figure 138:

Figure 138. 'Regression' module for Bayesian statistics.

The procedure to introduce the data is the same as for frequentist statistics. Conversely,
the output is different (see Figure 139 below). In this case, Pearson's r statistics are
presented, but p-values are replaced by the Bayes factor (BF10). Hence, to determine
whether the correlations were statistically significant, the Bayes factor has to be
observed. To name a few of these results, CLT is positively correlated with the level
of anxiety with a high Bayes factor.

Figure 139. Bayesian Pearson Correlations.


Bayesian Pearson Correlations
Pearson's r BF₁₀ Lower 95% CI Upper 95% CI
Montessori - CLT 0.921 *** 4.662e+7 0.795 0.964
Montessori - App 0.833 *** 39028.507 0.607 0.920
Montessori - YearsL2 0.149 0.319 -0.255 0.495
Montessori - Anxiety_Levels 0.630 ** 42.426 0.274 0.808
CLT - App 0.950 *** 3.909e+9 0.865 0.978
CLT - YearsL2 0.028 0.255 -0.358 0.403
CLT - Anxiety_Levels 0.670 *** 105.270 0.331 0.831
App - YearsL2 0.109 0.286 -0.290 0.465
App - Anxiety_Levels 0.566 * 12.673 0.187 0.770
YearsL2 - Anxiety_Levels -0.132 0.303 -0.482 0.270
* BF₁₀ > 10, ** BF₁₀ > 30, *** BF₁₀ > 100

119
REFERENCES
Chen, S. Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for
multiple comparisons. Journal of thoracic disease, 9(6), 1725–1729.
https://doi.org/10.21037/jtd.2017.05.34

Cohen, L., Manion, L., & Morrison, K. (2011). Research methods in education (7th
ed.). London: Routledge.

Cramer, D., & Howitt, D. (2004). The SAGE dictionary of statistics: A practical
resource for students in the social sciences. Thousand Oaks, CA: Sage.

Goss-Sampson, M. A. (2020). Statistical Analysis in JASP 0.14: A Guide for


Students. London: University of Greenwich.

Heiman, G. W. (2011). Basic statistics for the behavioral sciences.


Belmont, CA: Wadsworth.

Larson-Hall, J. (2010). A guide to doing statistics in second language research using


SPSS. New York: Routledge

Mackey, A., & Gass, S. M. (2005). Second language research. Methodology


and design. Mahwah, NJ: Lawrence Erlbaum Associates.

Norouzian, R., de Miranda, M., & Plonsky, L. (2018). The Bayesian revolution in
second language research: An applied approach. Language Learning, 68(4),
1032-1075.

Pagano, R. R. (2009). Understanding statistics in the behavioral sciences (9th ed.).


Belmont, CA: Wadsworth.

Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2
research. Language learning, 64(4), 878-912.

Plonsky, L., Sudina, E., & Hu, Y. (2021). Applying meta-analysis to research on
bilingualism: An introduction. Bilingualism: Language and Cognition, 1-6.

120
Porte, G. K. (2010). Appraising research in second language learning: A practical
approach to critical analysis of quantitative research (2nd ed.). Amsterdam:
John Benjamins.

Richards, K. (2003). Qualitative inquiry in TESOL. Basingstoke: Palgrave Macmillan

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater
reliability. Psychological bulletin, 86(2), 420.

Tavakoli, H. (2013). A Dictionary of Research Methodology and Statistics in Applied


Linguistics. Theran: Rahnamā Publishing.

Turner, H. M. I., & Bernard, R. M. (2006). Calculating and synthesizing effect sizes.
Contemporary issues in communication science and disorders, 33(Spring), 42-
55.

Urdan, T. C. (2017). Statistics in plain English (4th ed.). Mahwah, NJ: Lawrence
Erlbaum Associates.

Van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., Etz, A.,
Evans, N. J., Gronau, Q. F., Haaf, J. M., Hinne, M., Kucharský, Š., Ly, A., Marsman,
M., Matzke, D., Gupta, A. R. K. N., Sarafoglou, A., Stefan, A., Voelkel, J. G., &
Wagenmakers, E. J. (2020). The JASP guidelines for conducting and reporting a
Bayesian analysis. Psychonomic Bulletin and Review.
https://doi.org/10.3758/s13423-020-01798-5

121
Buy your books fast and straightforward online - at one of world’s
fastest growing online book stores! Environmentally sound due to
Print-on-Demand technologies.
Buy your books online at
www.morebooks.shop
Kaufen Sie Ihre Bücher schnell und unkompliziert online – auf einer
der am schnellsten wachsenden Buchhandelsplattformen weltweit!
Dank Print-On-Demand umwelt- und ressourcenschonend produzi
ert.
Bücher schneller online kaufen
www.morebooks.shop
KS OmniScriptum Publishing
Brivibas gatve 197
LV-1039 Riga, Latvia info@omniscriptum.com
Telefax: +371 686 204 55 www.omniscriptum.com

You might also like