Contextualizing Performances Comparing Performance

Language Assessment Quarterly
ISSN: 1543-4303 (Print) 1543-4311 (Online) Journal homepage: http://www.tandfonline.com/loi/hlaq20
Contextualizing Performances: Comparing

TM
Performances During TOEFL iBT and Real-Life
Academic Speaking Activities
Lindsay Brooks & Merrill Swain
To cite this article: Lindsay Brooks & Merrill Swain (2014) Contextualizing Performances:
TM
Comparing Performances During TOEFL iBT and Real-Life Academic Speaking Activities,
Language Assessment Quarterly, 11:4, 353-373, DOI: 10.1080/15434303.2014.947532
To link to this article: http://dx.doi.org/10.1080/15434303.2014.947532
Published online: 14 Nov 2014.
Submit your article to this journal
Article views: 459
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=hlaq20
Download by: [58.27.246.251] Date: 18 October 2015, At: 16:59

Language Assessment Quarterly, 11: 353–373, 2014
Copyright © Taylor & Francis Group, LLC
ISSN: 1543-4303 print / 1543-4311 online
DOI: 10.1080/15434303.2014.947532
ARTICLES
Contextualizing Performances: Comparing

Performances During TOEFL iBTTM and Real-Life
Downloaded by [58.27.246.251] at 16:59 18 October 2015
Academic Speaking Activities

Lindsay Brooks and Merrill Swain
University of Toronto, Toronto, Ontario, Canada
In this study we compare test takers’ performance on the Speaking section of the TOEFL iBTTM
and their performances during their real-life academic studies. Thirty international graduate stu-
dents from mixed language backgrounds in two different disciplines (Sciences and Social Sciences)
responded to two independent and four integrated speaking tasks of the TOEFL iBT and participated
in semistructured interviews. For the real-life academic contexts, we recorded the performances of
our participants in one in-class and one out-of-class speaking activity. On the basis of an analysis of
the participants’ speaking (examining grammatical, discourse, and lexical features), we demonstrate
that there are some overlapping and some distinct differences in their performances across contexts.
Our findings both support and raise questions about the extrapolation inference claim of the validity
argument of the Speaking section of the TOEFL iBT.
INTRODUCTION
In this article we compare students’ performances during the Speaking section of the Internet-
based Test of English as a Foreign LanguageTM (TOEFL iBTTM ) with their speaking per-
formances during their real-life graduate studies. Our study is framed within a Vygotskian
sociocultural theoretical (SCT) perspective (Lantolf & Thorne, 2006; Vygotsky, 1986), in which
speaking performances, whether during tests or other contexts, are conceptualized as mediated,
goal-driven activities. An SCT perspective recognizes the “inseparability of cognition and emo-
tion” (Swain, 2013a) and is therefore particularly suitable for examining speaking performance.
Also central to SCT and integral to our study is context. Context is not “that which surrounds” but
rather “that which weaves together” (Cole, 2005, p. 214). It is in this sense that we use the word
Correspondence should be sent to Lindsay Brooks, University of Toronto, Ontario Institute for Studies in Education,
252 Bloor St. West, 10th floor, Toronto, ON M5S 1V6, Canada. E-mail: lindsay.brooks@utoronto.ca
354 BROOKS AND SWAIN
“context” in this article. What this SCT perspective of context means for our study is that whether
students are engaged in taking a test or speaking during their real-life academic activities, their
goals, affect, and cognition are interwoven with, and inseparable from, the context from which
their performances emerge.
When prospective students take high-stakes language proficiency tests as part of the admission
process to colleges or universities, their successful test performance in one setting is followed by
daily performances in their real-life academic settings once they are admitted. The use of such
proficiency tests (e.g., TOEFL iBT) as evidence of language facility for admission decisions is
based on the claim that scores on the tests can predict test takers’ linguistic readiness (or not) to
cope with the language demands of academic studies. Such tests seek to simulate typical authen-
tic academic activities, and the scores are said to extrapolate to performance in real-life academic
settings (e.g., Bridgeman, Powers, Stone, & Mollaun, 2012). In the TOEFL iBT, the extrapola-
tion inference of its validity argument is based on the warrant that “the construct of academic
language proficiency as assessed by TOEFL accounts for the quality of linguistic performance in
English-medium institutions of higher education” (Chapelle, Enright, & Jamieson, 2008, p. 21).
The assumption underlying the warrant is that “performance on the TOEFL is related to other
criteria of language proficiency in the academic context” (ibid., p. 21). To date, the evidence
for this assumption has been based on correlational studies examining the relationships between
scores on TOEFL tasks and scores on other tests, test takers’ self-assessments, instructor ratings
of student performance, and/or course placements. Correlations indicate an association between
performance scores in different settings but do not explain the cause of such relationships. In this
article we extend the evidence base beyond such correlational criterion-related evidence and
investigate whether the extrapolation inference holds true, not for the scores but for the actual
language use in the Speaking section of the TOEFL iBT.
Collecting direct evidence of speaking permits a direct comparison of the grammatical, dis-
coursal, and lexical characteristics of test taker performance during different activities, as well
as the strategies involved in performing them. However, the former line of research has yet to
be undertaken. Most previous studies focused on describing the strategies (e.g., Swain, Huang,
Barkaoui, Brooks, & Lapkin, 2009; Barkaoui, Brooks, Swain, & Lapkin, 2013) and performance
(e.g., Brown, 2003; Lazaraton, 1996; Lumley & Brown, 1996) elicited by test tasks and test set-
tings only. To the best of our knowledge, no studies have yet compared oral performance in a test
setting with oral performance of the same participants in real-life academic settings. It is the lack
of such research that we address in the present study.
In the following section, we first review the literature relating specifically to the Speaking
section of the TOEFL iBT. This is followed by a review of those studies of particular relevance
to our study that examined spoken performance during 1) tests and 2) real-life academic settings,
both in class and out of class.
LITERATURE REVIEW
The Speaking Section of the TOEFL iBT
The conceptualization of speaking in the TOEFL iBT is based on the working model of language
use in an academic setting, an expanded version of Canale and Swain’s (1980) model of commu-
nicative competence and Bachman’s (1990) model of communicative language ability (Chapelle,
CONTEXTUALIZING PERFORMANCES 355
Grabe, & Berns, 1997). It includes knowledge of language (e.g., grammatical, sociolinguistic, and
textual knowledge), strategic competence, and the context of language use. In the development
of the TOEFL iBT, the stated direction was that “the test should more accurately reflect commu-
nicative competence, which refers to the ability to put language knowledge into use in relevant
contexts” (Taylor & Angelis, 2008, p. 42) and that test tasks should reflect those that students
typically encounter in their academic studies. In addition to making the tasks authentic, a desire
to make the test itself more authentic spurred the inclusion of speaking in the TOEFL iBT.
Including a speaking component in the TOEFL iBT was considered a “highly important devel-
opment” (Butler, Eignor, Jones, McNamara, & Suomi, 2000, p. 23) in that it allowed the testing
of oral skills through tasks thought to simulate real-world communicative situations that students
might encounter in an academic setting. This goal of more closely mirroring the types of speaking
required in academic settings was in part the rationale for including both independent (speaking
only) and integrated (listening, reading, and speaking; and listening and speaking) tasks in the
TOEFL iBT. Another reason for having both independent and integrated tasks was based on the
assumption that the nature of the performance they elicited would differ and provide a broader
representation of the domain. Through the extrapolation inference of the TOEFL iBT validity
argument, the test scores could be extrapolated to reflect speaking performance in the domain of
academic discourse across different genres, functions, and situations (Jamieson, Eignor, Grabe,
& Kunnan, 2008).
Practical constraints imposed by using computers for the Speaking section of the TOEFL iBT
necessitated that the assessment of speaking performance be semidirect and confined to mono-
logic discourse (Enright et al., 2008). As Butler et al. (2000) pointed out, in a computer-delivered
test, constraints and conditions are different because the tests are not created interactively as
most speaking would be in an academic context; tests can limit the nature of interactions, and
this in turn might affect the validity of extrapolations from test performance to performance in
real-life interactions. Therefore, it is important to question whether performance on the TOEFL
iBT monologic speaking tasks can be extrapolated to performance in what is mostly face-to-
face oral communication in a university setting. Increasingly in the language testing literature, an
emphasis has been placed on the co-constructed nature of speaking (e.g., Brooks, 2009; Brown,
2003; Chalhoub-Deville, 2003; Chalhoub-Deville & Deville, 2006; McNamara, 1997; Swain,
2001; Swain, Kinnear, & Steinman, 2011) in which participants are jointly responsible for the
performance. The presence of other interlocutors and the interaction (either verbal or nonverbal)
inherent in face-to-face oral communication in real-life contexts alters the nature of the interac-
tion and hence the performance (Deville & Chalhoub-Deville, 2006; Luoma, 2004; Weir, 2005).
The question remains, however, to what extent the speaking performances during the TOEFL iBT
Speaking tasks and during those in real-life academic activities are comparable.
Studies of Speaking Test Performance
The importance of analyzing test performance in gathering validity evidence is well established
in the language testing literature (e.g., Lazaraton, 2002; O’Loughlin, 2001; Shohamy, 1994;
Swain, 2001). Swain (2001) stressed that the dialogue of test takers (and by extension monologic
discourse of solo performance) can be an important source of validity evidence.
Much of the research on test discourse analysis has focused on the oral proficiency inter-
view (OPI) and its variations, such as the telephonic OPI (Johnson, 2001), and the mismatch
between the type of speaking elicited and real-life conversation (e.g., He & Young, 1998; Lantolf
& Frawley, 1985; Van Lier, 1989). However, none of those studies directly compared test per-
formance and actual conversations with the same individuals. Other studies of the OPI have
examined the relationship between the discourse and the ratings. In analyzing performance on
an OPI conducted in French, Magnan (1988) found a relationship between grammatical errors
and ratings. However, the relationship was not always linear and was dependent on the type of
error. Her explanation for these findings was that those students at a higher level made more errors
because they attempted more complex grammar.
Another strand of research in the literature has examined the relationship between test scores
and/or the wording of rating scales and test performance. One of the most comprehensive studies
examining test performance was conducted by Brown, Iwashita, and McNamara (2005); see also
Iwashita, Brown, McNamara, and O’Hagan (2008). They analyzed 200 speech samples from a
set of prototype independent and integrated speaking tasks developed for the new TOEFL and
measured linguistic resources, phonology, fluency, and content across task types and proficiency
levels. Significant differences between scores based on rating scales and detailed measures were
found only on one or two measures within each of the four categories. Brown et al. concluded
that scores were as good a basis for inferences of proficiency as a detailed analysis of the test
takers’ spoken performance.
Another line of research on test taker performance relevant to the present study has examined
the mode of delivery and type of interaction. In one such study, O’Loughlin (2001) analyzed
the discourse on two versions of an oral proficiency test, a direct (face-to-face) and semidirect
(tape-mediated) version. To make the versions as comparable as possible, interaction in the direct
version of the test was minimized by scripting the interlocutors. O’Loughlin examined the dis-
course in four of the test tasks; three of the tasks were classified as “monologic” in both versions
and the fourth task, a role play, was dialogic in the direct version and monologic in the semidi-
rect version. While the first three tasks tended to show some similarities in discourse features,
the role-play task in the two versions differed in genre, speech moves, communicative properties,
prosodic features, speech functions, and register. Overall, the monologic tasks resulted in more
formal discourse. In his analysis of lexical density, O’Loughlin found that the more interactive
the task, the lower the lexical density. O’Loughlin’s conclusion that the two tests were not equiv-
alent shows that the presence of an interlocutor (even though with limited interactivity) influences
spoken performance.
Studies of Spoken Academic Discourse
While much research has focused on the discourse analysis of written texts in academic con-
texts (e.g., Flowerdew, 2002; Grabe & Kaplan, 1996), relatively few studies have been conducted
on the linguistic characteristics of spoken academic discourse (Biber, Conrad, Reppen, Byrd,
& Helt, 2002; Hyland, 2002). Some of the research has focused on in-class oral discourse.
Farr (2003) analyzed the discourse of instructor and student dyadic interaction and found that
the use of minimal response tokens, nonminimal response tokens, and simultaneous speech to
demonstrate engaged listenership were important features of interaction.
Corpus-based studies have contributed to the literature on classroom discourse. Zareva
(2009) analyzed the use of circumstance adverbials (e.g., adverbials of place, time, process,
contingency) in college-level academic presentations given by students with English as a first
language (L1) and English as a second language (L2). She found differences in the frequencies of
circumstance adverbials between these two groups, with the L2 group using a more limited range.
Zareva suggested that the L2 students perceived the presentation to be more formal than did their
L1 peers. The L2 group tended to focus on the informational content rather than on interacting
with their classmates.
Though there is not much research on spoken academic registers, there is even less on other
informal, everyday registers in university settings (Biber et al., 2002) that tend to be more preva-
lent outside of the classroom but still in the university environment. Results from Biber et al.
(2002) and Biber (2006) suggest that students need to be able to handle formal academic regis-
ters as well as more interactive conversational registers that are common in classroom teaching
and study groups.
Present Study and Research Questions

In this article we address three research questions, each of which relates to the three contexts in
which we collected speaking samples from our 30 student participants: (a) the Speaking section
of the TOEFL iBT (hereinafter SSTiBT); (b) participants’ classrooms (hereinafter in-class); and
(c) participants’ out-of-class settings (hereinafter out-of-class). The three questions are:
1. Are there differences in the grammatical features (complexity and inaccuracy) used by
students across the three contexts?
2. Are there differences in the discourse features (cohesion and register characteristics) used
by students across the three contexts?
3. Are there differences in the vocabulary used by students across the three contexts?
METHOD
Participants
We recruited 30 international graduate students enrolled at a university in Canada to partici-

pate in our study. The participants were from two disciplines, Sciences (Engineering, Dentistry,
Biochemistry, and Physics) and Social Sciences (Education and Psychology), with 15 students in
each of these areas of study. We decided to focus on graduate students because we felt that they
would likely be required to do more speaking in their (in-class) academic studies than undergrad-
uate students would. Criteria for inclusion in the study were that participants had to 1) have a first
language other than English; 2) have been in Canada for less than two years; and 3) have taken an
English proficiency test within the previous two years as part of the admission process into their
current programs of study.
Our participants came from 11 different language backgrounds, including Mandarin (17), Farsi
(4), and one participant from each of the following 9 language backgrounds: Arabic, German,
Hindi, Italian, Kurdish, Nepalese, Portuguese, Russian, and Spanish. We gave all of our partici-
pants pseudonyms appropriate to their first language backgrounds. Table 1 provides an overview
of additional information about the participants.
TABLE 1
Participants’ Backgrounds
Sciences Social Sciences

(n = 15) (n = 15)
Gender
Female 4 12
Male 11 3
Age in years
Median 23 25
Range 22–30 22–43
Time in Canada (months)
Median 2.5 6
Range 2–17 1–19
TOEFL iBT Speaking score

(maximum score of 30)
Median 24 23
Range 19–30 17–29
The TOEFL iBT scores1 in Table 1 are those that the participants obtained in the research
version of the Speaking section that was part of our data collection procedures (see Procedures
Section below).
Instruments
Background Questionnaire. Participants completed a questionnaire reporting their gender,

age, time in Canada, first language, educational experience, proficiency test scores presented for
university admission, and their speaking activities in their academic studies.
The Speaking Section of the TOEFL iBT (SSTiBT). Our participants responded to a
research version of the SSTiBT consisting of two independent (Tasks 1 and 2) and four integrated
(Tasks 3 to 6) speaking tasks. Table 2 provides an overview of the six tasks.
Semistructured Interview Questions. All participants responded to a semistructured inter-
view in which we asked them to reflect on their perceptions of the SSTiBT and their real-life
academic speaking, their reactions to the test tasks, their speaking in their academic studies, and
to make any other additional comments they wished.
Procedures
Our data collection procedures are outlined in the following five steps:
1. Participants met with a Research Assistant (RA) and completed a background question-
naire. The same RA collected all of the data.
1 The TOEFL iBT Speaking tasks were scored by six ETS raters, who each scored answers to two different prompts.
Each task was scored by two different raters. Of the total rating decisions, in only two instances was adjudication
necessary.
TABLE 2
Overview of the Six Tasks in the SSTiBT
Preparation Time Response Time

Task Language Skills Required Topic (in sec.) (in sec.)
1 Speaking Familiar topic 15 45

2 Speaking Familiar topic 15 45
3 Reading, listening, speaking Campus-life situations 20 60
4 Reading, listening, speaking Academic course content 30 60
5 Listening, speaking Campus-life situations 30 60
6 Listening, speaking Academic course content 20 60
2. At that same session, to familiarize them with the SSTiBT instructions and tasks and to
practice taking notes, participants did one practice integrated listening and speaking task.
3. Immediately following the practice task, participants did all six speaking tasks in our
research version of the SSTiBT.
4. In the next stage of the study, students recorded their speaking once in-class and once out-
of-class (in either order) usually within a month of doing the SSTiBT. The RA arranged
the delivery and pickup of the digital recorders with the participants but was not present
for any of the recordings to not change the nature of the activity. We asked the students
to record activities that were natural and reflective of those they normally engaged in as
part of their academic studies. Therefore, we did not specify time guidelines. An overview
of the activities recorded in each context is in Table 3. The students interacted with their
peers and/or professors in the in-class and out-of-class activities.
5. As the final step in data collection procedures, the RA met with each participant for a
semistructured interview on their perceptions of the SSTiBT tasks and their speaking in
their academic studies (see Brooks & Swain, 2015).
Data Coding
To analyze the participants’ speaking performances in the three contexts, we first transcribed the
participants’ recordings and verified the transcripts for accuracy. We then segmented the tran-
scripts into Analysis of Speech Units (AS-units), which we considered to be the most suitable
unit of analysis for oral performance data. The AS-unit is a “single speaker’s utterance consisting
of an independent clause or sub-clausal unit, together with any subordinate clause(s) associated
with either” (Foster, Tonkyn, & Wigglesworth, 2000, p. 365). A number of studies have proposed
TABLE 3
Overview of the Types of Activities Recorded In-class and Out-of-class
Activity In-class Out-of-class
Presentation 21 2a
Paired or small group discussion 9 28
a Informal presentations to peers.
that the AS-unit is particularly appropriate for analyzing oral data (Foster et al., 2000; Norris &
Ortega, 2009; Plough, Briggs, & Van Bonn, 2010). Foster et al. (2000) argue that because the AS-
unit includes subclausal units, which are common in oral discourse, it is a robust measure for oral
performance data and is “sensitive to genuine differences in performance” (p. 372) because the
parameters of AS-units allow them to be used to evaluate both unfinished statements and state-
ments that occur over stretches of spoken interaction (i.e., across turns). While coding, however,
we found that the AS-units emerging from our data varied widely in length, depending, to a large
extent, on the degree of interaction among interlocutors (i.e., along a monologic–dialogic dimen-
sion). Therefore, we decided to adopt the clause within the AS-unit as the smallest unit of analysis
and use it as a common denominator for our analyses.
We then coded the data for the features specified in each research question. False starts, self-
repairs, and repetitions were identified and not considered in any of the analyses. The coding
schemes that we developed emerged both from the actual data and from consideration of the lit-
erature (e.g., Brown et al., 2005). The following is a detailed description of the coding procedures
for addressing each research question.
Grammatical Features. The grammatical complexity measure was calculated by dividing

the total number of clauses by the total number of AS-units for each participant in each context.
The grammatical inaccuracy measure was calculated by dividing the total number of errors by the
total number of clauses for each participant in each context. To identify errors in the participants’
transcripts, we modified Ferris’s (2002) written error types for oral data. We used this modified
error typology to guide our identification of errors. Because it was impossible to achieve inter-
coder reliability with classifying error types (see also Brown et al., 2005), we only counted the
number of errors, and we made no judgment about severity or comprehensibility.
Three research assistants (RAs) segmented the data into AS-units and clauses and identified
the number of errors in each clause. To establish intercoder reliability, the three RAs coded the
data from one participant individually and then met to discuss and resolve any disagreements
among them about the number of errors. Then the three RAs independently coded 17% of the
data (5 of the 30 participants) by dividing up the data among themselves so that each transcript
was coded by two RAs. Then they met to discuss the coding decisions and resolve any discrepan-
cies. The three coders achieved high levels of intercoder reliability (Henning, 1987), which was
calculated by using the Spearman-Brown prophesy formula: .95 for AS-units, .95 for clauses, and
.89 for errors. Finally, each RA coded a portion of the remaining transcripts.
Discourse Features and Register. All the discourse measures were calculated by dividing
the total instances of each feature by the number of clauses for each participant in each context.
To identify the discourse features to examine, we reviewed categories in the literature (e.g., Biber,
2006; Halliday & Hasan, 1976). However, the features we decided to focus on were, in large part,
data-driven, meaning that we compiled discourse features and then revised and reconstructed the
codes as we progressed through the coding. We made alterations to this emerging list, depending
on what the data revealed. We then counted the number of instances of each feature.
To establish intercoder reliability, two RAs coded two participants’ transcripts individually by
going through the transcripts and highlighting all instances of the discourse features and then met
to discuss and resolve any discrepancies in the numbers of each feature identified. Subsequently,
the two RAs independently coded the data from four participants individually. Then they met to
discuss any disagreements and resolve any discrepancies. The intercoder reliability, calculated
by using the Spearman-Brown prophesy formula, was .97. Finally, each coder coded half of the
remaining transcripts.
Vocabulary Use. For the analysis of participants’ vocabulary use, we used the lexical fre-
quency profiling software VocabProfile (VP) (Cobb, 2006; Heatley & Nation, 1994). VP classified
vocabulary used by the participants into four predefined frequency bands based on written texts:
1) 1,000 most frequent words, including content and function words (K1); 2) 1,000 second most
frequent words (K2); 3) 550 academic words (AWL); 4) the remaining words that are on none
of the lists above (Off-list). We used only the AWL word count, together with the K1 content
words, K2, and Off-list words, in calculating the total number of content words. Type/token
ratios (TTR) are a widely used vocabulary measure. However, we did not use it for our vocabu-
lary analysis because TTR is highly sensitive to the length and topic of a text (O’Loughlin, 2001;
Vermeer, 2000). The participants’ oral performances in our three contexts vary considerably in
terms of length, register, degree of interaction, and content. Therefore, instead of using TTR, we
calculated the number of words per clause that belong to each frequency band.
Because the vocabulary included in VP frequency bands was compiled on the basis of written
texts, using VP for analyzing spoken data posed a number of problems. First, false starts, repairs,
and other utterances that are characteristic of spoken discourse inflate the frequencies of certain
words. Second, the K1 list does not include words that occur very frequently in spoken, espe-
cially dialogic, data (e.g., VP classifies okay as an Off-list word). Third, VP does not recognize
contractions of formulaic expressions, such as gonna, and wanna as their full forms. To address
these problems, a few adjustments were made to the frequency lists and participants’ data: (a)
as with other analyses in the present study, all the false starts, repairs, and other nonmeaning
carrying words (including “XX” and “XXX” that were used to mark unintelligible utterances)
were removed from the data; (b) because the original written text-based K1 list does not include
okay (a high-frequency spoken word) and yeah, and yep, which are spoken equivalents of yes2
(which is classified as a K1 content word), we added them to the K1 list; and (c) contractions in
the transcripts were replaced with their full forms (e.g., wanna was replaced with want to).
Data Analyses
As described earlier, the clause is the common denominator for all analyses in the present study
and therefore the measures examined are presented as per-clause ratios. The only exception is
grammatical complexity, which is presented as the number of clauses per AS-unit. The following
are the procedures we used for the statistical analyses. For all the statistical analyses, we used
SPSS Version 20.
First, we examined whether the coded data would meet the statistical assumptions (e.g., nor-
mality of distribution) for parametric tests, such as the t-test and ANOVA. The Shapiro-Wilk tests
of normality showed that most of the measures were not normally distributed. Accordingly, we
decided to use nonparametric statistical tests in all our statistical analyses. It is for this reason that
medians are used and reported rather than means.
2 Although yep, yeah, and yes can have a range of intended meanings and discourse functions, we did not make a
distinction between literal and intended meanings of these utterances; all were classified as the literal meaning of yes.
Second, to address possible disciplinary differences in the measures we examined across the
three contexts, we conducted Kolmogorov-Smirnov two-sample tests for each measure in each
analysis with the discipline as the independent variable (2 levels) and the measures as the depen-
dent variables. Results indicated no significant differences in any of the measures by participant
discipline. Therefore, the data from the two disciplines have been collapsed, and all analyses
reported in this article have an N of 30.
Finally, to determine if there were significant differences among the three contexts in terms
of the grammatical, discourse, and vocabulary measures, we conducted Friedman tests (nonpara-
metric equivalent of repeated-measures ANOVA). If the Friedman test indicated a significant
difference among the contexts, pairwise comparisons between the contexts (SSTiBT vs. in-class;
SSTiBT vs. out-of-class; and in-class vs. out-of-class) were made with Wilcoxon signed-rank
tests (nonparametric equivalent of the matched-pairs t-test). Because multiple comparisons are
made in the Wilcoxon signed-ranks tests, we needed to correct for the possibility of a Type I
error, so we used a Bonferroni correction (.05/3 = .0167) to adjust for making three pairwise
comparisons. Therefore, throughout our results in which we have used pairwise comparisons, our
level of significance is .0167 to reflect that a Bonferroni correction has been applied.
RESULTS
Grammatical Features
To answer the first research question, we calculated two grammatical measures: complexity
(clauses per AS-unit) and inaccuracy (errors per clause).3
Grammatical Complexity by Context. Table 4 shows the medians and ranges for grammat-
ical complexity across the three contexts. The trend shows a decrease in grammatical complexity
from SSTiBT to in-class to out-of-class.
TABLE 4
Descriptive Statistics for Grammar Measures by Context (N = 30)
Context Clauses per AS-unit Errors per Clause
SSTiBT
Median 2.49 .44
Range 1.67 .93
In-class
Median 1.68 .23
Range 1.02 .58
Out-of-class
Median 1.38 .14
Range .72 .39
3 We decided to use contexts instead of types of activity (presentations vs. group discussions) to report our results.
However, it should be noted that we ran each analysis by activity type, and our results revealed the same patterns.
TABLE 5
Follow-up Tests for Comparing Grammatical Complexity (Clauses per AS-unit) by Context
Clauses per AS-unit Za Sig.b rc
In-class vs. SSTiBT −4.78 .00∗ .62

Out-of-class vs. SSTiBT −4.78 .00∗ .62
Out-of-class vs. In-class −3.63 .00∗ .47
a Wilcoxon signed-rank test. b Asymp. sig. (2-tailed); ∗ p < .01. c Effect size.
Because of this trend, we examined if there were any significant differences in grammatical
complexity across the three contexts by conducting a Friedman test, which indicated there was a
significant difference (X 2 (2, N = 30) = 48.27, p < .01).
Follow-up pairwise comparisons showed that there was a significant difference in complexity
between each context. The effect size between SSTiBT and both in-class and out-of-class was
.62, indicating a large effect size, whereas the effect size between in-class and out-of-class was
.47, indicating a medium effect size (see Table 5).4
Grammatical Inaccuracy by Context. Table 4 also shows the medians and ranges for
grammatical inaccuracy across the three contexts. The trend shows a decrease in grammatical
inaccuracy from SSTiBT to in-class to out-of-class. We then conducted a Friedman test, which
indicated there was a significant difference in grammatical inaccuracy across the three contexts
(X 2 (2, N = 30) = 45.87, p < .01). Follow-up pairwise comparisons (see Table 6) showed that
there was a significant difference in inaccuracy between each context. The effect sizes between
SSTiBT and both in-class and out-of-class were large, .59 and .61, respectively, whereas the effect
size between in-class and out-of-class was .45, indicating a medium effect size.5
That in the context of the SSTiBT our participants were the most inaccurate is surprising,
given that in reflecting on their speaking in the SSTiBT and in their real-life academic studies,
they commonly reported that they paid more attention to language in the context of the test.
However, as our results indicated, this increased attention to language did not necessarily translate
into increased accuracy. In Excerpt 1, one of our participants, Ling, commented on this when
comparing her speaking on the test with her speaking in out-of-class contexts.
4 Following Field (2009), we used Pearson’s correlation coefficient r as a measure of effect size, with an r of 0 meaning
there is no effect and an r of 1 meaning there is a perfect effect. Following Cohen (1992), Field suggests that r = .10 is a
small effect; r = .30 is a medium effect; and r = .50 is a large effect (Field, 2009, p. 57).
5 As explained earlier, we decided to use the clause as the common denominator for our analyses. However, clauses
based on the AS-unit may still raise the issue of comparability, though to a lesser degree than AS-units, across the
three contexts. The out-of-class context in particular produced a great number of very short independent subclausal units
(counted as one-clause AS-units) consisting of only one or two words (e.g., short answers, such as yes, or sure in spoken
interaction). These clauses obviously have little room for errors. This type of clause never occurred in the SSTiBT context,
and this may have been a major factor contributing to significantly higher grammatical accuracy in the out-of-class context
as seen above. To examine if this trend would still hold if we compared the grammatical inaccuracy measure calculated
with longer AS-units from the three contexts, we selected AS-units from each context that are three, four, and five clauses
long and aggregated all the errors occurring in those clauses and calculated the average number of errors per clause.
Results indicated that the same trend holds. The participants made .443 errors per clause in the SSTiBT, .257 in the
in-class context, and .215 in the out-of-class context.
TABLE 6
Follow-up Tests for Comparing Grammatical Inaccuracy (Errors per Clause) by Context
Errors per Clause Za Sig.b rc
In-class vs. SSTiBT −4.60 .00∗ .59

Excerpt 1
Definitely the daily life conversations I will speak better because I’m not so think about the word, the
grammar, all the time, instead I was just like trying to express myself whichever way, just like you
can understand . . . when I did the TOEFL test I was nervous too, I don’t know why, just like talking
to a machine is like making me [nervous]. . . . When we did the conversation between my lab mate,
I was actually not thinking about the grammar at all. I was just thinking about the conversation itself.
That is the natural thing, right? (Ling, interview).
In Excerpt 2, in a representative comment from one of our participants, Yuming commented

on the limited time in the test and also said that in the test, unlike in interaction in real life, she
did not receive any feedback on her accuracy or comprehensibility.
Excerpt 2
[In the test] you just can be one “answer machine” [laughs] to offer answers, and you don’t know
what you said is accurate or makes sense or not. But in the real life, the teacher can say, “Okay, what
you said might be da da da” and . . . you may change your mind and you may say more. But in the
limited time of the TOEFL speaking, it cannot be interactive like this (Yuming, interview).
Discourse Features
To answer the second research question, we calculated four measures of cohesion: 1) connec-
tors (e.g., furthermore, however); 2) coordinating conjunctions (e.g., and, but); 3) subordinating
conjunctions (e.g., although, because); and 4) total connectives (i.e., the total connectors, coor-
dinating conjunctions, and subordinating conjunctions). We also calculated five measures of
register: 1) informal language6 (e.g., like, guys); 2) speech organizers (e.g., first, in conclu-
sion); 3) questions (e.g., What is it?); 4) nominalization (e.g., simulation, propulsion); and
5) passivization (e.g., be satisfied). We expressed all measures as number of instances per clause.
Cohesion by Context. Table 7 shows the medians and ranges of the measures of the dis-
course features across the three contexts. To compare the use of connectives per clause, we
conducted a Friedman test, which indicated that there was a significant difference in the use
of connectives across the three contexts (for coordinating conjunctions, X 2 (2, N = 30) = 22.20,
p < .01), subordinating conjunctions (X 2 (2, N = 30) = 18.20, p < .01), and total connectives (X 2
(2, N = 30) = 29.87, p < .01). For connectors, a Friedman test showed no significant difference
6 Our measure of informal language refers to colloquial use of language such as the use of like as a filler in
conversation.
TABLE 7
Descriptive Statistics for Discourse Measures by Context (N = 30)
SSTiBT In-class Out-of-class
Measure Median Range Median Range Median Range
Cohesion
Connectors .02 .09 .01 .08 .01 .10
Coordinating Conj. .29 .45 .28 .46 .18 .22
Subordinating Conj. .11 .13 .08 .23 .06 .16
Total connectives .43 .42 .41 .61 .26 .35
Register
Informal Language .02 .26 .06 .52 .10 .31
Speech Organizers .06 .13 .02 .13 .00 .03
Questions .00 .01 .02 .13 .07 .27

Nominalization .22 .13 .21 .39 .10 .21
Passivization .02 .06 .02 .12 .00 .02
TABLE 8
Follow-up Tests for Comparing Use of Connectives by Context
Measure Comparison Za Sig.b rc
Coordinating In-class vs. SSTiBT −0.175 .86 .02

Conjunctions Out-of-class vs. SSTiBT −4.52 .00∗ .58
Subordinating In-class vs. SSTiBT −2.03 .04 .26
Conjunctions Out-of-class vs. SSTiBT −4.21 .00∗ .54
Total Connectives In-class vs. SSTiBT −1.04 .30 .13
a Wilcoxon signed-rank test. b Asymp. sig. (2-tailed); ∗ p ≤ .01. c Effect size.
in the use of connectors across the three contexts (X 2 (2, N = 30) = 4.660, p > .05). Follow-
up pairwise comparisons showed that the measures for coordinating conjunctions, subordinating
conjunctions, and total connectives were significantly lower in the out-of-class context than they
were in the other two contexts, with large or medium effect sizes (see Table 8).
Register by Context. To compare the use of the discourse features concerning register, we
conducted Friedman tests, which showed that there were significant differences in the use of
informal language (X 2 (2, N = 30) = 24.87, p < .01), speech organizers (X 2 (2, N = 30) =
35.47, p < .01), nominalization (X 2 (2, N = 30) = 24.20, p < .01), questions (X 2 (2, N = 30) =
47.45, p < .01), and passivization (X 2 (2, N = 30) = 24.97, p < .01) across the three contexts.
Follow-up pairwise comparisons showed that measures of informal language and questions
were significantly higher in the out-of-class context than in the in-class context, which in turn
was higher than in the SSTiBT. Use of speech organizers followed the opposite pattern with
highest use in the SSTiBT, followed by in-class, which was significantly higher than out-of-class.
TABLE 9
Follow-up Tests for Comparing Use of Features of Register by Context
Informal Language In-class vs. SSTiBT −2.56 .01∗ .33

Speech Organizers In-class vs. SSTiBT −3.30 .00∗ .43
Questions In-class vs. SSTiBT −4.46 .00∗ .58
Nominalization In-class vs. SSTiBT −0.977 .33 .13

Passivization In-class vs. SSTiBT −1.07 .28 .14
a Wilcoxon signed-rank test. b Asymp. sig. (2-tailed); ∗ p ≤ .01. c Effect size.
There also was more use of nominalization and passivization in both the SSTiBT and in-class
contexts than there was in the out-of-class context. All effect sizes were medium or large (see
Table 9).
Our participants’ comments corroborate our quantitative findings. Students typically reported
that they consciously tried to use formal language, such as speech organizers in the test as
illustrated in Bo’s comments in Excerpt 3. Similar to Ling (see Excerpt 1), Suyin commented
that she paid more attention to language in the test context and felt that she spoke more formally
when responding to the test tasks (see Excerpt 4).
Excerpt 3
I can always found a strategy. It’s a very lazy strategy to cope with most of this questions, I can say.
If they ask your opinion, idea about one issue, then I say: “for one hand, you can blah blah blah.
However, on the other hand, you have to take something something into consideration.” It’s a rigid
pattern you form when you prepare for this . . . but in the real world I would not do this . . . I would
say something more creative (Bo, interview).
Excerpt 4
I’m thinking when I did the test, I will pay more attention to academic vocabulary, and the expressions
or phrases, the difference between daily settings and academic settings, I might talk more formal or
something, so that’s just the formal talking style (Suyin, interview).
Vocabulary Use
To answer the third research question, we used VP and calculated the number of Kl (1,000 most
frequent words, including content and function words), K2 (1,000 second most frequent words
(K2), and Off-list words per clause.
TABLE 10
Descriptive Statistics for Vocabulary Measures by Context (N = 30)
SSTiBT In-class Out-of-class
Measure Median Range Median Range Median Range
K1 (content + function) 5.33 5.05 5.24 5.51 4.42 2.02

K2 .28 .23 .24 .50 .17 .29
Off-list .31 .94 .38 .95 .25 .47
Content wordsa 2.88 3.40 2.98 4.16 2.27 1.04
Note. a Content words = K1 content words + K2 + AWL + Off-list.
TABLE 11
Follow-up Tests for Comparing Vocabulary Measures by Context (N = 30)
K1 In-class vs. SSTiBT −0.237 .81 .03

K2 In-class vs. SSTiBT −1.02 .31 .13
Off-list In-class vs. SSTiBT −1.51 .13 .19
Total Content (K1 content + In-class vs. SSTiBT −1.10 .27 .14
K2 + AWL + Off-list) Out-of-class vs. SSTiBT −4.76 .00∗ .61
Table 10 provides the medians and ranges for the number of words per clause from each
frequency band across the three contexts.
For the K1, K2, and Off-list words, Friedman tests indicated that the three contexts had signif-
icantly different numbers of words per clause (for K1, X 2 (2, N = 30) = 29.40, p < .01, K2, X 2
(2, N = 30) = 20.87, p < .01, and Off-list, X 2 (2, N = 30) = 12.80, p < .01). Follow-up pairwise
comparisons showed that the out-of-class context had a significantly lower number of words per
clause for each frequency band than the other two contexts. The effect sizes of the significant
differences were medium to large, ranging from .36 to .62 (see Table 11).
Finally, on a Friedman test, the three contexts also showed a significant difference in the num-
ber of content words per clause (X 2 (2, N = 30) = 26.47, p < .01). Specifically, follow-up
pairwise comparisons indicated that the out-of-class context had a significantly lower number of
content words per clause than the other two. While the effect sizes were both large, the effect size
between the out-of-class and the SSTiBT was larger (r = .61) than that between the out-of-class
and the in-class (r = .51) (see Table 11).
Again, the participants’ comments reflected these findings. In the SSTiBT, students reported
that they paid more attention to vocabulary (see also Excerpts 1, 3, and 4), but as demonstrated
in Tala’s comments in Excerpt 5, they also considered vocabulary (and register) in their in-class
presentations more so than they did in group discussions.
Excerpt 5
As far as you communicate that’s enough, they can understand. But of course you don’t want to use
too informal language in your presentation. And you worry about judgement of the people about your
vocabulary and for TOEFL you just of course worry about your mark, so you use, maybe a wide
web of vocabularies, but in group discussion, whatever comes to your mind, and easy to say (Tala,
interview).
DISCUSSION
As we discussed earlier, to our knowledge there have been no direct comparisons of oral test per-
formance and speaking in real-life academic contexts. In conducting this comparison of speaking
across these contexts, the overall purpose of our study has been to find evidence (or not) to support
the extrapolation inference of the TOEFL iBT validity argument.
In the initial part of our discussion, we summarize our findings by indicating one of three
patterns they fell into, and by interpreting the patterns. Table 12 provides a summary of these
three patterns.
Pattern 1 was found in our analyses of grammatical complexity. This pattern shows a signif-
icant decrease in complexity from the SSTiBT to in-class to out-of-class contexts (Table 12).
This finding suggests that complexity falls along a continuum of interaction: the less interactive
the activity (such as SSTiBT), the greater the complexity. This corroborates a study by Michel,
Kuiken, and Vedder (2007), who found that monologic tasks were more linguistically complex
(measured by number of clauses per AS-unit) than were dialogic tasks. The out-of-class context,
mostly involving informal discussions between peers, was the most interactive as evidenced by
the numerous independent subclausal units (see Foster et al., 2000; see also Footnote 5) in the
students’ speaking. Noteworthy is that all of the activities in real life involved interaction and
dialogue.
Pattern 1 was also reflected in our measure of grammatical inaccuracy, with performances in
the SSTiBT the most inaccurate. On the surface, this may seem like a surprising finding, given
that in a language testing situation, students might be expected to be more attentive and therefore,
more accurate in their grammatical use. However, it is possible that this increased attention to
language, compounded by the stress of being tested, negatively affected their accuracy (see, for
example, Excerpt 1; see also Swain et al., 2009).
TABLE 12
Patterns in the Results
Pattern Measures
1. SSTiBT > In-class > Out-of-class grammatical complexity, grammatical inaccuracy, speech organizers
2. SSTiBT & In-class > Out-of-class coordinating conjunctions, subordinating conjunctions, total connectives,
passivization, nominalization, K1, K2, off-list, total content words
3. Out-of-class > In-class > SSTiBT questions, informal language
In the present study, it may also have been that the participants, because they knew that their
language would be assessed, attempted more difficult grammar and therefore made more errors
(Magnan, 1988). Although for the participants in our study, responding to the SSTiBT was not
high stakes, their goals were still to perform well and display their facility with the language.
From the representative sample of student comments in Excerpts 1, 3, and 4, the participants
certainly reported that they paid more attention to the language they used in the test context
and as Tala mentioned (see Excerpt 5), the consideration of their scores spurred this focus on
language.7 Other contributing aspects to the lack of accuracy in the test context could also have
been due to the lack of interaction, the limited time in the test, and the differences between the
test and real-life academic contexts that Yuming noted (see Excerpt 2).
In our analyses of coordinating conjunctions, subordinating conjunctions, and total connec-
tives as measures of cohesion, we found Pattern 2 (Table 12) across the three contexts. This
pattern demonstrated similarity in use of each feature in the SSTiBT and in-class contexts. These
features of cohesion were used significantly more than in the out-of-class context. The nature of
the activities during the SSTiBT (monologues) and the in-class context (presentations of research
or focussed discussions on course content in the presence of others) resulted in more complex
utterances in number of clauses per AS-unit. More complex utterances require the use of connec-
tives to join the clauses. However, interactive dialogue, such as during the out-of-class context,
tends to be characterized by fragmentary, short turns, and therefore, does not have the same
interactive structure as “when the same speaker holds the floor” (Halliday, 1989, p. 87).
For our measures of register, we found Patterns 1, 2, and 3. Pattern 1 (Table 12) occurred for
the use of speech organizers. This may be due in part to the differences in formality (Zareva,
2009) in those contexts and/or due to the mode of response as in O’Loughlin’s (2001) study.
Reflecting the perceptions of many of our participants, Suyin referred to the SSTiBT as “formal”
(see Excerpt 4) and participants such as Bo, purposefully injected speech organizers into his
test responses (see Excerpt 3). Use of linking adverbials (some of which we have included in
our category “speech organizers”) is more characteristic of written registers, which also tend
to be more formal. According to Biber (2006), in classroom teaching in the university context,
linking adverbials, such as “for example” and “that is,” are occasionally used but overall occur
infrequently in spoken registers.
Pattern 2 (Table 12) was found for passivization and nominalization, again a pattern reflective
of the students’ use of different registers across the contexts. Biber (2006) found use of passives
to be “extremely rare” (p. 65) and nouns to be “relatively rare in the spoken university registers”
(p. 56), while both are more frequent in written registers. The results of our study support these
findings.
Pattern 3 (Table 12) concerns the use of questions and informal language, reflecting again
an interrelated dimension of informality and interactivity. In this pattern, the use of questions
and informal language was significantly greater during the out-of-class context than during the
7 Our findings from the grammar measures show a clear pattern of decreasing syntactic complexity and increasing
grammatical accuracy moving from the SSTiBT to the out-of-class context. To some, this pattern may imply cognitive
trade-offs between syntactic complexity and grammatical accuracy (Skehan, 1998). However, as with other measures in
our study, syntactic complexity and grammatical accuracy may have been affected by a complex interplay of different
aspects of the context (both cognitive and affective). Therefore, the grammatical findings should not be taken to suggest
a simple inverse relationship in which complexity and accuracy compete for attentional resources.
in-class context, which in turn was greater than in the SSTiBT. There was only one instance of
a question in the SSTiBT, and it was used rhetorically (i.e., What’s the point?). Although in the
in-class contexts most of the activities were presentations, there was still frequent interaction,
as well as acknowledgment of the audience through questions and answer sessions or through
the use of rhetorical questions. The greater use of informal language in the out-of-class context
is not unexpected. In that context, the participants were the most confident, relaxed, and were
interacting with peers outside the classroom with no time limitations.
For our analyses of vocabulary use, we found Pattern 2. This pattern, in which performance
was similar across the SSTiBT and in-class contexts, but greater than in the out-of-class con-
text, held for use of all our vocabulary measures: K1 words, K2 words, Off-list words, and the
total number of content words per clause. With the exception of the use of K1 words, which
includes both content and function words, these vocabulary measures all indicate content word
use. Our finding that the out-of-class context, in which there was the most interaction, content
word use was less than in the SSTiBT and in-class contexts supports other studies, which have
shown that increased interactivity tends to result in a decreased proportion of content words (e.g.,
O’Loughlin, 2001; Shohamy, 1994; Ure, 1971).
To summarize, in Patterns 1 and 3, the SSTiBT is significantly different from the other two
contexts. In Pattern 2, the SSTiBT aligns more closely with in-class language use. All effect sizes
of significant differences were medium to large. Dimensions that interactively appear to play a
role are monologic-dialogic (interactivity); formal-informal; nervousness-relaxed/confident; and
availability of time.
CONCLUSION
By directly comparing actual speaking in a test context to speaking in real-life academic contexts,
we have attempted to move beyond correlational criterion-related evidence that has to date been
the backing for the extrapolation inference argument for the Speaking section of the TOEFL
iBT. Although it is unrealistic to expect there to be an exact correspondence between speaking
in a test context and speaking in real-life academic contexts, there should be some degree of
overlap in the spoken performances if one is to extrapolate from the test context to real-life
contexts. And, indeed, we found that there was some overlap in spoken performances, between
the SSTiBT and in-class performances in the use of connectives, passivization, nominalization,
and vocabulary types. In all other measures (grammatical complexity, grammatical inaccuracy,
use of speech organizers, use of questions, and use of informal language), the three contexts were
distinct.
As one would expect within a SCT framework, where speaking performances are concep-
tualized as mediated, goal-driven activities, it is not surprising that performances would be
different during testing, in-class, and out-of-class contexts: the goals are different, the strate-
gies that mediate performance are different (Brooks & Swain, 2013, 2015), and emotions are
linked intimately to performances (Swain, 2013b). In addition, one activity is monologic, whereas
the other two are dialogic in nature. Perhaps what is surprising is the extent to which there is
overlap.
For a validity argument, as Kane (2012) stated, strong claims, such as predictions of future
performance in different contexts “would typically require strong empirical support” (p. 36) and
that if any “weak links are identified, it may be necessary to adjust the interpretation and/or
the assessment or to conduct additional research” (p. 38). Because our direct comparisons of
speaking in the test and real-life academic contexts show both overlap and non-overlap of per-
formances across the three contexts, our findings expose a potential weak link in the interpretive
argument chain. As Chapelle, Enright, and Jamieson (2010) stated, “Time will tell whether future
researchers will be able to pick up the validity narrative and add to it with additional backing or
challenge it with rebuttals” (p. 11). Although our measures do not represent the complete picture
of the construct of academic language proficiency, our findings are a starting point to question
the validity narrative with a rebuttal. More extensive studies with different populations need to be
conducted to support this rebuttal or add backing to the validity argument.
REFERENCES
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.
Barkaoui, K., Brooks, L., Swain, M., & Lapkin, S. (2013). Test-takers’ strategic behaviors in independent and integrated
speaking tasks. Applied Linguistics, 34(3), 304–324.
Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam, Netherlands:
John Benjamins.
Biber, D., Conrad, S., Reppen, R., Byrd, P., & Helt, M. (2002). Speaking and writing at the university: A multidimensional
comparison. TESOL Quarterly, 36(1), 9–48.
Bridgeman, B., Powers, D., Stone, E., & Mollaun, P. (2012). TOEFL iBT speaking test scores as indicators of oral
communicative language proficiency. Language Testing, 29(1), 91–108.
Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better performance. Language
Testing, 26(3), 341–366.
Brooks, L., & Swain, M. (2015). Students’ voices: The challenge of measuring speaking for academic contexts. In B.
Spolsky, O. Inbar, & M. Tannenbaum (Eds.), Challenges for language education and policy: Making space for people
(pp. 65–80). New York, NY: Routledge.
Brooks, L., & Swain, M. (2013, March). Strategic speaking clusters in testing and real-life contexts. Paper presented at
the AAAL Conference, Dallas, TX.
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20(1), 1–25.
Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance
on English-for-Academic-Purposes speaking tasks (TOEFL Monograph No. 29). Princeton, NJ: Educational Testing
Service.
Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000 speaking framework: A working
paper (TOEFL Monograph No. 20). Princeton, NJ: Educational Testing Service.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing.
Applied Linguistics, 1(1), 1–47.
Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing,
20(4), 369–383.
Chalhoub-Deville, M., & Deville, C. (2006). Old, borrowed, and new thoughts in second language testing. In R. L.
Brennan (Ed.), Educational Measurement (4th ed.) (pp. 517–530). Westport, CT: American Council on Education
and Praeger Publishers.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Test score interpretation and use. In C. A. Chapelle, M. K.
Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign LanguageTM (pp.
1–25). New York, NY: Routledge.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference?
Educational Measurement: Issues and Practice, 29(1), 3–13.
Chapelle, C., Grabe, W., & Berns, M. (1997). Communicative language proficiency: Definition and implications for
TOEFL 2000 (TOEFL Monograph No. 10). Princeton, NJ: Educational Testing Service.
Cobb, T. (2006). The Web Vocabulary Profiler (Version 3.0). [Computer program]. University of Québec, Montréal.
Retrieved from http://www.lextutor.ca/vp/eng/
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.

Cole, M. (2005). Putting culture in the middle. In H. Daniels (Ed.), An introduction to Vygotsky (pp. 199–226). New York,
NY: Routledge.
Deville, C., & Chalhoub-Deville, M. (2006). Old and new thoughts on test score variability: Implications for reliability
and validity. In M. Chalhoub-Deville, C. A. Chapelle, & P. Duff (Eds.), Inference and generalizability in applied
linguistics: Multiple perspectives (pp. 9–25). Amsterdam, Netherlands: John Benjamins.
Enright, M. K., Bridgeman, B., Eignor, D., Kantor, R. N., Mollaun, P., Nissan, S., Powers, D. E., & Schedl, M. (2008).
Prototyping new assessment tasks. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Building a validity
argument for the Test of English as a Foreign LanguageTM (pp. 97–143). New York, NY: Routledge.
Farr, F. (2003). Engaged listenership in spoken academic discourse: The case of student-tutor meetings. Journal of English
for Academic Purposes, 2(1), 67–85.
Ferris, D. (2002). Treatment of error in second language student writing. Ann Arbor, MI: University of Michigan Press.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage.
Flowerdew, J. (Ed.). (2002). Academic discourse. London, UK: Pearson.
Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all reasons. Applied
Linguistics, 21(3), 354–375.

Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing. London: Longman.
Halliday, M. A. K. (1989). Spoken and written language (2nd ed.). Oxford, UK: Oxford University Press.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London, UK: Longman.
He, A. W., & Young, R. (1998). Language proficiency interviews: A discourse approach. In R. Young & A. W. He
(Eds.), Talking and testing: Discourse approaches to the assessment of oral proficiency (pp. 1–24). Amsterdam: John
Benjamins.
Heatley, A., & Nation, P. (1994). Range. [Computer program] Victoria University of Wellington, New Zealand. Retrieved
from http://www.victoria.ac.nz/lals/resources/range.aspx
Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cambridge, MA: Newbury House.
Hyland, K. (2002). Genre: Language, context, and literacy. Annual Review of Applied Linguistics, 22, 113–135.
Iwashita, N., Brown, A., McNamara, T., & O’Hagan, S. (2008). Assessed levels of second language speaking proficiency:
How distinct? Applied Linguistics, 29(1), 24–49.
Jamieson, J. M., Eignor, D., Grabe, W., & Kunnan, A. J. (2008). Frameworks for a new TOEFL. In C. A. Chapelle, M. K.
Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign LanguageTM (pp.
55–95). New York, NY: Routledge.
Johnson, M. (2001). The art of non-conversation: A re-examination of the validity of the oral proficiency interview. New
Haven, CT: Yale University Press.
Kane, M. (2012). Articulating a validity argument. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of
language testing (pp. 34–47). New York, NY: Routledge.
Lantolf, J. P., & Frawley, W. (1985). Oral-proficiency testing: A critical analysis. Modern Language Journal, 69(4),
337–345.
Lantolf, J. P., & Thorne, S. L. (2006). Sociocultural theory and the genesis of second language development. Oxford, UK:
Oxford University Press.
Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: the case of CASE. Language Testing, 13(2),
151–172.
Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge, UK: Cambridge
University Press.
Lumley, T., & Brown, A. (1996). Specific purpose language performance tests: Task and interaction. In G. Wigglesworth
& C. Elder (Eds.), The language testing cycle: From inception to washback. Australian Review of Applied Linguistics,
Series S, 13, 105–136.
Luoma, S. (2004). Assessing speaking. Cambridge, UK: Cambridge University Press.
Magnan, S. S. (1988). Grammar and the ACTFL oral proficiency interview: Discussion and data. Modern Language
Journal, 72(3), 266–276.
McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whose performance? Applied
Linguistics, 18(4), 446–466.
Michel, M. C., Kuiken, F., & Vedder, I. (2007). The influence of complexity in monologic versus dialogic tasks in Dutch
L2. International Review of Applied Linguistics, 45(3), 241–259.
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of
complexity. Applied Linguistics, 30(4), 555–578.
O’Loughlin, K. (2001). The equivalence of direct and semi-direct speaking tests. Cambridge, UK: Cambridge University
Press.
Plough, I. C., Briggs, S. L., & Van Bonn, S. (2010). A multi-method analysis of evaluation criteria used to assess the
speaking proficiency of graduate student instructors. Language Testing, 27(2), 235–260.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.
Skehan. P. (1998). A cognitive approach to language learning. Oxford, UK: Oxford University Press.
Swain, M. (2001). Examining dialogue: Another approach to content specifications and to validating inferences drawn
from test scores. Language Testing, 18(3), 275–302.
Swain, M. (2013a). The inseparability of cognition and emotion in second language learning. Language Teaching, 46(2),
195–207.
Swain, M. (2013b, March). The intertwining of emotion and cognition: A Vygotskian sociocultural perspective. Paper
presented at the AAAL Conference, Dallas, TX.
Swain, M., Huang, L.-S., Barkaoui, K., Brooks, L., & Lapkin, S. (2009). The speaking section of the TOEFL iBTTM
(SSTiBT): Test-takers’ reported strategic behaviors (TOEFL iBTTM Report No. TOEFL iBT-10). Princeton, NJ:
Educational Testing Service.
Swain, M., Kinnear, P., & Steinman, L. (2011). Sociocultural theory in second language education: An introduction
through narratives. Bristol, UK: Multilingual Matters.
Taylor, C. A., & Angelis, P. (2008). The evolution of the TOEFL. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson
(Eds.), Building a validity argument for the Test of English as a Foreign LanguageTM (pp. 27–54). New York, NY:
Routledge.
Ure, J. (1971). Lexical density and register differentiation. In G. E. Perren & J. L. M. Trimm (Eds.), Applications of
linguistics (pp. 443–452). Cambridge, UK: Cambridge University Press.
Van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency interviews as
conversation. TESOL Quarterly, 23(3), 489–508.
Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data. Language Testing, 17(1), 65–83.
Vygotsky, L. S. (1986). Thought and language (A. Kozulin, Trans.). Cambridge, MA: MIT Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. New York, NY: Palgrave Macmillan.
Zareva, A. (2009). Informational packaging, level of formality, and the use of circumstance adverbials in L1 and L2
student academic presentations. Journal of English for Academic Purposes, 8(1), 55–68.

Contextualizing Performances Comparing Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Contextualizing Performances Comparing Performance

Uploaded by

Copyright:

Available Formats

Language Assessment Quarterly

ISSN: 1543-4303 (Print) 1543-4311 (Online) Journal homepage: http://www.tandfonline.com/loi/hlaq20

Contextualizing Performances: Comparing

Lindsay Brooks & Merrill Swain

To link to this article: http://dx.doi.org/10.1080/15434303.2014.947532

Published online: 14 Nov 2014.

Submit your article to this journal

Article views: 459

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Download by: [58.27.246.251] Date: 18 October 2015, At: 16:59

Contextualizing Performances: Comparing

Academic Speaking Activities

The Speaking Section of the TOEFL iBT

Studies of Speaking Test Performance

Studies of Spoken Academic Discourse

Present Study and Research Questions

We recruited 30 international graduate students enrolled at a university in Canada to partici-

Sciences Social Sciences

TOEFL iBT Speaking score

Background Questionnaire. Participants completed a questionnaire reporting their gender,

Preparation Time Response Time

1 Speaking Familiar topic 15 45

Activity In-class Out-of-class

Grammatical Features. The grammatical complexity measure was calculated by dividing

Context Clauses per AS-unit Errors per Clause

Clauses per AS-unit Za Sig.b rc

In-class vs. SSTiBT −4.78 .00∗ .62

Errors per Clause Za Sig.b rc

In-class vs. SSTiBT −4.60 .00∗ .59

In Excerpt 2, in a representative comment from one of our participants, Yuming commented

SSTiBT In-class Out-of-class

Measure Median Range Median Range Median Range

Questions .00 .01 .02 .13 .07 .27

Measure Comparison Za Sig.b rc

Coordinating In-class vs. SSTiBT −0.175 .86 .02

Measure Comparison Za Sig.b rc

Informal Language In-class vs. SSTiBT −2.56 .01∗ .33

Out-of-class vs. SSTiBT −4.74 .00∗ .61

SSTiBT In-class Out-of-class

Measure Median Range Median Range Median Range

K1 (content + function) 5.33 5.05 5.24 5.51 4.42 2.02

Note. a Content words = K1 content words + K2 + AWL + Off-list.

Follow-up Tests for Comparing Vocabulary Measures by Context (N = 30)

Measure Comparison Za Sig.b rc

K1 In-class vs. SSTiBT −0.237 .81 .03

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.

Linguistics, 21(3), 354–375.

You might also like