Professional Documents
Culture Documents
Topic 1 Linguistic Evaluation
Topic 1 Linguistic Evaluation
TOPIC 1
LINGUISTIC EVALUATION: CONTEXT, HISTORY, ISSUES AND TRENDS
table of Contents
1 INTRODUCTION......................................................................................................2
2 THE INTEREST IN EVALUATION........................................................................2
2.1 The nature and quality of evidence.....................................................................3
2.2 The effects of evaluation on students.................................................................4
2.3 The fairness of testing with minorities...............................................................6
3 BRIEF HISTORY OF LINGUISTIC EVALUATION..............................................6
3.1 The pre-scientific trend.......................................................................................6
3.2 The psychometric-structuralist trend..................................................................7
3.3 The integrative-sociolinguistic trend................................................................10
3.4 The communicative trend.................................................................................11
4 TECHNOLOGICAL ADVANCES IN LINGUISTIC EVALUATION..................12
5 BIBLIOGRAPHIC REFERENCES.........................................................................13
2
1 INTRODUCTION
The specific learning outcomes that the student must achieve at the end of this topic
are:
1. The student defines, uses and relates a series of general concepts of educational
evaluation.
2. The student defines, uses and relates a series of concepts that have appeared
throughout the development of the linguistic evaluation.
3. The student defines, uses and relates a series of concepts related to the use of
new information and communication technologies in linguistic evaluation.
language teaching systems. Language tests have been the center of intense debate for a
multitude of reasons: accusations that the tests were biased against minorities or that
they influence teaching in an undesirable way, by paying too much attention to certain
types of content to the detriment of others, etc. Considering the importance of
assessment in language teaching practice, and the associated issues and debates, it is
essential that teachers understand the design, uses and abuses of language assessment
instruments.
Decisions about the choice of an educational test, about a call, or about the use of a
linguistic test, or about educational tests in general, are no longer of interest only to
teachers. Currently, society demands effectiveness in foreign language teaching
programs. This increased concern about issues related to language testing stems, in part,
from awareness of the social consequences of testing, especially the danger that certain
tests pose to the rights and opportunities of certain individuals and groups. This concern
has taken the form of attacks on testing, the testing industry and the new standards
governing testing, or requests for postponement of the implementation of new
assessment tools, or accusations that tests are biased and discriminatory. In reality, there
are many compelling reasons to be concerned about the social consequences of
evaluation. However, it is important to distinguish between, on the one hand, the
negative consequences for individuals or groups that originate from failures in the
evaluation instruments and, on the other, failures caused by misinterpretation or misuse
of test scores.
Linn and Gronlund (2000, p. 18) mention three areas that cause controversy in
educational evaluation, and that are perfectly applicable to linguistic evaluation: (1) the
nature and quality of tests, (2) the effects of evaluation in students and (3) justice with
minorities.
In the early 1960s, some authors, such as Hoffman (1962, p. 22), argued that multiple-
choice items penalized the most intelligent, original, or “exceptional” people. Hoffman
(1962) supported his claims with a review of standardized test items that showed that
some highly creative students high in the ability being tested were likely to make
interpretations that had not been anticipated by the test designers. 1 . Hoffman (1962, p.
17), for example, included the following letter, addressed to the editor of the Times:
Dear Sir:
Among the “mark the different element” questions my son had to answer in a school
entrance test was: “What is the different element in cricket, football, billiards and hockey?”
1
Davies et al. (1999, p. 187) define a standardized test as follows:
A test that ideally has the following characteristics, although so-called standardized
linguistic tests do not always have all of these characteristics:
A rigorous development, testing and review process, which determines the metric
properties of the test...
Standard procedures for the call and scoring of the test.
The content of the test is standardized in all versions. This content is based on a set
of test specifications that may reflect a theory of linguistic competence or a
conception of the expected needs of candidates. Alternative forms of the test are
examined to see if there is equivalence in content.
4
I said billiards because it is the only game that is played inside a building. A classmate said
soccer because it is the only one in which the ball is not hit with an instrument. A neighbor
said cricket because in other games the objective is to get the ball into a net; and my son,
with the confidence that comes with nine springs, decides on hockey because “it's the only
game for girls.”
Although Hoffman's (1962) criticisms were widely echoed, Hoffman also encouraged
test authors to add careful, logical item analysis to the statistical analysis of the items.
Frederiksen (1984, p. 199) observed that problems in standardized tests are usually well
structured , that is, “they are clearly expressed, all the information necessary to solve
the problem is available in the problem or - presumably - in the head. of the student, and
there is an algorithm that guarantees a correct solution if it is applied properly.”
However, most of the important problems one faces in life are poorly structured , that
is, they are
complex, without defined criteria to determine when the problem has been solved, without
all the information necessary to solve the problem, and without a 'legal move generator' to
find all the possibilities at each step during the resolution of the problem ( ibid. ). .
These criticisms have led to greater emphasis on open-ended questions and on designing
tests that use computer simulations.
Much of the misinterpretation and misuse of test scores would be avoided if the test
user were aware of the limited nature of the information a test provides. A good test
user takes into account the error that may exist in the test scores and uses information
other than the test score when making their decision. Claiming that better decisions are
made without test scores is claiming that better decisions are made when there is less
information. Test scores are certainly fallible, but they are probably less fallible than
most other types of information used to make educational decisions.
Critics of assessment claim that assessment has undesirable effects on students. Some of
the most frequently mentioned criticisms of the use of tests appear below, followed by a
few brief comments.
There is no doubt that anxiety increases during a test. For most students, assessment
forces them to do better. For a few, the anxiety caused by the test may be so high that it
interferes with performance on the test. These students usually have high anxiety and
the test simply increases their anxiety level. Different procedures can be used to reduce
test anxiety, such as thorough preparation before the test, rehearsing the test, and
providing enough time for the student to take the test with some peace of mind.
Fortunately, in recent years the designers of many tests also provide versions for the
student to rehearse and there has been a shift from speed tests to power tests. This
should help, but it is still necessary to carefully observe the students during the test and
reflect on the scores obtained by students for whom the test produces a high level of
anxiety.
5
There are teachers who attribute stereotypes to students based on test scores, which can
have an undesirable effect on the students' self-concept. It also happens that the student
develops a general feeling of failure from a low score. Teachers must explain to
students who receive low scores that tests are limited measures and that our
competencies (and, therefore, scores) change. Furthermore, the development of the
feeling of failure can be limited if the positive aspects that the student shows in the test
are mentioned. Testing can help students identify their strengths and weaknesses,
thereby contributing to better learning and a positive self-image.
Those who use this criticism maintain that when a teacher assigns a score to a test the
following process occurs:
Therefore, those who are expected to achieve more, achieve more, and those who are
expected to achieve less, achieve less. This effect, called the Pygmalion effect , was
studied by Rosenthal and Jacobsen (1968), although the study was later questioned by
other researchers (Elashoff and Snow, 1971; West and Anderson, 1976). It is widely
believed that teacher expectations increase or hinder a student's achievement.
In short, there is some reason in the various criticisms about the undesirable effects
of testing on students. But in most cases these criticisms should be directed at the users
of the tests, rather than at the tests themselves. The same people who misuse test results
are likely to misuse other information, which is probably less accurate and objective.
Therefore, the solution is not to stop using tests, but to start using tests and other data
6
more effectively. When tests are used in a positive way – that is, to help students
improve their learning – the consequences are likely to be beneficial.
The issue of fairness to racial and ethnic minorities is critical in any assessment
program. Fairness has received increasing attention in the language assessment
literature over recent years. The term justice is related, according to Linn and Gronlund
(2000, pp. 21-22), with different concepts:
Different concepts can lead to quite different conclusions about the fairness of any
test or assessment instrument. The fourth concept, equality of results, is incompatible
with other principles of assessment, such as the goal of achieving a reliable and valid
measure of what students know, regardless of their origin or ethnic group. If different
groups of students differ in the instruction they have received, in their experiences in
and out of school, and in their interests and effort, a test or assessment instrument that
provides different mean scores for minority groups and for the group Majority may
reflect the consequences of unfair treatment of minorities by society.
An absence of bias and procedural fairness are essential for an evaluation to have a
high degree of validity.
For Spolsky (1978, v), the pre-scientific tendency, which still prevails in many places in
the world, can be characterized by an absence of concern for statistical issues or for
notions such as objectivity and reliability:
In its simplest form, it presupposes that we can and should rely entirely on the judgment of
an experienced teacher, who can tell what grade should be given after a conversation of
several minutes, or after reading the response to an essay (Spolsky, 1978 , p.v).
In the pre-scientific trend, it is difficult to find oral exams and the exams usually consist
of open questions that must be answered in writing. These exams usually include:
1. The evaluators, that is, the psychologists responsible for the development of modern
theories and techniques of measurement in education, whose main objective is to
provide objective measurements through the use of different statistical techniques,
which allow the scores to be reliable and the interpretations that we make from the
scores are valid:
The form of the tests… is determined primarily by the need to evaluate the reliability
and validity of the tests. This is why, for example, the multiple choice response
technique is so common. In linguistic evaluation this means that we normally resort to
the skills of writing and listening comprehension (Ingram, 1968, p. 74).
The evaluators had noticed the poor reliability of traditional exams (Pilliner,
1968, p. 27). Starch and Elliott (1912), for example, observed that the scores that
142 English teachers had assigned to one test ranged between 64 and 98, while on
another test the scores ranged between 50 and 98 (Starch, 1913, p. 630 ). Starch
(1913, ibid. ) made theBoard1 based on the scores assigned by ten professors to 10
final English tests of the first year of the University of Wisconsin, in which we can
appreciate the great disparity in the scores that the professors ( instructors ) assign
to a test ( paper ) carried out by the same student. Teacher 4, for example, assigns a
score of 20 to the test taken by Student 4, while Teacher 8 assigns a score of 68 to
this same test:
8
Board1 Scores assigned by 10 instructors to a sample of 10 final English tests ( papers ) from the first
course at the University of Wisconsin (Starch, 1913, p. 630).
1. New types of tasks (such as the task in which the examinee answers by
choosing one option from among several possible options) require a written
response, which limits the linguistic assessment to writing and listening
comprehension activities. Agard and Dunkel (1948), for example, stated that
the only tests available were written tests of vocabulary, reading and
grammar and that none of these tests evaluated oral production and
comprehension skills ( cit. in Spolsky, 1978, p. saw; Fulcher, 1999, p. 391).
2. A test developed exclusively by evaluators does not take into account new
concepts, procedures and discoveries in language teaching and learning.
2. Experts with training in educational evaluation and linguistics. Already in the 1950s
there were voices that recommended the combination of knowledge from
educational evaluation with linguistic knowledge for the construction of linguistic
tests. Robert Lado (1950), for example, applied this combination of knowledge to
the design of English achievement tests for Latin American students and concluded
the following in his doctoral thesis:
Several conclusions are obtained. These conclusions are (1) that there is a great delay
in the measurement of English as a foreign language, (2) that the delay is related to
unscientific conceptions of the language, (3) that the science of language should be
used in the definition of what to teach... The study provides procedures for the
application of linguistics to the development of foreign language tests (Lado, 1950,
cit. in Carroll, 1953, p. 195).
For Carroll (1953, p. 195), the delay existed, in reality, in “the entire measurement
of foreign languages.” Throughout the 1950s and 1960s Lado refined his concepts
of linguistic assessment and in 1961 published Language Testing , a book aimed at
“teachers of foreign languages and English as a foreign language,” which is based
on the assumption that “ linguistic knowledge” is a “main contribution” to linguistic
9
evaluation, that is, for Lado (1961, p. vii) linguistic tests had to take into account
“the development of modern linguistic linguistics during the last thirty-five years” .
According to Spolsky (1978, p. vii), during the 50s and 60s the structuralist
conception of language, psychological theories and the practical needs of evaluators
were combined. On the one hand, the designers of linguistic tests needed extensive lists
of items that would allow the selection of certain items, which would be included in
objective tests, while, on the other, the structuralist linguists were describing language
as a system composed of elements that are combine with each other. In American
structural linguistics of the 1950s, a series of hierarchical levels were postulated in the
study of language, composed of a series of units, from whose combination the units of
the higher level emerged. Lado (1961, p. 25), for example, stated that “language is
constructed from sounds, intonation, accent, morphemes, words and combinations of
words.” Through this combination of the structural vision of language and objective
educational evaluation procedures, the path was clear towards the construction of an
objective test with multiple choice questions based on structural linguistics. The
linguistic elements can be evaluated, according to Lado (1961, p. 204), isolated or in
combination in an “integrated skill”, such as oral comprehension ( listening ), oral
comprehension ( reading ), oral production ( speaking ), writing ( writing ) or
translation ( translation ). Below I present two items that appear in Lado (1961), which
evaluate isolated elements and combined elements:
The sky highway above the top of the world has become the touchstone of the history of
intercontinental travel, ushering in a new age in commercial aviation (Map of Scandinavian Airlines
Routes)
very specific items of linguistic knowledge and skill that have been sensibly selected
from the generally enormous pool of possible items... It is the type of approach that
is necessary and recommended... where knowledge of structure and lexicon,
auditory discrimination and the oral production of sounds, and the reading and
writing of symbols and individual words (Carroll, 1961[1965], p. 369)2 .
1. The items or tasks that constitute a test designed according to the integrative-
sociolinguistic tendency are selected from a set that is broader than the set from
which the items or tasks of a psychometric-structuralist test are selected.
According to Carroll, this is an advantage, since it facilitates the construction of
a test that is independent of the curricula that the examinees who are going to
take the test have followed.
2. It seems that it is easier to relate the tasks of an integrative-sociolinguistic test
with different levels of competence.
2
Oller (1979, p. 37) defined a discrete point test as a test “that attempts to concentrate attention on one
point of grammar at a time”:
Each test item targets a single element of a given component of a grammar (or perhaps we
should say a grammar that is postulated), such as phonology, syntax, or vocabulary.
Furthermore, a discrete item test is intended to assess only one skill at a time (e.g., listening
comprehension, or oral production, or reading, or writing) and only one aspect of a skill
(e.g., productive in instead of receptive or oral instead of visual). Within each skill, aspect,
and component, discrete items supposedly target exactly one and only one phoneme,
morpheme, lexical item, grammatical rule, or whatever the corresponding item is (Oller,
1979, p. 37).
11
Later, Canale and Swain (1980, pp. 28-31) and Canale (1983, 338-342)
developed their concept of communicative competence, which has been very
influential in linguistic evaluation.
Other authors have divided the evolution of linguistic evaluation in a slightly different
way than Spolsky (1978). James Dean Brown (2005, pp. 19-24), for example,
distinguishes four movements in linguistic evaluation, which coexist today: (i) the pre-
scientific movement, (ii) the psychometric-structuralist movement, (iii) the integrative-
sociolinguistic movement, and (iv ) the communicative movement, while Elana
Shohamy (1997, p. 141) distinguishes three periods in the history of linguistic
evaluation: the period of discrete points, the integrative period and the communicative
period.
The communication trend, which began in the United Kingdom and later spread to
the United States, is based on three principles:
3
In italics in the original.
12
With the increasing availability and power of microcomputers at a relatively low price,
it is not surprising that the use of computer programs to assess the linguistic competence
of individuals has become widespread. Some of you may even have already taken, for
example, the DIALANG tests ( www.dialang.org ).
Using a computer to present the items of a linguistic test can have several
advantages. For example, instead of having to take the test on the day of the call,
examinees can request to take it at a time that best suits their needs. Additionally,
instead of having to wait several weeks to receive test results, scores can be obtained
immediately. Pearson Driving Assessment (2007) cites the following advantages of
computer-based assessment:
The ability to perform testing when the candidate requests it and when it is convenient for the
candidate.
The possibility of creating questions that can be stored in “question banks” and presenting these
questions randomly, reducing “serial” evaluation, that is, the need to evaluate all candidates on
the same day at the same time.
The disappearance of complex logistical problems, such as the distribution, storage and tracking
of exam forms.
Tests can be performed without an Internet connection, thus minimizing the risk of system
failures.
Reduction of effort and time when correcting and reporting results.
Instant results and immediate diagnostic feedback, indicating the candidate's strengths and areas
for improvement.
Although these advantages are important, the most significant changes have
occurred as a result of the fact that the computer can easily do things that are not easy
with a pencil and paper test. The technology allows, for example, to introduce video
recordings or pose problems that force students to use the Internet, which adds all the
advantages that these technologies can provide during the teaching and evaluation
processes.
The most widespread change in linguistic assessment has been the use of the
computer to perform adaptive tests , that is, tests in which the choice of the next item is
based on the examinee's previous responses, such as the DIALANG tests. Adaptive
13
testing can increase the quality of the information available and, therefore, of the
decisions made based on the available information. An adaptive test typically begins
with the presentation of an item believed to be of medium difficulty for the examinee.
The second and subsequent items are determined by the examinee's previous responses.
In general, if a test taker answers an item correctly, the program next selects a slightly
more difficult item. And, conversely, a slightly easier item is presented after an
incorrect answer. The test ends when the test taker's estimates of performance reach a
predetermined level of accuracy or when a specified number of items have been
presented. It has been shown that adaptive assessment can increase the efficiency and
accuracy of measures of certain types of concepts, skills, and abilities. In some cases,
adaptive tests can achieve the same level of reliability as a conventional pencil-and-
paper test, but in half the time.
However, you will not understand the full potential of using computers during the
assessment process if you only consider that computers are tools to present items more
easily: the computer can measure competencies that are not adequately measured in
conventional tests! pencil and paper! Video recordings allow problems to be presented
that are more realistic than problems normally posed in paper-and-pencil tests. The
simulation of problems presented through a computer has several advantages over
pencil and paper tests in teaching Spanish as a second language: the simulation can
force the examinee to concentrate their attention on the use of the information to solve a
problem. problem and can help evaluate not only the student's product but also the
process that the student uses to carry out the activity, including the way in which the
activity is approached, the quality of the solution, and the number of suggestions that
may be necessary to solve the activity.
5 BIBLIOGRAPHIC REFERENCES
students . Washington, DC: Center for Applied Linguistics, 1961, pp. 30-40. Reprint
in: ALLEN, Harold B (ed.). Teaching English as a second language: A book of
readings. New York: McGraw-Hill, 1965, 364-372.
DAVIES, Alan BROWN, Annie ELDER, Cathie HILL, Kathryn LUMLEY, Tom ;
McNamara, Tim F. Dictionary of language testing . Cambridge: Cambridge
University Press, 1999.
ELASHOFF, Janet D.; SNOW, Richard E. Pygmalion reconsidered; a case study in
statistical inference: reconsideration of the Rosenthal-Jacobson data on teacher
expectancy. Worthington, Ohio: Charles A. Jones, 1971.
SPAIN. Organic Law 2/2006, of May 3, on Education. Official State Gazette , May 4,
2006, no. 106, pp. 17158-17207.
FREDERIKSEN, Norman. “The real test bias: Influences of testing on teaching and
learning.” American Psychologist . 1984, vol. 39, no. 3, pp. 193-202.
FULCHER, Glenn. “Book Review: A history of foreign language testing in the United
States: from its beginnings to the present.” Language Testing . 1999, vol. 16, no. 3,
pp. 389-398.
HOFFMAN, Banesh. The tyranny of testing. New York: Crowell-Collier, 1962.
HYMES, D.H. “On communicative competence”. In: PRIDE, JB; HOLMES, Janet
(eds.). Sociolinguistics: selected readings . Hardmondsworth: Penguin, 1972, pp.
269-293.
INGRAM, Elisabeth. “Attainment and diagnostic test”. In: DAVIES, Alan (ed.).
Language testing symposium: a psycholinguistic approach . London: Oxford
University Press, 1968, pp. 70-97.
SIDE, Robert. Measurement in English as a foreign language with special reference to
Spanish-speaking adults . doctoral thesis. Ann Arbor , Michigan : University of
Michigan , 1950.
LINN, Robert L.; GRONLUND, Norman E. Measurement and assessment in teaching .
Saddle River, NJ: Prentice-Hall, 2000.
OLLER, John W. Language tests at schools . London: Longman, 1979.
ORGANIZATION FOR ECONOMIC COOPERATION AND DEVELOPMENT.
Organization for Economic Co-operation and Development [online]. Paris:
Organization for Economic Co-operation and Development, sd [ref. on January 14,
2007 5:06]. OECD Program for International Student Assessment (PISA): PISA in
Spanish. Available on World Wide Web:
<http://www.pisa.oecd.org/document/25/0,3343,en_32252351_32235731_39733465
_1_1_1_1,00.html>.
PEARSON DRIVING ASSESSMENT. Pearson VUE [online]. London: Pearson VUE,
2007 [ref. on October 27, 2007 20:37]. Computer-based testing: benefits. Available
on World Wide Web: <http://www.pearsonvue.co.uk/home/cbt/benefits/>.
PILLINER, Albert EG “Subjective and objective testing”. In: DAVIES, Alan (ed.).
Language testing symposium: a psycholinguistic approach . London: Oxford
University Press, 1968, pp. 19-35.
ROSENTHAL, Robert; JACOBSEN, Lenore. Pygmalion in the classroom: teacher
expectation and pupils' intellectual development. New York : Holt, Rinehart and
Winston, 1969.
SHOHAMY, Elana. “Second language assessment”. In: TUCKER, G. Richard;
CORSON, David (eds.). Encyclopedia of language and education, vol. 4: second
language education . Dordrecht: Kluwer, 1997, pp. 141-149.
15