Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

110 (~hap tor 4 Program Models for ELLs

5
i jii i l J )\!!
u! Assessment
Lindholm-Leary, K. J. (2001). Dual language education. Clevedon, UK: Multi-
lingual Matters. Everything that has to do with the test has been given such a high priority,
l i )"(] ...,.-." ( )j that there is no priority any more but that. The bottom line question comes
;J!u;u;_llT!S iru:Jl thc lJni cci SLJ down to, "Well, what's going to help them do better on the test?" And if it's
Perez, B. (2004). Becoming biliterate: A study of two-way bilingual immersion ed- not going to help them do better on the test, well, we don't have time for that
ucation. Mahwah, NJ: Lawrence Erlbaum.
\ti ·T pnH·'
right now.
i : ]!' -Nicole Soto, sheltered English immersion elementary school teacher
"(' !Ji'i)~!) 'li!l\,

KEY TERMS GUIDING QUESTIONS


'"'testing 1. What are the differences between testing, assess-
eassessment ment, and evaluation?
$evaluation 2. Why should ELL educators be wary of overreliance
""summative assessment on standardized tests of Ells' academic achieve-
®formative assessment ment and English language proficiency?
*norm-referenced test 3. Why is there a need for multiple measures?
<»criterion-referenced test 4. How can Ell educators provide valid and reliable
• reliability evidence of Ells' growth and achievement?
*standard error of 5. How can ELL educators use evidence of ELL growth
measurement (SEM) and achievement to inform their decision making?
e validity
*bias
0 accommodations

w performance assessment

""self-assessment
*peer assessment
'"portfolio assessment
*multiple measures

E LLs are probably the most tested students in our educational system. In addition
to taking the same federal and state tests required for all students, ELLs take
language proficiency tests every year. ELLs also participate in district-level and
school-level tests, and they take classroom-based tests developed by classroom
teachers. In one Texas school district, between March and the end of a recent school
year, 5th grade bilingual students were required to take nine different state and
district tests.
Educators need to be aware of the heavy testing burden placed on ELL students
and the impact test results can have on students' and teachers' lives. Test results
are often used to determine whether students are placed in or exited from special
programs, allowed to progress to the next grade level, required to attend summer
school, or awarded a high school diploma. Test scores are also often used to judge
the adequacy and skills of teachers, to determine, for example, which teachers re-
ceive monetary rewards, which teachers must obtain additional training, and
which teachers may be required to find another job. Elena Shohamy has outlined
specific features of the power inherent in tests (see Box 5.1).

111
112 Gh<:1pter 5 Assessment Assessment Basics 113

Box 5.1 rubrics; (3) analyzing student performance data, looking for strengths to build on,
and identifying areas in need of further instruction and assistance; (4) reporting
student strengths and needs in ways that are beneficial to students, teachers, and
the school; and (5) using student performance data to identify what a student must
Tests are administered by powerful institutions learn in order to progress to the next level. Thus, assessment offers a much more
Taken by individuals with little power comprehensive picture of student growth and achievement than a test allows.
Institution has non-negotiable control over knowledge
r~.:aiuatlun refers to the use of the evidence gathered through assessment to
Tests use the language of science make a judgment about the effectiveness of students' learning, of the teacher's
Statistical, objective, empirical teaching, or of the educational programs provided to the students. Final grades are
Grants authority, status, and power one example of student evaluation. Teachers use assessment data to evaluate the
Tests use the language of numbers effectiveness of their teaching, and they make adjustments accordingly. Assess-
Symbol of objectivity, scientism, rationality ment data can help educational leaders evaluate programs to determine whether
Tests use written forms of communication they are meeting the needs of their students. Assessment data can also help educa-
The written exam is a one-way act of communication tionalleaders identify ways that they can improve their programs and practices to
No consideration of test-taker's views, interpretations better serve those students.
Tests rely on documentation
Test takers redefined as describable, analyzable objects
Measurement of overall phenomenon and individual groups
Summative and Formative Assessments
Tests use objective formats A::;u1nnuLive asscssn1cnt provides a summary of what a student knows and can do.
Objective items (true/false; multiple-choice) Summative assessments typically are given at the end of a unit or perhaps at the
Calls for 1 correct answer, one truth, one interpretation end of a school year. A iormali \-'C ;~s::;c:,_•.;ment, in contrast, provides information to a
teacher about how a student is doing and what modifications may be needed in
Source: Shohamy, 2001, pp. 20-24.
instruction. Think of formative assessments as helping inform teachers what to do
next instructionally. A formative assessment may be a formal assessment, or it may
be as informal as a teacher walking around and observing students' progress while
This chapter provides a foundation for fair assessment of ELLs. First we explore they are working independently at their desks or in small groups. Effective teachers
basic principles of assessment and then look critically at current testing require- of ELL students make use of both formative and summative assessments.
ments for ELLs. Finally, we explore alternative and authentic forms of assessment
and highlight the need for multiple measures of ELLs' learning.
Norm-Referenced and Criterion-Referenced Tests
A norm -n~f~:rcnccd is used to compare a student's score to those of other stu-
dents. Test results are usually reported as percentile rankings. These represent how
a student's score compares with the scores of students in the test's norming popu-
! c~.l refers to the administration of tests. A test is a single instrument designed lation, that is, a group of students-the "norming group"-who have already taken
to systematically measure a sample of a student's ability at one particular time. A the same test. For example, a score at the 71st percentile means that the student
test is like a photographic snapshot. A snapshot can be misleading because it can- scored higher than 71% of the students in thenormingpopulation. Norm-referenced
not show what took place before or after it was taken, or the context in which it was tests are also used to compare the performances of classrooms, schools, districts,
taken. Likewise, a single test does not provide any information about what a stu- and even states. Commonly used norm-referenced tests include the Stanford
dent could do before or after taking the test. Nor can a test possibly measure every- Achievement Test, lOth edition (SAT-10), the TerraNova, the Iowa Test ofBasic Skills
thing a student knows and can do. A test can collect only a "sample" of a student's (ITBS), the Metropolitan Achievement Test (MAT), and the Comprehensive Test of
knowledge and ability. In mathematics, for example, only a small sample of prob- Basic Skills (CTBS). IQ tests, such as the Stanford-Binet and the Wechsler Intelli-
lems can be tested, and from this sample we attempt to make an inference about gence Scale for Children, and college-entrance tests, such as the SAT, ACT, and the
the test taker's math ability in general. GRE, are also norm-referenced tests.
li is much broader than testing. Margo Gottlieb (2006) defines as- A criterioll-rdercnccd !est, in contrast, determines how much a student has
sessment as "the systematic, iterative process of planning, collecting, analyzing, learned by tallying how many questions are answered correctly. These tests are de-
reporting, and using student data from a variety of sources over time" (p. 185). As- signed to determine whether a student meets specific criteria, such as course objec-
sessment involves (1) planning what, how, and when to assess, with attention to the tives or state standards. Scores are usually reported as raw scores or converted to
varied purposes of different constituents or stakeholders, including students, par- percentages. For example, a student who answered 45 out of 50 questions correctly
ents, teachers, and local and state administrators; (2) collecting appropriate data on a math test would receive a score of 90%, indicating the percentage of correct
from a variety of sources, ranging from informal observations of students in class answers. Most of the tests designed by teachers for classroom use are criterion-
to more formal scoring of student performances on specific tasks using common referenced tests. Many school districts also develop their own criterion-reference
114 Chapter 5 Assessment Assessment Basics 115

compare with their native English-speaking peers. However, both norm-referenced


and criterion-referenced tests in English are problematic for ELLs because they lack
The Lake Wobegon Effect proficiency in the language of the test. Also, both types of test are subject to test score
Fans of Garrison Keillor's National Public Radio program A Prairie Home Com~ inflation, particularly when they are part ofhigh-stakes accountability systems and
panion are familiar with his stories about his fictional hometown, Lake Wobe- teachers feel pressured to teach to them. The pressure on schools to have high aver-
gon, where "all the women are strong, all the men are good looking, and all the age passing rates to meet the criteria of state and federal accountability programs
children are above average." The humor of this famous line, of course, is that may lead to the temptation to use questionable strategies that minimize or elimi-
it is mathematically impossible for all children to be above average. nate the impact of ELL student test scores. Thus, many ELLs are made invisible in
Despite this impossibility, the results of norm-reference test scores have the very system that claims to be making them visible and addressing their needs.
revealed a national pattern that puts most children at a rating of above aver- . ~SJ (Norm-referenced vs. Criterion-referenced Tests, document)
age (Sacks, 1999). Some refer to this phenomenon as the "lake Wobegon Ef- Efforts to minimize the impact of ELL test scores on district and school aver-
fect." How is this possible? Recall that these tests are first normed on a sample ages could also be seen as acknowledgment that there are unresolved issues re-
of students to establish what the "average" score is-the point at which half lated to the reliability and validity oftest scores for ELL students, and thus it is un-
the students score above and half the students score below. The tests are then fair to schools to include ELL scores for accountability purposes. Let's now look at
administered to students throughout the country, and typically several years these issues of reliability and validity.
go by before the test is revised and new norms are established.
In the meantime teachers become familiar with what is on the tests. With the
pressure to raise test scores/ they teach to the tests. The result is test score
Reliability
inflation-test scores go up "without an underlying gain in real achievement"
(p. 136). The public thinks the students are getting smarter because of the rise The concept of recognizes that no test (or testing situation) is free from
in test scores, while in fact students are learning less because their instruction error, and that these errors result in misrepresenting students' knowledge and abil-
has been narrowed to only what is on the test. ity. Reliability, according toW. )ames Popham (2004), "refers to the consistency
with which a test measures whatever it's measuring," or uthe extent to which stu-
dents' scores on tests are free from errors of measurement" (pp. 29-30). Think of
reliability as consistency, which can be illustrated by the following three types of
tests to monitor student and school progress. Creators of criterion-referenced tests reliability evidence:
establish minimum passing scores (called cut-scores). 1
l. Internal reliability. How consistently do the test items function?
The standards-based tests required by NCLB are criterion-referenced tests.
You have likely encountered poorly written or simply bad questions on a test,
Cut scores are established by states and are often reported at three or four levels.
questions that even the most prepared students may not be able to answer cor-
For example, students who do not meet or exceed the passing score may be deemed
11 rectly. These items lack consistency with other test items. If there are many
below the standard"; those who score at or just above the passing score may be
poorly written or unfair questions, the test will lack internal reliability. It will not
deemed "basic"; and those who score well above the passing score may be deemed
"proficient" or "excelling."
be able to properly distinguish between higher-level and lower-level students, or
between students who are well prepared and those who are not.
Note the tremendous differences between norm-referenced and criterion-
referenced tests. Norm-referenced tests do not actually tell us how well students 2. Stability. How consistent are the results of a test across different testing
did in answering the questions or mastering the content; they tell only how a stu- occasions?
dent performed in comparison with other students. Theoretically, if all students Theoretically, if a teacher gives a test on Monday and gives the same test on
answered every question correctly, all students would be at the 50th percentile, Tuesday and does not provide any additional teaching on the topic between the
that is, they would all be average. Likewise, if all students got every answer wrong, two tests, the results should be the same. However, many factors can affect test
all students would be at the 50th percentile, because zero would be the average results, as any teacher can attest. Did the students eat breakfast? Is a student
score. Of course, these extremes do not occur because norm-referenced tests are feeling ill or concerned about a problem at home? Did the students see some-
designed to sort and rank students along a bell-curve distribution, and test devel- thing on television covering the same topics on the test after school on Monday?
opers tinker with their tests until they achieve this distribution in the norming These and other factors can introduce errors of measurement.
population. Thus, tests are also designed so that it is highly unlikely that all stu- 3. Alternative forms. How consistent are two or more different forms of a test?
dents at a given grade level will be able to answer all the questions correctly. Teachers, or testing companies, may want more than one form of their test avail-
Both norm-referenced and criterion-referenced tests are problematic for ELLs. able to prevent cheating, or, if a test is reused in the future, to prevent teachers
With norm-referenced tests, ELLs are unlikely ever to be deemed "above average," and students from remembering the questions or the answers. Of course, alter-
because ELLs typically are underrepresented, if included at all, in the norming pop- native forms are not identical, and thus the possibility of measurement error is
ulation and the purpose of these tests is to rank students along a bell curve. introduced.
Criterion-referenced tests may be considered fairer for ELLs, because their scores on
these tests supposedly reflect their knowledge of the content, rather than how they In instances of more holistic assessment, such as scoring a student essay, reli-
ability refers to the consistency of those who are doing the scoring. If one teacher
1: Seewww.fairtest.org (home page for the National Center for Fair and Open Testing) for more information on grades all the essays, how consistent is she in how she grades the papers at the top
different types of tests, as well as information about uses and abuses of high-stakes testing. of the pile compared with how she grades those near the bottom? Or if two or more
116 Chapter 5 Assessment Assessment Basics 117

people are grading the essays, how consistent are they in applying the same scor- problems, or if it consisted solely of numbers multiplied by 10, or did not include
ing criteria? If they read the same essay, would they give it the same score? This problems that require regrouping, the teacher could not make valid score-based
issue is referred to as inter-rater reliability. It is nearly impossible to have 100% inferences about the students' ability to multiply 2-digit numbers. Determining an
inter-rater reliability, because people are different and the grading of essays is appropriate number of test items and identifying items that are truly representa-
highly subjective. tive of the entire domain is a serious challenge in developing tests from which valid
Procedures for determining the reliability of testing instruments are often score-based inferences may be drawn.
based on mathematical formulas, resulting in numbers proclaiming just how reli- Unlike reliability, validity cannot be established though mathematical formu-
able an instrument is internally, or how consistent it is at different times or in dif- las. Rather, just as a lawyer gathers evidence to make a legal case, a variety of evi-
ferent forms. For large-scale assessments, such as state tests, a s1audard error of' dence must be gathered to make the case for validity. That evidence can be one of
nHctcurcmcnl (SEM) should be reported. The SEM indicates a range of trustworthi- the following three types, as described by Popham (2004, p. 55):
ness of an individual student's score. For example, if a test has an SEM of 3, and the
student scores 49, then the student's actual score is 49 +/- 3, that is, somewhere l. Content-related evidence refers to "the extent to which an assessment procedure
between 46 and 52. adequately represents the content of the assessment domain being sampled."
The implications of the SEM are enormous. Consider, for example, if the cut The earlier examples of the subtraction test used to measure spelling and the
score for passing a test is 50 but an ELL student scored 49, he would be deemed as multiplication test that includes only numbers multiplied by 10 are examples of
failing the test. But if the SEM for the test is 3, his score may have actually been 50, tests that lack content validity.
51, or 52, which are passing scores. 2 lmagine if the student were held back a grade 2. Criterion-related evidence refers to "the degree to which performance on an as-
or dented a high school diploma because of his "failing" score. It is possible the sessment procedure accurately predicts a students' performance on an external
student was unfairly penalized, with major social consequences, all because of er- criterion." Language proficiency tests are supposed to predict how well a student
rors inherent in the test, not in the students' actual performance. would perform in a mainstream classroom. Criterion-related evidence for these
Also problematic related to the SEM is the tendency for schools to focus all of tests should be gathered with data showing that students who score low on the
their instructional attention on students who "just barely" missed passing the test test struggle with the language demands of a mainstream classroom, while
by a few points. Linda McNeil (2000) observes that in many schools these students those who score high on the test are better able to cope. However, such data are
are referred to as the "bubble kids." Again, the SEM reveals that if these students rarely, if ever, collected.
are given the exact same test again, they may pass it without any instructional in- 3. Construct-related evidence refers to "the extent to which empirical evidence con-
terventions, simply because their failing" score was due to measurement error.
I<
firms that an inferred construct exists and that a given assessment procedure is
Thus, these schools may be wasting tremendous amounts of instructional time measuring the inferred construct accurately." This is similar to content-validity,
:;] that could be put to much better use, such as providing ESL instruction rather than but constructs are much more abstract, broad and complex.
test preparation. (Bubble Kids, document) Let's consider reading comprehension as an example of an inferred construct. First,
how do we define reading comprehension and how do we measure it? Would ability
to follow directions on a medicine bottle be sufficient? Does a student lack reading
Validity comprehension skills if he is unable to explain the multiple layers of meaning of the
refers to the accuracy with which a test or assessment measures what it John Keats poem "Ode on a Grecian Urn"? Since there are multiple skills involved in
purports to measure. Although people talk about a test being valid, tests them- reading comprehension that cannot possibly all be measured, which skills are se-
selves do not possess (or lack), validity. Popham (2004) explains that it is the valid- lected? Most reading comprehension tests involve reading passages that are fol-
ity of a score:b~sed inference that is at issue. Thomas Haladyna (2002) explains fur- lowed by questions, but how are the reading passages selected? How representative
ther that validity refers to how test scores are interpreted and used. So, for example, are the selected passages of the entire domain of possible texts students may come
If a teacher wants to measure students' English spelling ability, scores from a sub- in contact with? How representative are the types of questions (e.g., factual recall,
tractiOn test would not allow for any valid inferences about their spelling ability. main idea, supporting details, inferences, author's purpose) that are selected?
Essential to understanding validity is the notion of assessment domains-the Reading comprehension is a complex construct, and measuring it is not easy.
broad range of concepts, knowledge, and skills that a teacher wants to assess in a Haladyna (2002) describes tests as flawed measuring instruments and points
given subject or area. For example, consider the assessment domain of whole num- out that test scores can be affected by factors that are unrelated to what the test is
ber 2-digit multiplication. To make valid score-based inferences, the test obviously supposed to measure. These factors are sources of what testing experts call
must ask students to solve 2-digit multiplication problems. Could the teacher give ((construct-irrelevant variance." Haladyna offers a more practical term: test score
a test with ~very possible 2-digit multiplication problem? Technically it's possible, pollution. Any factor that interferes with an accurate interpretation can be viewed
but a test With thousands of problems would be a bit too much. A more reasonable as polluting the valid interpretation of the test score. Test score pollution that re-
test would incl~de just a sample of 2-digit multiplication problems. Students' per- sults in threats to validity include "mismatched tests, cheating on the test, inap-
formance on this sample would allow the teacher to make an inference about their propriate test preparation, inappropriate or unfair test administration, and failing
ability to solve this type of problem, including similar problems not on the test. to consider the context for a student's or a group of students' performance" (p. 57).
. Samples, however, must be carefully selected to ensure that they are represen- Haladyna notes that ELLs' limited English proficiency is a major source of "test-
tative of the entue domam. If, for example, the multiplication test had just three score pollution" that can seriously affect a student's score.
Consider the example of a math test in English that is full of word problems. An
2: Likewise, a student who passes with a score of 50 may actually have obtained a failing score of 47 to 49. ELL student may have been a top math student in her home country but would have
118 Ch81pter 5 Assessment Assessment Basics 119

In an effort to understand the extent to which linguistic bias may be present on


a state high-stakes test, my colleague Xiaoshi Li and I analyzed the 5th gr~de math
Texas Assessment of Knowledge and Skills (TAKS) and compared 1t w1th math
worksheets that had been completed over the school year by newcomer ELLs stu-
1. If the test is timed, the student may not have enough time to finish the test, because
dents from Cambodia (Wright & Li, 2006, 2008). The students, Nitha and Bora,
the thought process for translation requires extra time.
spoke no English when they arrived 2 months into the school year, yet they were
2. The English reading comprehension level of the test may be too difficult, so the test
becomes a measure of English reading comprehension instead of a measure of
expected to pass the math TAKS test-in English-less than 6 months later. Both
achievement. students were very bright and made great progress in learning English, math, and
3. The education program of a transplanted student is not likely to be very strong, re- other subjects during the school year. They had done well in math in Cambodia,
sulting in lower achievement and lower test scores but the academic standards in Texas were much higher than those of the Cambo-
4. The teaching quality may be inferior, especially if a bilingual teacher is needed [but dian education system, where impoverished schools do not have enough textbooks
not available] and most teachers never attended or graduated from high school. Their school in
5. If teaching is done in English and the student's English language proficiency is not Texas worked hard to provide as much math instruction as possible in an effort to
good, not only is the child failing to learn English but he or she also is not keeping help them catch up to their grade-level peers. With assistance-including assis-
up in other areas of study. tance in their native language-the girls were able to complete dozens of work-
6. If these students attend schools in school districts with limited services (e.g., sheets as part of their math instruction throughout the year, most of which con-
poverty~stricken areas), they are less likely to receive the special attention that more tained simple word problems written for students below the 5th grade level.
affluent school districts provide. Underfunded schools tend to have underfunded When comparing vocabulary, we found that the number of unique words on
programs for the neediest students. both the worksheets and the math TAKS were around 300 words. The math TAKS,
7. If these students live in social or cultural isolation, they are likely to underachieve. however, contained nearly twice as many academic and math-specific vocabulary
Source: Haladyna, 2002, pp. 186-187. words than Nitha and Bora had encountered on their math worksheets. Many
math-specific academic words, such as digit, diagram, parallel, congruent, rectan-
gular, and transformation, to name just a few, did not appear a single time in their
great difficulty passing this test because she has trouble reading or understanding worksheets, yet understanding these key words was necessary to solve the prob-
the questions in English. The math test becomes instead a test of English reading lems on the TAKS. These newcomer ELL students simply did not have the English
comprehension-not the construct the math test was designed to measure. Hence, math-specific vocabulary necessary to read and understand the test questions. We
lack of English proficiency is a source of construct irrelevance, the test score is pol- then did a comparison of syntactic complexity and found that the TAKS had a
luted, and we cannot make any valid interpretations about the student's math abil- greater variety of syntactic features, and over twice as many of the more complex
ity. Haladyna outlines several threats to the valid interpretation of ELL test scores syntactic features (see Figure 5.1). This means that when the gnls attempted to
(see Box 5.2). take the TAKS test, they encountered long sentences that were much more complex
than anything they had ever encountered during their regular math instruction.

Bias 20
18
! >i d~;
in testing and assessment refers to the qualities of an assessment instrument 18
that penalize a group of students because of their gender, religion, ethnicity, socio- 16
economic status, or any other characteristic that defines them as a group. The issue 14
0
of bias is really an issue of validity, because bias is one of the threats to valid test ~
~ 12
score use and interpretation; or as Haladyna would say, bias is a source of test-score
pollution.
~, 10

Concerned educators have long pointed out examples of questions on standard- "0" 8
8

ized tests that focus heavily on sports and thus may be biased against girls and in 6 5
favor of boys or that use names common only among White Americans. Bias is 4
TAKS Qs
harder to find these days because test developers have become more sensitive and 2
many now strive to eliminate potential bias in their testing instruments. Most com- • sw,
panies have developed bias elimination handbooks and guidelines for their internal
use that go beyond the superficial sprinkling of ethnic names throughout test ques-
tions. Despite these efforts, problems still occur and educators continue to find ex-
amples of subtle forms of bias in test questions, for example, items that require fa-
miliarity with the cultural norms and practices of White middle-class Americans.
Syntactic Features
Reducing bias is more complex than using diverse names and avoiding certain
topics. The biggest source of bias for ELLs is the language of the test. Simply put, a fig<Jre 5.~ Linguistic complexity of students' math worksheets (_SWs) compared with
test written in English is biased in favor of students who are native or fluent speak- that of the 5th grade math Texas Assessment of Knowledge and Sk1lls (TAKS Os). (Soume'
ers of English and biased against those who are not. Wright & U, 2008.)
120 Chapter 5 Assessment A Critical Look at Testing Requirements for ELLs 121

The following concluding sentences from word problems illustrate the differ- is obtained before making a high-stakes decision. The attention the joint Stan-
ence between the syntactic complexity of the worksheets and that of the TAKS: dards brings to issues of testing ELLs is greatly needed. Unfortunately, there are no
Student worksheet: How many does he have left? enforcement mechanisms for the Joint Standards. In other words, there is no ac-
TAKS: What information is needed to find the approximate number of sheets countability system in place by which to hold accountable those whose develop,
of paper Juan had left after the school year was over? use, and interpret tests for accountability purposes!
The political push with federal mandates for high-stakes state tests that in-
Note how the student worksheet question is a single clause, whereas the TAKS ques- clude ELLs has revealed many examples of how the )oint Standards are being ig-
tion contains multiple clauses of different types. The TAKS question is much more nored and violated, with unfair consequences for ELL students and their schools
linguistically complex and would pose a significant challenge to newly arrived based on test score interpretations of questionable validity.
ELLs-even for those who are really good at math. For native and fluent speakers of
English, however, the linguistic complexity of this sentence would not pose a sig-
nificant challenge. Thus, this test question exemplifies the type of the linguistic Content-Area Testing
bias that is inherent in tests administered to students who are not yet proficient in
the language of the test. The main purpose of content-area testing is to determine how well students are
performing in a particular subject area. Teachers often create their own content-area
tests throughout the year (e.g., spelling tests, math tests) or use the end-of chapter
or unit tests that come with their curricular programs, for both formative and sum-
A Critical look at Testing Requirements for Ells mative assessment purposes. States also create their own large-scale content-area
assessments that are aligned with their state content standards and use the stan-
Equipped with an understanding of assessment basics, we now review specific dardized assessment data for summative purposes.
problems with testing requirements for ELLs. We begin with a look at professional The most commonly tested subject areas on large-scale content-area assess-
standards for test development and use related to the testing of ELLs. Then we take ments are reading/language arts, math, and science, because NCLB mandates that
a critical look at the content and language testing requirements for ELLs as man- each state test students in these areas. However, many states also test students in
dated by NCLB. Finally, we explore alternative and more authentic assessments writing, and some also test students in social studies/history and other content
and procedures with the potential to provide more accurate measures of what ELLs areas.
know and can do. Large-scale standardized content-area tests are used for purposes of account-
abilityfor teachers and schools, as well as students. In addition to the federal school
accountability system through NCLB, states have their own school accountability
Joint Standards programs. State programs usually differ substantially from the federal system, so
much so that it is not uncommon for a school to be labeled as "failing" under NCLB,
Three major national professional organizations-the American Education Re-
but "excelling" under the state's system. Most states argue that their accountability
search Association, the American Psychological Association, and the National Coun-
system is more fair and accurate than NCLB's. To date, however, both state and
cil on Measurement in Education-have developed a set of standards for educational
federal accountability programs rely heavily on the results of high-stakes stan-
and psychological testing. These are commonly referred to as the Joint Standards.
dardized content-area tests.
The latest version, published in 1999, contains an entire chapter (chapter 9) that out-
lines ll standards related to the issue of testing students with limited English profi-
Accommodations for ELLs
.·CJ ciency. (Joint Standards, document)
The Joint Standards recognize that lack of English proficiency is a source of con- In partial recognition of the problems involved in testing ELLs, NCLB simply man-
struct irrelevance. That is, any test given to ELLs in English becomes, in effect, alan- dates that ELLs be assessed in a "valid and reliable manner" through the use of
guage test, with results that may not accurately reflect the students' competencies in reasonable accommodations." These ac.commoduUon~ could include, to the ex-
11 11

the construct intended to be measured. The Joint Standards declare that test norms tent practicable, assessments in the language and form most likely to yield accu-
based on native speakers of English should not be used with students whose first rate data on what such students know and can do in academic content areas, until
language is not English. The standards acknowledge the problems of translation, such students have achieved English language proficiency."'
noting that we cannot assume that a translation of a test is equivalent in content, dif- Providing tests in the ELLs' native language or using other accommodations is
ficulty level, reliability, and validity to the original English version. The standards not as easy as it might seem. Charlene Rivera and her colleagues conducted exten-
also recognize that simply testing a student in his or her native language alone is sive reviews of the research on testing accommodations for ELLs and on state-level
problematic because a bilingual student may not test well in either language accommodation policies. They found only 15 relevant studies, most of which were
Key to proper test use, the joint Standards declare, is an understanding of an inconclusive and unable to document whether the accommodations actually
individual student's bilingualism. Also addressed are issues related to the use of worked. So far, research can suggest that there are only two kinds of accommoda-
interpreters, modifications of tests (accommodations) for ELLs, the need for valid- tion with the potential to support ELI:s access to test content: testing students in
ity evidence, and guidance in interpreting test scores. When a test is used to make their native language (Ll) and simplifying the language used in the test (Rivera &
high-stakes decisions, for example, the standards suggest that the test user review Collum, 2006).
the test itself and consider using additional tests, observational information, or
modified forms of the chosen test to ensure that the enough adequate information 3: NCLB §llll(b)(3)(C)(ix)(II).
122 Chapter 5 Assessment

Ll tests appear to be tbe best accommodation. Indeed, if students are tested in


tbe language they know best, they should have a better chance of demonstrating
what they know and what they can do. It is not possible, however, to simply trans-
late a test from English to another language. Creating Ll tests that are truly equiva-
lent to the English versions is difficult, expensive, and time consuming. Only a
small number of states have Ll versions of their state tests, and nearly all of them
are in Spanish. Thus, the vast majority of ELLs, especially those who speak alan-
guage other than Spanish, have no access to this accommodation. Even states that
have Ll tests may have them for only certain grade levels and typically place re-
strictions on how many years students can take them.
Furthermore, L1 tests are not beneficial to all ELLs. Many ELLs are illiterate in
their native language because they have received literacy instruction only in En-
glish. Even if they are literate in the Ll, they may not be familiar with Ll academic
vocabulary if most of their instruction has been in English. Thus, Ll tests are ap-
propriate only for ELLs who (1) are literate in their Ll, and (2) have received Ll
content instruction in the subjects being tested (e.g., students enrolled in bilingual
education programs or newcomers with schooling from their home country).
The other accommodation that appears to have potential is to simplify the lan-
guage so that ELLs have a better chance of understanding the questions. This ac-
commodation, referred to as linguistic simplification, may not be appropriate for a
reading test, since comprehending grade-level text is the targeted construct. There-
fore most efforts are currently focused on math tests. A few states are beginning to
experiment with linguistically simplified (or "plain English") tests. But it is difficult
to create such tests that still cover the same construct as the original versions. More
research is needed to determine whether linguistic simplification is beneficial to
ELLs, and whether with its use validity can be maintained.
Given the demands of NCLB, Rivera and her colleagues have called on the fed-
eral government to provide funding for research on testing accommodations (Ri-
vera & Collum, 2006). The fact that they have had to do so reveals, however, that ..
E
NCLB requires something we really do not know how to do yet-test ELLs in a valid "
:;;
and reliable manner through the use of accommodations.
Despite the virtual lack of a research base to guide practice, states are piecing
e
0.

together accommodation policies for districts to follow when testing ELLs. These
practices vary widely and most states are using accommodation frameworks that
were developed for special education students, many of which are not appropriate
for ELLs. More important, they do not address the unique linguistic needs of ELLs.
Rivera and colleagues also call on states to create an ELL-responsive frame-
work to organize and categorize accommodations according to whether they pro-
vide direct or indirect linguistic support. Not all accommodations are appropriate
for all students. For example, an accommodation allowing students to use a bilin-
gual dictionary is of little benefit to Laotian American students who cannot read
Lao. An ELL-responsive accommodation framework would require schools to make
individualized accommodation decisions based on each student's specific back-
ground and needs. Having such a framework in place, however, would still notre-
solve the fact that we know little about how to provide accommodations that are
actually beneficial to the ELL students while maintaining the validity of the test.
There are problems associated with all of the commonly used testing accommoda-
tions for ELL students (see Table 5.1). ..,
~
0
~
'tl
0
Flaws in Determining A YP for the LEP Subgroup E
E
0
Two major flaws in NCLB impact schools with large numbers of ELL students. The u

first is related to the annual measurable achievement objectives (AMAOs) set by .Ji
A Critical Look at Testing Requirements for ELLs 125

the states. Recall from Chapter 3 thatAMAOs are the percentage of students in each
subgroup (including the LEP subgroup) that must pass the state test each school
year in order for the school to be deemed as making adequate yearly progress (AYP)
toward the goal in 2014 when NCLB mandates that all students (100%) will pass the
test. Schools with large numbers of ELL students will likely be well below the
AMAOs for any given school year and thus have farther to go to catch up than other
schools. For example, among 3rd graders tested for reading in Arizona in 2006, at
least 53.3% (or slightly over half) of the students in each subgroup were required to
pass the reading test for the school to be deemed as making AYP, according to the
state's AMAO requirements. (see Figure 5.2). A school in Arizona where only 12% of
3rd grade ELL students had passed the reading test the year before would have to
make much more improvement than would a school where 50% had passed the test
the year before.
Many states (including Arizona) set their AMAOs in tiers so that they initially
stayed constant for 3 years before moving to the next, intermediate goal, giving
schools extra time to get more and more ELLs (and other low-scoring students) to
pass the test before the AMAO increased. This strategy, however, does not account
for the fact that newcomer ELLs can arrive at anytime in any grade level. No matter
how much a school improves its instruction for students in the prior grades, newly
arrived ELLs never received that instruction. Under this system, schools are pun-
ished for the performance of students they may have just begun teaching.
To illustrate this problem, let's say in 2012 a school has 30 ELLs in 3rd grade.
According to Figure 5.2, the expectation for that year is that over 80% of a school's
3rd grade ELLs will pass Arizona's reading test in English. To meet the 2012 AMAO,
at least 24 of the school's 30 ELL students would have to pass the reading test. Even
if all30 had been in the school since kindergarten, this goal would be challenging
enough, because most would not yet have attained the level of proficiency in En-
glish required to read and fully comprehend the test. But what if 10 of those 30
"'
E students had arrived in the United States only the year before? How reasonable
"
:;; would it be then to expect at least 24 of the students pass the test? The following
0
~
0.. .. . .. year, let's say there are 35 ELLs in 3rd grade, only half of whom have been in the
United States since kindergarten. Now the expectation is that 95% of these stu-

100
100

~
c
80
"
'<l
~
Pf;

.
;;
0.

"'c
~
60
62.6
71.9

c .,"
0 53.3
·~ ~
~
40
IJ) 44
"' ~
c
"~
& 20

..,0
c
.. 0
?' ?' ~ i!! i!!
i ~ fJ ::2 ~
~

::!""
::: :!!'
"C
e! t o! <:] c{ oj
0
E
E
0
cS' & C>
"'
C> C> !:;
Year
i!! C> ~ ~ ~

~ 5.2 Arizona's annual measurable achievement objectives for grade 3


reading. (Source: Arizona Department of Education, 2004, p. 7.)
126 Chapter 5 Assessment A Critical Look at Testing Requirements for ELLs 127

dents-all but 1-will pass the 3rd grade reading test in order to meet the AMAO. By NCLB's own definition, an LEP student's difficulties with the English lan-
You can see that as schools receive new ELL students, and as the AMAOs increase, guage "may be sufficient to deny the individual the ability to meet the State's profi-
it becomes increasingly impossible to meet the expectation. cient level of achievement on State assessments."4 Nonetheless, the law mandates
The second flaw is more problematic and could eventually cause every school that they do just that. To identify a group of students who, by definition, cannot
with ELL students to be deemed as failing and subject to state takeover. Title I treats meet the standards, treat that group as static, and then require that 100% of stu-
the LEP subgroup as it does all other subgroups. But unlike the membership of dents in that group attain proficiency in those standards is not reasonable.
other subgroups (e.g., ethnicity), that of the LEP subgroup does not stay the same.
Few students who enter school LEP are still LEP by the end of 12th grade; they even-
tually learn English and are no longer classified as LEP. Thus, each year, the top Language Proficiency Testing
students in the LEP subgroup are redesignated as fluent English proficient (PEP).
They are no longer members of the subgroup and are replaced by newly arrived As noted in Chapter 3, when parents enroll their children in school. a home lan-
ELLs with the lowest levels of English proficiency. In other words, the ELLs with the guage survey is given to determine whether a language other than English is used
highest levels of English proficiency, those most likely to pass the test, are replaced in the home, and if so, the extent to which this language is used with and by the
by those with the least amount of English proficiency, those least likely to pass the children. Different states use different home language surveys, but the purpose is
test (see Figure 5.3). This movement into and out of the group makes it difficult and to identify students with a primary home language other than English (PHLOTE).
ultimately impossible for the LEP subgroup to meet the AMAOs set by the state. As Each PHLOTE student is then flagged for language proficiency testing to determine
a result, the LEP subgroup will continually appear as though no progress is being whether he or she is an ELL, and if so, at what level of English language proficiency.
made. The LEP subgroup will fail to make AYP, and thus the entire school will be Many school districts also assess PHLOTE students' proficiency in the standard
labeled as failing and ultimately subject to state or private takeover. variety of their native language. 5 This practice sometimes results in falsely labeling
In partial recognition of this flaw, the U.S. Department of Education (2004) an- some young Latino students as both non-English and non-Spanish speaking (see
nounced two changes to the way the LEP subgroup is treated for AYP purposes. Box5.3).
First, LEP students could be excluded from the reading test in English for the first Once students have been identified and placed in appropriate ELL programs (if
year and instead take the English language proficiency test, though results would available). their language proficiency should be assessed regularly to determine
not be included in schoolAYP calculations. Second, scores of redesignated LEP stu- their progress in learning the language. Ideally, students should receive formative
dents would be counted in the LEP subgroup for 2 years after their redesignation. assessments that provide information that can drive effective ESL instruction.
Newly arrived ELL students, however, were still expected to take-and pass-the Language-testing experts call the influence language tests and assessments have
state's math test, and their scores did count in the school's AYP calculations. Two on teaching the "wash back effect." Washback can be positive, meaning it provides
years later, in 2006, however, the Department of Education (2006a) made a slight useful feedback and drives beneficial instruction. Or washback can be negative,
change to this requirement as well. ELLs who had been in the country for less than perhaps better referred to as "backwash," such as in the cases where high-stakes
12 months would still be required to take the state math test, but their scores could tests force instructional attention to just a narrow set of discrete skills.
be excluded from the school's AYP calculations. At the end of each school year, a summative language proficiency test or assess-
Unfortunately, these changes are insufficient and, at best, delay the problem ment is administered to determine the students' progress in learning English over
rather than solve it. What is the sense in requiring newly arrived ELLs to take the the school year. The results are used to determine or help inform decisions about
state's math test when their scores will just be excluded? It is not reasonable to expect when each student has reached a high enough level of proficiency to be redesig-
newly arrived ELL students to learn enough English in 1 school year (around 180 nated as PEP and thus no longer in need of specialized ELL programs and services.
school days) to pass the same reading and math tests as their English-fluent peers. Results of English language proficiency assessments must be reported to the state
While the allowance for counting ELLs for 2 years after redesignation may help and the federal government for accountability purposes. Thus, language tests serve
a little, it does not solve the basic problem of the highest students' being replaced several purposes: identification, program placement, monitoring of progress, re-
by the lowest students in terms of English language proficiency. Thus, this is mainly porting for accountability purposes, redesignation, and program evaluation.
a cosmetic fix and is insufficient to ensure that the LEP subgroup will be able to
makeAYP. Complexity of the Language Proficiency Construct
Mandates for language proficiency testing are pushing practices that are on shaky
research ground. As noted in Chapter 2, language proficiency is an extremely co:n-
plex construct for which there is no simple measure. How on~ uses language vanes
l!e!lesigtlated FE:P according to time, place, interlocutors, settmg, purpose, topic, and so on, and thus
demonstrating one's competence in a language is very different from demonstrat-
ing competence in other kinds of knowledge (Valdes & Figueroa, 1994, p. 66). No
t test can ever fully capture a student's true ability in English, or any other language.
LEP subgroup The situation is still more complex for ELLs because they are emergent b!lm-
~uun·e 5.3 NCLB's expectations for all students in guals with varying degrees of proficiency in their native language or languages and
the LEP subgroup to eventually pass state's tests will
be impossible to meet due to the instability of the t
1\iew.ly arrived LEP
4: NCLB, Title IX, Part A, §901(25)(D)(i).
subgroup due to movement into and out of the group. 5: See MacSwan, Rolstad, & Glass, 2002, for a critique of this practice.
128 Chapter 5 Assessment A Critical Look at Testing Requirements for ELLs 129

Box 5.3 Dell Hymes has led to attempts to measure students' abilities to actually use lan-
guage to accomplish authentic and specific communicative tasks. But as Bernard
Spolsky argues, language testing does not assess authentic language behavior, and
there is no such thing as a test of authentic or natural language use (Cohen, !994).
Are There Students with No Language?
Thus, all language tests are indirect measures of the complex construct oflanguage
Non-nons is a term used in some schools to refer to Latino ELL students officially des- proficiency, and results of these tests can provide only weak inferences about stu-
ignated as non-English-speaking and non-Spanish-speaking. This designation results dents' language abilities.
from language proficiency tests administered to students in both Spanish and English.
Students who perform poorly on both tests are deemed as not having proficiency in Problems with NCLB Requirements for Language Proficiency
any language. Teachers often buy into this construct, complaining about students who
"don't know English or Spanish." Linguistically, this is an absurd notion. All normal The mandates and expectations of Title III of NCLB that hold schools accountable
children (i.e., those without cognitive or speech-related disabilities) naturally acquire for ensuring that ELLs make progress in learning English and ultimately attain
and master the language or languages of their speech community, typically by the age proficiency in the language create many problems for schools, the first of which is
of five. that there has to be some reasonable limit on the percentage of ELLs expected to be
Jeff MacSwan, Kellis Rolstad, and Gene Glass (2002) investigated the non-non issue redesignated as FEP each year. States may consider the length of time ELLs have
and determined that the problem is not children with no language but rather invalid been in school, but the law is unclear about how this information can be used in
language tests. They found that the tests gave heavy weight to literacy skills and stan- determining AYP. Also, although few would disagree with the expectation that stu-
dard Spanish. Students who spoke vernacular varieties of Spanish daily at home were dents make progress every year toward the goal of proficiency, there is a disconnect
deemed as non-Spanish-speaking simply because they never had the opportunity to between current research, which suggests that it takes between 4 and 7 years to
learn standard Spanish or to read and write in Spanish.
obtain fluency in a second language (Crawford & Krashen, 2007), and the structure
They concluded that language proficiency tests, and the resultant non-non labels, of state assessments, which typically define only 4 or 5levels of English proficiency.
create a false and potentially harmful description of many ELL students. The non-non Thus, if a state has only 4 levels of proficiency but it takes a student 6 years to be-
label draws attention away from the real issues, such as the need to help ELLs further
come fluent and the student is expected to move up one proficiency level each year,
develop their native language at school in order to build a foundation on which to build
he or she will run out oflevels. In California, for example, state reports have docu-
their new language, and the need to help ELLs learn a standard variety of Spanish with-
out devaluing the variety they speak at home. mented that ELL students move quickly from one level to the next at the lower lev-
els but get stuck in the higher levels for several years (Hill, 2006). The biggest prob-
lem with Title III's AYP expectations, however, is that making AYP under Title III
requires that ELLs make AYP under Title I. which calls for their passing state
English. Language tests, however, are based on the norms of monolingual speakers content-area tests. As discussed earlier, it is highly unlikely that most ELLs will do
and therefore treat bilingual students as if they were two monolingual speakers of well on a content-area test written in English, because they have not yet attained
two different languages in one person. But bilingual individuals' brains are not full proficiency in the language of the test.
simply divided in half with Ll on one side and L2 on the other. Rather, the two lan- In short, as expectations increase each year, it becomes increasingly difficult
guages interact in ways that linguists and cognitive scientists are just beginning to and ultimately impossible for school districts to make AYP under Title III. This fail-
explore. As Guadalupe Valdes and Richard A. Figueroa (1994) point out: ure to make AYP will have nothing to do with how well students are progressing in
When a bilingual individual confronts a monolingual test, developed by monolingual learning and attaining English fluency, but rather will be due to illogical reasoning
individuals, and standardized and normed on a monolingual population, both the test and technical flaws within NCLB. It comes as no surprise that in a 2008 report from
taker and the test are asked to do something they cannot. The bilingual test taker can- the U.S. Department of Education to Congress on Title Ill, not one single state met
not perform like a monolingual. The monolingual test cannot "measure" in the other all of the grade-level performance targets for AYP. 6
language. (p. 87) Another issue related to Title III's requirement is inconsistency in how states
Lisa Pray (2003) documented another problem with language proficiency assess ELLs' English language proficiency (Abedi, 2004) and inconsistency in how
tests-when they are given to "native" speakers of English, the test results classify states (and school districts) use the results oflanguage proficiency assessments to
many as limited English proficient. Thus, on one hand, these tests appear to hold make decisions about when to exit students from ELL programs. The inconsistency
ELLs to a higher standard of proficiency than their native English-speaking peers. is due in large part to confusion over the AYP requirements of Title Ill. Many states
On the other hand, some language proficiency tests may falsely indicate that stu- have failed to comply with even some of the most basic requirements. Recognizing
dents have attained proficiency in English when in fact there are many gaps in their that the ambiguities and technical flaws in Title III were partially responsible for
language skills. The danger here is that students may be exited from ELL programs this failure, and in an effort to provide clarifications and guidance to help states
before they are ready for mainstream instruction. With NCLB's pressure on school and districts comply with it, in late 2008 the U.S. Department of Education issued
districts to demonstrate progress and redesignate ELLs as quickly as possible, some 10 "Final Interpretations" of Title III (Spellings, 2008). It remains to be seen whether
states may be tempted to make less demanding language proficiency tests that these clarifications will result in more reasonable expectations for ELLs in learning
push students into mainstream classrooms before they are ready. English, and greater state and district compliance with the federal law.
In the past. language proficiency tests attempted to measure discrete, isolated
6: This biennial report used data for the 2005-2006 school year. Only one state reached all Title I ELL perfor-
skills. Since the late 1960s, the notion of communicative competence developed by mance targets in math. No state met all performance targets in reading or math.
130 Chapter 5 Assessment A Critical Look at Testing Requirements for ELLs 131

Consequential validity dictates that these harmful effects must be taken into
Consequential Validity consideration and changes made. Although to date these harmful effects have
been discounted or ignored by proponents ofNCLB and its mandatesfor high -stakes
Samuel Messick (1995) has raised the issue that validity should also be concerned
testing, mounting concerns over the harmful effects of high-stakes testing and
with the consequences associated with the interpretation and use of test scores.
NCLB have led a large number of professional organizations to issue statements or
He argues, "Any negative impact on individuals or groups should not derive from position papers related to these issues. The National Education Association (2008)
any source of test invalidity such as construct underrepresentation or construct-
issued a joint statement on NCLB with 144 national organizations representing a
irrelevant variance . ... Moreover, low scores should not occur because the mea-
broad range of educational, ethnic, religious, and professional groups. TESOL, a
surement contains something irrelevant that interferes with the affected students' signatory on the joint statement, has issued several of its own position papers and
demonstration of competence" (p. 7). Messick's concerns are often described as
written testimonies to the federal government on different aspects of NCLB, sev-
consequential validity. Chief among the many threats to the valid interpretation of eral of which discuss the impact ofNCLB on ELL students and call for changes to be
test scores for ELL students is that the students' lack of proficiency in the language EJ made to the law. (NEA's Position on NCLB, related links)
of the test-English-introduces a major source of construct irrelevance (e.g., a
Arne Duncan, appointed by President Barack Obama as the U.S. Secretary of
math test becomes an English proficiency test). Yet, despite the validity issues, test
Education in January 2009, should understand these problems. He was the super-
scores of ELLs are used to make high-stakes decisions.
intendent of Chicago Public Schools from 2001 to 2008, and many of the schools in
Many bad decisions are based on invalid interpretations of test scores, and
his district, and his district as a whole, was designated as failing to make AYP over
these decisions can have harmful effects on ELL students, teachers, classrooms,
the last 4 years of his administration. Contributing to this failing status was the low
and schools. Research has documented the following:
test scores ofLEP students. At the time of this writing, Duncan has made a few pub-
Narrowing of the curriculum to only those content areas covered on the test; lic comments about the need to change the mandates of NCLB with regard to LEP
little instruction in social studies, art, music, physical education, health, or students, but to date the Obama administration has only outlined a few generic
other nontested subjects; the elimination of recess to make more time for proposals to change the law. The administration has, however, set the stage to
instruction and test preparation make the situation worse by insisting that states receiving stimulus funding and
" Adoption of curricular programs designed to raise test scores; one-size-fits-all states contending for Race to the Top grants tie student test scores-including ELL
scripted phonics and reading programs; literacy interventions that are inap- test scores-to teacher evaluations. Without changes to NCLB, teachers will be
propriate for ELLs; frequent benchmark testing using tests similar in format held accountable for ELL test scores of questionable validity. Opposition to this
and content to the state test that take important time away from instruction; new requirement is growing, and it remains to be seen whether the administration
funds used to purchase test-preparation materials rather than appropriate will listen and provide the change educators have been hoping for.
ESL materials
Elimination of bilingual and heritage language programs so that the lan-
guage of instruction is exclusively in the language of the test (English); re-
placement of much-needed ESL instruction with test preparation
Reduction in the use of sheltered instruction strategies that make complex Alternative Authentic Assessments for Ells
content-area instruction comprehensible to ELLs; reduction in the amount
of time to engage students in conversations and other interactive activities Because of the difficulties involved in assessing the learning and progress of ELL
that promote English language development; restrictions placed on certain students using standardized tests, particularly the high-stakes tests mandated by
activities or strategies such as drawing, use of video, or field trips that benefit NCLB, effective teachers of ELL students use alternative authentic assessments
ELLs' language and academic development but yet are deemed as wasting throughout the school year. These assessments are alternative because they are
instructional time unlike the standardized tests and traditional pencil-and-paper classroom-based
" Purposeful underidentification of ELLs so the school can avoid being held tests created by teachers or pulled from textbooks. And they are authentic because
accountable for an LEP subgroup; encouraging secondary ELLs students to they more closely match instructional practices in the classroom and they reflect
drop out; using high school exit exams that often deny ELLs a high school di- the knowledge and skills ELLs need outside of the classroom.
ploma upon completion of school and thus restrict their future opportunities According to Socorro G. Herrera, Kevin G. Murry, and Robin M. Cabral (2007),
Creation of stressful school environment focused on testing leading to test authentic assessments:
anxiety for students; elimination of recess; less fun; high turnover of teach-
ers and administrators, with ELLs taught by least experienced teachers " Are generally developed directly from classroom instruction, group work,
Poor performance of ELLs on high-stakes tests, leading to lowered self- and related classroom activities and provide an alternative to traditional as-
confidence; inappropriate placement or retention of students; frustrated and sessments
demoralized teachers who become reluctant to work with ELLs; parents who " Can be considered valid and reliable in that they genuinely and consistently
complain about presence of ELLs at schooF assess a student's classroom performance
111 Facilitate the student's participation in the evaluation process
7: For research documenting these and other harmful effects of bad decisions based on the invalid interpreta-
m Include measurements and evaluations relevant to both teacher and student
tions of ELL test scores, see Amrein & Berliner, 2002a, 2002b; Haney, 2000, 2002; Kohn, 2000; Kozol, 2005; Mc-
Neil, 2000; Nichols &Berliner, 2007; Sacks, 1999; M. L. Smith, 2004; Smith, Heinecke, &Noble, 1999; Valenzuela, ,. Emphasize real-world problems, tasks, or applications that are relevant to
2004; Wright, 2002, 2003b, 2004b, 2005a, 2006a, 2007. the student and his or her community (p. 23)
132 Chapter 5 Assessment Alternative Authentic Assessments for ELLs 133

The authors also note that authentic assessments "identify and build on student teacher evaluates the process, the final product, or both, according to a preestab-
strengths such as language, prior experiences, interest, and funds of knowledge to lished set of criteria. For example, suppose a teacher wanted to observe students'
facilitate learning" (p. 24). ability to retell a story they had read or that had been read to them. A checklist for
Some common alternative authentic assess1nents include observations, per- this task could look something like this:
formance assessments, self-assessments and peer assessments, and portfolio as-
sessment. We now look briefly at each of these. o Identified the main characters
o Identified the setting
o Told what happened at the beginning of the story
Observations o Told what happened in the middle of the story
o Told what happened at the end of the story
The easiest way to assess ELL students is simply to watch them. Waht to know how o Identified the problem
they are progressing in listening and speaking English? Talk with them. Observe o Identified the solution
their interacti?ns in the classroom during group and pair work, during informal
transitiOn penods m class, and out on the playground or in the hallway. Want to Such a checklist could be customized for a specific book, listing characters and key
know h~w well they are reading? Listen to them read. Want to know how well they plot sequences.
are wntmg? Watch them while they are writing and read what they write. Want to Most performance and other forms of authentic assessments are evaluated ac-
know how well they understood the math lesson? Observe them completing inde- cording to a rubric, which is a more sophisticated form of observation checklist.
pendent math work at their desk or in cooperative groups. Informal observations Rubrics contain several categories, with the performance for each category scored
can be facilitated through the use of anecdotal records. This means jotting down on a scale (typically between l and 4)-the higher the number, the better the stu-
qmck notes on thmgs you observe the student doing. I used to carry around a clip- dent did in meeting the teacher's expectation for the given category. Figure 5.4 is an
board on ":'htch I h~d anmdex card for each student. The index cards were taped example of a rubric a teacher could use for a performance assessment of students
onto the clipboard m a flip-file fashion to make it easy to quickly find and flip to a given the task to create a bar graph showing their favorite (or what they consider
student's card (large sticky notes work just as well). I would make notes about the the least disgusting) cafeteria food.
students' linguistic and academic performances during lessons as I observed them Rubrics are an effective tool for assessing performance because they allow the
working collaboratively and independently, including the date of the observation. teacher to establish up-front the expectations for a particular assignment or activ-
On_ce a student's card was full, I would review it and then drop it in his or her port- ity. Unlike a single score or grade, the rubric allows students to see precisely where
folio. The clipboard also enabled to me to quickly figure out which students I had they met the expectations and where more work was needed. The most effective
been ignoring and needed to spend more time observing. The anecdotal records use of rubrics entails going over the rubric with students before they start working
helped me focus on the students and their progress and alerted me to students in on the task. If students are to meet high expectations, they need to know what those
need of extra assistance. I also shared the records with parents during parent con- expectations are. Some teachers involve students in the creation of the rubrics.
ferences.
Observations can be conducted more systematically with a checklist. Teachers
can create thei~ own checklists or use those provided in a given curriculum pro- Self-Assessments and Peer Assessments
gram. A checklist to track students' progress during silent reading time, for exam- Authentic assessments can be made even more effective if students are involved in
ple, might mclude some of the following items: assessing their own work and performance and that of their classmates. Sell
o Read at least 90% of the time ;JS:;\~:;smenl and peer assessnH:m can be facilitated though the use of a checklist or a
o Selected books appropriate to reading level rubric. A student might use the following checklist to score a report about a field trip:
o Held the book properly Field Trip Report Checklist
o Tracked text with finger while reading
o Read aloud Content Format and Mechanics
o with great difficulty o My report has a title. o I started each sentence with a
o with moderate difficulty o I wrote at least 3 paragraphs. capital letter.
o with good fluency o My report tells where we went. o I ended each sentence with a
o My report tells when we went.
11
?" or "!"
."
11

Checklists should be reviewed by the teacher, placed in the students' portfolio, and o My report tells what we did. o I indented each paragraph.
shared with parents and students themselves. When students know the teacher is o My report tells what I liked and o I proofread my story and cor-
watching:-and, more important, what the teacher is watching for-they are more didn't like. rected spelling and grammar
likely to live up to the teacher's high expectations. o My report tells what I learned. errors.
A similar checklist could be used for students to assess a report written by another
Performance Assessments student.
Rubrics can be used in the same manner for self- or peer assessments. For
A performance <:l~sessrnenl is also a form of observation but more structured. Stu- younger students, however, rubrics may need to be simpler than those used by the
dents are asked to perform a specific task or create a specific product, and the teacher. For example, Figure 5.5 is a rubric that could be used by kindergarten or lst
134 Chapter 5 Assessment Alternative Authentic Assessments for ELLs 135

grade students to self-assess their work at learning centers. Self-assessments can


Bar Evaluation also take the form of prompts students complete to reflect on their learning and
Student Name: work. These reflections can be done through discussions, teacher-student confer-
ences, or daily or weekly reflection logs, with prompts such as these:
Category i!, ~3 ~
Today !learned ...
Title The title is creative The title clearly A title is present A title is not I am going to use what I learned to ...
and clearly relates relates to the at the top of the present.
to the problem problem being graph.
Today I did a good job in .. .
being graphed. It graphed and is "' Today I had trouble with .. .
is printed at the printed at the top Tomorrow I am going to try harder to ...
top of the graph. of the graph. I need more help with ...
I want to learn more about ...
labeling of The X axis has a The X axis has a The X axis has a The X axis is not
X axis clear neat label clear label that label. labeled. Reflective self-assessments can be very effective in making students aware of what
that describes describes the they have learned and how they are progressing and can prepare them for further
the units used for units used for learning that can advance them to the next level.
the independent the independent Peer assessment can be very effective, but teachers should keep in mind that
variable. variable.
students need to be well trained in how to assess their peers. Proper training in
peer assessment can help students do higher quality work themselves because they
Labeling of The Y axis has a The Y axis has a The Y axis has a The Y axis is not
Yaxis clear neat label
know exactly what the evaluation criteria are. Training can also help avoid con-
clear label that label. labeled.
that describes the describes the units flicts over fairness and whether one student is evaluating another's work too
units and the and the dependent harshly or incorrectly. Care also must be taken in determining which students as-
dependent variable. variable. sess which students. A newly arrived ELL with little to no English proficiency, for
example, should not be expected to assess the writing of a more advanced ELL stu-
Units All units are Most units are All units are Units are neither dent. Peer assessment, however, does not replace teacher assessment. The purpose
described in described in described in a key described nor of peer assessment is for students to internalize the high expectations and provide
a key and are a key and are but are not appro- appropriately assistance to each other in meeting these expectations. It is not a strategy to save
appropriately sized appropriately sized priately sized for sized forthe data
for the data set. for the data set.
the teacher time by having students grade each other's work. In fact, this practice is
the data set. set.
forbidden in many schools because student privacy laws forbid teachers from mak-
ing a student's grades known to other students in the class. But even in schools
Accuracy All values are All values are Some values All values are
of plot depicted correctly depicted correctly are depicted
where such rules are strictly enforced, as long as the peer assessments do not result
depicted
and are easy to and are easy to incorrectly. incorrectly. in final scores or grades that get entered into a teacher's grade book, the types de-
see. A ruler was see. scribed here should be allowed because of the valuable feedback peers can provide
used to make the to each other.
bars neatly.

Neatness and
attractiveness
Exceptionally well
designed, neat,
Neat and relatively Lines are neatly Appears messy Portfolio Assessment
attractive. drawn but the and "thrown
and attractive. graph appears together" in a Pu'-'Ll'oi j() ;.Js:oc:·;snlcill is one of the most effective ways to measure a student's prog-
Colors that go well quite plain. hurry. Lines are ress. The teacher collects samples of the student's work throughout the year and
together are used visibly crooked.
to make the graph
organizes it systematically in a portfolio. Much like an artist's portfolio, a student
more readable. portfolio is a collection of work that demonstrates the student's progress and
achievement in meeting grade-level standards. That collection, however, should
not be just a hodgepodge resembling the overloaded backpack of a student whose
Example of a rubric for performance assessment {Note: Created with Rubistar at parents never bother to check it and clear it out each day.
ru bista r.4teachers.org) Effective student portfolios are well organized and provide evidence of student
learning. Most include samples of student work along with some indication of how
the student rated himself or herself on the work, and evidence of how it met
grade-level curricular standards (Herrera eta!., 2007). Some teachers use working
portfolios, which hold samples of students' work in progress. Others use showcase
Center Work Self-Assessment
Example of a portfolios, which hold samples of the students' best work. With showcase portfo-
self-assessment lios, many teachers prefer to decide what goes in, but others argue that the students
rubric

*
I finished all of my work I did some of my work
X
I did not do my work
should be allowed to pick their best work for inclusion. I prefer a mixed approach,
with some pieces chosen by the teacher and others by the student and teacher to-
gether.
136 Chapter 5 Assessment The Need for Multiple Measures 137

A portfolio should be organized by content area (or separate portfolios should were ELLs. Their scores were not very high. At the end of the school year, when I was
be set up for each content area) and should include representative work samples required to sit down with my principal to go over my students' test scores, I brought
from the beginning, middle, and end of the school year. Students' progress in writ- along samples from their portfolios. The portfolios provided evidence that when
ing, for example, could be easily observed by comparing writing samples com- many of my students came to me at the beginning of the school year they were
pleted by students during the first few weeks of school, those completed just before reading at the kindergarten or beginning of lst grade level but most had made over
winter break, and those completed near the end of the school year. If these are ac- a year's growth in their reading ability. I showed him writing samples from the be-
companied by completed rubrics and student self-assessments, the portfolio will ginning of year, which revealed that many students had struggled just to write a
present a clear record of the students' learning and mastery of the writing process full sentence, and then I showed him how these same students by the end of the
and proper use of writing conventions and style. year were writing full paragraphs and stories. I showed evidence that students who
Teachers will want to ensure that there are samples of different genres of writ- began the year unable to add or subtract were now able to do grade-level math
ing in the portfolio. However, to allow for useful comparisons of each student's work. In short, the multiple measures of my students gathered in their portfolios
work throughout the year, there should be at least three samples from the same throughout the year demonstrated that most of them had made tremendous prog-
genre. For example, the writing portfolio could include a first-person narrative ress. The state's test made my students-and me as their teacher-look like fail-
written at the beginning of the year, another in the middle of the year, and a third ures. The authentic assessments, in contrast, provided substantial evidence that
at the end of the year. A math portfolio might include a timed test of addition facts my students worked hard and made more progress than most students make in a
given at the beginning of the year and repeated at the middle and end to document single school year. They were anything but failures. My principal agreed that these
the students' improvement in speed and accuracy. alternative assessments painted a more fair and accurate measure of the progress
In many schools, grade-level teams decide what to include in student portfo- of our students than did the state test.
lios to ensure consistency in portfolio assessment for each grade level. In my for- One of the major weaknesses of high-stakes tests is that success is defined as
mer school district, we had districtwide standards for math portfolios. Some teach- students' meeting or exceeding a predetermined bar-the cut score that students
ers felt that this rigidity defeated the purpose of a portfolio as an alternative and must meet or exceed to be deemed as passing. Such tests cannot show students'
authentic assessment; others were grateful for the guidance. All agreed, however, growth during the school year. Consider Figure 5.6. Student A, a typical native
that the variety of samples in the portfolio provided a much better indication of English-speaking student, began 4th grade at about grade level and learned enough
their students' progress in learning math than did the state math test given at the during the school year to pass the state test for his grade level. Student B, a very
end of the school year. bright ELL student from a poor country who had been in the United States for less
Some schools also use a pivotal portfolio, which follows the student year to than a year before entering the 4th grade, began the school year at just above the
year throughout the program. Teams of teachers decide what kinds of summative kindergarten level. She made over 2 years' worth of growth in the 4th grade. None-
and formative student data to include in the pivotal portfolio, when to collect it, theless, student B is deemed a "failure" (along with her teacher) because she did not
and what rubrics to use. The use of common assessments ensures that teachers meet or exceed the passing standard for the 4th grade test, whereas her classmate,
share a language and a practice that is focused on improved student performance. who had significantly less growth in the same period, is deemed a "success."
Teachers use this balanced assessment and accountability system to counter the What is needed are assessment systems that can measure growth. There is
overreliance on high-stakes standardized tests that NCLB has created (Gottlieb & work under way in several states to create tests that can be used for accountability
Nguyen, 2006). purposes that are based on growth models. The education stimulus funding from
the Obama administration is also encouraging states to use growth models and
better track student progress. This new emphasis on growth models is a positive
and welcome development. But unfortunately, these models are still in the experi-
Need Multiple Measures mental stage and continue to rely on standardized tests. I suspect there will con-
tinue to be many problems with the standardized tests used with ELLs and that it
Among the many other problems with state high-stakes tests is that they provide will take several years to determine how we can get valid and reliable measures of
just one piece of evidence about what students know and can do. As we have seen,
this evidence is relatively weak and of questionable validity, especially for ELL stu- 5th
dents. The reality is that despite the inadequacies of standardized high-stakes
tests, they are not likely to go away any time soon. Therefore, it is imperative that r- 4th Grade State Test
Passing Standard
4th --·------------------
-Growth= "Success"
teachers use several alternative and authentic assessments throughout the year. In
other words, teachers need to use measure::, of a student's progress. Mul~
Student A
tiple measures of ELL students are particularly important because, as the school 3rd
year progresses, ELLs learn more and more English. At any point, they probably
know more English than they did the last time they took a test or completed an as- -Growth= "Failure"
2nd
sessment. Thus, along with providing strong evidence of student's knowledge and
skills, multiple measures enable teachers to track the student's progress and growth !U:l Illustration of
during the school year. 1st
how state high-stakes test
When I was an elementary school teacher, I was required to administer the cannot account for stu- Student 8
state's high-stakes test (SAT-9) in English to my 2nd grade students, all of whom dents' growth K
138 Chaptet Assessment Summary 139

should not be a cause for celebration but rather a cause for mourning. How can
ELL students' academic growth. Multiple measures of ELL students using a wide rising test scores be an indication of declining quality of instruction and educa-
variety of alternative and authentic assessments given throughout the school year tion in a school?
currently provide the most accurate (valid) and fair assessment of student's knowl- ::J 4. On·· line Discussion Boar·d This chapter lists many of the potentially harm-
edge, abilities, and growth. ful effects high-stakes testing may have on ELLs. Have you observed any of these
effects? Have you experienced any of them first-hand as a student or as a teacher?
How can these harmful effects be minimalized?
SUMM ::1 5. On-lin" Oiscussi<m Board Early in its educational reform efforts, the
Obama administration insisted that teachers' evaluations be tied to their stu-
Providing ELLs with effective instruction requires the appropriate use of assess- dents' test scores. Based on what you read in this chapter, why might this policy
ment. Assessment involves systemic planning and collection of student data from be unfair for teachers of ELL students? What changes in federal assessment re-
a variety of sources throughout the school year. Periodic summative assessments quirements for ELLs do you feel should be made before ELLs' assessment perfor-
can provide a summary of what students have learned, but on-going formative as- mances are used to evaluate teacher effectiveness?
sessments are essential to provide information that drives instruction appropriate 6. The current paradox of testing and ELLs is that if ELLs are not included in state-
to the students' strengths and needs. Norm-referenced tests may be inherently un- wide high-stakes testing and accountability programs, schools will likely ignore
fair to ELLs because their performance is compared to mainly (or exclusively) their needs, but when ELLs are included, their test scores are invalid but used in
English-proficient students. Criterion-referenced tests may be more fair for ELLs, harmful ways. Are there other ways that schools could be held accountable for
since these tests report the extent to which students learned the tested content. addressing the needs of their ELL students without having to rely on problem-
Both types of test, however, are problematic for ELLs because of unresolved issues atic high-stakes tests?
about how to test ELLs in a valid and reliable manner. ELLs' lack of proficiency in 7. Why are multiple measures crucial for valid assessment of ELLs? What are some
English means that academic tests given in English cannot provide a fair and ac- examples of the measures that should be collected over time and how should
curate measure of the students' true academic ability. Language tests, while im- these assessment data be evaluated?
portant for tracking the progress of ELLs, can likewise never fully measure a stu-
dent's proficiency because of the complexity of the construct of language profi-
ciency and the multifaceted nature of bilingualism. Given these unresolved issues, Research Activities
the mandates ofNCLB to use ELLs' high-stakes test results for school accountabil- . ::J 1. Wil<i This What is your state's policy for the inclusion of ELLs in the state's
ity purposes are problematic. Recent requirements to tie these scores to teacher testing and accountability program? What is the minimum group size estab-
evaluations are even more problematic. Furthermore, flaws in the logic and re- lished by your state for the LEP subgroup? Based on this number, how many of
quirements of NCLB set up unreasonable expectations for ELL students. Thus, the the schools you are familiar with would be required to have an LEP subgroup for
frequent use of alternative authentic assessments, such as observations, perfor- NCLB accountability purposes?
mance assessments, self- and peer assessments, and portfolio assessment, is cru- 2. Talk to teachers in a school you are familiar with. Ask them to describe any cur-
cial for obtaining an accurate picture of ELL students' progress. When used consis- ricular changes that have taken place in response to state or federal testing and
tently over the course of the school year, these assessments provide multiple mea- accountability mandates. Ask them how these changes have impacted their ELL
sures of student performance that document students' growth and thus provide a students.
clearer and more accurate picture of ELL students' language learning and aca- ....::1 3. Wil<i This Interview a teacher of ELL students. Ask what alternative authentic
demic progress.
assessments he or she uses with the students. Ask whether he or she feels these
assessments provide more accurate information about the students' knowledge
and skills than the state's test.
Discussion Questions 4. What testing accommodations are allowed for ELLs on your state's test? How
1. Why are norm-referenced achievement tests in English inherently unfair for ELL many of these accommodations are responsive to the ELLs' linguistic needs? Do
students? What would be the likely reaction of testing companies and policy- you feel these accommodations are sufficient to assess an ELL student in a valid
makers if most ELLs students scored above the 50th percentile on a norm- and reliable manner?
referenced test in English? 5. Conduct an analysis of your state's English language proficiency test (or lan-
2. Miguel, an advanced-level ELL, obtained a raw score of 56 on the state reading guage proficiency assessment system). What types of questions are asked? What
test but needed a score of 57 to pass. Because he "failed" the test, he and several kinds of tasks are students asked to complete, and how are these scored? Over-
other students who missed passing by a few points are pulled out of class over all, how effective do you feel this test is in measuring students' ability to listen,
the next few months to be drilled by hired tutors on reading exercises similar to speak, read, and write English?
those on the test, in preparation for retesting. Is this a good use of instructional
time? Why or why not? Did Miguel really "fail" the test? (Hint: standard error of
measurement.) Recommended Reading
3. Most people believe the higher the test scores, the better the school. However, Gottlieb, M. H. (2006). Assessing English language learners: Bridges from language
Alfie Kahn (2000), _author of the book The Case Against High Stakes Testing: Rais- proficiency to academic achievement. Thousand Oaks, CA: Corwin Press.
mg the Scores, Ruming the Schools, argues that a rise in a school's test scores ''i!'';:iidll·-· '·~·-,:nc;_J!ol_i U·~ ~-;-iwi;_dwtl rac;;,d:':ni; '"' :p;
140 Chapter 5 Assessment

6
Listening and Speaking
('()J

Menken, K. (2008). English learners left behind: Standardized testing as language


policy. Cleve don, UK: Multilingual Matters.
l ~~·ilon)--,, it I' CUii hu; t1P\U i!H·· It's like our saying to the students, "Sorry, I can't listen to your story
:,·y funnin!J_c., ;t'· !:uq. !,lJ iF<' l_'i'
about your puppy because I have to read you this book about a dog."
<;!lu_·." l\'stini; !HI tLv
-Linda Veal, sheltered English immersion kindergarten teacher
Nichols, S., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing cor-
rupts America's schools. Cambridge, MA: Harvard Education Press.
fmJIJ tiJc :-,ocicd nwc;.-;urc i.\ u:-;c·d lo
llll'<l.\UH~ socLd KEY TERMS GUIDING QUESTiONS
;,t
ih!' \"l']"\ ilii j<.;
!k iHilhil)':; duz \J!LH_'lli iU;/\- i\(,i,!J "'silent period 1. What does the research tell us about the relation-
{ll Hi -~.!.eli.{' i j i at [ ._, ;) '-, t I )l' ',())( ~ il!Cll> ll i'(' () r;' '-<·hooLc; ewaittime ships between Ells' oral language development, lit·
C!U<tiil\ uul our !Lit ion-.'> ~;c!toui_,.; etotal physical response eracy development, and educational achievement?
Popham, W. J. (2004). Classroom assessment: What teachers need to know (4th ed.). (TPR) 2. How can an understanding of Ells' listening and
Boston: Allyn and Bacon. @minimal pairs speaking strengths and needs inform a teacher's
-a cooperative learning choices of instructional approaches, methods, and
;~uidc
i'(!ll I \i; I i!:-!'·,1_ '<'. ,JJWll)\, • SOLOM (Student Oral strategies?
Vale~1Zuela, A. (2004). Leaving children behind: How "Texas-style" accountability Language Observation 3. How can the TESOL standards guide instruction and
fazls Latmo youth. New York: State University of New York Press. Matrix) assessment for Ells?
>H .J '"'T"' H;oddi:d vci1. ti i011 refr1ril1 vi'J_\His i1 1'j t:Xtl:-, I.·\' hi);_· lr;rmcr 4. How can teachers promote oral language use in the
classroom as a foundation for Ells' litera_cy devel-
\-'. 13uc,h tl\ I'!_ nut· 'llw '-lillhoc 1 ~-ddPIJC''
opment and academic achievement in English?
Oil Ji,O :,lucicnh in und ;\if .iJ)
5. How can teachers promote the development of
iiHJ\-,'iUc
higher levels of oral language proficiency for Ells?

f the four traditional language skills, listening is by far the one we use most
O frequently. Think about how much time you spent listening today. Then think
about how much time you spent speaking, which typically is second to listening.
Right now you are reading, and you might even write a few notes as you read. What
you read here could later become the topic for oral discussions. Oral language
dominates our day, and even reading and writing tasks provide the impetus for oral
conversation.
In the past, language educators considered listening and reading to be passive
skills, with students simply receiving oral or written input. Speaking and writing,
in contrast, were considered active skills, with students actively involved in pro-
ducing oral and written language. This view no longer holds. Today, as Elaine Hor-
witz (2008) explains:
Language scholars recognize that listening and reading are by no means passive pro-
cesses and that both listeners and readers must actively construct the meaning of any
language they encounter. Of course, whenever we listen, the words themselves are not
the only source of meaning. Gestures, facial expressions, tone of voice, and the setting
and context of the conversation all contribute to our understanding of what another
person is saying. (p. 69)
Evidence that listening is an active skill of constructing meaning can be found in
the fact that, as political debates and personal arguments attest, even native speak-
ers of English may hear the same thing but interpret it differently.
What do ELLs need to do to be able to actively construct meaning through oral
English? Consider some of the following challenges:

141

You might also like