Professional Documents
Culture Documents
Instructional Tools in Educational Measu
Instructional Tools in Educational Measu
Instructional Tools in Educational Measu
14 Copyright
C 2008 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
changed, in fact since the publication in lieve that teachers can turn to their gest the existence of substantial gaps
1990 of the Standards for Teacher Com- principals for help, almost no states re- in the respondents’ knowledge of ed-
petence in Educational Assessment of quire competence in assessment for li- ucational measurement and statistics.
Students (American Federation of censure as a principal or school admin- For example, only 10 of the 24 UCSB
Teachers, National Council on Mea- istrator at any level. As a result, assess- respondents were able to choose the
surement in Education [NCME], & Na- ment training is almost nonexistent in correct definition of measurement er-
tional Education Association, 1990), administrator-training programs.” ror, and only 10 knew that a Z-score
which expressed concern about “the The continuing need for professional represents the distance from the mean
inadequacy with which teachers are development in this area was confirmed in standard deviation units. Nearly half
prepared for assessing the educa- by two recent reports. A report on the mistakenly thought the reliability of a
tional progress of their students.” The licensing of school principals in the 50 test is “the correlation between student
document recommended that “assess- states, sponsored by the Wallace Foun- grades and student scores on the test.”
ment training . . . be widely available dation (Adams & Copland, 2005), in- When told that “20 students averaged 90
to practicing teachers through staff cludes a discussion of the necessary on an exam, and 30 students averaged
development programs at the district knowledge and skills for principals, ac- 40,” only one-half of the CSUN group
and building levels” (p. 1).1 cording to state licensing requirements. were able to calculate the combined av-
Further evidence regarding the need “Totally missing from state licensing erage correctly, and only one in 10 chose
for training in this area comes from frameworks,” according to the authors, the correct definition of measurement
two additional efforts undertaken by “was any attention to the meaning and error.
professional organizations during the use of learning assessments. . .” In an- As Popham (2006a, p. xiii) notes,
1990s. The Joint Committee on Com- other recent study, by the National “[t]oday’s educational leaders need to
petency Standards in Student Assess- Board on Educational Testing and Pub- understand the basics of assessment or
ment for Educational Administrators, lic Policy at the Lynch School of Educa- they are likely to become yesterday’s
sponsored by the American Association tion at Boston College, researchers sur- educational leaders . . .” How can the
of School Administrators, National As- veyed a nationally representative sam- required level of understanding be at-
sociation of Elementary School Prin- ple of teachers to ascertain their at- tained? Ideally, teacher and principal
cipals, National Association of Sec- titudes about state-mandated testing certification programs will eventually
ondary School Principals, and NCME, programs (Pedulla et al., 2003). When be modified to increase course con-
surveyed 1,700 administrators from the asked about the adequacy of profes- tent in educational measurement and
first three of these sponsoring organiza- sional development in the area of stan- statistics. Substantial changes of this
tions about assessment-related skills. dardized test interpretation, almost a kind are likely to be slow, however,
The three skills rated as most needed third of the 4,200 responding teachers and may occur only if licensing require-
by educational administrators (out of a reported that professional development ments are first modified. To help K-12
list of 37) were knowing the terminol- in this area was inadequate or very in- schools respond rapidly to the training
ogy associated with standardized tests, adequate (Pedulla et al., 2003). Fur- gap in these areas, the Instructional
knowing the purposes of different kinds ther documentation of the “widespread Tools in Educational Measurement and
of testing, and understanding the link- deficits in assessment skills evidenced Statistics (ITEMS) for School Person-
age between curriculum content and by practicing teachers” is described by nel project has created and evaluated
various kinds of tests (Impara, 1993, p. Lukin, Bandalos, Eckhout, and Mickel- a series of three instructional modules
20). In 1995, NCME published the Code son (2004, pp. 26–27). in educational measurement and statis-
of Professional Responsibilities in Edu- As part of the preliminary work con- tics, to be available via the Web and also
cational Measurement. This document, ducted in preparation for the ITEMS as CDs and DVDs.
which is intended to apply to profes- project, a survey assessing respondents’ The goal of the project was to provide
sionals involved in all aspects of educa- understanding of educational measure- immediate opportunities for teachers
tional assessment, including those who ment and statistics was developed and and administrators to increase their as-
administer assessments and use assess- field-tested by two doctoral students sessment literacy—more specifically,
ment results, includes the responsibil- under the first author’s supervision their understanding of the psycho-
ity to “maintain and improve . . . profes- (Brown & Daw, 2004). The survey, metric and statistical principles needed
sional competence in educational as- which consisted mainly of multiple- for the correct interpretation of stan-
sessment” (NCME, 1995, p. 1). choice questions, was completed by dardized test scores. The ITEMS ma-
According to Stiggins (2002), how- 24 University of California, Santa Bar- terials are intended to help prepare
ever, “only a few states explicitly re- bara (UCSB) graduate students who school personnel to use test results to
quire competence in assessment as a were pursuing an M.Ed. and a multiple- optimize instructional decisions and to
condition for being licensed to teach. subjects teaching credential. A revised pinpoint schools, classes, or individu-
No licensing examination now in place version of the survey was subsequently als that require additional instruction
at the state or federal level verifies com- offered over the Internet to students en- or resources, as well as to explain test
petence in assessment. Since teacher- rolled in a graduate education course at results to students, parents, the school
preparation programs are designed to California State University, Northridge board, the press, and the general com-
prepare candidates for certification un- (CSUN). The 10 CSUN students who munity. The provision of this train-
der these terms, the vast majority of responded to the Web-based version of ing in a convenient and economical
programs fail to provide the assessment the survey were experienced teachers way is intended to assist schools with
literacy required to prepare teachers or administrators in K-12 schools. Re- the successful implementation and in-
to face emerging classroom-assessment sults of the two survey administrations terpretation of assessments. Alterna-
challenges . . . Furthermore, lest we be- to teachers and future teachers sug- tive forms of professional development,
Summer 2008 15
such as intensive workshops or uni- data analysis software tools have participants made significant gains in
versity courses, are much more time- emerged from projects conducted at their understanding of statistical con-
consuming and expensive, and are un- SERG, including DataScope , Prob cepts; however, it was not determined if
likely to be funded in an era of limited Sim (Konold & Miller, 1994), and teachers made significant gains in their
school budgets. TinkerplotsTM (Konold & Miller, 2004). understanding of test results per se.
These products are designed for
Previous Statistics Education and teaching concepts in probability and
Assessment Literacy Research sampling using simulations. They have
been used extensively in the classroom Assessment Literacy
The ITEMS work is related to previ- to teach concepts to students, and have Since the enactment of NCLB, an in-
ous research in statistics education and also been used to investigate teachers’ creasing number of schools and dis-
assessment literacy. The goal of statis- approaches to analyzing data (e.g., see tricts are collecting massive quantities
tics education research is to promote ef- Hammerman & Rubin, 2002, 2004). of student data, including standardized
fective techniques, methods, and prod- While much of the statistics edu- test results, using software manage-
ucts for teaching statistics, as well as cation literature is focused ultimately ment products. (See Wayman, String-
to understand how individuals reason on student learning, recent studies field, and Yakimowski (2004) for a dis-
about statistical concepts. By contrast, have also examined teachers’ statisti- cussion of various software products de-
the work described here as assessment cal thinking and reasoning. Hammer- signed for storing, organizing, and ana-
literacy research focuses on the ways man and Rubin (2002, 2004), Makar and lyzing student, school, and district level
teachers use assessments to make in- Confrey (2002) and Confrey, Makar, data.) Administrators and teachers are
structional decisions. and Kazak (2004) have examined expected to use these data to make ed-
ways that teachers reason about sta- ucational and instructional decisions.
Statistics Education tistical concepts (in particular, vari- Boudett, City, and Murnane (2005) out-
A prominent statistics education re- ation in data) and how using com- line eight critical steps to effectively
search effort is that of Joan Garfield puter graphics tools can facilitate their use assessments to improve instruction
and Robert delMas of the University of understanding. and raise student achievement. Accord-
Minnesota and Beth Chance of Califor- Hammerman and Rubin (2002, 2004) ing to Boudett et al. (2005), the second
nia Polytechnic State University. In the studied the strategies that teachers step is to “build assessment literacy,”
current ARTIST (Assessment Resource used to comprehend variability in i.e., develop a working knowledge of
Tools for Improving Statistical Think- data and examined how teachers ex- common concepts related to test score
ing) project (https://app.gen.umn.edu/ plored and analyzed data using the results, and acquire appropriate skills
artist/), these researchers are work- TinkerplotsTM software. After learn- to interpret test score results. The text
ing to develop effective assessments of ing to “bin” and visualize data using by Popham (2006a) is intended to im-
statistics knowledge, which are made the graphical tools, teachers were less prove the assessment literacy of educa-
available online for teachers of first likely to use the sample average as a sin- tional leaders and another recent con-
courses in statistics (Garfield, delMas, gle overall representation of the data. tribution by Popham (2006b) consists
& Chance, 2003; see https://app.gen. Makar and Confrey (2002) exam- of 15 booklets on assessment topics that
umn.edu/artist/publications.html for a ined mathematics teachers’ statisti- are intended to help teachers increase
publications list). In a previous project, cal thinking about high-stakes test re- their assessment knowledge.
delMas, Garfield, and Chance (1999) sults at a low-performing high school Other resources for teachers and ad-
investigated the use of computer sim- in Texas. Using the computer-based ministrators are available, although in
ulations to improve the statistical rea- statistical learning tool, FathomTM many cases their effectiveness has not
soning of students in college-level in- (Finzer, 2001), teachers created graphs been formally assessed. Test publish-
troductory statistics courses. These to compare distributions of test scores ing companies offer products for profes-
projects built on an earlier program for male and female students to de- sional development in assessment lit-
of research, Project Chance, headed by termine if the differences in the cen- eracy. For example, Educational Test-
J. Laurie Snell of Dartmouth, the goal ters between the two groups of data ing Service (ETS) and CTB McGraw-
of which was to help students think were significant. However, the research Hill provide written guides and work-
critically about media reports that use showed that, after attending a work- shops for professional development in
probability and statistics (Snell & Finn, shop in data analysis, teachers still used which teachers can learn to use data
1992). intuition to determine whether differ- for instructional improvement. The
The Statistics Education Re- ences in centers of the distributions Pathwise Evidence-Centered Teach-
search Group (SERG) (www.umass. of test scores were significant, ignoring ing and Assessment Workshops, offered
edu/srri/serg/) at the University of concepts like variability and sampling through ETS, are designed for language
Massachusetts, Amherst works on a va- distributions in their reasoning. arts and mathematics teachers as train-
riety of projects to improve instruction Confrey et al. (2004) conducted ad- ing sessions in the principles of for-
in statistics courses. Founded in the ditional studies to explore the statis- mative assessment and linking of as-
1970s, the group originally investigated tical understanding of high-stakes test sessment to instruction. The Assess-
how individuals reason about statis- data by teachers. The purpose was to de- ment Training Institute, founded by
tical concepts before receiving any termine if a professional development Rick Stiggins in 1993 and acquired by
formal training, and then used results course in the area of testing and statis- ETS in 2006, provides classroom assess-
to improve K-12 and college-level tics could improve teacher reasoning ment training needed to support educa-
statistics instruction. Their current about statistical concepts as it relates tors. Books, videos, DVDs, and seminars
focus is on younger students. Several to test data. The results indicated that are available through the institute.
Summer 2008 17
FIGURE 1. Illustration of measurement error from Module 2.
pictures for cognitive resources in the haps because “learners may be more ministrator, who was responsible for
visual channel” (Mayer, 2001, p. 140). willing to accept that they are in a much of the day-to-day management
human-to-human conversation includ- of the project and a project evalua-
ing all the conventions of trying hard tor, who conducted a semi-independent
Coherence principle to understand what the other person is review of the project’s effective-
Efforts have been made to remove ex- saying” (Mayer, 2003, p. 135). In keep- ness.
traneous words, pictures, and sounds ing with this principle, formulas are not The project staff met three times
from the instructional presentations. used in the instructional modules. (The a year with an advisory committee
Irrelevant material, however interest- relevant formulas are given in the sup- consisting of public school teach-
ing, has been shown to interfere with plementary handbooks for those who ers and administrators and univer-
the learning process (Mayer, 2001, pp. are interested in learning them.) sity experts in human-computer inter-
113–133). action, multimedia production, multi-
media learning, cognitive psychology,
Project Staffing teacher education, educational tech-
Prior knowledge principle
The modules are designed to “use words The project staff who created the nology, theoretical statistics, and math
and pictures that help users invoke and three instructional modules described and statistics education.
connect their prior knowledge” to the in this article consisted of six indi-
viduals (some of whom worked on Principles of Program Evaluation
content of the materials (Narayanan &
Hegarty, 2002, p. 310). For example, only one or two modules): The prin- Generally stated, the questions to be
while participants may be unfamiliar cipal investigator (first author), who addressed by the program evaluation
with the term “sampling error,” most is an education professor with over 25 procedures for the ITEMS project are
will be familiar with statements like, years of research experience in edu- consistent with those considered by
“The results of this survey are accurate cational measurement and statistics; Owen (2007, p. 48) under the rubric
within plus or minus five percentage a senior researcher (second author), of impact evaluation: “Have the stated
points.” Analogies and metaphors have who is a statistics professor with ex- goals of the program been achieved?
also been shown to enhance mathemat- perience in education research; three Have the needs of those served by the
ical learning (English, 1997). technical specialists (third, fifth, and program been achieved? . . . Is the pro-
sixth authors) who were graduate stu- gram more effective for some partic-
dents in media arts and technology with ipants than for others?” In our case,
Conversational style experience in computer programming, the stated goal of the project was
A conversational rather than a formal graphics, and animation, and a profes- to present psychometric and statisti-
style is used in the modules; this has sional animator (fourth author). The cal information about the interpreta-
been shown to enhance learning, per- project staff also included a project ad- tion of standardized test scores that
Summer 2008 19
Table 3. Module 1: Results on the 20-Item Quiz
Module-First Group Quiz-First Group
Effect t-test p-value
Mean SD n Mean SD n Size (1-sided)
All Participants 13.2 3.7 52 12.0 3.4 61 .34 .042
Teacher Education (TEP) Students 13.1 4.0 33 11.7 3.5 35 .37 .059
School District Personnel 13.4 3.2 19 12.5 3.2 26 .28 .198
that those who viewed the module first who were either full-time math teach- dividuals who agreed to participate in
were better able to answer the quiz ers or TEP student teachers were clas- the follow-up phase of the Module 1 re-
questions. sified as math instructors if they in- search were contacted about a month
dicated that math was the sole sub- after they had first viewed the module
Data analysis ject taught, while all other participants and taken the quiz. They were given an
Psychometric analysis of the assess- were classified as nonmath instructors. online quiz identical to the one that
ment literacy quiz showed that, for the Because we expected math instruc- they took during their first module-
total group, the quiz had an average per- tors to be more familiar with measure- viewing session.
cent correct of 63, an average item dis- ment and statistics concepts than other Unfortunately, of the original 113
crimination (corrected item-total cor- participants, we anticipated that they participants, only 11 participated in
relation) of .28, a reliability of .71, and would perform better on the quiz than the follow-up phase and completed the
a standard error of measurement of 1.9 other participants. In fact, math in- quiz again. Four participants were from
points. structors (n = 19), scored only slightly the module-first group, and seven were
As indicated in Table 3, results for higher than participants who were not from the quiz-first group. The average
the 113 participants showed that view- math instructors (n = 94). The aver- score on the quiz for the four mem-
ing the module had a small but sta- age score for math instructors was 13.2 bers of the module-first group was 15.5
tistically significant positive effect on points (SD = 2.9), compared to an aver- the first time they took the quiz, and
quiz scores: Those in the “module-first” age score of 12.4 for other participants it remained 15.5 when they retook the
group answered an average of 13.2 of 20 (SD = 3.7). quiz (note that there was variability
questions correctly, compared with an within the scores at the two different
average score of 12.0 questions for the Follow-up quiz analysis times). The average quiz score for the
“quiz-first” group (an effect size of .34 A follow-up analysis was conducted to seven quiz-first follow-up participants
standard deviation units). determine the extent to which partici- was 14.3 the first time they took the
Further analysis showed that the pants retained an understanding of the quiz, and increased to 15.9 when they
quiz item on which the two groups dif- module material, as measured by their took it during the follow-up phase. Al-
fered most, addressed the topic of skew- follow-up performance on the quiz. In- though the follow-up samples are small
ness of a test score distribution. On
this question, 63% of the module-first
group answered the question correctly,
while only 34% of the quiz-first group
answered the question correctly. On the
remaining quiz items, differences in the
percentages correct for the two groups
were typically less than 10 points.
When the overall results in Table 3
are disaggregated, it becomes appar-
ent that the effect was larger among
the TEP students than among the
school personnel (see Figure 2). There
are several possible explanations for
this disparity. As noted, the TEP stu-
dents differed in several ways from the
school personnel (Table 2). In addi-
tion, Module 1 and the associated quiz
were presented to these students in a
group administration, with project staff
available to trouble-shoot. By contrast,
school personnel participated on an in-
dividual basis, at a time and place of
their own choice.
In addition to comparing the quiz
results by TEP status, we were also
interested in comparing quiz results FIGURE 2. Module 1: Average scores on 20-item quiz for school personnel
by math instructor status. Participants and teacher education program students in quiz-first and module-first groups.
Note: In Figures 2, 3, and 4, the labels for the "Module-First " and
20 "Quiz-First" groups were reversed. The accompanying data tables Educational Measurement: Issues and Practice
(Tables 2, 4, and 5) were correctly labeled.
and cannot be assumed to be represen- counterparts in three related respects. tion of these displays. Because the quiz
tative of the original participants, it is First, whereas we recruited only Cen- was more directly linked to the mate-
worth noting that on average, no loss tral California teachers and educators rial in the module than was the case for
of content knowledge was observed for in 2005, we recruited educators from Module 1, we believe the Module 2 quiz
either group. around the nation in 2006. Our primary provided a better test of the effective-
means of doing so was to place an ad- ness of the module. This 16-item mul-
Development and Evaluation of vertisement in the educational maga- tiple choice quiz was pilot-tested with
Module 2, “What Test Scores Do zine Phi Delta Kappan in January and UCSB TEP students.
and Don’t Tell Us” February. We also adapted our Website One hundred four individuals partic-
Module 2 Development to allow individuals to sign up for the ipated in the research phase for Module
Our second module, “What Test Scores project on the Website itself. In addi- 2, “What Test Scores Do and Don’t Tell
Do and Don’t Tell Us,” shows Stan, the tion, we contacted professional organi- Us.” Of those, 81 were TEP students
teacher introduced in Module 1, meet- zations, focusing on those for minor- from at least five universities, and 23
ing with parents Maria and Tim to dis- ity professionals, to encourage them to were school district personnel from 19
cuss the test results of their twins, sign up for project participation. school districts across the country. Par-
Edgar and Mandy. The module focuses A second change that we imple- ticipants received $15 electronic gift
on the effect of measurement error on mented in 2006 was necessitated by certificates from Borders.com. Demo-
individual student test scores, the ef- the fact that we were working with graphic information on the participants
fect of sample size on the precision of participants who were in some cases for this phase is given in Table 4. Sixty-
average scores for groups of students, thousands of miles away, making vis- eight percent of the participants were
and the definition and effect of test bias. its to schools impractical. We there- women, with average age of 31 years
Table 1 provides some additional infor- fore developed an infrastructure that and average teaching experience of 11
mation on the content of the module. allowed all transactions that take place years. The TEP students were typi-
As was the case in developing Mod- during the research and program eval- cally younger and had fewer years of
ule 1, a detailed outline was first devel- uation phases to be conducted elec- teaching experience than the school
oped. Text was then created and scenes tronically. Provision of logon informa- personnel.
were conceptualized and incorporated tion to participants, collection of data, To develop a better picture of
in a storyboard. Following that, a Web- and even distribution of gift cards were the school districts where participants
based version of the module was pro- electronic. (Participants received elec- worked, either full-time or as a stu-
duced using Flash software and was tronic certificates that are usable at dent teacher, we collected additional
revised over a period of several months Borders.com.) information about the community sur-
to reflect the feedback received from A third change was that, rather than rounding the school district and the
our project advisory committee. Both regarding a school or district as a target minority composition of the school
low- and high-bandwidth versions were of recruitment, we directly enlisted the district. In the background survey,
made available. participation of individual teachers and participants were asked if they would
Module 2 includes two features not administrators. This did not preclude describe the community surrounding
incorporated in Module 1. First, an op- the possibility that particular schools the school district as “central city,” “ur-
tional closed captioning feature is avail- and districts would participate; rather, ban fringe/large town,” or “rural/small
able, which should facilitate the use it allowed us to collect data from edu- town.” Of the 85 participants who re-
of the module by participants who are cators whether or not their schools or sponded to this question, 69% described
hard of hearing. Also, the module in- districts were participating on an insti- the area surrounding their district as
cludes four “embedded questions” (one tutional level. urban fringe, while 18% indicated that
for each main section) that allow par- their community was rural, and 13% de-
Module 2 Research Phase scribed it as central city.
ticipants to check their understanding
of the material and view portions of the To test the effectiveness of Module Participants were also asked to esti-
module a second time if they wish. The 2, we used a new assessment literacy mate a range for the percentage of stu-
embedded questions are also intended quiz, tailored to the topics of Module 2. dents in their school district who were
to encourage participants to remain en- Consistent with the recommendations African-American, Hispanic/Latino,
gaged in the process of watching the of our advisory panel, we created an Native American, or members of other
module. instrument that is more applications- ethnic minorities. The choices were as
oriented than the Module 1 quiz. Most follows: less than 20%, 21%–40%, 41%–
Recruitment and Data Collection questions include tables or graphics 60%, 61%–80%, and 81%–100%. Diverse
Procedures for Module 2 that resemble those that appear in stan- districts were well represented in the
The 2006 research and program eval- dardized test results. Respondents are study. Of the 84 participants who re-
uation phases differed from their 2005 asked questions about the interpreta- sponded to the question, 45% indicated
Summer 2008 21
Table 5. Module 2: Results on the 16-Item Quiz
Module-First Group Quiz-First Group
Effect t-test p-value
Mean SD n Mean SD n Size (1-sided)
All Participants 12.6 3.0 51 10.2 3.5 53 .74 .000
Teacher Education (TEP) Students 12.6 3.2 40 9.5 3.7 41 .90 .000
School District Personnel 12.7 1.9 11 12.5 1.4 12 .12 .375
that they worked in a district composed ule before and after taking the quiz. Module 1 background survey, partici-
of 41% to 60% minority students, 19% This effect is more apparent than that pants were asked to indicate the sub-
worked in a district that was 61% to obtained in the Module 1 evaluation ject they taught most frequently, while
80% minority students, 13% worked in a phase. in the Module 2 survey, they were given
district that was less than 20% minority To parallel the analysis in the Module the option to indicate multiple subjects
students, 12% worked in a district that 1 evaluation phase, we compared quiz that they taught. Participants who indi-
was 81% to 100% minority students, and results between math instructors and cated only “math” were then designated
11% worked in a district that consisted nonmath instructors. Math instructors as math instructors.)
of 21% to 40% minority students. consisted of both full-time math teach-
Psychometric analysis of the 16- ers and TEP student teachers in math; Follow-up quiz analysis
item multiple-choice assessment liter- all other participants were classified as As in the first module research phase,
acy quiz showed that the quiz had an nonmath instructors. As in the Module a follow-up analysis was conducted to
average percent correct of 71, an aver- 1 evaluation phase, math instructors (n determine the extent to which partic-
age item discrimination of .40, a reli- = 5) scored slightly higher on average ipants retained an understanding of
ability of .79, and a standard error of than participants who were nonmath the material from the second module.
measurement of 1.58 points. instructors (n = 99). The average score Participants in the Module 2 research
As indicated in Table 5, results for the for math instructors was 14.4 (SD = phase were contacted about a month
104 participants showed a statistically 1.5), compared to 11.2 for nonmath in- after they had first viewed the module
significant effect: Those in the module- structors (SD = 3.5). (Note that only and taken the quiz. They were given the
first group answered an average of 12.6 five participants were designated as identical quiz that they had previously
out of 16 questions correctly, compared math instructors in the Module 2 anal- taken during their first module-viewing
with an average score of 10.2 questions ysis, as compared to 19 in the Module 1 session.
for the quiz-first group (an effect size analysis. This may be because informa- Due to improvements in the online
of .74). tion on teaching was collected slightly administration system, contact with
Further analysis of the quiz results differently on the two occasions. In the participants was more easily achieved,
showed that the items that had the
largest group differences in the per-
centages correct—between 20% and
27%—were primarily about measure-
ment error or about the relation be-
tween the stability of sample means and
sample size. For example, a question
about the standard error of measure-
ment was answered correctly by 88%
of the module-first group, compared to
62% of the quiz-first group. For the re-
mainder of the questions, the typical
difference between the groups in the
percentages correct was between 5%
and 10%.
Table 5 shows that, as in Module 1,
the effect of viewing the module was
much larger for the TEP students (an
effect size of .90) than for the school
personnel (.12), a finding that is dis-
cussed further below. In Figure 3 we
can observe that school personnel per-
formed about the same (on average)
on the quiz regardless of whether they
viewed the module before or after tak-
ing the quiz; however, there was a dif-
ference in the average scores between FIGURE 3. Module 2: Average scores on 16-item quiz for school personnel
TEP students who viewed the mod- and teacher education program students in quiz-first and module-first groups.
and a much higher follow-up rate was based module was then produced, re- pants received $15 electronic gift cer-
observed. Of the original 104 partici- vised, and incorporated into our data tificates from Borders.com.
pants who took the quiz, 38 participated collection system. The closed caption- Demographic information on the par-
in the follow-up phase and completed ing and embedded questions features ticipants for this phase is given in Table
the quiz again. Fifteen participants introduced in Module 2 were retained. 6. About 82% of the participants were
were from the module-first group, and Production values for Module 3 were women, the average age was 37, and
23 were from the quiz-first group. The substantially improved over the previ- the average number of years of teaching
average score on the quiz for the 15 ous two modules. First, a professional experience was 6.6. The teacher educa-
members of the module-first group was animator with extensive cartooning ex- tion students were generally younger
13.9 the first time they took the quiz, perience joined the project. Second, and had fewer years of teaching experi-
and it dropped slightly to 13.1 when students from the Dramatic Art depart- ence than the school personnel.
they retook the quiz. However, this dif- ment at UCSB were hired to voice the As in the Module 2 research phase,
ference was not statistically significant animated characters, rather than em- we collected information about the
(p = .20) at the .05 level, as deter- ploying project staff for this purpose. community surrounding the school dis-
mined by a matched-pairs t-test. Hence, Finally, the audio recording was con- trict and the minority composition of
there does not appear to be any signifi- ducted in a professional sound studio. the school district. Participants were
cant loss of content knowledge for this These changes resulted in a module asked if they would describe the com-
group. The average quiz score for the that was much more polished than its munity surrounding the school district
23 quiz-first follow-up participants was predecessors. as “central city,” “urban fringe/large
10.6 the first time they took the quiz, Recruitment and data collection town,” or “rural/small town.” Of the
and increased to 11.4 when they took it procedures for Module 3 were sim- 24 participants who responded to this
during the follow-up phase. This differ- ilar to those in Module 2, except question, 12 described the area sur-
ence was marginally statistically signifi- that we selected a different magazine— rounding their district as urban fringe,
cant (p = .05), which is consistent with Learning and Leading with Techno- 3 indicated that their community was
an increase in content knowledge due logy—in which to place our advertise- rural, and 9 described it as central
to the effect of viewing the module. As ment, and we timed the ad to be con- city.
in the Module 1 evaluation, interpreta- current with the data collection instead Participants were also asked to
tion is complicated by the fact that the of preceding it. As an added incentive estimate a range for the percent-
follow-up samples are small and cannot to participation, a feature was added age of students in their school dis-
be assumed to be representative of the that allowed participants the option of trict who were African-American, His-
original participants. downloading a personalized certificate panic/Latino, Native American, or
indicating that they had completed an members of other ethnic minorities.
Development and Evaluation of ITEMS training module. The choices were identical to those
Module 3, “What’s the Difference?” given in the Module 2 background sur-
vey: Less than 20%, 21%–40%, 41%–
Module 3 Development Module 3 Research Phase 60%, 61%–80%, and 81%–100%. Diverse
Our third module, “What’s the Differ- The effectiveness of Module 3, “What’s districts were well-represented: Of the
ence?” shows a press conference in the Difference?” was tested using an 24 participants who responded to the
which a superintendent is explaining assessment literacy quiz tailored to the question, 8 worked in a district that
recently released test results. Norma topics of the module. We began with a consisted of 21% to 40% minority stu-
and Stan, the teachers introduced in 16-item instrument, which was reduced dents, 7 indicated that they worked in a
the previous modules, help to demys- to 14 items after being pilot-tested with district composed of 41% to 60% minor-
tify the test results for the reporters. UCSB TEP students. ity students, 4 worked in a district that
The main topics addressed are the im- Primarily because of reduced partic- was less than 20% minority, 3 worked in
portance of disaggregating test data for ipation by the UCSB TEP students dur- a district that was 81% to 100% minor-
key student groups and the implications ing the research phase, we were initially ity, and 2 worked in a district that was
for score trend interpretation of shifts able to recruit only 23 participants. 61% to 80% minority.
in the student population, the number Four were UCSB TEP students and Psychometric analysis of the 14-
of students assessed, and changes in 19 were school district personnel from item multiple-choice assessment liter-
tests and test forms. (See Table 1 for seven states. We were subsequently acy quiz revealed that the quiz had an
details.) able to collect data from 10 teacher edu- average percent correct of 63, an aver-
As was the case in developing Mod- cation students at California State Uni- age item discrimination of .53, a reli-
ules 1 and 2, an outline and script versity, Fresno, bringing the number of ability of .87, and a standard error of
were first developed. An animated Web- TEP students to 14. As before, partici- measurement of 1.47 points.
Summer 2008 23
Table 7. Module 3: Results on the 14-Item Quiz
Module-First Group Quiz-First Group
Effect t-test p-value
Mean SD n Mean SD n Size (1-sided)
All Participants 9.1 4.2 18 8.5 4.1 15 .14 .33
Teacher Education (TEP) Students 6.5 4.1 8 5.5 2.1 6 .32 __
School District Personnel 11.2 3.0 10 10.4 4.0 9 .23 __
Note: Because of small sample sizes, a single t-test was performed for all participants combined; separate tests were not conducted
for TEP students and school district personnel.
As indicated in Table 7, the effect of is currently a TEP student, answered 5 Participants in the Module 3 research
the module was not statistically signif- out of 14 items correctly. phase were contacted about a month
icant. (Because of small sample sizes, Further analysis of the quiz results after they had first viewed the module
a single t-test was performed for all did not show much difference between and taken the quiz. They were given the
participants combined; separate tests the scores of the module-first and quiz- identical quiz that they had previously
were not conducted for TEP students first groups on most items. However, taken during their first module-viewing
and school district personnel). Those two questions did reveal substantial session. Of the original 33 participants
in the module-first group did have group differences: An item on the ef- who took the quiz, 11 (33%) partici-
a slightly higher average quiz score fect of sample size on the interpreta- pated in the follow-up phase and com-
(9.1) than those in the quiz-first group tion of improvements in average test pleted the quiz again. Four participants
(8.6), however (see Figure 4). In both scores was answered correctly by 83% were from the module-first group and
the module-first and quiz-first groups, of the module-first group, compared to seven were from the quiz-first group.
school personnel performed substan- 60% of the quiz-first group. A question The average score on the quiz for the
tially better than teacher education about test equating was answered cor- four members of the module-first group
students. There were not enough data rectly by 78% of the module-first group, was 12.3 the first time they took the
to make a meaningful comparison compared to only 27% of the quiz-first quiz, and it remained the same when
between math and nonmath instructors group. they retook the quiz. The average quiz
as was done for the first two research score for the seven quiz-first follow-up
phases. Only two participants were Follow-up quiz analysis participants was 10.7 the first time they
math instructors. One math instructor A follow-up analysis was conducted to took the quiz, and increased to 12.6
with three years of teaching experience determine the extent to which partic- when they took it during the follow-
received a perfect score on the quiz, ipants retained an understanding of up phase. Because of the very small
while the other math instructor, who the material from the second module. number of participants in the follow-
up phase, tests of significance were not
conducted. As in the results from the
previous follow-up studies, the slight
increase in average score is consis-
tent with an increase in content knowl-
edge due to the effect of viewing the
module.
Summer 2008 25
instrument was almost entirely over- By far the greatest challenge in (National Council on Measurement in
hauled before the research phase, so the ITEMS project has been the re- Education, 1997). The project was led
that the final version bore little resem- cruitment of participants. Overworked by Barbara S. Plake and James C. Im-
blance to the pilot version. Neverthe- teachers who are overwhelmed by the para and was funded by a grant to NCME
less, it is possible that pilot participa- demands of NCLB are unlikely to under- from the W. K. Kellogg Foundation.
tion had some effect. take an optional activity like this one.
More generally, caution is warranted Ironically, then, one reason that educa- References
when interpreting the data analysis re- tors do not have time to learn about the
sults. While the TEP students from technical aspects of testing is that they Adams, J. E., & Copland, M. A. (2005). When
UCSB were required by their own fac- are occupied with the preparation and learning counts: Rethinking licenses for
ulty to participate in the main data col- administration of tests. school leaders. Available at http://www.
lection for Modules 1 and 2, the remain- As part of our effort to increase ncsl.org/print/educ/WhenLearningCounts_
der of the individuals who participated awareness of the project, we have is- Adams.pdf.
American Educational Research Association,
in the research phase chose to do so, sued press releases and made con- American Psychological Association, & Na-
resulting in nonrandom samples of the tacts with the Corporation for Educa- tional Council on Measurement in Edu-
corresponding populations. In addition, tional Network Initiatives in Califor- cation (1999). Standards for educational
samples ranged from small to moder- nia (CENIC), the University of Califor- and psychological testing. Washington, DC:
ate in size and cannot be assumed to nia Office of the President, the Califor- American Educational Research Associa-
be representative of school personnel nia Teachers Association Institute for tion.
or teacher education program students Teaching, the California Department American Federation of Teachers, National
nationwide. Nevertheless, the results of Education Beginning Teacher Sup- Council on Measurement in Education, &
are encouraging and do indicate that port and Assessment program, and the National Education Association (1990).
overall, the modules are having a posi- California County Superintendents Ed- Standards for teacher competence in edu-
cational assessment of student. Available at
tive impact on content knowledge in ed- ucational Services Association. In addi- http://ericsys.uncg.edu/AFT-NEA-AERA-
ucational measurement and statistics. tion, we have made conference presen- Standards for Teacher Competence in
Supplemental analysis of quiz perfor- tations, posted project information on Education.htm.
mance showed that math instructors education-oriented Websites and blogs, Begleiter, M. (2001). From word to image:
tended to perform better than other and used listservs to contact profes- Storyboarding and the filmmaking pro-
participants (Modules 1 and 2) and that sional organizations, focusing on those cess. Studio City, CA: Michael Weise Pro-
the topics participants were most likely for minority professionals. ductions.
to learn about from the modules were Following the completion of the sup- Boudett, K. P., City, E. A., & Murnane, R. J.
skewness of a test score distribution plementary data collection, we will fo- (Eds.) (2005). Data wise: A step-by-step
(Module 1), measurement error (Mod- cus for the remainder of the project on guide to using assessment results to im-
prove teaching and learning. Cambridge,
ule 2), the stability of sample means the dissemination of our materials, par- MA: Harvard Education Press.
(Module 2), the effect of sample size on ticularly to teacher education programs Brown, T., & Daw, R. (2004). How do K-
the interpretation of changes in average and new teachers. It is our hope that 12 teachers and administrators feel about
scores (Module 3), and test equating even those educators who were hesi- standardized testing and how well can
(Module 3). tant to participate in the research as- they interpret results of such tests? Unpub-
Most comments relayed to the pects of the project will find the ITEMS lished report, Santa Barbara, CA: Univer-
project via email (not included in the materials useful as a resource. sity of California.
formal program evaluation) have been Chance, B. (2002). Components of statis-
positive. Some examples are as follows: tical thinking and implications for in-
Acknowledgments struction and assessment. Journal of
“Very helpful and right to the point. Statistics Education, 10 (3). Available
We appreciate the support of the National at www.amstat.org/publications/jse/v10n3/
If I were a building principal or a de-
partment chair today all of the staff Science Foundation (#0352519). Any chance.html.
would go through this until everyone opinions, findings, and conclusions or rec- Confrey, J., Makar, K., & Kazak, S. (2004).
really understood it.” ommendations expressed in this material Undertaking data analysis of student out-
are those of the authors and do not neces- comes as professional development for
“I am inclined to recommend [Mod- teachers. International Reviews on Math-
ule 1] as required viewing for all new
sarily reflect the views of the National Sci- ematical Education (Zentralblatt für Di-
hires in our K-12 district, and it cer- ence Foundation. We are grateful to Liz daktik der Mathematik), 36(1), 32–40.
tainly will be recommended . . . for in- Alix, Pamela Yeagley, and Lois Phillips for delMas, R. C., Garfield, J., & Chance, B. L.
clusion in professional development their contributions. For more information (1999). A model of classroom research
on assessment literacy.” on the project, refer to the project Web- in action: Developing simulation ac-
tivities to improve students’ statistical
“I will be sharing [Module 1] with site at http://items.education.ucsb.edu reasoning. Journal of Statistics Edu-
my Assistant Superintendent with the or contact the first author at rzwick@ cation, 7 (3). Available at http://www.
hope of promoting it as a part of our education.ucsb.edu. amstat.org/publications/jse/secure/v7n3/
new teacher induction process.” delmas.cfm.
Note English, L. D. (1997). Mathematical reason-
In addition, the teacher education pro- ing: Analogies, metaphors, and images.
1
grams at UCSB and at California State These three organizations subse- Mahwah, NJ: Erlbaum.
University, Fresno have now incorpo- quently collaborated on the develop- Finzer, W. (2001). Fathom! (Version 1.12)
rated the ITEMS materials into their ment of a manual, Interpreting and [Computer Software]. Emeryville, CA: Key
curriculums. Communicating Assessment Results Curriculum Press.
Summer 2008 27