Different Teacher-Level Effectiveness Estimates, Different Results: Inter-Model Concordance Across Six Generalized Value-Added Models (Vams)

Educational Assessment, Evaluation and Accountability
https://doi.org/10.1007/s11092-018-9283-7
Different teacher-level effectiveness estimates, different

results: inter-model concordance across six generalized
value-added models (VAMs)
Edward Sloat 1 & Audrey Amrein-Beardsley 1 & Jessica Holloway 2
Received: 12 November 2017 / Accepted: 13 July 2018/

# Springer Nature B.V. 2018, corrected publication August/2018
Abstract
In this study, researchers compared the concordance of teacher-level effectiveness ratings
derived via six common generalized value-added model (VAM) approaches including a
(1) student growth percentile (SGP) model, (2) value-added linear regression model
(VALRM), (3) value-added hierarchical linear model (VAHLM), (4) simple difference
(gain) score model, (5) rubric-based performance level (growth) model, and (6) simple
criterion (percent passing) model. The study sample included fourth to sixth grade
teachers employed in a large, suburban school district who taught the same sets of
students, at the same time, and for whom a consistent set of achievement measures and
background variables were available. Findings indicate that ratings significantly and
substantively differed depending upon the methodological approach used. Findings,
accordingly, bring into question the validity of the inferences based on such estimates,
especially when high-stakes decisions are made about teachers as based on estimates
measured via different, albeit popular methods across different school districts and states.
Keywords Teacher accountability . Teacher effectiveness . Teacher evaluation . Teacher

quality . Validity . Value-added models
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11092-018-

9283-7) contains supplementary material, which is available to authorized users.
* Audrey Amrein-Beardsley
audrey.beardsley@asu.edu
Edward Sloat
esloat@asu.edu
Jessica Holloway
jessica.holloway@deakin.edu.au
1
Mary Lou Fulton Teachers College, Arizona State University, PO Box 871811, Tempe,
AZ 85287-1811, USA
2
Research for Educational Impact (REDI) Centre, Deakin University, Melbourne Burwood Campus,
221 Burwood Highways, Burwood, VIC 3125, Australia
1 Introduction
The past several decades have brought about a massive increase in test-based account-
ability, where a Btrust in numbers^ (Ozga 2016, p. 69; see also Porter 1996) has
globally dominated education policy and practice discourses (Rizvi and Lingard
2010; Verger and Parcerisa 2017). International large-scale achievement tests, such as
the Organization for Economic Co-operation and Development’s (OECD) Programme
for International Student Assessment (PISA) test, as well as national assessments (e.g.,
Australia’s National Assessment Program – Numeracy and Literacy [NAPLAN]), have
increasingly enabled leaders of education systems to use student test scores for
measuring school and teacher quality (Lingard et al. 2013).
For example, according to an analysis of the 2013 Teaching and Learning Interna-
tional Survey (TALIS) data, which were collected during a PISA administration at the
time, 97% of teachers reported that student test scores were, in some way, used in their
appraisals (Smith & Kubacka, 2017). Although it is increasingly common for schools to
consider student test scores when making curriculum and personnel decisions (Smith &
Kubacka, 2017), however, most countries have rejected the explicit use of these scores
for evaluating individual teachers (Sørensen 2016). One notable exception is in the
USA, where many teachers throughout the country continue to be subjected to evalu-
ations that explicitly incorporate some form of student growth measure, usually based on
standardized achievement tests (Collins and Amrein-Beardsley 2014; Close et al. 2018).
Consistent with global campaigns that espouse data-driven practice, higher stan-
dards, and competition as driving mechanisms for procuring educational improvement
(Close et al. 2018; Rizvi and Lingard 2010; Lingard et al. 2013), the USA has, in many
ways, led the charge with test-based accountability, especially as it pertains to teacher
accountability.
Although most countries have exercised some modicum of caution in this area
(Sørensen 2016), the USA has encouraged the use of objective measures to assess
teacher quality in comparatively extreme ways. This is typically done using students’
performance on large-scale standardized achievement tests over time using growth or
value-added models (VAMs1) (hereafter referred to as VAMs in their more generalized
forms2). Common arguments for doing so are that students will take their learning more
1
VAMs are designed to isolate and measure teachers’ alleged contributions to student achievement on large-
scale standardized achievement tests as groups of students move from one grade level to the next. VAMs are,
accordingly, used to help objectively compute the differences between students’ composite test scores from
year-to-year, with value-added being calculated as the deviations between predicted and actual growth
(including random and systematic error). Differences in growth are to be compared to Bsimilar^ coefficients
of Bsimilar^ teachers in Bsimilar^ districts at Bsimilar^ times, after which teachers are positioned into their
respective and descriptive categories of effectiveness (e.g., highly effective, effective, ineffective, highly
ineffective).
2
The main differences between VAMs and growth models are how precisely estimates are made and whether
control variables are included. Different than the typical VAM, for example, the SGP model is more simply
intended to measure the growth of similarly matched students to make relativistic comparisons about student
growth over time, without any additional statistical controls (e.g., for student background variables). Students
are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which
de facto controls for these other variables. Thereafter, determinations are made in terms of whether students
increase, maintain, or decrease in growth percentile rankings as compared to their academically similar peers.
Accordingly, researchers refer to both models as generalized VAMs throughout the rest of this manuscript
unless distinctions between growth models and VAMs are needed or required.
seriously, teachers will improve their quality of instruction, and teacher educators will
transform their BMickey Mouse^ (i.e., cartoonist or not of high quality and too easy)
teacher preparation courses, experiences, and requirements (Duncan 2009) as a result.
Thereafter, students will learn more, student achievement will improve, and the US’s
global preeminence will be reclaimed, so the logic goes (see, for example, Bill and
Melinda Gates Foundation 2013; Doherty and Jacobs 2015; Duncan 2011; Rhee 2011;
Weisberg et al. 2009).
This rationality, which undergirds much of the global education discourse, has been
extensively critiqued, however, philosophically, pragmatically, and empirically (Lingard
et al. 2013; Mathis 2011; Nichols and Berliner 2007; Grek and Ozga 2010; Timar and
Maxwell-Jolly 2012; see also Denby 2012, Mathews 2013; Pauken 2013). Yet, states
throughout the USA continue to incentivize the associated practices via related educa-
tional policies (Berliner 2018; Amrein-Beardsley and Holloway 2017; Close et al. 2018).
Accordingly, researchers in this paper focus specifically on the USA and its com-
plicated engagement with VAMs and VAM-based teacher evaluation systems. They use
the state of Arizona as a case to investigate recent changes to VAM-based policy that
shift control of teacher evaluation from the state to local education authorities. Like-
wise, they question the extent to which increased decentralized authorities pose chal-
lenges to the putative objectivity of VAM-based teacher evaluation systems. Via the
following section they provide an overview of the study’s context, including the VAM
policy in Arizona and of specific interest in this study.
2 Study setting
For approximately 35 years, educational policy in the USA can be defined by a testing
culture (Smith 2016) that relies extensively on testing and data as means for measuring,
evaluating, and comparing students, teachers, and schools (see also Hursh 2007; Close
et al. 2018). More specifically, a series of federal legislative and incentive programs
helped to initiate various forms of test-based accountability throughout the nation, such
as the passage of the No Child Left Behind (NCLB) Act of 2001 that focused on
school- and student-level accountability, and the Race to the Top (RttT) Act of 2011
and the NCLB waivers (which excused states for not meeting NCLB’s 100% student
proficiency goals by 2014) that shifted the accountability focus from students and
schools to the teachers. As part of RttT and NCLB waivers, most states subsequently
developed teacher evaluation systems that were, in large part, reliant on student test
scores for measuring teacher quality.
While the recent passage of the US’s Every Student Succeeds Act (ESSA 2016) has
helped to curb such accountability and reform efforts, particularly at the teacher
accountability level (Close et al. 2018), ESSA is still strongly exciting the same Btrust
in numbers^ (Ozga 2016, p. 69; see also Porter 1996) logic by continuing to encourage
states to hold teachers accountable for measurably and effectively teaching their
students. Consequently, since the passage of ESSA, some states have revised their
stronger teacher-level accountability policies (e.g., Alabama, Arizona, Georgia, Hawaii,
Louisiana, Oklahoma), although others (e.g., Colorado, Florida, Massachusetts, New
Mexico, New York, North Carolina, South Carolina, Tennessee) have not (see also
Felton 2016).
In Arizona, and in response to the RttT and related NCLB waivers, the state
legislature passed into law Senate Bill 1040 (SB1040) in the spring of 2010 (Arizona
Revised Statutes §15-203 (A) (38)). This legislation requires all public-school districts
to evaluate teacher effectiveness using objective quantitative measures, with 33 to 50%
of a teacher’s evaluation rating to be based on the value-added of students in his/her
classroom over time. SB1040 did not, however, mandate a specific methodology or set
of measures for measuring teachers’ value-added. Rather, SB1040 directed the state’s
board of education to develop a general framework under which districts would be free
to construct and implement their own value-added metrics, as long as the measures
used yielded evidence of reliability and validity and alignment with the state’s curric-
ular standards and objectives.
Because current estimates suggest that approximately 70% of all teachers in any
state (including Arizona) do not have the test-based measures required to measure
teacher-level value-added (Collins 2014; Gabriel and Lester 2013; Harris 2011), the
state’s evaluation framework also set forth two classifications of teachers based on data
availability. For Group A teachers, large-scale student achievement measures were
available, while this was not the case for Group B teachers; hence, only Group A
teachers (i.e., approximately 30% of the state’s teachers) were value-added eligible.
Again, however, a common method for calculating teacher value-added estimates for
Group A teachers was not (and has not been) specified by the state. While the state has
endorsed the student growth percentile (SGP) model (Betebenner 2009, 2011)—a
growth percentile quantile regression model oft-considered the most popular general-
ized VAM in the USA3—districts throughout Arizona are otherwise on their own.
Related, the state interpreted the passage of ESSA ( 2016) as freedom to relax the
state’s regulatory requirements more. In April of 2016, the Arizona State Board of
Education (ASBE) released its final revisions to SB1040 relaxing state-level re-
quirements further, for example, by also endorsing districts’ use of student learning
objectives (SLOs) for both Group A and Group B teachers and by allowing districts
to determine the relative weights to be assigned to all of the teacher evaluation
indicators adopted, implemented, and used. Indeed, in the USA and post the
passage of ESSA ( 2016), an SLO is being increasingly considered as a student
growth-in-achievement measure that includes a student-level learning goal that is
matched with a student-level achievement measure. SLOs, accordingly, are being
used to track students’ progress toward whatever goals might be set and measured,
as well as the primary student growth measure for teachers who are not value-added
eligible given their students do not have large-scale, reliable, and valid achievement
data to assess their value-added (i.e., using a VAM).
Notwithstanding, while the ASBE revisions still prioritize reliability and validity for
all measures used, they still come with little support or guidance otherwise. Arguably,
this raises equity questions regarding the uniformity and comparability of teacher
effectiveness estimates across districts, within Arizona and likely other states with
similar policy-level approaches (e.g., Alabama, Georgia, Hawaii, Louisiana, Oklaho-
ma, Texas), as well as parity questions about the validity of inferences to be made
across districts when high-stakes decisions and consequences are attached, also to be
3
The SGP model is also used or endorsed statewide in the states of Colorado, Hawaii, Indiana, Massachusetts,
Mississippi, Nevada, New Jersey, New York, Rhode Island, Virginia, and West Virginia (Collins and Amrein-
Beardsley 2014).
determined at the district-level (e.g., allocation of performance pay, teacher promotion,

tenure removal, teacher termination). Of most concern here, though, is when an
effective teacher in one district is classified at a distinctly different level of effectiveness
than an equally effective teacher in a neighboring, taking a (sometimes entirely)
different approach to measuring teacher value-added. A teacher who by all accounts
is an effective teacher, so the logic goes, should not be rewarded or penalized by the
simple choice of the district in which (s)he decides to teach or by the methods
employed to measure effectiveness as per the district in which (s)he teaches.
Related, it is reasonable to question whether, indeed, district personnel in certain
districts (e.g., Arizona’s Native-Indian reservations and other rural schools) have access
to the hard/software but also the technical training in statistical methods, data analysis,
and data management needed to support the required modeling of teachers’ effects.
Conceivably, a considerable proportion of Arizona (and other similar states) school
districts may be even more ill prepared to meet the requirements of such policies as
now imposed, which may be antithetically but literally an indictment of the state’s
dispensation of centralized control.
3 Purpose of the study
Given the profound influence that this faith in objectivity (Porter 1996, p. 8) logic has
had on education systems globally (see also Grek and Ozga 2010), researchers via this
study demonstrate that objectivity is contingent upon the contextual conditions that
shape any such policy’s development, implementation, and enactment (Ball 2012).
Hence, it was the purpose of the study to examine how newfound local liberties around
teacher evaluation might play out in practice, starting with the methods used to measure
any generalized form of teacher value-added.
Since Arizona passed the onus of responsibility for doing this down to the district-
level, as other states have also done, researchers more specifically explored what these
liberties might mean in terms of the results and inferences to be derived via said
estimates. Researchers studied this within one, large, suburban school district, with
implications for the other 230 school districts also situated within state borders, and with
implications for other districts and states beyond that. Of interest was not to resolve
whether centralized or decentralized control is better or to debate the strengths and
weaknesses of either legislative structure; rather, of interest was to investigate the extent
to which the state’s regard for local control might actually compromise the standards of
reliability and validity for which Arizona legislators also called (i.e., via SB1040).
With principal emphases on validity, researchers compared the concordance of
teacher-level effectiveness ratings derived via six generalized VAM approaches, all of
which are models of conditional achievement including a (1) growth percentile quantile
regression model (i.e., the SGP model; Betebenner 2009, 2011), (2) value-added
(residual) linear regression model (VALRM; McCaffrey et al. 2003), (3) value-added
(residual) hierarchical linear model (VAHLM; Raudenbush and Bryk 2002), (4) simple
difference (gain) score model, (5) rubric-based performance level (growth) model, and
(6) simple criterion (percent passing) model. Researchers included the latter three
simpler, albeit common models as likely used within and across districts with less
advanced data management and analytical tools.
4 Literature review
4.1 Validity
Validity is positioned here and elsewhere (e.g., Standards for Educational and Psy-
chological Testing advanced by the American Educational Research Association
(AERA), American Psychological Association (APA), and National Council on Mea-
surement in Education (NCME 2014) as the essential aspect of any measurement
system, as validity captures Bthe degree to which empirical evidence and theoretical
rationales support the adequacy and appropriateness of interpretations^ to be made
(Messick 1995, p. 741; see also Cronbach and Meehl 1955; Kane, 2013; Messick 1975,
1980). As per Messick (1989):
[A] unified view of [construct] validity is required that comprehends both the
scientific and the ethical underpinnings of test interpretation and use. This unified
concept of validity integrates considerations of content, criteria [emphases
added], and consequences into a construct framework for testing rational hypoth-
eses about theoretically relevant relationships, including those of an applied as
well as of a scientific nature. (p. 5; see also Cronbach and Meehl 1955; Kane,
2013; Messick 1995)
Hence, researchers present their interpretation of the current research as per this
definition of validity, with emphases on criteria/concurrent-related evidences of specific
interest in this study.
4.2 Criterion-related evidence of validity
As per Messick (1989), Bcriterion-related validity is based on the degree of relationship

between the test scores and [similar or other] criterion scores^ (p. 7). Ideally, to
establish criterion-related evidence of validity, VAM estimates should be highly corre-
lated with (1) similar criteria, derived via similar albeit in this case generalized VAMs
meant to serve the same purpose (i.e., to capture and represent teacher effectiveness),
and (2) other criteria, derived via other indicators of effective teaching (e.g., supervi-
sors’ observational ratings, student or parent survey-based evaluations of teachers), also
meant to serve the same purpose (i.e., to capture and represent teacher effectiveness).
All of these indicators, in short, should point in the same direction if to help yield valid
interpretations about teachers and their effectiveness. While criterion-related evidence
of validity is typically comprised of two types: concurrent-related evidence of validity
when data are collected at the same time and predictive-related evidence of validity
when criterion are collected one in advance of the other, the focus herein is on the
former—concurrent-related evidence of validity.
4.3 Inter-indicator correlations
First, in terms of concurrent-related evidence of validity, current research evidence

suggests that VAM estimates of teacher effectiveness do not strongly correlate with the
other measures typically used at the same time to measure the same teachers’ effects.
More specifically, the correlations being observed among mathematics and English/
language arts (ELA) teachers’ VAM-based estimates and either their observational
scores or student surveys of teacher quality are low to moderate in magnitude (e.g.,
0.2 ≤ r ≤ 0.5; see, for example, American Statistical Association (ASA) 2014; Bill and
Melinda Gates Foundation 2013; Curtis 2011; Graue et al. 2013; Harris 2011; Hill et al.
2011; Jacob and Lefgren 2005; Kimball et al. 2004; Kersting et al. 2013; Kyriakides
2005; Milanowski et al. 2004; Loeb et al. 2015; Nye et al. 2004; Polikoff and Porter
2014; Rothstein and Mathis 2013). While some argue that the Bmore subjective^
measures (e.g., supervisors’ observational scores, student or parent surveys) are at
fault, others argue that all of the measures, including VAM estimates, are at fault for
the low correlations observed, because all of the measures typically used to examine
teacher effects are limited and insufficient.
All of this also assumes that that there is such a thing as a general teaching
effectiveness construct that can be defined, observed, and captured (Berliner 2014;
Bill and Melinda Gates Foundation 2013; Braun et al. 2012; Chin and Goldhaber 2015;
Grossman et al. 2014; Harris 2011; Hill et al. 2011; Kennedy 2010; Newton et al. 2010;
Rothstein and Mathis 2013).
4.4 Inter-model correlations, holding the students and the timing of the tests
constant
Second, of concern, is the extent to which teachers’ value-added estimates derived from
similar yet different tests correlate when given to the same students at the same time.
Corcoran et al. (2011), for example, compared value-added estimates taken from both
high- and low-stakes tests given to the same students at the same time (i.e., holding the
students and the time of administration constant) and found low to modest correlations
between estimates used within mathematics (r = 0.59) and ELA (r = 0.50). Put simply,
they found that teachers of the same students who took similar tests at the same time
posted quite different scores given the fact that the tests used to measure teacher-level
value-added were different.
Within one of the initial Bill & Melinda Gates Foundation’s Measures of Effective
Teaching (MET) study reports ( 2010), researchers found even lower correlations
between similar yet different tests in mathematics (r = 0.22) and ELA (r = 0.37). In
perhaps one of the most well-known studies of this type, Papay (2010) also found that
such estimates, while all positive and statistically significant, widely ranged across tests
with all correlations small to moderate in size, or below r = 0.50 except in three
instances. His findings held true when the same students were tested at the same time
and when the same test developers developed the tests administered (see also
Lockwood et al. 2007).
4.5 Inter-model correlations, holding the students and the tests constant
Third, and as directly relevant to this particular study, is that when holding the students
and tests constant, the estimates derived via different VAMs and growth models (e.g.,
the SGP model) seem to be, for the most part, strongly associated with one another
across VAMs and subject areas (e.g., 0.50 ≤ r ≤ 0.90). This means that while the choice
of the test can seriously diminish validity, as discussed prior, the actual model that is
used to estimate value-added, as long as it is sophisticated and based on the same test
data, may not matter much at all. This makes arguments and declarations about which
VAM or growth model is Bthe best one^ more trivial than often assumed, as estimates
derived via all sophisticated VAMs (including the SGP model) seem to be correlating
well, more or less, with other estimates.
Newton et al. (2010), for example, evaluated the stability of teacher effectiveness
rankings across five VAM model specifications, holding the students and the mathe-
matics and ELA tests constant. They also let the choice of modeling frameworks to be
influenced by Bthe practical limitations [faced by] many district and state systems^ (p.
8), as researchers did here. They examined the teacher rankings for concordance using
Spearman (rho) rank and interclass correlation coefficients (ICC), with results indicat-
ing that rankings were also positively correlated but also varied substantially across
statistical models (0.76 ≤ ρ ≤ 0.95).
Johnson et al. (2013) found that different VAMs produced similar results, almost
regardless of specifications, noting that teacher estimates were highly correlated
across model specifications, ranging from r = 0.90 to 0.99. Goldhaber et al. (2014)
noted that their Bfindings [were] consistent with research that finds models includ-
ing student background and classroom characteristics are highly correlated with
simpler specifications that only include a single-subject lagged test score^ (p. 4);
hence, their Bfindings show[ed] that models…broadly speaking, agree[d] with one
another^ (see also Ballou et al. 2004; Briggs and Betebenner 2009; Glazerman and
Potamites 2011; Harris and Sass 2006; Hill et al. 2011; Lockwood et al. 2007; Schafer
et al. 2012; Tekwe et al. 2004).
However, if one is to criticize all such models as similar versions of the others, with
minor-to-major methodological deviations (e.g., whether statistical covariates are used
to counter the effects of student-level demographics), perhaps this is not surprising,
whereby using the same test input from the same students will more or less always
produce similar (albeit still perhaps invalid) output. In other words, the extent to which
model output correlate with and across one another might be expected if a Bgarbage in,
garbage out^ euphemism is invoked (i.e., input of incorrect or poor quality for certain
purposes will produce consistent though invalid output; Banchero and Kesmodel 2011;
Gabriel and Lester 2013; Harris 2011). Hence, while this is certainly one source of
validity evidence, that model estimates are similar across sophisticated VAMs, this is
not the only evidence needed to substantiate the validity of the inferences drawn.
In addition, and pertinent to this study, none of the researchers in these prior studies
(besides Newton et al. 2010) took into consideration what these model choices might
actually mean in actual practice. Earlier, researchers noted that such high correlations
were observed across and almost regardless of the model used as long as it was
sophisticated. What these same researchers did not take into consideration, however,
is what might happen to these correlations, in practice, when districts are free to choose
less sophisticated models, again, in the name of increased local control, and in terms of
policy compliance. Likewise, doing this is clearly no trivial undertaking; hence, to
understand how these correlations might vary beyond the estimates cited prior is
also critical to understand given this state’s (and likely other states’) reactions to ESSA
(e.g., in Alabama, Georgia, Hawaii, Louisiana, Oklahoma, Texas). Accordingly, re-
searchers extend these inquiries by examining alternative specifications within and
across more generalized and common VAMs. Researchers’ primary focus here, then,
was on the contexts and conditions that might yield relatively larger issues with
discordance than those discussed prior.
5 Methods
Again, researchers compared the concordance of teacher-level effectiveness ratings

derived via six common generalized VAM approaches including a (1) SGP model, (2)
VALRM, (3) VAHLM, (4) simple difference (gain) score model, (5) rubric-based
performance level (growth) model, and (6) simple criterion (percent passing) model.
For each approach, researchers used the distribution of aggregated teacher-level
estimates by subject area to rank teacher effects and then assign them effectiveness
ratings. Thereafter, researchers statistically evaluated the level of agreement be-
tween and among ratings to examine concordance, with concordance statistically
approximated by the extent to which similar results and conclusions were drawn,
via these independent methods with common purpose. The overall intent was to
examine what impact the choice of the methods implemented, as locally defined,
would have on the inferential and potentially consequential judgments of effective-
ness made.
The primary research question researchers investigated was to what extent teacher-
level ratings significantly or substantively differed depending upon the methodological
approaches used, with concordance yielding evidence of criterion-related evidence
of validity and a lack of concordance the inverse, while also bringing into question
the validity of the inferences based on such estimates especially when high-stakes
decisions are to be attached to such estimates. Researchers defined concurrent
concordance via statistical approximations of the extent to which similar results
for the same teachers at the same time were drawn via independent, common, and
more generalized VAMs.
5.1 Setting
5.1.1 District demographics
The data researchers accessed and used for this study came from a moderately large,
middle income, residential suburb northwest of the greater Phoenix metropolitan area.
The district enrolls approximately 26,000 K-12 students across 20 elementary and four
high schools and employs approximately 1230 classroom teachers. Roughly 4% of
students are classified as English language learners (ELLs), 12% receive some type of
special education services, and 45% participate in the National School Lunch Program
(NSLP). Eleven of the 20 elementary schools qualify for Title I funding based on local
community poverty rates.
5.1.2 District academic performance
District accountability measures are based primarily on the state’s former standard-
ized achievement test, the Arizona Instrument to Measure Standards (AIMS). Given
annually in April, the AIMS test measured students’ mastery of specific learning
objectives on the state’s adopted curriculum. In 2016, 34% of the district’s students
were proficient on the mathematics test and 34% were proficient on the AIMS
reading test.
5.2 Sample
Researchers obtained teacher-classroom assignments from the district’s student course

management system and combined this information by linking student identification
numbers, for Group A teachers teaching more than 10 students to minimize error and
variability inherent to small class sizes. Researchers also purposefully focused on self-
contained regular education teachers/classrooms to control for influences of special
instructional settings. All included teachers instructed state-prescribe curriculums in
language arts and mathematics which align directly to the required end-of-year assess-
ments. This excluded many special area teachers (e.g., art, music, physical education,
special education) which are also assessed under the state’s teacher evaluation policy
mandate. Researchers ultimately included and analyzed a final sample including 71
teachers’ grade 4 classrooms (maximum N = 1791 students4), 75 teachers’ grade 5
classrooms (maximum N = 1926 students), and 69 teachers’ grade 6 classrooms (max-
imum N = 1779 students).
5.3 Data
5.3.1 Achievement variables
The data researchers used in this study were derived via the aforementioned AIMS
tests, and they consisted of cross-sectional scale scores (SSs) covering springs 2010 and
2011 (i.e., AIMS Math SS 2010, Math SS 2011, and AIMS Read SS 2010, Read SS
2011). In general (e.g., for the linear regression and hierarchical linear models),
researchers expressed these achievement data in terms of SSs obtained from the Item
Response Theory (IRT) procedures used by the testing company to score the tests
(Arizona Department of Education (ADE) 2011; ADE 2009). AIMS scale scores are
also vertically equated within subjects across grade levels (ADE 2011). Researchers
also restricted achievement variables to these outcomes because they represented the
required minimum established by SB1040 and are uniformly available to all districts in
the state.
5.3.2 Covariates
Researchers used categorical data reflecting students’ special program membership

(e.g., special education (SPED), ELL, and gifted (Gifted)), students’ demographics
(e.g., NSLP status (Lunch)), and students’ primary home language (PHL)) as covariate
variables, when and as needed (e.g., for linear and hierarchical value-added model
4
The exact number of students covered by the classroom aggregations differs between the analytic methods.
For example, regression techniques use list wise deletion of cases if one or more of the explanatory variables
are missing, while non-regression techniques only require the presence of two achievement scores in the
calculations.
specifications), along with a continuous variable accounting for the number of days
each student was enrolled during the school year (Days Enrolled).
5.3.3 Student growth percentiles
As per Arizona’s accountability framework, the ADE calculated the SGPs researchers
used in this study using quantile regression methods (i.e., SGP Math 2011 and SGP
Read 2011). Researchers used the SGP derivatives to express the change in a student’s
achievement score relative to all other students starting at similar achievement levels on
the pretest occasion(s) (i.e., one prior score on AIMS mathematics and reading for all
students). Researchers specify all of the above descriptive statistics in Table 1.
5.4 Modeling approaches
For each modeling method, the process that researchers used for arriving at teachers’
effectiveness ratings was as follows: (1) researchers obtained outcome measures for
each student by subject by grade; (2) they aggregated those outcomes by classroom; (3)
and then transformed the classroom level distributions onto a common percentile scale;
after which (4) they assigned an overall effectiveness rating based on each classroom’s
generalized location within a range of percentile values. Researchers computed class-
room medians5 for the regression (linear and HLM), SGP, and difference score models
while the percent of students meeting criteria they derived from the rubric-based
performance level and passing rate approaches. Researchers assigned the final
location-based classroom effectiveness ratings based on the following rule: percentiles
1-to-29 = 1 (low effectiveness), percentiles 30-to-69 = 2 (moderate effectiveness), and
percentiles of 70-to-99 = 3 (high effectiveness). A summary of each modeling approach
is provided next (Table 2).
5.4.1 Student growth percentile model
Researchers obtained the ADE-generated SGPs, after which they computed median
values for students who were resident in each teacher’s classroom. They then trans-
formed the distribution of classroom median values into percentile ranks (i.e., median
growth percentiles (MGPs)), and they assigned rubric-based effectiveness ratings rang-
ing on the same one-to-three scale as based on their generalized locations. Table 3
reports the distributional statistics for MGPs across classrooms by grade and subject.
5.4.2 Value-added linear regression model
Researchers estimated identically specified regression models for each grade and subject
area. The dependent variable for each model was the 2011 subject scale score (Math or
Read SS 2011). Researchers also incorporated the same set of aforementioned covariates
within each grade-subject model as well as a constant term across models. Researcher
examination of residual structures indicated no significant problems with underlying
5
With small enrollments, averaging residual growth scores risk skewing the class aggregate measures.
Accordingly, researchers used medians as the class growth measure for this reason.
Table 1 Descriptive statistics for grades 4, 5, and 6
Variables Grade 4 (n = 1568) Grade 5 (n = 1641) Grade 6 (n = 1550)
Mean Median SD Mean Median SD Mean Median SD
SGP Math 2011 52.96 54.00 29.114 50.80 51.00 29.210 54.82 57.00 28.205
SGP Read 2011 48.37 48.50 28.667 48.99 49.00 29.378 50.98 52.00 28.736
Math SS 2011 396.68 400.00 50.655 407.43 408.00 48.829 426.36 424.00 53.093
Read SS 2011 487.72 488.00 48.123 506.39 510.00 46.755 520.52 521.00 40.308
Math SS 2010 374.42 376.00 48.424 392.91 394.00 45.861 406.12 408.00 47.852
Read SS 2010 466.95 469.00 49.026 486.98 488.00 49.643 503.17 506.00 43.776
Days Enrolled 278.68 283.00 24.298 278.39 283.00 24.949 277.66 283.00 27.051
Gifted 0.04 0.00 0.190 0.06 0.00 0.243 0.06 0.00 0.238
SPED 0.15 0.00 0.355 0.14 0.00 0.350 0.12 0.00 0.327
Lunch 0.47 0.00 0.499 0.51 1.00 0.500 0.49 0.00 0.500
ELL 0.21 0.00 0.408 0.20 0.00 0.402 0.19 0.00 0.394
PHL 0.17 0.00 0.373 0.16 0.00 0.364 0.15 0.00 0.352
model assumptions (Stevens 1996), and all models adhered to standard assumptions
including independent, identically distributed, error terms ~ N(0, σ2) where covariates
represent a fixed set of non-stochastic right hand side variables (Stevens 1996; Ferguson
and Takane 1989; Cohen and Cohen 1983).6 Thereafter, researchers computed class-
room level measures by finding the median residual value across all students within
teachers’ classrooms. Researchers transformed this vector of values into percentiles after
which they assigned the same relative effectiveness ratings. A summary of VALRM
parameter estimates and model results by grade and subject is provided in Table 4.
5.4.3 Value-added hierarchical linear model
Researchers also used HLM methods to generate student outcome residuals for math-
ematics and reading for the same grade levels. Researchers used an HLM approach
with a two-level structure (students nested within schools). Level 1 (students) reflected
the same fully specified theoretical covariate model used in the VALRM approach, and
level 2 (schools) incorporated adjustments on the constant term for the proportion of
NSLP students (students receiving free or reduced priced lunch) and the average
school-wide prior year achievement score. This two-level HLM structure reflects a
basic random intercepts model where the slope effects remained fixed and only the
intercept term was randomized (Raudenbush and Bryk 2002). The exact structure of the
VAHLM including covariates is provided below.
6
Researchers’ review of right hand side correlations and model diagnostics suggested multicollinearity among
the ELL, PHL, and Lunch variables, although researchers placed no burden of precision or interpretation on
the estimated parameters of the individual predictor variables, also noting that the use of collinear predictors
did not impact overall model performance (Johnston 1972). The outcome of the modeling approach, then, is
an estimate of residual achievement expressed in terms of the original scale scores. The model generates an
expected score for each student, and the difference between the actual and the expected outcome is the residual
value.
Table 2 Achievement and demographic variable correlations (Spearman’s rho)
Math SS 2011 Read SS 2011 Math SS 2010 Read SS 2010 SGP Math 2011 SGP Read 2011 Days enrolled Gifted SPED Lunch ELL PHL
Math SS 2011 1.00

Read SS 2011 0.79** 1.00
Math SS 2010 0.78** 0.73** 1.00
Read SS 2010 0.72** 0.81** 0.82** 1.00
SGP Math 2011 0.55** 0.31** − 0.04 0.11** 1.00

SGP Read 2011 0.28** 0.49** 0.06* − 0.06* 0.36** 1.00
Days Enrolled 0.10** 0.08** 0.05* 0.05* 0.06* 0.03 1.00
Gifted 0.25** 0.25** 0.27** 0.24** 0.07** 0.09** 0.05* 1.00
SPED − 0.35** − 0.35** − 0.36** − 0.39** − 0.13** − 0.04 − 0.04 − 0.05* 1.00
Lunch − 0.20** − 0.24** − 0.27** − 0.29** − 0.00 − 0.02 − 0.06* − 0.08** 0.12** 1.00
ELL − 0.11** − 0.17** − 0.18** − 0.22** − 0.00 − 0.01 0.05* − 0.05* − 0.04 0.24** 1.00
PHL − 0.10** − 0.17** − 0.17** − 0.20** − 0.00 − 0.01 0.05* − 0.06** − 0.04 0.23** 0.84** 1.00
**p < 0.01 level (two-tailed); *p < 0.05 level (two-tailed)

Table 3 Descriptive statistics for classroom median growth percentiles (MGPs)
Variables Grade 4 (n = 71 classrooms) Grade 5 (n = 75 classrooms) Grade 6 (n = 69 classrooms)
Mean Median SD Mean Median SD Mean Median SD
SGP Math 2011 55.75 56.00 18.07 51.78 53.00 16.36 56.80 56.00 15.85
(median)
SGP Read 2011 48.06 48.00 14.16 49.01 51.00 14.21 52.67 53.00 12.55
(median)
The Level 1 model is specified as follows:

Subject SS 2011ij ¼ β0 j þ β 1 j * DAYSENROLLEDij þ β2 j * GIFTEDij

þ β3 j * SPEDij þ β4 j * LUNCH ij þ β 5 j * ELLij

þ β6 j * PHLij þ β 7 j * Subject SS 2010ij þ rij
where Subject_SS_2011ij is the 2011 scale score for student i in school j for the subject
area tested, and Subject_SS_2010ij is the student’s scale score received in 2010 for the
same subject. The covariate terms in the model are the same as noted prior. The
intercept term β0j represents the expected 2011 scale score when all covariates are
assumed to be zero. The rij is the level 1 error term and is assumed to be well behaved
and normally distributed with mean of zero and variance σ2. That is, rij ~ N(0, σ2).
The Level 2 model is specified as follows:

β 0 j ¼ γ 00 þ γ 01 * LUNCHPCT j þ γ 02 * Subject SS 2010A j þ u0 j
β1 j ¼ γ 10
β2 j ¼ γ 20
β3 j ¼ γ 30
β4 j ¼ γ 40
β5 j ¼ γ 50
β6 j ¼ γ 60
β7 j ¼ γ 70
where LUNCHPCTj represents the percent of students in school j participating in NSLP

and Subject_SS_2010Aj represents school j’s average 2010 subject scale score. The u0j
term represents the random error associated with the school-level effects while γ00 is the
grand mean across all schools. By combining the two-level equations, the mixed model
then becomes:
MSS 2011ij ¼ γ 00 þ γ 01 *LUNCHPCT j þ γ 02 *MSS2010A j þ γ 10 *DAYSENROij
þ γ 20 *GIFTEDDU ij þ γ 30 *SPEDDUM ij þ γ 40 *LUNCHDUM ij

þ γ 50 *ELLDUM ij þ γ 60 *PHLDUM ij þ γ 70 *Subject SS 2010ij þ u0 j
þ rij
Table 4 Linear regression model for mathematics and reading
Mathematics Reading
(Dependent: Math SS 2011) (Dependent: Read SS 2011)
Variable B (t) B (t) B (t) B (t) B (t) B (t)

(Constant) 112.54 9.31 91.22 8.38 75.82 7.08 144.08 12.17 170.76 15.77 203.68 19.27
Gifted 15.10 3.33 15.49 5.02 11.90 3.57 24.30 6.12 5.72 1.99 9.03 3.30
SPED − 16.80 − 6.72 − 10.03 − 4.45 − 11.52 − 4.63 − 7.22 − 3.18 − 18.28 − 8.69 − 16.43 − 7.69
Lunch − 0.13 − 0.08 − 4.68 − 3.13 − 4.02 − 2.55 − 1.41 − 0.90 − 4.39 − 3.17 − 2.97 − 2.28
ELL − 2.52 − 0.68 − 0.03 − 0.01 .55 .17 − 0.51 − 0.15 − 1.52 − 0.54 3.26 1.20
PHL 2.49 .61 − 6.20 − 1.87 − 0.67 − 0.18 − 1.30 − 0.36 − 5.75 − 1.87 − 3.29 − 1.09
Days enrolled 0.05 1.37 0.05 1.86 0.01 0.39 0.02 0.70 0.04 1.53 − 0.03 − 1.20
Math_SS_2010 0.73 36.60 0.78 42.41 0.86 47.74 0.73 40.78 0.68 42.48 0.65 38.78
Adjusted R-square 0.59 0.65 0.69 0.64 0.67 0.63
SEM 32.26 28.73 29.41 28.83 26.71 24.43
F-Stat 329.12 443.24 500.99 404.17 484.86 381.96
F-Sig 0.00 0.00 0.00 0.00 0.00 0.00
No. of observations 1567 1641 1549 1570 1641 1549
The parameter estimates generated by the VAHLM model for grades 4, 5, and 6 in the
area of mathematics and reading are provided in the anonymous, online Appendix here
(see Tables 1X–12X). Both the conditional (fully specified) and unconditional model
estimates are provided so that appropriate variance reduction calculations may also be
reviewed. Thereafter, researchers computed classroom residual aggregates as medians
across students within a class and then transformed this vector of median classroom
residuals onto a percentile scale, after which they assigned relative effectiveness ratings
using the same defined effectiveness scale and rules.
5.4.4 Simple difference (gain) score model
The scales researchers used for the AIMS achievement tests were vertically scaled across
grades 3 through 8 in both mathematics and reading (ADE 2009, 2011). This scale is to
permit tracking of academic progress across continua as a student moves through different
grade level and content areas over time; hence, in many instances, district personnel use
the properties afforded to compute difference scores to estimate relative growth. To this
end, researchers also computed simple difference scores between 2010 and 2011 for all
students by subject and grade level. Thereafter, researchers derived classroom measures by
computing the median difference score across enrolled students. As with the other
measures, researchers also transformed the distribution of classroom medians to a percen-
tile scale to also assign effectiveness ratings. The descriptive statistics for the vector of
median classroom and teacher difference scores are illustrated in Table 5.
5.4.5 Performance level (growth) model
The AIMS test also applies performance standards to the scale scores attained by
students (ADE 2009, 2011). These performance standards apportion the range of scores
Table 5 Descriptive statistics:

Statistic Math SS Gain Read SS Gain
median difference scores
(classroom) by grade by subject
Grade 4
Count (classrooms) 71 71
Mean 24.69 20.97
Median 24.00 19.00
Std. deviation 15.80 10.14
Grade 5
Mean 13.81 20.91
Median 14.00 21.50
Grade 6
Mean 18.94 17.42
Median 19.00 16.00
into four descriptive performance categories: BFalls Far Below,^ BApproaches,^

BMeets,^ and BExceeds^ the state’s standards. One common method that district
personnel also utilize to monitor the relative growth of students over time is to chart
their performance level year-to-year. Students who maintain or improve their perfor-
mance ratings are commonly seen as making growth since they have been instructed on
curriculum associated with a higher grade level. As well, this approach is sometimes
used as a loose (non-technical) definition for Bone year’s growth^ signifying for a
teacher that a student has made progress from 1 year to the next but has not necessarily
improved his/her relative placement among his/her peers. This approach is also some-
times used to assign growth credit to teachers when district personnel analyze the
proportion of students making this type of growth to infer which teachers are eliciting
the strongest instructional impacts on performance over time. Hence, researchers
applied a similar set of growth rules to student-level AIMS performance levels for
spring 2010 and spring 2011 in mathematics and reading. The growth criterion matrix
is depicted in Table 6.
To assemble classroom level measures, researchers computed the percent of students
in each classroom receiving growth credit under this criterion. Researchers then
transformed this classroom vector of percentages to a percentile scale so that an overall
effectiveness rating could be assigned, again. The descriptive statistics for the percent
of students meeting the performance growth criterion within classrooms by grade and
subject is provided in Table 7.
5.4.6 Criterion (percent passing) measures of achievement
The sixth modeling approach allowed researchers to assess a student’s pass/fail status
on the most recent (2011) AIMS tests. Researchers included this approach to reflect the
most simplistic metric of student performance that a local district might (and might be
permitted to) use as a measure of teacher effectiveness. In this context, researchers
computed the proportion of students passing the 2011 AIMS tests per classroom.
Researchers then transformed the vector of classroom level proportions onto a percen-
tile scale to assign a teacher effectiveness rating following the same logic. Distribu-
tional statistics for measures across classrooms by grade and subject are illustrated in
Table 8.
Table 6 Growth criterion matrix: AIMS performance levels
AIMS 2011
AIMS 2010 Falls far below Approaches Meets Exceeds
FFB 0 1 1 1
Approach 0 1 1 1
Meets 0 0 1 1
Exceeds 0 0 0 1
The column and row headers indicate the four performance categories used with cell values reflecting the
intersection of performance in 2010 and 2011. A value of zero means a student did not get credit for making
growth while a value of 1 indicates the student made adequate growth

Statistic Math performance Reading performance
percent (performance level) growth growth
growth within classrooms
Grade 4 classrooms
Mean 0.36 0.18
Median 0.39 0.16
Grade 5 classrooms
Mean 0.28 0.18
Median 0.27 0.17
Grade 6 classrooms
Mean 0.17 0.17
Median 0.00 0.16
6 Findings
Again, the intent of this research was to examine the consistency of teacher effective-
ness ratings using alternative growth modeling approaches based on standardized
achievement measures in an applied setting of a local school district, with potential
generalization to other applied settings using similar approaches, also with varying

Statistics Passing rate— Passing rate—
percent passing 2011 AIMS by Math 2011 Reading 2011
grade and subject within
classrooms Grade 4
Mean 0.71 0.76
Median 0.74 0.81
Grade 5
Mean 0.70 0.79
Median 0.73 0.84
Grade 6
Mean 0.86 0.84
Median 1.00 0.87
levels of complexity (e.g., the other 230 school districts situated within state borders
and other states, also with local control provisions). To do this, researchers computed
Spearman (rho) rank correlation coefficients between all combinations of methods and
Kendall’s tau-c and Cohen’s kappa as measures of agreement between the SGP-based
ratings and each of the remaining five methods.
As also noted prior, researchers chose the SGP method as the comparative base
because it represented the state-generated measure, and it was also least arbitrary in
that, as also discussed prior, the estimates derived via the SGP model and different
VAMs (e.g., similar VALRMs and VAHLMs) seem to be, for the most part, strongly
associated with one another across subject areas (e.g., 0.50 ≤ r ≤ 0.90; Ballou et al.
2004; Briggs and Betebenner 2009; Glazerman and Potamites 2011; Goldhaber
et al. 2014; Harris and Sass 2006; Hill et al. 2011; Johnson et al. 2013; Lockwood
et al. 2007; Newton et al. 2010; Schafer et al. 2012; Tekwe et al. 2004). Hence, it
was against the SGP estimates that researchers assessed the other five approaches to
empirically represent that which local districts, with widely varying levels of expertise,
might adopt and implement in highly applied settings. Findings by examination are
presented next.
6.1 Rank correlations across classrooms
The Spearman (rho) rank correlation coefficient measures the degree of association
between two ordinal rank variables and is an appropriate statistic to use in this setting
because the measures being compared are teachers’ effectiveness ratings ranging from
1 (low effectiveness), 2 (moderate effectiveness), or 3 (high effectiveness). Rho values
range from − 1 to + 1 with the value itself reflecting the degree (magnitude) in the
association. Table 9 illustrates the rank correlations among each of the modeling
approaches by grade and subject area.
These data illustrate that the highest level of rating agreement existed between the
SGP and the VALRM with values ranging from r = 0.82 (grade 6 mathematics and
reading) to r = 0.92 (grade 4 mathematics and reading). This aligns with similar
research, as also mentioned prior (Ballou et al. 2004; Briggs and Betebenner 2009;
Glazerman and Potamites 2011; Goldhaber et al. 2014; Goldschmidt et al., 2012; Harris
and Sass 2006; Hill et al. 2011; Johnson et al. 2013; Lockwood et al. 2007; Newton
et al. 2010; Schafer et al. 2012; Tekwe et al. 2004).
The lowest correlations across the board were for the passing rate approach (i.e., a
simple criterion measure of the percent of teachers’ students passing the latest test) with
0.08 ≤ r ≤ 0.78 (as correlated with difference scores in grade 6 reading and with
performance level (growth) in grade 5 mathematics, respectively). The performance
level (growth) model generally showed low associations, as well. In three out of the six
cases, the difference score methods had larger associations to SGP than the VAHLM
method; however, it is clear from the range of values within and across grades and
subject areas that considerable (and statistically significant) variation exists in the
overall effectiveness ratings depending on the approach used.
Table 10 summarizes the Spearman (rho) rank correlation coefficients across
methods, again, using the SGP approach as a comparative base. This simplifies the
presentation in Table 9 by comparing the five locally derived methods to that of the
state’s (i.e., the SGP model).
Table 9 Spearman (rho) rank correlation coefficients by effectiveness rating method, grades 4–6 by mathematics and reading
SGP VALRM VAHLM Difference Performance Passing SGP VALRM VAHLM Difference Performance Passing
scores level rate scores level rate
Grade 4 Mathematics Grade 4 Reading

SGP 1.00 1.00
VALRM 0.92** 1.00 0.92** 1.00
VAHLM 0.66** 0.67** 1.00 0.78** 0.87** 1.00
Difference scores 0.91** 0.88** 0.55** 10.00 0.78** 0.77** 0.67** 1.00
Performance level 0.75** 0.82** 0.60** 0.70** 1.00 0.52** 0.57** 0.50** 0.45** 1.00
Passing rate 0.52** 0.59** 0.42** 0.43** 0.73** 10.00 0.61** 0.57** 0.39** 0.24* 0.47** 1.00
SGP 1.00 1.00
VALRM 0.86** 1.00 0.83** 1.00
VAHLM 0.71** 0.81** 1.00 0.78** 0.95** 1.00
Performance level 0.65** 0.65** 0.57** 0.64** 1.00 0.52** 0.41** 0.40** 0.28* 1.00
Passing rate 0.55** 0.49** 0.44** 0.45** 0.78** 1.00 0.44** 0.38** 0.33** 0.13 0.45** 1.00
SGP 1.00 1.00
VALRM 0.82** 1.00 0.82** 1.00
VAHLM 0.59** 0.83** 1.00 0.75** 0.97** 1.00
Performance level 0.71** 0.56** 0.48** 0.50** 1.00 0.53** 0.61** 0.59** 0.50** 1.00
Passing rate 0.52** 0.38** 0.30* 0.31** 0.70** 1.00 0.40** 0.43** 0.34** 0.08 0.19 1.00
Table 10 Spearman (rho) rank correlation coefficients with SGP by method as comparative base by grade and
subject area
Methodological Grade 4 Grade 4 Grade 5 Grade 5 Grade 6 Grade 6

approach Math Reading Math Reading Math Reading
VALRM 0.92a 0.92a 0.86a 0.83a 0.82a 0.82a

VAHLM 0.66a 0.78a 0.71a 0.78a 0.59a 0.75a
Difference scores 0.91a 0.78a 0.81a 0.73a 0.80a 0.67a
Performance level 0.75a 0.52a 0.65a 0.52a 0.71a 0.53a
Passing rate 0.52a 0.61a 0.55a 0.44a 0.52a 0.40a
a Correlation is significant at the 0.01 level (two-tailed)
As illustrated, the VALRM reports the strongest association to the SGP-based

classroom ratings, again, with all values above 0.80. Interestingly, the difference score
approach for grade 4 mathematics reports a correlation above 0.90 as well. All of the
statistics are significant (p < 0.01), although the magnitude of rater agreement across the
locally generated approaches to that of the state SGP-based data varies considerably
(from 0.40 for passing rate at grade 6 reading, to 0.92 for the VALRM approach in
grade 4 mathematics).
6.2 Rating agreement
Researchers also based the assignment of effectiveness ratings on the generalized rank
location of teacher-level ratings after conversion to a percentile scale. As per the
multiple applied and local settings in which this work is being considered, classification
into broad categories attempts to limit the effects of small variations in the ranks
between the different approaches. With this in mind, measuring the rating agreement
between methods is similar to measuring rater agreement in more general contexts such
as scoring reoccurring ratings of instructional practice.
Measures of rater agreement may be estimated using non-parametric methods such
as Kendall’s tau and Cohen’s kappa (Ferguson and Takane 1989; Freed et al. 1991;
Reynolds et al. 2009). For each, the hypothesis tested was whether ratings from two
matched methods were independent. The null hypothesis researchers posited was that
the ratings were independent and would vary randomly; that is, the two methods would
construct the same distribution of rating classifications and, consequently, be consid-
ered indifferent. Rejection of the null hypothesis would indicate that the matched
methods construct yielded different (non-random) classifications, suggesting that
choice of method indeed matters.
The Kendal tau-c measures the difference in concordance over discordant pairs of
ratings and is appropriate for contingency tables larger than 2 × 2 (Freed et al. 1991;
Ferguson and Takane 1989). It has a range of − 1 to + 1 with large positive values
denoting positive associations, large negative values denoting negative associations,
and values of zero denoting no association. For large counts (n > 10), Kendall’s tau is
approximately normal and may be utilized in z score form as a test statistic. A
statistically significant result rejects the null hypothesis, indicating that there is no
association (i.e., independence) between matched ratings.
Cohen’s kappa is a measure of interrater agreement between matched ordinal

rating scales (Reynolds et al. 2009). Its values generally range from 0 to + 1
(although it is possible to obtain negative values) and is calculated from the
observed and expected frequencies on the diagonal of a square contingency table.
Values of + 1 indicate perfect agreement, and lower values indicate less agreement.
With some subjectivity, values of kappa may be interpreted using the following
guidelines: negative values up to 0.20 as poor, 0.21 to 0.40 as fair, 0.41 to 0.60 as
moderate, 0.61 to 0.80 as substantial, and 0.81 to 1.00 as almost perfect agreement
(Landis and Koch 1977). Additionally, kappa values may be interpreted as a
proportion of expected agreement, and 1 − kappa is the proportion of expected
disagreement not due to chance.
Accordingly, researchers used six different approaches to compute effectiveness
ratings across classrooms in three grades by two subject areas. To examine all combi-
nations within one subject and one grade level required 15 matched comparisons.7
Because the process was replicated for two subject areas across three grade levels, a
total of 90 matched combinations were possible.8 However, researchers selected the
state-derived data used in the SGP model as a comparative base, again, against which to
examine the rating agreement of the remaining (local) approaches. This reduced the
number of comparisons to 30 and provided focus for the analysis.
Recall that for each approach, teachers were given a rating of 1 (low effectiveness), 2
(moderate effectiveness), or 3 (high effectiveness). Matching any two methods creates a
three-by-three contingency table where diagonal elements indicated agreement and off-
diagonal elements indicated non-agreement. As an example, the grade 4 mathematics
contingency table for crossing the SGP and VALRM approaches is provided in
Table 11.
Illustrated is that 71 teachers received an effectiveness rating on both the SGP and
VALRM approaches in grade 4 mathematics. In this case, assigned ratings matched for
61 (86%) of teachers and did not match for ten (14%) of teachers, and none of the
mismatched ratings differed by more than one level. The tau-c statistic for the data is
reported as 0.84 (n = 71, p < 0.00), rejecting the null hypothesis that the two methods
assign similar ratings. While the value reported for kappa (0.79, n = 71, p < 0.00)
indicates strong agreement (Landis and Koch 1977), the coefficient suggests that the
methods will not match approximately 21% of the time.
Researchers constructed similar contingency tables for each rating approach.
Table 12 reports the tau-c and kappa concordance statistics between SGP and the
remaining five rating approaches by grade and subject area. Included is the simple
percent disagreement (off-diagonal elements) occurring between the comparative rating
methods.
Illustrated is that some methods more closely corresponded with the ratings assigned
by the SGP model than others. However, the range of discordance across the methods is
substantial. Kappa values range from a low of 0.11 in grade 4 reading (performance
level) to 0.79 in grade 4 mathematics (VALRM). Measures of rating disagreement
7
The contingency table for one grade, one subject, contains 36 cells (6 × 6). Diagonal cells compare identical
methods and are therefore excluded. Off-diagonal cells are symmetric, leaving a total of 15 comparative
measures per grade per subject.
8
Fifteen per grade per subject by two subjects by three grades.
Table 11 Contingency table of effectiveness rating assignments for the SGP and VALRM methods (math-
ematics as an example)
VALRM Math
R1 R2 R3 Total
SGP Math R1 19 2 0 21
R2 2 24 3 29
R3 0 3 18 21
Total 21 29 21 71
Kendall’s tau-c = 0.84 (n = 71, p < 0.00); kappa = 0.79 (n = 71, p < 0.000)
R1 rating 1 (low effectiveness), R2 rating 2 (moderate effectiveness, R3 rating 3 (high effectiveness)
ranged from 18% in grade 4 mathematics (VALRM) to 59% in both grade 4 reading
(performance level) and grade 5 mathematics (passing rate). The tau-c measures were
highest in grade 4 mathematics (0.84, VALRM) and lowest for grade 6 reading (0.30,
performance level). Overall, these data indicate that teacher-level classifications greatly
deviate depending on the rating methodology used.
7 Conclusions
Analyses demonstrated that the teacher-level rating assignments may differ both
statistically and substantively depending on any approach or model adopted to
evaluate teachers’ impacts on their students’ growth in achievement over time. In
just this case, and using Arizona’s state-level SGP measures as a comparative based,
the proportion of discordant ratings ranged from 18 to 59% of teachers depending
on grade, subject area, and method used. This is critical in not only Arizona but also
in many other states (e.g., Alabama, Georgia, Hawaii, Louisiana, Oklahoma) in
which legislators are also embracing more local control post ESSA ( 2016).
Allowing districts to not only determine which model they might use to
evaluate teachers’ impacts on student achievement over time, should they decide
to continue to move forward in these regards, makes relative determinations of
teachers’ measurable effectiveness all the more arbitrary across varying contexts.
In other words, that where a teacher teaches might matter more than his/her
actual effectiveness is highly problematic, as also arbitrary. As per Ho (2009),
merely altering the assumptions and methods surrounding such policies and
policy implementations can fundamentally change accountability outcomes. This
in itself puts at risk the inferential validity of such teacher-level accountability
outcomes across districts and states that use different criteria for modeling and
measuring teacher effects today and post ESSA ( 2016). Models and model
specifications, indeed, matter; hence, attention must not only be paid to the
methodological approaches used but to the consumers of such information who
must also be critically aware of the inferential shortcomings any approach might
surmise. Related, nobody should be indifferent to the method chosen to infer
teacher quality.
Table 12 Grade 4 mathematics and reading Kendall tau-c and Cohen’s kappa measures of rating agreement
Model Kendall tau-c Kappa Rating disagreement

(no., percent)
Grade 4 Mathematics (n = 71)

VALRM 0.84*** 0.79*** 10 (14%)
VAHLM 0.62*** 0.44*** 26 (37%)
Difference 0.76*** 0.66*** 16 (23%)
Performance level 0.61*** 0.42*** 27 (38%)
Passing rate 0.42*** 0.25** 35 (49%)
Grade 4 Reading (n = 71)
VALRM 0.72*** 0.59*** 19 (27%)
VAHLM 0.58*** 0.47*** 25 (35%)
Difference 0.62*** 0.42*** 27 (38%)
Performance level 0.32*** 0.11* 42 (59%)
Passing rate 0.48*** 0.23** 36 (51%)
VALRM 0.63*** 0.50*** 25 (33%)
VAHLM 0.50*** 0.31*** 34 (45%)
Difference 0.63*** 0.50*** 25 (33%)
Passing rate 0.38*** 0.11* 44 (59%)
VALRM 0.71*** 0.56*** 22 (29%)
VAHLM 0.66*** 0.48*** 26 (35%)
Difference 0.60*** 0.46*** 27 (36%)
Performance level 0.41*** 0.24** 38 (51%)
Passing rate 0.31*** 0.16* 42 (56%)
VALRM 0.68*** 0.58*** 19 (28%)
VAHLM 0.42*** 0.30** 32 (46%)
Difference 0.71*** 0.61*** 18 (26%)
Passing rate 0.35*** 0.23** 35 (51%)
VALRM 0.55*** 0.36*** 29 (42%)
VAHLM 0.52*** 0.32*** 31 (45%)
Difference 0.59*** 0.54*** 21 (30%)
Performance level 0.30** 0.17** 38 (55%)
Passing rate 0.30** 0.21** 36 (52%)
***p < 0.001; **p < 0.05; *p > 0.05
On the flipside, it is not that one might or should use study results to suggest more
centralized control, either, but rather use results to demonstrate that just because District
A identifies Teacher B at a C level of effectiveness does not necessarily mean that the C
level of effectiveness is correct or that the same Teacher B would be rated similarly in
any other district. Any suppositions to the contrary are obtuse and misinformed as
based on evidence.
How to choose the best approach, however, is also unclear. Arguably, this choice
should be based in equal parts on the technical/inferential qualities of the analytic
method and the policy context under which such analytical activities are undertaken.
This might occur all the while acknowledging that public policy pressures may
motivate small districts to use data and methods that are readily available regardless
of the technical foundations of the approach, and this may have differential impacts on
teachers in such districts as compared to teachers’ colleagues elsewhere. Hence, issues
of inferential validity will certainly remain an issue of concern given the findings
reported herein, as also situated in the literature elsewhere.
Recall that Ho (2009) found that when examining growth criteria under a common
(federal) accountability environment, choice of performance levels (cut scores) and
time-horizons made a substantive difference in inferring instructional adequacy. In
contrast, Papay (2010) examined the variability of effectiveness ratings in the context
of a fixed modeling approach using differing achievement measures, indicating that the
choice of outcome measures directly impacts the rank ordering of teachers. Newton
et al. (2010) evidenced that when alternative models were applied to a fixed set of
achievement measures, dramatically different rankings also occurred. They extended
this analysis by also revealing instability over time using identical models and mea-
sures, showing variability of ratings across courses, and also documenting the influence
of student factors on outcomes.
The findings provided in this study align most closely with those of Newton et al.
(2010). However, researchers in this study also examined a broader range of ap-
proaches including complex growth models, linear regression and HLM approaches,
and more simplistic approaches including difference scores, movement across perfor-
mance levels, and status achievement measures. Overall, findings support the premise
that different teacher effectiveness ratings are constructed from different modeling
approaches in the context of a common fixed set of achievement measures. This finding
certainly has implications for others elsewhere.
Perhaps a more reflective analysis of these findings suggests that value-added,
growth models, and related achievement-based approaches for measuring teacher
quality are still incomplete. If longitudinal frameworks (same model, same mea-
sures, different students) are shown to be unstable, and if cross-section frameworks
(same students, same measures, different methodological approach) report discor-
dant ratings, and if convincing arguments can be made that linking quantitative
metrics to individual teachers fails to account for the rich learning environments
students are exposed to in public schools, then one might suggest that the class of
score-based accountability models are better conceptualized differently or consid-
ered at all. To do anything otherwise risks adopting policies perpetuating the
dangers associated with making teacher inferences that are systemically distorted
by local district’s evaluation system approaches.
This study also has implications for broader articulations of education that presume
that more data and increased accountability educational policies and initiatives will
offer much-desired solutions to help counter education inequity, as well as stagnation
(especially in the USA; see, for example, Bill and Melinda Gates Foundation 2013;
Doherty and Jacobs 2015; Duncan 2011; Rhee 2011; Weisberg et al. 2009). Accord-
ingly, as many nations continue to rely on data and accountability to steer educational
policy and practice (Lingard et al. 2013; Anagnostopoulos et al. 2013), there is an
imperative to offer specific counter-arguments that challenge the prevailing logic that
Bstrict quantification, through measurement, counting, and calculation, is among the
most credible strategies for rendering nature or society objective^ (Porter 1996, p. 74).
While quantification and statistics can provide valuable information about schools and
other social matters (Lingard 2011), it cannot be lost that objectivity is always context-
specific, as the present study illustrates.
While value-added modeling is often referred to as the most sophisticated means for
measuring the amount of influence a teacher has on his/her students’ achievement test
scores, and, perhaps that is true, as Lingard (2011) argued BThe knowledge we produce
is…partial, positioned and provisional with limitations when applied as an evidence
base to policy production^ (p. 358). This caveat is very important to keep in mind when
considering the implications of VAM-based policies and practices. While the data
produced by VAMs might be statistically sophisticated, contextual factors (e.g., type
of VAM, types of tests used, policy contexts, geographic locations) will always affect
how VAMs play out in practice. Likewise, this study serves as a cautionary tale for
educational systems and policymakers writ large, throughout the USA and abroad,
especially as they consider options for VAM-based policy or other test-based account-
ability policies and practices more generally.
8 Limitations
Again, via this study researchers examined the stability of multiple methods for
evaluating teachers’ instructional impacts on student achievement in the context of
assessing teachers’ instructional competency under loosely defined public policy
contexts. The six comparative analytic methods chosen ranged from the computation-
ally simplistic (i.e., percent of student passing the most recent end-of-year assessment)
to the technically sophisticated (i.e., estimating multi-level linear models). However,
the analytic methods that researchers selected herein were not intended to represent the
state of the art or best or worst approaches for estimating teachers’ instructional impacts
on their students’ achievement growth. Rather, researchers’ choices of models and
methods were driven by practical experience, knowledge, and recognition of what
many districts in the state of Arizona (and likely other states also at liberty to develop or
adopt state-endorsed or other teacher evaluation models) were utilizing when
attempting to comply with state-legislated policy mandates. Accordingly, the purpose
of this study was not to uncover which method was better or worse than the other, also
given this has been done before (see, for example, Braun 2005; Kupermintz 2003;
McCaffrey et al. 2004). The purpose was, however, to test the same dataset across
different (and common) model approaches to assess how results might change.
Related, researchers recognize that many alternative analytic models could have been
included such as latent growth models, two-level models with students nested into
classrooms, three-level models with students nested into classrooms and classrooms
nested within schools, and more complex linear models incorporating many different
student, teacher, and school covariate specifications. Again, however, the point of this
research was not to evaluate or compare the technical characteristics of alternative

modeling approaches, nor to advance particular approaches as being more technically
sound. Rather, it was to provide a generalized context to assess the stability of divergent
methods when holding students, teachers, and assessment measures constant. The models
chosen were merely representative of the types of choices local districts currently use or
may consider depending on their in-house level technical training, technology, human,
and other related resources. Researchers recognize that all of the six included modeling
approaches exhibit technical issues, some of which are more problematic than others.
In addition, researchers’ utilization of three stratified effectiveness classifications to
position teacher impacts is not intended to be definitive. Researchers also recognize that
different stratification schema could be employed to interpret the relative placements of
classroom growth estimates. However, researchers felt that segmenting the (percentile)
growth scale into three equal parts represented a very conservative approach to evalu-
ating the stability of the comparative outcomes. That is, substantive movement across a
limited number of boundaries represents larger incongruence between methods, sug-
gesting the influence of analytic method in determining teacher effectiveness outcomes.
That is, and as also noted prior, the validity problems left much to be desired when the
public policy requirements mandated a particular methodological and uniform approach
for all districts in a state (i.e., before the passage of ESSA 2016). But when the public
policy directive becomes even more obscure (i.e., post the passage of ESSA, 2016, yet
high-stakes consequential decisions are to still be attached to such metric-based out-
comes, the problems with validity (i.e., validity of inference) become substantively
amplified. At least in Arizona, a teacher’s performance label is functionally dependent
on the analytic approach used by the local district. This, as also evidenced herein,
presents a clear threat to validity, whereas a teacher at any given level of instructional
competency can get lucky or unlucky, simply by his or her place of employment.
References
American Educational Research Association (AERA), American Psychological Association (APA), &
National Council on Measurement in Education (NCME). (2014). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational
assessment. Alexandria, VA. Retrieved from http://www.amstat.org/policy/pdfs/asa_vam_statement.pdf.
Amrein-Beardsley, A., & Holloway, J. (2017). Value-added models for teacher evaluation and accountability:
Commonsense assumptions. Educational Policy, 1–27. https://doi.org/10.1177/0895904817719519.
Anagnostopoulos, D., Rutledge, S. A., & Jacobsen, R. (2013). The infrastructure of accountability: Data use
and the transformation of American education. Cambridge: Harvard Education Press.
Arizona Department of Education (ADE) (2009). AIMS math technical report 2009. Retrieved from
http://www.azed.gov/standards-development-assessment/files/2011/12/aimsmathfieldtesttechreport2009.
pdf.
Arizona Department of Education (ADE) (2011). AIMS 2011 technical report. Retrieved from http://www.
azed.gov/standards-development-assessment/files/2011/12/aims_tech_report_2011_final.pdf.
Ball, S. J. (2012). Politics and policy making in education: Explorations in sociology. London: Routledge.
Ballou, D., Sanders, W. L., & Wright, P. (2004). Controlling for student background in value-added
assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65. https://doi.
org/10.3102/10769986029001037.
Banchero, S. & Kesmodel, D. (2011). Teachers are put to the test: more states tie tenure, bonuses to new
formulas for measuring test scores. The Wall Street Journal. Retrieved from http://online.wsj.
com/article/SB10001424053111903895904576544523666669018.html.
Berliner, D. C. (2014). Exogenous variables and value-added assessments: a fatal flaw. Teachers College
Record, 116(1).
Berliner, D. (2018). Between Scylla and Charybdis: reflections on and problems associated with the evaluation
of teachers in an era of metrification. Education Policy Analysis Archives, 26(54), 1–29. https://doi.
org/10.14507/epaa.26.3820.
Betebenner, D. W. (2009). A primer on student growth percentiles. Dover: The Center for Assessment
Retrieved from http://www.ksde.org/LinkClick.aspx?fileticket=XmFRiNlYbyc%3d&tabid=1646
&mid=10217.
Betebenner, D.W. (2011). Package ‘SGP.’ Retrieved from https://cran.r-project.org/web/packages/SGP/SGP.
pdf.
Bill & Melinda Gates Foundation. (2010). Learning about teaching: Initial findings from the measures of
effective teaching project. Seattle: Retrieved from http://www.gatesfoundation.org/college-ready-
education/Documents/preliminary-findings-research-paper.pdf.
Bill & Melinda Gates Foundation (2013). Ensuring fair and reliable measures of effective teaching:
Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from http://www.
gatesfoundation.org/press-releases/Pages/MET-Announcment.aspx.
Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton:
Educational Testing Service Retrieved from http://www.ets.org/Media/Research/pdf/PICVAS.pdf.
Braun, H., Goldschmidt, P., McCaffrey, D., & Lissitz, R. (2012). Graduate student council Division D fireside
chat: VA modeling in educational research and evaluation. Paper Presented at Annual Conference of the
American Educational Research Association (AERA), Vancouver, Canada.
Briggs, D. C., & Betebenner, D. (2009). Is growth in student achievement scale dependent? Paper presented at
the annual meeting of the National Council for Measurement in Education (NCME), San Diego, CA.
Chin, M., & Goldhaber, D. (2015). Exploring explanations for the Bweak^ relationship between value added
and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy
Research (CEPR), Harvard University. Retrieved from http://cepr.harvard.edu/files/cepr/files/sree2015_
simulation_working_paper.pdf?m=1436541369.
Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems
after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO:
Nation Education Policy Center (NEPC). Retrieved from http://nepc.colorado.
edu/publication/stateassessment.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences
(2nd ed.). Hillsdale: Lawrence Erlbaum Associates.
Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added
Assessment System (EVAAS®). Education Policy Analysis Archives. Retrieved from http://epaa.asu.
edu/ojs/article/view/1594.
Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national
overview. Teachers College Record, 16(1). Retrieved from: http://www.tcrecord.org/Content.
asp?ContentId=17291
Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2011). Teacher effectiveness on high- and low-stakes
tests. New York: New York University Retrieved from https://files.nyu.edu/sc129
/public/papers/corcoran_jennings_beveridge_2011_wkg_teacher_effects.pdf.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52,
281–302. https://doi.org/10.1037/h0040957.
Curtis, R. (2011). District of Columbia Public Schools: Defining instructional expectations and aligning
accountability and support. Washington, D.C.: The Aspen Institute Retrieved from: www.nctq.
org/docs/Impact_1_15579.pdf.
Denby, D. (2012). Public defender: Diane Ravitch takes on a movement. The New Yorker. Retrieved from
http://www.newyorker.com/reporting/2012/11/19/121119fa_fact_denby.
Doherty, K. M., & Jacobs, S. (2015). State of the states 2015: Evaluating teaching, leading and learning.
Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from http://www.nctq.
org/dmsView/StateofStates2015.
Duncan, A. (2009). Teacher preparation: Reforming the uncertain profession. Retrieved from http://www.ed.
gov/news/speeches/2009/10/10222009.html.
Duncan, A. (2011). Winning the future with education: Responsibility, reform and results. Testimony given to
the U.S. Congress, Washington, DC: Retrieved from http://www.ed.gov/news/speeches/winning-future-
education-responsibility-reform-and-results.
Every Student Succeeds Act (ESSA) of 2015, Pub. L. No. 114-95, § 129 Stat. 1802. (2016). Retrieved from
https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf.
Felton, E. (2016). Southern lawmakers reconsidering role of test scores in teacher evaluations. Education
Week. Retrieved from http://blogs.edweek.org/edweek/teacherbeat/2016/03/reconsidering_test_scores_
in_teacher_evaluations.html.
Ferguson, G. A., & Takane, Y. (1989). Statistical analysis in psychology and education (6th ed.). New York:
McGraw-Hill.
Freed, M. N., Ryan, J. M., & Hess, R. K. (1991). Handbook of statistical procedures and their computer
applications to education and the behavioral sciences. New York: Macmillan Publishing Company.
Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: value-added measurement and the quest for
education reform. Education Policy Analysis Archives, 21(9), 1–30 Retrieved from http://epaa.asu.
Glazerman, S. M., & Potamites, L. (2011). False performance gains: a critique of successive cohort
indicators. Washington, DC: Mathematica Policy Research. Retrieved from www.mathematica-mpr.
com/publications/pdfs/.../False_Perf.pdf.
Goldhaber, D., Walch, J., & Gabele, B. (2014). Does the model matter? Exploring the relationship between
different student achievement-based teacher assessments. Statistics and Public Policy, 1(1), 28–39.
https://doi.org/10.1080/2330443x.2013.856169.
Goldschmidt, P., Choi, K., & Beaudoin, J. B. (2012, February). Growth model comparison study: Practical
implications of alternative models for evaluating school performance. Technical Issues in Large-Scale
Assessment State Collaborative on Assessment and Student Standards. Council of Chief State School
Officers.
Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy
Analysis Archives, 21(8), 1–36 Retrieved from http://epaa.asu.edu/ojs/article/view/1163.
Grek, S., & Ozga, J. (2010). Re-inventing public education: the new role of knowledge in education policy
making. Public Policy and Administration, 25(3), 271–288. https://doi.org/10.1177/0952076709356870.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between
classroom observation scores and teacher value added on multiple types of assessment. Educational
Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Harris, D. N. (2011). Value-added measures in education: What every educator needs to know. Cambridge:
Harvard Education Press.
Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement of teacher quality. Tallahassee:
Florida Department of Education Retrieved from http://itp.wceruw.org/vam/IES_Harris_Sass_EPF_
Value-added_14_Stanford.pdf.
Hill, H. C., Kapitula, L., & Umlan, K. (2011). A validity argument approach to evaluating teacher value-added
scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102
/0002831210387916.
Ho, A. D. (2009). The dependence of growth model results on proficiency cut scores. Educational
Measurement Issues and Practice, 28(4), 15–26. https://doi.org/10.1111/j.1745-3992.2009.00159.x.
Hursh, D. (2007). Assessing No Child Left Behind and the rise of neoliberal education policies. American
Educational Research Journal, 44(3), 493–518. https://doi.org/10.3102/0002831207306764.
Jacob, B. A., & Lefgren, L. (2005). Principals as agents: Subjective performance measurement in education.
Cambridge: National Bureau of Economic Research (NBER) Retrieved from www.nber.
org/papers/w11463.
Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer
control variables. Journal of Research on Educational Effectiveness, 8(1), 60–83. https://doi.org/10.1080
/19345747.2014.967898.
Johnston, J. (1972). Econometric methods (2nd ed.). New York: McGraw-Hill.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational
Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Kennedy, M. M. (2010). Attribution error and the quest for teacher quality. Educational Researcher, 39(8),
591–598. https://doi.org/10.3102/0013189X10390804.
Kersting, N. B., Chen, M., & Stigler, J. W. (2013). Value-added added teacher estimates as part of teacher
evaluations: exploring the effects of data and model specifications on the stability of teacher value-added
scores. Education Policy Analysis Archives, 21(7), 1–39 Retrieved from http://epaa.asu.
Kimball, S. M., White, B., Milanowski, A. T., & Borman, G. (2004). Examining the relationship between
teacher evaluation and student assessment results in Washoe County. Peabody Journal of Education,
79(4), 54–78. https://doi.org/10.1207/s15327930pje7904_4.
Kupermintz, H. (2003). Teacher effects and teacher effectiveness: a validity investigation of the Tennessee
Value-Added Assessment System. Educational Evaluation and Policy Analysis, 25, 287–298. https://doi.
org/10.3102/01623737025003287.
Kyriakides, L. (2005). Drawing from teacher effectiveness research and research into teacher interpersonal
behaviour to establish a teacher evaluation system: a study on the use of student ratings to evaluate teacher
behaviour. Journal of Classroom Instruction, 40(2), 44–66.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics,
33(1), 159–174. https://doi.org/10.2307/2529310.
Lingard, B. (2011). Policy as numbers: ac/counting for educational research. The Australian Educational
Researcher, 38(4), 355–382.
Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy:
commensurate global and national developments. Journal of Education Policy, 28(5), 539–556.
https://doi.org/10.1080/02680939.2013.820042.
Lockwood, J., McCaffrey, D., Hamilton, L., Stetcher, B., Le, V. N., & Martinez, J. (2007). The sensitivity of
value-added teacher effect estimates to different mathematics achievement measures. Journal of
Educational Measurement, 44(1), 47–67. https://doi.org/10.1111/j.1745-3984.2007.00026.x.
Loeb, S., Soland, J., & Fox, J. (2015). Is a good teacher a good teacher for all? Comparing value-added of
teachers with English learners and non-English learners. Educational Evaluation and Policy Analysis,
36(4), 457–475. https://doi.org/10.3102/0162373714527788.
Mathews, J. (2013). Hidden power of teacher awards. The Washington Post. Retrieved from http://www.
washingtonpost.com/blogs/class-struggle/post/hidden-power-of-teacher-awards/2013/04/08/15b7afcc-
9e66-11e2-9a79-eb5280c81c63_blog.html.
Mathis, W. (2011). Review of BFlorida Formula for Student Achievement: Lessons for the Nation.^. Boulder:
National Education Policy Center Retrieved from http://nepc.colorado.edu/thinktank/review-florida-
formula.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models
for teacher accountability. Santa Monica: Rand Corporation.
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added
modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101 RAND
reprint available at http://www.rand.org/pubs/reprints/2005/RAND_RP1165.pdf.
Messick, S. (1975). The standard problem: meaning and values in measurement and evaluation. American
Psychologist, 30, 955–966. https://doi.org/10.1037//0003-066x.30.10.955.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
https://doi.org/10.1037//0003-066x.35.11.1012.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York:
American Council on Education and Macmillan.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses
and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Milanowski, A., Kimball, S. M., & White, B. (2004). The relationship between standards-based teacher
evaluation scores and student achievement: Replication and extensions at three sites. Madison:
University of Wisconsin-Madison, Center for Education Research.
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher
effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis
Archives, 18(23), 1–27 Retrieved from http://epaa.asu.edu/ojs/article/view/810.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing corrupts America’s
schools. Cambridge: Harvard Education Press.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation
and Policy Analysis, 26(3), 237–257. https://doi.org/10.3102/01623737026003237.
Ozga, J. (2016). Trust in numbers? Digital education governance and the inspection process. European
Educational Research Journal, 15(1), 69–81. https://doi.org/10.1177/1474904115616629.
Papay, J. P. (2010). Different tests, different answers: The stability of teacher value-added estimates across
outcome measures. American Educational Research Journal, 48(1), 163–193. https://doi.org/10.3102
/0002831210362589.
Pauken, T. (2013). Texas vs. No Child Left Behind. The American Conservative. Retrieved from http://www.
theamericanconservative.com/articles/texas-vs-no-child-left-behind/
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Education
Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Porter, T. M. (1996). Trust in numbers: The pursuit of objectivity in science and public life. Princeton:
Princeton University Press.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Application and data analysis methods
(2nd ed.). Thousand Oaks: Sage Publications, Inc..
Reynolds, C. R., Livingston, R. B., & Wilson, V. (2009). Measurement and assessment in education (2nd ed.).
Upper Saddle River: Pearson Education, Inc..
Rhee, M. (2011). The evidence is clear: Test scores must accurately reflect students' learning. The Huffington
Post. Retrieved from http://www.huffingtonpost.com/michelle-rhee/michelle-rhee-dc-schools_b_845286.
html.
Rizvi, F., & Lingard, B. (2010). Globalizing education policy. London: Routledge.
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder:
National Education Policy Center (NEPC) Retrieved from http://nepc.colorado.edu/thinktank/review-
MET-final-2013.
Schafer, W. D., Lissitz, R. W., Zhu, X., Zhang, Y., Hou, X., & Li, Y. (2012). Evaluating teachers and schools
using student growth models. Practical Assessment, Research & Evaluation, 17(17). Retrieved from
pareonline.net/getvn.asp?v=17&n=17.
Smith, W. C. (2016). The global testing culture: Shaping education policy, perceptions, and practice. Oxford:
Symposium Books.
Smith, W. C., & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal systems.
Education Policy Analysis Archives, 25(86). https://doi.org/10.14507/epaa.25.2889.
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Education International Discussion
Paper. Retrieved from: http://download.eiie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web. pdf.
Stevens, J. (1996). Applied multivariate statistics for the social sciences. Mahwah: Lawrence Erlbaum
Associates, Inc.
Tekwe, C. D., Carter, R. L., Ma, C., Algina, J., Lucas, M. E., Roth, J., Arite, M., Fisher, T., & Resnick, M. B. (2004).
An empirical comparison of statistical models for value-added assessment of school performance. Journal of
Educational and Behavioral Statistics, 29(1), 11–36. https://doi.org/10.3102/10769986029001011.
Timar, T. B., & Maxwell-Jolly, J. (Eds.). (2012). Narrowing the achievement gap: Perspectives and strategies
for challenging times. Cambridge: Harvard Education Press.
Verger, A., & Parcerisa, L. (2017). A difficult relationship. Accountability policies and teachers: International
evidence and key premises for future research. In M. Akiba & G. LeTendre (Eds.), International
handbook of teacher quality and policy (pp. 241–254). New York: Routledge.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to
acknowledge and act on differences in teacher effectiveness. New York: The New Teacher Project (TNTP)
Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.

Different Teacher-Level Effectiveness Estimates, Different Results: Inter-Model Concordance Across Six Generalized Value-Added Models (Vams)

Uploaded by

Copyright:

Available Formats

You might also like

Different Teacher-Level Effectiveness Estimates, Different Results: Inter-Model Concordance Across Six Generalized Value-Added Models (Vams)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Different Teacher-Level Effectiveness Estimates, Different Results: Inter-Model Concordance Across Six Generalized Value-Added Models (Vams)

Uploaded by

Copyright:

Available Formats

Educational Assessment, Evaluation and Accountability

Different teacher-level effectiveness estimates, different

Edward Sloat 1 & Audrey Amrein-Beardsley 1 & Jessica Holloway 2

Received: 12 November 2017 / Accepted: 13 July 2018/

Keywords Teacher accountability . Teacher effectiveness . Teacher evaluation . Teacher

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11092-018-

determined at the district-level (e.g., allocation of performance pay, teacher promotion,

3 Purpose of the study

4.2 Criterion-related evidence of validity

As per Messick (1989), Bcriterion-related validity is based on the degree of relationship

4.3 Inter-indicator correlations

First, in terms of concurrent-related evidence of validity, current research evidence

Again, researchers compared the concordance of teacher-level effectiveness ratings

5.1.1 District demographics

5.1.2 District academic performance

Researchers obtained teacher-classroom assignments from the district’s student course

5.3.1 Achievement variables

Researchers used categorical data reflecting students’ special program membership

5.3.3 Student growth percentiles

5.4 Modeling approaches

5.4.1 Student growth percentile model

5.4.2 Value-added linear regression model

Table 1 Descriptive statistics for grades 4, 5, and 6

Variables Grade 4 (n = 1568) Grade 5 (n = 1641) Grade 6 (n = 1550)

Mean Median SD Mean Median SD Mean Median SD

5.4.3 Value-added hierarchical linear model

Math SS 2011 1.00

SGP Math 2011 0.55** 0.31** − 0.04 0.11** 1.00

**p < 0.01 level (two-tailed); *p < 0.05 level (two-tailed)

Table 3 Descriptive statistics for classroom median growth percentiles (MGPs)

Variables Grade 4 (n = 71 classrooms) Grade 5 (n = 75 classrooms) Grade 6 (n = 69 classrooms)

Mean Median SD Mean Median SD Mean Median SD

The Level 1 model is specified as follows:

where LUNCHPCTj represents the percent of students in school j participating in NSLP

MSS 2011ij ¼ γ 00 þ γ 01 *LUNCHPCT j þ γ 02 *MSS2010A j þ γ 10 *DAYSENROij

þ γ 20 *GIFTEDDU ij þ γ 30 *SPEDDUM ij þ γ 40 *LUNCHDUM ij

Variable B (t) B (t) B (t) B (t) B (t) B (t)

5.4.4 Simple difference (gain) score model

5.4.5 Performance level (growth) model

Table 5 Descriptive statistics:

into four descriptive performance categories: BFalls Far Below,^ BApproaches,^

5.4.6 Criterion (percent passing) measures of achievement

Table 6 Growth criterion matrix: AIMS performance levels

AIMS 2010 Falls far below Approaches Meets Exceeds

Table 7 Descriptive statistics:

Table 8 Descriptive statistics:

6.1 Rank correlations across classrooms

Grade 4 Mathematics Grade 4 Reading

Methodological Grade 4 Grade 4 Grade 5 Grade 5 Grade 6 Grade 6

VALRM 0.92a 0.92a 0.86a 0.83a 0.82a 0.82a

a Correlation is significant at the 0.01 level (two-tailed)

As illustrated, the VALRM reports the strongest association to the SGP-based

6.2 Rating agreement

Cohen’s kappa is a measure of interrater agreement between matched ordinal

Model Kendall tau-c Kappa Rating disagreement

Grade 4 Mathematics (n = 71)

***p < 0.001; **p < 0.05; *p > 0.05

research was not to evaluate or compare the technical characteristics of alternative

SGP Math 2011 0.55 0.31 − 0.04 0.11** 1.00

MSS 2011ij ¼ γ 00 þ γ 01 LUNCHPCT j þ γ 02 MSS2010A j þ γ 10 *DAYSENROij

þ γ 20 GIFTEDDU ij þ γ 30 SPEDDUM ij þ γ 40 *LUNCHDUM ij

p < 0.001; p < 0.05; p > 0.05