Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

School Effectiveness and School Improvement

An International Journal of Research, Policy and Practice

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/nses20

Assessing the impact of collaborative inquiry on


teacher performance and effectiveness

Xiu C. Cravens & Seth B. Hunter

To cite this article: Xiu C. Cravens & Seth B. Hunter (2021) Assessing the impact of
collaborative inquiry on teacher performance and effectiveness, School Effectiveness and
School Improvement, 32:4, 564-606, DOI: 10.1080/09243453.2021.1923532

To link to this article: https://doi.org/10.1080/09243453.2021.1923532

View supplementary material

Published online: 17 May 2021.

Submit your article to this journal

Article views: 951

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=nses20
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT
2021, VOL. 32, NO. 4, 564–606
https://doi.org/10.1080/09243453.2021.1923532

Assessing the impact of collaborative inquiry on teacher


performance and effectiveness
a b
Xiu C. Cravens and Seth B. Hunter
a
Peabody College of Education and Human Development, Vanderbilt University, Nashville, TN, USA; bCollege
of Education and Human Development, George Mason University, Fairfax, VA, USA

ABSTRACT ARTICLE HISTORY


This study tests the hypothesis that teacher-led collaborative inquiry Received 10 December 2019
cycles, guided by instructional standards, lead to improved teacher Accepted 23 April 2021
performance and effectiveness. We examine the impact of teachers’
KEYWORDS
self-selection into teacher peer excellence groups (TPEGs), which Collaborative inquiry;
involves lesson co-planning, peer observation and feedback, and community of practice;
collaborative lesson-plan revision on participating teachers from 14 professional development;
pilot public schools in Tennessee. Using survey results and statewide teacher performance;
administrative data, we apply a propensity score matching strategy, teacher effectiveness
and find that TPEG teachers experience growth in their instruction
ratings and value-added scores in the subsequent year, although
the longer term impact is attenuated. We contribute to the
literature by identifying deprivatized practice and instruction-
focused collaboration as key features of teacher communities of
practice, highlighting the importance of using standards-based
instructional quality measures, linking participation in collaborative
inquiry cycles to teacher-level outcomes, and estimating effects
applicable to situations in which teachers exercise agency and
collaborate voluntarily.

Introduction
In recent decades, educational reform efforts increasingly focused on the role of teachers
as the key to improving student learning (Darling-Hammond & Youngs, 2002; Stronge
et al., 2007). While research shows that teachers impact student achievement more
than any other within-school factors, teacher effectiveness varies greatly across class-
rooms and is inequitably distributed across racial and socioeconomic groups (Akiba
et al., 2007; Kane et al., 2011; Nye et al., 2004). In light of these findings, scholars have
argued that professional development (PD) is an essential lever to improve teaching
and ensure equitable student access to effective teachers (Chetty et al., 2014; Clotfelter
et al., 2007; Palardy & Rumberger, 2008).
Research concludes that job-embedded, collaborative, and teacher-led PD is more
likely to strengthen instructional practices and improve student learning (Coburn et al.,
2012; Desimone, 2009; Goddard et al., 2007; National Staff Development Council, 2001).
The field of PD also recognizes that teacher collaboration is dynamic and context depen-
dent, and the very nature of such activities present challenges to pinpointing the core

CONTACT Xiu C. Cravens xiu.cravens@vanderbilt.edu


© 2021 Informa UK Limited, trading as Taylor & Francis Group
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 565

features that can be applied to different settings and taken to scale (Hiebert et al., 2002;
Hill et al., 2013). To that end, some scholars find promise in international models that
feature teacher-led collaborative inquiry cycles as a mechanism of PD that can be
applied across subjects or grade levels (Huang & Shimizu, 2016; Jensen et al., 2016;
Lewis, 2015). In these inquiry models, teachers work in teams and engage in continuous
efforts to identify specific student-learning goals and instructional improvement
strategies.
Several studies have explored the extent to which various forms of teacher-led colla-
borative inquiry cycles are associated with student-level and school-level achievement
scores or school-level value-added scores (e.g., Goddard et al., 2007; Ronfeldt et al.,
2015; Supovitz, 2002). The study most similar to ours compares the school-level
average student achievement scores of schools that have instruction-focused, administra-
tor-supported teacher teams to those without such teams (Saunders et al., 2009). These
studies tend to find positive associations between teacher-led collaborative groups and
student- and school-level performance outcomes, underscoring the promise of these
PD models. However, this emerging line of research is yet to fully provide a compelling
theory of change that identifies and explicates how design principles and operational pro-
tocols enable teachers to translate new ideas into their own systems of practices, and if
changed practices lead to improvement in performance and effectiveness at the
teacher level. We extend this literature in three important ways.
First, we examine the effects of a specific model of teacher-led collaborative inquiry,
teacher peer excellence groups (TPEGs), co-designed by one of the authors through a
research–practice partnership. TPEGs are grounded in research on effective PD and, more
specifically, the principles of “communities of practice” – professional communities focused
on solving problems of practice by leveraging shared expertise. TPEGs organize teachers’
work and focus it on continuous instructional improvement. By using standards-based obser-
vation rubrics that describe specific instructional practices linked to higher student achieve-
ment scores, TPEGs aim to improve instruction, student achievement, and teacher
effectiveness (as measured by value-added scores). Critically, TPEGs are designed to accom-
plish these goals by setting norms that deprivatize teaching via collaborative lesson planning,
peer observation, post-observation reflection, and peer feedback.
Second, we examine two teacher-level outcomes: a measure of performance, as captured
by observation scores, and a measure of effectiveness, as captured by value-added scores.
Previous work within this literature tends to examine teacher PD effects on student-level
or school-level achievement scores (e.g., Saunders et al., 2009; Supovitz, 2002), school-level
value-added (e.g., Ronfeldt et al., 2015), or teacher self-reports (Cravens et al., 2017;
Goddard et al., 2007), which does not directly support inferences about effects on teachers.
We therefore argue that it is essential to take one step further and examine the extent to
which teacher-led collaborative inquiry groups lead to higher teacher performance and effec-
tiveness as students taught by higher performing and more effective teachers experience
better short- and long-term academic and non-academic outcomes (Chetty et al., 2014;
Doan, 2019).
Third, our data and methods support inferences about the changes in these two out-
comes if a policy assigns schools to offer TPEGs, but allows teachers to self-select into
these teams. Typical impact studies support inferences about the expected changes in
outcomes due to an intervention assigned to participants. Were our study a typical
566 X. C. CRAVENS AND S. B. HUNTER

impact study, it might support inferences about the expected changes in observation and
value-added scores when teachers are assigned to a TPEG by education policy or leaders.
Our study aims to estimate the changes in observation and value-added scores if policy
expects schools to offer and support TPEGs as a form of PD, then school administrators
allow teachers to self-select into a TPEG. Our findings provide evidence that school
leaders should include TPEGs in their repertoire of PD strategies that aim to improve
teacher performance and effectiveness.
We make these contributions by answering the following research questions: To what
extent do TPEG teachers engage in deprivatized practice and instruction-focused collab-
oration? What is the impact of teachers self-selecting into a TPEG on teacher observation
scores? What is the impact of teachers’ self-selection into a TPEG on value-added scores?

Literature and theory of change


Our study draws on previous research and explores the impact pathways that link
in-school teacher PD with improved teacher performance and effectiveness. We
develop a theory of change (see Figure 1) and posit that collaborative inquiry cycles
serve as the driving mechanism that builds effective communities of practice by providing
a team structure that organizes teachers’ continuous improvement efforts and a protocol
that centers activities around standards that define desired practice. We also suggest that
teacher communities of practice formed by such collaborative inquiry cycles are more
likely to display deprivatized practice and instruction-focused collaboration, two features
that lead to improved teacher performance and effectiveness.
We first describe the definition and core features of communities of practice, then
discuss the role of collaborative inquiry as a driving mechanism to build such commu-
nities. Next, we provide a brief review of recent research identifying effective instructional
practices and their association with student academic learning. Finally, we connect all
these pieces and discuss the design of our study.

Teacher communities of practice


Grounded in situated learning theory, Lave and Wenger (1991) coined the term “commu-
nity of practice” (CoP) with the assumptions that learners enter a community first as

Figure 1. From collaborative inquiry to teacher performance and effectiveness.


SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 567

novices whose knowledge is experience based, and learning occurs as learners collabora-
tively and critically reflect on shared experiences. The concept provides a useful perspec-
tive on knowing and learning and has been applied in various sectors as the key to
improving individual and group performance (Wenger, 2010; Wenger et al., 2002).
An extensive body of literature identifies the characteristics of communities of practice
that enhance teacher learning, for example, a strong content focus, inquiry-oriented
learning approaches, collaborative participation, and coherence with school curricula
and policies (Darling-Hammond, 2013; Desimone, 2009; Lee & Smith, 1996; Leithwood
et al., 2004; Louis et al., 1996; McLaughlin & Talbert, 2001; Youngs & King, 2002). Moreover,
studies indicate that the effectiveness of a community of practice depends on applying
newly learned knowledge by multiple community members, not simply the application
of it by one member (Grossman et al., 2001; Lave & Wenger, 1991).
Despite the theoretical benefits of communities of practice, scholars argue that they
are difficult to create. Within the culture of American schooling, researchers point to
two particularly salient barriers – the isolated and privatized nature of teaching and the
lack of consensus about what constitutes effective instructional practices (Buysse et al.,
2003; Grossman et al., 2001; Huang & Shimizu, 2016; Palincsar et al., 1998). To address
these barriers, more studies are focusing on identifying the distinct school-based activities
that broaden teacher access to peer expertise and improve effective research-based
instructional practices (Coburn & Russell, 2008; Gallimore et al., 2009; Grossman & McDo-
nald, 2008; Hiebert et al., 2002; Levine & Marcus, 2010; Lewis et al., 2006). Drawing from
these studies, we identify two key features of effective communities of practice: depriva-
tized practice and instruction-focused collaboration.
Deprivatized practice calls for teachers to see their practice as objects that can be
shared and examined publicly (Hiebert et al., 2002). In the United States, teachers are typi-
cally isolated in their classrooms where their practice remains private (Akiba & Wilkinson,
2016; Stigler & Hiebert, 2009). This isolation hinders teachers from addressing instruc-
tional concerns and sharing knowledge with their peers to replicate classroom successes.
Working in isolation also exacerbates variation in teaching quality due to prior experience,
training, grade level, subject area, student characteristics, and other school factors
(Goddard et al., 2007; Grossman et al., 2007). Forming communities of practice provides
an opportunity for teachers to overcome the isolated nature of teaching and access
their within-school peers’ expertise (Akiba & Wilkinson, 2016; Coburn et al., 2012). Depri-
vatized teaching allows a community of practice to address weak practices and, more
importantly, identify and institutionalize strong instructional practices. That is, when
depicted with clarity and concrete detail, accumulative practice-based knowledge can
be carried forward and scaled up into other teacher communities of practice (Cravens
& Wang, 2017; Huang & Shimizu, 2016; Jensen et al., 2016; Lewis et al., 2006; Little,
2002; Wang, 2013).
Instruction-focused collaboration begins by identifying improvement targets and articu-
lating how specific instructional practices will theoretically result in the desired outcome
(Gallimore et al., 2009; Hiebert et al., 2002; Kraft & Blazar, 2017; Papay et al., 2020). Through
peer observation and the collection of student and teacher work products, a community
of practice evaluates the extent to which instruction worked as expected and identifies
the potential sources of deviation from expected results (e.g., unexpected student beha-
viors; Bryk et al., 2015). These sources of deviation are important points for collaborative
568 X. C. CRAVENS AND S. B. HUNTER

reflection. Understanding the sources of deviation supports communities of practice in


replicating positive deviations and avoiding negative deviations in future lessons (Bryk
et al., 2015). Teachers with diverse content knowledge, pedagogical perspectives, and
past experiences enrich collaborative inquiry in communities of practice as these back-
grounds may lead to novel interpretations and conjectures about sources of deviation
(Hiebert et al., 2002; Huang & Shimizu, 2016), and the classroom becomes the primary
testing ground for continuous improvement of teaching (Jensen et al., 2016; Lewis,
2015; Lewis et al., 2006). Well-articulated descriptions of instructional practices that
lead to better student learning outcomes can also help guide teachers to stay focused
throughout their inquiry (Kane et al., 2011).

The role of collaborative inquiry cycles


Among the various mechanisms that aim to improve instructional practice and
student learning, teacher-led collaborative inquiry emerges as a promising form of
PD (Akiba & Wilkinson, 2016; Cravens et al., 2017; Goddard et al., 2007; Huang &
Shimizu, 2016; Lewis, 2015; Saunders et al., 2009). In these inquiry models, teacher
teams engage in iterative cycles that begin with instructional goal setting for
student learning, planning lessons that will achieve set goals, implementing the
lesson, and tracking instructional outcomes by observing peer teaching and monitor-
ing student learning results. Collaborative inquiry groups then consider if goals are
met, identify if the implemented lesson deviated from its design, and inquire into
the potential sources of positive and negative deviations (Bryk et al., 2015). Future
lessons replicate positive deviations, aim to mitigate negative deviations, and teachers
update their understanding of the conditions in which specific lessons and specific
instructional strategies work and for whom they work (Bryk et al., 2015). Studies
also find that teachers learn more about how specific lessons and instructional prac-
tices affect student learning when a community of practice focuses on a specific
instructional practice over a period of time and when teachers apply focal instruc-
tional strategies in similar and different settings (Gallimore et al., 2009; Grossman &
McDonald, 2008; Morris & Hiebert, 2009).
Most examples of widely practiced teacher collaborative inquiry models exist outside
the United States. The Japanese lesson-study model was among the earliest to be intro-
duced to the United States through the Trends in International Mathematics and Science
Study (TIMSS) in the early 1990s (Hiebert et al., 1999). In a typical lesson study, teachers
collectively examine curriculum and instructional materials and students’ thinking (i.e.,
set goals, collect data), and use multiple trials to improve classroom approaches (i.e.,
analyze data, identify deviations, improve lessons; Hiebert et al., 2002; Lewis, 2015;
Lewis et al., 2006).
Studies have also associated the “teaching-study groups” in Shanghai with higher
student achievement independent of student socioeconomic status and historical aca-
demic proficiency (Jensen et al., 2016; Organisation for Economic Co-operation and Devel-
opment [OECD], 2011; Tucker, 2014; Wang, 2013). In Shanghai and many parts of China,
teaching-study groups are typically organized by subject and grade level, led by teachers
with recognized content-pedagogical expertise, and engaged in weekly inquiry cycles
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 569

that examine teaching quality through collaborative lesson planning, peer observation
and feedback, and lesson revision (Cravens & Wang, 2017).
The Japanese lesson study and Shanghai teaching-study groups models illustrate the
pathways through which teacher-led collaborative inquiry cycles build communities of
practice. Scholars have argued that such collaborative inquiry cycles foster shared respon-
sibility within a community of practice (Huang & Shimizu, 2016; Lewis et al., 2006). And
unlike prescribed PD programs, the collaborative inquiry model is a strategy for continu-
ous improvement that applies to any subject, grade level, or local context, and could be
implemented at scale (Cravens & Wang, 2017; Jenson et al., 2016; Wang, 2013).

Linking instructional practices with student achievement gain and measuring


the impact
Our literature review suggests that teacher-led collaborative inquiry cycles start by iden-
tifying effective teaching practices, then set goals to implement effective instruction.
Encouragingly, in the last 15 to 20 years, more states and districts have adopted stan-
dards-based teacher observation rubrics that describe effective teaching practices and
articulate levels of proficiency, allowing educators to identify instructional differences
within and between teachers that are meaningful to student learning (Daley & Kim,
2010; Goldring et al., 2009, 2015; Grossman et al., 2013; Kane et al., 2011). One of the
most widely used tools to identify instructional differences is Danielson’s Enhancing Pro-
fessional Practice: A Framework for Teaching (2007; see also Archibald et al., 2011; Daley &
Kim, 2010; Kane et al., 2011), which covers four domains – planning and preparation, class-
room environment, instruction, and professional responsibilities, with each domain
further delineated by elements and indicators.
Several comprehensive teacher evaluation systems have incorporated the Danielson
framework. Daley and Kim (2010) examined one such system, the System for Teacher
and Student Advancement (i.e., TAP) and the alignment between summative TAP
scores and student achievement growth. They found that TAP evaluations provided
differentiated feedback, that classroom observational scores positively and significantly
correlated with student achievement growth, and that TAP teachers increased
observed skill levels over time (Daley & Kim, 2010). Using data from the Cincinnati
Public Schools’ Teacher Evaluation System (TES), Kane and colleagues (2011) found
that observational measures of teaching were substantively related to student achieve-
ment growth and that there were differences in these relationships across different
instructional strategies, implying that some strategies were more effective at raising
achievement scores. Chaplin and colleagues (2014) examined the associations among
three teacher effectiveness measures used by the Research-based Inclusive System of
Evaluation (RISE) for the Pittsburgh Public Schools – teacher evaluation system (pro-
fessional practice based on classroom observation), student surveys, and value-added
measures – and found that the measures were positively (but weakly) correlated.
The design and validation work of observation rubrics in TAP, TES, and RISE suggest
that standards-based teacher observation rubrics capture instructional aspects that
may improve student achievement, lending evidence to the validity and utility of these
instruments. However, it has been difficult to find communities of practice that focus
on the implementation of instructional standards, and even more challenging to separate
570 X. C. CRAVENS AND S. B. HUNTER

the causal impact of such communities of practice from confounding variables. As such,
Lewis and colleagues (2006) argued for new research designs that can make a stronger
causal warrant for innovative instructional practices.
An “early” descriptive study links teacher collaboration to student achievement (Supo-
vitz, 2002) by examining the extent to which teacher teaming changed instructional prac-
tice as reported on teacher surveys, and whether such changes improved student
learning as measured by standardized test performance. The study focused on elementary
and middle grades teachers who formed collaborative teams that developed a shared
vision and curriculum to improve student learning. The study found positive associations
between elementary teacher-reported teaming time spent on instruction and student
scores in reading, writing, and mathematics. Similar, but weaker, relationships were
also found for middle grades teachers.
Goddard and colleagues (2007) explored the relationship between a survey measure of
teacher collaboration and student achievement. Analyses suggested that fourth-grade
students had higher achievement in mathematics and reading when they attended
schools where teachers reported engaging in higher levels of collaboration. The
authors suggested that their results provided preliminary support for teacher collabor-
ation focused on work related to curriculum and instruction. The authors also discussed
the need for more research on the effects of different types of collaborative practices
using more representative samples.
Ronfeldt and colleagues (2015) tested whether the quality of teachers’ collaboration
was associated with student achievement gains using survey and administrative data
from Miami-Dade County. The data included 9,000 teacher observations from 336
schools over 2 years, and the survey measures included the frequency and quality of
teacher collaboration. Regression analyses found associations between “general collabor-
ation” at the school level and math and reading value added of 0.42 and 0.18, respectively.
However, the estimates were not conclusively causal.
Saunders and colleagues (2009) implemented a noteworthy study using a quasi-exper-
imental design similar to our analysis. The study tested whether the introduction of grade-
level teacher teams improved the school-level achievement scores of historically low-per-
forming, urban, elementary schools over 5 years. Nine schools volunteered for treatment,
and the authors identified six comparison schools from the same district as best matches
based on demographics and achievement (Saunders et al., 2009).1 The authors presented
baseline grade-level achievement scores of treated and comparison schools and stated
that results from unprinted t tests showed that the two groups were not statistically
different.2 The study found no significant results from the first 2 study years when only
principals were trained to form and facilitate teacher teams. However, the authors
found positive associations in the 3rd year, when treatment schools received intensified
principal training, school leaders provided consistent meeting times, and teacher teams
received explicit protocols that focused meeting time on students’ academic needs and
how they might be instructionally addressed (Saunders et al., 2009, p. 1007). This study
offers suggestive evidence that teacher teams with time to collaborate, school adminis-
trator support, and protocols focused on instructional improvement may improve
school-level student achievement scores. However, the study does not provide any evi-
dence on the extent to which such teams may improve individual teacher practice as
measured by observation scores, or effectiveness as measured by value-added scores.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 571

The teacher peer excellence group (TPEG) pilot


Design
In the early 2010s, the state of Tennessee received Federal Race to the Top funding and
introduced sweeping changes to its evaluation system, with the primary goal of improv-
ing teacher effectiveness (Tennessee Department of Education [TDoE], 2016a). The TDoE
stepped toward its goal by creating the Tennessee Educator Acceleration Model (TEAM)
standards-based observation rubric. The TEAM rubric is based on the Danielson frame-
work and national professional standards (Alexander, 2016). Similar to the Danielson fra-
mework for teaching, the TEAM rubric has three main dimensions – planning,
environment, and instruction – and each dimension includes multiple indicators and
proficiency descriptors (Appendix 1).
In late 2012, a research–practice partnership was formed between an international
team of university researchers (including one of the authors) and a non-partisan, non-gov-
ernmental organization in Tennessee. The partnership designed a collaborative inquiry
model, teacher peer excellence group (TPEG), which uses the TEAM rubric and key
design features of the Shanghai teaching-study groups and Japanese lesson study (see
Figure 2).
The TPEG model involves teams of teachers organized by subject matter or grade level
to conduct collaborative inquiry cycles through goal setting, lesson planning, peer obser-
vation, peer feedback conferences, and lesson-plan revision. Principals and teachers in the
pilot schools were provided with training on the TPEG theory of change, protocols for
conducting meetings and activities, and a modifiable template to plan and document
each cycle (see Appendix 2).
As indicated in Figure 2, the TEAM rubric plays a central role in ensuring the focus of
the collaborative inquiry cycles. Teachers in a TPEG use the TEAM rubric to decide which
instructional practices are targeted for improvement and how they are aligned with
student achievement measures. TPEG teachers then plan their lessons as a team,
observe their peers to see how a planned lesson is delivered, identify positive and nega-
tive deviations from the lesson plan, and provide feedback specific to the focal objectives
– all as steps of deprivatized teaching. The last step of each TPEG cycle is to revise the
lesson plan using what was learned from performance feedback and the deviation

Figure 2 . Assessing the impact of collaborative inquiry conducted by TPEGs.


572 X. C. CRAVENS AND S. B. HUNTER

analysis so that change can be documented, stored, and improvement could be assessed
when the lesson is taught again. While the specific focus and frequency of TPEG cycles
could vary by subject, grade, or other contextual factors, TPEGs are expected to serve
as a mechanism that drives and sustains deprivatized practice and instruction-focused
collaboration – the immediate and observable change in teacher practice.
For example, a TPEG may start a cycle by identifying “Lesson Structure and Pacing”, a
TEAM rubric indicator, as the focal objective (see Figure 3). The descriptor for Level 5
(significantly above expectations) provides a set of exemplary instructional behaviors:
The lesson starts promptly; the lesson’s structure is coherent, with a beginning,
middle, and end; the lesson includes time for reflection; pacing is brisk and provides
many opportunities for individual students who progress at different learning rates;
routines for distributing materials are seamless; and no instructional time is lost
during transitions (Figure 3). Using the TPEG cycle template (Appendix 2), one
member might share the opening, central instruction, and closing segments of the
lesson plan, where specifics such as academic content area, student learning objectives,
assessment and evaluation benchmarks, and materials and resources will also be pro-
vided. Teachers in the group would then observe the lesson focusing on behaviors cap-
tured by the “Lesson and Structure Pacing” indicator. The peer observation is followed
by a peer-feedback session to identify areas of strength, areas for improvement, and
potential sources of positive and negative deviation. Teachers would revise lessons
accordingly and store updated lessons where teachers can access them for future
use. Depending on the subject/grade composition of the TPEG, revision may take
different forms. In a sizeable middle school with multiple teachers for the same
subject in each grade, for example, the revised lesson could be taught by other
TPEG members and reassessed further. In smaller schools, teachers often move to
the next content topic but remain focused on the same TEAM indicator until the
group is satisfied with their performance.

Implementation and school/teacher selection into TPEGs


In 2012–2013 the TPEG partnership team approached six school districts about study par-
ticipation. The TPEG team and district leaders purposively selected three pilot schools in
each district, 18 schools in total. Critically, one of the authors, the TDoE, and partnering
non-government organization purposefully selected TPEG schools for the study with
varying student race/ethnicity, free/reduced-price lunch status (FRPL), achievement
scores, and teacher value-added scores (TVAAS). No pilot schools had pre-existing colla-
borative inquiry teams.

Figure 3. Example TEAM rubric indicator.


SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 573

The first round of TPEG implementation started in the fall of 2013. The TPEG partner-
ship and district leaders planned for each school to organize two TPEG teams: one for
math and one for Reading/Language Arts (RLA). The formation of the TPEG teams
varied by grade level and school size. As elementary schools are typically not departmen-
talized in the US, the principal and teachers in a school might form a math TPEG although
participating teachers also teach other subjects. A small and rural middle school might
form an English Language Art TPEG vertically across sixth, seventh, and eighth grades.
Among eligible teachers who taught math and RLA, TPEG groups were formed primar-
ily based on two criteria: (a) shared instructional foci as defined by the TEAM rubric and (b)
varied levels of expertise in instruction practice. This consideration stemmed from the
theory of change that teacher-led collaborative inquiry groups with a shared focus on
instructional practice are most likely to leverage the expertise of their members for con-
tinuous improvement. Importantly, as a part of the initial TPEG training, pilot school prin-
cipals learned to use teacher prior-year observation scores and prior-year teacher value-
added scores to select the initial TPEG teams. The principals were also trained to conduct
TPEG orientations and explain the steps and expectations of the collaborative inquiry
cycles to the selected teachers. Moreover, the teachers had the option to accept or opt
out of the TPEG assignment (i.e., teachers self-selected into TPEGs).
The research team intentionally designed the implementation of TPEGs to be flexible
and adaptive to local conditions and needs. As long as the collaborations were ongoing,
using the TEAM rubric to identify the inquiry focus, and adhering to the planning-obser-
vation-feedback-revision sequence, principals and TPEG teachers were encouraged to
take the lead in deciding the best team formation and logistical arrangement for
inquiry cycles to take place throughout the implementation stage. Pilot school principals
were also responsible for adjusting the school schedule so that TPEGs would be able to
conduct peer observations and have common planning time.
During 2013–2014, the 1st year of the implementation, the research design team pro-
vided quarterly convenings, conducted site visits, facilitated peer-to-peer dialogues
among principals and participating teachers across TPEG pilot schools, collected
samples of the TPEG cycle documents, and provided regular feedback to the schools.
In the following 2014–2015 school year, active training and monitoring by the research
team were replaced by informal meetings and resource sharing with the participating
schools.

Data and methods


Statewide administrative data
TDoE has developed a comprehensive and statewide administrative data network collect-
ing teacher observation scores and value-added scores, student and teacher demographic
information, and educator survey responses from the statewide Tennessee Educator Survey
(TES). Our analysis primarily uses TDoE administrative data from 2012–2013, which we
characterize as the “baseline” year, and longer term data from 2014–2015. Baseline
measures include teacher observation and value-added scores (TVAAS).
Teacher observation scores are summative, representing a combination of the instruc-
tional, planning, and environmental management skills of the teacher on a scale ranging
574 X. C. CRAVENS AND S. B. HUNTER

from 1 (significantly below expectations) to 5 (significantly above expectations) (Tennessee


Department of Education, 2016a). Certified observers conduct teacher observations;
observers must annually pass an accuracy rating test and exam about teacher evaluation
policy (Hunter, 2020). Statewide, observers are school administrators, teacher peer obser-
vers, and central-office-based observers.3 Among TPEG teachers, school administrators
conducted 97% of observations; the remainder was equally divided between peer obser-
vers and central-office personnel. State policy assigns teachers some number of obser-
vations based on prior-year scores on a composite measure of teacher effectiveness
(i.e., level of overall effectiveness [LOE], see below) and experience (Hunter, 2020). The
least effective teachers are assigned a minimum of four observations, while the most
effective are assigned one. Early-career teachers of moderate effectiveness are assigned
a minimum of four observations, and policy assigns career teachers a minimum of two.
Districts may add to these minima. In 2013–2014, 33% of TPEG teachers received three
observations, about 5% of TPEG teachers received one, 12% received two, 10% received
four, and the remainder of TPEG teachers received five observations.
Tennessee teachers of tested subjects receive a value-added TVAAS score generated by
the Tennessee Value-Added Assessment System (Tennessee Department of Education,
2016b). TVAAS measures the growth in student achievement scores that is attributable to tea-
chers.4 In calculating a TVAAS score, a student’s achievement score is compared to the scores
of their peers who performed similarly on past assessments. Only math/RLA teachers of
fourth- through 12th-grade students received TVAAS scores during the study period.5
Other teacher baseline measures include education level, years of experience, gender,
age, race/ethnicity, and LOE scores, a composite of observation scores, value-added scores,
and student achievement (e.g., American College Testing [ACT] scores, accountability
exam scores). Student baseline measures include standardized math and reading scores,
office referrals, race/ethnicity, and English as a second language (ESL), FRPL, and special
education (SPED) status. Baseline measures also include responses to TES items concerning
peer observation and collaboration. In 2012–2013, teachers were asked if they had been
observed by a department chair, instructional coach, mentor teacher, and engaged in
one-on-one (1–1) work with a mentor, informally consulted with a peer, or observed a
peer. We treat teacher observation and TVAAS scores from 2013–2014 as outcomes, and
observation and TVAAS scores from 2014–2015 as longer term outcomes. TDoE adminis-
trative data allow researchers to link math and RLA teachers to the students they taught
and identify the schools in which these teachers worked. We also use data from a
Spring 2014 TPEG survey administered to teachers in the pilot schools.

Measures
TPEG participation
We identify TPEG teachers using teacher self-identification in the TPEG survey distributed
in the pilot schools during Spring 2014. All teachers were asked if they were members of a
TPEG over the course of 2013–2014 and if yes, their TPEG’s subject focus (math or RLA).
Although 18 schools were selected for the TPEG study, we only obtained data from 14
schools. Three schools missed the time window to fill out the survey due to a mailing
error. The fourth school was omitted because it was missing the information we use to
link TPEG survey responses to TDoE administrative datasets. As these four schools are
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 575

not missing data due to attrition or standard non-response, we treat these data as missing
at random.
Among 112 teachers from 14 pilot schools that participated in this survey, 68 teachers
identified as TPEG participants (61% teacher response rate). The number of TPEG teachers
within a school ranged from two to 10 (see Table 1), and the majority came from inter-
mediate grades (see Table 1). One middle grades teacher taught Algebra 1, but partici-
pated in a cross-grade math TPEG. In each school, TPEG participation was suggested by
the principal but ultimately decided by the teacher. The self-selection option was inten-
tional as it allowed teachers to exercise some agency in their choice of professional learn-
ing opportunities. Consequently, our methods do not aim to remove variation arising
from teacher self-selection (i.e., volunteering) from our estimates.

Teacher communities of practice


The TPEG theory of change asserts that instruction and student achievement will improve
when teachers engage in instruction-focused collaboration and deprivatize their

Table 1. Descriptives for teacher peer excellence group (TPEG) teachers in analytical samples.
Number of TPEG Teachers in Pilot Schools Used in Propensity Score
School Matching (PSM)
A 2
B 8
C 4
D 3
E 3
F 2
G 5
H 4
I 5
J 5
K 6
L 4
M 7
N 10
Total of 14 TPEG Schools Total of 68 TPEG Teachers
Grade/ Subject Taught by TPEG Teachers Used
in PSM Frequency
4th 26
5th 15
6th 17
7th 4
8th 5
High School Material 1
Total 68
Subject Taught by TPEG Teachers Used in PSM Frequency
Math 18
Reading/ Language 23
Science 11
Social Studies 15
Algebra 1 1
Total 68
Notes: A teacher is included in any of these tables if they were identified as TPEG participants, were not missing obser-
vation or teacher value-added scores (TVAAS) from 2013–2014 or 2014–2015, and were not missing any control vari-
ables. Stated differently, if a TPEG teacher was used in at least one of the regressions that generated the findings in
Table 5, then we included them in Table 1.
576 X. C. CRAVENS AND S. B. HUNTER

instructional practices. TPEG surveys administered in Spring 2014 asked teachers how
often they engaged in these activities. Survey responses are used to answer our first
research question: To what extent do TPEG teachers engage in instruction-focused collab-
oration and deprivatized practice?
Eleven items were used to measure instruction-focused collaboration and six
measured deprivatized practice. All items included four responses: Never, 1–2 times per
semester, 1–2 times per month, and 1–2 times per week. Instruction-focused collaboration
items included a fifth option of Not Applicable. Survey items asked teachers about the
extent to which instruction-focused collaboration focused on: key ideas particular to a
unit or lesson, possible ways students solve particular problems, difficulties with individ-
ual student learning, joint lesson planning, sharing teaching materials, matching curricu-
lum to state standards, developing teacher understanding of the content, modifying
instruction, developing class activities to meet instructional objectives, creating
common homework assignments, and creating common tests or quizzes. Four items con-
cerning the deprivatization of practice asked teachers about the following activities: the
teacher observed other teachers teaching, the teacher was observed by another teacher,
the teacher received post-observation feedback from peers, and the teacher provided
post-observation feedback to peers.

Teacher performance and effectiveness


The TPEG theory of change asserts TPEGs formed by principals and teachers aim to
improve instruction and teacher effectiveness. We answer our second and third research
questions by estimating the impact of choosing to join a TPEG on teacher observations
and TVAAS scores.

Descriptive differences among TPEG and non-TPEG schools and teachers


Table 2 displays descriptive statistics for school-level and teacher-level matching variables
among non-TPEG schools, TPEG and non-TPEG teachers within TPEG schools, and non-
TPEG teachers in matched non-TPEG schools. These descriptives are shared for two
reasons: to examine the similarity of TPEG and non-TPEG schools/teachers with respect
to observable characteristics, and to provide descriptives of TPEG participants so
readers can decide if our findings might transfer to their own setting.
The descriptive statistics in Table 2 suggest there are some important differences
between the typical TPEG and non-TPEG school. The top panel of Table 2 shows the
school-level mean student reading achievement score is slightly higher in TPEG schools,
which could positively bias subsequent estimates if we do not account for this difference
in the estimation procedure. The average TPEG school also tends to include more students
from disadvantaged socioeconomic and racial/ethnic backgrounds. Furthermore, the
average TPEG school includes more students eligible for FRPL, more Black and Hispanic stu-
dents, and fewer White students. If differences in the composition of the student body are
not accounted for in the estimation procedure, these differences may negatively bias sub-
sequent estimates. Most importantly, we have strong reason to believe the aforementioned
school-level differences (e.g., student achievement scores) affected which schools were
chosen for TPEG participation, as described in the preceding section on school selection.
Because some of these important school selection determinants are imbalanced across
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 577

Table 2. Descriptive statistics.


School-Level Baseline Matching Variables
TPEG Non-TPEG
Schools Schools
Average Teacher Observation Score 3.78 3.89
[0.131] [0.124]
(18) (1272)
Average Teacher TVAAS Score 1.80 1.23
[1.448] [20.048]
(18) (1129)
Average Student Math Score 0.08 0.00
[0.136] [0.142]
(18) (1162)
Average Student Reading Score 0.13 0.01
[0.135] [0.126]
(18) (1161)
Average Student Office Referrals 0.42 0.23
[0.716] [0.098]
(18) (1171)
Average Teacher LOE Score 379.52 382.56
[2013.09] [2146.124]
(18) (1268)
Total Teachers with Master Degree+ 21.12 17.84
[78.735] [123.565]
(17) (1220)
Average Teacher Years of Experience 11.60 12.67
[3.886] [10.387]
(18) (1286)
Total ESL Students in School 81.50 61.01
[6137.441] [13698.71]
(18) (1171)
Total FRPL Students in School 616.78 581.47
[693105.0] [289683.1]
(18) (1171)
Total Black Students in School 195.67 136.55
[68539.06] [75349.34]
(18) (1171)
Total Hispanic Students in School 70.89 59.77
[7690.222] [11633.4]
(18) (1171)
Total White Students in School 874.44 902.98
[1035977.0] [725396.7]
(18) (1171)
Total SPED Students in School 189.83 155.95
[35354.03] [22827.48]
(18) (1171)
Total Admins in School 2.44 1.98
[1.203] [1.109]
(18) (1147)
SD Observation Score 0.52 0.47
[0.028] [0.022]
(18) (1259)
SD TVAAS Score 3.42 5.31
[1.048] [15.36]
(18) (1124)
SD Math Score 0.92 0.90
[0.013] [0.012]
(18) (1160)
SD Reading Score 0.88 0.89
[0.005] [0.011]
(18) (1161)
SD Student Office Referrals 0.94 0.66

(Continued)
578 X. C. CRAVENS AND S. B. HUNTER

Table 2. Continued.
School-Level Baseline Matching Variables
TPEG Non-TPEG
Schools Schools
[1.885] [0.392]
(18) (1169)
SD LOE Score 53.98 53.54
[213.952] [262.884]
(18) (1255)
SD Teacher Experience 9.38 9.66
[0.848] [3.354]
(18) (1275)
Proportion of Survey Respondents within School Selecting: Observed by a 0.01 0.02
“Department Head” in 2012–2013 < [0.001] [0.002]
(18) (1221)
Proportion of Survey Respondents within School Selecting: Observed by an 0.03 0.04
“Instructional Coach” in 2012–2013 [0.001] [0.004]
(18) (1221)
Proportion of Survey Respondents within School Selecting: Observed by a “Mentor 0.03 0.01
Teacher” in 2012–2013 [0.005] [0.001]
(18) (1221)
Proportion of Survey Respondents within School Selecting: Engaged in “1–1 Work 0.03 0.03
with a Mentor” in 2012–2013 [0.001] [0.001]
(18) (1221)
Proportion of Survey Respondents within School Selecting: “Informally Consulted 0.11 0.10
with a Peer” in 2012–2013 [0.003] [0.004]
(18) (1221)
Proportion of Survey Respondents within School Selecting: “Observed a Peer” in 0.05 0.04
2012–2013 [0.001] [0.002]
(18) (1221)

Teacher-Level Baseline Matching Variables


TPEG Teachers, TPEG Non-TPEG Teachers, TPEG Non-TPEG Teachers, Matched Non-
Schools Schools TPEG Schools
Observation Score 3.76 3.79 3.74
[0.281] [0.294] [0.378]
(113) (529) (2045)
TVAAS Score 1.97 1.50 0.84
[9.879] [15.561] [35.544]
(75) (150) (761)
Average Student Math 0.04 −0.13 −0.01
Score [0.292] [0.542] [0.354]
(93) (227) (891)
Average Student Reading 0.11 −0.06 −0.02
Score [0.183] [0.348] [0.33]
(91) (226) (957)
Average Student Office 0.48 0.77 0.47
Referrals [3.097] [1.406] [0.644]
(93) (227) (984)
LOE Scale 379.26 373.71 371.19
[5403.208] [3998.678] [5114.846]
(113) (522) (2036)
Masters+ 0.51 0.53 0.51
[0.252] [0.25] [0.25]
(108) (487) (2018)
Years of Experience 10.50 12.56 12.58
[88.535] [82.846] [91.876]
(114) (538) (2076)
Female 0.95 0.89 0.78
[0.05] [0.098] [0.173]

(Continued)
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 579

Table 2. Continued
Continued.
Teacher-Level Baseline Matching Variables
TPEG Teachers, TPEG Non-TPEG Teachers, TPEG Non-TPEG Teachers, Matched Non-
Schools Schools TPEG Schools
(114) (534) (2072)
White 0.90 0.89 0.95
[0.088] [0.096] [0.052]
(114) (539) (2077)
Black 0.09 0.09 0.05
[0.081] [0.08] [0.049]
(114) (539) (2077)
Age 39.70 42.30 43.17
[99.468] [118.143] [128.691]
(114) (526) (2049)
Total ESL Students Taught 3.90 3.67 5.08
[28.784] [16.106] [52.328]
(93) (227) (984)
Total FRPL Students 24.19 31.48 33.87
Taught [432.201] [705.171] [639.363]
(93) (227) (984)
Total Black Students 6.98 9.66 10.49
Taught [39.] [92.057] [199.343]
(93) (227) (984)
Total Hispanic Students 3.05 3.44 5.09
Taught [27.051] [19.009] [45.036]
(93) (227) (984)
Total White Students 38.59 42.28 49.67
Taught [807.049] [1126.124] [1175.169]
(93) (227) (984)
Total SPED Students 8.02 8.55 11.32
Taught [63.586] [58.187] [98.226]
(93) (227) (984)
SD Math Score 0.84 0.87 0.86
[0.039] [0.052] [0.039]
(92) (224) (881)
SD Reading Score 0.84 0.81 0.83
[0.031] [0.033] [0.036]
(91) (223) (948)
SD Student Office 0.68 1.42 1.02
Referrals [1.259] [3.177] [0.967]
(92) (224) (972)
Notes: Schools are the unit of analysis in the top panel; teachers are the unit in the bottom panel. Achievement scores
standardized within grade/subject. Masters+ is a binary variable indicating if the teacher had a Master’s degree or
higher. Variance in brackets, number of teachers in parentheses. TPEG = teacher peer excellence group; TVAAS =
teacher value-added scores; LOE = level of overall effectiveness; ESL = English as a second language; FRPL = free/
reduced-price lunch status; SPED = special education.

the typical TPEG and non-TPEG school, it is vitally important to use an estimation procedure
that will find non-TPEG schools resembling TPEG schools regarding these determinants.
The bottom panel of Table 2 also suggests there are observable differences between
the typical TPEG (first column) and non-TPEG teacher (second and third columns).
While the typical TPEG school tends to include more disadvantaged students, the
typical TPEG teacher teaches fewer, which could positively bias subsequent estimates if
we do not control for these differences. At the same time, TPEG teachers teach fewer
incoming White students than the typical non-TPEG teacher, which could negatively
bias subsequent estimates if left unchecked.
TPEG teachers also differ from non-TPEG teachers with respect to prior-year student
achievement scores and office referrals. The mean math and RLA achievement scores
580 X. C. CRAVENS AND S. B. HUNTER

of the typical TPEG teacher are higher than those of the typical non-TPEG teacher. The
difference is greater when comparing TPEG teachers to non-TPEG teachers in TPEG
schools. Additionally, the mean number of teacher-level student office referrals for
TPEG teachers is lower than office referrals for non-TPEG teachers within TPEG schools.
These differences show more “effective” teachers selected into TPEG, which will positively
bias estimates if estimation procedures do not control for these differences.

Analytic strategy: three-step doubly robust estimation


We use doubly robust estimation to identify the impact of teachers self-selecting into a
TPEG on TVAAS and TEAM scores. Doubly robust estimation combines outcome regression
with models of the treatment selection process, where the latter typically involves a match-
ing procedure (e.g., nearest neighbor matching; Funk et al., 2011; Guo & Fraser, 2015).
Doubly robust regression is well suited for our study design as we know the process by
which schools were assigned to treatment. One of the authors, TDoE, and the partnering
non-government organization purposefully selected schools using observable character-
istics, which we obtain, enabling us to meet the “ignorability” assumption of propensity
score estimators. Matching methods cannot generate unbiased treatment effects unless
analysts know what determined selection into treatment and obtain treatment determi-
nants. By modeling treatment selection using obtained treatment determinants, analysts
remove the confounding differences between treated and untreated units (Cook et al.,
2008; Guo & Fraser, 2015; Rosenbaum & Rubin, 1985). Our knowledge of the school selec-
tion process is well suited to propensity score matching, allowing us to identify non-TPEG
schools that are equivalent to TPEG schools. We apply doubly robust estimation instead of
propensity score matching because incorrectly specified regression or propensity score
models produce biased estimates. Doubly robust estimation combines regression and pro-
pensity score matching so that only one of the two models needs correct specification to
generate unbiased estimates (Funk et al., 2011).
As our study involved the assignment of schools to TPEG exposure, then the self-selec-
tion of teachers in TPEG schools into a self-contained TPEG, we apply a three-step doubly
robust estimation procedure to find (a) equivalent TPEG and non-TPEG schools, then find
(b) observably equivalent TPEG and non-TPEG teachers equivalent before applying (c)
outcome regressions, similar to Henry et al. (2013).

Step 1
Figure 4 represents a stylized version of the available data, which we use to illustrate
our three-step procedure. Step 1 models school selection into the study and identifies
a set of non-TPEG schools that are equivalent to TPEG schools. The pool of potential
school matches excludes all non-TPEG schools in districts where there is at least one
TPEG school (i.e., excludes non-TPEG Schools A and B in Figure 4); all other schools
remain in sample. We exclude schools like non-TPEG Schools A and B as practices
from TPEG Schools 1 or 2 might have spilled over into Schools A and B. For
example, leaders in TPEG districts might have encouraged Schools A and B to adopt
TPEG-like practices, which may attenuate our estimates if we left Schools A and B in
the pool of potential matches since Schools A and B might have received partial
TPEG treatment.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 581

Figure 4. Stylized matching procedure.


Note: X represents teacher peer excellence group (TPEG) participants, O represents non-participants.

We identify equivalent non-TPEG schools using logit regression and select four
matched units for each treated unit (i.e., 4–1 matching) using nearest neighbor matching
with replacement.6 In Step 1, schools are the unit of analysis and we match on school-level
baseline total number of Black, White, Hispanic, and FRPL students enrolled in each
school; average student math and reading achievement scores; and average teacher
TVAAS scores, the determinants of school selection. We must attain balance on these vari-
ables after Step 1 to meet the ignorability assumption.
To increase the precision of outcome regression estimates, we also match on several
other school-level baseline measures that plausibly relate to the outcomes of ultimate
interest in Step 3 (i.e., TVAAS and observation scores). We use these additional variables
in Step 1 to ensure consistency in variable use across all three steps. The remainder of
our Step 1 matching variables include average teacher observation and LOE scores,
average teacher level of education and years of experience, average student number
of office referrals, the total number of enrolled SPED and ESL students, the total
number of administrators in the school, and the standard deviations of average
teacher observation, TVAAS, and LOE scores and years of experience, and standard
deviations of average student math and reading scores, and office referrals.7 We also
include baseline TES responses from teachers about whether a department head,
instructional coach, or mentor teacher observed them or not, and responses about
whether the teacher engaged in one-on-one work with a mentor, informally consulted
with a peer, or observed a peer or not. We are less concerned about attaining balance
on these variables after Step 1 because the purpose of these variables is to serve as
controls in Step 3.
We identify matched TPEG schools using a 4–1 nearest neighbor procedure with repla-
cement. After identifying matched non-TPEG schools, we discard all unmatched non-TPEG
schools from the sample. If Step 1 matches TPEG Schools 1 and 2 to non-TPEG Schools C
582 X. C. CRAVENS AND S. B. HUNTER

and D, we would then discard Non-TPEG Schools E and F from the sample, having already
discarded Schools A and B.

Step 2
After Step 1, we only retain TPEG schools and matched non-TPEG schools in the sample.
Before implementing Step 2, we also discard all non-TPEG teachers from TPEG schools. In
Figure 4, X’s represent a teacher who participated in a TPEG, while O’s represent teachers
who did not participate in TPEGs. In our stylized example, the matching procedure
matches each X in TPEG Schools 1 and 2 to O’s from non-TPEG Schools C and D. We
do not include non-TPEG teachers in TPEG Schools (i.e., O’s from Schools 1 and 2) in
the pool of potential matches as we are concerned that the effects of TPEG might have
spilled over into the practices of teachers who did not participate in formal TPEGs. For
example, a TPEG teacher might have shared what she learned from TPEG participation
with a grade-level colleague who did not participate in TPEGs formally (i.e., an O in
School 1 or 2). Consequently, the grade-level colleague’s teaching might have improved.
If we retained the grade-level colleague in the pool of potential teacher matches, our esti-
mates might have been attenuated because the grade-level colleague effectively received
a partial exposure to TPEGs.
Step 2 applies a teacher-level logit regression to the new dataset and implements a 4–1
nearest neighbor matching procedure with replacement. We match on teacher-level baseline
observation, TVAAS, and LOE scores, teacher education level, gender, race and age, baseline
math and reading scores of each teacher’s average student, baseline number of office refer-
rals for each teacher’s average student, the total number of SPED, ESL, FRPL, Black, Hispanic,
and White students taught by each teacher, and the standard deviation of baseline math and
reading scores and office referrals. Although we assume that each of these teacher-level base-
line measures plausibly affected teacher selection into TPEGs, we also assume that teachers
selected into TPEGs because they believed doing so would improve their practice, which has
implications for our interpretation of results.
If we met the ignorability assumption in Step 2, the doubly robust estimation pro-
cedure would generate estimates as good as those generated by a randomized control
trial that assigned teachers to TPEG participation. As teachers self-selected (i.e., volun-
teered) into TPEGs for reasons that we may not observe, our estimates may not
support inferences concerning the impact of assigning teachers to TPEGs on observation
and TVAAS scores. Instead, our estimates support inferences about the impact of teacher
self-selection into TPEGs on observation and TVAAS scores.

Step 3
Before implementing Step 3 of our procedure, we discard all unmatched non-TPEG tea-
chers from Step 2; thus, the sample includes TPEG schools and equivalent non-TPEG
schools, TPEG teachers and matched non-TPEG teachers from matched non-TPEG
schools. Step 3 applies teacher-level inverse probability of treatment weighted (IPTW)
regressions to the remaining analytical sample. IPTW regressions use propensity scores
(i.e., those from Step 2) to down-weight (up-weight) information from the matched
non-TPEG teachers who were least (most) similar to TPEG teachers (Guo & Fraser,
2015). We set the IPTW of TPEG teachers to one, estimating the average effects of self-
selection into TPEGs on the outcomes of interest (i.e., an average treatment effect on
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 583

the treated [ATT]; Guo & Fraser, 2015). We regress TVAAS scores on (a) whether the
teacher was in a TPEG or not, (b) all teacher-level matching variables from Step 2, and
(c) all school-level matching variables from Step 3 using ordinary least squares (OLS)
regression. We then repeat Step 3 using observation scores as the outcome.

Next-year effects
We estimate if the effects of self-selecting into a TPEG in 2012–2013 persist into the next
school year using the same samples and right-hand side variables from Steps 1 to 3. The
only modification to models estimating next-year effects concerns the outcomes. We
replace the original outcomes, measured in 2013–2014, with teacher observation and
TVAAS scores measured in 2014–2015. These models generate estimates representing the
impact of teacher self-selection into TPEGs on next-year performance and effectiveness.

Findings
Instruction-focused collaboration and deprivatized practice
We examine survey responses from TPEG teachers (n = 68) to explore the extent to which
teachers implemented the TPEG theory of change. We find that TPEG teachers frequently
engaged in instruction-focused collaboration. Figures 5 or 7 display bar graphs capturing
the frequencies of TPEG teacher self-reported collaborative activities described in the
Measures section. At least 40% of respondents reported engaging in each instruction-

Figure 5. Frequency of engagement in instruction-focused collaborative activities I.


Note: n = 68 teachers.
584 X. C. CRAVENS AND S. B. HUNTER

Figure 6. Frequency of engagement in instruction-focused collaborative activities II.


Note: n = 68 teachers.

Figure 7. Frequency of engagement in instruction-focused collaborative activities III.


Note: n = 68 teachers. HW = homework.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 585

Figure 8. Frequency of deprivatized practice.


Note: n = 68 teachers.

focused collaborative activity 1–2 times per week with just one exception (no TPEG tea-
chers reported engaging in collaborative activities to work on key ideas particular to a
lesson or unit 1–2 times per week). At least 70% of respondents reported engaging in
each collaborative activity 1–2 times per month. Finally, no more than 7% of TPEG tea-
chers reported never engaging in a collaborative activity, with one exception. Only
about 16% of TPEG teachers reported never collaborating to design common homework
assignments. It appears that TPEGs create opportunities for teachers to share available
instructional resources and continuously align their inquiries with improvement targets
– what the theory of change calls for in an active community of practice.
Teachers deprivatized their practice less frequently than they engaged in instructional
collaboration. Figure 8 displays bar graphs representing the frequencies with which tea-
chers deprivatized their practices as discussed in the previous Measures section. Less than
10% of TPEG teachers reported engaging in a deprivatized practice 1–2 times per week.
However, at least 60% of TPEG teachers reported engaging in each deprivatized practice
1–2 times per month. In the Tennessee context this is a relatively intense dosage of obser-
vations, peer observation or not. The typical Tennessee teacher is formally evaluated
twice per academic year (Alexander, 2016). Thus, even if TPEG teachers only observed
one another once per month, this sextuples the number of observations received by
the typical Tennessee teacher.

Comparability of matched non-TPEG schools and teachers


In the Analytic strategy section, we discussed the ignorability assumption underlying models
of treatment selection, which asserts that analysts must know and obtain treatment determi-
nants to produce unbiased estimates. The second component of the ignorability assumption
586 X. C. CRAVENS AND S. B. HUNTER

asserts that the distribution of propensity scores for non-treated units must overlap substan-
tially with the distribution of treated-unit propensity scores (Cook et al., 2008; Guo & Fraser,
2015; Rosenbaum & Rubin, 1985). Effectively, each distribution’s range must overlap substan-
tially; the two distributions do not need to have the same peakedness (i.e., kurtosis) or skew-
ness. If the two distributions’ ranges do not overlap substantially, it is difficult to find suitable
matches in the matching procedure.
Figure 9 displays box and whisker plots of the estimated propensity score distributions
from Steps 1 and 2. The left panel of Figure 9 displays the school-level propensity scores esti-
mated in Step 1. There is substantial overlap in the two distributions, despite the relatively
large propensity score of three TPEG schools. The right panel of Figure 9 shows the box
and whisker plots of propensity scores estimated in Step 2, which overlap substantially.
While the overlap assumption concerns the overlap of propensity scores, analysts typi-
cally compare each matching variable’s mean and variance across treated and matched-
untreated units. Ideally, the means and ratio of variances of each matching variable
should balance across TPEG teachers and the matched-non-TPEG units. Variance ratios
measure if the distribution of each matching variable among the matched comparison
units is significantly different from the distribution of the same matching variable among
TPEG schools and teachers. If substantial imbalances remain, this would imply the matching
procedure is unsuccessful in identifying a group of comparable non-TPEG teachers. In this
section, we present evidence about covariate balance at the school- and teacher-levels.6

Figure 9. Regions of common support.


Upper (lower) whiskers 1.5 times the interquartile range + 75th (- 25th) percentiles. School (Teacher) level propensity
scores estimated using one record per school (teacher) and school (teacher) level variables only. School: non-
TPEG=1105, TPEG=14. Teacher: non-TPEG=662, TPEG=68. IQR = interquartile range; PS = propensity score; TPEG =
teacher peer excellence groups.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 587

After Step 1, we attain very good balance of the means of school-level matching variables.
t tests for equality of post-matching means show all school-level covariates are balanced
(rightmost column, Table 3). We also examine the balance of means by examining the absol-
ute value of standardized mean differences (SMD). Unlike t tests, SMD are not influenced by
the size of analytical samples. There are four “small” school-level differences in the SMDs of
covariates: standard deviation of prior-year observation and prior-year LOE scores, and the
proportion of TDoE survey respondents indicating they were observed by an (a) instructional
coach or (b) mentor teacher in the previous year (Cohen, 1988). Fortunately, none of these
four covariates were used in the selection of pilot schools and thus pose little threat to the
credibility of estimates. Although there is more imbalance with respect to the ratio of
weighted variances, we are less concerned about these imbalances because the dispersions
(i.e., variance or standard deviations) of these variables did not influence school selection for
TPEG participation. Moreover, we control for these differences directly in Step 3 by using all
matching variables as control variables in outcome regressions.
There is excellent balance between the means of teacher-level variables after matching
in Step 2, but weights produced in Step 2 induced some imbalance among the means of
school-level variables of lesser importance. Although only teacher-level variables are used
in Step 2 matching, we assess the balance of school-level means after Step 2 matching
because we include school-level covariates in Step 3.8 Neither t tests nor SMDs found
imbalances among the means of any teacher-level matching variables (see top panel,
Table 4). However, the variances of several teacher-level variables differed across TPEG
and non-TPEG teachers. Table 4 also shows that TPEG schools taught fewer disadvan-
taged students (ESL, Hispanic, SPED), had slightly less variation in TVAAS and math
scores, had less variation in RLA scores, and a greater proportion of TPEG teachers
reported being observed by mentor teachers (see Table 4). The greatest threat to our esti-
mates is likely the imbalance of school-level student demographics because these vari-
ables influenced school selection and plausibly relate to the outcomes of interest.
To the extent imbalances in Tables 3 or 4 are problematic, we control for these imbal-
ances directly in Step 3 because we use all teacher-level and all school-level matching
variables as control variables. Indeed, the ability to directly control for imbalances in
outcome regressions underscores the benefits of doubly robust estimation.

Effects on teacher observation scores and value-added scores


Our analysis suggests the observation scores and TVAAS scores of TPEG participants are
higher than what they would have been had these teachers not self-selected into a TPEGs.
The top panel of Table 5 shows that self-selecting into a TPEG increased observation
scores by 0.13 units (0.25 SD) at the 1% level of significance. Self-selecting into a TPEG
increased TVAAS scores by 0.68 units (0.09 SD), which was not significant at the conven-
tional 5% level of significance, but was at the 10% level of significance.9 Collectively, this
evidence lends support to the usefulness of TPEGs as communities of practice aiming to
improve teacher performance and effectiveness.
The effect of joining TPEG may persist into the future. Like the main findings, the next-
year outcomes of teachers that self-selected into TPEGs are higher than those of non-
TPEG teachers, but the magnitudes of these differences are smaller and not statistically
significant. The change in next-year observation scores of self-selecting TPEG teachers
588
Table 3. Covariate balance at school level after Step 1 of propensity score matching.
TPEG Weighted Matched Absolute Value of Standardized Ratio of Weighted Variances: t test for Equality
Mean Weighted Mean Difference of Means TPEG to Matches of Means
2012–2013 School-Level Baseline Matching Variables

X. C. CRAVENS AND S. B. HUNTER


Average Teacher Observation Score 3.718 3.676 0.14 0.586 0.41
Average Teacher TVAAS Score 1.691 1.749 0.03 0.221^ −0.09
Average Student Math Score 0.048 0.079 0.089 1.11 −0.26
Average Student Reading Score 0.088 0.076 0.033 0.861 0.09
Average Student Student Office Referrals 0.448 0.333 0.169 4.317 0.49
Average Teacher LOE Score 372.99 366.989 0.143 0.596 0.41
Total Teachers with Master Degree+ 21.118 21.794 0.057 0.394 −0.17
Average Teacher Teaching Experience 11.611 11.764 0.057 0.412 −0.17
Total ESL Students in School 86 102.176 0.107 0.154^ −0.31
Total FRPL Students in School 651.588 637.853 0.017 1.226 0.05
Total Black Students in School 203.412 223.309 0.064 0.594 −0.19
Total Hispanic Students in School 74.824 96.574 0.147 0.218^ −0.42
Total White Students in School 911.882 891.691 0.019 0.96 0.06
Total SPED Students in School 189.118 213.471 0.108 0.585 −0.31
Total Admins in School 2.471 2.426 0.038 0.919 0.11
SD Observation Score 0.544 0.604 0.367* 0.596 −1.06
SD TVAAS Estimates 3.394 3.589 0.097 0.158^ −0.28
SD Math Score 0.929 0.921 0.08 1.25 0.23
SD Reading Score 0.883 0.893 0.119 0.62 −0.34
SD Student Office Referrals 0.976 0.838 0.121 3.163^ 0.35
SD LOE Score 56.64 61.314 0.287* 0.21^ −0.83
SD Teaching Experience 9.334 9.211 0.096 0.353 0.28
Proportion of Survey Respondents within School Selecting: 0.01 0.01 0.024 0.664 −0.07
Observed by a “Department Head” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.027 0.034 0.224* 0.609 −0.65
Observed by an “Instructional Coach” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.029 0.008 0.426* 27.201^ 1.24
Observed by a “Mentor Teacher” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.027 0.027 0.007 1.046 0.02
Engaged in “1–1 Work with a Mentor” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.113 0.115 0.023 0.443 −0.07
“Informally Consulted with a Peer” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.047 0.056 0.189 0.408 −0.54
"Observed a Peer" in 2012–2013
Notes: Schools are the unit of analysis. Achievement scores standardized within grade/subject. Masters+ is a binary variable indicating if the teacher had a Master’s degree or higher. Weights are
the propensity score weights used in Step 1. TPEG = teacher peer excellence group; TVAAS = teacher value-added scores; LOE = level of overall effectiveness; ESL = English as a second
language; FRPL = free/reduced-price lunch status; SPED = special education.
* > 0.20, Cohen’s rule of thumb for small differences (Cohen, 1988). No absolute standardized difference is greater than 0.50, Cohen’s rule of thumb for medium-sized differences. ^ 5% level
using 13 degrees of freedom.
Table 4. Covariate balance at teacher and school level after Step 2 of propensity score matching.
TPEG Weighted Matched Absolute Value of Standardized Ratio of Weighted Variances: t test for Equality
Mean Weighted Mean Difference of Means TPEG to Matches of Means
2012–2013 Teacher-Level Baseline Matching Variables
Observation Score 3.849 3.795 0.103 0.602^ 0.61
TVAAS Score 1.934 1.984 0.012 0.41^ −0.07
Average Student Math Score 0.115 0.077 0.088 0.774 0.52
Average Student Reading Score 0.141 0.112 0.066 0.452^ 0.39
Average Student Student Office Referrals 0.293 0.312 0.037 1.504 −0.22
LOE Score 402.793 398.7 0.061 0.847 0.36
Masters+ 0.514 0.534 0.04 1.011 −0.24
Years of Experience 10.6 9.343 0.143 1.635^ 0.85
Female 0.914 0.92 0.021 1.073 −0.12
White 0.9 0.937 0.135 1.539 −0.80
Black 0.086 0.063 0.087 1.34 0.51
Age 39.929 38.16 0.177 1.039 1.05
Total ESL Students Taught 4.786 5.031 0.037 0.649 −0.22

SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT


Total FRPL Students Taught 29.171 29.923 0.034 0.902 −0.20
Total Black Students Taught 8.386 8.72 0.032 0.221^ −0.19
Total Hispanic Students Taught 3.729 3.894 0.03 1.167 −0.18
Total White Students Taught 44.8 45.331 0.018 0.804 −0.10
Total SPED Students Taught 9.486 9.38 0.014 1.35 0.08
SD Math Score 0.832 0.841 0.049 0.934 −0.29
SD Reading Score 0.854 0.853 0.001 0.903 0.01
SD Student Office Referrals 0.702 0.741 0.042 1.799^ −0.25
2012–2013 School-Level Baseline Matching Variables
Average Teacher Observation Score 3.753 3.698 0.19 0.575^ 1.12
Average Teacher TVAAS Score 1.52 1.29 0.136 0.212^ 0.80
Average Student Math Score 0.035 0.081 0.154 1.015 −0.91
Average Student Reading Score 0.087 0.088 0.003 1.005 −0.02
Average Student Student Office Referrals 0.383 0.386 0.006 2.622^ −0.03
Average Teacher LOE Score 374.925 368.761 0.16 0.779 0.95
Total Teachers with Master Degree+ 23.557 25.7 0.161 0.261^ −0.95
Average Teacher Teaching Experience 11.936 11.735 0.086 0.682 0.51
Total ESL Students in School 95.286 162.363 0.34* 0.069^ −2.00†
Total FRPL Students in School 790.457 944.5 0.172 0.971 −1.02
Total Black Students in School 240.543 292.331 0.152 0.439^ −0.90
Total Hispanic Students in School 80.457 154.134 0.41* 0.112^ −2.41†

589
(Continued)
590
X. C. CRAVENS AND S. B. HUNTER
Table 4. Continued.
TPEG Weighted Matched Absolute Value of Standardized Ratio of Weighted Variances: t test for Equality
Mean Weighted Mean Difference of Means TPEG to Matches of Means
Total White Students in School 1202.486 1344.706 0.121 1.052 −0.71
Total SPED Students in School 243 317.906 0.306* 0.487^ −1.80†
Total Admins in School 2.9 2.694 0.175 0.722 1.03
SD Observation Score 0.54 0.551 0.08 0.917 −0.47
SD TVAAS Estimates 3.635 4.146 0.257* 0.217^ −1.52
SD Math Score 0.906 0.927 0.209* 1.888^ −1.24
SD Reading Score 0.874 0.906 0.416* 1.304 −2.46†
SD Student Office Referrals 0.896 1.019 0.122 2.104^ −0.72
SD LOE Score 58.46 59.582 0.086 0.536^ −0.51
SD Teaching Experience 9.192 9.227 0.031 0.522^ −0.18
Proportion of Survey Respondents within School Selecting: 0.012 0.009 0.141 0.995 0.83
Observed by a “Department Head” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.025 0.026 0.046 0.661 −0.27
Observed by an “Instructional Coach” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.024 0.007 0.456* 15.285^ 2.69†
Observed by a “Mentor Teacher” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.024 0.022 0.079 0.992 0.47
Engaged in “1–1 Work with a Mentor” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.114 0.103 0.191 0.300^ 1.13
“Informally Consulted with a Peer” in 2012–2013
Proportion of Survey Respondents within School Selecting: 0.049 0.047 0.043 0.639 0.25
“Observed a Peer” in 2012–2013
Notes: Teachers are the unit of analysis in both panels. Achievement scores standardized within grade/ subject. Masters+ is a binary variable indicating if the teacher had a Master’s degree or
higher. Weights are the propensity score weights used in Step 2. TPEG = teacher peer excellence group; TVAAS = teacher value-added scores; LOE = level of overall effectiveness; ESL = English
as a second language; FRPL = free/reduced-price lunch status; SPED = special education.
* > 0.20, Cohen’s rule of thumb for small differences (Cohen, 1988). No absolute standardized difference is greater than 0.50, Cohen’s rule of thumb for medium-sized differences. ^ 5% level
using 67 degrees of freedom. †p < 0.10.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 591

Table 5. Results of three-step propensity score matching procedure.


TPEG SE N (Teachers) N (TPEG Teachers) adjusted R-squared
2013–2014 Outcomes
Observation Score 0.13** [0.049] 294 68 0.623
TVAAS 0.68+ [0.415] 279 68 0.326
2014–2015 Outcomes
Observation Score 0.08 [0.069] 241 68 0.501
TVAAS 0.15 [0.632] 209 68 0.233
Notes: Teachers are the unit of analysis. Standard errors clustered at school level. Weights used in teacher-level matching
from Step 2 serve as inverse probability of treatment weights. To estimate ATT the weights of all TPEG teachers set to
one. TPEG = teacher peer excellence group; TVAAS = teacher value-added scores.
+p < 0.10. **p < 0.01.

is 0.08 units higher (0.04 SD) than that of non-TPEG participants and the change in next-
year TVAAS scores is 0.15 higher (0.02 SD); however, these differences are statistically
insignificant at the 10% level and conventional 5% level. Although part of the loss in stat-
istical significance is explained by larger standard errors, each next-year coefficient is also
at least 35% smaller than its 2013–2014 counterpart.

Discussion
Our study builds on previous work that associates teacher communities of practice with
changes in student-level achievement scores, school-level average student achievement
scores, and teacher self-reported changes in practice. We contribute to the PD and com-
munities of practice literature by examining the impact of teacher self-selection into
TPEGs, a specific type of teacher-led collaborative inquiry group, on teacher-specific obser-
vation scores and value-added scores.
Teachers in TPEGs set instructional goals using a standards-based observation rubric
for formal teacher evaluation and deprivatize their teaching by planning lessons colla-
boratively, and engaging in peer observation and performance feedback. They inquire
into potential sources of positive and negative deviations from lesson plans, and use
knowledge from their inquiry to update lesson plans and teaching for greater success
in the specific contexts of public schools in Tennessee. The data imply that this specific
type of community of practice (i.e., TPEGs) improves teacher-specific observation scores
and value-added scores, at the 10% level of significance.
Our study extends the work by Saunders and colleagues on grade-level teams and
shed light on what remains in the “black box” about what teachers do during their collab-
orations and the resulted outcomes (2009, p. 1028). In some ways, our implementation
analyses suggest that teachers enacted TPEGs as they were designed. TPEG teachers
reported engaging in instruction-focused collaboration, especially in modifying instruc-
tions, joint lesson planning, and developing an understanding of content. Although
they also reported engaging in deprivatized practices, the degree of deprivatization fell
short of full enactment. Given the policy context in Tennessee that underscored the
importance of instructional standards for teacher evaluation, we suspect that it was rela-
tively easier for teachers to stay focused on instruction by using the TEAM rubric for the
collaborative inquiry cycles. However, the context did not encourage deprivatization in
the same way. As described in our literature review, teachers in the US are accustomed
to maintaining the privacy of their teaching culture (Akiba & Wilkinson, 2016; Saunders
592 X. C. CRAVENS AND S. B. HUNTER

et al., 2009). Despite the emphasis on deprivatization conveyed through TPEG trainings
and protocols, the ingrained norms of teaching in isolation are difficult to change, and
there remained logistical and resource barriers in US public schools. To be clear, TPEG tea-
chers reported engaging in deprivatized practices, but not to the fullest extent, implying
that TPEG teacher practices can be deprivatized further.
It is unsurprising that teachers’ self-selection into a TPEG substantially improved their
observation scores when teachers used the TEAM observation rubric for goal setting to
drive their collaborative inquiry cycles. Additionally, TPEGs exhibit several characteristics
of effective PD (e.g., job-embedded, ongoing, relevant) that improve teacher practice. We
only detected changes in teacher TVAAS (value-added) scores at the 10% level of signifi-
cance, however. This may be a function of the measurement error inherent in value-added
measures, which inflates standard errors when treating value-added scores as outcomes
(Guarino, Maxfield, et al., 2015; Guarino, Reckase, & Wooldridge, 2015). However, large
standard errors alone do not account for the weak statistical significance of the effects
on TVAAS scores because the magnitude of changes is relatively small. These small
changes in TVAAS may be due to the specific instructional foci chosen by TPEG partici-
pants. Prior work finds variation in the relationships between gains in student achieve-
ment scores and specific teaching strategies, meaning that some teaching strategies
associate more strongly with achievement gains than other strategies (Kane et al.,
2011). Although observation scores measuring the instructional strategies described by
the TEAM standards-based observation rubric all positively relate to student achievement
scores, TPEG teachers might have chosen to focus on the instructional strategies with
weaker relationships to gains in student achievement (i.e., value-added scores).
While the next-year outcomes of teachers who self-select into TPEGs are higher than
those of teachers who did not participate in TPEGs, our evidence also shows that the
effects of participating in a TPEG on teacher performance do not persist. This fading
effect of TPEGs could be related to the voluntary nature of the pilot study and the
reduced level of external technical assistance (i.e., training, feedback, monitoring) pro-
vided by the research–practice partnership for the participating schools after the 1st
year. This result is consistent with what Saunders et al. found in their 2009 study, that
a positive effect of teacher grade-level teams only occurred after the treatment schools
intensified the training and provided protocols for participating schools in the 3rd year.

Limitations
Our study may suffer from four broad limitations concerning measurement error, bias in
estimated effects on observation scores, attrition, and generalizability. Asking teachers to
identify their participation in TPEGs may introduce measurement error. Teachers who par-
ticipated in a TPEG might have misidentified as non-participants. However, the total
number of self-identified TPEG teachers matched principal-supplied lists of TPEG partici-
pants, so we are not concerned about this misidentification. In a second scenario, teachers
who did not participate in a TPEG might have indicated that they did so. As there was one
question about TPEG participation and a second about TPEG subject-matter focus, there is
little reason to believe that this type of measurement error was prevalent. Nonetheless,
suppose this type of measurement error is prevalent. In that case, it will attenuate the
coefficients representing the impact of teacher self-selection into a TPEG on the
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 593

outcomes, understating the true effects of self-selection into TPEGs. Consequently, we


interpret our estimates as conservative.
Another limitation is that we cannot definitively conclude that the positive relationship
with observation scores was driven entirely by TPEG self-selection. School administrators
would have known which teachers were and were not in TPEGs and (un)consciously rated
TPEG-teacher performance higher, independent of their observed performance. Although
we cannot definitively rule out the observer-bias possibility, the positive relationships
between TPEG self-selection and TVAAS scores at the 10% level suggest that the higher
observation scores were not entirely a function of observer bias. If the positive relation-
ship with subjectively generated observation scores was due entirely to observer bias,
we would not expect to see positive differences in objectively generated TVAAS scores.
Nonetheless, future research might use researcher-generated scores of video-recorded
teacher performance. Studies could be designed in which researcher scorers would not
know whether a teacher participated in a TPEG or not, removing observer bias driven
by observer knowledge of teacher TPEG participation.
Third, attrition might have biased our estimates. Some teachers might have joined a TPEG
but left before taking the survey we used to identify teacher participants. If attrition was sub-
stantial and reasons for remaining in a TPEG (e.g., motivation to improve) systematically cor-
related positively with our outcomes, the estimates might have been upwardly biased.
Remaining limitations concern generalizability. First, data were collected as part of a
small pilot program in which teachers received researcher-provided supports that may
not exist elsewhere. Though the TPEG pilot was intentionally designed to be teacher
formed and led and adaptive to school conditions, the role of the initial and ongoing
training and guidance from the research team warrants further attention as we note
the fading longer term impact of TPEG participation. Second, the majority of teachers
in the pilot taught intermediate grades, and our results may not generalize to other
grade bands. Finally, we cannot generalize to cases where teachers are assigned to TPEGs.

Implications
Our study extends prior work in the field of PD and teacher communities of practice. First, our
findings support inferences about expected changes in teacher-specific measures of perform-
ance and effectiveness. Second, our methods support inferences about the expected changes
in these outcomes if schools were to offer TPEGs as a PD alternative while allowing teachers to
self-select into a TPEG. In a similar study about collaborative inquiry, Saunders and colleagues
(2009) capture the impact of assigning teachers to grade-level teams on school-level average
student achievement scores. In studying the impact of the TPEG model, we situate teachers in
settings more typical to voluntary take-up of collaboration (i.e., we do not control for teacher
self-selection into TPEGS), which we believe provides district and school leaders with more
policy- and practice-relevant findings.
Policymakers might expect school leaders to offer TPEGs as a teacher PD opportunity, then
allow teachers to choose whether they want to participate in a TPEG or not. Our findings
imply that observation scores and value-added scores will improve among teachers who
choose to participate in a TPEG, more so than if they do not have the opportunity to partici-
pate. However, we suspect that offering TPEGs in conventional settings would be difficult if
newly introduced practices are not rooted in school or district culture. Moreover, the rapid
594 X. C. CRAVENS AND S. B. HUNTER

and positive shock to the instruction of TPEG teachers may fade away without ongoing gui-
dance and support for their communities of practice. Additionally, our findings imply that
principal preparation programs might train aspiring principals in the design of collaborative
inquiry cycles so that future school administrators are well positioned to form and support
these teacher professional learning opportunities.
In summary, our study takes steps towards examining how growth in teacher obser-
vation scores and value-added scores is related to how teachers work together to
improve instructional quality. Our findings support the notion that such improvement
is more likely when teachers share common instructional objectives, follow inquiry-
focused protocols, and deprivatize their practice. To further our understanding of key
factors that lead to instructional improvement, future research should capture the contex-
tual and structural nuances in how participation in new initiatives take place, and the level
of support and enabling conditions at the school and district levels.

Notes
1. Saunders et al. (2009) present demographic information on treated and comparison schools
and claim that the two groups of schools are statistically similar, but we cannot find results
from any tests supporting this claim (e.g., no regressions, t tests).
2. Using printed descriptive statistics, we confirmed this claim independently. Although equiv-
alence in baseline grade-level achievement scores is essential, it does not rule out plausible
confounding school-level variables. If Saunders and colleagues (2009) knew that schools vol-
unteered for treatment because of their baseline grade-level achievement scores only, then
equivalence in these scores would control for the only reason that schools volunteered for
treatment or not. However, we cannot find any evidence that the authors knew why
schools chose to volunteer for treatment. It is plausible that school leaders did not volunteer
for study participation unless they believed that their teachers would benefit, introducing
positive bias.
3. See Hunter (2020) for additional details on Tennessee teacher observation policy.
4. The TVAAS scale ranges from −52.7 to 42.6 with a mean of −0.09 and standard deviation of 6.23.
5. Non-departmentalized elementary grade teachers may receive two TVAAS estimates: one for
math and one for RLA. In such cases, we take the mean TVAAS estimate.
6. While matching with replacement increases standard errors, it also decreases the potential for
biased estimation.
7. We assume improving teacher and student performance is more difficult when professional
or student learners are more diverse (i.e., school- level standard deviations of these perform-
ance measures are greater).
8. Moreover, the IPTW used in Step 3 are based on weights developed during Step 2, thus the
balance of school-level covariates needs to be assessed using these weights and with tea-
chers as the unit of analysis.
9. We discuss the less conventional 10% level of statistical significance because the conse-
quence of a Type I error (i.e., false rejection of the hypothesis that self-selecting into a
TPEG does not affect TVAAS scores) is not severe. Prior work concludes that TPEG teachers
report higher engagement levels of instructional collaboration (Cravens et al., 2017).
Additionally, the current study finds teacher observation scores were positively affected by
teacher self-selection into TPEGs. Thus, self-selecting into a TPEG seems to yield valuable
benefits, even if participation does not improve TVAAS scores.

Disclosure statement
No potential conflict of interest was reported by the authors.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 595

Notes on contributors
Xiu C. Cravens is an associate professor of the practice in Education Policy at the Department of Lea-
dership, Policy, and Organizations. Her scholarly work involves qualitative and quantitative analyses
of reform policies that are particularly related to the organizational and cultural contexts of schools
in the United States and other countries, the role of instructional leaders in a changing policy
environment, promising practices in professional development, and the conceptual and methodo-
logical challenges of cross-cultural transfer and generalization of leadership theories and their
applications.
Seth B. Hunter is an assistant professor of Education Leadership at George Mason University. His
research interests include the intersection of educator (i.e., teacher, principal) professional devel-
opment and evaluation, educator observation systems and practices, and teacher leadership. To
explore these topics Dr Hunter primarily applies econometric techniques to large-scale non-
experimental data. Some of his work employs psychometric or qualitative methods.

ORCID
Xiu C. Cravens http://orcid.org/0000-0002-3077-9313
Seth B. Hunter http://orcid.org/0000-0002-3051-872X

References
Akiba, M., LeTendre, G. K., & Scribner, J. P. (2007). Teacher quality, opportunity gap, and national
achievement in 46 countries. Educational Researcher, 36(7), 369–387. https://doi.org/10.3102/
0013189X07308739
Akiba, M., & Wilkinson, B. (2016). Adopting an international innovation for teacher professional
development state and district approaches to lesson study in Florida. Journal of Teacher
Education, 67(1), 74–93. https://doi.org/10.1177/0022487115593603
Alexander, K. (2016). TEAM Evaluator Training 2016-17 [Certification Training]. https://team-tn.org/
wp-content/uploads/2013/08/TEAM-Teacher-Training-2016_FINAL_PDF.pdf
Archibald, S., Coggshall, J. G., Croft, A., & Goe, L. (2011). High-quality professional development for all
teachers: Effectively allocating resources [Research & Policy Brief]. National Comprehensive Center
for Teacher Quality.
Bryk, A. S., Gomez, L. M., Grunow, A., & LeMahieu, P. G. (2015). Learning to improve: How America’s
schools can get better at getting better. Harvard Education Press.
Buysse, V., Sparkman, K. L., & Wesley, P. W. (2003). Communities of practice: Connecting what we
know with what we do. Exceptional Children, 69(3), 263–277. https://doi.org/10.1177/
001440290306900301
Chaplin, D., Gill, B., Thompkins, A., & Miller, H. (2014). Professional practice, student surveys, and value-
added: Multiple measures of teacher effectiveness in the Pittsburgh public schools (REL 2014-024).
U.S. Department of Education, Institute of Education Sciences, National Center for Education
Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic.
Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers II: Teacher value-
added and student outcomes in adulthood. American Economic Review, 104(9), 2633–2679.
https://doi.org/10.1257/aer.104.9.2633
Clotfelter, C. T., Ladd, H. F., & Vigdor, J. L. (2007). Teacher credentials and student achievement:
Longitudinal analysis with student fixed effects. Economics of Education Review, 26(6), 673–682.
https://doi.org/10.1016/j.econedurev.2007.10.002
Coburn, C. E., & Russell, J. L. (2008). District policy and teachers’ social networks. Educational
Evaluation and Policy Analysis, 30(3), 203–235. https://doi.org/10.3102/0162373708321829
Coburn, C. E., Russell, J. L., Kaufman, J. H., & Stein, M. K. (2012). Supporting sustainability: Teachers’
advice networks and ambitious instructional reform. American Journal of Education, 119(1), 137–
182. https://doi.org/10.1086/667699
596 X. C. CRAVENS AND S. B. HUNTER

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum
Associates.
Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experiments and observa-
tional studies produce comparable causal estimates: New findings from within-study comparisons.
Journal of Policy Analysis and Management, 27(4), 724–750. https://doi.org/10.1002/pam.20375
Cravens, X., Drake, T. A., Goldring, E., & Schuermann, P. (2017). Teacher peer excellence groups
(TPEGs): Building communities of practice for instructional improvement. Journal of
Educational Administration, 55(5), 526–551. https://doi.org/10.1108/JEA-08-2016-0095
Cravens, X., & Wang. J. (2017). Learning from the masters: Shanghai’s teacher-expertise infusion
system. International Journal for Lesson and Learning Studies, 6(4), 306–320. https://doi.org/10.
1108/IJLLS-12-2016-0061
Daley, G., & Kim, L. (2010). A teacher evaluation system that works. National Institute for Excellence in
Teaching.
Danielson, C. (2007). Enhancing professional practice: A framework for teaching (2nd ed.). ASCD.
Darling-Hammond, L. (2013). Getting teacher evaluation right: What really matters for effectiveness
and improvement. Teachers College Press.
Darling-Hammond, L., & Youngs, P. (2002). Defining “highly qualified teachers”: What does “scien-
tifically-based research” actually tell us? Educational Researcher, 31(9), 13–25. https://doi.org/10.
3102/0013189X031009013
Desimone, L. M. (2009). Improving impact studies of teachers’ professional development: Toward
better conceptualizations and measures. Educational Researcher, 38(3), 181–199. https://doi.
org/10.3102/0013189X08331140
Doan, S. (2019). What do classroom observation scores tell us about student success? Capturing the
impact of teachers using at-scale classroom observation scores [Doctoral dissertation, Vanderbilt
University]. https://ir.vanderbilt.edu/handle/1803/15448
Funk, M. J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M. A., & Davidian, M. (2011). Doubly
robust estimation of causal effects. American Journal of Epidemiology, 173(7), 761–767. https://
doi.org/10.1093/aje/kwq439
Gallimore, R., Ermeling, B. A., Saunders, W. M., & Goldenberg, C. (2009). Moving the learning of teach-
ing closer to practice: Teacher education implications of school-based inquiry teams. The
Elementary School Journal, 109(5), 537–553. https://doi.org/10.1086/597001
Goddard, Y. L., Goddard, R. D., & Tschannen-Moran, M. (2007). A theoretical and empirical investi-
gation of teacher collaboration for school improvement and student achievement in public
elementary schools. Teachers College Record, 109(4), 877–896.
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P.
(2015). Make room value added: Principals’ human capital decisions and the emergence of
teacher observation data. Educational Researcher, 44(2), 96–104. https://doi.org/10.3102/
0013189X15575031
Goldring, E., Porter, A., Murphy, J., Elliott, S. N., & Cravens, X. (2009). Assessing learning-centered lea-
dership: Connections to research, professional standards, and current practices. Leadership and
Policy in Schools, 8(1), 1–36. https://doi.org/10.1080/15700760802014951
Grossman, P., Compton, C., Shahan, E., Ronfeldt, M., Igra, D., & Shaing, J. (2007). Preparing prac-
titioners to respond to resistance: A cross-professional view. Teachers and Teaching: Theory and
Practice, 13(2), 109–123. https://doi.org/10.1080/13540600601152371
Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship
between measures of instructional practice in middle school English language arts and teachers’
value-added scores. American Journal of Education, 119(3), 445–470. https://doi.org/10.1086/
669901
Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in teaching and
teacher education. American Educational Research Journal, 45(1), 184–205. https://doi.org/10.
3102/0002831207312906
Grossman, P., Wineburg, S., & Woolworth, S. (2001). Toward a theory of teacher community. Teachers
College Record, 103(6), 942–1012.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 597

Guarino, C. M., Maxfield, M., Reckase, M. D., Thompson, P. N., & Wooldridge, J. M. (2015). An evalu-
ation of empirical Bayes’s estimation of value-added teacher performance measures. Journal of
Educational and Behavioral Statistics, 40(2), 190–222. https://doi.org/10.3102/1076998615574771
Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2015). Can value-added measures of teacher per-
formance be trusted? Education Finance and Policy, 10(1), 117–156. https://doi.org/10.1162/
EDFP_a_00153
Guo, S., & Fraser, M. W. (2015). Advanced quantitative techniques in the social sciences: Vol. 11.
Propensity score analysis: Statistical methods and applications (2nd ed.). SAGE Publications.
Henry, G. T., Smith, A. A., Kershaw, D. C., & Zulli, R. A. (2013). Formative evaluation: Estimating pre-
liminary outcomes and testing rival explanations. American Journal of Evaluation, 34(4), 465–485.
https://doi.org/10.1177/1098214013502577
Hiebert, J., Gallimore, R., & Stigler, J. W. (2002). A knowledge base for the teaching profession: What
would it look like and how can we get one? Educational Researcher, 31(5), 3–15. https://doi.org/
10.3102/0013189X031005003
Hiebert, J., Stigler, J. W., & Manaster, A. B. (1999). Mathematical features of lessons in the TIMSS Video
Study. ZDM Mathematics Education, 31(6), 196–201. https://doi.org/10.1007/BF02652695
Hill, H. C., Beisiegel, M., & Jacob, R. (2013). Professional development research: Consensus, cross-
roads, and challenges. Educational Researcher, 42(9), 476–487. https://doi.org/10.3102/
0013189X13512674
Huang, R., & Shimizu, Y. (2016). Improving teaching, developing teachers and teacher educators,
and linking theory and practice through lesson study in mathematics: An international perspec-
tive. ZDM Mathematics Education, 48(4), 393–409. https://doi.org/10.1007/s11858-016-0795-7
Hunter, S. B. (2020). The unintended effects of policy-assigned teacher observations: Examining the
validity of observation scores. AERA Open, 6(2). https://doi.org/10.1177/2332858420929276
Jensen, B., Sonnemann, J., Roberts-Hull, K., & Hunter, A. (2016). Beyond PD: Teacher professional learn-
ing in high-performing systems. National Center on Education and the Economy.
Kane, T. J., Taylor, E. S., Tyler, J. H., & Wooten, A. L. (2011). Identifying effective classroom practices
using student achievement data. Journal of Human Resources, 46(3), 587–613. https://doi.org/10.
3368/jhr.46.3.587
Kraft, M. A., & Blazar, D. (2017). Individualized coaching to improve teacher practice across grades
and subjects: New experimental evidence. Educational Policy, 31(7), 1033–1068. https://doi.org/
10.1177/0895904816631099
Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge
University Press.
Lee, V. E., & Smith, J. B. (1996). Collective responsibility for learning and its effects on gains in
achievement for early secondary school students. American Journal of Education, 104(2), 103–
147. https://doi.org/10.1086/444122
Leithwood, K., Louis, K. S., Anderson, S., & Wahlsttom, K. (2004). Review of research: How leadership
influences student learning. The Wallace Foundation.
Levine, T. H., & Marcus, A. S. (2010). How the structure and focus of teachers’ collaborative activities
facilitate and constrain teacher learning. Teaching and Teacher Education, 26(3), 389–398. https://
doi.org/10.1016/j.tate.2009.03.001
Lewis, C. (2015). What is improvement science? Do we need it in education? Educational Researcher,
44(1), 54–61. https://doi.org/10.3102/0013189X15570388
Lewis, C., Perry, R., & Murata, A. (2006). How should research contribute to instructional improve-
ment? The case of lesson study. Educational Researcher, 35(3), 3–14. https://doi.org/10.3102/
0013189X035003003
Little, J. W. (2002). Professional community and the problem of high school reform. International
Journal of Educational Research, 37(8), 693–714. https://doi.org/10.1016/S0883-0355(03)00066-1
Louis, K. S., Marks, H. M., & Kruse, S. (1996). Teachers’ professional community in restructuring
schools. American Educational Research Journal, 33(4), 757–798. https://doi.org/10.3102/
00028312033004757
McLaughlin, M. W., & Talbert, J. E. (2001). Professional communities and the work of high school teach-
ing. University of Chicago Press.
598 X. C. CRAVENS AND S. B. HUNTER

Morris, A. K., & Hiebert, J. (2009). Introduction: Building knowledge bases and improving systems of
practice. The Elementary School Journal, 109(5), 429–441. https://doi.org/10.1086/596994
National Staff Development Council. (2001). National Staff Development Council’s standards for staff
development (Rev. ed.). https://gtlcenter.org/sites/default/files/docs/pa/3_PDPartnershipsand
Standards/NSDCStandards_No.pdf
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational
Evaluation and Policy Analysis, 26(3), 237–257. https://doi.org/10.3102/01623737026003237
Organisation for Economic Co-operation and Development. (2011). Strong performers and successful
reformers in education: Lessons from PISA for the United States. https://doi.org/10.1787/
9789264096660-en
Palardy, G. J., & Rumberger, R. W. (2008). Teacher effectiveness in first grade: The importance of
background qualifications, attitudes, and instructional practices for student learning.
Educational Evaluation and Policy Analysis, 30(2), 111–140. https://doi.org/10.3102/
0162373708317680
Palincsar, A. S., Magnusson, S. J., Marano, N., Ford, D., & Brown, N. (1998). Designing a community of
practice: Principles and practices of the GIsML community. Teaching and Teacher Education, 14(1),
5–19. https://doi.org/10.1016/S0742-051X(97)00057-7
Papay, J. P., Taylor, E. S., Tyler, J. H., & Laski, M. E. (2020). Learning job skills from colleagues at work:
Evidence from a field experiment using teacher performance data. American Economic Journal:
Economic Policy, 12(1), 359–388. https://doi.org/10.1257/pol.20170709
Ronfeldt, M., Farmer, S. O., McQueen, K., & Grissom, J. A. (2015). Teacher collaboration in instructional
teams and student achievement. American Educational Research Journal, 52(3), 475–514. https://
doi.org/10.3102/0002831215585562
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched
sampling methods that incorporate the propensity score. The American Statistician, 39(1), 33–
38. https://doi.org/10.1080/00031305.1985.10479383
Saunders, W. M., Goldenberg, C. N., & Gallimore, R. (2009). Increasing achievement by focusing
grade-level teams on improving classroom learning: A prospective, quasi- experimental study
of Title I schools. American Educational Research Journal, 46(4), 1006–1033. https://doi.org/10.
3102/0002831209333185
Stigler, J. W., & Hiebert, J. (2009). The teaching gap: Best ideas from the world’s teachers for improving
education in the classroom. Free Press.
Stronge, J. H., Ward, T. J., Tucker, P. D., & Hindman, J. L. (2007). What is the relationship between
teacher quality and student achievement? An exploratory study. Journal of Personnel
Evaluation in Education, 20(3–4), 165–184. https://doi.org/10.1007/s11092-008-9053-z
Supovitz, J. A. (2002). Developing communities of instructional practice. Teachers College Record, 104
(8), 1591–1626.
Tennessee Department of Education. (2016a). Teacher evaluation. https://team-tn.org/teacher-
evaluation-2-2/
Tennessee Department of Education. (2016b). TVAAS. https://team-tn.org/data/tvaas/
Tucker, M. S. (Ed.). (2014). Chinese lessons: Shanghai’s rise to the top of the PISA league tables. National
Center on Education and the Economy.
Wang, J. (2013, March 10–12). Introduction of school-based teacher professional development in China
[Paper presentation]. Asia Leadership Roundtable, Shanghai, China.
Wenger, E. (2010). Communities of practice and social learning systems: The career of a concept. In
C. Blackmore (Ed.), Social learning systems and communities of practice (pp. 179–198). Springer.
Wenger, E., McDermott, R., & Snyder, W. M. (2002). Cultivating communities of practice: A guide to
managing knowledge. Harvard Business School Press.
Youngs, P., & King, M. B. (2002). Principal leadership for professional development to build school
capacity. Educational Administration Quarterly, 38(5), 643–670. https://doi.org/10.1177/
0013161X02239642
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 599

Appendix 1. General educator rubrics

General Educator Rubric: Planning

General Educator Rubric: Environment


600 X. C. CRAVENS AND S. B. HUNTER

General Educator Rubric: Instruction


SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 601
602 X. C. CRAVENS AND S. B. HUNTER
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 603

Appendix 2. Teacher Peer Excellence Group (TPEG) inquiry cycle


604 X. C. CRAVENS AND S. B. HUNTER
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 605
606 X. C. CRAVENS AND S. B. HUNTER

You might also like