Classroom Observation and Mathematics Education Research

Journal of Mathematics Teacher Education
https://doi.org/10.1007/s10857-019-09445-0
Classroom observation and mathematics education research
Jonathan Bostic1 · Kristin Lesseig2 · Milan Sherman3 · Melissa Boston4
© Springer Nature B.V. 2019
Abstract
Classroom observations have become an integral part of research related to mathematics
education. In this qualitative study, we describe the current state of the mathematics educa-
tion field with regard to the use of classroom observation. The research question was: How
is classroom observation being used to measure instructional quality in mathematics edu-
cation research? In all, 114 peer-reviewed manuscripts published between 2000 and 2015
that involved classroom observation as part of an empirical study were examined using a
cross-comparative methodology. Seventy (61%) did not use a formalized classroom obser-
vation protocol (COP), 21 (18%) developed their own COP, and 23 (20%) used a previ-
ously developed COP. Of the implemented COPs, 44% have published validity evidence
in a peer-reviewed journal. We perceive the great variety of research approaches for class-
room observation as necessary and potentially challenging in moving mathematics educa-
tion forward with respect to research on instructional contexts.
Keywords Classroom observation · Instruction · Qualitative · Validity
Introduction
Classroom observations have become an integral part of mathematics education research

for several important reasons. First, the nature and quality of classroom instruction mat-
ters, as student learning is highly dependent on the mathematical opportunities made
available in the classroom (Donovan and Bransford 2005; National Council of Teach-
ers of Mathematics [NCTM] 2014). Teacher and student actions and interactions are
among the many factors that significantly impact student learning. As such, research-
ers need methods of capturing, analyzing, and understanding the events that unfold in
Jonathan Bostic, Kristin Lesseig, Milan Sherman, and Melissa Boston have contributed equally to this
manuscript.
* Jonathan Bostic
bosticj@bgsu.edu
1
Bowling Green State University, Bowling Green, OH, USA
2
Washington State University Vancouver, Vancouver, WA, USA
3
Drake University, Des Moines, IA, USA
4
Duquesne University, Pittsburgh, PA, USA
13
Vol.:(0123456789)
J. Bostic et al.
mathematics classrooms. Second, a high degree of variability exists in instructional

quality within and across districts, schools, and classrooms (Hiebert et al. 2005). Iden-
tifying and describing this variability is important for ensuring that every student has
the opportunity to learn mathematics, as lower-quality mathematics instruction is often
associated with schools serving low-income rural and urban communities (Lubienski
2008). Continuing to monitor students’ opportunities to learn mathematics in all types
of schools and classrooms is important for allocating resources and determining the
focus of instructional improvements. Third, what is measured takes on value. As class-
room observations become a standard part of accountability measures at the district,
state, and national level [e.g., teacher evaluations based on Danielson’s (2013) “Frame-
work for Teaching”], classroom instruction (and the work of teachers) takes on greater
value. Classroom observations offer direct measures of instructional quality with the
potential to supplement or supplant indirect measures based on student achievement
data or growth models (Boston et al. 2015a, b; Schlesinger and Jentsch 2016). Hence,
the ways we operationalize “quality” teaching in observation tools can help reform
mathematics teaching and increase the professionalism of the field.
Different ways of operationalizing instructional quality have led to the creation of a
variety of classroom observation tools and processes (Boston et al. 2015a, b; Charalam-
bous and Praetorius 2018; Schlesinger and Jentsch 2016). As noted by Charalambous
and Praetorius (2018), different observation tools: (a) “illuminate certain instructional
aspects but leave others less visible” (p. 335); (b) name the same construct using dif-
ferent terms; and (c) name different constructs using the same terms. As such, select-
ing an observation tool for use in mathematics education research can be problematic,
especially in the absence of an overview of existing tools. A review of relevant tools has
power to galvanize mathematics teacher education scholarship through more frequent
use of a smaller set of tools. Such a review may also shed light on variables in instruc-
tional quality that are unexplored and require new instruments.
As demonstrated in recent publications, supporting mathematics education research-
ers in selecting and using observation tools has taken on prime importance in the field
(e.g., Boston et al. 2015a, b; Charalambous and Praetorius 2018; Kane and Staiger
2012; Schlesinger and Jentsch 2016). For instance, a special issue of ZDM: Mathemat-
ics Education in March 2018 features a survey of 12 tools by editors Charalambous and
Praetorius (2018) and chapters by the developers of each of the tools. The purpose of
the special issue is to identify the nexus or synergy between tools. Similarly, Schles-
inger and Jentsch (2016) discuss 11 classroom observation tools and their methodologi-
cal and theoretical underpinnings. Schlesinger and Jentsch’s work provides readers with
an idea of what various tools measure, but as the authors indicate, their list is not com-
prehensive and does not indicate the research purpose of the within studies using each
tool. Other publications have addressed issues of reliability (e.g., Hill et al. 2012) and
validity (e.g., Bostic 2017, 2018; Bostic et al. 2019) in relation to classroom observation
tools and research.
In this manuscript, we address the problem of selecting an observation tool for
mathematics education research by providing an overview of those used in published
research between 2000 and 2015 and hence widely accessible to mathematics educa-
tion researchers. The research question guiding this study is: How is classroom obser-
vation being used to measure instructional quality in mathematics education research?
Our purpose in this manuscript is to describe and analyze the mathematics education
field with regard to the use of classroom observations in recent mathematics education
research.
13
Literature review
Classroom observation
The use of classroom observations in mathematics education research has changed sub-
stantially over time, with an increase in the complexity of how classroom observations are
used as research data. In the discussion that follows, we describe how classroom observa-
tion research has evolved over time. While we are taking a somewhat historical perspec-
tive, this progression is not meant to represent an actual timeline, as one form of research
did not wholly replace others and each form of research continues to have relevant and
important uses. This progression also conveys a shift in the degree to which researchers
have captured the quality or level of implementation of particular instructional practices.
Take as a starting point research that associates the amount of some measurable attrib-
ute of classroom instruction with student learning outcomes. For example, process–prod-
uct research (that occurs inside classrooms) often identifies measurable teaching behaviors
(e.g., wait time, number of questions) and associates these behaviors with student learning
outcomes. Historically, Brophy (1986) provides an example of process–product research
that connected a number of teacher behaviors with greater student achievement. From pro-
cess–product research in general, researchers could make claims that connected teachers
exhibiting more/less of some behavior with improved student achievement.
Next, consider research that associates the presence of an initiative or intervention
with student learning outcomes. Studies in this research tradition investigate the impact of
instructional innovations (e.g., new curricula, cooperative learning, or technology) on stu-
dent learning when the innovation is “in place,” without deeply investigating how the inno-
vation is being implemented or how differences in the quality of implementation appear to
impact students’ learning. Specific examples include research on the effectiveness of cog-
nitive tutors and intelligent tutoring systems synthesized by Slavin, Lake, and Groff (2009)
and Steenbergen-Hu and Cooper (2013), as well as studies on the effectiveness of various
mathematics curricula featured in the “What Works Clearinghouse” (U.S. Department of
Education 2012, 2013).
Some studies assess changes in classroom practice based on survey or self-report.
Throughout the 1990s, research on the effectiveness of professional development initiatives
frequently used teachers’ self-reports, surveys, and written artifacts as evidence of changes
or developments in teachers’ instructional practices (e.g., Borasi et al. 1999; Farmer et al.
2003; Swafford et al. 1997). In many of these studies, teachers described changes in their
beliefs and practice closely aligned with the main goals and tenets of the professional
development, and teachers themselves attributed these changes and developments in prac-
tice to their participation in the professional development initiatives. Hence, claims regard-
ing teachers’ practice made in these studies, based on teachers’ self-reports, appear to be
reasonable and justifiable. Surveys have also been shown to reliably capture the presence
or quantity of aspects of classroom practice, such as the Survey of Enacted Curriculum
(Blank et al. 2001). However, surveys and self-reports are less effective for assessing the
quality of teachers’ practice, as teachers’ meanings and interpretations of the terms and
phrases used to describe instructional practices (e.g., what it means for students to engage
in “problem solving”) may be different from researchers’ meanings and may change over
time (Ball and Rowan 2004; Le et al. 2009). Studies attending to the amount or presence of
teacher behaviors or instructional innovations through observation or teachers’ self-report
offer important results, but provide limited insight into how the instructional behaviors or
13
J. Bostic et al.
innovations are being used. They do not provide information regarding the nature of inter-
actions between teachers and students in the process of teaching and learning mathematics.
We consider studies that report on the level of quality or implementation of instruc-
tion as the next step in the progression, with differences in the role that classroom obser-
vations play as a source of research data and differences in how various tools or studies
conceptualize “instructional quality.” Historically, several professional development stud-
ies that utilized teachers’ self-report or surveys as the main source of data also conducted
informal classroom observations, where classroom observations were conducted but not
systematically analyzed as a source of empirical data. Instead, classroom observations pro-
vided anecdotal examples to support claims or provide concrete descriptions of changes
or developments in teachers’ instructional practices (e.g., Borasi et al. 1999; Farmer et al.
2003). For example, SummerMath for Teachers (Schifter and Simon 1992) and the Edu-
cational Leaders in Mathematics Project (Simon and Schifter 1991) observed one lesson
per week in teachers’ classrooms and conducted post-observation interviews in the school
year following teachers’ participation in each project; however, evidence of enhanced
teacher knowledge and practice was comprised of teachers’ writings and post-observation
interviews.
In contrast, other studies utilize formal observations that are conducted and analyzed
empirically and systematically. Within this group, some observation protocols provide a
single or small set of overall holistic scores that encapsulate an entire lesson. Saxe et al.
(1999) used two main indicators of ambitious instruction: (1) the degree to which class-
room practices elicit and build on students’ thinking; and (2) the extent to which concep-
tual issues were addressed in problem solving. Schoen et al. (2003) examined the extent
to which the observed teaching was aligned with a set of criteria representing ambitious
instruction (i.e., teachers used open-ended questioning; students monitored their own work)
and determined one overall holistic rating (excellent, good, fair, or poor) to characterize
the observed lesson. The use of a single rating provides a broad assessment of the quality
of instruction, as defined by the researchers. An overall rating is useful for correlations,
for example, associating levels of instructional quality with student achievement outcomes.
However, an overall score would not provide sufficiently detailed information to identify
the nuances of instruction that may impact students’ learning of mathematics or to suggest
pathways for instructional improvement.
Formal observations, rated with sets of indicators have the potential to capture mul-
tiple aspects of instructional quality. Observations are collected and analyzed systemati-
cally, often following a process of rater training to ensure interrater agreement or reliabil-
ity. Some tools in this category are content general, such as Danielson’s Framework for
Teaching (FFT; 2013) and the Classroom Assessment Scoring System (CLASS; Hamre
et al. 2012). Other tools are specific to mathematics classrooms, but applicable to a wide
range of researches, including the Mathematical Quality of Instruction (MQI; Hill et al.
2012), Instructional Quality Assessment (IQA; Boston 2012a, b; Matsumura et al. 2008),
and Mathematics Scan (MSCAN; Berry et al. 2010). Projects also develop their own class-
room observation tools for specific purposes. For example, the Middle School Mathemat-
ics Study (Tarr et al. 2008) developed an observation protocol that examined the align-
ment between the curriculum and teachers’ instructional practices, identifying whether the
teacher used a variety of components of ambitious instruction more or less frequently than
suggested by the curriculum.
Figure 1 outlines a progression in the use of classroom observations in mathematics
education research as described in this section. While this progression is not historically
linear, the categories in Fig. 1 appear to reflect differences in the nature of classroom
13
Fig. 1 A categorization of research on classroom instruction (Bostic et al. 2017)
observation research over time. For example, in studies that considered the amount or pres-
ence of an instructional practice or utilized surveys, researchers and research tools typically
identified practices that were easily observable (or reportable) and at a large grain size
[e.g., the teacher emphasized academic objectives (Brophy 1986); students made presenta-
tions (McCaffrey et al. 2001); students worked in small groups (Briars and Resnick 2000)].
In studies that examined the quality of some dimension(s) of instruction, not only was the
construct observable, but also researchers and tools needed to identify levels of quality of
the practice or construct, hence at a smaller grain size and level of specificity [e.g., the
nature of teacher’s questions or level of mathematical rigor in students’ responses (Boston
and Wilhelm 2015)].
The study described herein investigates the last phase of research in this progression,
where formal observations are conducted and analyzed along multiple dimensions to assess
some aspect(s) of instructional quality (as defined by the research tool). Questions about
mechanisms of change regarding teachers’ practice or interactions with students necessarily
require detailed qualitative and quantitative data from classroom observations. In addition,
what counts as evidence in mathematics education research is changing as well (Schles-
inger and Jentsch 2016). The need to substantiate and quantify claims about changes in
teachers’ practice and instructional quality requires the use of clearly identifiable criteria,
instruments that are supported by robust validity evidence, and data that are grounded
in sufficient reliability practices. Through these requirements, mathematics education
researchers are likely to be better able to analyze classroom observations empirically.
Recent publications have provided comparisons or syntheses of observation tools (e.g.,
Boston et al. 2015a, b; Charalambous and Praetorius 2018; Kane and Staiger 2012; Schles-
inger and Jentsch 2016). Boston et al. (2015a, b) provided a snapshot of three tools for
classroom observation: Reformed Teaching Observation Protocol (RTOP; Sawada et al.
13
J. Bostic et al.
2002), IQA, and MQI. The Measures of Effective Teaching (MET) Project (Kane and
Staiger 2012) conducted a large-scale national study to explore how classroom observa-
tions, student surveys, and student achievement data could be used together to produce a
robust measure of teaching effectiveness. MET researchers analyzed 1000 mathematics
lessons in grades 4–8 from public schools across the country using five observation tools:
RTOP, MQI, FFT, CLASS, and UTeach Observation Protocol (UTOP; Walkington et al.
2012). Results were compared across the five tools, with the finding that general and math-
specific tools provided a consistent picture of instructional quality in mathematics class-
rooms. Historically, the role of classroom observation has shifted in importance and level
of specificity. These shifts are reflective of consensus in the types of instructional practices
that lead to positive student outcomes, as well as more robust methodological data analysis
tools (e.g., video analysis software).
Validity and validation evidence
While the previous research studies (e.g., Boston et al. 2015a, b; Kane and Staiger 2012;
Schlesinger and Jentsch 2016) have provided comparisons and discussion of various class-
room observation tools, there is still a need to further categorize the plethora of classroom
observation tools currently being used with respect to validity. According to Kane (2016),
“Validity is a property of the proposed interpretations and uses of the test [instrument]
scores and is not simply a property of the test or the test score” (p. 64). Validity is cen-
tral to answering the question: How do you know your results, data, and instrument are
aligned? Behind every developed classroom observation protocol, there should be an evi-
dence-based (validity) argument justifying that the outcomes within a particular study are
reasonably drawn (Bostic 2017, 2018; Kane 2016). The Standards for Educational and
Psychological Testing delineate five sources of evidence for instrument development and
validation (American Educational Research Association, American Psychological Testing,
& National Council on Measurement in Education [AERA, APA, & NCME] 2014): (1)
content, (2) response processes, (3) relations to other variables, (4) internal consistency,
and (5) consequences for testing (or use of instrument). It is not necessary to provide evi-
dence for all five sources; however, several pieces of evidence for a source and/or bodies of
evidence that address multiple sources lead to stronger evidence-based validity arguments
(AERA et al. 2014; Kane 2016).
Given the expanding nature of classroom observation research and the plethora of pro-
tocols being developed, it is imperative that researchers critically attend to how observa-
tional data are collected and analyzed. However, mathematics researchers may not always
have enough information to make appropriate decisions when choosing an observation
instrument. Thus, we investigate and identify validity evidence associated with identified
classroom observation protocols.
Method
The present study was designed to address the question: How is classroom observation
being used to measure instructional quality in mathematics education research? Our goal
is to provide a current depiction of classroom observation studies within mathematics edu-
cation research. This depiction includes a thorough analysis of the scope of research in
which classroom observation protocols (COPs) are and are not employed. For those studies
13
utilizing COPs, we identified central constructs the tool was purported to measure. Moreo-
ver, we attend to issues of validity by describing and categorizing the published validity
evidence associated with the identified COPs.
Research design and question
The design for this qualitative study included a cross-comparative analysis method to con-
duct the literature search and eventually construct a sample for analysis. A cross-compar-
ative analysis is a means to examine numerous cases (i.e., manuscripts in this case) along
with multiple variables (Khan and VanWynsberghe 2008). A goal for using this design was
to identify all the cases between 2000 and 2015 in which classroom observations have been
used to measure instructional quality and identify those that are relevant to our research
question. The cross-comparative analysis allowed us to examine the various purposes for
observational research as well as explore additional variables such as (a) construct meas-
ured, (b) indicators, and (c) desired sample (or population, when available), with a particu-
lar COP. These variables are germane to designing and selecting instruments for classroom
observation research (Boston et al. 2015a, b). This analysis process parallels work done
previously for three COPs that have been used extensively (see Boston et al. 2015a, b).
More importantly, these variables provided a means to categorize and detail how classroom
observations are used to measure instructional quality, thus addressing the research ques-
tion. Additionally, the authors have been part of several COP development teams; hence,
they drew upon their past experiences when considering variables of interest.
Data collection and analyses
The following sections describe the data collection process, which ultimately created the
samples for analysis. Figure 2 is a flowchart of this process. The first two stages involved
(1) conducting a thorough literature search, and (2) analyzing peer-reviewed studies that
used classroom observation. The next two stages focused on validity evidence for the COPs
identified in the first two stages, including (3) conducting a second literature search focused
Fig. 2 Outline of data collection process
13
J. Bostic et al.
on published validity evidence for identified COPs, and (4) analyzing validation studies
particular to those COPs.
Stage one: Constructing the sample for analysis
The first stage of the cross-comparative analysis design was to systematically compile a
set of manuscripts for further analysis. There were three steps to this stage: determining
search terms, determining how to locate manuscripts, and culling the sample for our needs.
First, we determined search terms that might generate all empirical studies that used class-
room observation. Because the authors have developed COPs previously and published
reviews of COPs (see Boston et al. 2015a, b), they initially drew upon past experience of
language that has been used when referencing COPs. Next, the authors examined relevant
articles familiar to them and added keywords found in those manuscripts. Finally, the team
spoke with education research librarians at two different institutions to generate further
language and/or simplify language for search engines (e.g., mathematics and mathemati-
cal became math*). Ultimately, terms were grouped into three sets. One set contained the
phrase “math*.” The * symbol acts as a means to include any variation of a specific word
(e.g., math, mathematics, and mathematical). The second set contained seven synonyms for
tool: protocol, instrument, observation, measure*, tool, assess*, and eval*. The third set
of words described the population of interest: instruct*, teach*, practice, learn*, process,
and student. Population of interest describes the group, action, or process for which a COP
is meant to be administered. After selecting our groups and search terms, we selected one
term from each set and linked them with “and” operators. The “and” operator in a web
search requires that all three terms, not a subset of them, be found in results. Combinations
of one term from each set were explored in the title, abstract, and manuscript document. As
an example, “math*” “protocol,” and “instruct*” were explored in title, abstract, and manu-
script for a total of 42 unique combinations with these three terms.
Next, we discussed our choices for academic search engines with educational research
librarians at two different universities before selecting JSTOR and EBSCO as meta-crawler
search engines. Justification for this selection included that many mathematics education
journals are found in either JSTOR or EBSCO. Specifically, all of the “A” quality journals
were part of the sample space and many of the “B” and “C” quality ones as well (Toerner
and Arzarello 2012). Additionally, this list overlapped with several journals included in
studies of mathematics education journal quality (Brigham Young University Department
of Mathematics Education 2008; Williams and Leatham 2017). Our sample for studies
included work found in peer-reviewed journals. White papers, research reports, and confer-
ence proceedings were excluded from our sample.
After selecting search terms and search engines, three individuals conducted the
searches. Results were grouped to create a list of 1144 possible manuscripts, and we
began the third step of this stage. Abstracts from each of these articles were compiled in
a spreadsheet for further review. Each author reviewed sets of 100 manuscripts as well as
an additional 15 manuscripts that were assigned to another reviewer. Thus, each author
functioned as a primary reviewer as well as a secondary reviewer. Each author reviewed
approximately 300 manuscripts and performed a secondary review on 45 manuscripts. The
review consisted of examining the title and abstract of each paper to confirm that the study
included observations in mathematics classrooms. Such a culling was necessary because
our search included manuscripts published across the world and from a variety of journals.
13
For instance, math may have been found in the title, abstract, or manuscript but the focus of
the study may have been on English Language Arts or other disciplines.
The population of 1144 manuscripts that met our search criteria ultimately resulted in
114 manuscripts of empirical research for further analysis. The justification for the large
population but small remaining set of manuscripts was that search engines included many
manuscripts that mention observation in some form but not as classroom observation, sug-
gesting that our search captured a very high percentage of applicable manuscripts.
Stage two: Analyzing the sample of classroom observation literature
The focus of this stage was to analyze the sample constructed of all classroom observa-
tion research (see Table 1). Table 1 provides a framing for coherently organizing results.
Results for this stage are based on the 114 manuscripts using classroom observation. The
first round of cross-comparative analysis used three variables: COP usage, COP type (i.e.,
researcher developed or developed by different group/author), and purpose of classroom
observations. In the second round of analysis, we aimed to categorize the manuscripts fur-
ther. If authors of these studies engaged in classroom observation but did not use a COP,
then we aimed to describe their chosen framework. Our hypothesis was that classroom
observation research might be situated within broad categories based on the purpose of the
classroom observations. Those categories were developed a posteriori. Manuscripts that
used a COP were reviewed to generate information alongside the variables (a) construct
measured, (b) indicators, and (c) desired sample (or population, when available) with a
particular COP.
Stage three: Validation evidence
Validation of an instrument or tool should be central to conducting generalizable and rep-

licable research (Bostic 2017, 2018; Kane 2006, 2016). The Standards (AERA, APA, &
NCME 2014) govern the use of instruments, assessments, and tools within many areas,
Table 1 Variables associated with examination of manuscripts

Stage two (round 1)
No COP used
Yes, COP used
Researcher-developed COP
COP developed by different group/author
Purpose of classroom observation
Stage two (round 2)
Name of COP (if applicable)
Construct measured with COP
Indicators for COP
Desired sample or population for using COP
Stage four
No validation evidence
Yes, validation evidence provided
Validation evidence provided within the paper
Validation evidence provided in separate peer-reviewed manuscript
Shading denotes the three groups of variables
13
J. Bostic et al.
including academic research. Thus, validity evidence ought to be presented if conclusions

from research are aligned with claims drawn from an instrument’s use.
To accomplish stage three, we first reviewed the manuscripts that included COPs for
validity evidence. Next, we conducted a literature search in a similar fashion to stage one.
Search terms included valid*, reliab*, the COP’s name, and the COP’s name written as an
abbreviation. Search terms were explored within the title, abstract, and manuscript for the
same journals as before. Again, we used JSTOR and EBSCO. This returned 15 new manu-
scripts for consideration. Our sample for validation studies as well as the previous analysis
included only work found in peer-reviewed journals. Thus, white papers, research reports,
and conference proceedings were excluded from our sample. This decision to delimit our
sample was purposeful because it retained the same level of rigor across both cross-com-
parative analyses.
Stage four: Analyzing the sample of validation literature
To accomplish this stage, one member of the research team conducted a review of the orig-
inal 114 studies, followed by subsequent analysis of the 15 studies generated from stage
three. The variables for this part of the cross-comparative analysis were (a) whether validity
evidence for a COP was found and if so, (b) the type of the validity evidence (i.e., content,
response processes, relations to other variables, internal consistency, and consequences for
testing [use of instrument]). A goal of this validity study analysis was to further contextual-
ize classroom observation research and link the use of COPs to their purported outcomes.
Interrater agreement
Interrater agreement allows for a team of reviewers to feel confident that one coder’s indi-
cation might be identical to another coder (James et al. 1993). Independence of coders
within a coding team is essential. During the first round of the second stage, each coder
independently reviewed 15% of the 1144 manuscripts that were marked by another indi-
vidual to include (or not) in the sample. Coders had complete agreement for this round,
providing evidence that the coding team was consistent in applying criteria for keeping
manuscripts for further review.
During the second round of the second stage, our approach started with three coders
reviewing approximately 40 manuscripts each and categorizing the manuscript alongside
three variables (see Table 1). A second individual from that three-person team indepen-
dently considered the first person’s categorization and confirmed it or discussed a possi-
ble discrepancy with the first coder. Discussions focused on the best categorization for the
manuscript. Ultimately, there was complete agreement between each pair of coders without
the need to bring in the third researcher for consultation. The results presented in the next
section thus reflect mutually agreed upon categories.
Because one individual from the team conducted the fourth stage, interrater agreement
is not applicable, however, the process was vetted with the entire research group to ensure
that the analytical approach could be replicated and results were logically drawn from the
analytical approach. First, the review team discussed and later agreed upon operationaliza-
tion of ideas (e.g., content validity evidence and internal structure evidence). Next, the cod-
ing team discussed the decision-making process for vetting across members. Ultimately,
there was complete agreement in the cross-comparative analysis process with respect to
validation evidence. During the analysis, the individual conducting this stage shared the
13
results at multiple points during the search and analysis with the team members for their
consideration. Through periods of discussion during this stage, it was evident that the other
team members agreed that (a) results could be replicated should they choose to carry out
this analysis and (b) results were logically drawn from an organized and coherent decision-
making process.
Results
Our research question was: How is classroom observation being used to measure instruc-
tional quality in mathematics education research? We respond to that question in three
ways. First, we discuss how classroom observation was utilized within the articles we
reviewed (round one of stage two cross-comparative analysis). Based on this sample, we
provide a categorization for the purpose of classroom observation in mathematics educa-
tion research. Second, we share results from the second round of cross-comparative analy-
sis during stage two. These results highlight the nature of the COPs found in our sample.
Third and finally, we offer results connecting validity evidence and COPs that emerged
from our cross-comparative analysis during stage four.
Stage two, round one: COP usage and purpose of classroom observation
The search returned a total of 114 articles in which data from classroom observations
were used. Table 2 depicts how many of these articles (a) had no mention of a COP but
conducted classroom observation research, (b) used a COP developed as part of the study
described in the article, or (c) used or adapted a previously existing protocol designed by
another group or author. Note that these categories are mutually exclusive. Nearly two-
thirds of the studies returned by the search did not mention or make use of any COP,
while the remaining articles were almost evenly split between self-developed and existing
protocols.
Next, we aimed to describe the purpose of the classroom observations in our sample.
Clearly, one’s chosen methodology is dependent on the particular research questions to be
investigated. It became evident in our review of the manuscripts that attending to research
purpose would provide a more complete picture of the appropriate use of classroom obser-
vation in mathematics education. For example, in exploratory or descriptive studies, we
reasoned that it might not be suitable for researchers to make use of an instrument to meas-
ure some aspect of instruction. Researchers are likely exploring what, if anything, to meas-
ure. These studies comprise the first category of studies appearing in Table 3. Studies that
aimed to determine the impact of teacher education or professional development on teach-
ers’ practice (category 2) or the impact of teachers’ practice on students’ learning (category
3) might be more inclined to measure some dimension of classroom practice. One might
Table 2 Search results
No COP mentioned Internally developed COP COP designed by another group/ Total
author
Number (%) 70 (61.4%) 21 (18.4%) 23 (20.2%) 114
13
J. Bostic et al.
Table 3 Purpose of classroom observations and the use of a classroom observation protocol

Purpose of observations Number of Subset using a COP (% of #
articles (% of of articles in this category)
114)
Exploratory, descriptive, grounded theory 41 (36%) 8 of 41 (20%)

Provide evidence of some instructional change or improve- 17 (15%) 8 of 17 (47%)
ment
Make connections between classroom practice and student 32 (28%) 13 of 32 (41%)
variables, e.g., student learning
Correlation of teacher beliefs, knowledge, perceptions, etc. 12 (11%) 3 of 12 (25%)
and teacher practice
To develop or compare classroom observation protocols 12 (11%) 12 of 12 (100%)
Total 114 44
consider classroom practice to be a dependent variable in the former case, and an inde-
pendent variable in the latter; in both cases, however, the need to substantiate claims about
the relationship between the two could be supported by measuring the particular dimension
of practice in question. A fourth category of studies sought to correlate a teacher charac-
teristic, such as knowledge or beliefs, with teacher’s practice. In this case, also, it might
seem reasonable to measure some aspect of instruction related to the knowledge or beliefs
in question. Finally, in the fifth category, we included a number of articles that discussed
the development of a particular instrument or compared/contrasted existing instruments.
The purpose of these articles differed from the others in that the focus of the article was on
the instrument(s) itself, not on using one of them to answer a research question related to
classroom practice.
Table 3 depicts the distribution of the 114 articles across these five categories, as well
as the frequencies for manuscripts using a COP within each identified purpose. We note
that for this phase of the analysis, no distinction was made between self-developed or
pre-existing COPs. The purpose for most studies was descriptive or exploratory in nature
(36%) or sought to determine the effect of some practice or intervention on students (28%).
As hypothesized, a relatively low percentage of exploratory studies attempted to measure
instruction using a COP (20%), while a higher proportion was used in studying the influ-
ence of instruction on student outcomes (41%) and those seeking to determine the effect
of some intervention on teachers’ practice (47%). Most COPs found in manuscripts may
be classified as informal observations, formal observations with an overall holistic score,
or formal observations rated with sets of indicators (see Fig. 1). Thus, recent studies are
attending to levels of quality and implementation.
Stage two, round two: Nature of COPs
A total of 27 COPs were identified, six of which were mentioned in more than one article (see
Table 4). Appendix 1 provides a description of the six COPs that came up more than once in
our search, with respect to our variables of interest: construct measured, indicators, and typical
study population (or desired population of interest) and citations for the referenced articles.
The remaining instruments, which came up only once, are listed in Appendix 2. Again, evi-
dence for each variable of interest is presented as well as the citation for the article in which
it appears. While there might be broad connections between constructs being measured by
13
Table 4 Frequencies for COPs Classroom observation protocol Number of

mentioned in more than one articles*
published manuscript
1 Instructional Quality Assessment (IQA) 7
2 Reformed Teaching Observation Protocol (RTOP) 6
3 Mathematical Quality of Instruction (MQI) 5
4 UTeach Observation Protocol (UTOP) 2
5 TruMath 3
6 Oregon-Teacher Observation Protocol (OTOP) 2
*Citations for articles that references these COPs are provided in

Appendix 1
these COPs, each COP had a unique focus. Manuscripts came from a wide array of journals.
More than two studies were found in each of the following journals: American Educational
Research Journal, Elementary School Journal, Journal of Mathematics Teacher Education,
and School Science and Mathematics Journal.
Stage four results: Validity evidence and COPs
Regarding the validity evidence for COPs, evidence was rarely found within the manuscripts
themselves. Appendices 1 and 2 note the connections between COPs that came up in our
search and the source of validity evidence. In sum, 44% of all COPs have some form of valid-
ity evidence, located in a manuscript found in the sample space. All six COPs listed in Appen-
dix 1 had some form of validity evidence associated with them, found in one or more of the
references. Only 29% of the manuscripts listed in Appendix 2 offer any validity evidence in
a way that is consistent with the Standards (AERA et al. 2014). If validity evidence was pro-
vided in a manuscript, then it usually was related to the validity source called Internal Struc-
ture in the form of reliability scores (e.g., Cronbach’s alpha or test–retest reliability). An analy-
sis was performed to examine whether validity evidence for other sources was offered (e.g.,
relations to other variables, response processes, and consequences from testing). Seven COPs
(i.e., 7 of 27 COPs; 26%) have validity evidence for more than one source found within our
sample of 114 published peer-reviewed manuscripts. Those COPs were the CLASS, EQUIP,
IQA, MISCOP, MQI, M-SCAN, and RTOP.
The secondary search for manuscripts discussing validity evidence about specific COPs
provided further details about validity. The purpose for this secondary search was to retain
the same search process and rigor as performed in the earlier search. A total of five additional
peer-reviewed manuscripts were collected from this secondary search for validity-related man-
uscripts (see Table 5). Broadly speaking, the validity evidence aimed to address the Standards
(AERA et al. 1999, 2014) in some fashion. Evidence related to content and internal structure
was usually presented within these validation studies. These manuscripts addressed validity
evidence for the EQUIP, MQI, M-SCAN, and RTOP.
13
J. Bostic et al.
Table 5 Validity evidence articles and connection to sources of validity

Classroom observa- Article Validity evidence—sources addressed
tion protocol
EQUIP Marshall et al. (2010) Content validity and internal structure

MQI Learning Mathematics for Content validity and internal structure
Teaching Project (2011)
RTOP Sawada et al. (2002) Content validity and internal structure
EQUIP and RTOP Marshall et al. (2011) Internal structure
M-SCAN Walkowiak et al. (2014) Content validity, response processes validity, rela-
tionship to other variables validity evidence, and
internal structure
Limitations and delimitations
We acknowledge some limitations of our qualitative cross-comparative study. First, the

sampling frame (i.e., 2000–2015 and specific journals) may have delimited our search
inadvertently. Given the large number of initial manuscripts (1144), we are reluctant to
believe so but admit this is possible. This also bears recognition for the validity search
because we focused on journals whose scope is primarily aimed at mathematics education
researchers, not assessment and evaluation (methodology) scholars. Second, we adhered to
a strict sampling frame for validity studies. Conference proceedings (e.g., American Edu-
cational Research Association, Psychology of Mathematics Education—North America,
Research Council in Mathematics Learning, and Research in Undergraduate Mathematics
Education) yielded a few additional manuscripts focusing on validity evidence. However,
we chose to delimit our data collection to peer-reviewed journals for all cross-comparative
analysis to retain coherence and rigor across the work. Broadly speaking, authors of the
COPs were not authors of conference proceedings utilizing their COP. Similarly, our search
identified research reports and white papers containing validity evidence for some COPs
(e.g., LSC Horizon), but because these manuscripts were not peer-reviewed, they were not
retained for analysis.
Discussion
We identified 114 published studies using classroom observations, and only 44 of those
studies (39%) mention a formalized COP as part of their research. Our intent with this
finding is to provide a state of the field—not to make judgments about appropriate or inap-
propriate research lines, instruments, or findings.
That said, a few facts regarding the state of the field were surprising. First, a low percentage
of studies conducting classroom observation research used a COP. A number of these stud-
ies (see Table 3) were exploratory in nature and thus might not be expected to utilize a COP.
However, very few studies seeking to correlate a teacher variable (e.g., knowledge, beliefs,
etc.) with instructional practice used an instrument to measure instruction (see Table 3).
Although our analysis is at a relatively large grain size, this result raises questions regarding
how claims about the relationship between, for example, teacher beliefs and teacher practice
13
are substantiated if practice is not measured. The use of a COP for measuring classroom
observation has potential to provide strong evidence for claims and to augment generalizations
across studies examining a similar phenomenon.
Second, a small number of studies attended to validity. Hill and Shih (2009) point out that
only 17% of quantitative studies published in JRME employed measures with validity evi-
dence or reported their own validity evidence. The lack of published validation studies for
COPs in peer-reviewed journals within our study’s sample space is concerning. However, rea-
sons for this finding may be multifactorial in nature including: lack of opportunities to publish
validation studies, lack of interest in or understanding of validity, or lack of cross-disciplinary
research between mathematics educators and measurement experts. If COPs continue to be
used in mathematics education research, then we recommend researchers consider whether a
chosen COP is appropriate for an intended use, whether validity evidence exists for a chosen
COP, and how the new study can engage in gathering validity evidence for a COP within the
specific research. As more tools are developed for classroom observation, we also recommend
that researchers consider the intended design, purpose statement, and evidence-based validity
arguments for each new instrument (Bostic 2017, 2018).
Next, 27 different instruments for measuring instruction may be an unwieldy number for
researchers to navigate in order to identify a tool aligned with their chosen research questions.
Perhaps the need to identify aspects of mathematics instruction aligned with the specific goals
or outcomes of a particular study entices researchers to create their own tools. As it is highly
probable that instruments overlap in what they measure, this study makes salient the need in
the field to investigate and catalog what each instrument measures, the conditions under which
it can be used reliably (e.g., live, videotape, or types of artifacts needed), whether training is
necessary, the scalability of the instrument, and other information necessary to thoughtfully
choose a COP for measuring practice.
On one hand, many COPs with unique purposes allow access to a wider audience of
researchers examining classroom contexts. However, too many COPs may also limit the
strength of generalizations that can be made from multiple studies examining the same phe-
nomenon. Very few instruments were used in more than one study. Different instruments
measuring different aspects of instruction, or measuring the same aspect in different ways,
make meaningful comparisons between studies a tenuous task and inhibit progress in the field.
Similar tools and foci would facilitate the compilation of results across several small studies
into meta-analyses with rigorous conclusions and implications.
However, it may be that there is no practical solution to this issue if the real problem is
more fundamental. The plethora of instruments identified in this study may serve as evidence
that the field of mathematics education lacks a shared vision of what constitutes high-quality
instruction, at least at the level of classroom observations. With regard to classroom instruc-
tion, one can generally infer what teachers value by examining their assessments. Applying the
same logic to the results of the present study suggests that as a field, mathematics education
researchers do not agree on, or at least are still trying to determine, what factors support high-
quality mathematics instruction.
Conclusion
Classrooms are extremely complex environments (Hiebert et al. 2005), and thus it may
be the case that as a field we are still at the nascent stages of instrument development.
As more instruments are developed for classroom observation, it is prudent to consider
13
J. Bostic et al.
the purpose statement and validity arguments for each one (Bostic 2017, 2018). There is
a reasonable rationale for both sides of an evidence-based argument related to numerous
COPs. Many COPs with unique purposes allow access to a wider audience of researchers
examining classroom contexts. However, too many COPs may also limit the strength of
generalizations that can be made from multiple studies examining the same phenomenon.
In short, development and use of COPs should be closely considered in light of their valid-
ity evidence, which is tied to clearly written purpose statements and validity arguments.
Cataloging existing instruments through cross-comparative analysis can improve the
quality of the research being conducted and allow for meaningful comparisons across stud-
ies (Khan and VanWynsberghe 2008). It has potential to allow researchers more informa-
tion with which to make informed decisions about conducting research, which is vitally
important to building a foundation of mathematics education research (Smith 2014).
Based upon our results, we offer three final recommendations for mathematics education
research. First, the field of scholars working in mathematics education contexts needs a
way to document and catalog existing instruments in a way that is broadly accessible and
does not discriminate against a methodology. Second, scholars should attend to an instru-
ment’s purpose, design features, and ways results may be used during the development and/
or selection phase for a particular study. Considering these facets may result in better uni-
fying the field of mathematics teacher education scholarship while also fostering further
developments in areas that need COPs. Third, validity evidence and associated validity
arguments must be a primary consideration for both recommendations (1) and (2). It is dif-
ficult to judge the rigor of conclusions and their generalizability without having a sense of
the validity associated with the outcomes from a particular instrument. In the present study,
our team focused on classroom observation research in mathematics education and found
a diversity of purposes and means for engaging in this scholarly work. Ultimately, research
questions regarding the relationship between teachers’ practice and some other dimension,
such as student achievement, teacher education or professional development, or teachers’
knowledge, beliefs, planning, or decisions, require that teachers’ practice be observed and
analyzed. Given the central role of teaching in the enterprise of mathematics education and
the importance of how instruction is implemented in that process (Franke et al. 2007), the
need to measure various aspects of mathematics instruction through classroom observation
is great.
Acknowledgements We would like to share our sincere appreciation to Timothy Folger, Maria Nielsen, and
Davis Gerber at Bowling Green State University, and Dan Chibnall at Drake University for their assistance
throughout this project.
13
Appendix 1
Classroom Construct Indicators Typical study Validity evi- References

observation measured population dence
protocol
Instructional Academic rigor Instructional K-12 math- Content, Boston (2012a,

Quality and account- tasks, task ematics response b), Boston
Assessment able talk implementa- instruction processes, et al. (2015a,
(IQA) tion, explana- internal b), Boston
tions of structure and Smith
mathematical (2009), Jackson
thinking and et al. (2013),
reasoning Schlesinger and
Jentsch (2016),
Schoenfeld
(2013), Wil-
helm and Kim
(2015)
Reformed Reform- Lesson design; K-12 math- Content, Boston et al.
Teaching oriented lesson imple- ematics response (2015a, b),
Observation mathematics mentation; instruction processes, Jong et al.
Protocol and science content; class- internal (2010),
(RTOP) teaching (i.e., room culture structure Marshall et al.
standards- (2011), Peters
based teach- Burton et al.
ing, inquiry (2014), Sawada
orientation, et al. (2002),
student-cen- Schlesinger and
tered teaching Jentsch (2016)
practices)
Mathematical Rigor and Common K-9 mathemat- Content, Boston et al.
Quality of richness of core-aligned ics instruction response (2015a, b), Hill
Instruction mathematics student processes, et al. (2012),
(MQI) present practices; internal struc- Kapitula and
working with ture, relation- Umland (2011),
students and ship to other Schlesinger and
mathematics; variables Jentsch (2016),
richness of Schoenfeld
mathematics; (2013)
errors and
imprecision;
classroom
work is
connected to
mathematics
13
J. Bostic et al.

protocol
UTeach Effective Designing les- K-12 math- Internal struc- Schlesinger and
Observation STEM teach- sons that are ematics ture Jentsch (2016),
Protocol ing inquiry based, instruction Schoenfeld
(UTOP) Use real- (2013), Was-
world con- serman and
nections and Walkington
involve active (2014)
participation;
Modifying
instruc-
tion (using
questioning,
responding to
student needs
and classroom
contexts);
content
knowledge in
the work of
teaching
Teaching Attributes of Content; cogni- K-12 math- Content Schlesinger and
for Robust equitable and tive demand; ematics Jentsch (2016),
Understand- robust learn- equitable instruction Schoenfeld
ing (TRU) ing environ- access to con- (2013)
Framework ments tent; agency;
ownership
and identity;
formative
assessment
13

protocol
Oregon Teacher Reform- Habits of mind; K-16 math- Content Morrell et al.
Observation oriented metacogni- ematics (2004), Wain-
Protocol teaching tion; student instruction wright et al.
discourse; (2004)
challenged
ideas; student
misconcep-
tions; concep-
tual thinking;
divergent
thinking;
interdis-
ciplinary
connections;
pedagogi-
cal content
knowledge;
multiple rep-
resentations
Appendix 2
Classroom Construct Indicators Sample in the Validity evi- References

observation measured cited study dence
protocols
1 Dyadic Student– Interactions 61 at-risk Internal struc- Baker (1999)

teacher–stu- teacher around aca- youth grade ture
dent contact interactions demic work 3–5
observational (negative classroom
system and positive) procedures
(Good and & behavior
Brophy
1994)
2 Classroom High quality Classroom 440 preschool Content, Hamre et al.
Assessment teacher–stu- organization, teachers response (2012)
Scoring dent interac- instruc- processes,
System tions tional and and internal
(CLASS) emotional structure
support
3 Classroom Explicit Teacher 129 kin- Internal struc- Doabler et al.
Observation Instructional demonstra- dergarten ture (2015)
of Student– Interactions tion, student classrooms
Teacher independent
Interactions- practice, stu-
Mathematics dent errors,
(COSTI-M) and teacher
feedback
13
J. Bostic et al.

protocols
4 Levels of Teachers’ Extent to 26 elementary None provided Franke et al.
Engage- attention which stu- teachers (2001).
ment with to student dent thinking (grades 1–5)
Children’s thinking is elicited who had
Mathemati- and used in participated
cal Thinking instructional in CGI PD
from CGI (decisions)
5 High Quality- High quality- Small group Three instruc- None provided Newton (2009)
Teaching of teaching work and tors of
Foundational in upper high-level elementary
Skills in elementary questions education
Math and schools course
Reading
(Valli and
Croninger
2002)
6 Robust Math- Quality of Mathematical Two 8th-grade None provided Mendez et al.
ematical mathemati- and discur- math classes (2007)
Discussion cal discus- sive strength
(RMD) sion of discourse
protocol
7 Comprehen- Instructional Instructional 145 3rd None provided McCaslin et al.
sive School practice at opportunity, through (2006)
Reform scale student 5th-grade
Classroom activities, classrooms
Observa- and teacher–
tion System student
(CSRCOS) relationships
8 COS-1, 3, and Quality of Quality of 791 children at None provided Pianta et al.
5 (Classroom classroom emo- grades 1, 3, (2008)
Observation supports tional and and 5
System for instructional
First, Third, interactions
and Fifth and amount
Grade) of exposure
to literacy
and math
activities
9 Observing Promoting Task, 28 elementary None provided Morrone et al.
Patterns of mastery authority, education (2004)
Adaptive goals in the recognition, majors
Learning classroom grouping,
(OPAL) evaluation,
and time
10 Classroom Lesson quality Tasks, role 26 in-service None provided Arbaugh et al.
Implementa- of teacher, secondary (2006)
tion Frame- social mathematics
work culture, teachers
mathemati-
cal tools, and
equity
13

protocols
11 Growing Culturally Classroom 19 second- None provided Brown and
Awareness responsive relation- ary math Crippen
Inventory: pedagogy ships, and science (2016)
(GAIn) discourse, preservice
protocol ** and socio- teachers
derived from political con- (PSTs used
Culturally sciousness GAIn to
Responsive code cooper-
Instruction ating teach-
Observation ers lessons,
Protocol i.e., it was an
(CRIOP) instructional
tool)
12 TIMSS Mathemat- Organization 39 8th-grade None provided Santagata and
1995/1999 ics lesson of classroom classrooms Barbieri
Video Study structure and interaction, in Italy (2005)
procedure presentation instructional
activities,
and organi-
zation of
math content
13 Science Learn- Design- Engineering 35 Grades 5 None provided Capobianco and
ing through informed design- and 6 STEM Rupp (2014)
Engineer- pedagogical informed teachers
ing Design methods pedagogical
(SLED) ** for STEM methods
derived from instruction
Inquiring
into Science
Instruction
Observation
Protocol
(ISIOP)
14 Mathematics Quality of sci- The degree 54 second- Content Judson (2013)
Integrated ence lesson to which ary STEM validity,
into Science: when math mathematics teachers Response
Classroom is integrated is integrated processes,
Observation into student- Internal
Protocol centered structure
(MISCOP) learning of
science
15 Classroom Usable Teachers’ Nationally None provided Kersting et al.
Video knowledge ability to recruited (2016)
Analysis for teaching analyze sample
(CVA) mathematics authentic of 676
teaching elementary
events and middle
school
teachers
13
J. Bostic et al.

protocols
16 Electronic Quality of Categories 52 classrooms Internal struc- Marshall et al.
Quality inquiry- include (35 teach- ture (2011)
of Inquiry based instruction, ers) middle
Protocol instruction discourse, school
(EQUIP) in math and assessment, science
science and cur- teachers
riculum
17 School Student- Instructional 45 observa- None provided Hall and Miro
Observa- centered orientation, tions across (2016)
tion Method classroom classroom 4 different
(SOM) and instruction organization, STEM
Rubric for instructional education
Student-Cen- strategies, programs
tered Activi- student
ties (RSCA) activities,
(Ross et al. technology
1998) use, and
assessment
18 Cases of How teachers Context and Three geom- None provided Sears and
Reasoning use the proof nature of etry teachers Chavez (2014)
and Proving tasks during the lesson,
(CORP) a lesson cognitive
demand of
the tasks,
and proof
schemes
19 Classroom Advancing Teacher lesson 18 first grade None provided Fraivillig et al.
Observation student planning, teach- (1999)
Instrument thinking Classroom ers using
(COI) practices, Everyday
and “on-the- Mathematics
fly” decision curriculum
making
20 Classroom Culturally Dimensions of 14 high school None provided Rubel and Chu
Observation relevant CureMap: teachers (2012)
Inventory pedagogy teaching
(COI) mathematics
for under-
standing;
centering
instruction
on students’
experiences;
developing
students’
critical
conscious-
ness about or
with math-
ematics
13

protocols
21 Math- Standards- Structure 88 3rd grade Content, Ottmar et al.
ematics Scan based of lesson, teachers—43 response (2013)
(M-SCAN) mathematics multiple rep- of whom processes,
teaching resentations, taught at relationship
practices students’ use schools to other
of tools, cog- receiving variables,
nitive depth, Responsive and internal
discourse Classroom structure
community, (RC) train-
explana- ing
tion &
justification,
problem
solving, and
connections
& applica-
tions
References
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (2014). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
American Educational Research Association, American Psychological Association, National Council on
Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing
(US). (1999). Standards for educational and psychological testing. Washington, DC: American Educa-
tional Research Association.
Arbaugh, F., Lannin, J., Jones, D. L., & Park-Rogers, M. (2006). Examining instructional practices in Core-
Plus lessons: Implications for professional development. Journal of Mathematics Teacher Education,
9(6), 517–550.
Baker, J. A. (1999). Teacher-student interaction in urban at-risk classrooms: Differential behavior, relation-
ship quality, and student satisfaction with school. The Elementary School Journal, 100(1), 57–70.
Ball, D. L., & Rowan, B. (2004). Introduction: Measuring instruction. The Elementary School Journal, 105,
3–10.
Berry, III, R. Q., Rimm-Kaufman, S. E., Ottmar, E. M., Walkowiak, T. A., & Merritt, E. (2010). The Math-
ematics Scan (M-Scan): A measure of mathematics instructional quality. Unpublished measure, Uni-
versity of Virginia.
Blank, R. K., Porter, A., & Smithson, J. (2001). New tools for analyzing teaching, curriculum and standards
in Mathematics & Science. Results from Survey of Enacted Curriculum Project. Final Report. Council
of Chief State School Officers, Attn: Publications, One Massachusetts Avenue, NW, Suite 700, Wash-
ington, DC 20001-1431.
Borasi, R., Fonzi, J., Smith, C., & Rose, B. J. (1999). Beginning the process of rethinking mathematics
instruction: A professional development program. Journal of Mathematics Teacher Education, 2,
49–78.
Bostic, J. (2017). Moving forward: Instruments and opportunities for aligning current practices with testing
standards. Investigations in Mathematics Learning, 9(3), 109–110.
Bostic, J. (2018). Improving test development reporting practices. In L. Venenciano & A. Sanogo (Eds.),
Proceedings of the 45th Annual Meeting of the Research Council on Mathematics Learning (pp.
57–64). Baton Rouge, LA.
Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2017). Classroom observation protocols: Choose your
own tool. Research report presented at the National Council of Teachers of Mathematics Research
Conference, San Antonio, TX.
13
J. Bostic et al.
Bostic, J., Matney, G., & Sondergeld, T. (2019). A lens on teachers’ promotion of the Standards for Math-
ematical Practice. Investigations in Mathematics Learning, 11(1), 69–82.
Boston, M. D. (2012a). Assessing the quality of mathematics instruction. Elementary School Journal, 113,
76–104.
Boston, M. (2012b). Assessing instructional quality in mathematics. The Elementary School Journal,
113(1), 76–104.
Boston, M., Bostic, J., Lesseig, K., & Sherman, M. (2015a). A comparison of mathematics classroom obser-
vation protocols. Mathematics Teacher Educator, 3, 154–175.
Boston, M., Bostic, J., Lesseig, K., & Sherman, M. (2015b). Classroom Observation tools to support the
work of mathematics teacher educators. Invited Manuscript for Mathematics Teacher Educator, 3,
154–175.
Boston, M. D., & Smith, M. S. (2009). Transforming secondary mathematics teaching: Increasing the cog-
nitive demands of instructional tasks used in teachers’ classrooms. Journal for Research in Mathemat-
ics Education, 40, 119–156.
Boston, M. D., & Wilhelm, A. G. (2015). Middle school mathematics instruction in instructionally-focused
urban districts. Urban Education, 52(7), 829–861.
Briars, D. J., & Resnick, L. B. (2000). Standards, assessments… and what else? The essential elements of
standards-based school improvement. Center for the Study of Evaluation, National Center for Research
on Evaluation, Standards, and Student Testing, Graduate School of Education & Information Studies,
University of California, Los Angeles.
Brigham Young University Department of Mathematics Education. (2008). Report on venue study.
Retrieved from https://nctm.confex.com/nctm/…/BYU%20Study%20for%20Journal%20Rankings.pdf.
Brophy, J. (1986). Teacher influences on student achievement. American Psychologist, 41, 1069.
Brown, J. C., & Crippen, K. J. (2016). The growing awareness inventory: Building capacity for culturally
responsive science and mathematics with a structured observation protocol. School Science and Math-
ematics, 116(3), 127–138.
Capobianco, B. M., & Rupp, M. (2014). STEM teachers’ planned and enacted attempts at implementing
engineering design-based instruction. School Science and Mathematics, 114(6), 258–270.
Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different
lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM Math-
ematics Education, 50, 355–366. https://doi.org/10.1007/s11858-018-0914-8.
Danielson, C. (2013). The framework for teaching: Evaluation instrument. Princeton, NJ: Danielson Group.
Doabler, C. T., Baker, S. K., Kosty, D. B., Smolkowski, K., Clarke, B., Miller, S. J., et al. (2015). Examining
the association between explicit mathematics instruction and student mathematics achievement. The
Elementary School Journal, 115(3), 303–333.
Donovan, M. S., & Bransford, J. D. (2005). How students learn: History, mathematics, and science in the
classroom. Committee on How People Learn: A Targeted Report for Teachers National Research
Council. Washington, DC: National Academies Press.
Farmer, J. D., Gerretston, H., & Lassak, M. (2003). What teachers take from professional development:
Cases and implications. Journal of Mathematics Teacher Education, 6, 331–360.
Fraivillig, J. L., Murphy, L. A., & Fuson, K. C. (1999). Advancing children’s mathematical thinking in eve-
ryday mathematics classrooms. Journal for Research in Mathematics Education, 30, 148–170.
Franke, M. L., Carpenter, T. P., Levi, L., & Fennema, E. (2001). Capturing teachers’ generative change: A
follow-up study of professional development in mathematics. American Educational Research Jour-
nal, 38, 653–689.
Franke, M., Kazemi, E., & Battey, D. (2007). Understanding teaching and classroom practice in mathemat-
ics. In F. Lester Jr. (Ed.), Second handbook of research on mathematics teaching and learning (pp.
225–256). Charlotte, NC: Information Age Publishing.
Good, T., & Brophy, J. (1994). Looking in classrooms (6th ed., pp. 209–262). New York: Harper Collins
College Publishers.
Hall, A., & Miro, D. (2016). A study of student engagement in project-based learning across multiple
approaches to STEM education programs. School Science and Mathematics, 116(6), 310–319.
Hamre, B., Pianta, R., Burchinal, M., Field, S., LoCasale-Crouch, J., Downer, J., et al. (2012). A course
on effective teacher-child interactions: Effects on teacher beliefs, knowledge, and observed practice.
American Educational Research Journal, 49(1), 88–123.
Hiebert, J., Stigler, J. W., Jacobs, J. K., Givvin, K. B., Garnier, H., Smith, M., et al. (2005). Mathematics
teaching in the United States today (and tomorrow): Results from the TIMSS 1999 video study. Educa-
tional Evaluation and Policy Analysis, 27, 111–132.
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher
observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64.
13
Hill, H., & Shih, J. (2009). Examining the quality of statistical mathematics education research. Journal
of Research in Mathematics Education, 40(3), 241–250.
Jackson, K., Garrison, A., Wilson, J., Gibbons, L., & Shahan, E. (2013). Exploring relationships
between setting up complex tasks and opportunities to learn in concluding whole-class discussions
in middle-grades mathematics instruction. Journal for Research in Mathematics Education, 44(4),
646–682.
James, L. R., Demaree, R. G., & Wolf, G. (1993). Rwg: An assessment of within-group interrater agree-
ment. Journal of Applied Psychology, 78, 306.
Jong, C., Pedulla, J. J., Reagan, E. M., Salomon-Fernandez, Y., & Cochran-Smith, M. (2010). Exploring
the link between reformed teaching practices and pupil learning in elementary school mathematics.
School Science and Mathematics, 110(6), 309–326.
Judson, E. (2013). Development of an instrument to assess and deliberate on the integration of math-
ematics into student-centered science learning. School Science and Mathematics, 113(2), 56–68.
Kane, M. T. (2006). Valildation. In R. L. Brennan, National Council on Measurement in Education, &
American Council on Education (Eds.), Educational measurement. Westport, CT: Praeger Publishers.
Kane, M. T. (2016). Validation strategies: Delineating and validating proposed interpretations and uses
of test scores. In S. Lane, M. Raymond, & T. M. Haladyna (Eds.), Handbook of test development
(Vol. 2nd). New York, NY: Routledge.
Kane, T. J. & Staiger, D. O. (2012). Gathering Feedback for Teaching: Combining high-quality observa-
tions with student surveys and achievement gains. Research paper. MET Project. Bill and Melinda
Gates Foundation.
Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added
scores. American Educational Research Journal, 48(3), 794–831.
Kersting, N. B., Sutton, T., Kalinec-Craig, C., Stoehr, K. J., Heshmati, S., Lozano, G., et al. (2016). Fur-
ther exploration of the classroom video analysis (CVA) instrument as a measure of usable knowl-
edge for teaching mathematics: Taking a knowledge system perspective. ZDM Mathematics Educa-
tion, 48(1–2), 97–109.
Khan, S., & VanWynsberghe, R. (2008). Cultivating the under-mined: Cross-case analysis as knowledge
mobilization. Forum: Qualitative Social Research, 9, 1–21.
Le, V., Lockwood, J. R., Stecher, B. M., Hamilton, L. S., & Martinez, J. F. (2009). A longitudinal inves-
tigation of the relationship between teachers’ self-reports of reform-oriented instruction and math-
ematics and science achievement. Educational Evaluation and Policy Analysis, 31, 200–220.
Learning Mathematics for Teaching Project. (2011). Measuring the mathematical quality of instruction.
Journal of Mathematics Teacher Education, 14, 25–47.
Lubienski, S. T. (2008). On” gap gazing” in mathematics education: The need for gaps analyses. Journal
for Research in Mathematics Education, 39, 350–356.
Marshall, J., Smart, J., & Horton, R. (2010). The design and validation of EQUIP: An instrument to assess
inquiry-based instruction. International Journal of Science & Mathematics Education, 8(2), 299–321.
Marshall, J. C., Smart, J., Lotter, C., & Sirbu, C. (2011). Comparative analysis of two inquiry obser-
vational protocols: Striving to better understand the quality of teacher-facilitated inquiry-based
instruction. School Science and Mathematics, 111(6), 306–315.
Matsumura, L. C., Garnier, H., Slater, S. C., & Boston, M. (2008). Toward measuring instructional inter-
actions ‘at-scale’. Educational Assessment, 13, 267–300.
McCaffrey, D. F., Hamilton, L. S., Stecher, B. M., Klein, S. P., Bugliari, D., & Robyn, A. (2001). Inter-
actions among instructional, practices, curriculum, and student achievement: The case of Stand-
ards-based high school mathematics. Journal for Research in Mathematics Education, 32, 493–517.
McCaslin, M., Good, T. L., Nichols, S., Zhang, J., Wiley, C. R., Bozack, A. R., et al. (2006). Compre-
hensive school reform: An observational study of teaching in grades 3 through 5. The Elementary
School Journal, 106(4), 313–331.
Mendez, E. P., Sherin, M. G., & Louis, D. A. (2007). Multiple perspectives on the development of an
eighth-grade mathematical discourse community. The Elementary School Journal, 108(1), 41–61.
Morrell, P. D., Wainwright, C., & Flick, L. (2004). Reform teaching strategies used by student teachers.
School Science and Mathematics, 104(5), 199–213.
Morrone, A. S., Harkness, S. S., D’ambrosio, B., & Caulfield, R. (2004). Patterns of instructional dis-
course that promote the perception of mastery goals in a social constructivist mathematics course.
Educational Studies in Mathematics, 56(1), 19–38.
National Council of Teachers of Mathematics. (2014). Principles to actions: Ensuring mathematical
success for all. Reston, VA: Author.
Newton, K. J. (2009). Instructional practices related to prospective elementary school teachers’ motiva-
tion for fractions. Journal of Mathematics Teacher Education, 12(2), 89–109.
13
J. Bostic et al.
Ottmar, E. R., Rimm-Kaufman, S. E., Berry, R. Q., & Larsen, R. A. (2013). Does the responsive classroom
approach affect the use of standards-based mathematics teaching practices? Results from a randomized
controlled trial. The Elementary School Journal, 113(3), 434–457.
Peters Burton, E., Kaminsky, S. E., Lynch, S., Behrend, T., Han, E., Ross, K., et al. (2014). Wayne School of
Engineering: Case study of a rural inclusive STEM-Focused High School. School Science and Math-
ematics, 114(6), 280–290.
Pianta, R. C., Belsky, J., Vandergrift, N., Houts, R., & Morrison, F. J. (2008). Classroom effects on chil-
dren’s achievement trajectories in elementary school. American Educational Research Journal, 45(2),
365–397.
Ross, S. M., Smith, L. J., & Alberg, M. (1998). The school observation measure (SOM VC). Memphis:
Center for Research in Educational Policy, The University of Memphis.
Rubel, L. H., & Chu, H. (2012). Reinscribing urban: Teaching high school mathematics in low income,
urban communities of color. Journal of Mathematics Teacher Education, 15(1), 39–52.
Santagata, R., & Barbieri, A. (2005). Mathematics teaching in Italy: A cross-cultural video analysis. Math-
ematical Thinking and Learning, 7(4), 291–312.
Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., et al. (2002). Measuring reform
practices in science and mathematics classrooms: The reformed teaching observation protocol. School
Science and Mathematics, 102, 245–253.
Saxe, G. B., Gearhart, M., & Seltzer, M. (1999). Relations between classroom practices and student learning
in the domain of fractions. Cognition and instruction, 17, 1–24.
Schifter, D. A., & Simon, M. (1992). Assessing teachers’ development of a constructivist view of mathemat-
ics learning. Teacher and Teacher Education, 8, 187–197.
Schlesinger, L., & Jentsch, A. (2016). Theoretical and methodological challenges in measuring instructional
quality in mathematics education using classroom observations. ZDM Mathematics Education, 48(1–
2), 29–40.
Schoen, H. L., Cebulla, K. J., Finn, K. F., & Fi, C. (2003). Teacher variables that relate to student achieve-
ment when using a standards-based curriculum. Journal for Research in Mathematics Education, 34,
228–259.
Schoenfeld, A. H. (2013). Classroom observations in theory and practice. ZDM Mathematics Education,
45(4), 607–621.
Sears, R., & Chavez, O. (2014). Opportunities to engage with proof: the nature of proof tasks in two geom-
etry textbooks and its influence on enacted lessons. ZDM Mathematics Education, 46, 767–780.
Simon, M. A., & Shifter, D. (1991). Towards a constructivist perspective: An intervention study of math-
ematics teacher development. Educational Studies in Mathematics, 22, 309–331.
Slavin, R. E., Lake, C., & Groff, C. (2009). Effective programs in middle and high school mathematics: A
best evidence synthesis. Review of Educational Research, 79, 839–911.
Smith, M. (2014). Tools as a catalyst for practitioners’ thinking. Mathematics Teacher Educator, 3, 3–7.
Steenbergen-Hu, S., & Cooper, H. (2013). A meta-analysis of the effectiveness of intelligent tutoring sys-
tems on K–12 students’ mathematical learning. Journal of Educational Psychology, 105, 970–987.
Swafford, J. O., Jones, G. A., & Thornton, C. A. (1997). Increased knowledge in geometry and instructional
practice. Journal for Research in Mathematics Education, 28, 467–483.
Tarr, J. E., Reys, R. E., Reys, B. J., Chavez, O., Shih, J., & Osterlind, S. (2008). The impact of middle grades
mathematics curricula on student achievement and the classroom learning environment. Journal for
Research in Mathematics Education, 39, 247–280.
Toerner, G., & Arzarello, F. (2012). Grading mathematics education research journals. Newletter of the
European Mathematical Society, 86, 52–54.
U.S. Department of Education, Institute of Education Sciences, What Works Clearinghouse. (2012, Febru-
ary). High School Math intervention report: I CAN Learn®. Retrieved from http://whatworks.ed.gov.
U.S. Department of Education, Institute of Education Sciences, What Works Clearinghouse. (2013, Janu-
ary). High School Mathematics intervention report: Carnegie Learning Curricula and Cognitive
Tutor®. Retrieved from http://whatworks.ed.gov.
Valli, L., & Croninger, R. (2002). High quality teaching of foundational skills in mathematics and read-
ing (# 0115389). Washington: National Science Foundation Interdisciplinary Educational Research
Initiative.
Wainwright, C., Morrell, P. D., Flick, L., & Schepige, A. (2004). Observation of reform teaching in under-
graduate level mathematics and science courses. School Science and Mathematics, 104(7), 322–335.
Walkington, C., Arora, P., Ihorn, S., Gordon, J., Walker, M., Abraham, L., & Marder, M. (2012). Develop-
ment of the UTeach observation protocol: A classroom observation instrument to evaluate mathemat-
ics and science teachers from the UTeach preparation program. Unpublished paper. Southern Method-
ist University.
13
Walkowiak, T., Berry, R., Meyer, J., Rimm-Kaufman, S., & Ottmar, E. (2014). Introducing an observational
measure of standards-based mathematics teaching practices: Evidence of validity and score reliability.
Educational Studies in Mathematics, 85, 109–128.
Wasserman, N., & Walkington, C. (2014). Exploring links between beginning UTeachers’ beliefs and
observed classroom practices. Teacher Education & Practice, 27(2/3), 376–401.
Wilhelm, A. G., & Kim, S. (2015). Generalizing from observations of mathematics teachers’ instructional
practice using the instructional quality assessment. Journal for Research in Mathematics Education,
46(3), 270–279.
Williams, S., & Leatham, K. (2017). Journal quality in mathematics education. Journal for Research in
Mathematics Education, 48(4), 369–396.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13

Classroom Observation and Mathematics Education Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classroom Observation and Mathematics Education Research

Uploaded by

Copyright:

Available Formats

Journal of Mathematics Teacher Education

Classroom observation and mathematics education research

Jonathan Bostic1 · Kristin Lesseig2 · Milan Sherman3 · Melissa Boston4

© Springer Nature B.V. 2019

Keywords Classroom observation · Instruction · Qualitative · Validity

Classroom observations have become an integral part of mathematics education research

mathematics classrooms. Second, a high degree of variability exists in instructional

Fig. 1 A categorization of research on classroom instruction (Bostic et al. 2017)

Validity and validation evidence

Research design and question

Data collection and analyses

Fig. 2 Outline of data collection process

Stage one: Constructing the sample for analysis

Stage two: Analyzing the sample of classroom observation literature

Stage three: Validation evidence

Validation of an instrument or tool should be central to conducting generalizable and rep-

Table 1 Variables associated with examination of manuscripts

including academic research. Thus, validity evidence ought to be presented if conclusions

Stage four: Analyzing the sample of validation literature

Stage two, round one: COP usage and purpose of classroom observation

Number (%) 70 (61.4%) 21 (18.4%) 23 (20.2%) 114

Table 3 Purpose of classroom observations and the use of a classroom observation protocol

Exploratory, descriptive, grounded theory 41 (36%) 8 of 41 (20%)

Stage two, round two: Nature of COPs

Table 4 Frequencies for COPs Classroom observation protocol Number of

*Citations for articles that references these COPs are provided in

Stage four results: Validity evidence and COPs

Table 5 Validity evidence articles and connection to sources of validity

EQUIP Marshall et al. (2010) Content validity and internal structure

We acknowledge some limitations of our qualitative cross-comparative study. First, the

Classroom Construct Indicators Typical study Validity evi- References

Instructional Academic rigor Instructional K-12 math- Content, Boston (2012a,

Classroom Construct Indicators Typical study Validity evi- References

Classroom Construct Indicators Typical study Validity evi- References

Classroom Construct Indicators Sample in the Validity evi- References

1 Dyadic Student– Interactions 61 at-risk Internal struc- Baker (1999)

Classroom Construct Indicators Sample in the Validity evi- References

Classroom Construct Indicators Sample in the Validity evi- References

Classroom Construct Indicators Sample in the Validity evi- References

Classroom Construct Indicators Sample in the Validity evi- References

You might also like

Fig. 1 A categorization of research on classroom instruction (Bostic et al. 2017)

Fig. 2 Outline of data collection process

Table 1 Variables associated with examination of manuscripts

Table 3 Purpose of classroom observations and the use of a classroom observation protocol

Table 4 Frequencies for COPs Classroom observation protocol Number of

Table 5 Validity evidence articles and connection to sources of validity