Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Zach Himmelsbach

Annotated Bibliography and Reflective Research Memo

Research Question: How can the tools of learning analytics, educational data mining, and data
science be used and/or expanded to measure and improve teaching practice and quality?

Petrilli, M. (2018). Big Data Transforms Education Research. Education Next, 18(1), Education
Next, Winter 2018, Vol.18(1).

Petrilli proposes that machine learning, which underpins some education technologies
that show limited success in the classroom, could have a large impact on basic
education research. He writes that these tools “may enable us finally to unlock the black
box of the classroom” and “equip teachers, school and district leaders, and policy
makers with the sort of insights and analytics” that will lead to better outcomes for
students. He describes the traditional approach to classroom research as involving a
bunch of graduate students observing classrooms (or video) and spending many hours
coding it. This approach, although it has generated some insights, is “incredibly labor-
intensive, costs gobs of money, and thus may not be practical.”

Researchers have responded to this difficulty by concentrating on administrative


measures, which has not illuminated the thing that probably matters much more: what
is actually happening in classrooms. Recently though, some education researchers have
applied machine learning techniques to automate the production of richer measures.
Petrilli references one group that has attempted to automate measures of teaching
practice and another that has focused on student learning, noting that the second group
has make more progress so far. He closes by suggesting that national studies like NAEP
could collect audio data to which we could apply these tools and that “the power duo of
big data and machine learning will enable us to build a research enterprise that actually
improves classroom instruction.”

Petrilli makes a direct pitch for the value of this kind of work, but glosses over several
important details. He fails to mark the vast difference in the volume of research applying
these tools to learning and to teaching. Journals dedicated to work in this field publish
articles almost exclusively focused on learning. In fact, the Kelly article he mentions is
the only article I could find that applies machine learning techniques to teacher
observation in K-12. I get into important details of the Kelly paper in the next entry of
this bibliography, but I think it’s fair to say that Petrilli overstates researchers’ progress
on this challenge (and even their attention to it). This article captures the popular-
science view of machine learning’s potential to impact social policy, but it misleads us
regarding the state of this work in education, collapses learning research and teaching
research (not that they should be so separate, but they currently are), and does not
illuminate what the current challenges are in realizing the promise of these tools.
Kelly, S., Olney, A., Donnelly, P., Nystrand, M., & D'Mello, S. (2018). Automatically Measuring
Question Authenticity in Real-World Classrooms. Educational Researcher, 47(7), 451-464.

Kelly and his co-authors use machine learning methods to detect teachers’ use of
“authentic questions,” for which there are no pre-determined correct answers. They
motivate the paper with the observation that current methods to measure the presence
of these kinds of teacher practices are not scalable. They deploy a pipeline of automatic
speech recognition, natural language processing tools, and machine learning algorithms
to predict how humans would code 132 high-quality audio recordings as well as an
archival database of 451 text transcripts. They conclude that the correlations between
their model’s predictions and the human coding (r = 0.60 and r = 0.69, respectively)
demonstrate that their approach can provide a “valuable complement to human coding
in research efforts.”

Their approach differs for the two data sources in that the first must be automatically
transcribed (presenting an additional technical hurdle) while the second has already
been transcribed manually. Beyond this, their modeling approach in the two cases is
similar: they aggregate manually constructed features (e.g. number of each part-of-
speech) at the observation level (i.e. one row per recording) and then use a regression
tree model to predict the proportion of authentic questions in that recording (or text).
To avoid overfitting, they employ a standard cross-validation approach (LOO-CV at the
teacher and school level, in the two respective cases).

It excites me that Kelly and his co-authors have pioneered applying machine learning
tools to automating measures of teacher practice, and the paper accomplishes this
while integrating the work with literature on teaching quality and practice (there are a
few papers, from CS authors, that deploy similar approaches in post-secondary settings
but wholly ignore this kind of literature). However, their modeling approach does not
take advantage of recent advances in machine learning (and computing power) that
may produce results that are more useful in practice. As the Goldberg text in this
bibliography establishes, deep learning approaches have taken natural language
processing from basic research to a stage where it’s commercially useful. Education
research with machine learning should be employing the same state-of-the-art tools. In
summary, I’m not convinced their results are indicative of an approach that would be “a
valuable complement” to human coding, but I do agree with their statement that
“research that pairs established theories of teaching and learning with technological
innovations of the kind used in this study may soon lead to a new era in research and
school improvement.”
Kane, T., Kerr, K., Pianta, R. & Measures of Effective Teaching Project. (2014). Designing teacher
evaluation systems : New guidance from the measures of effective teaching project (First ed.).
San Francisco, CA: Jossey-Bass.

The editors claim that “real improvement requires quality measurement.” Teaching,
though, as a rich interaction between instructors, learners, and content, defies capture
by a single measure. The MET project combined student surveys, multiple observation
instruments scored by trained rafters, and student achievement (on both state tests and
more challenging assessments). The editors argue that student surveys provided reliable
feedback on teaching that predicted student learning, and that the MET studies use of
random assignment establishes that their combination of measures can identify
teachers who cause students to learn more.

Chapters of the book, written by the analysts and instrument developers who worked
on the MET project, explores the individual measures used, the use of data for feedback
and evaluation, and the interactions between multiple measures and their contexts. The
conclusion proposes that assessment of teacher performance is “the new normal,” but
that the operationalization of these assessments has not been effective in that it has not
succeeded in identifying high- and low-performing teachers or shed light on what
constitutes quality practice. Pianta and Kerr argue that the MET project has provided a
research infrastructure that is capable of making progress on this front, but we should
be cautious that this is an optimistic interpretation of the MET results (the book came
out under the auspices of the project’s funder). They also claim that the value-added
approach to teacher assessment is weak as a diagnostic and support tool.

The book presents evidence that the use of multiple measures of teachers (something
that the authors say was only seldom done in the past) offers an improvement over any
single measure (they substantiate this by showing that it explains more variation in
student outcomes). They demonstrate that having trained raters score teaching with
several observational instruments provides valuable information that measures
constructed from administrative data lack. To show this, they had to have teams of
trained raters score thousands of videos (and with multiple instruments). At scale, this
would be prohibitively expensive. So, the MET results provide further motivation for
researching low-cost, scalable methods for collecting these kinds of measures. The
resulting database also holds thousands of hours of video that could be used in this
research.

Blikstein, P., & Worsley, M. (2016). Multimodal Learning Analytics and Education Data Mining:
using computational technologies to measure complex learning tasks. Journal of Learning
Analytics, 3(2), 220-238. https://doi.org/10.18608/jla.2016.32.11

Blikstein and Worsley propose that new data-collection technologies can be paired with
machine learning tools to provide new insights into learning. Most work in learning
analytics and educational data mining prior to this paper focused on online courses
(particularly MOOCs) and cognitive tutors that provide highly structured data for the
researcher. This restricts the use of machine learning tools in education to studying
interactions with computers. Using these tools with richer data collected by sensors in
the learning environment would allow the examination of more complex, open-ended
learning experiences.

The authors, arguing from a constructivist perspective, claim that a lack of appropriate
measures has created an asymmetry in education debates around progressive practices.
Multimodal learning analytics could reduce or eliminate this asymmetry. They claim that
technologies using this approach could enable teachers to provide students with real-
time feedback and adjust their instruction in response to students difficulties as well as
aggregate measures of things like student attention over time. They present several
small-scale explorations of these technologies.

Blikstein and Worsley claim, in one of the primary educational data science journals,
that the majority of work prior to 2016 focused on computer interactions. They propose
moving beyond this to explore other learning environments, but their emphasis is on
providing a richer picture of what students are doing, not teachers (they do make one
passing reference to that fact that teachers could be given information about their use
of gestures). There’s an implicit notion here that teachers, if they only had the right
information about the state of their students, would be able to effectively intervene, a
tacit denial of variation in teacher skill. In work following this well-cited article, the focus
of educational data science research has been primarily, if not almost exclusively, on
examining students. But the same technologies and methods could be used to gather
rich data about teaching.

Gitomer, D. (2009). Measurement issues and assessment for teaching quality. Thousand Oaks:
Sage Publications.

This book chapters explore issues related to assessing teacher quality. It organizes work
by various authors into three sections: “Measuring Teaching Quality for Professional
Entry,” “Measuring Teaching Quality In Practice,” and “Measuring Teaching Quality In
Context.” Each section ends with the a synthesis of the included articles.

The first section describes how measurement is used for entry into the teaching
profession. The authors describe how, traditionally, this involves standardized tests of
content and pedagogical knowledge. Proficiency benchmarks on these tests are legal
manifestations of what is considered sufficient or “good enough” to be a teacher, but
Linda Darling-Hammond argues that “the definition of sufficient in many cases is what
lots of education professionals would consider not good enough.” The chapters
demonstrate a consensus among the authors that, though traditionally considered a
discrete event, entry into the teaching profession should instead be conceptualized (and
enacted) as a process. So, they argue, assessment of teachers must go beyond
gatekeeping initial entry to supporting a longer-term process including “recruitment,
selection, hiring, professional growth, and tenure.” They go on to express skepticism
that student achievement outcomes alone can provide the necessary support for these
stages. One author proposes that the gap would be filled by “measures of professional
judgement in the appropriate use of teacher knowledge and skill.” A further challenge is
that these measures should not only predict success but also provide useful feedback
for improving practice. It’s mentioned that the proposed entry screenings are “very
labor intensive” as they require, among other things, the review of a teaching sample.

The second section, on measuring teacher quality in practice, explores expanding the
types of content knowledge we measure, the pros and cons of value-added measures,
and how a state education agency operationalizes teacher measurement. The authors in
this section present evidence that teachers matter (e.g. that teacher effects explain
substantial portions of variance in student outcomes) and that the most widely used
measures have a low correlation with student outcomes (i.e. they are intended as
proxies for teacher effectiveness, but they aren’t good proxies). Typically used measures
of “quality” are in fact attributes of the teachers and not “direct measures of their actual
teaching.” Ball and Hill make the case that we should switch from thinking about
teacher quality to teaching quality. They also conduct an observational study to
evaluate instruction quality, noting that “the need for this sort of study raises questions
of scalability.” Several chapters raise concerns about the scalability of the multiple-
measures regimes that seem adequate. The synthesis of this section calls such a system
“clearly necessary,” while lamenting that “gaining these measures for large numbers of
teachers often at several process points may well prove intractable from a cost point of
view.”

The third section discusses the limitations of achievement measures. A lack of consistent
findings linking teacher attributes or practices to student outcomes is discouraging. For
instance, most estimates of teacher effects are 1/20 th of the black-white achievement
gap. Yet there is evidence that teachers matter, suggesting that the available metrics
(and/or the methods we apply to them) do not capture the important characteristics
and practices of teachers. Some of the arguments they present on this point may not be
compelling (e.g. we all remember a teacher who had an impact on us) to those
demanding rigorous evidence.

Concerns raised throughout this book about scalability and utility of measures motivate
the exploration of a data science-based approach to teacher measurement. First,
automated measures of teacher practices, based on video recordings of teachers, could
dramatically reduce the cost of scaling a multi-measure regime. If an algorithm could
accurately predict how an expert rater would mark a video, we can circumvent the need
to deploy an enormous workforce of trained observers. Secondly, these tools may be
useful in identifying elements of teaching practice that explain more of the variation in
student outcomes (though admittedly, this is a much harder problem).
Goldberg, Y., & Hirst, G. (2017). Neural Network Methods for Natural Language Processing (Vol.
10). San Rafael: Morgan & Claypool.

Goldberg and Hirst discuss the potential of deep learning for advancing work on natural
language processing problems. The book is by-and-large technical, covering how to
implement neural-network models and make decisions about which kinds of neural
networks (or combinations of them) to employ for particular types of problems. Natural
language has three properties that pose serious (perhaps insurmountable) challenges
for linear models: it is “discrete, compositional, and sparse.” In brief, language (unlike,
say, color) is not continuous, as the building blocks of language, words, are discrete;
“compositional” here refers to the fact that meaning (which is what we care about for
most applications) arises from a very complex set of rules; and sparse refers to the fact
that even in large datasets, you are unlikely to see the same phrase or sentence many
times (there may even be words that only appear in one observation of your data).

Neural networks employ an embedding layer that maps discrete language into
continuous vectors (of relatively low dimensionality), alleviating the discreteness and
sparsity problems. One type of neural network, multi-layer perceptrons (MLPs), can be
combined with pre-trained word embeddings (an embedding algorithm trained on an
outside, large body of text) and used where a linear model was previously used. Deep
learning techniques allow researchers to capture critical features of language like word
order without knowing a priori - and then hand-constructing - the correct features of
the data. The authors present evidence that MLPs combined with pre-trained
embedding layers often lead to superior classification accuracy.

Goldberg and Hirst describe the inadequacy of linear models for many problems
involving natural language and provide examples of deep learning approaches (which
can fit arbitrary non-linear relationships, besides other benefits) outperforming linear
methods, in some cases achieving “stellar accuracies.” The ability to incorporate pre-
trained embedding layers also hints that we may be able to have success without
enormous amounts of data. These techniques could be applied in attempts to automate
measures of teaching based on audio and video recordings of classrooms. The book
does not address concerns about potential algorithmic bias that would be paramount in
porting these tools to educational research, and future work that adopts these tools
should examine this prospect carefully (perhaps the use of pre-trained embedding layer
presents a problem; it may matter what corpus they are trained on).

Gitomer, D., Bell, C., Qi, Y., McCaffrey, D., Hamre, B., & Pianta, R. (2014). The Instructional
Challenge in Improving Teaching Quality: Lessons From a Classroom Observation Protocol.
Teachers College Record, 116(6), .

The researchers use the CLASS-S observation instrument to evaluate instructional


practice in algebra classrooms. They observed 82 algebra I teachers in middle and high
schools, 4-5 times each, with pairs of raters. “Master coders” also scored the videos and
the rater scores were compared to those as well as teacher’s self-assessments.
Teachers’ ratings of their own practice were positively correlated with their practice, but
aspects of practice that most need improvement are those that are rated most
differently by teachers and external observers. (There’s a tacit assumption here that the
coders are correct about these aspects of the teachers’ practice.)

Collecting these ratings was highly labor-intensive: observers received extensive training
and were calibrated on a weekly basis, and they conducted ratings on “an almost full-
time basis over the course of the project.” Despite this investment, on some items of
the instrument (notably the ones that most needed improvement), raters differed
markedly from the master coders.

Gitomer et al present evidence that not only is collecting these kinds of measures very
expensive, it’s also difficult to train raters to replicate the scores of master coders. They
also cite evidence that using observation protocols can result in improvements in
teaching. Both of these facts motivate the development of automating the collection of
rich measures of teaching practice.

Vitiello, V., Bassok, D., Hamre, B., Player, D., & Williford, A. (2018). Measuring the quality of
teacher–child interactions at scale: Comparing research-based and state observation
approaches. Early Childhood Research Quarterly, 44, 161-169.

Vitiello et al examine how observations conducted as part of state accountability


programs compare to those conducted by researchers. State and researcher teams
rated 85 classrooms in Louisiana using the CLASS instrument for Pre-K (which is
currently used by 45% of QRIS systems). Correlations on domains of the instrument
between the teams ranged from .21 to .43.

The paper argues for the value of instruments like class, reviewing evidence that
teachers who score higher on CLASS provider greater learning gains for students, though
the effects are modest. Still, it’s unclear how these kinds of measures for accountability
purposes. At scale, “the same emphasis on reliability, calibration, double-coding, and
multiple visits that has been common in research may not be as feasible […] due to costs
or other constraints.”

This paper demonstrates how the challenges of effectively scaling these types of
measures is a problem right now, under existing accountability systems. Local raters’
scores differed from researchers’ scores on average, and also had more variability. That
said, the local raters’ scores were more positively correlated with some student
outcome measures. This suggests an additional application of machine learning tools: in
addition to replicating expert measures, we may be able to better predict outcome
measures based on classroom observation. (Whether predicting student outcomes
should be the primary goal of the observation instruments is an open question.)
Taylor, E. S., & Tyler, J. H. (2012). Can teacher evaluation improve teaching? Education Next,
12(4)

The authors present a causal estimate of the effect of a teacher evaluation program in
Cincinnati, based largely on classroom observation, on teachers’ effectiveness (as
measured by test score gains). They find that teachers are more effective in the year
that they are evaluated and even more effective in the years after (11% of a standard
deviation in math). The identification strategy here requires that other factors of
teachers’ experience do not change simultaneously with evaluation, which seems
reasonable but is difficult to assess without particular knowledge of the context.

They note that these improvements appear in spite of the low variance on the
evaluation scores across teachers and hypothesize that it’s the reflection on their
teaching prompted by the evaluation, rather than the scores, that lead to improvements
in student outcomes. It’s also of interest that in this program, teachers are evaluated by
their peers. The improvements are impressive, but the authors question whether the
benefits outweigh the costs: experienced teachers are used as peer evaluators, so they
need to be pulled out of the classroom. The cost per teacher evaluated is estimated at
$7,500 per year (not counting the lost-learning from taking experienced teachers out of
their classrooms), 90% of which is evaluator salaries.

This paper establishes that observation-based accountability can improve student


outcomes, but that it’s currently very expensive. Could some of these labor costs be
replaced by automation?
Reflective Research Memo

In recent years, educational data science has emerged as a distinct sub-field in

education research. Two communities within this sub-field, the learning analytics community

and the educational data-mining community, have recently broadened their focus from

primarily examining student interactions with computers to exploring a wider range of learning

environments (Blikstein & Worley, 2016). Yet, even with this expansion, education researchers’

applications of machine learning – and other tools of data science – have remained almost

exclusively focused on students’ behavior and processes, tacitly assuming that teachers are

interchangeable, that if we could only provide them with the right information about their

students, any teacher could effectively intervene.

At the same time, researchers focused on assessing teacher practice and quality provide

evidence of the substantial variation in teachers’ practices and effectiveness, while noting the

difficulties of crafting and scaling measures that can evaluate and guide practice (Gitomer,

2009). Several of the challenges inherent in this work may be addressed by educational data

scientists who focus on teaching. Indeed, some have proposed that machine learning

techniques could usher in a new era of research and illuminate the “black box” of the classroom

(Petrilli, 2018). Petrilli submits that a main obstacle to collecting rich data about classrooms is

cost, and other researchers have raised the same concern (Taylor & Tyler, 2012; Gitomer et al

2014).

Currently, if we want to measure teacher practices, we need to train a team of raters to

manually code while they observe a classroom. For research purposes, this is usually a team of

graduate students or researchers, but when incorporated into policy, it has been peer-teachers
or state employees (Vitiello et al, 2018; Taylor & Tyler, 2012). Not only is it expensive to train a

large team of raters, but non-researchers, even when trained, may rate teachers systematically

differently than researchers (Gitomer et al, 2014). This is a concern given that the observation

instruments are typically developed and validated by research teams. And it’s not just that the

non-researchers ratings are different on average, there’s also more variation between raters

scoring the same observation.

One potential contribution of educational data science research would be to automate

the scoring of classroom video and audio recordings for a range of observation instruments. If

an algorithm could be trained to accurately predict how an expert would rate a given recording,

it could be used to initially code large bodies of recorded observations (we may then want to

flag uncertain cases for human review) dramatically lowering costs and allowing researchers to

get a rich picture of teaching practice at scale. In fact, this may be the only way to collect a

broad set of rich measures at scale, which we are motivated to do by two research findings:

first, multiple measures used in consort explain more variation in student outcomes than any

one measure alone (Kane, Kerr, & Pianta, 2014), and secondly, evaluation schemes focused on

observation can improve student outcomes (Taylor & Tyler, 2012). Beyond replicating existing

measures, machine learning techniques may be used to produce new measures that align more

directly with student outcomes, as one concern about the current instruments is their modest

correlation with students’ performance on assessments (Gitomer et al, 2009).

To date there exists only one paper focused specifically on applying data science tools

directly to questions about K-12 teaching (Kelly et al, 2018). Kelly and his co-authors set out to

train a model that predicts the proportion of authentic questions asked in a lesson. They are
moderately successful in this, but the model does not perform well enough to replace human

coding. Admirably, they anchor their application of machine learning techniques in the

extensive literature on teaching and advocate for “research that pairs established theories of

teaching and learning with technological innovations of the kind used in this study.” (Emphasis

mine.) While this pioneering effort is exciting, it neglects to take advantage of more recent

developments in machine learning and natural language processing.

In their paper, Kelly and his co-authors employ a regression tree model. In recent years,

substantial progress has been made on supervised learning problems, particularly those

centered on natural language (which most observations of teachers heavily involve) through

the use of neural network models (Goldberg & Hirst, 2017). Goldberg and Hirst present

numerous examples of these kinds of models outperforming other approaches and achieving

“stellar accuracies.” Recent advances in computing power have made training these models

possible, and they underpin recent commercial voice technologies, like Alexa. Implementing

these models, relative to older techniques like regression trees, is both more computationally

demanding and requires more specialized training on the part of the researcher, but may

provide meaningful improvements in accuracy and offers other benefits (like ease of improving

the model with additional data over time).

Consideration of the educational data science and teacher measurement literatures

suggest opportunities for researchers with data science training to apply state-of-the-art

machine learning methods to challenges in teacher measurement research. One focus of this

work would be on automating the scoring of classroom observation instruments like CLASS.

Another would be going beyond existing measures to identify aspects of teaching practice that
relate more strongly to student outcomes. The recently developed toolkit of neural network

models appears to be a promising way to approach these challenges, especially those that are

focused on teachers’ language.

You might also like