Professional Documents
Culture Documents
Educational Research and Evaluation: An International Journal On Theory and Practice
Educational Research and Evaluation: An International Journal On Theory and Practice
To cite this article: Daniel Muijs (2006) Measuring teacher effectiveness: Some methodological
reflections, Educational Research and Evaluation: An International Journal on Theory and Practice,
12:1, 53-74, DOI: 10.1080/13803610500392236
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Educational Research and Evaluation
Vol. 12, No. 1, February 2006, pp. 53 – 74
Teacher effectiveness is an issue that has received increased attention in recent years, as researchers
have become aware of limitations in models that see the school as the key arena for improving pupil
learning outcomes. This renewed interest makes it timely to look again at the methods used in
teacher effectiveness research. This article presents an overview of some key issues in researching
teacher effectiveness from a process-product perspective. The choice of outcome measure is a first
key area. Traditionally most teacher effectiveness research has utilised externally published
standardised tests. However, it will be argued that this is too limited in the light of societal demands
on education. Actually measuring teacher factors is an issue the difficulty of which has often been
underestimated. Classroom observation, surveys of teachers and students, and qualitative methods
such as interviews have been most frequently employed. The advantages and disadvantages of each
are discussed. In the final section, the main analysis methods suitable for teacher effectiveness
research are outlined.
Introduction
Teacher effectiveness is an issue that has received increased attention in recent years,
as researchers have become aware of limitations in models that see the school as the
key arena for improving pupil learning outcomes. Multilevel models of school effects
show that the variance in pupil outcomes that can be explained at the classroom level
is around twice that at the school level, at around 10 – 20% of variance to be explained
on average. Furthermore, in some studies up to 75% of this variance has been
explained by teacher behaviours (Muijs & Reynolds, 2002).
*School of Education, University of Manchester, Oxford Road, Manchester, MI3 9PL, UK.
E-mail: Daniel.Muijs@manchester.ac.uk
This interest in teacher effectiveness, along with its potential power as a lever of
educational improvement, clearly begs the question of how we can efficiently and
reliably measure this factor. In view of this renewed interest in the field, it is
timely to present an overview of the relatively extensive body of research in this
area.
The basic format of the traditional ‘‘teacher effectiveness’’ study is a so-called
product-process design, similar to that used in most school effectiveness research
(Brophy & Good, 1986; Teddlie & Reynolds, 2000). Typically outcomes are
measured, and a classroom observation instrument or questionnaire is used to
measure teacher factors such as classroom behaviours or pedagogical content
knowledge, and how that might affect these outcomes. The aim is then to see which
behaviours, beliefs, or other teacher factors (if any) are associated with more positive
outcomes. This model, while shown to be empirically supported, does, of course,
have its limitations. In particular, it does not take into account sufficiently the fact
that teachers’ roles are broader than their classroom practice and includes
management roles, pastoral roles, and relationships with parents and community as
well as classroom practice (Campbell, Kyriakides, Muijs, & Robinson, 2004;
Kyriakides, Campbell, & Christofidou, 2002). However, for the purposes of this
article we will concentrate on three key questions that arise from the traditional
model: How do we measure outcomes, how do we measure teacher factors, and how
do we analyse the data from teacher effectiveness studies.
Outcome Measures
In most teacher effectiveness research to date, the outcome studied has been
academic achievement. This does not have to be the case, however. In theory, an
effectiveness research design (whether teacher or school effectiveness, or indeed in
other fields such as organisational effectiveness in business studies) is neutral with
respect to outcomes, being concerned with means rather than goals (Teddlie &
Reynolds, 2000). Therefore it is entirely possible to look at what behaviours can
positively influence students’ self-esteem, prosocial behaviours or moral values. The
choice of outcome is of crucial importance, as it is clear that it is the goal of the study
that must determine the outcome measures used. All effectiveness research is goal
oriented, and as such careful consideration of the exact aims of the study is essential.
Measuring Teacher Effectiveness 55
Too often little care is taken in the development of good measures, use being made of
existing instruments for convenience sake rather than because they accurately reflect
the outcomes the researchers are seeking to measure. Careful attention to detail is an
essential part of this process.
Most studies, however, have dealt with achievement, which is a clear limitation of
extant research. Furthermore, achievement has usually been measured using a
standardised basic skills test. This type of test obviously has limitations. Basic skills
are by no means the be-all-and-end-all of educational achievement, thinking skills
and learning-to-learn skills being equally important. However, in contrast to what is
often stated, it is not impossible to measure higher order skills using standardised
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
tests, even multiple choice tests if well designed (Sanders & Horn, 1994, 1995). The
main advantages of standardised tests lie in the high quality of the items, written by
specialists in the subject and in item construction, the standardisation of
administration and scoring procedures, the fact that standardised tests allow
comparison with national norms and students, and the good psychometric qualities
(reliability and validity) of the tests (Muijs & Reynolds, 2001a). Standardised
multiple choice tests in particular cover a wide range of topics, thus giving a good
overview of students’ knowledge of the curriculum (Sanders & Horn, 1994, 1995). It
is these reasons, not least the comparability of tests taken by different pupils in
different schools, that make standardised tests the favoured outcome measure in
teacher effectiveness research.
Disadvantages lie in a possible mismatch between what students have learnt in class
and what is measured by the test, and in the lack of flexibility of these tests. They also
offer less insight into students’ thought processes than do a number of alternative
assessment methods. Most published standardised tests come with psychometric
information on the reliability of the test. However, while it is still common practice to
simply accept and reprint these published scores, the fact that reliability is a property of
items within a particular sample, rather than a test invariant, means that this practice is
problematic and should be replaced by scores calculated on the sample at hand
(Thompson, 1998). Likewise, validity cannot be accepted merely on the basis of prior
studies, but has to be demonstrated specifically with regards to the study at hand.
More recently, a new paradigm in measurement has taken hold, known as Item
Response Theory. This theory posits that all items on a test are indicators of some latent
underlying construct, such as mathematical ability or mathematical achievement.
Using software programmes one can determine how well items fit with this model,
how difficult they are, and how well they discriminate between students of differing
ability, which represents a significant advance over Classical Test Theory-based
models (Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991;
Heinen, 1996). Item Response Models do have their own limitations, however. Thus,
it is usually not entirely clear what the underlying construct is. For example, if when
in developing a literacy test one found one underlying construct, it would not be clear
whether this construct reflected achievement, ability, or another factor we may not
have theorised. Furthermore, Item Response Models are limited when it comes to
measuring multidimensional rather than unidimensional constructs (Glas & Verhelst,
56 D. Muijs
1995). Item Response Models also cannot take account of the fact that data in
educational research are often hierarchically nested. While the method, like many
other statistical techniques, is predicated on the assumption of simple random
sampling, educational researchers, for both practical and goal-based reasons, often
sample schools or school districts and test all or a selection of pupils nested in these
organisations instead. Some research is ongoing on the integration of Item Response
Theory and multilevel models (Fox, 2003), but a lack of stability and restrictions on
the data mean that these experimental techniques are not yet suited for wide practical
application.
However, notwithstanding their strengths, standardised tests are a limited way of
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
conceptualising the multiple desirable outcomes of education. The fact that they are
by far the most commonly used outcome measures in teacher effectiveness research
should not blind us to other possibilities that are becoming ever more important as
society makes new demands on teachers and schools, and elements such as learning
to learn, creativity, and thinking skills come to be seen as essential parts of the
development of students (Campbell, Kyriakides, Muijs, & Robinson, 2003). If we as
researchers want to respond to this demand, we will need to broaden our use of
outcome measures. It is entirely possible and thinkable to look at the effectiveness of
teachers in encouraging pupils’ creative art work for example, but this would entail an
entirely different type of outcome measure to the traditional standardised test, such as
assessing the creativity of actual pupil work using portfolios. This would entail making
a judgement about the relative creativity of student output, which is an imprecise art,
making it less attractive as a research tool in a field that strives towards statistically
generalisable and replicable findings. In order to attenuate the problems of
unreliability and subjectivity involved in such procedures certain steps can be taken,
however. Firstly, a panel of experts should convene to decide on the criteria to
determine creativity. Delphi panel methods would be particularly suited to this
process. This should be followed by the preparation of a scoring rubric that outlines
what criteria make for good, bad, or average performance on each predefined goal.
Raters should then be able to blind mark the samples of student work, after which
reliability can be established both between raters and samples. Those samples that
raters disagree on will have to be re-examined and may have to be deleted from the
analyses (Borich, 1996). This method of sampling pupil work can obviously be used
to test other pupil outcomes as well. Sampling methods have a number of advantages,
such as the fact that they can provide a longitudinal picture of student work across the
year in contrast to the snapshot provided by testing, and can provide a more
‘‘naturalistic’’ picture of student performance compared to the forced testing
situation. However, this method is likely to cause more problems with regards to
reliability and validity than standardised testing.
As mentioned above, outcomes can and should also be broadened to reflect
the affective and social goals of education. As a wide range of instruments in these
fields exist, there are certainly no methodological constraints here. Both affective and
social factors in children and adolescents have been widely researched, though they
have not been linked to teacher factors that often (Muijs, 1997). Development of
Measuring Teacher Effectiveness 57
measures, however, has progressed further in some areas than in others. In the
domain of self-concept research, for example, there exist a number of proven reliable
and valid measures (such as Harter’s Self-Perception Profile for Children (Harter,
1985) and Marsh’s Self-Description Questionnaire (SDQ) (Marsh, 1988)). In other
areas, such as attitudes to learning, the situation is less clear-cut. Widely accepted
measures do not exist and reliability and validity of measures is often problematic
(Hautamäki, 2001). Therefore, researchers need to be careful in selecting measures in
these areas, and may often need to consider developing measures that fit in with their
own definition of the concepts they are studying. Furthermore, researchers need to be
careful when adapting measures designed in a different culture and (often) language,
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
as structures may not hold cross-culturally. One study using a version of the
Australian SDQ measure among primary school pupils in Belgium, for example,
found that liking and belief in ability in a subject, which were strongly correlated in
the Australian sample, did not correlate in the Belgian one (Muijs, 1997).
As well as the need to base outcome measures on the goals of the study and
ensuring reliability and validity, a further issue that needs to be taken into account is
that of distinguishing between uncorrected outcome measures, outcome measures
corrected for prior scores, and measures of growth. Uncorrected outcomes are
measured at one point in time. Examples include achievement at the end of the year
and self-worth at the end of the intervention. Such uncorrected measures are
problematic as a measure of an intervention or of the effectiveness of the teacher, for
example, as they are in many cases most heavily influenced by factors outside of the
classroom, such as prior learning and ability in the case of achievement (Teddlie &
Reynolds, 2000). Therefore, outcome measures are often corrected by using earlier
scores on the same measure as a predictor. To take the examples above, one could
use achievement at the start of the school year as a predictor of final year achievement,
or use pre-intervention self-worth scores as a predictor of post-intervention scores.
This is usually a better measure of the effect of an intervention or a teacher factor, but
is not, in contradiction to what is often supposed, a measure of pupil growth. To be
able to measure growth, we need to ensure, firstly, that we are measuring the outcome
variable at more than two time points in order to be able to follow an actual growth
trajectory, and, secondly, that we use appropriate models to measure this. Growth in
learning or achievement is not necessarily linear, for example, and exclusive use of
linear models could therefore be misleading (Muijs & Reynolds, 2000a).
Teacher Factors
The second crucial element in teacher effectiveness research is measuring the
teacher factors that are hypothesised to be related to pupil outcomes. Obviously,
as was the case in choosing outcome measures, the first question that needs to be
answered is whether or not the measures used reflect the goals of the study and
the theoretical assumptions and hypotheses underlying it. Factors hypothesised to
affect outcomes can be varied, and include teacher beliefs, teacher behaviours, and
teacher subject knowledge. Each will require a different methodological approach.
58 D. Muijs
Choice of Method
A choice that researchers have to make when deciding to study teacher effectiveness is
whether to use classroom observation, survey style research, qualitative methods such
as interviews, or a combination of methods. All have potential advantages and
disadvantages, which depend in part on the goal of the study.
When studying behaviours, classroom observations have the advantage of possibly
being more objective due to the outsider’s perspective. A further advantage is that as
outside observers are likely to have observed a range of classrooms and teachers and
should be well versed in the theories underlying teacher effectiveness research, they
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
should be better able to judge a teacher’s behaviour relative to that of other teachers.
However, it is not realistic to study teacher beliefs using this method, and data on
subject knowledge collected through observation would be highly partial unless a
great number of observations of individual teachers were performed.
When using questionnaires to ask teachers about their own teaching, one is
confronted with a number of problems with regards to reliability. The fact that many
teachers do not observe each other’s lessons even within their own school, and
certainly not outside it, means that they often have little to compare their own
teaching to. Teachers are also not always aware of exactly what they are doing during
lessons, and can sometimes be surprised when confronted with the findings of outside
observers. The detailed recording of actual behaviours possible during classroom
observation allows for fine-grained exploration of behaviours, which would be hard to
achieve in survey-style studies (Muijs & Reynolds, 2001b).
Disadvantages are firstly that all classroom observations are by definition
snapshots, and even successive observations of a teacher will only ever supply a
collection of snapshots rather than a full picture of said teacher’s behaviour over the
year. Also, the presence of an observer in the classroom will inevitably influence the
teacher’s behaviour, either consciously in the form of the teacher putting on a
‘‘performance’’ for the sake of the observer, or unconsciously, through increased
caution or nervousness. The strength of this ‘‘observer effect’’ on the teacher seems to
differ from person to person, making it difficult to take this bias into account
statistically (Ward, Clark, & Harrison, 1981). Explaining the purely scientific and
non-inspectory nature of the study, being as unobtrusive as possible and not
interfering in the lesson in any way will help attenuate these effects, but will not make
them disappear.
Questionnaires do not suffer from these disadvantages: No observer is present
during lessons and teachers can reflect on their teaching over the whole year rather
than just one lesson. Questionnaires are also cheaper to administer than classroom
observations. The major problem with using survey questionnaires to measure
teacher behaviour is the lack of correspondence often found with their behaviours as
observed in the classroom. Charlesworth et al. (1993), for example, found that what
teachers say and what they do during lessons often differ. Earlier, Hook and
Rosenshine (1979) reviewed 11 studies that employed (mainly low-inference)
classroom observation schedules, and found that in 9 studies in which this
Measuring Teacher Effectiveness 59
comparison was possible, there was no relationship between teachers’ reports of their
use of specific behaviours and the actual occurrence of these behaviours. In those
studies in which student views were solicited, these proved to be more highly
correlated with the results of third party observations, and were likewise unrelated to
teacher self-reports. In three studies that looked at the relationship between scales and
dimensions of self-reported and externally observed teacher behaviours (e.g.,
‘‘teacher control’’), there was more evidence of correspondence between the two,
although a significant positive correlation was found in less than half of all cases.
While in itself it is possible that observers may not have picked up a particular teacher
behaviour during a lesson (although if it was consistently employed this is unlikely),
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
the fact that observer and student reports do correlate, and the fact that there was also
little correspondence on larger areas where the effect of not observing a single
behaviour would be limited, suggests, according to Hook and Rosenshine (1979),
that teacher self-reports are unreliable, largely due to lack of practice of teachers in
this respect.
Questionnaires are also susceptible to social desirability response sets (Moorman &
Podsakoff, 1992). This means that respondents are likely to provide an answer that
they see as presenting their behaviour in the most positive light. This is clearly an issue
in teacher effectiveness research, as teachers are usually well aware of the discourse
regarding effective teaching, and would therefore know if asked that to support whole-
class lecture as an effective pedagogy would go against current views in the educational
community. It is therefore important to take this into account in looking at data
collected. Some researchers have advocated the use of so-called social desirability
response scales, designed to measure individual respondents’ tendency to give a
socially desirable response. This would allow the researcher to adapt scores on other
variables accordingly. These instruments have come in for some criticism, however,
being described as too obvious and of unclear reliability (Hays & Ware, 1996).
When using observations to measure teacher behaviours, these should be carried
out as often as practically feasible in order to obtain a measure of teacher effects that
spans more than one snapshot. This will allow the researcher to look at the reliability
of her/his measures over time. However, when looking at other teacher factors, such
as beliefs, questionnaires, though susceptible to the limitations mentioned above
(such as social desirability bias), are more suitable than observations, due to the
impossibility of directly observing inner states, and the necessity for a longer term
perspective than can be achieved through observation. Subject knowledge is also only
partially observable, though here questionnaires are also limited due to strong social
desirability response sets and the difficulty for teachers in accurately judging their
own subject knowledge. In theory, use of tests could be appropriate, but practically
this is often hard (due to teacher sensitivity and time constraints). Many researchers
have therefore used proxy measures such as certification and degree classification,
both of which can be seen to require certain subject knowledge standards. These are,
however, relatively crude measures and do not necessarily incorporate pedagogical
knowledge (Darling-Hammond, 2001). A further problem can be that items tend to
emerge from theory and prior knowledge of the researcher, allowing less possibility
60 D. Muijs
for the discovery of new or unknown traits and dispositions. Use of qualitative,
explorative methods in the development of survey instruments, such as focus
groups or open-ended interviews can help alleviate this problem. Such qualitative
methods can also in many cases be suited to discovering inner states, traits, and
beliefs, allowing in-depth probing and understanding to develop. Generalising these
from the individuals involved in the research to a larger population will require
quantitative follow-up, however.
An alternative to observation and questionnaires that has been suggested is the use
of teacher reports describing their last lesson rather than their teaching as a whole.
This is posited as being more likely to produce unbiased reports than traditional
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
overall feelings to influence their answers on individual items. In both cases it leads to a
lack of discrimination between the distinct concepts the researcher is trying to study.
Halo effects have been found to occur in a number of studies of classroom observation,
especially where observers were not trained in classroom observation (e.g., Phelps,
Schmitz, & Boatright, 1986), and have been found to be related to the extent to which
raters believe the behaviours are correlated (Suter & Roberts, 1987). Halo effects in
observer ratings can be measured using the SD measure, a measure of within-ratee
variability across the different dimensions, or by using the r-measure, a measure of
interdimension correlations (Fisicaro & Vance, 1994).
Another form of bias that can occur in both observational and survey studies is
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
social desirability responses, however. Emphasising the need for accurate and honest
answers, and the confidential nature of the research, as well as the fact that it is
general behaviours rather than individual responses the researcher is looking for, have
been found to lead to less biased responses (Gordon, 1987).
Nonresponse is an issue that has traditionally attracted a lot of attention in survey
research. In virtually any survey one will be confronted with this problem, which may
have serious methodological consequences. Depending on the study, nonresponse
can be anything from 10 to 97% (the latter figure applies especially to surveys sent to
commercial organisations), with 50% being about average (Hartman, Fuqua, &
Jenkins, 1998). There is also some evidence that nonresponse in survey studies has
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
increased over the past decades, partly presumably in response to the increasing
demands to complete various types of surveys from government, educational research
institutes, and private research companies. Increasing response rates is therefore a key
challenge in survey research, which is usually tackled through follow-up requests by
phone, mail, et cetera, and by providing inducements for respondents, such as
financial or material rewards. Some commentators have advocated the use of online
rather than postal surveys. However, while cheaper to administer than paper-based
surveys, the evidence in education at present points to lower rather than higher
response rates (Muijs, 2004). A related problem is nonresponse to particular items in
surveys, where respondents may answer one item but not the next. There are a
number of ways to remedy this type of nonresponse, by using substitution methods
for the missing items. This can take the form of simply substituting the mean for the
missing value, but this method has obvious disadvantages and can better be replaced
by algorithmic methods which take into account the information provided by the
respondents’ scores on other items as well as the response patterns in the sample as a
whole, such as the EM algorithm. However, this and all other substitution methods
assume that the pattern of nonresponse is unbiased.
This is also the problem with global nonresponse (respondents not returning
surveys). Nonrespondents are often simply ignored in analyses, but this is highly
problematic. Nonrespondents can, and often do, differ from respondents on crucial
aspects that may bias the research findings. One method that has been proposed to
correct for this is to remail a sample of nonrespondents, and then check whether the
answers of those who return the questionnaire at the second instance differ
significantly from those of the initial respondents (Hartman et al., 1998). If one
does not get a 100% response to this remailing, as is likely, this method assumes that
the people who respond after prompting do not differ significantly from those that still
do not respond, however.
Obviously, both rating scales and observation instruments need to be reliable and
valid (see above). In this respect it is important to bear in mind that a number of the
biases mentioned above can influence the computation of, in particular, internal
reliability. Halo effects and response biases can both enhance internal reliability as
they lead to raters/respondents giving similar answers across items (Alliger &
Williams, 1992). In observational research, the two main types of reliability are
interrater reliability and reliability of observations over time (stability of measured
64 D. Muijs
behaviours). Interrater reliability refers to whether different raters will rate the
same behaviours of the same teacher similarly, as measured by statistics such as
Cohen’s kappa. This is a crucial issue in classroom observation research. If high
reliability is not achieved, what one might be measuring might be partially or largely
due to observer effects rather than to differences in observed behaviours. Clear prior
training and practice are crucial in this respect, as demonstrated by the study by
Veenman, Bakermans, Franzen, and Van Hoof (1996), in which significant
differences were found in the behaviours of trainee teachers trained in effective
teaching behaviours compared to control teachers when rated by trained observers,
but not when rated by supervisory teachers. The latter were found often to base their
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
ratings on general impressions of the trainee teachers. The situation with respect
to reliability over time is complex, as unreliability can be caused by both rater
unreliability over time and behaviour instability.
In view of the issues discussed above, the most important thing when using
observation is to thoroughly train raters beforehand. This training needs to familiarise
them with the instrument used, take them carefully through all the steps they need to
follow, explain how the data will be interpreted, and explain the possible sources of
bias they need to be aware of in order not to succumb to them. A significant amount
of practice is needed, which should continue until a suitably high level of reliability
has been reached. Training of this nature has been found to significantly reduce bias
of all kinds (Rudner, 1992). It has been found that simply informing observers of
research hypotheses need not lead to bias, but raters should not be given feedback on
how well their observations are supporting these hypotheses, otherwise they will start
producing data designed to support them (O’Leary, 1973). Nevertheless, it may be
useful to statistically adjust for observer effects, which can to a certain extent be
modelled in multilevel or Structural Equation Modelling, by adding a (latent)
measurement factor to the model.
distinguish less from more effective ratees, while on a high-inference instrument all
respondents received highly positive ratings (Wiersma, 1988).
Interrater reliability is influenced by the level of inference necessary. Low-inference
measures, which require little observer judgement, will usually result in higher
interrater reliability, but even with relatively high-inference measures requiring more
subjective rater judgement) it is possible to achieve high levels of reliability with
sufficient effective training (Muijs & Reynolds, 2000b; Nelson & Ray, 1983).
One solution could be to use both high- and low-inference measures in a study.
However, there is some evidence of contamination, in that a halo effect seems to
occur whereby the ratings on the different instruments influence one another
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
Data Analysis
Before going on to a discussion of analysis techniques in teacher effectiveness
research, it needs pointing out that it is impossible to make any firm inferences from
these studies with respect to the relationship between outcomes and behaviours
without controlling sufficiently for other variables that may affect outcomes, in
particular pupil socioeconomic status, gender, ethnicity, and learning disabilities.
Furthermore, as school effectiveness research has shown (see Teddlie & Reynolds,
2000), school social context, school effectiveness factors, and other facets of
classroom organisation (e.g., use of setting), may influence outcomes and should
therefore be controlled in all analyses conducted. Obviously, measurement problems
exist with these variables as well (such as the unreliability of special educational needs
measures and the inadequacy of free meal eligibility as a measure of social class, for
which it is often used as a proxy) and it is therefore here again necessary to ensure that
measures used are carefully chosen with the goals of the study in mind, and in such a
way that they validly measure the underlying construct (such as parental social
capital) that one wants to measure.
66 D. Muijs
Most of the best-known studies of teacher effectiveness were done using traditional
correlational or regression-based techniques (Brophy & Good, 1986), notable
exceptions being the Junior School Project by Mortimore, Sammons, Stoll, Lewis,
and Ecob (1988), and Muijs and Reynolds’ (2000b) MEPP Evaluation in the UK.
Using these traditional methods is problematic for a number of reasons. Firstly,
almost all teacher effectiveness data are by definition hierarchical, in that they
combine pupil-level (outcomes, background), classroom-level (teacher effectiveness),
and hopefully, school-level data in a structure in which pupils are nested in
classrooms, which are in turn nested in schools (which are often nested in school
boards or Local Education Authorities). This causes a number of problems for
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
traditional regression methods, not least the attenuation of standard errors which can
cause effects to be wrongly rated as statistically significant (Goldstein, 1995, 1997;
Kreft & De Leeuw, 1998; Plewis, 1997). Furthermore, multilevel modelling allows
the researcher to partition the variance to be explained among the different levels, so
s/he can ascertain how much variance exists at the classroom level as opposed to the
individual pupil level, for example, and thus can find out how much of the variance in
achievement and pupil progress between teachers is explained by teacher behaviours
(found to be 50% on progress over 2 years and 75% on progress over the year in a
recent study (Muijs & Reynolds, 2000b). Multilevel analysis also allows one to free
the slopes in the models, so the researcher can study complex interactions between
variables and levels. Overall, this technique has become essential to all quantitative
educational research, and that certainly includes teacher effectiveness research.
When employing multilevel models, it is important to include all relevant levels in
the analysis, even if no variables have been collected at that level. In a study using
real-life data, Opdenakker and Van Damme (2000) found that if levels were not
included, both the partitioning of variance to the different levels and the actual
parameters could be faulty. More specifically, variance explained at the left out level
was assigned to the levels immediately adjoining it, while some parameters were
significant or not depending on exactly which levels were wrongly excluded from the
analyses. This means that researchers need to make sure they test the fit of all possible
null models to the data, to see which number of levels is most appropriate (Kyriakides
& Charlambous, 2004). A complication in this regard is the fact that variables at
different levels are often correlated to one another, and that, for example, the impact
of teacher factors may differ depending on contextual characteristics, such as social
background of the school intake (Levacic, Malmberg, Steele, & Smees, 2004;
Opdenakker & Van Damme, 2001). Multilevel modelling allows such covariances to
be modelled, however, and this should form a part of models employed.
A further statistical modelling technique that seems particularly useful for teacher
effectiveness research is Structural Equation Modelling (SEM). This technique
allows the researcher to explicitly test the fit of a prespecified model to the covariance
matrix. Furthermore, SEM allows the researcher to specify that her/his variables
are actually latents, measured by means of manifest variables (Hayduk, 1996;
Jöreskog & Sörbom, 1996). This means that one can model the actually hypothesised
relationship between all the variables in the model (e.g., teacher beliefs influence
Measuring Teacher Effectiveness 67
children has improved compared to those taught by matched control teachers (for an
example, see Good, Grouws, & Ebmeier, 1983). This type of quasi-experimental
research is itself not without problems, however, as controlling all other variables that
may influence pupil outcomes is as good as impossible in real-life school settings,
leading to serious concerns regarding the validity of this type of research. More often
in teacher effectiveness research one will want to measure actual behaviours in a
variety of classrooms and see whether any of these are related to outcomes, and for
related read ‘‘affect outcomes’’ either implicitly or (better) explicitly stated by the
researcher. The problem is that traditional statistical techniques do not allow this kind
of conclusion to be made, as for every (relatively) fitting model there may be a large
number of equivalent models which fit the data equally well. Newer methods, such as
SEM, are less susceptible to this due to the large number of tests of model fit that
exist, which include relative fit test allowing us to compare the fit of different models
(Marcoulides & Schumacker, 2001). However, use of these tests is not without
controversy (see Hayduk, 1996), and it is not entirely clear which of the many model
fit tests is most appropriate. Therefore, most researchers in their publications
carefully use terminology such as ‘‘related to’’, ‘‘associated with’’, and so forth, when
discussing their statistical analyses. However, when one looks closer at relevant
publications, one will usually find them littered with terminology suggestive of the
authors’ causal assumptions (Muijs & Reynolds, 2001a, provides a good example of
this fallacy). Some researchers have recently attempted to overcome this problem by
developing new methods to determine causality from cross-sectional data. Chambers’
(1991) corresponding regressions method was not able to correctly identify causal
relationships in a simulation on data provided by this author. Pearl’s (2000) more
sophisticated Bayesian methods seem more promising, and deserve empirical testing.
As in most behavioural research, most of the variables we are interested in are, in
fact, latent. For example, when looking at student outcomes (in Mathematics, say)
what we want to find out is their Mathematics achievement or ability. A given test is
always only an indicator of this latent trait. Similarly, when we want to look at
effective classroom management, the variables we have measured during our
observations are indicators of this trait, rather than the actual trait themselves
( Jöreskog & Sörbom, 1996). A similar theory underlies Item Response Theory,
which may be used in developing measurement instruments both for measuring
teacher behaviours and student outcomes, as mentioned above.
68 D. Muijs
As multilevel modelling and SEM both address different problems with traditional
statistical methods, it is logical that attempts should be made to combine the two.
This once again is relevant to teacher effectiveness research, as most studies are in
effect multilevel latent trait structures. Therefore, a combination of these two
methods could be attempted in future teacher effectiveness research. Several
statisticians have developed methods to do this, usually by integrating the levels as
parameters in structural equation models, where, basically, in each group the group
mean is subtracted from the individual scores (see Muthén (1994), whose model can
be implemented in the Mplus software programme of which he is the main
developer). At present, these techniques are newly developed and developing, and
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
exact effects of particular teacher actions on pupil outcomes, and will not allow
statistical generalisation to the population, which is probably the reason that most
teacher effectiveness research has been quantitative.
A useful development of teacher effectiveness research is the use of mixed
quantitative-qualitative methods (Tashakkori & Teddlie, 1998). A problem with the
quantitative methods used in teacher effectiveness research is that they do not allow
researchers to look more closely at context and look at what is happening at a deeper
level. Therefore, a useful strategy could be to identify more and less effective teachers
using outcome data, and then combine quantitative classroom observation with in-
depth interviews of selected teachers and/or in-depth qualitative observations of
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
Conclusion
In this article, I have tried to look at a number of methodological issues involved in
the (quantitative) study of teacher effectiveness. In three sections I have discussed
quality criteria for input/outcome measures, process measures, and statistical
analysis.
Outcome measures need to be reliable and valid, whatever outcome is studied.
Process measures need to guard against various sources of bias, mainly through
thorough training of observers or careful development of questionnaires if that
method is used. Statistical analysis needs to make use of modern sophisticated
methods such as multilevel and Structural Equation Modelling.
One of the main goals, along with these recommendations, of this article, however,
is to encourage some reflection on methods among teacher effectiveness researchers.
Instruments are often not piloted before starting the study, and sometimes observers
are let into classrooms without any training at all, let alone any knowledge of the
theories underlying the research. Many educational researchers also still seem
unaware of the advances in statistical methods made over the past 20 years, and one
still too often encounters published papers using unsuitable statistical techniques. It
must also not be forgotten that use of sophisticated statistical methodology cannot
make up for mistakes made in the previous phases of the research. Bad data cannot
magically be transformed into good research through statistical modelling. This
article is therefore in a sense an appeal for more care in research design, hard though
this may sometimes be within the confines of externally funded short-term contracts
commissioned by agencies looking for a quick fix. However, if the field of teacher
effectiveness is to progress, methodology must remain a prime concern.
Finally, in the light of ‘‘reflective research practice’’, I would have to mention the
necessity for researchers to reflect on the theories underlying their research and on
what these imply for their designs. All research, not least teacher effectiveness
70 D. Muijs
. There is a need for researchers to ensure that instruments and methods follow
goals, and are not chosen purely for the sake of convenience or familiarity.
. Reflection on methods, in terms of reliability and validity, is crucial if the field is to
progress. This includes careful consideration of cross-cultural and comparative
issues.
. Researchers need to increase their efforts to remain abreast of the latest
developments in data analysis and modelling as well as emerging theories of
learning and cognition rather than relying on traditional methods.
. Researchers need to take into account the multiple and expanding roles of
teachers, and there is an urgent need for the development of measures and studies
that can more accurately reflect these different roles and factors that may lead to
differential teacher effectiveness in different subjects, areas, and domains.
References
Alliger, G. M., & Williams, K. J. (1992). Relating the internal consistency of scales to rater response
tendencies. Educational and Psychological Measurement, 52, 337 – 343.
Bennett, R. E., & Ward, W. C. (1993). Construction versus choice in cognitive measurement. Hillsdale,
NJ: Lawrence Erlbaum.
Billiet, J. B., & McClendon, M. J. (2000). Modeling acquiescence in measurement models for two
balanced sets of items. Structural Equation Modeling, 7, 608 – 628.
Borich, G. (1996) Effective teaching methods (3rd ed.). New York: Macmillan.
Brophy, J. E., & Good, T. L. (1986). Teacher behaviour and student achievement. In M. C.
Wittrock (Ed.), Handbook of research on teaching (pp. 328 – 375). New York: Macmillan.
Calkins, D., Borich, G. D., Pascone, M., Kugle, C. L., & Marston, P. T. (1997). Generalizability of
teacher behaviors across classroom observation systems. Journal of Classroom Interaction, 13(1),
9 – 22.
Campbell, R. J., Kyriakides, L., Muijs, D., & Robinson, W. (2003). Differential teacher
effectiveness: Towards a model for research and teacher appraisal. Oxford Review of Education,
29(3), 347 – 362.
Campbell, R. J., Kyriakides, L., Muijs, D., & Robinson, W. (2004). Effective teaching and
values: Some implications for research and teacher appraisal. Oxford Review of Education,
30(4), 451 – 465.
Chambers, W. V. (1991). Inferring formal causation from corresponding regressions. Journal of
Mind and Behavior, 12(1), 49 – 70.
Measuring Teacher Effectiveness 71
Charlesworth, R., Hart, C., Burts, D., Thomasson, R., Mosleu, J., & Fleege, P. (1993). Measuring
the developmental appropriateness of kindergarten teachers’ belief and practices. Early
Childhood Research Quarterly, 8, 255 – 276.
Darling-Hammond, L. (2001). Does teacher certification matter? Educational Policy Analysis
Archives, 9(2), 57 – 77.
Dickson, G. E. (1983, July). The competency assessment of teachers using high and low inference
measurement procedures: A review of research. Paper presented at the World Assembly of the
International Council on Education for Teaching, Washington, DC.
Embretson, S. E., & Reise, S. P. (Eds.) (2000). Item Response Theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum.
Evertson, C. M., Brophy, J., & Anderson, L. (1978, March). Process-outcome relationships in the
Texas Junior High School Study: Compendium. Paper presented at the Annual Meeting of the
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
Johanson, G. A., Gips, C. J., & Rich, C. E. (1993). If you can’t say something nice. A variation on
the social desirability response set. Evaluation Review, 17, 116 – 122.
Jöreskog, K., & Sörbom, D. (1996). LISREL 8: User’s reference guide. Chicago: Scientific Software
International.
Kaplan, R. M. (1975, April). Is beauty talent? Sex interaction in the attractiveness halo effect. Paper
presented at the Annual Meeting of the Western Psychological Association, Los Angeles.
Kreft, I., & De Leeuw, J. (1998). Introducing multilevel modeling. London: Sage.
Kyriakides, L., Campbell, R. J., & Christofidou, E. (2002). Generating criteria for measuring
teacher effectiveness through a self-evaluation approach: A complementary way of measuring
teacher effectiveness. School Effectiveness and School Improvement, 13, 291 – 325.
Kyriakides, L., & Charlambous, C. (2004). Extending the scope of analysing data of IEA studies:
Applying multilevel modelling techniques to analyse TIMSS data. Proceedings of the IRC 2004,
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
Muijs, D., & Reynolds, D. (2002). Teacher beliefs and behaviors: What matters. Journal of
Classroom Interaction, 37(2), 3 – 15.
Muthén, B. (1994). Multilevel covariance structure analysis. Sociological Methods and Research,
22(3), 376 – 398.
Nelson, E. A., & Ray, W. J. (1983, August). Observational ratings of teacher performance:
Dimensionality and stability. Paper presented at the Annual Meeting of the American
Psychological Association, Los Angeles.
O’Leary, K. D. (1973). The effects of observer bias in field experimental settings. Final report. (ERIC
Document Reproduction Service No. ED078086). Washington, DC: National Center for
Educational Research and Development.
Opdenakker, M.-C., & Van Damme, J. (2000). The importance of identifying levels in multilevel
analysis: An illustration of the effects of ignoring the top or intermediate levels in school
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
effectiveness research. School Effectiveness and School Improvement, 11, 103 – 130.
Opdenakker, M.-C., & Van Damme, J. (2001). Relationship between school composition and
characteristics of school process and their effect on mathematic achievement. British
Educational Research Journal, 27, 407 – 432.
Owen, S. A. (1976, April). The validity of student ratings: A critique. Paper presented at the Annual
Meeting of the American Educational Research Association, San Francisco.
Padron, Y. N., & Waxman, H. C. (1999). Classroom observations of the five standards of effective
teaching in urban classrooms with English language learners. Teaching and Change, 7(1),
79 – 100.
Pearl, J. (2000). Causality. Cambridge: Cambridge University Press.
Phelps, L., Schmitz, C. D., & Boatright, B. (1986). The effects of halo and leniency on cooperating
teacher reports using Likert-type rating scales. Journal of Educational Research, 79(3), 151 – 154.
Plewis, I. (1997). Statistics in education. London: Arnold.
Ritter, J. M., & Langlois, J. H. (1986, April). The role of physical attractiveness in the observation of
adult-child interactions: Eye of the beholder or behavioral reality? Paper presented at the Biennial
International Conference on Infant Studies, Los Angeles.
Rudner, L. M. (1992). Reducing errors due to the use of judges. (ERIC Document Reproduction
Service Digest No. ED355254). Washington, DC: ERIC Clearinghouse on Tests,
Measurement, and Evaluation.
Sanders, W. L., & Horn, S. P. (1994). The Tennessee Value-Added System: Mixed model
methodology in educational assessment. Journal of Personnel Evaluation in Education, 8,
299 – 311.
Sanders, W. L., & Horn, S. P. (1995). Educational assessment reassessed: The usefulness of
standardised and alternative measures of student achievement as indicators for the assessment
of educational outcomes. Educational Policy Analysis Archives, 3(6). Retrieved November 23,
2002, from http://epaa.asu.edu/epaa/v3n6.html
Solomon, D., Watson, M., & Deer, J. (1988). Measurement of aspects of classroom environments
considered likely to influence children’s prosocial development. Moral Education Forum,
13(4), 10 – 17.
Suter, W. N., & Roberts, W. L. (1987). An experimental investigation of the beliefs-of-relatedness
source of halo. Contemporary Educational Psychology, 12, 77 – 85.
Tamir, P. (1983). Teachers’ self-reports as an alternative strategy for the study of classroom
transactions. Journal of Research in Science Teaching, 20(9), 815 – 823.
Tashakkori, A., & Teddlie, C. (1998). Mixed methodology. Thousand Oaks, CA: Sage.
Teddlie, C., & Reynolds, D. (2000). International handbook of school effectiveness research. London:
Falmer Press.
Thompson, B. (1998). Statistical significance testing and effect size reporting: Portrait of a possible
future. Research in the Schools, 5(2), 33 – 38.
Veenman, S., Bakermans, J., Franzen, Y., & Van Hoof, M. (1996). Implementation effects of a pre-
service course for secondary education teachers. Educational Studies, 22(2), 225 – 243.
74 D. Muijs
Ward, M. D., Clark, C. C., & Harrison, G. V. (1981, April). The observer effect in classroom visitation.
Paper presented at the Annual Meeting of the American Educational Research Association,
Los Angeles.
Widmeyer, W. N., & Loy, J. W. (1988). When you’re hot, you’re hot! Warm-cold effects in first
impressions of persons and teaching effectiveness. Journal of Educational Psychology, 80(1),
118 – 121.
Wiersma, W. (1983, April). Assessment of teacher performance: Constructs of teacher competencies based
on factor analysis of observation data. Paper presented at the Annual Meeting of the American
Educational Research Association, Montreal, Quebec, Canada.
Wiersma, W. (1988). The Alabama Career Incentive Program: A statewide effort in teacher evaluation.
(ERIC Document Reproduction Service Digest No. ED298128). Auburn, Alabama: Auburn
University College of Education.
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015