Educational Research and Evaluation: An International Journal On Theory and Practice

This article was downloaded by: [Chinese University of Hong Kong]
On: 24 February 2015, At: 12:01

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Educational Research and Evaluation:

An International Journal on Theory and
Practice
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/nere20
Measuring teacher effectiveness: Some

methodological reflections
a
Daniel Muijs
a
School of Education , University of Manchester , UK
Published online: 16 Feb 2007.
To cite this article: Daniel Muijs (2006) Measuring teacher effectiveness: Some methodological
reflections, Educational Research and Evaluation: An International Journal on Theory and Practice,
12:1, 53-74, DOI: 10.1080/13803610500392236
To link to this article: http://dx.doi.org/10.1080/13803610500392236
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Educational Research and Evaluation
Vol. 12, No. 1, February 2006, pp. 53 – 74
Measuring Teacher Effectiveness:

Some methodological reflections
Daniel Muijs*
Downloaded by [Chinese University of Hong Kong] at 12:01 24 February 2015
School of Education, University of Manchester, UK
(Received 21 July 2004; accepted 3 May 2005)
Teacher effectiveness is an issue that has received increased attention in recent years, as researchers
have become aware of limitations in models that see the school as the key arena for improving pupil
learning outcomes. This renewed interest makes it timely to look again at the methods used in
teacher effectiveness research. This article presents an overview of some key issues in researching
teacher effectiveness from a process-product perspective. The choice of outcome measure is a first
key area. Traditionally most teacher effectiveness research has utilised externally published
standardised tests. However, it will be argued that this is too limited in the light of societal demands
on education. Actually measuring teacher factors is an issue the difficulty of which has often been
underestimated. Classroom observation, surveys of teachers and students, and qualitative methods
such as interviews have been most frequently employed. The advantages and disadvantages of each
are discussed. In the final section, the main analysis methods suitable for teacher effectiveness
research are outlined.
Introduction
Teacher effectiveness is an issue that has received increased attention in recent years,
as researchers have become aware of limitations in models that see the school as the
key arena for improving pupil learning outcomes. Multilevel models of school effects
show that the variance in pupil outcomes that can be explained at the classroom level
is around twice that at the school level, at around 10 – 20% of variance to be explained
on average. Furthermore, in some studies up to 75% of this variance has been
explained by teacher behaviours (Muijs & Reynolds, 2002).
*School of Education, University of Manchester, Oxford Road, Manchester, MI3 9PL, UK.
E-mail: Daniel.Muijs@manchester.ac.uk
ISSN 1380-3611 (print)/ISSN 1744-4187 (online)/06/010053–22

Ó 2006 Taylor & Francis
DOI: 10.1080/13803610500392236
54 D. Muijs
That teacher factors are crucial to understanding and improving education is

demonstrated by a large review of meta-analyses of factors that affect pupil outcomes,
in which Hattie (1999) found that most of the factors that have moderate to large
effect sizes are classroom-level interventions or variables (such as direct instruction or
feedback) that are characterised by the fact that they involve teacher action.
Policy-makers are also showing a strong interest in teacher effectiveness or teacher
quality, which has often taken the form of accountability measures that incorporate
measures of teacher effectiveness, based either on pupil results like the value added
system (TVAAS) used in Tennessee or observation of teachers as used in the Ofsted
inspection system in England.
This interest in teacher effectiveness, along with its potential power as a lever of
educational improvement, clearly begs the question of how we can efficiently and
reliably measure this factor. In view of this renewed interest in the field, it is
timely to present an overview of the relatively extensive body of research in this
area.
The basic format of the traditional ‘‘teacher effectiveness’’ study is a so-called
product-process design, similar to that used in most school effectiveness research
(Brophy & Good, 1986; Teddlie & Reynolds, 2000). Typically outcomes are
measured, and a classroom observation instrument or questionnaire is used to
measure teacher factors such as classroom behaviours or pedagogical content
knowledge, and how that might affect these outcomes. The aim is then to see which
behaviours, beliefs, or other teacher factors (if any) are associated with more positive
outcomes. This model, while shown to be empirically supported, does, of course,
have its limitations. In particular, it does not take into account sufficiently the fact
that teachers’ roles are broader than their classroom practice and includes
management roles, pastoral roles, and relationships with parents and community as
well as classroom practice (Campbell, Kyriakides, Muijs, & Robinson, 2004;
Kyriakides, Campbell, & Christofidou, 2002). However, for the purposes of this
article we will concentrate on three key questions that arise from the traditional
model: How do we measure outcomes, how do we measure teacher factors, and how
do we analyse the data from teacher effectiveness studies.
Outcome Measures
In most teacher effectiveness research to date, the outcome studied has been
academic achievement. This does not have to be the case, however. In theory, an
effectiveness research design (whether teacher or school effectiveness, or indeed in
other fields such as organisational effectiveness in business studies) is neutral with
respect to outcomes, being concerned with means rather than goals (Teddlie &
Reynolds, 2000). Therefore it is entirely possible to look at what behaviours can
positively influence students’ self-esteem, prosocial behaviours or moral values. The
choice of outcome is of crucial importance, as it is clear that it is the goal of the study
that must determine the outcome measures used. All effectiveness research is goal
oriented, and as such careful consideration of the exact aims of the study is essential.
Measuring Teacher Effectiveness 55
Too often little care is taken in the development of good measures, use being made of
existing instruments for convenience sake rather than because they accurately reflect
the outcomes the researchers are seeking to measure. Careful attention to detail is an
essential part of this process.
Most studies, however, have dealt with achievement, which is a clear limitation of
extant research. Furthermore, achievement has usually been measured using a
standardised basic skills test. This type of test obviously has limitations. Basic skills
are by no means the be-all-and-end-all of educational achievement, thinking skills
and learning-to-learn skills being equally important. However, in contrast to what is
often stated, it is not impossible to measure higher order skills using standardised
tests, even multiple choice tests if well designed (Sanders & Horn, 1994, 1995). The
main advantages of standardised tests lie in the high quality of the items, written by
specialists in the subject and in item construction, the standardisation of
administration and scoring procedures, the fact that standardised tests allow
comparison with national norms and students, and the good psychometric qualities
(reliability and validity) of the tests (Muijs & Reynolds, 2001a). Standardised
multiple choice tests in particular cover a wide range of topics, thus giving a good
overview of students’ knowledge of the curriculum (Sanders & Horn, 1994, 1995). It
is these reasons, not least the comparability of tests taken by different pupils in
different schools, that make standardised tests the favoured outcome measure in
teacher effectiveness research.
Disadvantages lie in a possible mismatch between what students have learnt in class
and what is measured by the test, and in the lack of flexibility of these tests. They also
offer less insight into students’ thought processes than do a number of alternative
assessment methods. Most published standardised tests come with psychometric
information on the reliability of the test. However, while it is still common practice to
simply accept and reprint these published scores, the fact that reliability is a property of
items within a particular sample, rather than a test invariant, means that this practice is
problematic and should be replaced by scores calculated on the sample at hand
(Thompson, 1998). Likewise, validity cannot be accepted merely on the basis of prior
studies, but has to be demonstrated specifically with regards to the study at hand.
More recently, a new paradigm in measurement has taken hold, known as Item
Response Theory. This theory posits that all items on a test are indicators of some latent
underlying construct, such as mathematical ability or mathematical achievement.
Using software programmes one can determine how well items fit with this model,
how difficult they are, and how well they discriminate between students of differing
ability, which represents a significant advance over Classical Test Theory-based
models (Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991;
Heinen, 1996). Item Response Models do have their own limitations, however. Thus,
it is usually not entirely clear what the underlying construct is. For example, if when
in developing a literacy test one found one underlying construct, it would not be clear
whether this construct reflected achievement, ability, or another factor we may not
have theorised. Furthermore, Item Response Models are limited when it comes to
measuring multidimensional rather than unidimensional constructs (Glas & Verhelst,
56 D. Muijs
1995). Item Response Models also cannot take account of the fact that data in
educational research are often hierarchically nested. While the method, like many
other statistical techniques, is predicated on the assumption of simple random
sampling, educational researchers, for both practical and goal-based reasons, often
sample schools or school districts and test all or a selection of pupils nested in these
organisations instead. Some research is ongoing on the integration of Item Response
Theory and multilevel models (Fox, 2003), but a lack of stability and restrictions on
the data mean that these experimental techniques are not yet suited for wide practical
application.
However, notwithstanding their strengths, standardised tests are a limited way of
conceptualising the multiple desirable outcomes of education. The fact that they are
by far the most commonly used outcome measures in teacher effectiveness research
should not blind us to other possibilities that are becoming ever more important as
society makes new demands on teachers and schools, and elements such as learning
to learn, creativity, and thinking skills come to be seen as essential parts of the
development of students (Campbell, Kyriakides, Muijs, & Robinson, 2003). If we as
researchers want to respond to this demand, we will need to broaden our use of
outcome measures. It is entirely possible and thinkable to look at the effectiveness of
teachers in encouraging pupils’ creative art work for example, but this would entail an
entirely different type of outcome measure to the traditional standardised test, such as
assessing the creativity of actual pupil work using portfolios. This would entail making
a judgement about the relative creativity of student output, which is an imprecise art,
making it less attractive as a research tool in a field that strives towards statistically
generalisable and replicable findings. In order to attenuate the problems of
unreliability and subjectivity involved in such procedures certain steps can be taken,
however. Firstly, a panel of experts should convene to decide on the criteria to
determine creativity. Delphi panel methods would be particularly suited to this
process. This should be followed by the preparation of a scoring rubric that outlines
what criteria make for good, bad, or average performance on each predefined goal.
Raters should then be able to blind mark the samples of student work, after which
reliability can be established both between raters and samples. Those samples that
raters disagree on will have to be re-examined and may have to be deleted from the
analyses (Borich, 1996). This method of sampling pupil work can obviously be used
to test other pupil outcomes as well. Sampling methods have a number of advantages,
such as the fact that they can provide a longitudinal picture of student work across the
year in contrast to the snapshot provided by testing, and can provide a more
‘‘naturalistic’’ picture of student performance compared to the forced testing
situation. However, this method is likely to cause more problems with regards to
reliability and validity than standardised testing.
As mentioned above, outcomes can and should also be broadened to reflect
the affective and social goals of education. As a wide range of instruments in these
fields exist, there are certainly no methodological constraints here. Both affective and
social factors in children and adolescents have been widely researched, though they
have not been linked to teacher factors that often (Muijs, 1997). Development of
measures, however, has progressed further in some areas than in others. In the
domain of self-concept research, for example, there exist a number of proven reliable
and valid measures (such as Harter’s Self-Perception Profile for Children (Harter,
1985) and Marsh’s Self-Description Questionnaire (SDQ) (Marsh, 1988)). In other
areas, such as attitudes to learning, the situation is less clear-cut. Widely accepted
measures do not exist and reliability and validity of measures is often problematic
(Hautamäki, 2001). Therefore, researchers need to be careful in selecting measures in
these areas, and may often need to consider developing measures that fit in with their
own definition of the concepts they are studying. Furthermore, researchers need to be
careful when adapting measures designed in a different culture and (often) language,
as structures may not hold cross-culturally. One study using a version of the
Australian SDQ measure among primary school pupils in Belgium, for example,
found that liking and belief in ability in a subject, which were strongly correlated in
the Australian sample, did not correlate in the Belgian one (Muijs, 1997).
As well as the need to base outcome measures on the goals of the study and
ensuring reliability and validity, a further issue that needs to be taken into account is
that of distinguishing between uncorrected outcome measures, outcome measures
corrected for prior scores, and measures of growth. Uncorrected outcomes are
measured at one point in time. Examples include achievement at the end of the year
and self-worth at the end of the intervention. Such uncorrected measures are
problematic as a measure of an intervention or of the effectiveness of the teacher, for
example, as they are in many cases most heavily influenced by factors outside of the
classroom, such as prior learning and ability in the case of achievement (Teddlie &
Reynolds, 2000). Therefore, outcome measures are often corrected by using earlier
scores on the same measure as a predictor. To take the examples above, one could
use achievement at the start of the school year as a predictor of final year achievement,
or use pre-intervention self-worth scores as a predictor of post-intervention scores.
This is usually a better measure of the effect of an intervention or a teacher factor, but
is not, in contradiction to what is often supposed, a measure of pupil growth. To be
able to measure growth, we need to ensure, firstly, that we are measuring the outcome
variable at more than two time points in order to be able to follow an actual growth
trajectory, and, secondly, that we use appropriate models to measure this. Growth in
learning or achievement is not necessarily linear, for example, and exclusive use of
linear models could therefore be misleading (Muijs & Reynolds, 2000a).
Teacher Factors
The second crucial element in teacher effectiveness research is measuring the
teacher factors that are hypothesised to be related to pupil outcomes. Obviously,
as was the case in choosing outcome measures, the first question that needs to be
answered is whether or not the measures used reflect the goals of the study and
the theoretical assumptions and hypotheses underlying it. Factors hypothesised to
affect outcomes can be varied, and include teacher beliefs, teacher behaviours, and
teacher subject knowledge. Each will require a different methodological approach.
58 D. Muijs
Choice of Method
A choice that researchers have to make when deciding to study teacher effectiveness is
whether to use classroom observation, survey style research, qualitative methods such
as interviews, or a combination of methods. All have potential advantages and
disadvantages, which depend in part on the goal of the study.
When studying behaviours, classroom observations have the advantage of possibly
being more objective due to the outsider’s perspective. A further advantage is that as
outside observers are likely to have observed a range of classrooms and teachers and
should be well versed in the theories underlying teacher effectiveness research, they
should be better able to judge a teacher’s behaviour relative to that of other teachers.
However, it is not realistic to study teacher beliefs using this method, and data on
subject knowledge collected through observation would be highly partial unless a
great number of observations of individual teachers were performed.
When using questionnaires to ask teachers about their own teaching, one is
confronted with a number of problems with regards to reliability. The fact that many
teachers do not observe each other’s lessons even within their own school, and
certainly not outside it, means that they often have little to compare their own
teaching to. Teachers are also not always aware of exactly what they are doing during
lessons, and can sometimes be surprised when confronted with the findings of outside
observers. The detailed recording of actual behaviours possible during classroom
observation allows for fine-grained exploration of behaviours, which would be hard to
achieve in survey-style studies (Muijs & Reynolds, 2001b).
Disadvantages are firstly that all classroom observations are by definition
snapshots, and even successive observations of a teacher will only ever supply a
collection of snapshots rather than a full picture of said teacher’s behaviour over the
year. Also, the presence of an observer in the classroom will inevitably influence the
teacher’s behaviour, either consciously in the form of the teacher putting on a
‘‘performance’’ for the sake of the observer, or unconsciously, through increased
caution or nervousness. The strength of this ‘‘observer effect’’ on the teacher seems to
differ from person to person, making it difficult to take this bias into account
statistically (Ward, Clark, & Harrison, 1981). Explaining the purely scientific and
non-inspectory nature of the study, being as unobtrusive as possible and not
interfering in the lesson in any way will help attenuate these effects, but will not make
them disappear.
Questionnaires do not suffer from these disadvantages: No observer is present
during lessons and teachers can reflect on their teaching over the whole year rather
than just one lesson. Questionnaires are also cheaper to administer than classroom
observations. The major problem with using survey questionnaires to measure
teacher behaviour is the lack of correspondence often found with their behaviours as
observed in the classroom. Charlesworth et al. (1993), for example, found that what
teachers say and what they do during lessons often differ. Earlier, Hook and
Rosenshine (1979) reviewed 11 studies that employed (mainly low-inference)
classroom observation schedules, and found that in 9 studies in which this
comparison was possible, there was no relationship between teachers’ reports of their
use of specific behaviours and the actual occurrence of these behaviours. In those
studies in which student views were solicited, these proved to be more highly
correlated with the results of third party observations, and were likewise unrelated to
teacher self-reports. In three studies that looked at the relationship between scales and
dimensions of self-reported and externally observed teacher behaviours (e.g.,
‘‘teacher control’’), there was more evidence of correspondence between the two,
although a significant positive correlation was found in less than half of all cases.
While in itself it is possible that observers may not have picked up a particular teacher
behaviour during a lesson (although if it was consistently employed this is unlikely),
the fact that observer and student reports do correlate, and the fact that there was also
little correspondence on larger areas where the effect of not observing a single
behaviour would be limited, suggests, according to Hook and Rosenshine (1979),
that teacher self-reports are unreliable, largely due to lack of practice of teachers in
this respect.
Questionnaires are also susceptible to social desirability response sets (Moorman &
Podsakoff, 1992). This means that respondents are likely to provide an answer that
they see as presenting their behaviour in the most positive light. This is clearly an issue
in teacher effectiveness research, as teachers are usually well aware of the discourse
regarding effective teaching, and would therefore know if asked that to support whole-
class lecture as an effective pedagogy would go against current views in the educational
community. It is therefore important to take this into account in looking at data
collected. Some researchers have advocated the use of so-called social desirability
response scales, designed to measure individual respondents’ tendency to give a
socially desirable response. This would allow the researcher to adapt scores on other
variables accordingly. These instruments have come in for some criticism, however,
being described as too obvious and of unclear reliability (Hays & Ware, 1996).
When using observations to measure teacher behaviours, these should be carried
out as often as practically feasible in order to obtain a measure of teacher effects that
spans more than one snapshot. This will allow the researcher to look at the reliability
of her/his measures over time. However, when looking at other teacher factors, such
as beliefs, questionnaires, though susceptible to the limitations mentioned above
(such as social desirability bias), are more suitable than observations, due to the
impossibility of directly observing inner states, and the necessity for a longer term
perspective than can be achieved through observation. Subject knowledge is also only
partially observable, though here questionnaires are also limited due to strong social
desirability response sets and the difficulty for teachers in accurately judging their
own subject knowledge. In theory, use of tests could be appropriate, but practically
this is often hard (due to teacher sensitivity and time constraints). Many researchers
have therefore used proxy measures such as certification and degree classification,
both of which can be seen to require certain subject knowledge standards. These are,
however, relatively crude measures and do not necessarily incorporate pedagogical
knowledge (Darling-Hammond, 2001). A further problem can be that items tend to
emerge from theory and prior knowledge of the researcher, allowing less possibility
60 D. Muijs
for the discovery of new or unknown traits and dispositions. Use of qualitative,
explorative methods in the development of survey instruments, such as focus
groups or open-ended interviews can help alleviate this problem. Such qualitative
methods can also in many cases be suited to discovering inner states, traits, and
beliefs, allowing in-depth probing and understanding to develop. Generalising these
from the individuals involved in the research to a larger population will require
quantitative follow-up, however.
An alternative to observation and questionnaires that has been suggested is the use
of teacher reports describing their last lesson rather than their teaching as a whole.
This is posited as being more likely to produce unbiased reports than traditional
teacher classroom questionnaires, by diminishing social desirability responses due to

the specificity of the items (e.g., ‘‘I used audiovisuals’’). However, correspondence
with the results of observations were mixed, being high for such factors as use of
specific materials, but far lower with respect to measures of type of teaching and
student reactions (Tamir, 1983).
As mentioned above, student ratings of their teachers provide another alternative.
Students can certainly provide highly interesting and useful information on their
teachers, especially among pupils from Junior school upwards, but are not always able
to supply detailed information on specific behaviours. The results on reliability of
student data are unclear. In an older study, student ratings were found to differ from
classroom observation data, therefore being likely to be a useful addition rather than a
replacement for such methods (Evertson, Brophy, & Anderson, 1978). Another older
study found student ratings of their educators to be related to the grades awarded to
them and to suffer from halo effects (Owen, 1976). However, a more recent study
found high correlations between student classroom climate self-reports and teachers’
use of effective instructional behaviours as measured through classroom observation
(Padron & Waxman, 1999).
Interviews with teachers have the advantage of allowing the researcher to probe the
teacher in depth on his/her teaching and thus allow more detailed findings to emerge
than questionnaires, but they suffer from lower reliability, relying on teachers’ ‘‘there
and then’’ memory and, if not standardised, from lack of comparability. This makes
them more difficult to quantify. In-depth interviews also make a large demand on
teacher time, can be threatening to teachers, and are as expensive as classroom
observation, but allow the interviewer to be reactive and probe, and are likely to lead
to a deeper understanding of meaning and reasons of teachers engaging in classroom
practice. They can often lead to a deeper understanding of underlying teacher beliefs
and thought structures.
Finally, it is important to mention here that, while teacher factors as mentioned
above have been found to be key variables in pupil learning, their impact is mediated
by the perceptions of the learners themselves. There is no direct ‘‘throughput’’ of
teaching into learning. Actors actively construct and form their learning and their
perceptions therefore mediate between teaching and learning outcomes (Bennett &
Ward, 1993; McDowell, 1995). Therefore, researchers should, where possible, study
these perceptions, through survey or qualitative methods.
Methodological Problems in Teacher Effectiveness Research

Both classroom observation and questionnaires suffer from a number of other
methodological problems, some relevant to one, others to both.
A major problem in classroom observation is observer bias. If an observer has a
preconceived expectation of the behaviour of the observed person, s/he is likely to
align the actual observations to these expectations to at least some extent. In one
example, observers were told the behaviour of the subjects they were observing (a
mirror tracing task) would decrease over subsequent observations due to subject
alcohol consumption leading to reduced hand-eye coordination. In reality, no change
in performance existed. However, observers reported decreased performance in

accordance to the expectations that had been evoked (Goldstein, Hopkins, & Strube,
1994). In another study, students received information that a visiting lecturer was
either a warm or a cold personality. Following the lecture, they were asked to rate the
lecturer. The half of the class who had received the information that he had a warm
personality rated him a more effective, sociable, humourous, and informal lecturer
than did the half who had received the information that he was a cold person
(Widmeyer & Loy, 1988). Careful training can help alleviate this problem, however
(Harrop, 1979).
Generalisability has been found to be a problem in observational research, in that
behaviours observed on one occasion are not necessarily typical of a teacher’s
everyday behaviour. In one study of eight classroom observation instruments,
generalisability across occasions was found to be poor (Calkins, Borich, Pascone,
Kugle, & Marston, 1997). This points to the need to observe teachers as many times
as possible and the need to take contextual factors into account. This finding has
somewhat different consequences for research and practical uses of observations. In a
research setting, in which a variety of teachers are observed in order to collect global
information on the relationship between behaviours and outcomes rather than on the
teachers themselves, a sufficient number of teachers along with multiple observations
of the same teachers should be sufficient to provide reasonably stable results.
However, when observation is used as a means of high-stakes judgement of teachers,
such as in appraisal or inspection systems, this finding has far-reaching consequences
in that only a large number of observations of a particular teacher will provide a
generalisable estimate. In view of the importance of this factor, more research is
needed to ascertain whether the grades assigned to teaching quality during often short
inspection visits with a limited number of observations of any one teacher are
sufficiently reliable as a measure of a teacher’s practice to be useful. The above
examples suggest that this may be problematic.
Another methodological problem is the occurrence of halo effects in both classroom
observation and survey research. This is the tendency, in observation, for observers to
allow overall impressions of the observed teacher to influence ratings of their
individual behaviours, and can be defined as the difference between observed rating
intercorrelations and intercorrelations among actual ratee behaviour (Lance, Fisicaro,
& LaPointe, 1990). In surveys, this effect takes the form of respondents allowing their
62 D. Muijs
overall feelings to influence their answers on individual items. In both cases it leads to a
lack of discrimination between the distinct concepts the researcher is trying to study.
Halo effects have been found to occur in a number of studies of classroom observation,
especially where observers were not trained in classroom observation (e.g., Phelps,
Schmitz, & Boatright, 1986), and have been found to be related to the extent to which
raters believe the behaviours are correlated (Suter & Roberts, 1987). Halo effects in
observer ratings can be measured using the SD measure, a measure of within-ratee
variability across the different dimensions, or by using the r-measure, a measure of
interdimension correlations (Fisicaro & Vance, 1994).
Another form of bias that can occur in both observational and survey studies is
respondent/rater style. This term refers to particular traits of raters/respondents that

can influence their responses/ratings, such as positive, negative, or acquiescent
response styles. These response styles often occur in surveys, and can seriously bias
responses leading to erroneous interpretation of scales due to the diminished
comparability of responses across subjects. A problem in observational research can
be bias resulting from observer reactions to the outward appearance of the observed
teacher. That this effect can be strong is illustrated by findings that it has been shown
that essays purported to have been written by an attractive as opposed to unattractive
female author were rated significantly higher by male judges (and nonsignificantly
lower by female judges), although this effect was not observed for male authors
(Kaplan, 1975). A similar result was found by Ritter and Langlois (1986), although
the effect mainly occurred where observers were asked to provide global ratings of the
effectiveness of the observed. Where observers were asked to rate specific behaviours,
the effect did not occur to the same extent. In order to counteract this form of
observer bias, it is necessary to instil a thoroughly professional ethic in raters, as well
as pointing out to them possible sources of bias, as observer awareness of their own
potential biases may help limit their occurrence. In observation systems, these
tendencies can be limited (if not eliminated) through proper observer training, but in
survey questionnaires this effect is hard to control, although some statistical
corrections, such as including this effect as a latent variable in structural equation
models, do exist (Billiet & McClendon, 2000).
A serious problem in survey studies are social desirability response sets. This is the
phenomenon whereby the respondent replies in the way she thinks the reader/
researcher/public will expect her to. According to some psychologists, social
desirability can, as well as being a self-presentation strategy, be a means of convincing
oneself of one’s own virtue (Gordon, 1987). This is especially problematic in teaching
research, in view of the knowledge of the professional discourse among teachers. For
example, teachers are unlikely to claim they spend most of the lesson lecturing the
class if they have been subjected to a discourse that disapproves of such methods.
There is also some evidence that when a ‘‘don’t know’’ category is included,
respondents will tend to use this option instead of making a negative judgement of
themselves or of others. This may clearly introduce bias into the results, especially as
the nonrandom pattern of missingness in this case makes imputation problematic
( Johanson, Gips, & Rich, 1993). There are some techniques to lessen the effect of
social desirability responses, however. Emphasising the need for accurate and honest
answers, and the confidential nature of the research, as well as the fact that it is
general behaviours rather than individual responses the researcher is looking for, have
been found to lead to less biased responses (Gordon, 1987).
Nonresponse is an issue that has traditionally attracted a lot of attention in survey
research. In virtually any survey one will be confronted with this problem, which may
have serious methodological consequences. Depending on the study, nonresponse
can be anything from 10 to 97% (the latter figure applies especially to surveys sent to
commercial organisations), with 50% being about average (Hartman, Fuqua, &
Jenkins, 1998). There is also some evidence that nonresponse in survey studies has
increased over the past decades, partly presumably in response to the increasing
demands to complete various types of surveys from government, educational research
institutes, and private research companies. Increasing response rates is therefore a key
challenge in survey research, which is usually tackled through follow-up requests by
phone, mail, et cetera, and by providing inducements for respondents, such as
financial or material rewards. Some commentators have advocated the use of online
rather than postal surveys. However, while cheaper to administer than paper-based
surveys, the evidence in education at present points to lower rather than higher
response rates (Muijs, 2004). A related problem is nonresponse to particular items in
surveys, where respondents may answer one item but not the next. There are a
number of ways to remedy this type of nonresponse, by using substitution methods
for the missing items. This can take the form of simply substituting the mean for the
missing value, but this method has obvious disadvantages and can better be replaced
by algorithmic methods which take into account the information provided by the
respondents’ scores on other items as well as the response patterns in the sample as a
whole, such as the EM algorithm. However, this and all other substitution methods
assume that the pattern of nonresponse is unbiased.
This is also the problem with global nonresponse (respondents not returning
surveys). Nonrespondents are often simply ignored in analyses, but this is highly
problematic. Nonrespondents can, and often do, differ from respondents on crucial
aspects that may bias the research findings. One method that has been proposed to
correct for this is to remail a sample of nonrespondents, and then check whether the
answers of those who return the questionnaire at the second instance differ
significantly from those of the initial respondents (Hartman et al., 1998). If one
does not get a 100% response to this remailing, as is likely, this method assumes that
the people who respond after prompting do not differ significantly from those that still
do not respond, however.
Obviously, both rating scales and observation instruments need to be reliable and
valid (see above). In this respect it is important to bear in mind that a number of the
biases mentioned above can influence the computation of, in particular, internal
reliability. Halo effects and response biases can both enhance internal reliability as
they lead to raters/respondents giving similar answers across items (Alliger &
Williams, 1992). In observational research, the two main types of reliability are
interrater reliability and reliability of observations over time (stability of measured
64 D. Muijs
behaviours). Interrater reliability refers to whether different raters will rate the
same behaviours of the same teacher similarly, as measured by statistics such as
Cohen’s kappa. This is a crucial issue in classroom observation research. If high
reliability is not achieved, what one might be measuring might be partially or largely
due to observer effects rather than to differences in observed behaviours. Clear prior
training and practice are crucial in this respect, as demonstrated by the study by
Veenman, Bakermans, Franzen, and Van Hoof (1996), in which significant
differences were found in the behaviours of trainee teachers trained in effective
teaching behaviours compared to control teachers when rated by trained observers,
but not when rated by supervisory teachers. The latter were found often to base their
ratings on general impressions of the trainee teachers. The situation with respect
to reliability over time is complex, as unreliability can be caused by both rater
unreliability over time and behaviour instability.
In view of the issues discussed above, the most important thing when using
observation is to thoroughly train raters beforehand. This training needs to familiarise
them with the instrument used, take them carefully through all the steps they need to
follow, explain how the data will be interpreted, and explain the possible sources of
bias they need to be aware of in order not to succumb to them. A significant amount
of practice is needed, which should continue until a suitably high level of reliability
has been reached. Training of this nature has been found to significantly reduce bias
of all kinds (Rudner, 1992). It has been found that simply informing observers of
research hypotheses need not lead to bias, but raters should not be given feedback on
how well their observations are supporting these hypotheses, otherwise they will start
producing data designed to support them (O’Leary, 1973). Nevertheless, it may be
useful to statistically adjust for observer effects, which can to a certain extent be
modelled in multilevel or Structural Equation Modelling, by adding a (latent)
measurement factor to the model.
High and Low Inference in Classroom Observation

An important distinction in classroom observation research is that between high- and
low-inference measures. Low-inference measures require a minimal amount of
observer judgement, relying largely on counting the behaviours one wants to study.
High-inference measures rely on more observer judgement, as when observers are
asked to rate a teacher’s behaviour on a rating scale (i.e. ‘‘the teacher corrected
behaviour constructively’’).
In one study, low-inference observations seemed to produce slightly more reliable
results than high-inference measures, although both were found to shed light on
different aspects of effective teaching (McConnell & Bowers, 1979). Low-inference
measures also tend to have higher interater reliabilities than high-inference measures
(Dickson, 1983).
A study by Wiersma (1983) found that low- and high-inference measures produced
different factor structures, not supporting the convergent validity of the two methods,
and a later study by the same author found a low-inference instrument to reliably
distinguish less from more effective ratees, while on a high-inference instrument all
respondents received highly positive ratings (Wiersma, 1988).
Interrater reliability is influenced by the level of inference necessary. Low-inference
measures, which require little observer judgement, will usually result in higher
interrater reliability, but even with relatively high-inference measures requiring more
subjective rater judgement) it is possible to achieve high levels of reliability with
sufficient effective training (Muijs & Reynolds, 2000b; Nelson & Ray, 1983).
One solution could be to use both high- and low-inference measures in a study.
However, there is some evidence of contamination, in that a halo effect seems to
occur whereby the ratings on the different instruments influence one another
(Marston & Zimmerer, 1979). However, although low-inference measures are

more reliable they are also more limited in what they are able to measure. Low-
inference measures by definition focus on the occurrence or not of certain easily
measurable behaviours, such as counts of the number of questions, but cannot
measure the quality of use of these behaviours, and are unable to take context
factors into account (e.g., ‘‘a high number of open questions’’ may differ for grade-
1 and grade-7 students). This means that it will depend on the goals of the study
which type of measure to use. If one is using high-inference measures, the training
of observers, especially with regards to increasing interrater reliability, becomes even
more crucial.
While most existing observation instruments tend to focus on behaviours that are
hypothesised to improve students’ academic performance, this is not necessarily the
only type of observation to do. Observation schedules can be designed that focus on
teacher behaviours that are hypothesised to influence the development of students’
self-concept, and instruments measuring behaviours hypothesised to influence
students’ prosocial development exist as well (Solomon, Watson, & Deer, 1988).
Data Analysis
Before going on to a discussion of analysis techniques in teacher effectiveness
research, it needs pointing out that it is impossible to make any firm inferences from
these studies with respect to the relationship between outcomes and behaviours
without controlling sufficiently for other variables that may affect outcomes, in
particular pupil socioeconomic status, gender, ethnicity, and learning disabilities.
Furthermore, as school effectiveness research has shown (see Teddlie & Reynolds,
2000), school social context, school effectiveness factors, and other facets of
classroom organisation (e.g., use of setting), may influence outcomes and should
therefore be controlled in all analyses conducted. Obviously, measurement problems
exist with these variables as well (such as the unreliability of special educational needs
measures and the inadequacy of free meal eligibility as a measure of social class, for
which it is often used as a proxy) and it is therefore here again necessary to ensure that
measures used are carefully chosen with the goals of the study in mind, and in such a
way that they validly measure the underlying construct (such as parental social
capital) that one wants to measure.
66 D. Muijs
Most of the best-known studies of teacher effectiveness were done using traditional
correlational or regression-based techniques (Brophy & Good, 1986), notable
exceptions being the Junior School Project by Mortimore, Sammons, Stoll, Lewis,
and Ecob (1988), and Muijs and Reynolds’ (2000b) MEPP Evaluation in the UK.
Using these traditional methods is problematic for a number of reasons. Firstly,
almost all teacher effectiveness data are by definition hierarchical, in that they
combine pupil-level (outcomes, background), classroom-level (teacher effectiveness),
and hopefully, school-level data in a structure in which pupils are nested in
classrooms, which are in turn nested in schools (which are often nested in school
boards or Local Education Authorities). This causes a number of problems for
traditional regression methods, not least the attenuation of standard errors which can
cause effects to be wrongly rated as statistically significant (Goldstein, 1995, 1997;
Kreft & De Leeuw, 1998; Plewis, 1997). Furthermore, multilevel modelling allows
the researcher to partition the variance to be explained among the different levels, so
s/he can ascertain how much variance exists at the classroom level as opposed to the
individual pupil level, for example, and thus can find out how much of the variance in
achievement and pupil progress between teachers is explained by teacher behaviours
(found to be 50% on progress over 2 years and 75% on progress over the year in a
recent study (Muijs & Reynolds, 2000b). Multilevel analysis also allows one to free
the slopes in the models, so the researcher can study complex interactions between
variables and levels. Overall, this technique has become essential to all quantitative
educational research, and that certainly includes teacher effectiveness research.
When employing multilevel models, it is important to include all relevant levels in
the analysis, even if no variables have been collected at that level. In a study using
real-life data, Opdenakker and Van Damme (2000) found that if levels were not
included, both the partitioning of variance to the different levels and the actual
parameters could be faulty. More specifically, variance explained at the left out level
was assigned to the levels immediately adjoining it, while some parameters were
significant or not depending on exactly which levels were wrongly excluded from the
analyses. This means that researchers need to make sure they test the fit of all possible
null models to the data, to see which number of levels is most appropriate (Kyriakides
& Charlambous, 2004). A complication in this regard is the fact that variables at
different levels are often correlated to one another, and that, for example, the impact
of teacher factors may differ depending on contextual characteristics, such as social
background of the school intake (Levacic, Malmberg, Steele, & Smees, 2004;
Opdenakker & Van Damme, 2001). Multilevel modelling allows such covariances to
be modelled, however, and this should form a part of models employed.
A further statistical modelling technique that seems particularly useful for teacher
effectiveness research is Structural Equation Modelling (SEM). This technique
allows the researcher to explicitly test the fit of a prespecified model to the covariance
matrix. Furthermore, SEM allows the researcher to specify that her/his variables
are actually latents, measured by means of manifest variables (Hayduk, 1996;
Jöreskog & Sörbom, 1996). This means that one can model the actually hypothesised
relationship between all the variables in the model (e.g., teacher beliefs influence
teacher behaviours which influence achievement . . .) rather than between their

indicators (the individual items in the teacher beliefs scale, for example). A
contentious issue is to what extent this allows one to determine causality. This is
an issue that is often fudged in SEM (and statistical modelling more generally). It is
fair to say that almost all teacher effectiveness researchers have a causal model in
mind at least implicitly. However, strictly speaking it is only possible to determine
causality in an experimental way, by manipulating the predictor and seeing whether
or not this results in a change in the dependent variable. This can be done to a certain
extent in teacher effectiveness research, in so far as the research concerns training
teachers in effective behaviours and then measuring whether the performance of their
children has improved compared to those taught by matched control teachers (for an
example, see Good, Grouws, & Ebmeier, 1983). This type of quasi-experimental
research is itself not without problems, however, as controlling all other variables that
may influence pupil outcomes is as good as impossible in real-life school settings,
leading to serious concerns regarding the validity of this type of research. More often
in teacher effectiveness research one will want to measure actual behaviours in a
variety of classrooms and see whether any of these are related to outcomes, and for
related read ‘‘affect outcomes’’ either implicitly or (better) explicitly stated by the
researcher. The problem is that traditional statistical techniques do not allow this kind
of conclusion to be made, as for every (relatively) fitting model there may be a large
number of equivalent models which fit the data equally well. Newer methods, such as
SEM, are less susceptible to this due to the large number of tests of model fit that
exist, which include relative fit test allowing us to compare the fit of different models
(Marcoulides & Schumacker, 2001). However, use of these tests is not without
controversy (see Hayduk, 1996), and it is not entirely clear which of the many model
fit tests is most appropriate. Therefore, most researchers in their publications
carefully use terminology such as ‘‘related to’’, ‘‘associated with’’, and so forth, when
discussing their statistical analyses. However, when one looks closer at relevant
publications, one will usually find them littered with terminology suggestive of the
authors’ causal assumptions (Muijs & Reynolds, 2001a, provides a good example of
this fallacy). Some researchers have recently attempted to overcome this problem by
developing new methods to determine causality from cross-sectional data. Chambers’
(1991) corresponding regressions method was not able to correctly identify causal
relationships in a simulation on data provided by this author. Pearl’s (2000) more
sophisticated Bayesian methods seem more promising, and deserve empirical testing.
As in most behavioural research, most of the variables we are interested in are, in
fact, latent. For example, when looking at student outcomes (in Mathematics, say)
what we want to find out is their Mathematics achievement or ability. A given test is
always only an indicator of this latent trait. Similarly, when we want to look at
effective classroom management, the variables we have measured during our
observations are indicators of this trait, rather than the actual trait themselves
( Jöreskog & Sörbom, 1996). A similar theory underlies Item Response Theory,
which may be used in developing measurement instruments both for measuring
teacher behaviours and student outcomes, as mentioned above.
68 D. Muijs
As multilevel modelling and SEM both address different problems with traditional
statistical methods, it is logical that attempts should be made to combine the two.
This once again is relevant to teacher effectiveness research, as most studies are in
effect multilevel latent trait structures. Therefore, a combination of these two
methods could be attempted in future teacher effectiveness research. Several
statisticians have developed methods to do this, usually by integrating the levels as
parameters in structural equation models, where, basically, in each group the group
mean is subtracted from the individual scores (see Muthén (1994), whose model can
be implemented in the Mplus software programme of which he is the main
developer). At present, these techniques are newly developed and developing, and
limitations remain. The multilevel structure needs to remain simple, complex

interactions cannot be modelled, the number of variables must remain limited and
using more than two levels is highly problematic. Ordinal, and in particular nominal,
variables are not well suited to these methods (due to the mean subtraction
mentioned above). Also, as multilevel SEM is a form of multiple group SEM, the
assumptions for SEM hold at both levels. This has one particular consequence,
namely that the sample size necessary to achieve stable models should be around a
minimum of 200 in most cases, this at the second level as well as the first (J. Hox,
personal communication, April 1998). These problems mean that differences remain
regarding the appropriateness of using multilevel SEM models (Hox & Maas, 2001).
A final methodological issue lies in the measurement of student progress. The
assumption in most statistical methods (including SEM and multilevel modelling as
usually applied) is that student growth in learning over time is linear. From a learning
theory perspective, this is clearly untenable (Woolfolk, 1990). Methods do exist to
look at the actual growth of students as measured over time, and plot the actual
‘‘growth curve’’ (the terminology itself here suggests nonlinearity) of students
(Heinen, 1996; Plewis, 1997). These growth curve methods can be implemented in
both multilevel and SEM contexts, and should be linked to teacher behaviours in
teacher effectiveness research in order to get a better picture of the actual interactions
between teacher behaviours and pupil growth. Furthermore, the possible nonlinearity
of teacher behaviours themselves, which may be subject to threshold effects (i.e., it
may be that the effect of, say, using open questions actually kicks in at a particular
level, rather than having a linear effect in which asking more open questions is linearly
related to pupils achieving better) needs to be explored more as well.
Finally, while I have concentrated on quantitative methods, this is by no means an
attempt to negate the use of qualitative methods in teacher effectiveness research. I do
believe, however, that most process-product research will have at least a quantitative
component, as this type of research is interested in differences in student outcomes,
which are usually most accurately measured in a quantitative way. However, one can
easily think of occasions where outcomes may be qualitatively measured. Certain
affective outcomes could be measured through in-depth interview techniques, for
example. Process variables can likewise be qualitatively measured, ethnographic
research being used to look at classroom processes (Hammersley & Atkinson, 1995).
Obviously, the use of qualitative methods will not allow researchers to measure the
exact effects of particular teacher actions on pupil outcomes, and will not allow
statistical generalisation to the population, which is probably the reason that most
teacher effectiveness research has been quantitative.
A useful development of teacher effectiveness research is the use of mixed
quantitative-qualitative methods (Tashakkori & Teddlie, 1998). A problem with the
quantitative methods used in teacher effectiveness research is that they do not allow
researchers to look more closely at context and look at what is happening at a deeper
level. Therefore, a useful strategy could be to identify more and less effective teachers
using outcome data, and then combine quantitative classroom observation with in-
depth interviews of selected teachers and/or in-depth qualitative observations of
particular classroom interactions or selected classroom ethnographies (Tashakkori &

Teddlie, 1998). This method has been widely practised in school effectiveness
research, and some applications exist in teacher effectiveness (e.g., Medwell & Wray,
1997, in the UK), which have led to useful insights into the effective teaching of
numeracy and literacy.
Conclusion
In this article, I have tried to look at a number of methodological issues involved in
the (quantitative) study of teacher effectiveness. In three sections I have discussed
quality criteria for input/outcome measures, process measures, and statistical
analysis.
Outcome measures need to be reliable and valid, whatever outcome is studied.
Process measures need to guard against various sources of bias, mainly through
thorough training of observers or careful development of questionnaires if that
method is used. Statistical analysis needs to make use of modern sophisticated
methods such as multilevel and Structural Equation Modelling.
One of the main goals, along with these recommendations, of this article, however,
is to encourage some reflection on methods among teacher effectiveness researchers.
Instruments are often not piloted before starting the study, and sometimes observers
are let into classrooms without any training at all, let alone any knowledge of the
theories underlying the research. Many educational researchers also still seem
unaware of the advances in statistical methods made over the past 20 years, and one
still too often encounters published papers using unsuitable statistical techniques. It
must also not be forgotten that use of sophisticated statistical methodology cannot
make up for mistakes made in the previous phases of the research. Bad data cannot
magically be transformed into good research through statistical modelling. This
article is therefore in a sense an appeal for more care in research design, hard though
this may sometimes be within the confines of externally funded short-term contracts
commissioned by agencies looking for a quick fix. However, if the field of teacher
effectiveness is to progress, methodology must remain a prime concern.
Finally, in the light of ‘‘reflective research practice’’, I would have to mention the
necessity for researchers to reflect on the theories underlying their research and on
what these imply for their designs. All research, not least teacher effectiveness
70 D. Muijs
research, is based on theory, from hypotheses based on prior research or theories as to

what teacher behaviours to measure at the lowest level, to fundamental epistemo-
logical theories at the highest, which will all impact on the research design. For
example, teacher effectiveness research will usually stem from either an essentially
postpositivistic or pragmatic view of science and the world. However, to confess
oneself a constructivist and then engage in process-product research is essentially
contradictory. The research design needs to be in accord with the researchers’
theories at all these levels.
Overall, then, the following recommendations for future research in teacher
effectiveness emerge from this study:
. There is a need for researchers to ensure that instruments and methods follow
goals, and are not chosen purely for the sake of convenience or familiarity.
. Reflection on methods, in terms of reliability and validity, is crucial if the field is to
progress. This includes careful consideration of cross-cultural and comparative
issues.
. Researchers need to increase their efforts to remain abreast of the latest
developments in data analysis and modelling as well as emerging theories of
learning and cognition rather than relying on traditional methods.
. Researchers need to take into account the multiple and expanding roles of
teachers, and there is an urgent need for the development of measures and studies
that can more accurately reflect these different roles and factors that may lead to
differential teacher effectiveness in different subjects, areas, and domains.
References
Alliger, G. M., & Williams, K. J. (1992). Relating the internal consistency of scales to rater response
tendencies. Educational and Psychological Measurement, 52, 337 – 343.
Bennett, R. E., & Ward, W. C. (1993). Construction versus choice in cognitive measurement. Hillsdale,
NJ: Lawrence Erlbaum.
Billiet, J. B., & McClendon, M. J. (2000). Modeling acquiescence in measurement models for two
balanced sets of items. Structural Equation Modeling, 7, 608 – 628.
Borich, G. (1996) Effective teaching methods (3rd ed.). New York: Macmillan.
Brophy, J. E., & Good, T. L. (1986). Teacher behaviour and student achievement. In M. C.
Wittrock (Ed.), Handbook of research on teaching (pp. 328 – 375). New York: Macmillan.
Calkins, D., Borich, G. D., Pascone, M., Kugle, C. L., & Marston, P. T. (1997). Generalizability of
teacher behaviors across classroom observation systems. Journal of Classroom Interaction, 13(1),
9 – 22.
Campbell, R. J., Kyriakides, L., Muijs, D., & Robinson, W. (2003). Differential teacher
effectiveness: Towards a model for research and teacher appraisal. Oxford Review of Education,
29(3), 347 – 362.
Campbell, R. J., Kyriakides, L., Muijs, D., & Robinson, W. (2004). Effective teaching and
values: Some implications for research and teacher appraisal. Oxford Review of Education,
30(4), 451 – 465.
Chambers, W. V. (1991). Inferring formal causation from corresponding regressions. Journal of
Mind and Behavior, 12(1), 49 – 70.
Charlesworth, R., Hart, C., Burts, D., Thomasson, R., Mosleu, J., & Fleege, P. (1993). Measuring
the developmental appropriateness of kindergarten teachers’ belief and practices. Early
Childhood Research Quarterly, 8, 255 – 276.
Darling-Hammond, L. (2001). Does teacher certification matter? Educational Policy Analysis
Archives, 9(2), 57 – 77.
Dickson, G. E. (1983, July). The competency assessment of teachers using high and low inference
measurement procedures: A review of research. Paper presented at the World Assembly of the
International Council on Education for Teaching, Washington, DC.
Embretson, S. E., & Reise, S. P. (Eds.) (2000). Item Response Theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum.
Evertson, C. M., Brophy, J., & Anderson, L. (1978, March). Process-outcome relationships in the
Texas Junior High School Study: Compendium. Paper presented at the Annual Meeting of the
American Educational Research Association, Toronto, Canada.

Fisicaro, S. A., & Vance, R. J. (1994). Comments on the measurement of halo. Educational and
Psychological Measurement, 54(2), 366 – 371.
Fox, J. P. (2003). Stochastic EM for estimating the parameters of a multilevel IRT model. British
Journal of Mathematics and Statistics in Psychology, 56(1), 65 – 81.
Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch Model. In G. H. Fischer & I. W.
Molenaars (Eds.), Rasch models: Foundations, recent developments and applications (pp. 69 – 96).
New York: Springer Verlag.
Goldstein, H. (1995). Multilevel statistical models (2nd ed.). London: Arnold.
Goldstein, H. (1997). Methods in school effectiveness research. School Effectiveness and School
Improvement, 8, 369 – 395.
Goldstein, M. D., Hopkins, J. R., & Strube, M. J. (1994). ‘‘The eye of the beholder’’: A classroom
demonstration of observer bias. Teaching of Psychology, 21(3), 154 – 156.
Good, T. L., Grouws, D. A., & Ebmeier, W. (1983). Active mathematics teaching. New York:
Longman.
Gordon, R. A. (1987). Social desirability bias: A demonstration and technique for its reduction.
Teaching of Psychology, 14(1), 40 – 42.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory.
Newbury Park, CA: Sage.
Hammersley, M., & Atkinson, P. (1995). Ethnography. London: Taylor & Francis.
Harrop, L. A. (1979). An examination of observer bias in a classroom behaviour modification
experiment. Educational Studies, 5(2), 97 – 107.
Harter, S. (1985). Manual for the self-perception profile for children. Denver, CO: University of
Denver.
Hartman, B. W., Fuqua, D. R., & Jenkins, S. J. (1998). The problems of and remedies for
nonresponse bias in educational surveys. Journal of Experimental Education, 54(2), 85 – 90.
Hattie, J. (1999, August). Influences on student learning. Inaugural lecture: Professor of Education,
Auckland, University of Auckland, New Zealand.
Hautamäki, J. (2001). Learning-to-learn. A framework. Helsinki: National Board of Education.
Hautamäki, J. (2001). Assessing learning-to-learn: A framework. Helsinki, Finland: National Board of
Education.
Hayduk, L. A. (1996). LISREL issues, debates, and strategies. Baltimore: John Hopkins University
Press.
Hays, R. D., & Ware, J. E. J. (1996). My medical care is better than yours. Social desirability and
patient satisfaction ratings. Medical Care, 24(6), 519 – 525.
Heinen, T. (1996). Latent class and discrete latent trait models. Thousand Oaks, CA: Sage.
Hook, C. M., & Rosenshine, B. V. (1979). Accuracy of teacher reports of their classroom behavior.
Review of Educational Research, 49(1), 1 – 12.
Hox, J., & Maas C. (2001). The accuracy of multilevel Structural Equation Modelling with
pseudobalanced groups and small samples. Structural Equation Modeling, 8, 157 – 174.
72 D. Muijs
Johanson, G. A., Gips, C. J., & Rich, C. E. (1993). If you can’t say something nice. A variation on
the social desirability response set. Evaluation Review, 17, 116 – 122.
Jöreskog, K., & Sörbom, D. (1996). LISREL 8: User’s reference guide. Chicago: Scientific Software
International.
Kaplan, R. M. (1975, April). Is beauty talent? Sex interaction in the attractiveness halo effect. Paper
presented at the Annual Meeting of the Western Psychological Association, Los Angeles.
Kreft, I., & De Leeuw, J. (1998). Introducing multilevel modeling. London: Sage.
Kyriakides, L., Campbell, R. J., & Christofidou, E. (2002). Generating criteria for measuring
teacher effectiveness through a self-evaluation approach: A complementary way of measuring
teacher effectiveness. School Effectiveness and School Improvement, 13, 291 – 325.
Kyriakides, L., & Charlambous, C. (2004). Extending the scope of analysing data of IEA studies:
Applying multilevel modelling techniques to analyse TIMSS data. Proceedings of the IRC 2004,
TIMSS, 68–86. Nicosia, Cyprus: University of Cyprus.

Lance, C. E., Fisicaro, S. A., & LaPointe, J. A. (1990). An examination of negative halo error in
ratings. Educational and Psychological Measurement, 50, 545 – 553.
Levacic, R., Malmberg, L., Steele, F., & Smees, R. (2004, January). The relationship between school
climate and head teacher leadership, and pupil attainment: Evidence from a sample of English
secondary schools. Paper presented at the International Congress for School Effectiveness and
Improvement, Rotterdam, The Netherlands.
Marcoulides, G. A., & Schumacker, R. E. (2001). New developments and techniques in Structural
Equation Modelling. Mahwah, NJ: Lawrence Erlbaum.
Marsh, H. W. (1988). Self-Description-Questionnaire I. SDQ-I manual and research monograph. San
Antonio, Texas: The Psychological Corporation.
Marston, P. T., & Zimmerer, L. K. (1979). The effect of using more than one type of teacher observation
system in the same study. (ERIC Document Reproduction Service No. ED206712). Austin,
Texas: Texas University Research and Development Center for Teacher Education.
McConnell, J. W., & Bowers, N. D. (1979, April). A comparison of high inference and low inference
measures of teacher behaviours. Paper presented at the Annual Meeting of the American
Educational Research Association, San Francisco.
McDowell, L. (1995). The impact of innovative assessment on student learning. Innovations in
Education and Training International, 32, 302 – 313.
Medwell, J., & Wray, D. (1997). The effective teaching of literacy. Coventry, UK: University of
Warwick Institute of Education.
Moorman, R., & Podsakoff, M. (1992). A meta-analytic review and empirical test of the potential
confounding effects of social desirability response sets in organizational behaviour research.
Journal of Occupational and Organisational Psychology, 35(4), 218 – 234.
Mortimore, P., Sammons, P., Stoll, L., Lewis, D., & Ecob, R. (1988). School matters. Somerset
Wells, UK: Open Books.
Muijs, D. (1997). Predictors of academic achievement and academic self-concept: A longitudinal
perspective. British Journal of Educational Psychology, 67, 263 – 277.
Muijs, D. (2004). Leading learning: First technical report. Coventry, UK: Warwick Institute of
Education.
Muijs, R. D., & Reynolds, D. (2000a, January). Effective mathematics teaching: Year 2 of a research
project. Paper presented at the International Congress for School Effectiveness and
Improvement, Hong Kong.
Muijs, R. D., & Reynolds, D. (2000b). School effectiveness and teacher effectiveness: Some
preliminary findings from the evaluation of the Mathematics Enhancement Programme.
School Effectiveness and School Improvement, 11, 325 – 337.
Muijs, D., & Reynolds, D. (2001a). Effective teaching. Evidence and practice. London: Paul Chapman.
Muijs, R. D., & Reynolds, D. (2001b, January). Student background and teacher effects on achievement
and attainment in mathematics: A longitudinal study. Paper presented at the International
Congress for School Effectiveness and Improvement, Toronto, Canada.
Muijs, D., & Reynolds, D. (2002). Teacher beliefs and behaviors: What matters. Journal of
Classroom Interaction, 37(2), 3 – 15.
Muthén, B. (1994). Multilevel covariance structure analysis. Sociological Methods and Research,
22(3), 376 – 398.
Nelson, E. A., & Ray, W. J. (1983, August). Observational ratings of teacher performance:
Dimensionality and stability. Paper presented at the Annual Meeting of the American
Psychological Association, Los Angeles.
O’Leary, K. D. (1973). The effects of observer bias in field experimental settings. Final report. (ERIC
Document Reproduction Service No. ED078086). Washington, DC: National Center for
Educational Research and Development.
Opdenakker, M.-C., & Van Damme, J. (2000). The importance of identifying levels in multilevel
analysis: An illustration of the effects of ignoring the top or intermediate levels in school
effectiveness research. School Effectiveness and School Improvement, 11, 103 – 130.
Opdenakker, M.-C., & Van Damme, J. (2001). Relationship between school composition and
characteristics of school process and their effect on mathematic achievement. British
Educational Research Journal, 27, 407 – 432.
Owen, S. A. (1976, April). The validity of student ratings: A critique. Paper presented at the Annual
Meeting of the American Educational Research Association, San Francisco.
Padron, Y. N., & Waxman, H. C. (1999). Classroom observations of the five standards of effective
teaching in urban classrooms with English language learners. Teaching and Change, 7(1),
79 – 100.
Pearl, J. (2000). Causality. Cambridge: Cambridge University Press.
Phelps, L., Schmitz, C. D., & Boatright, B. (1986). The effects of halo and leniency on cooperating
teacher reports using Likert-type rating scales. Journal of Educational Research, 79(3), 151 – 154.
Plewis, I. (1997). Statistics in education. London: Arnold.
Ritter, J. M., & Langlois, J. H. (1986, April). The role of physical attractiveness in the observation of
adult-child interactions: Eye of the beholder or behavioral reality? Paper presented at the Biennial
International Conference on Infant Studies, Los Angeles.
Rudner, L. M. (1992). Reducing errors due to the use of judges. (ERIC Document Reproduction
Service Digest No. ED355254). Washington, DC: ERIC Clearinghouse on Tests,
Measurement, and Evaluation.
Sanders, W. L., & Horn, S. P. (1994). The Tennessee Value-Added System: Mixed model
methodology in educational assessment. Journal of Personnel Evaluation in Education, 8,
299 – 311.
Sanders, W. L., & Horn, S. P. (1995). Educational assessment reassessed: The usefulness of
standardised and alternative measures of student achievement as indicators for the assessment
of educational outcomes. Educational Policy Analysis Archives, 3(6). Retrieved November 23,
2002, from http://epaa.asu.edu/epaa/v3n6.html
Solomon, D., Watson, M., & Deer, J. (1988). Measurement of aspects of classroom environments
considered likely to influence children’s prosocial development. Moral Education Forum,
13(4), 10 – 17.
Suter, W. N., & Roberts, W. L. (1987). An experimental investigation of the beliefs-of-relatedness
source of halo. Contemporary Educational Psychology, 12, 77 – 85.
Tamir, P. (1983). Teachers’ self-reports as an alternative strategy for the study of classroom
transactions. Journal of Research in Science Teaching, 20(9), 815 – 823.
Tashakkori, A., & Teddlie, C. (1998). Mixed methodology. Thousand Oaks, CA: Sage.
Teddlie, C., & Reynolds, D. (2000). International handbook of school effectiveness research. London:
Falmer Press.
Thompson, B. (1998). Statistical significance testing and effect size reporting: Portrait of a possible
future. Research in the Schools, 5(2), 33 – 38.
Veenman, S., Bakermans, J., Franzen, Y., & Van Hoof, M. (1996). Implementation effects of a pre-
service course for secondary education teachers. Educational Studies, 22(2), 225 – 243.
74 D. Muijs
Ward, M. D., Clark, C. C., & Harrison, G. V. (1981, April). The observer effect in classroom visitation.
Paper presented at the Annual Meeting of the American Educational Research Association,
Los Angeles.
Widmeyer, W. N., & Loy, J. W. (1988). When you’re hot, you’re hot! Warm-cold effects in first
impressions of persons and teaching effectiveness. Journal of Educational Psychology, 80(1),
118 – 121.
Wiersma, W. (1983, April). Assessment of teacher performance: Constructs of teacher competencies based
on factor analysis of observation data. Paper presented at the Annual Meeting of the American
Educational Research Association, Montreal, Quebec, Canada.
Wiersma, W. (1988). The Alabama Career Incentive Program: A statewide effort in teacher evaluation.
(ERIC Document Reproduction Service Digest No. ED298128). Auburn, Alabama: Auburn
University College of Education.
Woolfolk, A. (1990). Educational psychology. New York: Allyn & Bacon.

Educational Research and Evaluation: An International Journal On Theory and Practice

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Educational Research and Evaluation: An International Journal On Theory and Practice

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [Chinese University of Hong Kong]

On: 24 February 2015, At: 12:01

Educational Research and Evaluation:

Measuring teacher effectiveness: Some

To link to this article: http://dx.doi.org/10.1080/13803610500392236

PLEASE SCROLL DOWN FOR ARTICLE

Measuring Teacher Effectiveness:

School of Education, University of Manchester, UK

(Received 21 July 2004; accepted 3 May 2005)

ISSN 1380-3611 (print)/ISSN 1744-4187 (online)/06/010053–22

That teacher factors are crucial to understanding and improving education is

teacher classroom questionnaires, by diminishing social desirability responses due to

Methodological Problems in Teacher Effectiveness Research

in performance existed. However, observers reported decreased performance in

respondent/rater style. This term refers to particular traits of raters/respondents that

High and Low Inference in Classroom Observation

(Marston & Zimmerer, 1979). However, although low-inference measures are

teacher behaviours which influence achievement . . .) rather than between their

limitations remain. The multilevel structure needs to remain simple, complex

particular classroom interactions or selected classroom ethnographies (Tashakkori &

research, is based on theory, from hypotheses based on prior research or theories as to

American Educational Research Association, Toronto, Canada.

TIMSS, 68–86. Nicosia, Cyprus: University of Cyprus.

Woolfolk, A. (1990). Educational psychology. New York: Allyn & Bacon.

You might also like