Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Assessment & Evaluation in Higher Education

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/caeh20

The student evaluation of teaching and likability:


what the evaluations actually measure

Dennis Clayson

To cite this article: Dennis Clayson (2021): The student evaluation of teaching and likability:
what the evaluations actually measure, Assessment & Evaluation in Higher Education, DOI:
10.1080/02602938.2021.1909702

To link to this article: https://doi.org/10.1080/02602938.2021.1909702

Published online: 14 Apr 2021.

Submit your article to this journal

Article views: 97

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=caeh20
Assessment & Evaluation in Higher Education
https://doi.org/10.1080/02602938.2021.1909702

The student evaluation of teaching and likability:


what the evaluations actually measure
Dennis Clayson
Marketing, University of Northern Iowa, Cedar Falls, IA, USA

ABSTRACT KEYWORDS
For several decades research into the student evaluation of teaching has Student evaluation of
periodically found an association between how well students like an teaching; SET; likability;
instructor and the evaluations. The association has been largely ignored, personality; evaluation
being seen as an indicator of bias, or as a statistical or procedural artifact. errors; learning
However, these interpretations may be obscuring a more fundamental
hypothesis. It is possible that the evaluations, instead of being a measure
of ‘good’ or ‘effective’ teaching as commonly conceived, are actually a
measure of a student-perceived construct similar to likability. This study
looks directly at the influence of likability on the student evaluation of
teaching. Knowing nothing about an instructor or how a class was taught,
students’ perception of likability accounted for two-thirds of the total
variance of the evaluations. The student evaluation of teaching could be
replaced with a single likability measure, with little loss of predictability.

Introduction
Most research involving the student evaluation of teaching (generically referred to as SET)
assumes a priori that the instruments are measuring ‘effective’ instruction. High evaluations are
seen as an indicator of good teaching. Findings that do not fit this paradigm are generally
considered to be a result of bias. However, there are three lines of evidence (lack of multidi-
mensionality, failure to improve performance, and a failure to measure student learning) that
indicate that the instruments utilized to measure student evaluation of teaching or instruction
are not evaluating the ‘effectiveness’ of instruction. Indeed, the instruments may not be mea-
suring instruction at all.

Multidimensionality
Even though students might recognize the multidimensionality of instruction, they appear to
be responding to the evaluations primarily in a unidimensional fashion. Early researchers were
correctly emphatic in identifying teaching as a complex activity with multidimensional implica-
tions (Feldman 1986; Marsh and Roche 1997; Marsh 2007). However, there are several lines of
evidence indicating students do not evaluate teaching in a multidimensional fashion.

i. Halo effects are commonplace. It has been found that students, when completing the
evaluation instruments, will ignore the content of any given question and answer in a
manner consistent with some overriding issue or concern (Orsini 1988; Shevlin et al.
2000; Darby 2007; Clayson and Haley 2011; Keeley et al. 2013).

CONTACT Dennis Clayson dennis.clayson@uni.edu


© 2021 Informa UK Limited, trading as Taylor & Francis Group
2 D. CLAYSON

ii. First impressions, which generally are unrelated to a more sophisticated evaluation of
teaching, are common. It has been shown that the reaction to video clips, some as short
as two seconds, predicted the ratings of instructors given by others with more substantial
interactions (Ambady and Rosenthal 1993; Tom, Tong, and Hesse 2010). The initial eval-
uations of the instructor made before any instruction, and even before a syllabus was
distributed, has been found to be significantly related to the final evaluations (Clayson
and Sheffet 2006; Buchert et al. 2008; Clayson 2013).
iii. The evaluations have a validity issue which suggests students are not multidimensionally
evaluating their instructors. SET has been criticized for having convergent validity, while
appearing to lack divergent and discriminant validity (Greenwald and Gillmore 1997;
Marks 2000; Sproule 2002; Langbein 2008; Onwuegbuzie, Daniel, and Collins 2009).
Convergent validity and discriminant and divergent invalidity would be found if SET
were measuring one, or a few, dimensions of instruction, but not all. In other words,
this would be expected if the students were employing a unidimensional evaluation to
a multidimensional construct. Teaching has multidimensional factors, but that does not
ensure the students’ response to instruction is multidimensional.

Improvement of teaching
One of the primary purposes of evaluation has been the improvement of instruction (Gaillard,
Mitchell, and Vahwere 2006; Palmer 2012: Benton and Ryalls 2016). Yet the use of student
evaluations does not appear to meaningfully improve teaching as measured by the same
instruments. That is to say, improvement of SET scores or rankings appear to be disassociated
with previous utilization of the evaluations, both in terms of purpose (formative or summative)
or in longitudinal application (Centra1972; Cohen 1980; Kember, Leung, and Kwan 2002; Penny
2003; Davidovitch and Soen 2006; Campbell and Bozeman 2007; Carle 2009). Younger instructors
generally have higher or equal evaluations compared to more experienced and older instructors
(Miron 1988; Carrell and West 2010).
Historically it was initially believed that SET would improve instruction (Overall and Marsh
1979; Cohen 1980), but it became evident early on that the evaluations, by themselves, did
little to improve teaching without external feedback from experts or consultants (Marsh 1984;
Wachtel 1998). However, this raised its own problems. First, why should the resultant purpose
of an evaluation (improvement of instruction) be undiscoverable to the target of the evaluation?
Second, the information contained in the evaluations, even if discoverable, appears to be insuf-
ficient for improvement. Consultants must introduce additional concepts, evidence and intelli-
gence, even after evaluation (Marsh 1984; Smith 2008). As Spooren, Brockx, and Mortelmans
(2013) state, ‘SET ultimately does not achieve the goal of providing useful information to an
important stakeholder, with the ultimate goal of improvement’ (p. 623). That is to say, the
improvements, if they occur, must come from an outside source because the evaluations them-
selves do not contain the information necessary for instructional improvement (Wolfer and
Johnson 2003). Valsan and Sproule (2008) concluded, ‘We contend that the notion of teaching
effectiveness has no verifiable empirical content and therefore the question of teaching score
validity is misguided’ (p. 939).

Learning
It is assumed that a valid measure of teaching should be associated with what students learn
(Cohen 1981; Otto, Sanford, and Ross 2008; Boring, Ottoboni, and Stark 2016). There is com-
pelling evidence the evaluating instruments are not related to measurable learning.
Before reviewing the relationship further, it is useful to note what is meant by the term. Learning
here is referring to what students could do at some point of time, which they could not do at a
Assessment & Evaluation in Higher Education 3

prior time, because of instruction. Learning ‘reflects a change over time, not a state at a particular
moment in time’ (Bacon 2016, 3). Learning, as used in this context, is a gain in performance related
to the purpose of a class. It does not refer to the students’ perception of learning. The two are only
marginally related (Clayson 2009; Bacon 2016), with the perception of learning accounting for two
to 10 percent of the variance of learning as defined (Sitzmann et al. 2010; Stroebe 2016).
While the evidence suggests a relationship between SET and learning might have existed in
the past (Marlin and Niss 1980; Dowell and Neal 1982; Baird 1987), more recent studies have
found no meaningful association between the evaluations and learning (O’Connell and Dickinson
1993; Mohanty et al. 2005; Clayson 2009; Boring, Ottoboni, and Stark 2016). The authors of a
recent meta-analysis (Uttl, White, and Gonzalez 2017) concluded, ‘Despite more than 75 years
of sustained effort, there is presently no evidence supporting the widespread belief that students
learn more from professors who receive higher SET ratings’ (p. 31). Learning has also been found
to be disassociated with other variables that are associated with SET (Delucchi and Pelowski
2000; Basow, Codos, and Martin 2013). In addition, there appears to be a negative relationship
between the student ratings of a class and the performance of students in subsequent classes
(Johnson 2003; Yunker and Yunker 2003; Weinberg, Hashimoto, and Fleisher 2009; Carrell and
West 2010; Braga, Paccagnella, and Pellizzari 2014; Kornell and Hausman 2016; Stroebe 2016).
In summary, as found by Curby et al. (2020), the quality of the instructor does not have the
largest influence on SET.

The question of likability


If students’ responses have little relationship to what they are learning, and, on average, the
information they give is not helping instructors obtain higher scores, what unidimensional factor
is the SET instrument actually measuring?

Likability
One intriguing hypothesis has lurked in the background for decades. At different points in time
researchers have independently suggested what the evaluation most effectively produces is a
‘likability’ scale’ (Uranowitz and Doyle1978; Tang and Tang 1987; Clayson and Haley 1990; Marks
2000; Clayson 2021). For example, Delucchi (2000, 228) concluded, ‘As a predictor of overall
ratings, the magnitude of the likability effect far exceeds that for effort and perceptions of
learning’. An intriguing study related to likability looked at what effect an instructor’s ‘charisma’
had on the teacher’s lecturing ability and the overall characteristics of the class (Shevlin et al.
2000). The charisma factor explained 69 percent of the variability of the instructor’s lecturing
ability and 37 percent of the variability of the class attributes. The study was replicated with a
group of Indian MBA students with similar results. The authors (Mittal and Gera 2013, 295)
concluded, ‘The results thus empirically established that student’s perception of charisma of the
teacher explains a significant percent of the variation of SET rather than the individual ratings
of dimensions of “lecturer ability” and “module attributes”’. Other studies, looking at intervening
variables such as physical attractiveness, have found a strong connection between the study
variables and likability (Gurung and Vespia 2007; Feistauer and Richter 2018). Gurung and Vespia
reported, ‘By far the strongest predictor of self-reported learning, and a significant predictor of
self-reported grades, however, was the likability of the professor. Likability, in turn, was predicted
by instructor attractiveness, approachability, and formality of dress, along with the student
attendance, participation, and self-reported class difficulty’ (p. 8). In other words, the students’
perception of likability could be seen as the unifying factor that combined all other related
variables.
4 D. CLAYSON

Tangential evidence also came from a variety of sources which suggested instructors could
raise their evaluations by increasing how well they are liked by students. Humour was encour-
aged because it makes the instructor more likable (Lei, Cohen, and Russler 2010). Then there
is the infamous chocolate study (Youmans and Jee 2007) in which students in three classes
taught by the same instructor were given a SET by a teaching assistant. Half of the students
received chocolates prior to filling out the evaluation and half did not. The SET administrator
emphasized that he was the source of the chocolate, not the instructor. Eight of nine scales
about the instructor were found to be higher in the chocolate group compared to the control
group, indicating that the class was more intellectually challenging, students were encouraged
to participate more in class, the class was better compared to other classes, and the instructor
was friendlier toward students than the ratings made by the control group. A similar effect was
found in another report, this time with chocolate cookies (Hessler et al. 2018). Penny (2003)
noted how some faculty have labelled the questionnaires as ‘happy forms’.

Personality
Another connection between SET and likability can be seen in the high association between
students’ perception of their teacher’s personality and the evaluations. In the mid-1980s, Feldman
(1986) reviewed the literature and found that sixty to seventy-five percent of the variance in SET
was accounted for by the students’ perception of their instructor’s personality. The instructors’
perception of their own personality was not related to the evaluations. Feldman suggested that
both the students’ measures of personality and teaching effectiveness could be mutually ‘con-
taminated’ by other variables. One of the suggested contaminates was likability. An instructor
might be liked for any number of reasons, which results in the students rating the instructor as
a good teacher and as having a pleasing personality. In another study, Clayson and Sheffet (2006)
took measures of personality and evaluations at four different times during a semester. They
compared change in the students’ perception of their instructor’s personality with change in the
evaluations in the last half of the term. The personality/evaluation changes were highly related.
Even after the midterm changes in the evaluations were consistent with changes in the students’
perception of the instructor’s personality. The methodology ruled out the possibility that the
association was due to insufficient control of secondary variables. As Patrick (2011, 239) found,
‘Personality explained variance in teacher and course evaluations over and above grades and
perceived learning’.
All of this shows that the students’ perception of their instructor’s likability, personality,
and their responses on a SET instrument are highly related, which as previously noted, led
some researchers to suggest the evaluations create what could be called a likability scale. This
hypothesis may appear radical only because the literature has assumed that likability is related
to the evaluations as a contaminate rather than as what the evaluations might actually be
measuring. In other words, likability is an assumed independent variable predicting the eval-
uation rather than the evaluation being an independent variable predicting likability. A similar
point was made long ago by Uranowitz and Doyle (1978, 21) when they stated, ‘Considerably
more is known about influences on likability than about the effects of likability’. They contin-
ued, ‘It seems quite clearly a topic for future investigations’ (p. 33). However, no current studies
were found directly testing a likability hypothesis.

Purpose of study
This study investigates the proposal that a predominate source of the variability found in a
scale produced by the student evaluation of teaching can be accounted for by the students’
perception of likability.
Assessment & Evaluation in Higher Education 5

Methods
The likability hypothesis was studied by utilizing (data mining) an existing dataset which was
created over a nine-year period.

Participants
At an accredited business school of an American university, a survey was made available to
students of an undergraduate business class, each semester, for nine consecutive years (from
2010 to 2019). During that period the class topic remained the same and was taught by the
same instructor. Almost 900 students contributed information. They were overwhelmingly juniors
and seniors (96%), and 56 percent were female. Their grade point average (GPA) ranged from
1.99 to 4.00 (average = 3.14 on a 4-point scale). Approximately two-thirds of respondents were
business students, and the rest were social science and humanities majors.

Materials
The variables utilized in the study were taken from a more extensive questionnaire containing
30 items related to class topics and discussions. Only the questions pertinent to the current
study were utilized. As part of this exercise, the students were asked to evaluate their own
personality, and as a comparison, the personality of an instructor. The students went online to
complete the 60-item International Personality Item Pool (IPIP) Five Factor Personality inventory
for both themselves and for a selected instructor (found at: http://www.outofservice.com/big/
five/). The Big Five inventory has been used for years and has known instrument-related char-
acteristics (Goldberg 1990; Schwarzer, Mueller, and Greenglass 1999; Birnbaum 2000).
Consistent with other studies (Clayson and Haley 1990; Clayson and Sheffet 2006), the stu-
dents’ perception of an instructor’s personality traits can be summed (neuroticism reversed) to
create a measure of the general likability of the instructor. This is not a direct measure of
personality, but the positiveness of personality. The students were also asked to directly rate
the instructor and themselves on a seven-point scale with the anchors ‘unlikable’ to ‘likable’. For
the proxy of SET, students recorded their responses on a seven-point scale to two questions.
The first rated their satisfaction with learning in class, and the second rated the students’
perception of the overall teaching effectiveness of the instructor. The first summarized the
interest in learning found in over a third of the questions of the official university evaluation.
The second is identical to the summative statement on the university’s official SET instrument.
These were averaged to produce the dependent variable Eval (α = 0.87). A summary of variables
used in the study is shown in Table 1. The presentation of the questions on the survey were
counterbalanced.

Procedure
Students could refuse to participate, but received class credit and had an opportunity to learn
more about themselves if they completed the assignment, and approximately 93 percent of
the students did so. Since the survey method varied and not all students answered every
question, the data utilized came from 834 students who provided material useful for the study.
Students received the instrument after midterm either by picking up the survey in class or
downloading the survey from the class online site. They had ten days to complete the survey
and return it to class. As part of the information requested, the students evaluated their own
personality on the IPIP, and as a comparison, the evaluation of an instructor they currently had,
other than their current business class instructor. The purpose of the exercise was to gain
6 D. CLAYSON

Table 1. Variables utilized in study.


Dependent variables
Evaluation (Eval)
How would you rate your overall satisfaction with your learning in this class?
Low: 1 2 3 4 5 6 7: High
How would you rate the overall teaching effectiveness of the instructor in the course?
Ineffective: 1 2 3 4 5 6 7: Effective
Likability (Like)
Rate this instructor on the following dimension:
Unlikable: 1 2 3 4 5 6 7: Likable
Personality Total (PerTot)
Averaged sum of the Big 5 factors (Neuroticism reversed), derived from the International
Personality Item Pool (IPIP) Five Factor Personality Inventory
Control variables
Self-Liking (SLike) Same scale as Like above; how likable students see
themselves
Gender (Female) Male = 0 Female = 1
Grade (Grade) In instructor’s class: Same or below normal = 0 Above
normal = 1
GPA (GPA) GPA, self-reported
Control (Control) Next instructor = 0 Self-select = 1

insight into how a hypothetical construct such as personality could be measured by the stu-
dents’ evaluation of themselves, with an instructor as a comparison. An inspection of the data
found that selection of the evaluated instructor varied. One group (n = 271) was simply asked
to evaluate an instructor the student currently had. The second group (n = 563) were required
to select the next instructor after the present class, irrespective of where or when the class
was held. A preliminary analysis found no associational differences between the two groups
(see Table 2), except for a difference in the overall magnitude of the evaluation given. This
difference was controlled by adding a control variable to the evaluation (see Table 1).
The method of selecting teachers to be evaluated, in essence, created a random sample of
instructors. Other than the likability scores and the teaching evaluation it is important to note
that the data contained no information about the instructors themselves, no information about
an instructor’s class, how the class was taught, or in what academic area the class was con-
ducted. That is to say, there was no data related to characteristics of the instructors, or anything
about how the instructors taught, except for the perception of what the students liked.
Given the sample size and the lack of control over the origins of the data, the significant
level was set at 0.001. The data did not contain information which would allow individual
students to be identified.

Results
Validity checks
As seen in Table 2, the summation of the Big Five traits was highly related to both the eval-
uation and the likability measure (PerTot/Eval: r = 0.518, t = 16.76; PerTot/Like: r = 0.519, t = 16.80).
Historically, grades have consistently been found to be related to the evaluations (Cohen 1981;
Gillmore and Greenwald 1999). The relationships here are all significant (Grade/Eval: r = 0.251,
t = 7.18; Grade/Like: r = 0.146, t = 4.08; Grade/PerTot: r = 0.149, t = 4.17). Women at the university
receive higher grades than men. The data shows a significant difference in GPA (t(865) = 7.14,
p < 0.001), and a strong association between gender and GPA (Female/GPA: r = 0.224, t = 6.36).

Hypothesis related findings


The study investigates the proposal that the scale produced by a student evaluation of teaching
and likability are strongly related. The following analyses were performed.
Assessment & Evaluation in Higher Education 7

Table 2. Correlations between variables.


Variable 1 2 3 4 5 6 7 8 Mean SD
1. Eval – 5.16 1.35
2. Like 0.558 – 5.52 1.41
3. PerTot 0.518 0.519 – 6.17 1.55
4. SLike 0.128 0.188 0.165 – 5.94 0.90
5. Female 0.022 0.014 0.083 0.013 – 0.55 0.49
6. Grade 0.251 0.146 0.149 0.056 0.010 – 0.31 0.46
7. GPA −.081 −.112 −.013 −.052 0.224 −.102 – 3.14 0.40
8. Control 0.178 −.018 −.005 −.043 0.019 0.030 −.019 – 0.34 0.47
Notes: Coefficient in bold, p < 0.001; PerTot (as utilized) = PerTot/10.

Table 3. Two-stage regression of Eval, with Like, PerTot and control variables.
Variable Beta t p Tol
Model 1
Like 0.395 11.90 .000 .730
PerTot 0.313 9.42 .000 .730
Model 2
Like 0.383 11.84 .000 .707
PerTot 0.298 9.26 .000 .714
  SLike 0.006 0.20 .838 .955
  Grade 0.143 5.61 .000 .961
  GPA −.015 −0.52 .602 .925
  Female −.003 −0.11 .913 .942
  Control 0.182 6.70 .000 .996
(Const) 3.27 .001
Model 1 R = 0.618 R2Adj = 0.381
Model 2 R = 0.662 R2Adj = 0.433
R2Adj2/R2Adj1 = 0.433/0.381 = 1.136
R2Adj1/R2Adj2 = 0.381/0.433 = 0.880
(88.0; Percent of all measured variance attributable to Like and PerTot)
Notes: Coefficient in bold, p < 0.001; PerTot (as utilized) = PerTot/10.

1. Variables related to the evaluation should be related to likability. Inspection of Table 2


shows that the direct simple least-squared associations are substantial (Eval/Like: r = 0.558,
t = 18.81; Eval/PerTot: r = 0.518, t = 16.76: Like/PerTot: r = 0.519, t = 16.80). Note that the
correlation between the personality derived variable and the evaluation (PerTot and
Eval), and the same variable with the direct measure of likability (PerTot and Like) are
essentially identical (t = 0.033).
2. The multiple regression of the evaluation, with control variables, should be highly influ-
enced by likability measures (see Table 3). The likability variables are significant predictors
of Eval accounting for 88 percent of the total variance in the study, which includes all
other student and control variables.
3. Table 4 shows separate regressions between student variables and Eval, Like and PerTot.
The pattern, the magnitude of the associations, and the amount of variance accounted
for are similar. The largest R2 (Eval; 0.075) compared to the smallest R2 (PerTot; 0.048)
can be compared by utilizing a method suggested by Ferguson (1971), and is not sig-
nificantly different (t = 1.21).
4. Table 5 presents three measures of shared variance between Eval and Like and PerTot.
The three variables entered into a principal component analysis created a single factor
with the three variables accounting for 69 percent of the variance. Cronbach’s alpha,
which is the weighted ratio of covariance to variance, was 0.77 for the three variables.
A structural model is given in Figure 1. Eval is shown predicting ‘likability’ only to
reinforce the equivalence. The model indicates that Eval and ‘likability’ share 64 per-
cent of variance. The data fits the model robustly (χ2 = 0.16, df = 1, p = 0.69), with
root mean square error (RMSEA) equal to 0.000, and a comparative fit index (CFI)
equal to 1.00.
8 D. CLAYSON

Table 4. Regression of Eval, Like and PerTot, with student variables.


Variable Eval Like PerTot
Beta coefficients
SLike 0.116 0.189 0.156
Grade 0.241 0.120 0.139
GPA −.050 −.087 −.009
Female 0.028 0.033 0.080
R2Adj 0.075 0.059 0.048
Note: Betas in bold, p < 0.001

Table 5. Shared variance of Eval, Like and PerTot.


Principal component analysis
Variable Factor
Eval 0.837
Like 0.837
PerTot 0.810 Single factor accounting for 68.6
percent of variance
Cronbach’s Alpha Eval, Like, and PerTot as single
variable: α = 0.771
Structural Model (See Figure 1) Like and Eval accounts for 64.0
percent of shared variance

Discussion
This study found that the students’ perception of the likability of an instructor, for whatever
reason, accounts for a significant part of the variance of a teaching evaluation. Given the higher
variability expected in a student-as-case method, the similarity between evaluation and measures
of likability found here are remarkable. Without statistically including any objective measure of
an instructor, how the instructor taught, or any observation of what happened in the classroom,
all the while controlling for relevant student differences, the students’ measures of how well
they liked an instructor accounted for almost two-thirds of the total variance of SET. This pro-
portion is almost identical to previous findings of the effects of ‘charisma’ on the evaluations
(Shevlin et al. 2000; Mittal and Gera 2013). Given these results, SET instruments could be replaced
with a simple unidimensional question measuring likability, or with a personality inventory, with
little loss of predictability.
If these findings are proven valid in other venues, the evaluations could be seen as a
reflection of student perceptions of what they like, channeled through a particular assessment
instrument. Hence, we should expect to find evidence of what students like and dislike within
student evaluations of teaching. If they are more comfortable with certain gender stereotypes,
then gender bias should be found (Huebner and Magel 2015; Miles and House 2015; Mitchell
and Martin 2018; Clayson 2020). If students like good grades, then grades should be found to
influence SET (Brockx, Spooren, and Mortelmans 2011; Backer 2012; Chen, Wang, and Yang
2017). If students are ambivalent about the discipline needed for learning, then learning should
not be related to SET. Finding such factors should not be seen as errors in the process or as
contaminating variables, but as indicators of what the evaluation process is actually measuring,
i.e. likability.

Limitations
It is possible that the students at this university are different from others. However, research
with these students has shown consistent similarities with the findings of national and inter-
national studies. Nevertheless, the data was not created with this study in mind and lacks the
rigor that might have been applied if its origins had been more specifically defined by the
Assessment & Evaluation in Higher Education 9

Figure 1. Structural Model: Likability by SET.

goals of the study. Further, although instructions were explicit, students completed the survey
in private, and their data had to be accepted at face value. Unless there was some unknown
bias connecting almost a decade of students, it could reasonably be presumed that this problem
would increase the variability and hence the probability of Type II errors.
It could also be argued that the associations found here between the evaluation, likability
and personality resulted from procedural or questionnaire bias. While this is possible, it is
improbable. The likability questions and the personality scales were counterbalanced, as was
the SET measure. In addition, the personality scores were derived from a separate 60-item
personality inventory found online, physically and temporally removed from the immediate
survey questionnaire.
A major limitation is the dimensionality of responses. The measure of Eval was derived from
the average of two questions, and the major measures of likability (Like and PerTot) from two
variables. Even though, as defended in the introduction, many researchers feel that students
respond to instruction on the evaluations in a unidimensional fashion: teaching is still considered
to be a multidimensional endeavor. This does not preclude the possibility that students may
form their perception of likability from numerous events and from personal and instructional
attributes. Nevertheless, the validity of this study rests on the assumption that students will
typically answer questions about instruction utilizing some important construct that will override
the specific intent of differing questions, and that construct can best be defined as likability.

Implications
The findings in this study, although robust, are exploratory. This is important to keep in mind
because the likability hypothesis, if found to be valid, has profound implications for the current
evaluation system, some of which will be considered radical.
Understanding that the student evaluation of teaching is a measure of what students like
and dislike potentially creates a paradigm shift in the way the evaluations are seen and uti-
lized. The evaluations become more similar to customer satisfaction surveys than a measure
of teaching. Even that, however, is only partially correct because there are differences between
a service encounter and an educational interaction. A student isn’t a customer in the same
sense as someone buying a pair of shoes or staying at a hotel. Students do not buy a degree
10 D. CLAYSON

in the same sense as buying a liter of soda (D’Eon and Harris 2000; Clayson and Haley 2005;
Xu, Lo, and Wu 2018).
The hypothesis also has research implications. The study implies the evaluations reflect
student perceptions. This suggests the evaluations are a measure of the students, not of the
instructor. It also removes the need to assume that findings which invalidate pedagogic theory
are incorrect, or to suggest that negative findings are a type of ‘witch hunt’ (Theall and Franklin
2002). Fundamentally, it proposes that the question of validity has been misplaced. The instru-
ments are not necessarily a measure of teaching as seen by educators. They are a measure of
whatever students like and dislike, whatever that may be. These preferences should be found
in the evaluations irrespective of any theory or concept of instruction.
The hypothesis also has formative and summative implications. Unless a consumer satisfaction
model is adopted, high evaluation scores are not an indicator that an instructor is a ‘good’ or
‘effective’ teacher, nor are low scores an indicator of the opposite. This implies the current
evaluation system is inadequate, misleading and perhaps even detrimental. To put this into
perspective, consider an evaluation based not on ‘likability’ but on learning. First ‘learning’ would
need to be defined. Second, based on the definition, learning would need to be measured.
Third, unless ‘learning’ was defined as a subjective perception, the determination of ‘learning’
would need to be targeted at students, not at the instructor. The current evaluation paradigm
arguably does not perform these steps even if ‘learning’ is replaced with other instructionally
related variables. The hypothesis suggests the current evaluation system cannot validly measure
anything other than what the students like and dislike.

Disclosure statement
No potential conflict of interest was reported by the author.

References
Ambady, N., and R. Rosenthal. 1993. “Half a Minute: Predicting Teacher Evaluations from Thin Slices
of Nonverbal Behavior and Physical Attractiveness.” Journal of Personality and Social Psychology
64 (3): 431–441. doi:10.1037/0022-3514.64.3.431.
Backer, E. 2012. “Burnt at the Student Evaluation Stake: The Penalty for Failing Students.” e-Journal of
Business Education & Scholarship of Teaching 6 (1): 1–13. www.ejbest.org/.
Bacon, D. R. 2016. “Reporting Actual and Perceived Student Learning in Education Research.” Journal
of Marketing Education 38 (1): 3–6. https://doi.org/10.1177/023475316636732. doi:10.1177/
0273475316636732.
Baird, J. S. 1987. “Perceived Learning in Relation to Student Evaluation of University Instruction.”
Journal of Educational Psychology 79 (1): 90–91. doi:10.1037/0022-0663.79.1.90.
Basow, S. A., S. Codos, and J. L. Martin. 2013. “The Effects of Professor’s Race and Gender on Student
Evaluations and Performance.” College Student Journal 47 (2): 352–363. www:ngentaconnect.com/
content/prin/csj/2013/00000047/00000002/art00011.
Benton, S. L., and K. R. Ryalls. 2016. IDEA PAPER No. 58: Challenging Misconceptions about Student Ratings
of Instruction. Manhattan, KS: The IDEA Center. https://files.eric.ed.gov/fulltext/ED573670.pdf.
Birnbaum, M. H. 2000. “A Survey of Faculty Opinions Concerning Student Evaluation of Teaching.”
https://psych.fullerton.edu/mbirnbaum/faculty3.htm
Boring, A., K. Ottoboni, and P. B. Stark. 2016. “Student Evaluation of Teaching (Mostly) Do Not Measure
Teaching Effectiveness.” ScienceOpen. https://10.14293/S2100-1006.1SOR-EDU.AETBZC.v1. Also at:
https://www.math.upenn.edu/~pemantle/active-papers/Evals/stark2016.pdf.
Braga, M., M. Paccagnella, and M. Pellizzari. 2014. “Evaluating Students’ Evaluations.” Economics of
Education Review 41: 71–88. doi:10.1016/j.econedurev.2014.04.002.
Brockx, B., P. Spooren, and D. Mortelmans. 2011. “Taking the Grading Leniency Story to the Edge.
The Influence of Student, Teacher, and Course Characteristics on Student Evaluations of Teaching
Assessment & Evaluation in Higher Education 11

in Higher Education.” Education Assessment, Evaluation and Accountability 33: 289–306. doi:10.1007/
s11092-011-9126-2.
Buchert, S., E. L. Laws, J. M. Apperson, and N. J. Bregman. 2008. “First Impressions and Professor
Reputation: Influence on Student Evaluations of Instruction.” Social Psychology of Education 11 (4):
397–408. doi:10.1007/s11218-008-9055-1.
Campbell, J. P., and W. C. Bozeman. 2007. “The Value of Student Ratings: Perceptions of Students,
Teachers, and Administrators.” Community College Journal of Research and Practice 32 (1): 13–24.
doi:10.1080/10668920600864137.
Carle, A. C. 2009. “Evaluating College Students’ Evaluations of a Professor’s Teaching Effectiveness
across Time and Instruction Mode (Online vs. Face-to-Face) Using a Multilevel Growth Modeling
Approach.” Computers & Education 53 (2): 429–435. doi:10.1016/j.compedu.2009.03.001.
Carrell, S. E., and J. E. West. 2010. “Does Professor Quality Matter? Evidence from Random Assignment
of Students to Professors.” Journal of Political Economy 118 (3): 409–432. doi:10.1086/653808.
Centra, J. A. 1972. Two Studies on the Utility of Student Ratings for Instructional Improvement (SIR Report
No. 9). Princeton, NJ: Educational Testing Service.
Clayson, D. E. 2009. “Student Evaluations of Teaching: Are They Related to What Students Learn? A
Meta-Analysis and Review of the Literature.” Journal of Marketing Education 31 (1): 16–30. https://
https://doi.org/10.1177/0273475308324086. doi:10.1177/0273475308324086.
Clayson, D. E. 2013. “Initial Impressions and the Student Evaluation of Teaching.” Journal of Education
for Business 88 (1): 26–35. doi:10.1080/08832323.2011.633580.
Clayson, D. E. 2020. “Student Perception of Instructors: The Effect of Age, Gender, and Political Leaning.”
Assessment & Evaluation in Higher Education 45 (4): 607–616. doi:10.1080/02602938.2019.1679715.
Clayson, D. E. 2021. A Comprehensive Critique of Student Evaluation of Teaching: Critical Perspectives on
Validity, Reliability, and Impartiality. New York: Routledge.
Clayson, D. E., and D. A. Haley. 1990. “Student Evaluations in Marketing: What Is Actually Being
Measured?” Journal of Marketing Education 12 (3): 9–17. doi:10.1177/027347539001200302.
Clayson, D. E., and D. A. Haley. 2005. “Marketing Models in Education: Students as Customers, Products,
or Partners.” Marketing Education Review 15 (1): 1–10. doi:10.1080/10528008.2005.11488884.
Clayson, D. E., and D. A. Haley. 2011. “Are Students Telling Us the Truth? A Critical Look at the
Student Evaluation of Teaching.” Marketing Education Review 21 (2): 101–114. doi:10.2753/
MER1052-8008210201.
Clayson, D. E., and M. J. Sheffet. 2006. “Personality and the Student Evaluation of Teaching.” Journal
of Marketing Education 28 (2): 149–160. doi:10.1177/0273475306288402.
Chen, C. Y., S. Wang, and Y. Yang. 2017. “A Study of the Correlation of the Improvement of Teaching
Evaluation Scores Based on Student Performance Grades.” International Journal of Higher Education
6 (2): 162–168. doi:10.5430/ijhe.v6n2p162.
Cohen, P. A. 1980. “Effectiveness of Student-Rating Feedback for Improving College Instruction:
A Meta-Analysis.” Research in Higher Education 13 (4): 321–341. doi:10.1007/BF00976252.
Cohen, P. A. 1981. “Student Ratings of Instruction and Student Achievement: A Meta-Analysis of
Multi-Section Validity Studies.” Review of Educational Research 51 (3): 281–309.
doi:10.3102/00346543051003281.
Curby, T., P. McKnight, L. Alexander, and S. Erchov. 2020. “Sources of Variance in End-Of-Course Student Evaluations.”
Assessment & Evaluation in Higher Education 45 (1): 44–53. doi:10.1080/02602938.2019.1607249.
Darby, J. A. 2007. “Are Course Evaluations Subject to a Halo Effect?” Research in Higher Education 77
(1): 46–55. doi:10.7227/RIE.77.4.
Davidovitch, N., and D. Soen. 2006. “Using Students’ Assessments to Improve Instructors’ Quality of
Teaching.” Journal of Further and Higher Education 30 (4): 351–376. doi:10.1080/03098770600965375.
D’Eon, M. F., and C. Harris. 2000. “If Students Are Not Customers, What Are They?” Academic Medicine
75 (12): 1173–1177.
Delucchi, M. 2000. “Don’t Worry, Be Happy: Instructor Likability, Student Perceptions of Learning, and
Teacher Ratings in Upper-Level Sociology Courses.” Teaching Sociology 28 (3): 220–231.
doi:10.2307/1318991.
Delucchi, M., and S. Pelowski. 2000. “Liking or Learning: The Effect of Instructor Likeability and Student
Perceptions of Learning on Overall Ratings of Teaching Ability.” Radical Pedagogy 2: 1–5. Full paper
can be found at: http://radicalpedagogy.icaap.org/
12 D. CLAYSON

Dowell, D. A., and J. A. Neal. 1982. “A Selective Review of the Validity of Student Ratings of Teaching.”
Journal of Higher Education 53 (1): 51–62. doi:10.1080/00221546.1982.11780424.
Feistauer, D., and T. Richter. 2018. “Validity of Students’ Evaluations of Teaching: Biasing Effects of
Likability and Prior Subject Interest.” Studies in Educational Evaluation 59: 168–178. doi:10.1016/j.
stueduc.2018.07.009.
Feldman, K. A. 1986. “The Perceived Instructional Effectiveness of College Teachers as Related to Their
Personality and Attitudinal Characteristics: A Review and Synthesis.” Research in Higher Education
24 (2): 139–213. doi:10.1007/BF00991885.
Ferguson, G. A. 1971. Statistical Analysis in Psychology & Education. New York: McGraw-Hill.
Gaillard, F. D., S. P. Mitchell, and K. Vahwere. 2006. “Students, Faculty, and Administrators Perception
of Students Evaluations of Faculty in Higher Education Business Schools.” Journal of College Teaching
& Learning 8: 77–90. doi:10.19030/tlc.v3i8.1695.
Gillmore, G. M., and A. G. Greenwald. 1999. “Using Statistical Adjustment to Reduce Biases in Student
Ratings.” American Psychologist 54 (7): 518–519. . (Original data published: Greenwald, A. G. 1991.
American Psychologist 52, 1182–1186.) doi:10.1037/0003-066X.54.7.518.
Goldberg, L. R. 1990. “An Alternative "Description of Personality": The Big-Five Factor Structure.” Journal
of Personality and Social Psychology 59 (6): 1216–1229. https://doi.org/10.1037/0022-3514.59.6.1216.
doi:10.1037//0022-3514.59.6.1216.
Greenwald, A. G., and G. M. Gillmore. 1997. “Grading Leniency Is a Removable Contaminant of Student
Ratings.” American Psychologist 52 (11): 1209–1217. doi:10.1037/0003-066X.52.11.1209.
Gurung, R. A. R., and K. Vespia. 2007. “Looking Good, Teaching Well? Linking Liking, Looks, and
Learning.” Teaching of Psychology 34 (1): 5–10. https://doi.org/10.1080/00986280709336641.
doi:10.1177/009862830703400102.
Hessler, M., D. M. Pöpping, H. Hollstein, H. Ohlenburg, P. H. Arnemann, C. Massoth, L. M. Seidel,
A. Zarbock, and M. Wenk. 2018. “Availability of Cookies during an Academic Course Session Affects
Evaluation of Teaching.” Medical Education 52 (10): 1064–1072. doi:10.1111/medu.13627.
Huebner, L., and R. C. Magel. 2015. “A Gendered Study of Student Ratings of Instruction.” Open Journal
of Statistics 05 (06): 552–567. doi:10.4236/ojs.2015.56058.
Johnson, V. E. 2003. Grade Inflation: A Crisis in College Education. New York: Springer.
Keeley, J. W., T. English, J. Irons, and A. M. Henslee. 2013. “Investigating Halo and Ceiling Effects in
Student Evaluations of Instruction.” Educational and Psychological Measurement 73 (3): 440–457.
doi:10.1177/0013164412475300.
Kember, D., D. Y. P. Leung, and K. P. Kwan. 2002. “Does the Use of Student Feedback Questionnaires
Improve the Overall Quality of Teaching?” Assessment & Evaluation in Higher Education 27 (5):
411–425. doi:10.1080/0260293022000009294.
Kornell, N., and H. Hausman. 2016. “Do the Best Teachers Get the Best Ratings?” Frontiers in Psychology
25. doi:10.3389/fpsyg.2016.00570.
Langbein, L. 2008. “Management by Results: Student Evaluation of Faculty Teaching and the
Mis-Measurement of Performance.” Economics of Education Review 27 (4): 417–428. doi:10.1016/j.
econedurev.2006.12.003.
Lei, S. A., J. L. Cohen, and K. M. Russler. 2010. “Humor on Learning in the College Classroom: Evaluating
Benefits and Drawbacks from Instructors’ Perspective.” Journal of Instructional Psychology 37 (4):
326–331. Found at: Gale OneFile: Health and Medicine. link.gale.com/apps/doc/A249957357/
HRCA?u=uni_rodit&sid=HRCA&xid=80366e5f
Marks, R. B. 2000. “Determinants of Student Evaluations of Global Measures of Instructor and Course
Value.” Journal of Marketing Education 22 (2): 108–119. doi:10.1177/0273475300222005.
Marlin, J. W., and J. F. Niss. 1980. “End-of-Course Evaluations as Indicators of Student Learning and
Instructor Effectiveness.” The Journal of Economic Education 11 (2): 16–27. doi:10.1080/00220485.
1980.10844950.
Marsh, H. W. 1984. “Students’ Evaluations of University Teaching: Dimensionality, Reliability, Validity,
Potential Biases and Usefulness.” Journal of Educational Psychology 76 (5): 707–754.
doi:10.1037/0022-0663.76.5.707.
Marsh, H. W. 2007. “Students’ Evaluations of University Teaching: Dimensionality, Reliability, Validity,
Potential Biases and Usefulness.” pp. 319–383. In The Scholarship of Teaching and Learning in Higher
Education: An Evidence-Based Perspective, edited by R. P. Perry and J. C. Smart. Dordrecht: Springer.
doi:10.1007/1-4020-5742-3_9.
Assessment & Evaluation in Higher Education 13

Marsh, H. W., and L. A. Roche. 1997. “Making Students’ Evaluations of Teaching Effectiveness Effective:
The Critical Issues of Validity, Bias, and Utility.” American Psychologist 52 (11): 1187–1197. doi:10.1
037/0003-066X.52.11.1187.
Miles, P., and D. House. 2015. “The Tail Wagging the Dog; an Overdue Examination of Student
Teaching Evaluations.” International Journal of Higher Education 4 (2): 116–126. doi:10.5430/ijhe.
v4n2p116.
Miron, M. 1988. “Students’ Evaluation and Instructors’ Self-Evaluation of University Instruction.” Higher
Education 17 (2): 175–181. doi:10.1007/BF00137970.
Mitchell, K. M. W., and J. Martin. 2018. “Gender Bias in Student Evaluations.” PS: Political Science &
Politics 51 (3): 648–652. doi:10.1017/S104909651800001X.
Mittal, S., and R. Gera. 2013. “Student Evaluation of Teaching Effectiveness (SET): An SEM Study in
Higher Education in India.” International Journal of Business and Social Science 4 (10): 289–298.
https://pdfs.semanticscholar.org/4c92/788203ef9d434eb37eab8b24482508ff6672.pdf.
Mohanty, G. J., C. Gretes, B. Flowers, B. Algozzine, and F. Spooner. 2005. “Multi- Method Evaluation
of Instruction in Engineering Classes.” Journal of Personnel Evaluation in Education 18 (2): 139–151.
doi:10.1007/s11092-006-9006-3.
O’Connell, D. Q., and D. J. Dickinson. 1993. “Student Ratings of Instruction as a Function of Testing
Conditions and Perceptions of Amount Learned.” Journal of Research and Development in Education
27 (1): 8–23. https://eric.ed.gov/?id=EJ478565.
Onwuegbuzie, A. J., L. G. Daniel, and K. M. T. Collins. 2009. “A Meta-Validation Model for Assessing
the Score-Validity of Student Teaching Evaluations.” Quality & Quantity 43 (2): 197–209. doi:10.1007/
s11135-007-9112-4.
Orsini, J. L. 1988. “Halo Effects in Student Evaluations of Faculty: A Case Application.” Journal of
Marketing Education 10 (2): 38–45. doi:10.1177/027347538801000208.
Otto, J., D. A. Sanford, and D. N. Ross. 2008. “Does Ratemyprofessor.com Really Rate My Professor?”
Assessment & Evaluation in Higher Education 33 (4): 355–368. doi:10.1080/02602930701293405.
Overall, J. U., and H. W. Marsh. 1979. “Midterm Feedback from Students: Its Relationship to Instructional
Improvement and Students’ Cognitive and Affective Outcomes.” Journal of Educational Psychology
71 (6): 856–865. doi:10.1037/0022-0663.71.6.856.
Palmer, S. 2012. “Student Evaluation of Teaching: Keeping in Touch with Reality.” Quality in Higher
Education 18 (3): 297–311. doi:10.1080/13538322.2012.730336.
Patrick, C. L. 2011. “Student Evaluations of Teaching: Effects of the Big Five Personality Traits, Grades
and the Validity Hypothesis.” Assessment & Evaluation in Higher Education 36 (2): 239–249.
doi:10.1080/02602930903308258.
Penny, A. R. 2003. “Changing the Agenda for Students’ Views about University Teaching: Four
Shortcoming of SRT Research.” Teaching in Higher Education 8 (3): 399–441. doi:10.1080/13562510309396.
Schwarzer, R., J. Mueller, and E. Greenglass. 1999. “Assessment of Perceived Self-Efficacy on the Internet:
Data Collected in Cyberspace.” Anxiety, Stress and Coping 12 (2): 145–161.
doi:10.1080/10615809908248327.
Shevlin, M., P. Banyard, M. Davies, and M. Griffiths. 2000. “The Validity of Student Evaluation of
Teaching in Higher Education: Love Me, Love My Lectures?” Assessment & Evaluation in Higher
Education 25 (4): 397–405. doi:10.1080/713611436.
Sitzmann, T., K. Ely, K. G. Brown, and K. N. Bauer. 2010. “Self-Assessment of Knowledge: A Cognitive
Learning or Affective Measure?” Academy of Management Learning & Education 9 (2): 169–191.
doi:10.5465/amle.9.2.zqr169.
Smith, C. 2008. “Building Effectiveness in Teaching through Targeted Evaluation and Response:
Connecting Evaluation to Teaching Improvement in Higher Education.” Assessment & Evaluation in
Higher Education 33 (5): 517–533. doi:10.1080/02602930701698942.
Spooren, P., B. Brockx, and D. Mortelmans. 2013. “On the Validity of Student Evaluation of Teaching:
The State of the Art.” Review of Educational Research 83 (4): 598–642. http://doi.
org/10.3102/00346543134968870. doi:10.3102/0034654313496870.
Sproule, R. 2002. “The Underdetermination of Instructor Performance by Data from the Student Evaluation
of Teaching.” Economics of Education Review 21 (3): 287–294. doi:10.1016/S0272-7757(01)00025-5.
Stroebe, W. 2016. “Why Good Teaching Evaluations May Reward Bad Teaching: On Grade Inflation
and Other Unintended Consequences of Student Evaluations.” Perspectives on Psychological Science
11 (6): 800–816. doi:10.1177/1745691616650284.
14 D. CLAYSON

Tang, T. L., and T. L. N. Tang. 1987. “A Correlation Study of Students’ Evaluations of Faculty Performance
and Their Self-Ratings in an Instructional Setting.” College Student Journal 21 (1): 90–97. https://
w w w. r e s e a r c h g a t e . n e t / p u b l i c a t i o n / 2 3 4 1 3 1 4 9 1 _ A _ c o r r e l a t i o n _ s t u d y _ o f _ s t u d e n t s’ _
evaluations_of_faculty_performance_and_their_self-ratings_in_an_instructional_setting.
Theall, M., and J. Franklin. 2002. “Looking for Bias in All the Wrong Places: A Search for Truth or a
Witch Hunt in Student Ratings of Instruction?” New Directions for Institutional Research 27 (5):
45–56. doi:10.1002/ir.3.
Tom, G., S. T. Tong, and C. Hesse. 2010. “Thick Slice and Thin Slice Teaching Evaluations.” Social
Psychology of Education 13 (1): 129–136. doi:10.1007/s11218-009-9101-7.
Uranowitz, S. W., and K. O. Doyle. 1978. “Being Liked and Teaching: The Effects and Bases of Personal
Likability in College Instruction.” Research in Higher Education 9 (1): 15–41. https://doi:10.1007/
BF00979185. doi:10.1007/BF00979185.
Uttl, B., C. A. White, and D. W. Gonzalez. 2017. “Meta-Analysis of Faculty’s Teaching Effectiveness:
Student Evaluation of Teaching Ratings and Student Learning Are Not Related.” Studies in Educational
Evaluation 54: 22–42. doi:10.1016/j.stueduc.2016.08.007.
Valsan, C., and R. Sproule. 2008. “The Invisible Hands behind the Student Evaluation of Teaching: The
Rise of the New Managerial Elite in the Governance of Higher Education.” Journal of Economic Issues
42 (4): 939–958. doi:10.1080/00213624.2008.11507197.
Wachtel, H. K. 1998. “Student Evaluations of College Teaching Effectiveness: A Brief Review.” Assessment
and Evaluation in Higher Education 83 (4): 191–211. doi:10.1080/0260293980230207.
Weinberg, B. A., M. Hashimoto, and B. M. Fleisher. 2009. “Evaluating Teaching in Higher Education.”
The Journal of Economic Education 40 (3): 227–261. doi:10.3200/JECE.40.3.227-261.
Wolfer, T. A., and M. M. Johnson. 2003. “Re-Evaluating Student Evaluation of Teaching.” Journal of
Social Work Education 39 (1): 111–121. doi:10.1080/10437797.2003.10779122.
Xu, J. (B.), A. Lo, and J. Wu. 2018. “Are Students Customers? Tourism and Hospitality Students’ Evaluation
of Their Higher Education Experience.” Journal of Teaching in Travel & Tourism 18 (3): 236–258. doi:10.1080/
15313220.2018.1463587.
Youmans, R. J., and B. D. Jee. 2007. “Fudging the Numbers: Distributing Chocolate Influences Student
Evaluations of an Undergraduate Course.” Teaching of Psychology 34 (4): 245–247.
doi:10.1080/00986280701700318.
Yunker, P. J., and J. Yunker. 2003. “Are Student Evaluations of Teaching Valid? Evidence from an
Analytical Business Core Course.” Journal of Education for Business 78 (6): 313–317. https://
doi:10.1080/08832320309598619. doi:10.1080/08832320309598619.

You might also like