Student Ratings of Instruction: Validity and Normative Interpretations

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

RESEARCH IN HIGHER EDUCATION

Volume 7, pages 67-78


© 1977APS Publications,Inc.

STUDENT RATINGS OF INSTRUCTION:


Validity and Normative Interpretations

Dale C. Brandenburg and Jeffrey A. Slinde, University of Illinois,


Urbana
Enrique E. Batista, Universidad de Antioquia

Using 3,355 class section means, the relationship between six predictor variables and
student ratings of instruction (CEQ) was investigated by computing the intercorrelation
matrix among all variables and by performing several regression analyses. Results of the
study indicated that all linear interactions were negligible and that more than one-fourth
of the criterion variance was shared with all the predictor variables. Two predictor
variables, expected grade and required-elective, provided extremely large contributions
to the prediction of the criterion measure, however. Implications of the validity results
with respect to normative data and thus to the administrative use of the ratings were
illustrated.
= ~ = t = = = = = ~ I = = = = ~ = = ~ = = ~ = t ~ = = * = = ~ = ~ - = ~ = = = ~ = = ~ = I ~ = = = ~ = ~ D = ® = ~ = m ~ = ~ = ~ i N = ~ = ® ~ * m = ~

Key Words: student ratings; validity; norming

In order to ascertain the quality of evaluative information provided


by student rating forms, several studies have examined the relationship
of various instructor, student, or course variables on such ratings. In
general, the results have been contradictory and inconclusive. For
example, Downie (1952) and Gage (1961) found a significant
relationship between instructor's rank and student ratings, whereas
Aleamoni and Graham (1974), Aleamoni and Yimer (1973), and Rayder
(1968) did not. Instructor's sex was reported not to be related to ratings
by Downie (1952) and Rayder (1968), but somewhat opposite results
were found by McKeachie and Lin (1971), While several studies
(Bausell and Magoon, 1972; Granzin and Painter, 1973; Kooker, 1968;
Pohlman, 1975; Weaver, 1960) found students' ratings to be predicted

Address reprint requests to JeffreyA. Slinde, University of Illinois, Measurementand


Research Division, 307 EngineeringHall, Urbana, IL 61801.

67
68 Brandenburg, Slinde, and Batista
by their expected grade, others (Garverick and Carter, 1962; Holmes,
1971; Kennedy, 1975) did not. The relationship of class size and
student ratings was found nonsignificant by Aleamoni and Graham
(1974), while Gage (1961) and Lovell and Hayner (1955) found a
significant relationship. Some studies (Doyle and Whitely, 1974;
Hildebrand, Wilson and Dienst, Note 2) have reported that the
required-elective nature of the course was not related to ratings. Other
researchers, on the other hand (Gage, 1961; Lovell and Haner, 1955;
Magoon and Bausell, 1973), found that instructors who teach required
courses receive lower ratings than those teaching nonrequired courses.
Course level was found to be a statistically significant variable by
Aleamoni and Graham (1974) and Jiobu and Pollis (1971) but
nonsignificant by Grant (1971). Rayder (1968) has indicated that, in
general, instructor characteristics are related to a greater extent to
ratings than student characteristics.
The methodology used when investigating the validity of student
rating data has varied considerably from one study to another. Aside
from studies using instruments of unknown or unreported technical
quality, a number of studies report results based on a small number of
classes or on different sections of a single course. Other studies report
results which used only teaching assistants as ratees, while still other
studies have used students from only one or two class levels.
Furthermore, there has been a failure to report results about
cross-validations or even the appropriateness of studying linearity
(scatterplots) in almost all correlation studies. Also, several studies
have used ad-hoc rating instruments. Generalizability from such studies
tend to be limited; the methodological differences between studies may
help explain a sizable portion of the mixed results that have been
reported.
In addition to methodological differences between studies, statistical
analyses have also varied from one study to another. Analyses have
ranged from nonparametric, such as sign and chi-square tests, to
multivariate analysis of variance and cannonical correlations. The most
frequently used methods include simple correlations, linear regressions,
and univariate analysis of variance. The wide range of both the
methodological and statistical procedures between studies may help
explain the current difficulties in clearly determining which and to what
extent variables appear to have a relationship with student ratings.
Also, more useful information can be gathered regarding the Jariables
affecting student ratings if the variables proposed to have any effect are
considered simultaneously.
Sheehan (1975) has reviewed a limited number of research findings
and concluded that using student ratings for administrative decisions
Student Ratings of Instruction 69
(e.g., pay, rank, and tenure) is questionable. In a reply to Sheehan,
Aleamoni (1976) pointed out that whether the criterion variation
accounted for by various predictor variables results in significantly
different ratings is still undetermined. The purposes of this study,
therefore, were to investigate the validity of student ratings of
instruction and to assess the effect of the validity results on the raw
and scaled rating scores. Like most previous research, it would be
more appropriate to say that this study is concerned with the invalidity
of student ratings rather than their validity. That is, the extent to which
extraneous variables bias the underlying construct being measured in
such a way as to favor certain instructors over others.

METHOD

Student ratings were collected via a widely used standardized student


rating instrument, the Illinois Course Evaluation Questionnaire (CEQ)
(Aleamoni and Spencer, 1973). The CEQ contains 23 items (four-choice
Likert scale SA-A-D-SD) with six subscores, a total score, and
additional scores on three general items. Any of these measures might
be considered a potential criterion measure, but the CEQ total was
selected for this study because it is the measure most often used in
conjunction with administrative decision-making. Furthermore, the
intercorrelations among subscales range from 0.78 to 0.93, the lowest
correlation between subscale and total is 0.90, and the lowest
correlation between the three general items with the total is 0.87. Thus,
the CEQ total rank orders instructors about the same as any of the
subscores or general items.
The samples for the present study consisted of 1,794 class sections
from the fall semester of 1974 and 1,561 class sections from the spring
semester of 1975. Both samples represented every college and most
departments at the University of Illinois at Urbana, Champaign.
The mean CEQ total score for each class section was used as the
criterion and three continuous and three categorical variables were
used as predictor variables. The three nominal variables were dummy
variable coded (see, for example, Kerlinger and Pedhauser, 1973). That
is, for a given level of a nominal variable (e.g., the TA level of the
instructor rank variable), a corresponding column vector of l's and 0's
was created to indicate whether the instructor was or was not a
member of that level. For each categorical variable, the number of
such column vectors is equal to one Jess than the number of levels of
the nominal variable (k - 1). The group corresponding to the missing
column is referred to as the reference group. The reference group is
70 Brandenburg, Slinde, and Batista

not slighted because, if each of the k - 1 vectors of a categorical


variable equals zero, then this implies that the instructor was not a
member of each of these levels and so must be a member of the
reference group (omitted level).
The six predictor variables were chosen because they were
frequently used in previous student rating research. Choosing the
reference group was largely arbitrary since we were primarily
concerned with the categorical variable as a whole. That is, the
multiple correlation and also the regression weight for the other
variables are unaffected by the choice of a reference group. In
addition, since differences between unstandardized regression weights
corresponding to the levels of a given nominal variable reflect
differences between the group criterion means (holding constant any '
remaining variables), there is essentially no difference in choosing one
level over another as the reference group. The predictor variables are
defined as follows:
1. Class size--the total number of students who completed the
questionnaire in each course.
2. Required-elective--the proportion of students in a class taking the
course as an elective.
3. Expected grade--the average expected grade for the students in a
given class w e r e A = 5 , B = 4 , C = 3 , D = 2 , E = 1.
4. Course levelml00, 200,300, and 400 (reference group).
5. Instructor's rank--teaching assistant, instructor, assistant
professor, associate professor, and full professor (reference
group).
6. Instructor's sex male and female (reference group).
Validity data was obtained by computing the intercorrelation matrix
among all variables and by performing regression analyses.
Cross-validations were also performed by applying the partial
regression weights derived from the fall, 1974 sample to the spring,
1975 sample and vice versa.

RESULTS

As a preliminary step, an inspection of all bivariate scatterplots


between the continuous independent variables and the criterion
variable revealed no discernible departure from linearity. Furthermore,
all linear interactions among the six independent variables were found
to have only a negligible relationship with the criterion measure. That
is, all linear interactions accounted fo r only slightly more than 2% of
Student Ratings of Instruction 71
the criterion variance above that accounted for by the set of predictor
variables.
Due to the multiple levels of the instructor rank and course level
variables, a total of 11 individual predictor variables were defined. For
the combined sample, (N = 3,355), Table I displays the intercorrelation
matrix among all 12 variables and the mean and standard deviation of
each variable. The point biserial correlations for each reference group
with the criterion were 0.073 and 0.080 for the full professor and 400
course level distinctions, respectively. The two nominal variables with
multiple levels are best summarized by considering their multiple
correlations (R) with the criterion. The two multiple R's for the
instructor rank and course level variables were 0.223 and 0.136,
respectively. The correlations between each of the six predictor
variables with the criterion were all significantly greater than zero at
p < 0.01; but, more important practically, some of these correlations
were quite large, especially for the expected grade and
required-elective variables.
Table II contains the results of the regression analyses performed on
the combined semesters' data. Due to the rather small differences
between the multiple correlations computed for each semester and the
high values for each cross-validation R, the standardized partial
regression weight and associated standard error for each variable are
reported only for the combined sample. The multiple correlations
computed on a given semester's data and the cross-validation R's are,
however, given in the last two columns of the table for both the fall
and spring semesters.
When all 11 predictor variables were included in the regression
analysis (Analysis 1), the multiple correlation was 0.521 which means
that more than one-fourth of the criterion variance is shared with the
11 predictor variables. Eight of the 11 beta weights were significantly
different from zero at p < 0.01, but this is not very surprising due to
the extremely large sample size. Only five of the standardized partial
regression weights were, however, greater than 0.10 with only one
weight greater than 0.30 (expected grade) and one other weight greater
than 0.20 (required-elective). On a relative basis, therefore, the
expected grade and required-elective variables provide the largest
contribution to the prediction of the criterion measure. The class size,
instructor rank, and course level variables do provide small
contributions to the prediction, however. The TA level of the instructor
rank variable and the 100 level of the course level variable are the main
contributors for these two categorical variables.
The impact of one or more predictor variables can also be assessed
by taking the difference between squared multiple correlations. This
O00~

e~

I I

I I I
I1

II I

I t

"0
~!..~.-.~..~..

I I I

I I I I I I

T~
II1 li 111

o~

~ I I ! I I

.=
o ~ .

OO

72
'~" II

~J

L;
r
~J

~S
~J ~r~ ~- .0

°~

~D
0

0
..= ,,0
J~

L~

.= E~

~qO

.~.o
4a

i ° ~D
c~

73
74 Brandenburg, Slinde, and Batista
difference, often called incremental validity, yields the proportion of
additional variance in the criterion that is predictable from the "new'"
predictor variable(s) above that variance accounted for by the "initial"
variable(s). The incremental validity was obtained for the interaction
effects, but given the trivial increase in R by including the interaction
effects, the beta weights and cross-validation results were not reported
for the interaction effects. Given the results for the regression analysis
involving all individual predictor variables, the incremental validity was
obtained for expected grade with required elective and for the
remaining variables. Both values, taken together, provide further
information about the importance of these two sets of predictor
variables. For example, the difference between squared multiple R's
for Analyses I and II given in Table II is 0.039. This indicates that a
rather small amount of predictable variance is added by variables other
than expected grade or required elective. The difference of 0.191
between squared multiple R's for Analyses I and III of Table II
indicates that a sizeable amount of predictable variance is added by the
addition of expected grade and required-elective. Thus, these two
variables are extremely important predictors of student ratings, and
they also include almost all the information given by the entire set of
predictor variables.
The effect of the two most important variables on the raw and scaled
CEQ scores is illustrated with the results reported in Table III where a
combination of the expected grade and required-elective variables is
used to predict CEQ ratings. The predicted CEQ's were referred to the
overall norms table contained in Illinois Course Evaluation
Questionnaire (CEQ) Results Interpretation Manual Form 73
(Brandenburg and Aleamoni, Note 1) and both the predicted values and
associated deciles are recorded in Table III. It can be seen from Table
III that if a class has greater than average elective enrollment and has
greater than average expected grade (upper left), decile ratings are
substantially different than for a class with lower than average elective
enrollment and lower than average expected grade (lower right). For
the most extreme case illustrated in Table III, there is about a 1.2
standard deviation unit or a five decile difference between the two
predicted CEQ scores. Needless to say, differences of even half this
magnitude can have quite an impact on administrative decision-making.
Of course the decile differences are not as dramatic when the
required-elective norms are used, which is due to the common variance
between expected grade and required elective, but the deciles based on
the overall norms are the values most often consulted for
administrative use.
Student Ratings of Instruction 75
CONCLUSIONS

The decile differences that were reported above are large, but on the
other hand, the raw score differences are, except for the moderate to
extreme cases, somewhat small. But even small raw score differences
can rank order instructors quite differently. In addition, the raw score
differences can be made to appear larger merely by applying a linear
transformation. This is why differences were reported in terms of the
criterion standard deviation; in this form, the differences remain the
same no matter what the linear scaling. Furthermore, under a linear
transformation, the deciles will also remain unchanged. Increasing the
number of scale points is, of course, not the same as a linear
transformation, but the extent to which it operates roughly in this
fashion, then the results reported here will also generalize to scales
with additional points.
It has been demonstrated that significant shifts in ratings can occur
as the result of extraneous variables. In addition, the results also
illustrate the problem of using deciles and the problem of criterion
insensitivity. Deciles tend to exaggerate raw score differences in the
middle of the distribution but squeeze them together' at the extremes.
Criterion insensitivity refers to raters using only a relatively small
portion of the scale and thus to a small criterion standard deviation
which, at least from our experience, is characteristic of omnibus forms
such as the CEQ.
The problem of extraneous variables brings up a final concern about
using expected grade as a control variable, e.g., in a norms table or in
a regression equation which uses the difference between predicted and
actual criterion values to control for extraneous variables. We argue for
excluding expected grade as a control variable not only because it can
be influenced by the instructor, but also because it is correlated with
obtained grade and thus to final achievement. It appears quite
contradictory to argue for using expected grade as a control variable
and yet use final achievement as a criterion for the validation of
student ratings.
As previously indicated, it is often the case, unfortunately, that
administrative decisions will be influenced to a great extent by the
normative information provided by students' ratings. This being the
case, the strong relationship between expected grade and required
elective with the ratings, and their subsequent consequences on
normative data, indicate that certain instructors will be at a
disadvantage relative to their peers. Student rating information in
conjunction with self-evaluation (see for example, Batista and
%
Q ¢.1
C~

I
X

C~

I
r._.
IX
~S

C~

IX

e~

-i-

C~
~J

4-

~+

4-
IX
"i '~ riTj
~u

~L

-I--I- I I

76
Student Ratings of Instruction 77
Brandenburg, 1975), peer, and administrative information, however, is
likely to provide a reasonable evaluation of one's instructional
performance,

REFERENCE NOTES
1. Batista, E. E., and Brandenburg, D. C. (1976). The instructor self-evaluation
form: Development and validation of an ipsati~ee forced-choice measure of
self-perceived faculty performance. Unpublished.
2. Brandenburg, D. C., and Aleamoni, L. M. (1976). Illinois Course Evaluation
Questionnaire (CEQ) Results Interpretation Manual Form 73. (Manual.
Urbana, Illinois). University of Illinois at Urbana-Champaign, Measurement
and Research Division of the Office of Instructional Resources.
3. Hildebrand, M., Wilson, R. C., and Dienst, E. R. (1971). Evaluating
University Teaching. Berkeley, Calif.: Center for Research and Development
in Higher Education,

REFERENCES
Aleamoni, L. M. (1976). On the invalidity of student ratings for administrative
personnel decision (comment). Journal of Higher Education 47:607-610.
Aleamoni, L. M., and Graham, M. H. (1974). The relationship between CEQ
ratings and instructor's rank, class size and course level. Journal of
Educational Measurement 11:189-202.
Aleamoni, L. M., and Spencer, R. E. (1973). The Illinois course evaluation
questionnaire: A description of its development and a report of some of its
results. Educational and Psychological Measurement 33:669-684.
Aleamoni, L. M., and Yimer, M. (1973). An investigation of the relationship
between colleague rating, student rating, research productivity, and academic
rank in rating instructional effectiveness. Journal of Educational
Measurement 64:274-277.
Bausell, R. B., and Magoon, J. (1972). Expected grade in a course, grade point
average, and student ratings of the course and the instructor. Educational
and Psychological Measurement 32:1013-1023.
Downie, N. M. (1952). Student evaluation of faculty. Journal of Higher
Education 23:495-496.
Doyle, K. O., and Whitely, S. E. (1974). Student ratings as criteria for
effective teaching. American Educational Research Journal 11:259-274.
Gage, N. L. (1961). The appraisal of college teaching: An analysis of ends and
means. Journal of Higher Education 32:17-22.
Garverick, C. M., and Carter, H. D. (1962). Instructor ratings and expected
grades. California Journal of Educational Research 13:2t8-221.
Grant, C. W. (1971). Faculty allocation of effort and student course
evaluations. Journal of Educational Research 64:405-411.
Granzin, K. I., and Painter, J. J. (1973). A new explanation for students'
course evaluation tendencies. American Educational Research Journal
10:115-124.
78 Brandenburg, Slinde, and Batista
Holmes, D. S. (1971). The relationship between expected grades and students'
evaluation of their instructors. Educational and Psychological Measurement
31:951-957.
Jiobu, R. M., and Pollis, C. A. (1971). Student evaluations of courses and
instructors. The American Sociologist 6:317-321.
Kennedy, W. R. (1975). Grades expected and grades received--their
relationship to students' evaluation of faculty performance. Journal of
Educational Psychology 6:109-115.
Kedinger, F., and Pedhauser, E. J. (1973). Multiple Regression in Behavioral
Research. New York: Holt, Rinehart and Winston.
Kooker, E. W. (1968). The relationship of known college grades to course
ratings on student selected items. The Journal of Psychology 69:209-215.
Lovell, G. D., and Haner, C. F. (1955). Forced-choice applied to college
faculty rating. Educational and Psychological Measurement 15:291-304.
Magoon, J., and Bausell, R. B. (1973). Required versus elective course ratings.
College Student Journal 7:29-33.
McKeachie, W. J., and Lin, Y. (1971). Sex differences in student response to
college teachers: Teacher warmth and teacher sex. American Educational
Research Journal 8:221-226.
Pohlman, J. T. (1975). A multivariate analysis of selected class characteristics
and student ratings of instruction. Multivariate Behavioral Research 10:81-91.
Rayder, N. F. (1968). College student ratings of instructors. Journal of
Experimental Education 37:76-81.
Sheehan, D. S. (1975). On the invalidity of student ratings for administrative
personnel decisions. Journal of Higher Education 46:687-700.
Weaver, C. H. (1960). Instructor rating by college students. Journal of
Educational Psychology 51:21-25.

You might also like