Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Developmental Psychology

1977, Vol. 13, No, 5, 535-536

Kohlberg's Moral Judgment Scale: Some Methodological Considerations


KENNETH H. RUBIN AND KRISTIN T. TROTTER
University of Waterloo
Forty children in Grades 3 and 5 were administered the first three dilemmas of Kohlberg's
moral judgment scale. The children were divided into two groups. The first group received
the scale 2 weeks after first administration. The second group received a multiple-choice
variant of the scale. Data analyses revealed low test-retest reliability for scores attained on
the three dilemmas together as well as individually. Scores attained on items within each
dilemma were intercorrelated and found to be low and generally nonsignificant. Reliability
coefficients of internal consistency were .77, .73, and .82 for the 3 dilemmas, respectively,
and .78 for the total scale. Children who received the multiple choice variant of the scale
scored at significantly higher moral levels than did those who received the typical verbal
production version of the scale.

Kohlberg (Note 1) has described the development of moral thought and knowledge as a
process in which individuals pass through six
qualitatively different stages in a universal and
invariant sequential fashion. The method by
which Kohlberg has assessed an individual's
stage of moral development, the Moral Judgment Scale, generally consists of presenting a
subject with a series of moral dilemmas and asking him/her to verbally resolve the dilemmas. On
the basis of these responses the subject is assigned a moral judgment level. Recently, Kurtines and Greif (1974) questioned the reliability
and validity of the Moral Judgment Scale. These
authors noted a relative lack of data concerning
both the test-retest reliability and the consistency of an individual's moral judgment stage
from one dilemma to the next. Moreover, there
is evidence that projective test scores are
influenced by the subject's verbal facility. For
example, according to Rest (1976), a person can
recognize and discriminate an idea before he can
spontaneously verbalize the idea in response to a
story dilemma. As a result, a child's level of
moral judgment may be underestimated when
assessed verbally. The purpose of the present
study was to consider each of the aforementioned methodological issues.
The subjects were 40 Caucasian, middle-class
children attending Grades 3 and 5 in southwestern Ontario. The Grade 3 children (M age = 9
years 1 month) included 15 males and 7 females.
The Grade 5 subjects (M age = 11 years 2
months) included 9 males and 9 females. To
avoid possible administration problems assoRequests for reprints should be sent to Kenneth H.
Rubin, Department of Psychology, University of
Waterloo, Waterloo, Ontario, Canada N2L 3G1.

ciated with the longer interview procedure (Kurtines & Greif, 1974) only the first three of
Kohlberg's (Note 1) original dilemmas were administered to each child by the second author.
These dilemmas were designed to investigate attitudes towards "life and punishment" (Heinz),
"contract and personal relationship" (Joe and
his father), and "property and conscience" (Bob
and Karl). The most frequently occurring stage
in response to the questions across all three dilemmas was considered to be the subject's dominant, overall stage of moral judgment (global
scoring). In addition the most frequently occurring stage within each dilemma was calculated.
Transcripts of 20 of the initial interviews were
given to the first author to assess interjudge
agreement. Agreement for the scoring of global
moral judgment levels was 85%. Following initial testing, the subjects were grouped into
matched pairs according to both their ages and
levels of moral judgment. This was done to assure the initial comparability of groups prior to
assessing subsequent performance on verbal
versus forced choice moral judgment tests.
Statistical /-tests were calculated to determine
the initial equivalence of the two samples on
each of the three dilemmas. As expected, the
results of the /-tests were nonsignificant. Children's moral levels on the three dilemmas
ranged from 1 to 3B.
To assess test-retest reliability one group
(Group 1) was given the same Moral Judgment
Scale 2 weeks after initial testing. The second
group (Group 2) was administered a multiple
choice variant of the scale. For Dilemma 1, only
Kohlberg (Note 1) Questions 1, 3, and 6 were
included in the multiple choice format, and comparative statistics between those who took this
test and the verbal production format were per-

535

536

BRIEF REPORTS

formed only for these questions. Similarly Questions 1, 3, and 5 were chosen for Dilemma 2 and
Questions 1, 3, 5, and 8a were chosen for Dilemma 3. The scoring system for the verbal production versus forced choice comparison went
as follows. Each verbally produced answer was
assigned a moral judgment level. Moreover,
each forced choice question was followed by 5
answers, representative of the first 5 levels of
moral judgment but randomly arranged on the
answer sheet. These alternatives derived from
Kohlberg's (Note 1) scoring forms which were
based upon actual subject responses.
The stages at which Grade 3 and 5 children
fell for the second testings ranged from 1 to 3B
for the verbally produced answers and from 1 to
4B for the multiple choice answers. There were
no statistically significant differences between
the levels of the two age groups. All data were
pooled for further analysis. Pearson productmoment correlation coefficients were calculated
between the global moral levels from Week 1 to
Week 3 for Group 1. A statistically significant
relationship was found when the level of moral
judgment was calculated as the modal response
level across all three dilemmas, /-(19) = .44,p <
.05. While this coefficient is statistically significant, by conventional standards it actually
represents a low test-retest reliability given the
small sample size and the relatively short period
between tests. Perhaps had all nine Kohlberg
dilemmas been utilized, the reliability may have
been greater. Such a study remains to be carried
out. Statistically significant test-retest coefficients were found for the levels calculated individually for Dilemmas 2, r(19) = .39, p < .09
and 3.K19) = .62, p < .003.
While the separate issues of "life and punishment," "contract and personal relationship,"
and "property and conscience" are considered
in Dilemmas 1, 2, and 3, respectively, the global
method of scoring may obscure the possibility of
the attainment of different levels of moral development given different questions across and
within dilemmas. To examine this possibility, an
average item intercorrelation was calculated for
the questions within each individual dilemma for
all 40 pretest subjects. These Pearson productmoment correlations were .25 (10 questions), .23
(9 questions), and .29 (11 questions) for the three
dilemmas, repectively, and .11 for the total
scale. Internal consistency was calculated by

using Guilford's (1956, p. 463) modification of


the Spearman-Brown formula. The reliability
coefficients were .77, .73, and .82 for the three
dilemmas respectively and .78 for the total scale.
Since the number of items involved in each dilemma was small, the coefficients may actually
be overestimates of dilemma reliability (Guilford, 1956). In addition, correlation coefficients
calculated between the moral levels attained for
each of the three dilemmas were all nonsignificant. Thus, the global method of scoring,
which is based on the assumption that moral
knowledge is a unitary construct, may be inadequate as an index of moral development.
Finally, a series of t-tests was conducted between Group 2 means for the verbal production
versus the multiple choice tasks for each dilemma question as well as for each of the total
dilemma levels. All /-tests were significant at the
p < .001 level1 indicating that the subjects' multiple choice scores were consistently higher than
the verbal production scores. Similarly, when
the Group 2 multiple choice responses were
compared to the Group 1 verbally produced
second test responses, the former means were
all found to be significantly higher (p < .001)
than the latter means. In conclusion, the present
study describes a number of deficiencies with
the Moral Judgment Scale. Future research of a
large scale nature would do well to consider the
psychometric properties of this measure.
1
All /-test and mean data are available from the
first author.

REFERENCE NOTE

1. Kohlberg, L. Instructions for standard scoring,


Form A. Unpublished manuscript, Harvard University, 1973.
REFERENCES

Guilford, J. P. Fundamental statistics in psychology


and education. New York: McGraw-Hill, 1956.
Kurtines W., & Greif, E. B. The development of
moral thought: Review and evaluation of Kohlberg's
approach. Psychological Bulletin, 1974,81, 453-470.
Rest, J. R. New approaches in the assessment of
moral judgment. In T. Lickona (Ed.), Moral development and behavior. New York: Holt, Rinehart, &
Winston, 1976.
(Received January 14, 1977)

You might also like