Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article was downloaded by: [Memorial University of Newfoundland]

On: 03 August 2014, At: 01:25


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

International Journal of Research &


Method in Education
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/cwse20

Interpretation of standards with


Bloom’s revised taxonomy: a
comparison of teachers and assessment
experts
a
Gunilla Näsström
a
Department of Educational Measurement , Umeå University ,
Umeå, Sweden
Published online: 18 Mar 2009.

To cite this article: Gunilla Näsström (2009) Interpretation of standards with Bloom’s revised
taxonomy: a comparison of teachers and assessment experts, International Journal of Research &
Method in Education, 32:1, 39-51, DOI: 10.1080/17437270902749262

To link to this article: http://dx.doi.org/10.1080/17437270902749262

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or
arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014
International Journal of Research & Method in Education
Vol. 32, No. 1, April 2009, 39–51

Interpretation of standards with Bloom’s revised taxonomy:


a comparison of teachers and assessment experts
Gunilla Näsström*

Department of Educational Measurement, Umeå University, Umeå, Sweden


Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

(Received January 2008; final version received November 2008)


Taylor and Francis
CWSE_A_375096.sgm

International
10.1080/17437270902749262
1743-727X
Original
Taylor
202009
32
gunilla.nasstrom@edmeas.umu.se
GunillaNäsström
000002009
&Article
Francis
(print)/1743-7288
Journal of Research
(online)
and Method in Education

In education, standards have to be interpreted, for planning of teaching, for


development of assessments and for alignment analysis. In most cases, it is
important that there is an agreement between individuals and organizations about
how to interpret standards. However, there is a lack of studies of how consistent
different group of judges are when interpreting standards. In this study, the
usefulness of Bloom’s revised taxonomy for interpreting standards in mathematics
is evaluated, using different criteria. The results indicate that the taxonomy is an
acceptable tool. The results also indicate that there are differences between the
panel composed of teachers and the panel composed of assessment experts. The
assessment experts were more consistent in their interpretation of standards.
Limitations of the study and requirements for alignment analysis are discussed.
Keywords: standards; Bloom’s revised taxonomy; inter-judge consistency; intra-
judge consistency

Introduction
Educational systems are today often standards-based. Standards are here defined as
descriptions of what students should know and/or be able to do, as well as descriptions
of how well the students should attain these knowledge and skills (Popham 2003).
These standards are often broad and vague (Luft, Brown, and Slutherin 2007) and
therefore need to be interpreted. Teachers need to interpret the standards to plan their
teaching (Bybee 2003) and to assess and grade their students (Popham 2003). Those
who construct and develop standardized assessments have to interpret the standards to
formulate a valid blueprint (Popham 2003). In alignment analysis, the judges have to
interpret the standards to be able to compare the standards with other standards, with
assessments, or with teaching (Bhola, Impara, and Buckendahl 2003).
It is important that individuals and organizations agree on their interpretations of
standards. To get equivalent grades in a country or a region, all teachers should have
the same interpretations. Teachers and assessment experts who develop standardized
assessments have to interpret the standards in the same way to give the students an
opportunity to perform well on the standardized assessments (Biggs 2003). In align-
ment analyses, all judges should have similar interpretations of the standards to derive
trustworthy comparisons (Bhola, Impara, and Buckendahl 2003).

*Email: gunilla.nasstrom@edmeas.umu.se

ISSN 1743-727X print/ISSN 1743-7288 online


© 2009 Taylor & Francis
DOI: 10.1080/17437270902749262
http://www.informaworld.com
40 G. Näsström

For interpretation of standards, a taxonomy may be a useful tool for several


reasons. By placing standards into categories in a taxonomy, the structure of the
standards can be visualized. Therefore, comparisons are possible to make between
different sets of standards, between standards and teaching as well as between stan-
dards and assessments (Bhola, Impara, and Buckendahl 2003). The categorization is
also useful for studying changes in standards over time. One purpose of many taxon-
omies is to make standards clearly understandable by interpreting, categorizing and
communicating their content (e.g. Seddon 1978; Anderson and Krathwohl 2001). The
most used and well-known taxonomy in educational settings is Bloom’s taxonomy
from 1956, but more taxonomies have been developed and revised since then. Other
examples of taxonomies are Guilford’s taxonomy (1967), TIMSS (Mullis et al. 1993),
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

PISA (OECD 1999), Marzano’s new taxonomy (Marzano and Kendall 2007), Porter’s
taxonomy (Porter and Smithson 2001) and Bloom’s revised taxonomy (Anderson and
Krathwohl 2001). Bloom’s revised taxonomy (Anderson and Krathwohl 2001) is a
development and revision of Bloom’s original taxonomy from 1956.
In this study, Bloom’s revised taxonomy was chosen as a categorization tool for
standards for four reasons. Firstly, the taxonomy is designed for analysing and devel-
oping standards, teaching and assessment as well as of emphasizing alignment
among these main components of an educational system. Secondly, this taxonomy
has been applied in nursing education (Su, Osisek, and Starnes 2004), music educa-
tion (Hanna 2007) as well as in schools in several states in the USA (Pickard 2007),
but none of these studies have evaluated the usefulness of this taxonomy. Therefore,
there is a lack of studies about the quality of Bloom’s revised taxonomy, especially
as a categorization tool for standards. Thirdly, this taxonomy has general stated
content categories which allow comparisons of standards from different subjects.
Fourthly, in a study where standards in chemistry were categorized with two differ-
ent types of models, Bloom’s revised taxonomy was found to interpret the standards
more unambiguously than a model with topics-based categories (Näsström and
Henriksson 2008).
The focus of this article is on evaluating the usefulness of Bloom’s revised taxon-
omy for interpretation of standards. Interpretation of standards is based on human
judgements, and therefore inter- and intra-judge consistency is an important issue for
the trustworthiness of interpretation of standards. Another focus of this article is on
similarities and differences between teachers and assessment experts when interpret-
ing standards. The article is structured in the following way: Firstly, criteria for eval-
uating the usefulness of a taxonomy as a categorization tool are described. Secondly,
a short review of inter- and intra-judge consistency in interpretation of standards is
presented. Thirdly, results are presented describing the usefulness of Bloom’s revised
taxonomy as well as describing similarities and differences between the teachers and
the assessment experts. Fourthly, the usefulness of Bloom’s revised taxonomy,
similarities and differences between the teachers and the assessment experts, as well
as limitations of this study are discussed.
The criteria for evaluating the usefulness of the taxonomy are based on Hauenstein’s
(1998) five rules. A taxonomy should, according to Hauenstein, (1) be applicable; (2)
be totally inclusive, i.e. all standards can be categorized; (3) have mutually exclusive
categories, i.e. unambiguously categorize one standard into only one category; (4)
follow a consistent principle of order; and (5) use the terms in categories and sub-
categories that are representative of those used in the field. One aspect of applicability
is that judges can use the taxonomy. Another aspect of applicability is the number of
International Journal of Research & Method in Education 41

categories utilized in the taxonomy. In this article, the first three rules are used in the
evaluation of Bloom’s revised taxonomy.
Interpretation of standards is based on human judgements, and it is important to
obtain agreement on the categorization of standards. One important aspect of this
agreement is to obtain a high level of inter-judge consistency, indicating that the cate-
gorizations will be the same regardless of judges, as well as intra-judge consistency,
indicating stability in the judgements (Stephens et al. 2006).
In general, studies about inter- and intra-judge consistency for interpretation of
standards are conspicuous by their absence. There are at least two possible explana-
tions of this. One explanation stems from Bloom’s original taxonomy (Bloom 1956),
in which the author claimed that it is at least a bit more complicated to classify assess-
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

ment items than standards. This claim has been used as an argument for focusing only
on categorization of assessment items (Poole 1971). A second explanation is that
judges in alignment studies are supposed to be familiar with the specific standards,
and therefore the discussion about interpretations of standards is restricted to the train-
ing part (Bhola, Impara, and Buckendahl 2003). Even though it is important that the
judges agree on how to interpret standards, this assumption is very seldom verified.
On the contrary, inter-judge consistency has been reported in several studies
dealing with categorization of assessment items with the same type of taxonomies that
can be used for interpretation of standards (e.g. Fairbrother 1975; Seddon 1978;
Herman, Webb, and Zuniga 2007; Webb, Herman, and Webb 2007). In such studies,
inter-judge consistency is commonly measured as the percentage of perfect agreement
among the judges and the kappa coefficient (Watkins and Pacheco 2000; Stemler
2004). Herman, Webb, and Zuniga (2007) reported the percentage of agreement for a
clear majority (at least two thirds of the judges) and a bare majority (more than half
of the judges) to nuance the computation of percentage of agreement.
The purpose of this study is to investigate the usefulness of Bloom’s revised taxon-
omy for interpretation of standards. Another purpose is to describe differences and
similarities between teachers and assessment experts when interpreting standards.

Method
Design
Two panels of judges categorized the same standards with Bloom’s revised taxonomy
under similar conditions. The judgements of these two panels were compared regard-
ing inter- and intra-judge agreement as well as the usefulness of Bloom’s revised
taxonomy as a tool for interpretation of standards.

The standards
The 35 interpreted standards in this study make up one syllabus in mathematics for
upper secondary schools in Sweden. The analysed syllabus contains 20 standards
named goals and 15 standards named grading criteria (see Skolverket 2007–08).

Bloom’s revised taxonomy


Bloom’s revised taxonomy (Anderson and Krathwohl 2001) has two dimensions, one
knowledge dimension and one cognitive process dimension. The knowledge dimension
42 G. Näsström

focuses on content as types of knowledge. The categories in this dimension are factual
knowledge, conceptual knowledge, procedure knowledge and metacognitive knowl-
edge. The categories in the knowledge dimension are assumed, by the authors, to lie
along a continuum, from concrete in factual knowledge to abstract in metacognitive
knowledge. The continuum between conceptual and procedural knowledge overlaps
somewhat, according to the authors.
The dimension of cognitive processes focuses on how the knowledge is used. The
categories in this dimension are remember, understand, apply, analyse, evaluate and
create. The underlying continuum in this dimension is cognitive complexity, ranging
from low-cognitive complexity in remember to high-cognitive complexity in create.
Bloom’s revised taxonomy provides a two-dimensional taxonomy table with
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

24 cells (see Figure 1). The rows in the taxonomy table represent the four categories
of the knowledge dimension and the columns the six categories of the cognitive
process dimension. One standard will thereby be categorized according to the two
dimensions and placed in the corresponding cell in the taxonomy table.
Figure 1. Distribution (in per cent) of each panel’s total categorizations of all standards on each occasion.

The judges
One panel consisted of four teachers with relevant education for, and experience of,
teaching the specific course in mathematics. The four teachers teach in different
schools in different parts of Sweden. These teachers were also engaged as a reference
group for developing national tests in mathematics for the particular course. There-
fore, the conclusion is that all the teachers were very familiar with the syllabus.
The other panel consisted of four assessment experts with relevant education for,
and prior experience of, teaching the specific course in mathematics. They have also
developed and constructed national tests for at least five years, but for different
courses in mathematics in upper secondary schools. These judges are supposed to
have a more detailed and deeper experience of analysing standards in mathematics
than the teachers in the other panel.

Procedure
The procedure for collecting data was the same for both panels, even though they took
place on different days. For both of the panels, data was collected on two occasions,
so that intra-judge consistency as well as inter-judge consistency could be studied.
A week before the first occasion, the judges received an introduction letter. This
letter presented the study, gave an overview of Bloom’s revised taxonomy and of
classified examples of standards from another syllabus than the one in the study. On
the first occasion, Bloom’s revised taxonomy was presented and exemplified,
followed by a discussion about classification of examples. Directly afterwards the
judges categorized the standards individually. On the second occasion, the judges
again categorized individually the standards in the same syllabus but without any
introduction. The time between the two occasions was between two and three months.
The cognitive process dimension is assumed to lie on a continuum from low to high
cognitive complexity (Anderson and Krathwohl 2001), and when categorizing each
standard regarding this dimension the judges were instructed to choose the category
with the highest cognitive complexity. The categories in the knowledge dimension are,
however, problematic to order along a continuum, because knowledge is commonly
assumed to consist of different types without any clear ordering (e.g. de Jong and
International Journal of Research & Method in Education 43

Ferguson-Hessler 1996). Therefore, the judges were allowed to place each standard
into more than one category in the knowledge dimension, i.e. multi-categorize.
However, factual and conceptual knowledge are ordered along a continuum. Factual
knowledge is the bricks that build up conceptual knowledge (e.g. Anderson and
Krathwohl 2001). Therefore, the judges were allowed only to choose either factual or
conceptual knowledge. If standards are multi-categorized, then the cells are not
mutually exclusive, according to Hauenstein’s third rule (1998).

Statistical methods
Three measures of both inter- and intra-judge consistency among individual judges are
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

reported, namely the percentage of perfect agreement (all judges agree), the percent-
age of a clear majority of the judges (at least three judges agree) and Fleiss’s kappa.
These measures are all useful when treating nominal variables (Stemler 2004) and the
categories in at least the knowledge dimension of the taxonomy can only be assumed
to be nominal variables. The strength of kappa values compared to the percentage of
agreement is that kappa takes into account chance agreement among the judges
(Watkins and Pacheco 2000).
Fleiss’s kappa (1971) is used in this study, because of multiple judges and data on
nominal level. However, each judge was allowed to place one single standard in one
to three categories with the same weight. To be able to compute Fleiss’s kappa, only
one category per standard and judge can be used. The category chosen was the one
that the judges in each panel most strongly agreed on. According to Landis and Koch
(1977) kappa values between 0.01 and 0.20 represent slight agreement, those between
0.21 and 0.40 fair agreement, those between 0.41 and 0.60 moderate agreement, and
those greater than 0.60 substantial agreement. For measuring percentage of agreement,
a rule of thumb is that an agreement of at least 70% is acceptable (Stemler 2004).
The judges were instructed to use either factual or conceptual knowledge in their
categorization of standards, but sometimes both these categories were used at the same
time. In such cases, only the cell with conceptual knowledge was counted.
The statistical analysis of inter- and intra-judge consistency for panels as wholes
is based on how the standards are distributed in the taxonomy table for all judges in
each panel. The percentage of the total number of categorizations of all standards is
presented in a taxonomy table for each panel on each occasion.
The distribution of standards in the different cells in the taxonomy table for one
panel and occasion is compared to the distribution of the other panel or occasion and
as a measure of how similar the two distributions are the emphasis index is used. This
index has been used by Porter (2002) for comparing the distribution of standards with
the distribution of assessment items in alignment analyses, but he called the index for
balance index. The emphasis index is:

E = 1−
∑x− y
2
where x is the proportion of the total number of categorized standards in each cell in
the taxonomy table for Panel 1 or Occasion 1 and y is the corresponding proportion
for Panel 2 or Occasion 2. When E = 1, the distributions are the same and emphasize
the same cells in the taxonomy table. E = 0 means that the distributions are completely
different.
44 G. Näsström

Webb (2002) used a similar index for balance between standards and assessment
items, and according to him index values of at least 0.70 indicate an acceptable level.
Index values between 0.60 and 0.70 are, according to Webb, indicating an only
weakly acceptable level.

Results
Firstly, results about the usefulness of Bloom’s revised taxonomy for interpretation of
standards are presented. These results will first be reported for both panels, and then
the results for each panel are presented. Finally, results concerning the consistency of
the use of the taxonomy will be reported, both inter- and intra-judge consistency for
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

individual judges.

Usefulness
All standards were categorized by all judges in both panels, i.e. the taxonomy is totally
inclusive. Table 1 shows proportions of multi-categorized standards for both of the
panels.
Both the teachers and the assessment experts multi-categorized standards, but the
assessment experts multi-categorized standards to a larger extent than the teachers. On
the first occasion, 31 of the 35 standards were multi-categorized by at least one assess-
ment expert, while only 5 standards were multi-categorized by at least one teacher.
The number of multi-categorized standards increased from the first occasion to the
second occasion for both panels (see Table 1). For example, the teachers more than
doubled the number of multi-categorized standards from 5 to 13.
The utilization of the cells in the taxonomy table is visualized in Figure 1. All four
judges in a panel on one occasion were treated as a whole. All placements of the stan-
dards for each whole were forming the total distribution for that whole and the
percentages for each whole in Figure 1 were based on each total distribution. The
number of rectangles placed in one cell in Figure 1 was larger in cells with a larger
proportion of standards. The cells with most placements were also darkest coloured.
The teachers categorized the standards into more cells in the taxonomy table than
the assessment experts (see Figure 1). On both occasions, the teachers used 21 cells in
the taxonomy table and 19 cells were used at both occasions. The assessment experts
used 16 cells on the first occasion and 18 cells on the second occasion, with 15 cells
used on both occasions. None of the judges used the cell create factual knowledge,
while all the other 23 cells were used by at least one judge on at least one occasion.
Table 2 presents the results of the emphasis index, which indicates how similar
two distributions of categorizations are, when each panel at each occasion is treated
as one whole.
Higher values on the emphasis index indicate a larger correspondence between the
two compared distributions of categorizations. When distributions of the two panels
Table 1. Proportions of multi-categorized standards (placed in more than one cell) for
teachers and assessment experts on both occasions.
Teachers Assessment experts
Totally Occasion 1 14% (5) 89% (31)
Occasion 2 37% (13) 97% (34)
International Journal of Research & Method in Education 45

The Cognitive Process Dimension


Remember Understand Apply Analyse Evaluate Create
Occ. 1 Occ. 2 Occ. 1 Occ. 2 Occ. 1 Occ. 2 Occ. 1 Occ. 2 Occ. 1 Occ. 2 Occ. 1 Occ. 2

Teachers 2% 2% 1% 1% 5% 3% 1% 1%
Factual
knowledge
The Knowledge Dimension

Assessment 3% 2% 1% 1% 1%
experts

Teachers 1% 1% 6% 12% 6% 12% 1% 7% 1% 1% 1% 3%


Conceptual
knowledge
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

Assessment 7% 8% 18% 16% 9% 10% 4% 4% 5% 6%


experts

Teachers 1% 6% 5% 21% 16% 6% 3% 8% 7% %


5 8%

Procedural
knowledge
Assessment 2% 1% 5% 7% 23% 18% 9% 10% 5% 5% 3% 7%
experts

Teachers 1% 5% 5% 3% 3% 3% 3% 7% 5% 10% 2%

Metacognitive
knowledge
Assessment 2% 1% 2% 1% 3% 1% 3%
experts

Figure 1. Distribution (in per cent) of each panel’s total categorizations of all standards on
each occasion.

are compared, the emphasis index is higher for the second occasion compared to the
first occasion (see Table 2) indicating that the distributions for the two panels corre-
spond to a larger extent on the second occasion than on the first one. However, the
emphasis indices are high on both occasions. When the distributions for the two occa-
sions for each panel are compared, the emphasis indices are higher for the assessment
experts (0.84) than for the teachers (0.73). This indicates that the assessment experts’
distributions of standards agree to a larger extent than the teachers’ distributions.
However, the emphasis index is also high for the teachers. An index of at least 0.70
is, according to Webb (2002), an acceptable level and all, except the comparison
between the panels on Occasion 1, reach this level.

Inter-judge consistency
Table 3 presents results from an analysis of inter-judge consistency among the indi-
vidual judges in each panel on each occasion. The assessment experts agreed to a

Table 2. Emphasis index, indicating the degree of similarity between two distributions of
categorizations of standards for comparison between the two panels on each occasion as well
as for comparison between the two occasions for the respective panel.
Comparison E
Between panels Occasion 1 0.64
Occasion 2 0.75
Between occasions Teachers 0.73
Assessment experts 0.84
46 G. Näsström

Table 3. Consistency among judges in each panel on each occasion (inter-judge consistency),
reported both as percentage of agreement and kappa coefficients.
Teachers Assessment experts
Occasion 1 Occasion 2 Occasion 1 Occasion 2
Perfect agreement 3% (1) 11% (4) 26% (9) 14% (5)
Clear majority 29% (10) 29% (10) 46% (16) 46% (16)
Kappa coefficients 0.15 0.24 0.47 0.41
Note: (1) The percentage of agreement is reported both for all four judges in each panel (perfect agreement)
and for at least three judges in each panel (clear majority). (2) Number of standards in parenthesis.
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

higher degree than the teachers about the categorizations of the standards on both
occasions. All four assessment experts agreed about the categorization of nine
standards (26%) on the first occasion and five standards (14%) on the second
occasion, while all the teachers agreed on one standard (3%) on the first occasion and
four standards (11%) on the second occasion. A clear majority of assessment experts
agreed on 16 standards (46%) on both occasions, compared to 10 standards (29%) for
the teachers. If the acceptable level of at least 70% agreement is applied to these
results, the inter-judge consistency is non-acceptable.
The kappa coefficients (see Table 3) also show a higher degree of inter-judge
agreement for the assessment experts compared to the teachers on both occasions. For
the assessment experts, the kappa coefficients were 0.47 and 0.41 respectively, indi-
cating moderate agreement. For the teachers, the kappa coefficients were 0.15 and
0.24 respectively, indicating slight agreement on the first occasion and fair agreement
on the second occasion.

Intra-judge consistency
Table 4 presents results concerning intra-judge consistency for the individual judges.
The assessment experts placed standards in the same category on both occasions to a
higher degree than the teachers. On average, 51% of the standards (18) were catego-
rized in the same way on both occasions by the assessment experts compared to 25%
of the standards (9) for the teachers.
The kappa coefficients (see Table 4) also show a higher degree of intra-judge
consistency for the assessment experts compared to the teachers. For the assessment

Table 4. Consistency between occasions for individual judges (intra-judge consistency), and
average and standard deviations (SD) for each panel.
Teachers Assessment experts
Agreement Average 25% (9) 51% (18)
SD 7% (2.50) 12% (4.19)
Kappa coefficients Average 0.18 0.43
SD 0.09 0.12
Note: (1) Intra-judge consistency is reported as percentage of agreement and kappa coefficients. (2)
Number of standards in parenthesis.
International Journal of Research & Method in Education 47

experts, the average kappa coefficient is 0.43, indicating moderate agreement. For the
teachers, kappa coefficient is 0.18, indicating only slight agreement.

Discussion
The purpose of this study was to investigate the usefulness of Bloom’s revised taxon-
omy for interpretation of standards. The purpose was also to study differences and
similarities between teachers and assessment experts when they interpreted standards
by means of the taxonomy. The discussion is structured in the following way. Firstly,
the usefulness of Bloom’s revised taxonomy will be discussed. Secondly, differences
and similarities between teachers and assessment experts will be discussed. Finally,
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

limitations of this study are discussed.

The usefulness of Bloom’s revised taxonomy


Bloom’s revised taxonomy is a useful tool for interpretation of standards, according
to the first three rules of Hauenstein (1998). The taxonomy is applicable because all
judges were able to use the taxonomy and almost all cells in the taxonomy table
were used by at least one judge (see Figure 1). The only unused cell was create
factual knowledge, and in mathematics it is highly unlikely to find standards catego-
rized in this cell, because of the difficulty for students to produce new facts. The
taxonomy is also inclusive, because all judges categorized all the standards. Whether
the cells in the taxonomy table are mutually exclusive enough is, however, a matter
of discussion. According to Hauenstein (1998), the taxonomy is mutually exclusive
if one standard is placed in only one category. In this study, the judges were allowed
to multi-categorize, i.e. categorize one standard into more than one cell, and the
judges utilized this possibility (see Table 1). Therefore, it is possible to question
whether the categories in Bloom’s revised taxonomy are mutually exclusive.
However, the large proportion of multi-categorized standards can also have resulted
from too broad and vague standards or from the instruction given to the judges.
There is a need to further study whether the categories in the taxonomy are mutually
exclusive. One such study might be to ask judges whether there are problems with
placing each standard in only one category. Another direction of studies is to investi-
gate whether the standards are broad and vague, which, according to Luft, Brown,
and Slutherin (2007), many standards are today. A division of the standards into
narrower and probably less vague sub-standards could decrease the proportion of
multi-categorization.
Another issue concerning the usefulness of a taxonomy is the level of inter- and
intra-judge consistency. The lack of comparable studies is problematic for evaluating
the measured levels. Kappa coefficients and percentages of agreement are quite low,
indicating at best moderate agreement according to the scale of Landis and Koch for
kappa coefficients.
When all the categorizations for each panel are considered, high levels of
consistency are found between panels as well as between occasions. The distributions
correspond to a large extent, at least at an acceptable level using Webb’s rule of thumb
(2002). The low levels of consistency for individuals and high level for panels as
wholes, indicate that the standards are too broad and vague to be unambiguously
categorized by individuals. Not surprisingly, a panel of judges interprets the standards
in a more realistic and nuanced way than one single judge.
48 G. Näsström

The conclusion is that Bloom’s revised taxonomy, on the whole, is a useful tool
for interpretation of standards in this study.

Similarities and differences between teachers and assessment experts


In this study, results from one panel composed of teachers and another panel
composed of assessment experts were compared. The similarities between the panels
were that both panels categorized all standards and that both teachers and assessment
experts multi-categorized to some extent. The differences are more noticeable. The
teachers utilized the categories in Bloom’s revised taxonomy to a larger extent,
multi-categorized to a lesser extent, and had lower levels of inter- and intra-judge
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

consistency than the assessment experts.


The assessment experts multi-categorized standards, i.e. one standard was inter-
preted to contain at least two forms of knowledge, to a larger extent than the teachers.
One possible explanation might be that the assessment experts had prior experience of
exploring all possible interpretations of a single standard in their work, while the
teachers were more one-sided when interpreting standards. Teachers are maybe more
influenced by interpretations from textbooks.
The assessment experts were more consistent in their categorization of the stan-
dards, probably because of more detailed and deeper experience of interpreting
standards. One conclusion of this result is that it is important, especially for teachers,
to have more training in interpreting standards to increase inter- and intra-judge
consistency. However, training of teachers as judges can probably increase inter- and
intra-judge-consistency, but only up to the levels for the assessment experts, maybe
because of too broad and vague standards.
When the distribution of all categorizations for all judges in the panel of teachers
is compared to the distribution for the panel of assessment experts, the agreements are
at an acceptable level. This indicates that interpretations of standards are more
consistent for a whole group of judges, for teachers too, then for individual judges.
Therefore, to increase the agreement in interpretation of standards it is important,
especially for teachers, to encourage interpretation of standards in groups. Bloom’s
revised taxonomy may be a useful tool for such discussions.

Limitations
In this study, the evaluation of the usefulness of the taxonomy was limited to
Hauenstein’s first three rules. To evaluate the fourth and fifth rules, i.e. whether the
categories are ordered by a consistent principle and whether the terms in the taxonomy
are representative of the field, other types of studies are needed.
The size of the samples in this study is quite small and this may have influenced
the reliability negatively. Alignment studies are methodologically comparable to this
study and in this type of studies the number of judges in panels is ranging from 2 (e.g.
Porter 2002) to 27 (e.g. D’Agostino et al. 2008). Webb (2007) recommends that a
panel should consist of five to eight judges and concludes that a larger number of
judges increase the reliability. However, a large number of judges also require a lot of
resources as time, people and money, and therefore the level of acceptable reliability
has to be weighed against the costs.
The teachers in this study are not fully representative of teachers in general,
because of their participation in the development of national tests. These teachers have
International Journal of Research & Method in Education 49

therefore probably more experience of interpreting and discussing standards than


teachers in general. If a random sample of teachers had participated, the levels of
inter- and intra-judge consistency could be expected to have been even lower than for
the teachers in this study.
One limitation of the procedure in this study is the discussion session on the first
occasion. The judges were allowed to discuss examples within the panel and the goal
was to reach consensus. However, the panels discussed separately and therefore the
two panels reached partly different consensus. The discussion was also limited in
time, which restricted the possibility to reach full consensus among all the judges in
one panel for every example. If the discussions had been longer and forced to reach
consensus on every example, the levels of inter- and intra-judge consistency might
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

have been higher.


This study is also limited to standards in mathematics, while Bloom’s revised
taxonomy was developed for all academic subjects. Therefore, it is possible that stud-
ies with standards from other academic subjects than mathematics can give different
results concerning the degree of usefulness of Bloom’s revised taxonomy and the
levels of inter- and intra-judge consistency.

Conclusions and further research


There seems to be an implicit assumption, in, for example, alignment studies, that
judges can interpret standards consistently if they are familiar with the specific stan-
dards. All the judges in this study are assumed to be familiar with the standards,
because of their professions, but the low levels of inter- and intra-judge consistency,
especially for the teachers, indicate considerable differences in their interpretation of
standards. Therefore, it is important to study and report inter- and intra-judge consis-
tency when standards are interpreted for example in alignment studies.
Another conclusion concerning the low level of inter- and intra-judge consistency
for the teachers compared to the assessment experts is that more detailed and deeper
experience of interpreting standards seems to be an important qualification for judges.
To increase the judges’ qualification, especially teachers’ qualification, a more
extended training session might be one solution. This is important to investigate.
Further studies are required to find the optimal size of panels to give an acceptable
reliability and requires acceptable amount of resources.
To be able to recommend Bloom’s revised taxonomy for interpretation of stan-
dards in general, further studies in other academic subjects than mathematics are
needed. There are many academic subjects but a starting-point might be in subjects
that often have standardized assessments.
Interpretation of standards is needed for planning teaching, for assessing students
and for alignment analyses. Alignment analyses are usually comparisons between
standards and assessments (e.g. Webb, Herman, and Webb 2007) or between stan-
dards and teaching (Porter 2002). Such comparisons require that assessment items and
teaching be also categorized with the same tool as the standards. This study, however,
only evaluates the usefulness of Bloom’s revised taxonomy for interpreting standards.
Further studies are needed to investigate the usefulness of the taxonomy for categoriz-
ing assessment items and teaching.
However, what are the consequences of different interpretations of standards? The
low levels of inter- and intra-judge consistency for the teachers may have negative
effects on students’ learning, on fairness in assessments and on trustworthiness in
50 G. Näsström

alignment analyses. One conclusion is that agreement in interpretation of standards


has to increase, especially for teachers, to allow all students a fair chance to attain the
same standards regardless of teachers and schools.

References
Anderson, L.W., and D.R. Krathwohl, eds. 2001. A taxonomy for learning, teaching, and
assessing: A revision of Bloom’s taxonomy of educational objectives. New York: Addison
Wesley Longman.
Bhola, D.S., J.C. Impara, and C.W. Buckendahl. 2003. Aligning tests with states’ content
standards: Methods and issues. Educational Measurement: Issues and practice 22, no. 3:
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

21–9.
Biggs, J. 2003. Teaching for quality learning at university. Glasgow: Society for Research
into Higher Education and Open University Press.
Bloom, B.S., ed. 1956. Taxonomy of educational objectives: Handbook I: Cognitive domain.
New York: David McKay.
Bybee, R.W. 2003. Improving technology education: Understanding reform – Assuming
responsibility. Technology Teacher 62, no. 8: 22–5.
D’Agostino, J.V., M.E. Welsh, A.D. Cimetta, L.D. Falco, S. Smith, W. Hester VanWinkle,
and S.J. Powers 2008. The rating and matching item-objective alignment methods.
Applied Measurement in Education 21, no. 1: 1–21.
de Jong, T., and M.G.M. Ferguson-Hessler. 1996. Types and qualities of knowledge. Educational
Psychologist 31, no. 2: 105–13.
Fairbrother, R.W. 1975. The reliability of teachers’ judgement of the abilities being tested by
multiple choice items. Educational Research 17, no. 3: 202–10.
Fleiss, J.L. 1971. Measuring nominal scale agreement among many raters. Psychological
Bulletin 76, no. 5: 378–82.
Guilford, J.P. 1967. The nature of human intelligence. New York: McGraw-Hill.
Hanna, W. 2007. The new Bloom’s taxonomy: Implications for music education. Arts
Education Policy Review 108, no. 4: 7–16.
Hauenstein, A.D. 1998. A conceptual framework for educational objectives: A holistic
approach to traditional taxonomies. Lanham, MD: University Press of America.
Herman, J.L., N.M. Webb, and S.A. Zuniga. 2007. Measurement issues in the alignment of
standards and assessments: A case study. Applied Measurement in Education 20, no. 1:
101–26.
Landis, J.R., and G.G. Koch. 1977. The measurement of observer agreement for categorical
data. Biometrics 33, no. 1: 159–74.
Luft, P., C.M. Brown, and L.J. Slutherin. 2007. Are you and your students bored with the
benchmarks? Sinking under the standards? Then transform your teaching through
transition. Teaching Exceptional Children 39, no. 6: 39–46.
Marzano, R.J., and J.S. Kendall. 2007. The new taxonomy of educational objectives.
Thousand Oaks, CA: Corwin Press.
Mullis, I.V.S., M.O. Martin, T.A. Smith, R.A. Garden, K.D. Gregory, E.J. Gonzales, S.J.
Chrostowski, and K. M. O’Connor. 2001. TIMSS assessment frameworks and specifica-
tions 2003. Chestnut Hill: International Association for the Evaluation of Educational
Achievement.
Näsström, G., and W. Henriksson. 2008. Alignment of standards and assessment: A theoreti-
cal and empirical study of methods for alignment. Educational Journal of Research in
Educational Psychology 6, no. 3: 667–90.
OECD. 1999. Measuring student knowledge and skills: A new framework for assessment.
Paris: OECD.
Pickard, M.J. 2007. The new Bloom’s taxonomy: An overview for family and consumer
sciences. Journal of Family and Consumer Sciences Education 25, no. 1: 45–55.
Poole, R.L. 1971. Characteristics of the taxonomy of educational objectives: Cognitive
domain. Psychology in the Schools 8, no. 4: 379–85.
Popham, W.J. 2003. Test better, teach better: The instructional role of assessment.
Alexandria, VA: Association for Supervision and Curriculum Development.
International Journal of Research & Method in Education 51

Porter, A.C. 2002. Measuring the content of instruction: Uses in research and practice.
Educational Researcher 31, no. 7: 3–14.
Porter, A.C., and J.L. Smithson. 2001. Are content standards being implemented in the
classroom? A methodology and some tentative answers. In From the capitol to the classroom:
Standards-based reform in the States, ed. S.H. Fuhrmans, 60–80. Chicago, IL: National Soci-
ety for the Study of Education, University of Chicago Press.
Seddon, G.M. 1978. The properties of Bloom’s taxonomy of educational objectives for the
cognitive domain. Review of Educational Research 48, no. 2: 303–23.
Skolverket. 2007–08. Upper secondary school. Mathematics. http://www3.skolverket.se/ki03/
front.aspx?sprak=EN&ar=0708&infotyp=8&skolform=21&id=MA&extrald= and http://
www3.skolverket.se/ki03/info.aspx?sprak=EN&id=MA&skolform=21&ar=0708&info-
typ=17 (accessed March 4, 2008).
Stemler, S.E. 2004. A comparison of consensus, consistency, and measurement approaches to
Downloaded by [Memorial University of Newfoundland] at 01:25 03 August 2014

estimating interrater reliability. Practical Assessment, Research and Evaluation 9, no. 4.


Stephens, J.-P., G.A. Vos, E.M. Stevens, and J.S. Moore. 2006. Test-retest repeatability of the
strain index. Applied Ergonomics 37: 275–81.
Su, W.M., P.J. Osisek, and B. Starnes 2004. Applying the revised Bloom’s taxonomy to a
medical-surgical nursing lesson. Nurse Educator 29, no. 3: 116–20.
Watkins, M.W., and M. Pacheco. 2000. Interobserver agreement in behavioural research:
Importance and calculation. Journal of Behavioral Education 10, no. 4: 205–12.
Webb, N.L. 2002. An analysis of the alignment between mathematics standards and
assessments for three states. Paper presented at the annual meeting of the American
Educational Research Association, April 1–5, in New Orleans, LA.
Webb, N.L. 2007. Issues related to judging the alignment of curriculum standards and
assessments. Applied Measurement in Education 20, no. 1: 7–25.
Webb, N.M., J.L. Herman, and N.L. Webb 2007. Alignment of mathematics state-level
standards and assessments: The role of reviewer agreement. Educational Measurement:
Issues and Practice 26, no. 2: 17–29.

You might also like