Professional Documents
Culture Documents
The Disagreeable Behaviour of The Kappa Statistic: Laura Flight and Steven A. Julious
The Disagreeable Behaviour of The Kappa Statistic: Laura Flight and Steven A. Julious
(wileyonlinelibrary.com) DOI: 10.1002/pst.1659 Published online 3 December 2014 in Wiley Online Library
1. INTRODUCTION
Table I. Interpretations
It is often of interest to measure the agreement between a num- of the kappa statistic
ber of raters when an outcome is nominal or ordinal. The kappa
Kappa Agreement
statistic, first proposed by Cohen [1], is a measure of agreement
and is frequently used. This method, however, has a number of < 0.20 Poor
limitations. 0.21 0.40 Fair
This Teacher’s Corner article gives a description of the statis- 0.41 0.60 Moderate
tic and important related concepts, followed by a motivational 0.61 0.80 Good
example of how the limitations affect the usefulness of kappa in 0.81 1.00 Very Good
practice.
chance denoted by pe [1]. Consequently, Cohen proposed the
2. THE KAPPA STATISTIC kappa statistic:
po pe
The measurement of agreement often arises in the context of D . (1)
1 pe
reliability, where the concern is less about whether there is an
This estimates the proportion of agreement between raters after
association or correlation between the classifications of two raters
removing any chance agreement. The kappa statistic can be inter-
but more about whether the two raters agree [2]. Bloch and Krae-
preted using the scale taken from Altman [4] given in Table I.
mer [3] suggest that agreement is a distinct kind of association
Shrout [5] gives more conservative interpretations of the statis-
describing how well one rater’s classification agrees with another
tic, for example, values in the range 0.41–0.60 represent only
or how well a single rater’s classifications at one time point agrees
fair agreement. Throughout, the Altman scale of interpretation
with their classifications at another time point.
is used.
Throughout this article, the main focus is on inter-rater agree-
ment, where it is assumed the objects to be classified are inde-
pendent, the raters make their classifications independently and 2.2. Weighted Kappa Statistic
the categories are independent [1]. The issues raised can also be When considering multiple categories to classify objects, ordering
generalised to intra-rater agreement. is important. Cohen’s kappa (Equation 1), however, assumes the
disagreement between different categories is equally weighted,
2.1. Cohen’s Kappa Statistic and the ordering of categories is not important [6]. If disagree-
ments are thought to have varying consequences, a weighted
The kappa statistic was proposed by Cohen in the context of two
kappa can be calculated. This method applies a weight vij to dis-
raters classifying objects into two categories. The proportion of
agreements in the ith row and jth column of the data table. Larger
concordance po is the number of times different raters classify
objects into the same category divided by the total sample size
(N). This is the observed agreement between the raters and can Medical Statistics Group, University of Sheffield, Sheffield, England
be used as a simple estimate of agreement. *Correspondence to: Professor Steven Julious, Medical Statistics Group, ScHARR,
Cohen (1960) suggests this estimate alone is not sufficient University of Sheffield, 30 Regent Court, Regent Street, Sheffield, England, S1 4DA
as it is necessary to account for agreement expected by E-mail: s.a.julious@sheffield.ac.uk
74
Pharmaceut. Statist. 2015, 14 74–78 Copyright © 2014 John Wiley & Sons, Ltd.
L. Flight and S. A. Julious
75
Pharmaceut. Statist. 2015, 14 74–78 Copyright © 2014 John Wiley & Sons, Ltd.
L. Flight and S. A. Julious
The maximum attainable kappa, max , is calculated using, lence and bias (PABAK) [8]. This statistic is estimated by replacing
the diagonal elements of the table (a, d) by their average n D
poM pe .a C d/
max D , (4) and the off-diagonal elements (b, c) by their average
1 pe 2
.b C c/
where poM is estimated by, m D . Substituting these values into Equations 1 and 3
2
and using N D aCbCcCd, the formula for PABAK can be reduced
g1 f 1 g2 f2 gr fr
poM D min , C min , C : : : C min , (5) to:
N N N N N N 2n
0.5
N
for an r r table. PABAK D D 2po 1 (6)
.1 0.5/
Sim and Wright [2] interpret max as reflecting the extent to
which the rater’s ability to agree is constrained by pre-existing fac-
tors that result in unequal marginal totals. It is helpful to report 5. MOTIVATING EXAMPLE
this statistic alongside Cohen’s kappa, as it can illustrate how a low
value might be a consequence of the marginal totals rather than In this section, a number of examples are used to illustrate
poor agreement. how the kappa statistic should not be solely relied upon when
assessing agreement.
In the example that is the motivation for this paper, there
3.2. Prevalence and Bias
are N D 261 students who are categorised by two indepen-
Byrt et al. [8] describe the kappa statistic as being affected by both dent assessors as either ‘one’ or ‘two’. It is important to establish
the prevalence and bias between raters . Prevalence in this con- whether there is agreement between assessors. If agreement is
text is the probability with which a rater classifies an object as poor it would be necessary to use additional assessors or to have
‘one’ in the study sample, say. This relates to the balance of the the grading scheme amended.
table, with low prevalences giving a balanced table.
Bias is concerned with the frequency at which raters choose 5.1. Unadjusted Kappa
a particular category and affects the symmetry of the table. A
Data for this example are given in Table III. Using Feinstein and
non-biased table will be symmetrical as the raters do not differ
Ciccetti’s definitions, this is an example of a symmetrically imbal-
in their frequency of choosing category one. The extent of bias
anced table, as f1 and g1 are large, and f2 , g2 are small. The propor-
and prevalence can be evaluated using the Prevalence and Bias
tion of the objects in class one for both assessors is clearly greater
indexes. Prevalence and bias are used along with symmetry and
than 0.5. There is a high probability that an assessor will classify
balance to explain when the paradoxes of kappa can occur.
a student as one and hence, high prevalence (PI=0.628). The fre-
quencies for which the assessors choose a particular category are
4. PREVALENCE AND BIAS ADJUSTED KAPPA similar, and so there is low bias (BI=0.234).
The proportion of concordance po D 0.682 indicates good
If after examining the symmetry and balance of a table and esti- agreement between the two assessors. The kappa statistic, on
mating the prevalence index (PI) and the bias index (BI) it is clear the other hand is D 0.038, with pe D 0.360. This suggests
that the kappa is likely to be influenced by the distribution of little agreement beyond that expected by chance between the
the marginal totals, it is possible to calculate an adjusted kappa assessors. The interpretation of agreement varies substantially
statistic. Byrt et al. (1993) define a statistic that adjusts for preva- depending on the summary statistic chosen.
76
Copyright © 2014 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2015, 14 74–78
L. Flight and S. A. Julious
0.4
Table III. Data with
two assessors categoris-
0.3
ing into two categories
(N D 261)
0.2
1 2 Total
Kappa
1 171 72 243
0.1
2 11 7 18
Total 182 79 261
0.0
Considering the imbalance in the table and the high PI value, 0.65 0.70 0.75 0.80 0.85 0.90 0.95
it is likely that the kappa statistic is being influenced by the Proportion in g1
prevalences and hence, distribution of the marginal totals. The
proportion of students in the marginal g1 is 0.931 and f1 is 0.697. Figure 1. Kappa statistic plotted against the proportion of students rated as
g1
The maximum attainable kappa is one:one . /
N
243 182 18 79
min , C min , 0.670
261 261 261 261
max D D 0.292. Table IV. Perfectly sym-
1 0.670 metrical (N D 261)
(7)
This low value indicates that there is unlikely to ever be strong 1 2 Total
agreement between the assessors when considering the inter-
pretations in Table I, a consequence of factors influencing their 1 100 51 151
marginal totals. 2 51 59 110
As an illustration of how the prevalences and the distribution Total 151 110 261
of the marginals influences kappa, in Table III, the number of stu-
dents in the off-diagonal cells is fixed, and the remaining students
are split evenly between the diagonal cells. One student is then
moved from cell ‘two:two’ to ‘one:one’, hence, the prevalence of Table V. Imperfectly
each assessor categorising a student as one is increased. The over- asymmetrical (N D 261)
all concordance, however, remains constant as the proportion
1 2 Total
of students awarded the same grade by each assessor does not
change. The proportion of concordance is constant where as the 1 60 101 161
kappa statistic falls. 2 1 99 100
Although the proportion of concordance and kappa are not Total 61 200 261
strictly comparable as concordance does not account for agree-
ment expected by chance, the large differences in their values
indicate that the kappa may not be behaving as expected. Table V (PI=0.149, BI=0.383) also has po D 0.609, however, this
The proportion of students in the marginal g1 and kappa are table is asymmetrical. One assessor categorises more students in
calculated as one student is moved as described. These two statis- category one and fewer students in category two when compared
tics are plotted in Figure 1 to demonstrate how there is a change with the second assessor. There is a difference in the distribu-
in kappa despite no change in the concordance and arguably tion of the marginal totals of the assessors. This suggests less
the agreement. The vertical line on the plot marks the propor- agreement than in the symmetric Table IV, as intuitively asses-
tion of students in g1 for Table III, and the horizontal line marks sors who agree will have similar marginal distributions [7]. The
the value of the kappa statistic for this table. The kappa statis- kappa statistic for Table V, however, is larger D 0.391. This value
tic even falls below zero when the proportion in the marginal g1 indicates greater agreement despite the lack of agreement in the
approaches 0.95. This negative kappa is interpreted by Viera and marginal totals.
Garrett (2005) as less than chance agreement [9].
Based on the motivational example, scenarios are imagined 5.2. Adjusted Kappa
with a sample size of 261 where symmetry and imbalance impact
on kappa. If a table has symmetry and is perfectly imbalanced, In the motivating example, the PABAK D 0.364, much greater
such as the hypothetical example in Table IV, this indicates some than the unadjusted kappa. The high prevalence and imbalance
agreement between the assessors. The PI value is 0.157, a low in the table is decreasing the kappa statistic, which can lead to
value reflecting the distribution of students in the marginals for incorrect inferences about the agreement between the two raters.
both assessors being balanced. The BI score of 0 is a consequence Varying the prevalences in Table III by moving a student from
of the perfect symmetry of the table, g1 D f1 and g2 D f2 . The two:two to one:one as before, the PABAK statistic remains con-
proportion of concordance for this table is 0.609, which further stant as the proportion of objects in the marginal g1 changes.
indicates good agreement, however, the kappa is 0.189. Using the This illustrates how the adjusted kappa is less influenced by
interpretations in Table I, this suggests poor agreement. prevalence and the distribution of the marginals.
77
Pharmaceut. Statist. 2015, 14 74–78 Copyright © 2014 John Wiley & Sons, Ltd.
L. Flight and S. A. Julious
Copyright © 2014 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2015, 14 74–78