The Disagreeable Behaviour of The Kappa Statistic: Laura Flight and Steven A. Julious

TEACHER’S CORNER
(wileyonlinelibrary.com) DOI: 10.1002/pst.1659 Published online 3 December 2014 in Wiley Online Library
The disagreeable behaviour of the

kappa statistic
Laura Flight and Steven A. Julious*
It is often of interest to measure the agreement between a number of raters when an outcome is nominal or ordinal. The kappa
statistic is used as a measure of agreement. The statistic is highly sensitive to the distribution of the marginal totals and can
produce unreliable results. Other statistics such as the proportion of concordance, maximum attainable kappa and prevalence
and bias adjusted kappa should be considered to indicate how well the kappa statistic represents agreement in the data. Each
kappa should be considered and interpreted based on the context of the data being analysed. Copyright © 2014 John Wiley &
Sons, Ltd.
Keywords: kappa statistic; concordance; agreement; PABAK
1. INTRODUCTION
Table I. Interpretations
It is often of interest to measure the agreement between a num- of the kappa statistic
ber of raters when an outcome is nominal or ordinal. The kappa
Kappa Agreement
statistic, first proposed by Cohen [1], is a measure of agreement
and is frequently used. This method, however, has a number of < 0.20 Poor
limitations. 0.21 0.40 Fair
This Teacher’s Corner article gives a description of the statis- 0.41 0.60 Moderate
tic and important related concepts, followed by a motivational 0.61 0.80 Good
example of how the limitations affect the usefulness of kappa in 0.81 1.00 Very Good
practice.
chance denoted by pe [1]. Consequently, Cohen proposed the
2. THE KAPPA STATISTIC kappa statistic:
po pe
The measurement of agreement often arises in the context of D . (1)
1 pe
reliability, where the concern is less about whether there is an
This estimates the proportion of agreement between raters after
association or correlation between the classifications of two raters
removing any chance agreement. The kappa statistic can be inter-
but more about whether the two raters agree [2]. Bloch and Krae-
preted using the scale taken from Altman [4] given in Table I.
mer [3] suggest that agreement is a distinct kind of association
Shrout [5] gives more conservative interpretations of the statis-
describing how well one rater’s classification agrees with another
tic, for example, values in the range 0.41–0.60 represent only
or how well a single rater’s classifications at one time point agrees
fair agreement. Throughout, the Altman scale of interpretation
with their classifications at another time point.
is used.
Throughout this article, the main focus is on inter-rater agree-
ment, where it is assumed the objects to be classified are inde-
pendent, the raters make their classifications independently and 2.2. Weighted Kappa Statistic
the categories are independent [1]. The issues raised can also be When considering multiple categories to classify objects, ordering
generalised to intra-rater agreement. is important. Cohen’s kappa (Equation 1), however, assumes the
disagreement between different categories is equally weighted,
2.1. Cohen’s Kappa Statistic and the ordering of categories is not important [6]. If disagree-
ments are thought to have varying consequences, a weighted
The kappa statistic was proposed by Cohen in the context of two
kappa can be calculated. This method applies a weight vij to dis-
raters classifying objects into two categories. The proportion of
agreements in the ith row and jth column of the data table. Larger
concordance po is the number of times different raters classify
objects into the same category divided by the total sample size
(N). This is the observed agreement between the raters and can Medical Statistics Group, University of Sheffield, Sheffield, England
be used as a simple estimate of agreement. *Correspondence to: Professor Steven Julious, Medical Statistics Group, ScHARR,
Cohen (1960) suggests this estimate alone is not sufficient University of Sheffield, 30 Regent Court, Regent Street, Sheffield, England, S1 4DA
as it is necessary to account for agreement expected by E-mail: s.a.julious@sheffield.ac.uk
74
Pharmaceut. Statist. 2015, 14 74–78 Copyright © 2014 John Wiley & Sons, Ltd.
L. Flight and S. A. Julious
Feinstein and Cicchetti [7] highlight the following issues they

Table II. Data with term ‘paradoxes’ of the kappa statistic:
two raters categoris-
ing into two cate- (1) For high values of concordance, low values of kappa can
gories be recorded.
(2) Asymmetric, imperfectly imbalanced tables have a higher
1 2 Total kappa than perfectly imbalanced and symmetric tables.
1 a b g1
By examining Equation 1 for the kappa statistic, it is evident
2 c d g2
that its value is dependent on the proportion of agreement
Total f1 f2 N expected by chance pe . The smaller the value of pe , the larger the
kappa statistic will be. The larger the pe , the smaller the kappa is.
Feinstein and Cicchetti illustrate that the value of pe is depen-
dent on the distribution of the marginal totals. This is evident
weights apply greater penalties to disagreements with larger con- when rewriting the formula for pe as:
sequences [2]. The determination of these weights should be
made prior to the collection of data [6]. .f1 g1 C f2 g2 /
The formula for a weighted kappa statistic is given by, pe D . (3)
N2
P
vij poij
w D 1 P , (2) The values f1 , f2 and g1 , g2 are the marginal totals for rater 1 and
vij peij
rater 2, respectively.
where poij is the proportion of observed classifications in the i, jth The first paradox occurs when there is symmetrical imbal-
cell, and peij is the proportion of expected classifications. The peij ance in the vertical and horizontal marginal totals. The sec-
are found by multiplying the ith row total by the jth column total ond paradox occurs if the imbalance is asymmetrical or imper-
and dividing by the total sample size N. fectly symmetrical.
3.1. The Maximum Attainable Kappa, Ämax

3. THE PARADOXES
Feinstein and Cicchetti also highlight that it is not only the mag-
The following definitions (Box 2) are given to help explain some nitude of kappa that is affected by the marginal totals but also
of the issues arising with the kappa statistic and are used in the the maximum possible value of the statistic [7]. Cohen notes
context of Table II. that kappa can only reach the maximum value of one when the
Throughout this article, the special dichotomous (22) case off-diagonal elements (b, c) in Table II are equal to zero [1]. For
is considered where no weightings are applied. The principles this to occur, the marginal values must be identical. Hence g1 D f1
discussed can be generalised to the (r r) case. and g2 D f2 , resulting in a perfectly symmetrical table.
75
The maximum attainable kappa, max , is calculated using, lence and bias (PABAK) [8]. This statistic is estimated by replacing
the diagonal elements of the table (a, d) by their average n D
poM pe .a C d/
max D , (4) and the off-diagonal elements (b, c) by their average
1 pe 2
.b C c/
where poM is estimated by, m D . Substituting these values into Equations 1 and 3
2
and using N D aCbCcCd, the formula for PABAK can be reduced
g1 f 1 g2 f2 gr fr
poM D min , C min , C : : : C min , (5) to:
N N N N N N 2n
0.5
N
for an r r table. PABAK D D 2po 1 (6)
.1 0.5/
Sim and Wright [2] interpret max as reflecting the extent to
which the rater’s ability to agree is constrained by pre-existing fac-
tors that result in unequal marginal totals. It is helpful to report 5. MOTIVATING EXAMPLE
this statistic alongside Cohen’s kappa, as it can illustrate how a low
value might be a consequence of the marginal totals rather than In this section, a number of examples are used to illustrate
poor agreement. how the kappa statistic should not be solely relied upon when
assessing agreement.
In the example that is the motivation for this paper, there
3.2. Prevalence and Bias
are N D 261 students who are categorised by two indepen-
Byrt et al. [8] describe the kappa statistic as being affected by both dent assessors as either ‘one’ or ‘two’. It is important to establish
the prevalence and bias between raters . Prevalence in this con- whether there is agreement between assessors. If agreement is
text is the probability with which a rater classifies an object as poor it would be necessary to use additional assessors or to have
‘one’ in the study sample, say. This relates to the balance of the the grading scheme amended.
table, with low prevalences giving a balanced table.
Bias is concerned with the frequency at which raters choose 5.1. Unadjusted Kappa
a particular category and affects the symmetry of the table. A
Data for this example are given in Table III. Using Feinstein and
non-biased table will be symmetrical as the raters do not differ
Ciccetti’s definitions, this is an example of a symmetrically imbal-
in their frequency of choosing category one. The extent of bias
anced table, as f1 and g1 are large, and f2 , g2 are small. The propor-
and prevalence can be evaluated using the Prevalence and Bias
tion of the objects in class one for both assessors is clearly greater
indexes. Prevalence and bias are used along with symmetry and
than 0.5. There is a high probability that an assessor will classify
balance to explain when the paradoxes of kappa can occur.
a student as one and hence, high prevalence (PI=0.628). The fre-
quencies for which the assessors choose a particular category are
4. PREVALENCE AND BIAS ADJUSTED KAPPA similar, and so there is low bias (BI=0.234).
The proportion of concordance po D 0.682 indicates good
If after examining the symmetry and balance of a table and esti- agreement between the two assessors. The kappa statistic, on
mating the prevalence index (PI) and the bias index (BI) it is clear the other hand is D 0.038, with pe D 0.360. This suggests
that the kappa is likely to be influenced by the distribution of little agreement beyond that expected by chance between the
the marginal totals, it is possible to calculate an adjusted kappa assessors. The interpretation of agreement varies substantially
statistic. Byrt et al. (1993) define a statistic that adjusts for preva- depending on the summary statistic chosen.
76
Copyright © 2014 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2015, 14 74–78
0.4
Table III. Data with
two assessors categoris-
0.3
ing into two categories
(N D 261)
0.2
1 2 Total
Kappa
1 171 72 243
0.1
2 11 7 18
Total 182 79 261
0.0
Considering the imbalance in the table and the high PI value, 0.65 0.70 0.75 0.80 0.85 0.90 0.95
it is likely that the kappa statistic is being influenced by the Proportion in g1
prevalences and hence, distribution of the marginal totals. The
proportion of students in the marginal g1 is 0.931 and f1 is 0.697. Figure 1. Kappa statistic plotted against the proportion of students rated as
g1
The maximum attainable kappa is one:one . /
N

243 182 18 79
min , C min , 0.670
261 261 261 261
max D D 0.292. Table IV. Perfectly sym-
1 0.670 metrical (N D 261)
(7)
This low value indicates that there is unlikely to ever be strong 1 2 Total
agreement between the assessors when considering the inter-
pretations in Table I, a consequence of factors influencing their 1 100 51 151
marginal totals. 2 51 59 110
As an illustration of how the prevalences and the distribution Total 151 110 261
of the marginals influences kappa, in Table III, the number of stu-
dents in the off-diagonal cells is fixed, and the remaining students
are split evenly between the diagonal cells. One student is then
moved from cell ‘two:two’ to ‘one:one’, hence, the prevalence of Table V. Imperfectly
each assessor categorising a student as one is increased. The over- asymmetrical (N D 261)
all concordance, however, remains constant as the proportion
1 2 Total
of students awarded the same grade by each assessor does not
change. The proportion of concordance is constant where as the 1 60 101 161
kappa statistic falls. 2 1 99 100
Although the proportion of concordance and kappa are not Total 61 200 261
strictly comparable as concordance does not account for agree-
ment expected by chance, the large differences in their values
indicate that the kappa may not be behaving as expected. Table V (PI=0.149, BI=0.383) also has po D 0.609, however, this
The proportion of students in the marginal g1 and kappa are table is asymmetrical. One assessor categorises more students in
calculated as one student is moved as described. These two statis- category one and fewer students in category two when compared
tics are plotted in Figure 1 to demonstrate how there is a change with the second assessor. There is a difference in the distribu-
in kappa despite no change in the concordance and arguably tion of the marginal totals of the assessors. This suggests less
the agreement. The vertical line on the plot marks the propor- agreement than in the symmetric Table IV, as intuitively asses-
tion of students in g1 for Table III, and the horizontal line marks sors who agree will have similar marginal distributions [7]. The
the value of the kappa statistic for this table. The kappa statis- kappa statistic for Table V, however, is larger D 0.391. This value
tic even falls below zero when the proportion in the marginal g1 indicates greater agreement despite the lack of agreement in the
approaches 0.95. This negative kappa is interpreted by Viera and marginal totals.
Garrett (2005) as less than chance agreement [9].
Based on the motivational example, scenarios are imagined 5.2. Adjusted Kappa
with a sample size of 261 where symmetry and imbalance impact
on kappa. If a table has symmetry and is perfectly imbalanced, In the motivating example, the PABAK D 0.364, much greater
such as the hypothetical example in Table IV, this indicates some than the unadjusted kappa. The high prevalence and imbalance
agreement between the assessors. The PI value is 0.157, a low in the table is decreasing the kappa statistic, which can lead to
value reflecting the distribution of students in the marginals for incorrect inferences about the agreement between the two raters.
both assessors being balanced. The BI score of 0 is a consequence Varying the prevalences in Table III by moving a student from
of the perfect symmetry of the table, g1 D f1 and g2 D f2 . The two:two to one:one as before, the PABAK statistic remains con-
proportion of concordance for this table is 0.609, which further stant as the proportion of objects in the marginal g1 changes.
indicates good agreement, however, the kappa is 0.189. Using the This illustrates how the adjusted kappa is less influenced by
interpretations in Table I, this suggests poor agreement. prevalence and the distribution of the marginals.
77
6. DISCUSSION of interpretation should be used with these limitations in mind.

It is recommended, when reporting the kappa statistic, to also
The motivational and hypothetical examples presented demon- report the proportion of concordance, PABAK, PI, BI and max
strate some of the ways a kappa statistic can behave in an unex- to allow a fully informed judgement about the usefulness of
pected and disagreeable way. Feinstein and Cicchetti attribute kappa regarding the level of agreement seen [10]. Each kappa
these problems to the distribution of the marginal totals in a table should be considered and interpreted based on the context
and the affect this has on the proportion of agreement expected in hand.
by chance, pe [7].
Byrt et al. describe the properties of the data in terms of preva-
lences and bias [8]. The Bias Index and the Prevalence Index are
proposed as ways of measuring how bias and prevalence might Acknowledgements
influence the marginal totals and impact on the interpretation
of kappa. This report is an independent research arising from an NIHR
The motivational example demonstrates how the kappa statis- Research Methods Fellowship, RMFI-2013-04-011 Goodacre, sup-
tic can mask high levels of agreement seen when considering ported by the National Institute for Health Research. The views
only the observed proportion of agreement. The large discrepan- expressed in this publication are those of the author(s) and not
cies between these two values can result in misleading inferences necessarily those of the NHS, the National Institute for Health
being made if relying solely on the kappa statistic. Research, the Department of Health or the University of Sheffield.
The hypothetical examples show how it is not feasible to
use standardised interpretations of kappa such as those given We would like to thank two anonymous reviewers for their
in Table I. When comparing two tables, a higher kappa value valuable comments which greatly improved the paper.
might actually be associated with a table that has intuitively
less agreement.
Interpretations of the kappa statistic should be made on a case REFERENCES
by case basis after considering all the characteristics of the data.
These include the proportion of concordance and the symme- [1] Cohen J. A coefficient of agreement for nominal scales. Educational
try and balance of the marginal totals. The PI, BI and maximum and Psychological Measurement 1960; 20:37–46.
attainable kappa max should also be calculated so appropriate [2] Sim J, Wright CC. The kappa statistic in reliability studies: use, inter-
interpretations can be made. pretation, and sample size requirements. Physical Therapy 2005;
85:257–268.
Alternatives such as the prevalence and bias adjusted kappa
[3] Bloch DA, Kraemer HC. 2 2 Kappa coefficients: measures of agree-
(PABAK) can be calculated to facilitate comparisons with the ment or association. Biometrics 1989; 45:269–287.
unadjusted kappa to assess how sensitive the statistic is to the [4] Altman DG. Practical statistics for medical research. Chapman and
distribution of the marginal totals [8]. It is unlikely that comparing Hall: London, 1991, pp. 404.
kappa statistics from different tables and data will be of use due [5] Shrout PE. Measurement reliability and agreement in psychiatry.
to these sensitives. Statistical Methods in Medical Research 1998; 7:301–317.
[6] Cohen J. Weighted kappa: nominal scale agreement with provision
for scaled disagreement or partial credit. Psychological Bulletin 1968;
7. CONCLUSION 70:213–220.
[7] Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The
problems of two paradoxes. Journal of Clinical Epidemiology 1990;
In this Teacher’s Corner article an example is used to describe 43:543–548.
the kappa statistic. Although a well known and frequently used [8] Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. Journal of
method for estimating levels of agreement, the kappa statistic has Clinical Epidemiology 1993; 46(5):423–429.
serious limitations. It is highly sensitive to the distribution of the [9] Viera A, Garrett JM. Understanding interobserver agreement:the
marginal totals and can produce unreliable results. kappa statistic. Family Medicine 2005; 37(5):360–3.
[10] Chen G, Faris P, Hemmelgarn B, Walker RL, Quan H. Measuring
The kappa statistic should be used with caution, and inferences agreement of administrative data with chart data using prevalence
made should account for these limitations. Kappas from different unadjusted and adjusted kappa. BMC Medical Research Methodology
data are unlikely to be directly comparable, and a generic scale 2009; 9:5. DOI: 10.1186/1471-2288-9-5.
78
Copyright © 2014 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2015, 14 74–78

The Disagreeable Behaviour of The Kappa Statistic: Laura Flight and Steven A. Julious

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Disagreeable Behaviour of The Kappa Statistic: Laura Flight and Steven A. Julious

Uploaded by

Copyright:

Available Formats

TEACHER’S CORNER

The disagreeable behaviour of the

Keywords: kappa statistic; concordance; agreement; PABAK

Feinstein and Cicchetti [7] highlight the following issues they

3.1. The Maximum Attainable Kappa, Ämax

6. DISCUSSION of interpretation should be used with these limitations in mind.

You might also like