Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Available online at www.sciencedirect.

com

International Journal of Nursing Studies 46 (2009) 141–142


www.elsevier.com/ijns
Commentary
Interrater reliability and the kappa statistic:
A comment on Morris et al. (2008)
Jan Kottner *
Centre for Humanities and Health Sciences, Department of Nursing Science, Charité Universitätsmedizin Berlin,
Charitéplatz 1, 10117 Berlin, Germany
Received 26 March 2008; received in revised form 1 April 2008; accepted 1 April 2008

Keywords: Interrater reliability; Kappa

Establishing interrater reliability of instruments is an measurement instrument can differentiate among indivi-
important issue in nursing research and practice. Morris duals. The only reason to apply a particular instrument to
et al.’s (2008) paper highlights the problem of choosing the a measurement situation is to differentiate between indivi-
appropriate statistical approach for interrater reliability data duals (Streiner and Norman, 2003). If this very instrument is
analysis and the authors raise the important and relevant not able to detect any differences, one should rather question
question how to interpret kappa-like statistics like Cohen’s the instrument than the statistic (Shrout, 1998). Morris et al.
(k) or weighted kappa (kw). (2008) were unable to calculate kw in 18 out of 63 variables,
It is true that the often called ‘chance corrected’ k was because of the ‘‘large number of ‘constants’ in the data’’ (p.
frequently criticised because its value is dependent on the 646). In other words, both participating experienced staff
prevalence of the rated trait in the sample (‘base rate nurses were unable to detect any variability among 30 clients
problem’). Consequently, even if two raters nearly or exactly when nearly one-third of the I-NMDS variables were
agree, k-coefficients are nearly or equal to 0 if the prevalence applied. This is an important finding. Users must be careful
of the rated characteristic is very high or very low. This when applying these 18 I-NMDS items, because the degree
objects the natural expectation that interrater reliability must of interrater reliability is still unknown. That means a certain
be high as well. However, this is neither a limitation nor a degree of relative precision of these items could not be
‘‘main drawback’’ (p. 646). In fact it is a desired property, demonstrated. It is questionable to apply such a measure-
because k-coefficients are classical interrater reliability ment, when it was already expected that variability of the I-
coefficients (Dunn, 2004; Kraemer et al., 2002; Landis NMDS variables would be very low (p. 647).
and Koch, 1975). In the classical test theory, reliability is Every interrater reliability coefficient is unavoidably
defined as the ratio of variability between subjects (or linked to the population to which the instrument is applied
targets) to the total variability. The total variability is the (Kraemer, 1979; Streiner and Norman, 2003). Low interrater
sum of subject (target) variability and the measurement error reliability coefficients are caused either by a lack of agreement
(Dunn, 2004; Streiner and Norman, 2003). Consequently, if between raters or by non-identified differences between the
the variance between the subjects is very small or even zero rated subjects. Nevertheless the strength of interrater relia-
the reliability coefficient would be near zero as well. There- bility coefficients is that it is an indicator for the quality and the
fore, reliability coefficients do not only reflect the degree of clinical value of observations characterising individuals or
agreement between raters, but also the degree to which a subjects (Kraemer, 1979; Shrout, 1998).
For further discussions on this topic it would be helpful to
* Tel.: +49 30 450 529 054; fax: +49 30 450 529 900. strictly differentiate between interrater reliability and inter-
E-mail address: jan.kottner@charite.de. rater agreement. High proportions of interrater agreement

0020-7489/$ – see front matter # 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ijnurstu.2008.04.001
142 J. Kottner / International Journal of Nursing Studies 46 (2009) 141–142

are important but kappa-like statistics provide information Fleiss, J.L., Levin, B., Paik, M.C., 2003. Statistical Methods for
about the clinical value of the ratings. Finally, calculated Rates and Proportions, third ed. Wiley, New Jersey.
proportions of overall agreement are also dependent on Kraemer, H.C., 1979. Ramifications of a population model for k as a
sample characteristics. If the prevalence of a category is coefficient of reliability. Psychometrika 44 (4), 461–472.
Kraemer, H.C., Periyakoil, V.S., Noda, A., 2002. Kappa coefficients
very high or very low it is likely that the overall proportion of
in medical research. Stat. Med. 21 (14), 2109–2129.
agreement is inflated. To overcome this ‘limitation’ certain Landis, J.R., Koch, G.G., 1975. A review of statistical methods in the
indices of specific agreement seem to be very helpful (Fleiss analysis of data arising from observer reliability studies (part I).
et al., 2003; Szklo and Nieto, 2007). Statistica Neerlandica 29, 101–123.
Morris, R., MacNeela, P., Scott, A., Treacy, P., Hyde, A., O’Brien, J.,
Lehwaldt, D., Byrne, A., Drennan, J., 2008. Ambiguities and
Conflict of interest conflicting results: The limitations of the kappa statistic in
establishing the interrater reliability of the Irish nursing mini-
mum data set for mental health: a discussion paper. Int. J. Nurs.
None.
Stud. 45 (4), 645–647.
Shrout, P.E., 1998. Measurement reliability and agreement in psy-
chiatry. Stat. Methods Med. Res. 7 (3), 301–317.
References Streiner, D.L., Norman, G.R., 2003. Health Measurement Scales: A
Practical Guide to their Development and Use, third ed. Oxford
Dunn, G., 2004. Statistical Evaluation of Measurement Errors: University Press, Oxford.
Design and Analysis of Reliability Studies, second ed. Hodder Szklo, M., Nieto, F.J., 2007. Epidemiology Beyond the Basics,
Arnold, London. second ed. Jones and Barlett Publishers, Sudbury.

You might also like