Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

 अमृते यनमः

Inter-rater Reliability of a Dyslexia Screening Test


Mithun Haridas Nirmala Vasudevan Lakshmi Sasikumar
Center for Research in Analytics and Department of Physics, Department of Mathematics,
Technologies for Education (CREATE) Amrita School of Arts & Sciences Amrita School of Arts & Sciences
Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham
Amritapuri, Kollam, India Amritapuri, Kollam, India Amritapuri, Kollam, India
mithunh@am.amrita.edu nirmalav@am.amrita.edu lakshmivgra@gmail.com
Georg Gutjahr Raghu Raman Prema Nedungadi
Center for Research in Analytics and School of Business Center for Research in Analytics and
Technologies for Education (CREATE) Amrita Vishwa Vidyapeetham Technologies for Education (CREATE)
Amrita Vishwa Vidyapeetham Amritanagar, Coimbatore, India Amrita Vishwa Vidyapeetham
Amritapuri, Kollam, India raghu@amrita.edu Amritapuri, Kollam, India
georgcg@am.amrita.edu prema@amrita.edu

Abstract—A standardized dyslexia screening test can help in A standardized assessment should maintain a high degree
identifying the vast number of undiagnosed dyslexics in Indian of uniformity in how it is administered and how the results are
schools; needless to say, such a test should produce consistent and interpreted. Different evaluators/raters should give similar
reproducible results. This study investigated reliability and scores for a student. The current study analyzes the inter-rater
consistency among raters of a Malayalam-English dyslexia agreement, intraclass agreement, and consistency for English
screening in India. Paper-based tests were administered to tasks in our Malayalam-English dyslexia screening test.
groups of students, and four raters evaluated the answer sheets of
208 second-grade students (ages 6–7). Inter-rater agreement,
intraclass agreement, and internal consistency were calculated. II. METHODS
Our findings include good agreement among raters’ appraisals
for most error types and tasks. Internal consistency for a few A. Assessment Method
tasks was low, possibly because these tasks evaluated more than Paper-based screening tests of 30-minutes duration were
one skill. A few error types need to be redefined and a few tasks administered to groups of students in their regular classrooms.
need to be more skill-specific to enable unambiguous and fruitful The tests consisted of six identical tasks in both Malayalam
interpretation by different raters in the future. and English [9]. The present study analyzes performance in
the following five tasks in English:
Keywords—reading disability; dyslexia; screening test; bilingual
assessment; rater agreement; internal test consistency (1) Visual discrimination: The children were shown a
letter/word and asked to identify the same letter/word
from a group of letters/words.
I. INTRODUCTION
Dyslexia is a learning disability that affects reading and (2) Spelling test: The classroom instructor read out ten words
spelling [1], [2]; in many cases, it can be overcome with early and also used the words in sentences to ensure that the
diagnosis and remediation [2], [3], [4]. children correctly understood the words. The children
were asked to write the words.
Screening tests afford a means of scanning large
populations of children to identify those at-risk of dyslexia. (3) Vocabulary test (free writing): The children were given
The existing tests usually are administered in English and are two minutes to write all the words that they could think of.
standardized for native-English speakers; these tests may not (4) Copying a passage: The children were asked to copy a
be suitable for identifying dyslexics in India because the passage in 2.5 minutes.
manifestations of dyslexia vary depending on the language [5].
A typical Indian school student is exposed to more than one (5) Reading comprehension: The children were shown a
language, and English is often the second language. passage and asked to write the answers to four questions
based on the passage.
Currently, there is no standardized dyslexia assessment for
Malayalam, the primary language in Kerala State in southwest For the first four tasks, 34 kinds of errors were identified.
India, and spoken by approximately 38 million people Examples include letter reversal in the spelling test and
worldwide [6]. Malayalam is an orthographically transparent incorrect spacing between words in the copying task. For the
language, unlike English. reading comprehension task, the number of correctly answered
questions was recorded.
As the first step towards developing a standardized
dyslexia assessment for Kerala school children, we proposed
and administered a battery of tests in English and Malayalam B. Evaluation of the Screening Test
[7], [8], [9]. We identified and categorized all student errors; Four raters graded the screening tests independently and
error analysis helped in understanding spelling and reading recorded the errors types made by the students. All raters
challenges [8], [9]. received initial training on how to grade and decide error types.
 अमृते यनमः
For the English spelling and vocabulary tests, the raters agreement; values greater than 0.81 indicate almost perfect
counted the number of spelling mistakes made by the students. agreement. The error types are sorted in ascending order based
Different types of spelling mistakes, such as the omission, on the κ value. Tasks for the error types are shown in the first
addition, and substitution of letters, were identified. Further, column.
phonological errors were distinguished from orthographical
errors; a dyslexic student might use his/her phonological TABLE I. RATER AGREEMENT FOR ERRORS ON ENGLISH TASKS IN
knowledge to spell a word and write bak instead of back [10]. DYSLEXIA SCREENING TEST
This error is phonologically correct but orthographically
wrong. On the other hand, a student may write cr instead of Task Error Κ
car [11]. In this instance, the student has omitted a phoneme.
COPa Spacing between words 0.65
For the visual discrimination task, the raters counted the
number of correct and incorrect responses. COP Mixing up words belonging to differrent sentences 0.66

For the copying task, the raters analyzed the spelling errors COP Consistency in size of letters 0.67
and additionally graded the characteristics of the handwriting,
such as size consistency and the ability to write in straight COP Addition of letters, e.g. writing bigi instead of big 0.70
lines, on a Likert scale.
COP Omission of letters, e.g. writing wak instead of walk 0.71
Overall scores for each task were calculated based on the Letters in the wrong sequence, e.g. writing alppe
VOCa 0.72
total number of errors committed on that task. instead of apple
VOC Addition of letters, e.g. writing bigi instead of big 0.73
C. Study Population
Substitution of one consonant with another, e.g.
VOC 0.76
208 second-grade students (103 boys, 105 girls) from six writing hig instead of big
Government schools and one private school participated in COP Number of incorrect words 0.77
this study. The students were 6–7 years old, spoke Malayalam
at home, and learnt English as a second language at school. In VOC Omission of letters, e.g. writing wak instead of walk 0.77
the Government schools, the medium of instruction was
Malayalam; while in the private school, the medium of COP Number of missing words 0.78
instruction was English. SPLa
Substitution of one consonant with another, e.g.
0.79
writing hig instead of big
Substitution of one vowel with another, e.g. writing
D. Statistical Methods VOC
bag instead of big
0.80
Inter-rater reliability refers to the consistency or agreement VOC
Letter reversal of lower case letters, e.g. writing bog
0.81
among different raters who evaluate the same set of instead of dog
assessments. Fleiss’ kappa (κ) is a statistical measure to Letter reversal of lower case letters, e.g. writing bog
SPL 0.82
instead of dog
quantify inter-rater reliability. Fleiss’ κ can be used when
there are more than two raters [12]. Values of κ can range COP Omission of punctuation marks, such as the comma 0.82
from -1 to +1; the higher the values, the higher the agreement Not all sounds represented, e.g. writing had instead
SPL 0.83
between the raters. For the present study, κ values were of hand
calculated for 34 error types in five English tasks of our VOC No evidence of strategy 0.84
assessment.
Substitution of one vowel with another, e.g. writing
SPL 0.86
Intraclass correlation coefficient (ICC) is an index to bag instead of big
measure how closely grouped units resembles each other [13]. All sounds represented but poor grapheme phoneme
SPL 0.86
correspondence, e.g. writing sed instead of said
Values range from 0 to 1 with larger values indicating stronger Letter reversal of lower case letters, e.g. writing bog
resemblance. ICC values were calculated to quantify rater COP 0.87
instead of dog
agreement for tasks.
VOC Number of misspelled words 0.89
Cronbach’s alpha (α) can be used to measure the internal
consistency of scores with multiple components [14]. The SPL No evidence of strategy 0.90
values of α range from 0 to 1, with higher values indicating COP Ability to write in a straight line 0.90
higher consistency. Values of α were calculated for the five
tasks in our assessment. Scores were standardized for the SPL Omission of letters, e.g. writing wak instead of walk 0.91
calculation of Cronbach’s alpha.
SPL Number of misspelled words 0.92
R software was used to perform the statistical analysis.
Letter reversal for uppercase letters, e.g. choosing
Fleiss' κ and ICC were calculated using the package psych. VSDa
the mirror image of G instead of G
0.94
All sounds represented but poor grapheme phoneme
VOC 0.95
III. RESULTS correspondence, e.g. writing sed instead of said

Table 1 shows the Fleiss κ value for rater agreement for 34 SPL Addition of letters, e.g. writing bigi instead of big 0.96
error types. Values greater than 0.41 indicate moderate VSD
Incorrect position of letters within a word, e.g.
0.98
agreement; values greater than 0.61 indicate substantial choosing gril instead of girl
 अमृते यनमः
Task Error Κ explained by the slight disagreement about the target words
and what constitutes a proper English word. For example, if a
Letter reversal for lower case letters, e.g. choosing
VSD
bog instead of dog
1.00 student wrote bog, it was difficult to understand whether
Letters in the wrong sequence, e.g. writing alppe student had tried to write dog or bag or something else. Due to
SPL 1.00 this reason, different raters might have categorized errors
instead of apple.
VOC
Not all sounds represented, e.g. writing had instead
1.00
differently. This can be avoided by looking at other errors
of hand made the same students.
COP Crossing out wrong enteries a number of times 1.00
Internal consistency was relatively high for comprehension
a. COP: Copying a passage, VOC: Vocabulary test, SPL: Spelling test, VSD: Visual discrimination and copying tasks, but relatively low for visual and word
tasks. This suggests that the visual and word tasks required a
Table 2 presents the intraclass correlation coefficients for variety of skills and different students struggled with different
rater agreements on different tasks. Values between 0.50 and skills. For these tasks, it may be better to define subtasks that
0.75 indicate moderate agreement; values between 0.75 and have a higher internal consistency, can be interpreted more
0.90 indicate good agreement; and values greater than 0.90 clearly, and are more informative about dyslexia.
indicate excellent agreement [13].
V. CONCLUSIONS
TABLE II. INTRACLASS CORRELATION COEFFICIENTS (ICC) FOR
RATER AGREEMENTS ON DIFFERENT TASKS For an assessment or screening tool to be of practical use,
the reliability and consistency must be calculated. For the
Task ICC 95 % Confidence Interval current study, we calculated the inter-rater agreement,
Visual discrimination 0.97 0.96 0.98
intraclass agreement, and internal consistency of the dyslexia
screening test. Overall, we found adequate agreement among
Spelling test 0.97 0.96 0.97
the four raters in error types and in tasks.
Vocabulary 0.69 0.64 0.74
Copying a passage 0.82 0.78 0.85
This study found that the agreement among the four raters
was moderate to excellent for all error types. The intraclass
Reading comprehension 0.97 0.97 0.98
correlation coefficients also showed an excellent rater
agreement for the tasks.
Table 3 presents Cronbach’s α values for internal Additionally, the study revealed that internal consistency
consistency of the different tasks. Average scores from the was relatively high for the comprehension and copying tasks,
four raters were used for the calculation. Values below 0.6 and relatively low for the visual and word tasks. The
indicate low internal consistency and values above 0.8 indicate manifestation of dyslexia may vary from one child to another.
good consistency. So, students may be deficient in some of the skills required for
the visual and word tasks, but proficient in others, leading to a
TABLE III. INTERNAL CONSISTENCY OF THE DIFFERENT TASKS . low internal inconsistency. In a future work, we will
investigate the various skills required for each of these tasks to
Task α 95 % Confidence Interval get consistent and meaningful categories of errors.
Visual Discrimination 0.43 0.30 0.56
Spelling test 0.50 0.40 0.60
ACKNOWLEDGMENT
Vocabulary 0.31 0.17 0.45 Our work derives its direction and inspiration from the
Copying a passage 0.74 0.69 0.79
Chancellor of our University, Sri Mata Amritanandamayi Devi.
The first author was supported by the Visvesvaraya Ph.D.
Reading comprehension 0.80 0.76 0.85
scholarship; this work was partly funded by the Department of
Science and Technology–Cognitive Sciences Research Initiative
IV. DISCUSSION (DST-CSRI), Government of India (DST SR/CSI/120/2013,
While the rater agreement was at least moderate for all DST SR/CSI/121/2013).
error types, some of the error types should be defined more
clearly in the future to make the grading even more consistent. REFERENCES
For example, in the vocabulary task, students had to write all
[1] R. L. Peterson and B. F. Pennington, “Developmental dyslexia,” The
the words that they could think of. Raters were at times Lancet, vol. 379(9830), pp. 1997–2007, May 2012.
uncertain on how to decide error types, since the target word
[2] M. J. Snowling, “Early identification and interventions for dyslexia: a
was hard to guess. Depending on the target word, the error contemporary view,” Journal of Research in Special Educational Needs,
type would change. For example, if a student wrote boll, it vol. 13(1), pp. 7–14, Jan 2013.
might have been a vowel substitution for the target word ball [3] P. Sanjanaashree, M. A. Kumar, and K. P. Soman, “Language learning
or a letter reversal for the target word doll. for visual and auditory learners using scratch toolkit,” Proc. IEEE Int.
Conf. Computer Communication and Informatics (ICCCI), IEEE Press,
The intraclass correlation coefficients showed good to pp. 1–5, Jan. 2014, doi: 10.1109/ICCCI.2014.6921765
excellent rater agreement for all tasks except vocabulary. The [4] S. P. Vadanan and N. K. Prakash, “FPGA based LED cube to assist
moderate agreement for the vocabulary task can again be child dyslexia,” Proc. IEEE Int. Conf. on Computational Intelligence
 अमृते यनमः
and Computing Research (ICCIC), IEEE Press, pp. 1–3, Dec. 2017, doi:
10.1109/ICCIC.2017.8524513
[5] P. T. Daniels and D. L. Share, “Writing system variation and its
consequences for reading and dyslexia,” Scientific Studies of Reading,
vol. 22(1), pp. 101–116, Jan. 2018.
[6] V. K. Narne, P. Prabhu, P. Thuvassery, R. Ramachandran, A. Kumar, R.
Raveendran, and S. A. Gafoor, “Frequency importance function for
monosyllables in Malayalam,” Hearing, Balance and Communication,
vol. 14(4), pp. 201–206, Oct. 2016.
[7] M. Haridas, N. Vasudevan, A. Iyer, R. Menon, and P. Nedungadi,
“Analyzing the Responses of Primary School Children in Dyslexia
Screening Tests,” Proc. IEEE Int. Conf. MOOCs, Innovation and
Technology in Education (MITE), IEEE Press, Nov. 2017, pp. 89–94,
doi: 10.1109/MITE.2017.00022.
[8] M. Haridas, N. Vasudevan, G. J. Nair, G. Gutjahr, R. Raman and P.
Nedungadi, “Spelling errors by normal and poor readers in a bilingual
Malayalam-English dyslexia screening test,” Proc. IEEE 18th Int. Conf.
Advanced Learning Technologies (ICALT), IEEE Press, July 2018, pp.
340–344, doi: 10.1109/ICALT.2018.00085.
[9] M. Haridas, N. Vasudevan, G. Gutjahr, R. Raman, and P. Nedungadi,
“Comparing English and Malayalam spelling errors of children using a
bilingual screening tool,” 4th International Congress on Information and
Communication Technology (ICICT 2019), Springer, Feb 2019, doi:
https://doi.org/10.1007/978-981-32-9343-4_34.
[10] R. H. Bahr, E. R. Silliman, V. W. Berninger, and M. Dow, “Linguistic
pattern analysis of misspellings of typically developing writers in grades
1-9,” Journal of Speech, Language, and Hearing Research, vol. 55, pp.
1587–1599, April 2012.
[11] L. A. Winkler, “Analysis of Patterns in Handwritten Spelling Errors
among Students with Various Specific Learning Disabilities,” Graduate
Theses and Dissertations, University of South Florida, 2016.
[12] J. Sim and C. C. Wright, “The kappa statistic in reliability studies: use,
interpretation, and sample size requirements,” Physical Therapy, vol.
85(3), pp. 257–268, March 2005.
[13] T. K. Koo and M. Y. Li, “A guideline of selecting and reporting
intraclass correlation coefficients for reliability research,” Journal of
Chiropractic Medicine, vol. 15(2), pp. 155–163, June 2016.
[14] P. Kline, The Handbook of Psychological Testing, 2nd ed., London:
Routledge, 2000, p. 13.

You might also like