Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 44

DEBREMARKOS UNIVERSITY

COLLEGE OF SOCIAL SCINCE AND HUMANITIES

DEPARTMENT OF ENGLISH LANGUAGE AND LITERATURE

COURSE TITLE:RESEARCH METHODS ELT(TEFL 523)


TEREME PAPERE TITLE:RESEARCH RELIABILITY

BY: Getaneh Yekoye ID NO___________

Submitted to:Dr.Yoseph Zewdu

Date, Agust,9,2019
Acknowledgements

First and foremost I would like to express my heartfelt gratitude to my instructor


Dr. Yoseph Zewdu. For giving us this term paper because we are practiced how to
write term paper and the contents that includes and also the format due to this
reason I want to say, thank you so.
I.INTRODUCTION
The reliability of test scores is the extent to which they are consistent across
different occasions of testing different editions of the test or different raters scoring
the test taker’s responses. This guide explains the meaning of several terms
associated with the concept of test reliability: true score error of measurement
alternate-forms reliability inters rater reliability internal consistency reliability
coefficient standard error of measurement classification consistency and
classification accuracy. It also explains the relationship between the number of
questions, problems or tasks in the test and the reliability of the scores. Key words:
reliability true score error of measurement alternate-forms reliability, inter rater
reliability, internal consistency reliability coefficient, standard error of
measurement classification consistency classification accuracy.
1. What is Research Reliability?

It refers to consistency throughout a series of measurements. For example, if a


respondent gives out a response to a particular item, he is expected to give the same
response to that item even if he is asked repeatedly. If he is changing his response to the
same item, the consistency will be lost. So the researcher should frame the items in a
questionnaire in such a way that it provides consistency or reliability.Reliability is defined
in different ways by different researchers under stable conditions, with consistent results
and the results not varying. Reliability reflects consistency and reliability over time.
Reliability describes a test that is dependable and consistent (Brown, 2004). It is not
possible to trust test results without reliability (Davies, 1999). A reliable test, when
administered to different students on different occasions, describes similar results. Just as
validity, reliability is a condition that can be represented numerically with a reliability
coefficient. Its specific calculation procedure is not discussed for it goes beyond the scope of
this paper. It can also be measured by comparing the actual test with an external and
highly dependable test (Davies, 1999). However, for teaching and classroom purposes,
reliability can be analyzed considering factors such as students, raters, test creation and
test administration (Brown, 2004).Researchers have attempted to find ways of minimizing
unreliability in tests. For instance, Uysal´s (2010) paper critically reviews the scoring
procedure of the writing component of the English International Testing Service (IELTS).
IELTS authorities implement strategies such as a multiple rating system, training sessions
for raters and pairing up single examiners with a senior examiner to rate a paper. Uysal
comments that to avoid reliability issues in the grading of writing tasks, multiple raters
should be used permanently and reliability coefficients calculated regularly. From this we
can generalize that Reliability of any research is the degree to which it gives an accurate
score across a range of measurement. It can be viewed as being reliability or consistency,
or we can generalize that Reliability is the extent to which test scores are consistent, with
respect to one or more sources of inconsistency the selection of specific questions, the
selection of raters, the day and time of testing. Every reliability statistic refers to a
particular source of inconsistency (or a particular combination of sources), which it
includes as measurement error. We cannot understand a reliability statistic unless you
know what sources of inconsistency it includes as measurement error.

2. Types of Reliability
In quantitative research the measurement procedure consists of variables whether a single
variables or a number of variables that we measure. In general there are four common
research reliabilities

1. Test-retest reliability is a measure of reliability obtained by administering the same


test twice over a period of time to a group of individuals.  The scores from Time 1 and
Time 2 can then be correlated in order to evaluate the test for stability over time. 
Example:  A test designed to assess student learning in English could be given to a group of
students twice, with the second administration perhaps coming a week after the first.  The
obtained correlation coefficient would indicate the reliability of the scores.

2.Parallel forms reliability is a measure of reliability obtained by administering different


versions of an assessment tool (both versions must contain items that probe the same
construct, skill, knowledge base, etc.) to the same group of individuals.  The scores from the
two versions can then be correlated in order to evaluate the consistency of results across
alternate versions.
Example:  If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly split
the questions up into two sets, which would represent the parallel forms.

3. Inter-rater reliability is a measure of reliability used to assess the degree to which


different judges or raters agree in their assessment decisions.  Inter-rater reliability is
useful because human observers will not necessarily interpret. Answers the same way;
raters may disagree as to how well certain responses or material demonstrate knowledge of
the construct or skill being assessed.
Example:  Inter-rater reliability might be employed when different judges are
evaluating the degree to which art portfolios meet certain standards.  Inter-rater
reliability is especially useful when judgments can be considered relatively subjective. 
Thus, the use of this type of reliability would probably be more likely when evaluating
artwork as opposed to math problems.

4. Internal consistency reliability is a measure of reliability used to evaluate the degree


to which different test items that probe the same construct produce similar results.
Due to this reason we can categorizes in to two groups this are as follow… 
A. Average inter-item correlation is a subtype of internal consistency
reliability.  It is obtained by taking all of the items on a test that probe the
same construct (e.g., reading comprehension), determining the correlation
coefficient for each pair of items, and finally taking the average of all of these
correlation coefficients.  This final step yields the average inter-item
correlation. 
B. Split half reliability is begun by “splitting in half” all items of a test that are
intended to probe the same area of knowledge (e.g., World War II) in order
to form two “sets” of items.  The entire test is administered to a group of
individuals, the total score for each “set” is computed, and finally the split-
half reliability is obtained by determining the correlation between the two
totals set scores.

Finally Reliability is the extent to which test scores are not affected by chance factors by
the luck of the draw. It is the extent to which the test taker’s score does not depend on
the specific day and time of the test (as compared with other possible days and times of
testing), the specific questions or problems that were on the edition of the test that the
test taker took the specific raters who rated the test taker’s responses (if the scoring
process involved any judgment). In short Reliability tells us how consistently the test
scores measure something.

2.1 Reliability Is Consistency


Test scores are reliable to the extent that they are consistent over different occasions of
testing, different editions of the test, containing different questions or problems
designed to measure the same general skills or types of knowledge, and different
scorings of the test takers’ responses, by different raters. By now, you may realize that
there is more than one kind of reliability. Different kinds of reliability refer to different
kinds of consistency .Inter -rater reliability is the consistency of the scores awarded by
different raters to the same responses.
3. Measurement error

Reliability is the extent to which test scores are consistent, with respect to one or more
sources of inconsistency the selection of specific questions, the selection of raters, the
day and time of testing. Every reliability statistic refers to a particular source of
inconsistency (or a particular combination of sources) which it includes as
measurement error. You cannot understand a reliability statistic unless you know what
sources of inconsistency it includes as measurement error. The term “measurement
error” includes only those influences that we think can be assumed to operate
randomly. Some kinds of knowledge and skills that affect the test taker’s score are not
part of what the test is intended to measure.

3.1 Reliability and Sampling


When you take a test, the questions or problems we see are only a sample of the
questions or problems that could have been included. If you took a different edition of
the test, you would see a different sample of questions or problems.
4.Test Length and Reliability
A longer test is a more reliable test. A longer test provides a larger sample of the
questions or problems the test consists of. It gives us a better look at the test taker’s
responses to those kinds of questions or problems. If each question or problem is a
separate task, and we increase the number of questions or problems on the test, we will
increase the alternate-forms reliability of the scores.
5. The Reliability Coefficient
The reliability coefficient is an absolute number that can range from .00 to 1.00. A value
of 1.00 indicates perfect consistency and value of .00 indicates a complete lack of
consistency. Scores assigned to the test takers by a completely random process would
have a reliability coefficient very close to .00. The reliability coefficient is actually a
correlation. It is the correlation between the test takers’ scores on two applications of
the testing process or in the case of inter rater reliability, two applications of the scoring
process. Although correlations can vary from -1.00 to 1.00, we assume that the scores
from two applications of the same testing process would not correlate negatively. The
reliability coefficient always refers to a group of test takers. It answers the question,
“How consistent are the test takers’ relative positions in the group, as indicated by their
scores. The answer to this question can differ substantially between one group and
another. If the test takers in the group are nearly equal in the abilities the test measures
small changes in their scores can easily change their relative positions in the group.
Consequently, the reliability coefficient will be low. The more the test takers differ in
their ability, the less likely that small change in their scores will affect their relative
positions in the group, and the higher the reliability coefficient will be. You cannot
compare reliability coefficients for two different tests unless they refer to the same
population of test takers.

6. Conclusion
Here is a conclusion of the most important things to know about test reliability. Test
takers’ scores are affected to some degree by factors that are essentially random in the
way they affect the scores. Reliability is consistency or the extent to which test takers’
scores are consistent over different days of testing, different editions of the test,
containing different questions or tasks, and different raters scoring the test takers’
responses. The reliability of test scores is the extent to which they measure something
consistently. Test scores cannot be valid unless they are reliable. A test taker’s true
score is the average of the scores the test taker would get over some specified set of
conditions. Error of measurement is the difference between the score a test taker
actually got and that test taker’s true score. We cannot know an individual test taker’s
true score or error of measurement. However, we can estimate statistics for true scores
and errors of measurement in a large group of test takers. The questions, problems, or
tasks on a single edition of a test are a sample of those that could possibly have been
included. We can get a better sample of the test taker’s performance by enlarging the
test to include more questions, problems, or tasks.
REFERENCES

Allison, D. (1999) Language Testing and Evaluation: An Introductory


Course. Singapore University Press: Singapore
Brown, H.D. (1994) Principles of Language Learning and Teaching. Third
Edition. Prentice Hall: New Jersey, USA.

Brown, H.D. (2004) Language Assessment Principles and Classroom


Practices. Pearson Education Longman: White Plains, New York, USA.
Davies, A. (1999) Reliability. In Fulcher, G. and Thrasher, R. Language
Testing Videos. In association with ILTA, available:
http://languagetesting.info.s

Educational Testing Service (2007) Developing an ESL Placement Test for


Community College Students. Available
www.ets.org/Media/Research/pdf/conf_achgap_cc_young.pdf  
Uysal, H.H. (2010) A Critical Review of the IELTS Writing Exam. ELT
Journal 64/3, 314-320.

You might also like