Professional Documents
Culture Documents
Reliability & Validity
Reliability & Validity
Reliability & Validity
http://www.creative-wisdom.com/teaching/assessment/reliability.html
Temporal stability: This type of reliability utilizes the same form of a test on two
or more separate occasions to the same group of examinees (Test-retest). On many
occasions this approach is not practical because the behaviors of the examinees
could be affected by repeated measurements. For example, the examinees might
be adaptive to the test format and thus tend to score higher in later tests. This
consequence is known as the carry-over effect. Hence, careful implementation of
the test-retest approach is strongly recommended (Yu, 2005).
Form equivalence: This approach requires two different test forms based on the
same content (Alternate form). After alternate forms have been developed and
validated by test equating, it can be used for different examinees. In high-stake
examinations it is very common to employ this method to preempt cheating.
Because the two forms have different items, an examinee who took Form A earlier
could not "help" another student who takes Form B later.
Internal consistency: This type of reliability estimate uses the coefficient of test
scores obtained from a single test or survey (Cronbach Alpha, KR20, Spilt-half).
Consider this scenario: respondents are asked to rate the statements in an attitude
survey about computer anxiety. One statement is: "I feel very negative about
computers in general." Another statement is: "I enjoy using computers." People
who strongly agree with the first statement should be strongly disagree with the
second statement, and vice versa. If the rating of both statements is high or low
among several respondents, the responses are said to be inconsistent and
patternless. The same principle can be applied to a test. When no pattern is found
in the students' responses, probably the test is too difficult and as a result the
examinees just guess the answers randomly.
For example, the reliability of writing skill test score is affected by the raters, the
mode of discourse, and several other factors (Parkes, 2000).
Face validity: Face validity simply means that the validity is taken at face value.
As a check on face validity, test/survey items are sent to teachers or other subject
matter experts to obtain suggestions for modification. Because of its vagueness
and subjectivity, psychometricians have abandoned this concept for a long time.
However, outside the measurement arena, face validity has come back in another
form. While discussing the validity of a theory, Lacity and Jansen (1994) defines
validity as making common sense, and being persuasive and seeming right to the
reader. For Polkinghorne (1988), validity of a theory refers to results that have the
appearance of truth or reality.
The internal structure of things may not concur with the appearance. Many times
professional knowledge is counter-common sense. The criteria of validity in
research should go beyond "face," "appearance," and "common sense."
Second, very often content experts fail to identify the learning objectives of a
subject. Take the following question in a philosophy test as an example:
When was the founder and CEO of Microsoft, William Gates III born?
a. 1949
b. 1953
c. 1957
d. None of the above
Construct validation as unification: The criterion and the content models tends
to be empirical-oriented while the construct model is inclined to be theoretical.
Nevertheless, all models of validity requires some form of interpretation: What is
the test measuring? Can it measure what it intends to measure? In standard
scientific inquiries, it is important to explicitly formulate an interpretative
(theoretical) framework and then to subject it to empirical challenges. In this
sense, theoretical construct validation is considered functioning as a unified
framework for validity (Kane, 2001).
It has been a tradition that multiple factors are introduced into a test to improve
validity but decrease internal-consistent reliability."
Being inspired by Moss, Mislevy went further to ask whether there can be
reliability without reliability (indices).
In many situations we don't present just one single argument; rather problem
solving involves a chain of arguments with multiple evidence.
Reliability is not a property of the test; rather it is attached to the property of the
data. Thus, psychomterics is datammetrics.
Tests are not reliable. It is important to explore reliability in virtually all studies.
In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of measuring
reliability, reviewed the historical development of Cronbach Alpha. He asserted, "I no
longer regard the formula (of Cronbach Alpha) as the most appropriate way to examine
most data. Over the years, my associates and I developed the complex generaliability (G)
theory" (p. 403). Discussion of the G theory is beyond the scope of this document.
Nevertheless, Cronbach did not object use of Cronbach Alpha but he recommended that
researchers should take the following into consideration while employing this approach:
Independence of sampling
Heterogeneity of content
How the measurement will be used: Decide whether future uses of the
instrument are likely to be exclusively for absolute decisions, for differential
decisions, or both.
The conventional view (content, criterion, construct) is fragmented because it fails to take
into account both evidence of the value implications of score meaning as a basis for
actionable items and the social consequences of using the test scores.
Validity is not a property of the test or assessment, but rather it is about the meaning of
the test scores.
Critics argued that consequences should not be a component of validity because test
developers should not be held responsible for the consequences of misuse that are out of
their control. Rather, accountability should be tied to the misuser. Messick (1998)
counter-argued that social consequences of score interpretation include the value
implications of the construct, and this implication must be addressed by evaluating the
meaning of the test score. While test developers should not be accountable to misuse of
tests, they should still be cautious to the unanticipated consequences of legitimate score
interpretation.
Some scholars argue that the traditional view that "reliability is a necessary but not a
sufficient condition of validity" is incorrect. This school of thought conceptualizes
reliability as invariance and validity as unbiasedness. A sample statistic may have an
expected value over samples equal to the population parameter (unbiasedness), but have
very high variance from a small sample size. Conversely, a sample statistic can have very
low sampling variance but have an expected value far departed from the population
parameter (high bias). In this view, a measure can be unreliable (high variance) but still
valid (unbiased).
Sample statistic (Yellow line) --> unbiased Sample statistic (Yellow line) --> Biased
High variance (Green line)
When I buy a drug that has been approved by FDA and my friend asks me whether it
heals me, I tell him, "I am taking a drug approved by FDA and therefore I don't need to
know whether it works for me or not!" A responsible evaluator should still check the
instrument's reliability and validity with his/her own data and make any modifications if
necessary.
Low reliability is less detrimental to the performance pretest. In the pretest where
subjects are not exposed to the treatment and thus are unfamiliar with the subject matter,
a low reliability caused by random guessing is expected. One easy way to overcome this
problem is to include "I don't know" in multiple choices. In an experimental settings
where students' responses would not affect their final grades, the experimenter should
explicitly instruct students to choose "I don't know" instead of making a guess if they
really don't know the answer. Low reliability is a signal of high measurement error, which
reflects a gap between what students actually know and what scores they receive. The
choice "I don't know" can help in closing this gap.
Last Updated: 2012
References
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational and
psychological testing. Washington, DC: Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun (Eds.),
Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the perspective
of replications. Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd Ed.). Washington, D. C.: American Council on Education.
Parkes, J. (2000). The relationship between the reliability and cost of performance
assessments. Education Policy Analysis Archives, 8. Retrieved from
http://epaa.asu.edu/epaa/v8n16/
Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An
integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State
University of New York Press.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error
studies at the National Center for Education Statistics. Washington D. C.: U. S.
Department of Education
Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability issues.
Thousand Oaks: Sage.
Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of
Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press.
had to take a general medical examination covering all general areas in order to be
certified.