Professional Documents
Culture Documents
Statistics Refresher Validity Reliability PPT Content
Statistics Refresher Validity Reliability PPT Content
Statistics Refresher Validity Reliability PPT Content
Test scores are frequently expressed as numbers, and 4. Ratio Scale: has all the three properties (Kelvin)
statistical tools are used to describe, make inferences from,
and draw conclusions about numbers. Type of Scale Magnitude Equal Absolute 0
Intervals
Why we need statistics? Nominal No No No
Statistical methods serve two important purposes in the
Ordinal Yes No No
quest for scientific understanding.
1. Statistics are used for purposes of description. Interval Yes Yes No
2. We can use statistics to make inferences, which are Ratio Yes Yes Yes
logical deductions about events that cannot be observed
directly. Frequency Distrubutions
A distribution of scores summarizes the scores for a
Descriptive Statistics are methods used to provide a group of individuals.
concise description of a collection of quantitative The frequency distribution displays scores on a
information. variable or a measure to reflect how frequently each
Inferential Statistics are methods used to make value was obtained; defines all the possible scores and
inferences from observations of a sample to population. determines how many people obtained each of those
scores.
Scales of Measurement
Measurement: application of rules for assigning
numbers to objects.
The rules are the specific procedures used to transform
qualities of attributes into numbers.
Probability Sampling
- Simple Random Sampling: everyone has an equal
chance of being selected as part of the sample
- Systematic Sampling: every nth item or person after
is picked
- Stratified Sampling: random selection within
predefined groups
- Cluster Sampling: groups rather than individual units Standard error of measurement (SEm)
of the target population are selected randomly estimates how repeated measures of a person on the
same instrument tend to be distributed around his “true”
Non-Probability Sampling score; which the true score is always unknown because
- Convenience Sampling: selected based on their no measure can be constructed that provides a perfect
availability reflection of the true score.
- Quota Sampling: specifying who should be recruited
for a survey according to certain groups or criteria Domain Sampling Theory
- Purposive Sampling: chosen consciously based on another central concept in CTT; assumes that the items
their knowledge and understanding of the research that have been selected for any one test are just a
question sample of items from an infinite domain of potential
- Snowball or Referral Sampling: people recruited to items; considers the problems created by using a limited
be part of a sample are asked to invite those they number of items to represent a larger and more
know to take part complicated construct.
- as the sample gets larger, it represents the domain
more and more accurately.
- the greater the number of items, the higher the
2.1. Reliability reliability.
- pertains to the consistency of test measurement
- can be estimated from the correlation of the Item Response Theory (IRT)
observed tests score with the true score A way to analyze responses to tests or questionnaires
- a test may be reliable in one context and unreliable with the goal of improving measurement accuracy and
in other reliability; used to focus on the range of item difficulty
that helps assess an individual's ability level.
In psychological testing, the word error does not imply item difficulty = item easiness
that a mistake has been made. error implies that there
will always be some inaccuracy in our measurements. Sources of Error Variance
In other words, tests that are relatively free of Test Construction: item sampling or content
measurement error are reliable, and tests that have so sampling, terms that refer to a variation among
much measurement error are considered unreliable. items within a test as well as variation among items
between tests.
History & Theory of Reliability Test Administration: may influence the test taker's
Abraham De Moivre introduced the basic notion of attention or motivation.
sampling error. Test Scoring and Interpretation: individual
Karl Pearson developed the product moment administered test still require scoring by trained
correlation. personnel.
Charles Spearman worked out most of the basics of
contemporary reliability theory with 1904 article Test Re-Test Reliability
entitled "The Proof and Measurement of Association 1 group, 1 test, 2 administration
between Two Things." 2 weeks – 6months
testers use “rubber yardsticks” to estimate measurements o used to evaluate the error associated with
administering a test at two different times.
Basics of Test Score Theory o It is appropriate when evaluating the reliability of a
The difference between the score we obtain and the test that purports to measure something that is
score we are really interested is the error of relatively stable over time, ex: personality trait.
measurement – X-T=E o Carryover Effect: occurs when the first testing
session influences the scores from the old session.
(short span of interval)
Classical Test Score Theory: assumes that each o Practice Effect: the test-retest correlation usually
person has a true score that would be obtained if there overestimates the true reliability. (test-taker's
were no errors in measurement ( X=T+E) concern)
- The difference between the true score and the
observed score results from measurement error Test-Retest Procedure
Sample population = (Test A) [1] + (Test A) [2]
Administer the psychological test o to shorten the items, odd-even system or top
Get the test result bottom method can be used.
Wait for the interval (time gap) - top bottom method: 1st half (25 items), 2nd
Re-administer the psychological test half (25 items)
Get the test result - odd even system: 1st half (odd), 2nd half
Correlate the results (even)
o to adjust the half-test reliability, use the Spearman
Disadvantages of Test-Retest Brown Formula.
Possible better performance o allows a test developer or user to estimate internal
consistency reliability from correlation of two halves
Checking / knowing the answers
of a test.
Practice effect
o Spearman-Brown formula: r = 2r/1+r
where, (r) is the estimated correlation between the two
coefficient of stability: the estimate of test-retest
halves of the test (if each had had the total number of
reliability, when the interval between testing is greater items) and r is the correlation between the two halves of
than six months the test
As a Psychometrician in a clinic, the psychologist instructed you to
Parallel-Forms & Alternate- Forms Reliability develop a test that measure emotional stability of suicidal patients. You
decided to use split-half to establish its reliability. The correlation
1 group, 2 test, 1 administration
between two halves of the test is .78. according to the spearman-
o compares two equivalent forms of test that measure brown formula, the estimated reliability would be .876
the same attribute. The two forms use different Computation: r = 2 (.78) / 1 + .78
items; however the rules used to select items of a
particular difficulty level are the same Kuder-Richardson Formula (KR20)
o two forms are administered to the same group of KR20 is (Kuder & Richardson, 1937) developed this
people on the same day. Pearson's R product- formula to measure the reliability estimate of Split-half.
moment correlation is used to estimate the reliability The formula is applicable to items that are
o alternate forms are designed to be equivalent with DICHOTOMOUS, scored 0 or 1 (usually right or wrong).
respect to variables such as content and level of o used when the items varies the level of difficulty.
difficulty.
o coefficient of equivalence- measures the same
attributes.
Procedure
Administer the first test
Administer the Alternate form test
Score both tests
Correlate
Construct Validity
2.2. Validity judgment about appropriateness of inferences
Validity, as applied to a test, is a judgment or estimate of drawn from test scores regarding individual
how well a test measures what it purports to measure in standings on a variable called construct.
a particular context. it arrived at by executing a comprehensive analysis
Validity indicated what the test aims or purports to of:
measure. a. How scores on the test relate to other test
It answer the question, “Does the test measure what it scores and measures, and
is supposed to measure?” b. How scores on the test can be understood
within some theoretical framework for
Face Validity understanding the construct that the test was
Face validity is not really a validity at all because it does designed to measure.
not offer evidence to support conclusions drawn from
test score. Convergent Validity
Face validity relates more to what a test appears to - when a measure correlates well with other tests
measure to the person being tested than to what the test believed to measure the same construct.
actually measures. - ex: correlate assessment scores for a math ability
Face validity only establishes the presentation, physical test with scores obtained from other math ability
appearance of the psychological test (Is the test tests.
presentable to the test takers?)
Discriminant Validity
Three Categories of Validity - a construct measure diverges from other measures
Content Validity that should be measuring different things.
explores the appropriateness of test items of a - ex: correlate assessment scores for a math ability
psychological test. test with scores obtained from a verbal ability test.
It means that the test covers the content what is
supposed to cover. This describes a judgment of Practicality of a Test
how adequately a test samples behavior A test must be usable
representative of what the test was designed to Selection of the test should also be based on:
sample. - Effort
test blueprint for the "structure" of the test - Affordability
evaluation. A plan regarding the types of information - Time frame
to be covered by the items Test requires simple directions
Easy administration and scoring
Criterion-related Validity
criterion: a standard on which a judgment or
decision may be based. The standard against which
a test or a test score is evaluated