Professional Documents
Culture Documents
Module-2 Psychological Assessment (Yessa)
Module-2 Psychological Assessment (Yessa)
Lesson 2 Reliability
Lesson 3 Validity
Lesson 6 Norming
Module I
2
MODULE I
INTRODUCTION
This module presents ethics and human acts. It is hoped that you will
learn to appreciate ethics as both science and way of life and the purpose of
which man exists and the factors that affect his human acts. You will also
learn the end of human acts and how happiness can be found.
OBJECTIVES
There are three lessons in the module. Read each lesson carefully
then answer the exercises/activities to find out how much you have
benefited from it. Work on these exercises carefully and submit your output
to your tutor or to the DOUS office.
In case you encounter difficulty, discuss this with your tutor during
the face-to-face meeting. If not contact your tutor at the DOUS office.
Module I
3
Lesson 1
Statistics Refresher
SCALES OF MEASUREMENT
Properties of scales:
A. Category – naming/labeling
B. Magnitude – “moreness”; we suggests that one is more than the other.
C. Equal Interval – the difference between two points at any place has the
same meaning as the difference between two other points on other places.
D. Absolute zero – zero suggest absence of the variable being measured.
1. Nominal – naming; labeling; one category does not suggest that the other is
higher or lower. Ex. Gender; religion
2. Ordinal – observations can be ranked into order but the degree of difference
is unobtainable. Ex. Position in the company
3. Interval – there is magnitude and equal interval; no true zero
4. Ratio – there is magnitude, equal intervals, and true zero
Nominal No No No
Ordinal Yes No No
◼ most psychological data are ordinal by nature but are treated as interval.
▪ Ex. How a person respond to an item (ordinal) and how the response
are treated (summed, interval)
Module I
4
◼ IQ are initially for classification and not for measurement (cited by Binet),
thus IQ is nominal.
MEASURES OF VARIABILITY
◼ Indicates how scattered the score are distribution; how far one score is from
the other. Measures the dispersion of the scores.
◼ Range – equal to the difference of HS to LS
◼ Quartile – points that divide the distribution into 4 equal parts.
◼ Interquartile range – difference between Q3 and Q1; represents the
middle 50% of the distribution.
◼ Semi-interquartile range - (Q3 – Q1)/2
◼ STANDARD DEVIATION - Approximation of the average deviation around
the mean
◼ Gives detail of how much above or below a score to the mean.
CORRELATIONAL STATISTICS
◼ Statistical tools for testing the relationship between variables.
◼ Covariance – How much two scores vary together
◼ Correlation coefficient – mathematical index that describes the direction and
magnitude of a relationship.
▪ Ranges from -1.00 to +1.00
Module I
5
Module I
6
THINK!
Lesson 2
Reliability
RELIABILITY DEFINED
◼ Reliability – refers to the consistency of scores obtained by the same person
when re-examined with the same test on different occasions, or with different
sets of equivalent items, or under other variable examining condition.
◼ This mainly refers to the attribute of consistency in measurement.
▪ Charles Spearman – Key individual in the theories of reliability.
X = E + T
(observed score) (error) (true score)
◼ Standard error of measurement – the standard deviation of the distribution of
errors for each repeated application of the same test on an individual.
◼ Inversely related to reliability
Module I
7
◼ SEM = SD√(1 – r)
CONFIDENCE INTERVAL
◼ Range or band of test scores that is likely to contain the true score.
◼ M – 1.96 (SEM) to M + 1.96 (SEM)
◼ M = mean to test scores from the test taker.
*Error (E) can either be positive or negative. If E positive, the Obtained Score
(X) will be higher than the True Score (T); if E is negative, then X will be
lower than T.
*Although it is impossible to eliminate all measurement error, test developers
do strive to minimize psychometric nuisance through careful attention to the
sources of measurement error.
*It is important to stress that true score is never known.
MEASUREMENT ERROR
◼ All factors associated with the process of measuring variable, other than the
variable being measured
▪ Random error – source of error in measuring a targeted variable
caused by unpredictable fluctuations and inconsistencies of other
variables in the measurement process.
▪ Systematic error – source of error in measuring a variable that is
typically constant or proportionate to what is presumed to be the true
value of the variable being measured.
Module I
8
Module I
9
◼ When two measures have a positive (+) correlation, the high/low scores on Y
are associated with the high/low scores on X.
◼ When two measures have a negative (-) correlation, the high scores on Y are
associated with low scores on X and vice versa.
◼ Correlations of +1.00 are extremely rare in psychological research and usually
signify a trivial finding.
FORMS OF RELIABILITY
Module I
10
◼ Example: You took an IQ test today and you will take it again after
exactly a year. If your scores are almost the same (e.g. 105 and 107),
then the measure has a good test-retest reliability.
◼ Error variance – corresponds to the random fluctuations of
performance from one test session to the other.
◼ Clearly, this type of reliability is only applicable to stable traits.
Module I
11
level of difficulty of the items should also be equal. Instructions, time limits,
illustrative examples, format and all other aspects of the test must likewise be checked
for equivalence.
C. INTERNAL CONSISTENCY
◼ Used when tests are administered once.
◼ Suggests that there is consistency among items within the test.
◼ This model of reliability measures the internal consistency of the test which is
the degree to which each test item measures the same construct. It is simply
the intercorrelations among the items.
◼ If all items on a test measure the same construct, then it has a good internal
consistency.
THE SPLIT HALF RELIABILITY
▪ It is obtained by splitting the items on a questionnaire or test in half,
computing a separate score for each half, and then calculating the
degree of consistency between the two scores for a group of
participants.
▪ The test can be divided according to the odd and even numbers of the
items (odd-even system).
▪ Since the test is divided into two and are correlated to each other, the
coefficient of correlation has been compromised; thus, Spearman –
Brown Formula should be used to correct the correlation of the test.
▪ Spearman –Brown Formula
▪ A statistics which allows a test developer to estimate what correlations
between the two halves would have been if each half had been the
length of the whole test and have equal variances.
CRONBACH’S ALPHA
◼ Cronbach’s coefficient alpha
▪ Also called as Cronbach alpha
▪ Used when two halves of the test have unequal variances.
▪ Provides the lowest estimate of reliability.
▪ Average of all split halves.
▪ Items are not in right or wrong format (Likert Scale)
Module I
12
KUDER RICHARDSON 20
◼ Kuder-Richardson 20 (KR20) Formula
▪ The statistics used for calculating the reliability of a test in which the
items are dichotomous or scored as 0 or 1.
▪ Tests with right or wrong format.
D. INTER-RATER RELIABILITY
▪ It is the degree of agreement between two observers who
simultaneously record measurements of the behaviors.
▪ Examples:
▪ Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the
construct are almost the same, then the measure has a good
inter-rater reliability.
▪ Two parents evaluated the ADHD symptoms of their child. If
they both yield identical ratings, then the measure has good
inter-rater reliability.
▪ This uses the kappa statistic in order to assess the level of agreement
among raters in nominal scale.
▪ Cohen’s Kappa – used to know the agreement among 2 raters
▪ Fleiss’ Kappa – used to know the agreement among 3 or more
raters.
Module I
13
CORRECTION OF ATTENUATION
Module I
14
◼ Used to determine the exact correlation between two variables if the test is
deemed affected by error.
THINK!
Module I
15
Lesson 3
Validity
VALIDITY DEFINED
▪ It refers to the degree to which the measurement procedure measures
the variable that it claims to measure (strength and usefulness).
▪ Gives evidence for inferences made about a test score.
▪ Basically, it is the agreement between a test score or measure and the
characteristic it is believed to measure.
VALIDATION
◼ The process of gathering and evaluating evidence about validity.
▪ Local validation studies – applied when test are altered in some ways
such as format, language, or content.
FACE VALIDITY
◼ Face validity – is the simplest and least scientific form of validity and it is
demonstrated when the face value or superficial appearance of a measurement
measures what it is supposed to measure.
◼ Item seems to be reasonably related to the perceived purpose of the test.
◼ Often used to motivate test takers because they can see that the test is relevant.
Module I
16
▪ Predictive
◼ Construct
▪ Convergent
▪ Discriminant/Divergent
A. CONTENT VALIDITY
◼ The extent to which the test is representative of a defined body of content
consisting of topics and processes.
◼ Content validation is not done by statistical analysis but by the inspection of
items. A panel of experts can review the test items and rate them in terms of
how closely they match the objective or domain specification.
◼ This considers the adequacy of representation of the conceptual domain the
test is designed to cover.
◼ If the test items adequately represent the domain of possible items for a
variable, then the test has adequate content validity.
◼ Determination of content validity is often made by expert judgment.
◼ Educational Content Valid Test – syllabus is covered in the test; usually
follows the table of specification of the test.
◼ Table of specification – a blueprint of the test in terms of number of
items per difficulty, topic importance, or taxonomy.
Employment Content Valid Test – appropriate job related skills are included
in the test. Reflects the job specification of the test.
Clinical Content Valid Test – symptoms of the disorder are all covered in the
test. Reflects the diagnostic criteria for a test.
ISSUES:
◼ Construct underrepresentation
▪ Failure to capture important components of a construct (e.g. An
English test which only contains vocabulary items but no grammar
items will have a poor content validity.)
◼ Construct-irrelevant variance
▪ Happens when scores are influenced by factors irrelevant to the
construct (e.g. test anxiety, reading speed, reading comprehension,
illness)
Module I
17
CHARACTERISTICS OF A CRITERION
1. Relevant
2. Valid and Reliable
3. Uncontaminated
▪ Criterion contamination – criterion based on predictor measures; the
criterion used is a criterion of what is supposed to be the criterion.
C. CONSTRUCT VALIDITY
Construct – An informed scientific idea developed or hypothesized to describe or
explain a behavior; something built by mental synthesis.
-Unobservable, presupposed traits; something that the researcher thought to have
either high or low correlation with other variables.
◼ Established through a series of activities in which a researcher simultaneously
defines some construct and develops instrumentation to measure it.
◼ A judgment about the appropriateness of inferences drawn from test scores
regarding individual standings on a variable called construct.
◼ Required when no criterion or universe of content is accepted as entirely
adequate to define the quality being measured.
◼ Assembling evidence about what a test means.
◼ Series of statistical analysis that one variable is a separate variable.
Module I
18
Evidence of Homogeneity
◼ How uniform a test is in measuring a single concept.
▪ Subtest scores are correlated to the total test score.
▪ Coefficient alpha may be used as homogeneity evidence.
▪ Spearman Rho can be used to correlate an item to another item.
▪ Pearson or point biserial can be used to correlate an item to the total
test score. (item-total correlation)
Evidence of change with age
▪ Some variable/construct are expected to change with age.
Evidence of pretest posttest
▪ Difference of scores from pretest and post test of a defined construct
after careful manipulation would provide validity
Evidence from distinct group
▪ Also called a method of contrasted group
▪ T-test can be used to test the difference of groups.
Module I
19
FACTOR ANALYSIS
◼ Can be used to obtain evidence for both convergent and discriminant validity.
◼ Exploratory Factor Analysis – estimating or extracting factors; deciding how
many factors to retain; and rotating factors to an interpretable orientation
▪ Looking for factors
◼ Confirmatory Factor Analysis – researchers test the degree to which a
hypothetical model fits the actual data.
CROSS VALIDATION
◼ Revalidation of the test to a criterion based on another group different from
the original group from which the test was validated.
▪ Validity Shrinkage – decrease in validity after cross validation.
▪ Co-validation – validation of more than one test from the same group.
▪ Co-norming – norming more than one test from the same group.
Module I
20
THINK!
Lesson 4
Test Development
Module I
21
TEST DEVELOPMENT - An umbrella term that goes into the process of creating a
test.
1. Test conceptualization – an early stage of test development wherein idea for
a particular test is conceived.
◼ Stage wherein the following are determined: Construct, Goal, User,
Taker, Administration, Format, Response, Benefits, Costs,
Interpretation
◼ Determination whether the test would be Norms-Referenced or
Criterion-Referenced
*PILOT WORK
◼ Also called as pilot study, pilot research
◼ May be in the form of interview in determining appropriate
item for the test.
◼ It may entail literature review, experimentation, or any efforts
that researcher may cause in order to determine the items that
might be included in the test.
Module I
22
Lesson 5
Item Analysis
ITEM ANALYSIS - A general term for a set of methods used to evaluate test items,
one of the most important aspects of test construction.
▪ Item Difficulty
▪ Item Reliability
▪ Item Validity
▪ Item Discriminability
▪ Items for Criterion Reference Test
*Distractor Analysis
ITEM DIFFICULTY (p)
� item analysis is defined by the number of people who get a particular item
correct for a test that measures achievement or ability,
� For example, if 79% of the test takers answered item number 1 correctly,
then we have a difficulty index of .79.
� This definition, however, indicate the easiness of the test than difficulty.
� Thus, it is also suggested that achievement tests make use of multiple
choice questions because it has .25 chance of getting the correct response.
� Should range from 0.30 – 0.70
Module I
23
ITEM RELIABILITY
◼ Indicates the internal consistency of a test. The higher the index; the higher
the internal consistency.
◼ (Item Reliability) = (SD of the item) x (item-total correlation)
◼ Factor analysis can also be used to determine which items has more load
for the whole test.
ITEM VALIDITY
◼ Provides an indication of the degree to which a test is measuring what it
purports to measure. Higher item-validity index; the higher the criterion
related validity for the test.
◼ Item Validity = (item standard deviation) x (correlation of item and
criterion)
ITEM DISCRIMINATION (d)
◼ Indicates how adequately an item separates high scorers from low scorers
on the entire test.
◼ How well an item performs in relation to some criterion. In other words, it
tells us the degree of association of the performance to an item and
performance on the whole test.
◼ Limits at 0.30 discrimination index
◼ The higher the d the more high scorers answering the item correctly
▪ Extreme Group Method
▪ Point Biserial Correlation
Module I
24
SAMPLE EXERCISE:
Item Proportion Proportion Discrimination
Correct for Correct for Index
students in students in
the top third the bottom
of class third of class
3 .95 .45
4 .97 .93
5 .56 .69
POINT BISERIAL
◼ Used for correlating dichotomous and continuous data.
◼ Correlates whether those who got an item correct tends to have high scores as
well.
Module I
25
◼ A frequency polygon is created after the test given to two groups; one group
that is exposed to learning unit, another group that is not exposed to learning
unit.
◼ Antimode - the score with the lowest frequency
▪ Determination of cut score (passing score) for a criterion referenced
test.
Lesson 6
Norming
Module I
26
STANDARD SCORES
◼ A raw score that has been converted from one scale to another scale
◼ Provide a context of comparing scores on different tests by converting scores
from the two tests into z-score
◼ “z scores are golden”
Z SCORE
◼ Mean of 0 ; SD of 1
◼ Zero plus or minus one scale
◼ When determined, can be used to translate one scale to another.
Module I
27
LINEAR TRANSFORMATION
◼ Derived formula of the Z-score to transform one score from a scale to another
score.
◼ NS = SD(Z)+M
PERCENTILE RANKS
◼ Tells the relative position of a test taker in a group of 100.
◼ Suggests how many samples fall below a specified score.
◼ For example: if person has a score equivalent to percentile 50, it suggests that
50 percent of the test takers fall below that specific score.
Module I
28
EXERCISE:
Create a norm for the following test data using standard scores (z, t, etc), percentiles
and determine its mean, median and mode, SD and Range
11 12 14 8 20 18 18 13 15 11
5 19 15 14 9 10 17 15 16 6
9 12 13 15 16 20 18 12 17 10
12 11 10 15 14 9 10 17 13 11
TYPES OF NORMS
◼ Criterion Reference Testing – interpretation of test is based on a certain
standards.
▪ Method of evaluation and a way of deriving meaning form test scores
evaluating an individual’s score with reference to a set of standard.
▪ Also called as Content-referenced or Domain-referenced
▪ Criterion –a standard on which a judgement or decision is based.
◼ Norms Reference Testing – Score is interpreted based on the performance of
a standardized group.
Developmental norms
◼ indicates how far along the normal developmental path an
individual has progressed.
Age norms
� A child’s score on a test corresponds to the highest year
level or age level that he can successfully complete.
Grade norms
Module I
29
National Norms
◼ norms on large scale samples
◼ National representativeness
◼ Subgroup norms – a normative sample segmented by any of the
criteria initially used in selecting samples
◼ Local norms – provide normative information with respect to
the local population’s performance on a test.
MODULE SUMMARY
SUMMATIVE TEST
Module I
30
Module I