Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30

MODULE II

THE SCIENCE OF PSYCHOLOGICAL


ASSESSMENT

Lesson 1. Statistics Refresher

Lesson 2 Reliability

Lesson 3 Validity

Lesson 4 Test Development

Lesson 5 Item Analysis

Lesson 6 Norming

Module I
2

MODULE I

ETHICS AND HUMAN ACTS

INTRODUCTION

This module presents ethics and human acts. It is hoped that you will
learn to appreciate ethics as both science and way of life and the purpose of
which man exists and the factors that affect his human acts. You will also
learn the end of human acts and how happiness can be found.

OBJECTIVES

After studying the module, you should be able to:

1. discuss the importance of ethics to man.


2. explain the attributes and kinds of human acts.
3. give and explain the concept of morality.

DIRECTIONS/ MODULE ORGANIZER

There are three lessons in the module. Read each lesson carefully
then answer the exercises/activities to find out how much you have
benefited from it. Work on these exercises carefully and submit your output
to your tutor or to the DOUS office.

In case you encounter difficulty, discuss this with your tutor during
the face-to-face meeting. If not contact your tutor at the DOUS office.

Good luck and happy reading!!!

Module I
3

Lesson 1

Statistics Refresher

The use of statistics in psychological measurement:


1. Description – using descriptive statistics
- used for making interpretation of test results
- Provides a concise description of a quantitative information
2. Inference – using inferential statistics
- provides conclusions regarding a population based on the observation made
on a sample

SCALES OF MEASUREMENT
Properties of scales:
A. Category – naming/labeling
B. Magnitude – “moreness”; we suggests that one is more than the other.
C. Equal Interval – the difference between two points at any place has the
same meaning as the difference between two other points on other places.
D. Absolute zero – zero suggest absence of the variable being measured.

1. Nominal – naming; labeling; one category does not suggest that the other is
higher or lower. Ex. Gender; religion
2. Ordinal – observations can be ranked into order but the degree of difference
is unobtainable. Ex. Position in the company
3. Interval – there is magnitude and equal interval; no true zero
4. Ratio – there is magnitude, equal intervals, and true zero

Scale of Magnitude Equal interval Absolute zero


Measurement

Nominal No No No

Ordinal Yes No No

Interval Yes Yes No

Ratio Yes Yes Yes

◼ most psychological data are ordinal by nature but are treated as interval.
▪ Ex. How a person respond to an item (ordinal) and how the response
are treated (summed, interval)

Module I
4

◼ IQ are initially for classification and not for measurement (cited by Binet),
thus IQ is nominal.

◼ Displays scores on a variable or a measure to reflect how frequent each value


was obtained.
◼ Graphs – a diagram or chart illustrating data
▪ Histogram – graphs with vertical lines at the true limits of each test
score; connected bars; used for continuous data
▪ Bar Graph – used in describing frequencies; disconnected bars
▪ Frequency polygon – points are plotted at the class mark of each of the
intervals; Continuous lines

MEASURES OF CENTRAL TENDENCY


Statistics that indicated the average or midmost score between the extreme scores in
distribution.
A. Mean – the most appropriate central tendency for interval and ratio when
distribution is normal. Simply the average of a data set
B. Median – middle score of the population. Computed by arranging the data
from lowest to highest and getting the middle most score. If the N is even,
average the two middlemost score.
C. Mode – the most frequently occurring score in a distribution

MEASURES OF VARIABILITY
◼ Indicates how scattered the score are distribution; how far one score is from
the other. Measures the dispersion of the scores.
◼ Range – equal to the difference of HS to LS
◼ Quartile – points that divide the distribution into 4 equal parts.
◼ Interquartile range – difference between Q3 and Q1; represents the
middle 50% of the distribution.
◼ Semi-interquartile range - (Q3 – Q1)/2
◼ STANDARD DEVIATION - Approximation of the average deviation around
the mean
◼ Gives detail of how much above or below a score to the mean.

CORRELATIONS AND REGRESSIONS

CORRELATIONAL STATISTICS
◼ Statistical tools for testing the relationship between variables.
◼ Covariance – How much two scores vary together
◼ Correlation coefficient – mathematical index that describes the direction and
magnitude of a relationship.
▪ Ranges from -1.00 to +1.00

Module I
5

▪ The nearer to 1; the stronger the relationship


▪ The nearer to 0; the weaker the relationship
▪ The symbol suggests the type of relationship (negative = indirect
relationship; positive = direct relationship)

Pearson Product Moment Correlation


▪ Correlates 2 variables in interval/ratio scale format
▪ Devised by Karl Pearson
Spearman Rho
▪ Also called as rank-ordered correlation or Spearman Correlation
▪ Correlates 2 variables in ordinal scale
Biserial Correlation
▪ Correlates one continuous and one artificial dichotomous data
▪ Score in a test (continuous/interval) and being highly aggressive
(artificial dichotomy)
Point Biserial Correlation
▪ Correlates one continuous and one true dichotomous data.
▪ Score in the test (continuous/interval) and correctness in an item within
the test (true dichotomous)
◼ True dichotomy – dichotomy in which there are only two
possible categories.
◼ Ex. Sex (male – female)
◼ Artificial Dichotomy – dichotomy in which there are other
possibilities in a certain category
◼ Ex. Pass or Fail (both pass and fail constitutes a certain
range of score)
Phi Coefficient
▪ Correlates two dichotomous data; at least one true dichotomy
▪ Ex. Gender and passing or failing a test
Tetrachoric Correlation
▪ Correlates two dichotomous data; both are artificial dichotomy
▪ Ex. Passing or failing a test and being highly aggressive or not.

OTHER ESSENTIAL CORRELATIONAL CONCEPTS


Coefficient of Alienation – measure of non association between two variables
▪ √(1 – r2)
Coefficient of Determination (r2)– tells the proportion of the total variation in
scores on Y that we know as a function of information about X
▪ Suggests the percentage shared by two variables. The effect of one
variable to another.
▪ Square of r
▪ r = 0.75; r2 is = 0.56 or 56.25%

Module I
6

THINK!

Discuss the role of ethics in your daily life.

Lesson 2

Reliability

RELIABILITY DEFINED
◼ Reliability – refers to the consistency of scores obtained by the same person
when re-examined with the same test on different occasions, or with different
sets of equivalent items, or under other variable examining condition.
◼ This mainly refers to the attribute of consistency in measurement.
▪ Charles Spearman – Key individual in the theories of reliability.

*Measurement error is common in all fields of science.


*Tests that are relatively free of measurement error are considered to be
reliable while tests that contain relatively large measurement error are
considered to be unreliable.

CLASSICAL TEST SCORE THEORY


◼ Classical Test Score Theory – this assumes that each person has a true score
that would be obtained if there were no errors in measurement.
◼ Measuring instruments are imperfect, therefore the observed score for each
person almost always differ from the person’s true ability or characteristic.
◼ Measurement error – the difference between the observed score and the true
score results.

X = E + T
(observed score) (error) (true score)
◼ Standard error of measurement – the standard deviation of the distribution of
errors for each repeated application of the same test on an individual.
◼ Inversely related to reliability

Module I
7

◼ SEM = SD√(1 – r)

CONFIDENCE INTERVAL
◼ Range or band of test scores that is likely to contain the true score.
◼ M – 1.96 (SEM) to M + 1.96 (SEM)
◼ M = mean to test scores from the test taker.

*Error (E) can either be positive or negative. If E positive, the Obtained Score
(X) will be higher than the True Score (T); if E is negative, then X will be
lower than T.
*Although it is impossible to eliminate all measurement error, test developers
do strive to minimize psychometric nuisance through careful attention to the
sources of measurement error.
*It is important to stress that true score is never known.

MEASUREMENT ERROR
◼ All factors associated with the process of measuring variable, other than the
variable being measured
▪ Random error – source of error in measuring a targeted variable
caused by unpredictable fluctuations and inconsistencies of other
variables in the measurement process.
▪ Systematic error – source of error in measuring a variable that is
typically constant or proportionate to what is presumed to be the true
value of the variable being measured.

Factors that contribute to consistency:


▪ These consist entirely of those stable attributes of the individual, which
the examiner is trying to measure.
Factors that contribute to inconsistency:
▪ These include characteristics of the individual, test, or situation, which
have nothing to do with the attribute being measured, but which
nonetheless affect test scores.

DOMAIN SAMPLING MODEL


▪ There is a problem in the use of limited number of items to represent a
larger and more complicated construct.
▪ A sample of items is utilized instead of the infinite pool of items of the
construct.
▪ The greater the number of items, the higher the reliability.

SOURCES OF MEASUREMENT ERROR


A. Item selection

Module I
8

▪ One source of measurement error is the instrument itself. A test


developer must settle upon a finite number of items from a potentially
infinite pool of test question.
▪ Which items should be included? How should they be worded?
▪ Although psychometricians strive to obtain representative test items,
the particular set of questions chosen for a test might not be equally
fair to all persons.
B. Test Administration
▪ General environmental conditions may exert an untoward influence on
the accuracy of measurement, such as uncomfortable room
temperature, dim lighting, and excessive noise.
▪ Momentary fluctuation in anxiety, motivation, attention, and fatigue
level of the test taker may also introduce sources of measurement error.
▪ The examiner may also contribute to the measurement error in the
process of test administration.
C. Test Scoring
▪ Whenever psychological test uses a format other than machine-scored
multiple choice items, some degree of judgment is required to assign
points to answers.
▪ Most tests have well-defined criteria for answers to each question.
These guidelines help minimize the impact of subjective judgment in
scoring.
▪ Consistent and reliable scores have reliability coefficient near 1.0;
Conversely, tests, which reflect large amount of measurement error,
produce inconsistent and unreliable score and their reliability
coefficients are close to 0.

ITEM RESPONSE THEORY (IRT)


◼ With the help of a computer, the item difficulty is calibrated to the mental
ability of the test taker.
◼ If you got several easy items correct, the computer will then move to more
difficult items.
◼ If you get several difficult items wrong, the computer moves back to average
items.
THE CORRELATION COEFFICIENT
◼ A correlation coefficient (r) expresses the degree and magnitude of a linear
relationship between two sets of scores obtained from the same person.
◼ It can take on values ranging from -1.00 to +1.00.

Module I
9

◼ When two measures have a positive (+) correlation, the high/low scores on Y
are associated with the high/low scores on X.
◼ When two measures have a negative (-) correlation, the high scores on Y are
associated with low scores on X and vice versa.
◼ Correlations of +1.00 are extremely rare in psychological research and usually
signify a trivial finding.

FORMS OF RELIABILITY

A. TEST-RETEST RELIABILITY - It is established by comparing the scores


obtained from two successive measurements of the same individuals and calculating a
correlation between the two sets of scores.
◼ It is also known as time sampling reliability since it measures the error
associated with administering a test at two different times.
◼ This is used when we measure only traits or characteristics that do not change
over time. (e.g. IQ) .

Module I
10

◼ Example: You took an IQ test today and you will take it again after
exactly a year. If your scores are almost the same (e.g. 105 and 107),
then the measure has a good test-retest reliability.
◼ Error variance – corresponds to the random fluctuations of
performance from one test session to the other.
◼ Clearly, this type of reliability is only applicable to stable traits.

LIMITATIONS OF TEST-RETEST RELIABILITY


◼ Carryover effect – occurs when the first testing session influences the results
of the second session and this can affect the test-retest reliability of a
psychological measure.
◼ Practice effect – a type of carryover effect wherein the scores on the second
test administration are higher than they were on the first.
Remember…
◼ If the results of the first and second administration has a low correlation, it
might mean that:
▪ The test has poor reliability
▪ A major change had occurred on the subjects between the first and
second administration.
▪ A combination of low reliability and major change have occurred.
◼ Sometimes, a poor test-retest correlation do not mean that the test is
unreliable. It might mean that the variable under study has changed.
Choice of Time Intervals
◼ Too short = Carryover effects
◼ Too long = various factors might happen in between 2 testing conditions.

B. PARALLEL FORMS RELIABILITY


◼ It is established when at least two different versions of the test yield
almost the same scores.
◼ It is also known as item sampling reliability or alternate forms reliability since
it compares two equivalent forms of a test that measure the same attribute to
make sure that the items indeed assess a specific characteristic.
◼ The correlation between the scores obtained on the two forms represents the
reliability coefficient of the test.
◼ Examples:
◼ The Purdue Non-Language Test (PNLT) has Forms A and B
and both yield slightly identical scores of the test taker.
◼ The SRA Verbal Form has parallel forms A and B and both
yield almost identical scores of the test taker.
◼ The error of variance in this case represents fluctuations in
performance from one set of items to another, but not fluctuations over
time.
Tests should contain the same number of items and the items should be
expressed in the same form and should cover the same type of content. The range and

Module I
11

level of difficulty of the items should also be equal. Instructions, time limits,
illustrative examples, format and all other aspects of the test must likewise be checked
for equivalence.

LIMITATIONS OF PARALLEL FORMS RELIABILITY


◼ One of the most rigorous and burdensome assessments of reliability since test
developers have to create two forms of the same test.
◼ Practical constraints make it difficult to retest the same group of individuals.

C. INTERNAL CONSISTENCY
◼ Used when tests are administered once.
◼ Suggests that there is consistency among items within the test.
◼ This model of reliability measures the internal consistency of the test which is
the degree to which each test item measures the same construct. It is simply
the intercorrelations among the items.
◼ If all items on a test measure the same construct, then it has a good internal
consistency.
THE SPLIT HALF RELIABILITY
▪ It is obtained by splitting the items on a questionnaire or test in half,
computing a separate score for each half, and then calculating the
degree of consistency between the two scores for a group of
participants.
▪ The test can be divided according to the odd and even numbers of the
items (odd-even system).
▪ Since the test is divided into two and are correlated to each other, the
coefficient of correlation has been compromised; thus, Spearman –
Brown Formula should be used to correct the correlation of the test.
▪ Spearman –Brown Formula
▪ A statistics which allows a test developer to estimate what correlations
between the two halves would have been if each half had been the
length of the whole test and have equal variances.

CRONBACH’S ALPHA
◼ Cronbach’s coefficient alpha
▪ Also called as Cronbach alpha
▪ Used when two halves of the test have unequal variances.
▪ Provides the lowest estimate of reliability.
▪ Average of all split halves.
▪ Items are not in right or wrong format (Likert Scale)

Module I
12

KUDER RICHARDSON 20
◼ Kuder-Richardson 20 (KR20) Formula
▪ The statistics used for calculating the reliability of a test in which the
items are dichotomous or scored as 0 or 1.
▪ Tests with right or wrong format.

D. INTER-RATER RELIABILITY
▪ It is the degree of agreement between two observers who
simultaneously record measurements of the behaviors.
▪ Examples:
▪ Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the
construct are almost the same, then the measure has a good
inter-rater reliability.
▪ Two parents evaluated the ADHD symptoms of their child. If
they both yield identical ratings, then the measure has good
inter-rater reliability.

▪ This uses the kappa statistic in order to assess the level of agreement
among raters in nominal scale.
▪ Cohen’s Kappa – used to know the agreement among 2 raters
▪ Fleiss’ Kappa – used to know the agreement among 3 or more
raters.

BRIEF SUMMARY OF METHODS FOR ESTIMATING RELIABILITY

Method No. of Forms No. of Sessions Sources of Error


Variance

Test - Retest 1 2 Changes over time


Alternative – Forms 2 1 Item sampling
(Immediate)
Alternative – Forms 2 2 Item sampling
(Delayed) changes over time

Module I
13

Split – Half 1 1 Item sampling


(Spearman-Brown) Nature of split
Coefficent Alpha 1 1 Item sampling
&Kuder -Richardson Test Heterogeneity
Inter -Rater 1 1 Scorer Differences

WHICH TYPE OF RELIABILITY IS APPROPRIATE?


◼ For tests that has two forms, use parallel forms reliability.
◼ For tests that are designed to be administered to an individual more than once,
use test-retest reliability.
◼ For tests with factorial purity, use Cronbach’s coefficient alpha.
◼ For tests with items carefully ordered according to difficulty, use split-half
reliability.
◼ For tests which involve some degree of subjective scoring, use inter-rater
reliability.
◼ For tests which involve dichotomous items or forced choice items, use KR20.

DYNAMIC VS STATIC CHARACTERISTICS


Dynamic Characteristics – ever changing characteristics that change
through time or situation.
▪ Internal consistency is of best use.
Static Characteristics – Characteristics that would not vary across time.
▪ Test retest and parallel forms are of best use.

UNIDIMENSIONAL VS MULTIDIMENSIONAL VARIABLES


◼ Unidimensional tests are expected to have high internal consistency than
multidemensional tests.

HOW RELIABLE IS RELIABLE?


◼ Test restest, alternate forms – “the more the better”
◼ Internal consistency
▪ Basic Research = .70 to .80
▪ Clinical Setting = .95 or above

WHAT TO DO ABOUT LOW RELIABILITY?


◼ Increase the number of items.
◼ Use factor analysis and item analysis.
◼ Use the correction for attenuation formula.

CORRECTION OF ATTENUATION

Module I
14

◼ Used to determine the exact correlation between two variables if the test is
deemed affected by error.

THINK!

Distinguish human acts from acts of man.

Module I
15

Lesson 3

Validity

VALIDITY DEFINED
▪ It refers to the degree to which the measurement procedure measures
the variable that it claims to measure (strength and usefulness).
▪ Gives evidence for inferences made about a test score.
▪ Basically, it is the agreement between a test score or measure and the
characteristic it is believed to measure.

VALIDATION
◼ The process of gathering and evaluating evidence about validity.
▪ Local validation studies – applied when test are altered in some ways
such as format, language, or content.

FACE VALIDITY
◼ Face validity – is the simplest and least scientific form of validity and it is
demonstrated when the face value or superficial appearance of a measurement
measures what it is supposed to measure.
◼ Item seems to be reasonably related to the perceived purpose of the test.
◼ Often used to motivate test takers because they can see that the test is relevant.

EXAMPLES OF FACE VALIDITY


◼ An IQ test containing items which measure memory, mathematical ability,
verbal reasoning and abstract reasoning has a good face validity.
◼ An IQ test containing items which measure depression and anxiety has a bad
face validity.
◼ A self-esteem rating scale which has items like “I know I can do what other
people can do.” and “I usually feel that I would fail on a task.” has a good face
validity.
◼ Inkblot test that low face validity because test takers question whether the test
really measures personality.

EVIDENCES OF VALIDITY OF TESTS


◼ Content
◼ Criterion Related
▪ Concurrent

Module I
16

▪ Predictive
◼ Construct
▪ Convergent
▪ Discriminant/Divergent
A. CONTENT VALIDITY
◼ The extent to which the test is representative of a defined body of content
consisting of topics and processes.
◼ Content validation is not done by statistical analysis but by the inspection of
items. A panel of experts can review the test items and rate them in terms of
how closely they match the objective or domain specification.
◼ This considers the adequacy of representation of the conceptual domain the
test is designed to cover.
◼ If the test items adequately represent the domain of possible items for a
variable, then the test has adequate content validity.
◼ Determination of content validity is often made by expert judgment.
◼ Educational Content Valid Test – syllabus is covered in the test; usually
follows the table of specification of the test.
◼ Table of specification – a blueprint of the test in terms of number of
items per difficulty, topic importance, or taxonomy.
Employment Content Valid Test – appropriate job related skills are included
in the test. Reflects the job specification of the test.
Clinical Content Valid Test – symptoms of the disorder are all covered in the
test. Reflects the diagnostic criteria for a test.

ISSUES:
◼ Construct underrepresentation
▪ Failure to capture important components of a construct (e.g. An
English test which only contains vocabulary items but no grammar
items will have a poor content validity.)
◼ Construct-irrelevant variance
▪ Happens when scores are influenced by factors irrelevant to the
construct (e.g. test anxiety, reading speed, reading comprehension,
illness)

Quantification of Content Validity


▪ Lawshe (1975) proposed a structured and systematic way of
establishing the content validity of a test
▪ He developed the formula content validity ratio (CVR)

Module I
17

B. CRITERION – RELATED VALIDITY


▪ Tells how well a test corresponds with a particular criterion.
▪ A judgment of how adequately a test score can be used to infer an individual’s
most probable standing on some measure of interest.
▪ Criterion – standard against which a test or a test score is evaluated.
▪ A criterion can be a test score, psychiatric diagnosis, training cost,
index of absenteeism, amount of time.

CHARACTERISTICS OF A CRITERION
1. Relevant
2. Valid and Reliable
3. Uncontaminated
▪ Criterion contamination – criterion based on predictor measures; the
criterion used is a criterion of what is supposed to be the criterion.

TYPES OF CRITERION-RELATED VALIDITY


Concurrent Validity
◼ Both the test scores and the criterion measures are obtained at
present.
Predictive Validity
◼ Test scores may be obtained at one time and the criterion measure
may be obtained in the future after an intervening event.

C. CONSTRUCT VALIDITY
Construct – An informed scientific idea developed or hypothesized to describe or
explain a behavior; something built by mental synthesis.
-Unobservable, presupposed traits; something that the researcher thought to have
either high or low correlation with other variables.
◼ Established through a series of activities in which a researcher simultaneously
defines some construct and develops instrumentation to measure it.
◼ A judgment about the appropriateness of inferences drawn from test scores
regarding individual standings on a variable called construct.
◼ Required when no criterion or universe of content is accepted as entirely
adequate to define the quality being measured.
◼ Assembling evidence about what a test means.
◼ Series of statistical analysis that one variable is a separate variable.

Module I
18

◼ A test has a good construct validity if there is an existing psychological theory


which can support what the test items are measuring.
◼ Establishing construct validity involves both logical analysis and empirical
data.
◼ Example: In measuring aggression, you have to check all past research and
theories to see how the researchers measure that variable/construct.
◼ Construct validity is like proving a theory through evidences and statistical
analysis.

EVIDENCES OF CONSTRUCT VALIDITY


1. Test is homogenous, measuring a single construct.
2. Test score increases or decreases as a function of age, passage of time, or
experimental manipulation.
3. Pretest, posttest differences
4. Test scores differ from groups.
5. Test scores correlate with scores on other test in accordance to what is
predicted.

Evidence of Homogeneity
◼ How uniform a test is in measuring a single concept.
▪ Subtest scores are correlated to the total test score.
▪ Coefficient alpha may be used as homogeneity evidence.
▪ Spearman Rho can be used to correlate an item to another item.
▪ Pearson or point biserial can be used to correlate an item to the total
test score. (item-total correlation)
Evidence of change with age
▪ Some variable/construct are expected to change with age.
Evidence of pretest posttest
▪ Difference of scores from pretest and post test of a defined construct
after careful manipulation would provide validity
Evidence from distinct group
▪ Also called a method of contrasted group
▪ T-test can be used to test the difference of groups.

TEST SCORES CORRELATE WITH SCORES ON OTHER TEST


◼ Convergent evidence
▪ Also called as convergent validity
▪ The test is correlated to another measure
▪ Ex. Depression test and Negative Affect Scale
◼ Divergent/Discriminant Evidence
▪ Also called as divergent/discriminant validity
▪ A validity coefficient sharing little or no relationship between the
newly created test and an existing test.
▪ Social Desirability test and Marital Satisfaction test.

Module I
19

FACTOR ANALYSIS
◼ Can be used to obtain evidence for both convergent and discriminant validity.
◼ Exploratory Factor Analysis – estimating or extracting factors; deciding how
many factors to retain; and rotating factors to an interpretable orientation
▪ Looking for factors
◼ Confirmatory Factor Analysis – researchers test the degree to which a
hypothetical model fits the actual data.

CROSS VALIDATION
◼ Revalidation of the test to a criterion based on another group different from
the original group from which the test was validated.
▪ Validity Shrinkage – decrease in validity after cross validation.
▪ Co-validation – validation of more than one test from the same group.
▪ Co-norming – norming more than one test from the same group.

FACTORS THAT AFFECT VALIDITY OF A TEST


1. Test bias – A factor inherent in a test that systematically prevents accurate,
impartial measurement.
▪ Test fairness – the extent to which a test is used in an impartial, just,
and equitable way
▪ Adverse Impact – the use of the test systematically rejects higher
portions of minority group than non-minority applicants
▪ Differential Validity – the extent to which a test has different meaning
for different people
▪ Ex. A test may be a valid predictor of college success for white
but not for African-Americans
2. Rating Error – judgment resulting from intentional and unintentional misuse
of a rating scale
▪ Rating – a numerical or verbal judgment that places a person or an
attribute along a continuum identified by a rating scale.
▪ Leniency Error/ Generosity Error – An error in rating that
arises from the tendency on part of the rater to be lenient in
scoring marking or grading.
▪ Severity Error – overly critical scoring
▪ Central Tendency Error – The rater has reluctance in giving
ratings at either positive or negative extreme.
◼ Rater’s ratings would tend to cluster in the middle of the
continuum.
▪ Halo Effect – The tendency to give ratee a higher rating than
he/she objectively deserves because of the rater’s failure to
discriminate among conceptually distinct and potentially
independent aspects of a ratee’s behavior.
◼ Tendency to ascribe positive attributes independently of
the observed behavior.

Module I
20

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY


◼ Reliability and validity are partially related and partially independent.
◼ Reliability is a prerequisite for validity, meaning a measurement cannot be
valid unless it is reliable.
◼ It is not necessary for a measurement to be valid for it to be considered
reliable.

FORMULA FOR ESTIMATING THE MAXIMUM VALIDITY


◼ r₁₂max = √r₁₁r₂₂
◼ r₁₂max = maximum validity
◼ r₁₁ = reliability of the test
◼ r₂₂ = reliability of the criterion

THINK!

As a student, what is your philosophy in life? What is


your ultimate end?

Lesson 4

Test Development

STAGES OF TEST DEVELOPMENT

BASIC ASSUMPTIONS IN PSYCHOLOGICAL TESTING


1. Psychological traits and states exists
2. Psychological traits and states can be quantified and measured
3. Test-related behavior predicts non-test related behavior
4. Tests and other measurement techniques have strengths and
weaknesses
5. Various sources of error are part of the assessment process – True
Score Theory
6. Testing and Assessment can be conducted in an unbiased manner.
7. Testing benefits society

Module I
21

TEST DEVELOPMENT - An umbrella term that goes into the process of creating a
test.
1. Test conceptualization – an early stage of test development wherein idea for
a particular test is conceived.
◼ Stage wherein the following are determined: Construct, Goal, User,
Taker, Administration, Format, Response, Benefits, Costs,
Interpretation
◼ Determination whether the test would be Norms-Referenced or
Criterion-Referenced
*PILOT WORK
◼ Also called as pilot study, pilot research
◼ May be in the form of interview in determining appropriate
item for the test.
◼ It may entail literature review, experimentation, or any efforts
that researcher may cause in order to determine the items that
might be included in the test.

2. Test construction – writing test items as well as formatting items, scoring


rules, and otherwise designing and building a test.
Scaling – process of setting rules for assigning numbers in measurement.
◼ Designing a measuring device for the trait/ability being measured.
◼ Manifested through its item format (dichotomous, polytomous, likert,
catergory)
◼ Item pool – usually 2 times the intended final form number of
items. (3 times is advised for inexperienced test developer)
◼ Reservoir from which items will or will not be drawn
for the final version of the test
◼ Final test items should contain all domains of the test.
◼ Determination of scoring model
◼ Cumulative, categorical, ipsative
◼ Creation of final format of the test

3. Test Tryout – administration of a test to a representative sample of test


takers under conditions that stimulate the conditions under which the final
version of the test will be administered.
Issues:
◼ Determination of target population
◼ Determination of number of samples for test tryout (# of items
multiplied to 10)
◼ Test tryout should be executed under conditions as identical as possible
to the conditions under which the standardized test will be
administered.

Module I
22

4. Item analysis – entails procedures usually statistical designed to explore


how individual test items work as compared to other items in the test and in
the context of the whole test.
◼ Determination of the following:
◼ Reliability

Lesson 5

Item Analysis

ITEM ANALYSIS - A general term for a set of methods used to evaluate test items,
one of the most important aspects of test construction.
▪ Item Difficulty
▪ Item Reliability
▪ Item Validity
▪ Item Discriminability
▪ Items for Criterion Reference Test
*Distractor Analysis
ITEM DIFFICULTY (p)
� item analysis is defined by the number of people who get a particular item
correct for a test that measures achievement or ability,
� For example, if 79% of the test takers answered item number 1 correctly,
then we have a difficulty index of .79.
� This definition, however, indicate the easiness of the test than difficulty.
� Thus, it is also suggested that achievement tests make use of multiple
choice questions because it has .25 chance of getting the correct response.
� Should range from 0.30 – 0.70

OPTIMUM ITEM DIFFICULTY


◼ Suggests the best difficulty for an item based on the number of responses.
◼ OID = (chance performance + 1)/ 2
◼ Chance performance – performance based on guessing. Can be equated by
dividing 1 from the number of distractors.
Item difficulty Index
◼ The value that describes the item difficulty for an ability test.
Item endorsement Index

Module I
23

◼ The value that describes the percentage of individuals who said


endorsed an item in a personality test.

OMNIBUS SPIRAL FORMAT


◼ Items in an ability test are arranged into increasing difficulty.
◼ Used for motivating samples in taking the test.
▪ Give away items – presented near the beginning of the test to
spur motivation and lessen test anxiety.

ITEM RELIABILITY
◼ Indicates the internal consistency of a test. The higher the index; the higher
the internal consistency.
◼ (Item Reliability) = (SD of the item) x (item-total correlation)
◼ Factor analysis can also be used to determine which items has more load
for the whole test.
ITEM VALIDITY
◼ Provides an indication of the degree to which a test is measuring what it
purports to measure. Higher item-validity index; the higher the criterion
related validity for the test.
◼ Item Validity = (item standard deviation) x (correlation of item and
criterion)
ITEM DISCRIMINATION (d)
◼ Indicates how adequately an item separates high scorers from low scorers
on the entire test.
◼ How well an item performs in relation to some criterion. In other words, it
tells us the degree of association of the performance to an item and
performance on the whole test.
◼ Limits at 0.30 discrimination index
◼ The higher the d the more high scorers answering the item correctly
▪ Extreme Group Method
▪ Point Biserial Correlation

EXTREME GROUP METHOD


◼ This method compares people who have done well with those who have
done poorly on a test.
◼ For example, you might find the students with test scores in the top third
and those in the bottom third of the class. Then you will find the
proportions of people in each group who got each item correct. The
proportions of the groups per item is subtracted to get the discriminability
index.
STEPS USING EXTREME GROUP METHOD:
◼ Choice of 25% – 33% of the population above and below. (upper and
lower group should be identical)

Module I
24

▪ Ex. 27% above the population as upper group; 27% of the


population as lower group
◼ Find the proportion of students who answered correctly per item in
both groups.
◼ Subtract the proportion of the low group to the high group to get the
discrimination index.

SAMPLE EXERCISE:
Item Proportion Proportion Discrimination
Correct for Correct for Index
students in students in
the top third the bottom
of class third of class

1 .89 .34 .55

2 .76 .36 .40

3 .95 .45

4 .97 .93

5 .56 .69

POINT BISERIAL
◼ Used for correlating dichotomous and continuous data.
◼ Correlates whether those who got an item correct tends to have high scores as
well.

ITEM CHARACTERISTICS CURVE


◼ Graphic representation of item difficulty and discrimination.
◼ Usually plots the scores at x-axis then p and d on the y-axis.

ITEMS FOR CRITERION REFERENCED TESTS

Module I
25

◼ A frequency polygon is created after the test given to two groups; one group
that is exposed to learning unit, another group that is not exposed to learning
unit.
◼ Antimode - the score with the lowest frequency
▪ Determination of cut score (passing score) for a criterion referenced
test.

ISSUES AMONG TEST ITEMS: HOW TO IMPROVE A FAULTY ITEM?


Items are fair
◼ Item fairness is the degree of an item is biased.
◼ Biased Test Items – items that favor one particular group of examinees. EG:
Girls are more familiar with girly things like tampons, etc than boys so we
avoid using such terms.
▪ Can be tested using inferential statistics among groups.
Qualitative Item Analysis
◼ Involve exploration of the issues through verbal means such as interviews and
group discussions conducted with test takers and other relevant parties.
Think out loud Administration
▪ Allows test takers (during standardization) to speak their mind while
taking the test.Used for shedding light to the test taker’s thought
process during the administration of the test.
Expert panels
◼ Guide researchers/test developers in doing sensitivity review (especially in
cultural issues)
▪ Sensitivity review – a study of test items typically to examine test bias,
presence of offensive language and stereotypes.

Lesson 6

Norming

THE NORMAL DISTRIBUTION


◼ Normal distribution – majority of the test takers are bulked at the middle of the
distribution, very few test takers are at the extremes
▪ Mean = median = mode
▪ Q1 and Q3 have equal distances to the Q2/median
◼ Positively skewed distribution – more test takers got a low score
▪ Mean>median>mode
▪ (Q3-Q2)>(Q2-Q1)

Module I
26

◼ Negatively skewed distribution – more test takers got a high score


▪ Mode>median>mean
▪ (Q2-Q1)>(Q3-Q2)

KURTOSIS - The steepness of a Distribution


A. Platykurtic – flat; the difference of the number of test takers who got high
and low score is not far from the number of test takers who got a score in
equivalent to the mean.
B. Leptokurtic – Peaked; the difference of the number of test takers who got
high and low score is far from the number of test takers who got a score in
equivalent to the mean.
C. Mesokurtic – Middle; the distribution is deemed normal.

DECILE - Points where the distribution is equally divided into 10 parts.


◼ D1 – D9

STANDARD SCORES
◼ A raw score that has been converted from one scale to another scale
◼ Provide a context of comparing scores on different tests by converting scores
from the two tests into z-score
◼ “z scores are golden”

Z SCORE
◼ Mean of 0 ; SD of 1
◼ Zero plus or minus one scale
◼ When determined, can be used to translate one scale to another.

OTHER STANDARD SCORES


T – Score
◼ Mean = 50; SD = 10
◼ Created by McCall in honor of his professor Thorndike
Stanine
◼ Mean = 5; SD = 2
◼ Used by US Airforce Assessment
◼ Takes whole numbers 1 – 9; no decimals
Deviation IQ
◼ Mean = 100; SD = 15
◼ Used for interpreting IQ
Sten
◼ Standard ten
◼ Mean = 5.5; SD = 2

Module I
27

GRE or SAT (Graduate Record Exam/ Scholastic Aptitude Test)


◼ Mean = 500; SD = 100
◼ Used for admission for graduate school and college

RELATIONSHIP AMONG STANDARD SCORES

LINEAR TRANSFORMATION
◼ Derived formula of the Z-score to transform one score from a scale to another
score.
◼ NS = SD(Z)+M

PERCENTILE RANKS
◼ Tells the relative position of a test taker in a group of 100.
◼ Suggests how many samples fall below a specified score.
◼ For example: if person has a score equivalent to percentile 50, it suggests that
50 percent of the test takers fall below that specific score.

PROCEDURES IN CONSTRUCTING PERCENTILE AS A NORM


◼ 1. Enumerate all possible scores; arrange in increasing order (low scores are at
the bottom of the frequency distribution; viceversa)

Module I
28

◼ 2. Tally the frequency for each of the scores. Construct a cumulative


frequency for each.
◼ 3. Use the formula for percentiles.

EXERCISE:
Create a norm for the following test data using standard scores (z, t, etc), percentiles
and determine its mean, median and mode, SD and Range
11 12 14 8 20 18 18 13 15 11

5 19 15 14 9 10 17 15 16 6

9 12 13 15 16 20 18 12 17 10

12 11 10 15 14 9 10 17 13 11

NORMS: RAW SCORES ARE MEANINGLESS


◼ Performance by defined groups on a particular test.
◼ Transformation of raw scores in making meaningful interpretations of scores
on a test
▪ Norming – process of creating norms
▪ Normative samples – group of people whose performance on a
particular test is analyzed and referred
▪ Race norming – norming based on race/ culture
▪ User norms – norms provided by the test manuals.
▪ Norman – the person who constructs a norm

TYPES OF NORMS
◼ Criterion Reference Testing – interpretation of test is based on a certain
standards.
▪ Method of evaluation and a way of deriving meaning form test scores
evaluating an individual’s score with reference to a set of standard.
▪ Also called as Content-referenced or Domain-referenced
▪ Criterion –a standard on which a judgement or decision is based.
◼ Norms Reference Testing – Score is interpreted based on the performance of
a standardized group.
Developmental norms
◼ indicates how far along the normal developmental path an
individual has progressed.
Age norms
� A child’s score on a test corresponds to the highest year
level or age level that he can successfully complete.
Grade norms

Module I
29

� Assigns achievement on a test or battery of tests according


to grade norms.
Ordinal Scale
� Are designed to identify the stage reached by the child in
the development of specific behavior functions.
Within group norms
◼ The individual’s performance is evaluated in terms of the
performance of the most nearly comparable standardization group.
Percentiles
� are expressed in terms of the percentage of persons in the
standardization sample who fall below a given RS. It
indicates the individual’s relative position in the
standardization sample.
Standard scores
� are derived scores which uses as its unit the SD of the
population upon which the test was standardized.
Deviation IQ
� is a standard score on an intelligence test with a mean of
100 and an SD that approximates the SD of the Stanford-
Binet IQ distribution.

National Norms
◼ norms on large scale samples
◼ National representativeness
◼ Subgroup norms – a normative sample segmented by any of the
criteria initially used in selecting samples
◼ Local norms – provide normative information with respect to
the local population’s performance on a test.

MODULE SUMMARY

SUMMATIVE TEST

1. Discuss the role of ethics in your life as a person.

2. Distinguish ethics and morality.

Module I
30

Module I

You might also like