Professional Documents
Culture Documents
Psych Assessment Notes
Psych Assessment Notes
WHAT IS A TEST?
• Assessment: procedure to gather information about people
• Test: type of assessment that uses specific procedures to obtain information and convert that information to number or scores.
– Use of specific or systematic procedures
• Selecting a set of items or tests questions
• Specifying conditions under which the test is administered
– Scoring of responses
– Sample of behavior
• TESTING
– term used to refer to the process that covers the administration of a test to the interpretation of a test score.
• PSYCHOLOGICAL TESTING
– The process of measuring psychology-related variables through the use of devices or procedures designed to obtain
a sample of behavior.
IRT RAMIFICATIONS
SEM differs in different levels of Θ; shorter tests can be more reliable than longer ones
Different forms are best for respondents of different Θ levels
NORMS
• Measurement
– the act of assigning numbers or symbols to characteristics of things (people, events, whatever) according to rules
– The rules used in assigning numbers are guidelines for representing the magnitude (or some other characteristic) of
the object being measured
• Scale
– A scale is a set of numbers (or other symbols) whose properties model empirical properties of the objects to which
the numbers are assigned.
• Types of Variables
– Discrete Variables
• consist of indivisible categories
– Continuous Variable
• infinitely divisible into whatever units a researcher may choose
NORMS
• Transformation of raw scores into a more meaningful scale derived from the performance of a large sample of persons
representative of one or more specified groups.
• Norms of a test are based on the distribution of scores obtained by some defined sample of individuals.
• Developed through SAMPLING
– Deriving a representative group of the population – a sample
SAMPLING
• Probability Sampling
– Each element has a known probability of being sampled.
• Nonprobability Sampling
– Each element has an unknown probability of being sampled.
TYPES OF NORMS
• Developmental Norms
– Indicates how far along the normal developmental path an individual has progressed
• Age Norms
• Age-equivalent scores
• Indicate the average performance of different samples of testtakers who were at various ages at
the time the test was administered
• Grade Norms
• Designed to indicate the average test performance of testtakers in a given school grade
• Mental Age
• A child’s score on a test corresponds to the highest year level or age that he can complete
• Ordinal Scale
• Identify the stage reached by the child in the development of a specific behavior function
PSYCHOMETRIC PROPERTIES
RELIABILITY & VALIDITY
RELIABILITY
• Basic premise
– Presence of error
• Find the magnitude of error and develop ways to minimize them
– Tests that are relatively free from measurement error are deemed to be reliable.
• The extent to which measurements are consistent or repeatable; also, it is the extent to which measurements differ from
occasion to occasion as a function of measurement error
• It is concerned with that portion of measurement that is due to permanent effects that persist from sample to sample.
• Conceptualization of error
– Error exist because we only obtain a sample of behavior
• Charles Spearman pioneered reliability assessment
– De Moivre
– Pearson
– Kuder and Richardson
– Cronbach
RELIABILITY COEFFICIENT
– is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total
variance
Sources of error
• Test Construction
– Item sampling; content sampling
• Test Administration
• Test Scoring and Interpretation
Reliability estimates
• Test-Retest Reliability Estimate
– Test-retest method
– an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations
of the same test
– When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to
as the coefficient of stability
• Parallel-Forms and Alternate-Forms Reliability Estimates
– coefficient of equivalence
– Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are
equal
– Alternate forms are simply different versions of a test that have been constructed so as to be parallel
• Split-Half Reliability Estimates
– obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once
– Step 1. Divide the test into equivalent halves.
11
– Step 2. Calculate a Pearson r between scores on the two halves of the test.
– Step 3. Adjust the half-test reliability using the Spearman-Brown formula
• Other methods of determining internal consistency
– Inter-item consistency refers to the degree of correlation among all the items on a scale
• A measure of inter-item consistency is calculated from a single administration of a single form of a test
• Homogenous Test – measures a single trait
• The more homogenous the test is, the better the internal consistency
– Kuder-Richardson formulas
• Where test items are highly homogeneous, KR20 and split-half reliability estimates will be similar.
• However, KR20 is the statistic of choice for determining the inter-item consistency of dichotomous items,
primarily those items that can be scored right or wrong (such as multiple-choice items).
• If test items are more heterogeneous, KR20 will yield lower reliability estimates than the split-half method.
• where stands for the Kuder-Richardson formula 20 reliability coefficient, k is the number of test items,
σ is the variance of total test scores, p is the proportion of testtakers who pass the item, q is the proportion
2
of people who fail the item, and Σpq is the sum of the pq products over all items
– KR21
• Reliability formula that does not require the calculation of p’s and q’s for every item
• It uses an approximation of the sum of the pq products: the mean test score
• Assumptions: all items are of equal difficulty, or the average difficulty is 50%
– Coefficient Alpha
• May be thought of as the mean of all possible split-half correlations, corrected by the Spearman-Brown
formula
• Nondichotomous items
• where is coefficient alpha, k is the number of items, is the variance of one item, Σ is the sum of
variances of each item, and is the variance of the total test scores
RELIABILITY ESTIMATES
Alpha as an index
• Usually, an internal consistency value of .70 is deemed as appropriate.
12
• However, a newly developed test should not, as much as possible, obtain a very high internal consistency of .90 and above
• Some scales achieving a modest reliability of .60 to .69 are accepted (in values research)
• What happens if I do not achieve any alpha values within these range?
– terrible scale
• Considerations in the use and purpose of reliability coefficients
VALIDITY
• The agreement between a test score or measure and the quality it is believed to measure.
• As applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context.
– More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test
scores.
• the process of gathering and evaluating evidence about validity
– validation studies (i.e. local validation studies)
• FACE VALIDITY
– Based from face value, it can measure what it purports to measure
• Not a true measure of validity
• No evidence
• content validity
• CONTENT VALIDITY
– Extent to which a test assesses all the important aspects of a phenomenon that it purports to measure
• Two concepts:
– Construct under-representation
• Failure to capture important components of the construct
– Construct-irrelevant variance
• Scores are influenced by factors irrelevant to the construct
• CRITERION VALIDITY
• How a test corresponds to a particular criterion
• CONCURRENT VALIDITY
• extent to which a test yields the same results as other, established measures of the same behavior, thoughts, or
feelings
13
– PREDICTIVE VALIDITY
• good at predicting how a person will think, act, or feel in the future
• CONSTRUCT VALIDITY
– extent to which a test measures what it is supposed to measure and not something else altogether
– Predictive
• Predictor and criterion
– Concurrent
Validity coefficient
• Relationship between a test and a criterion
• Tells the extent to which the test is valid for making statements about the criterion
• >.60 : rare; .30 to .40 are usually considered high
• Statistical significance: less than 5 in 100 chances
Construct validity
• Construct: something built by mental synthesis
• Involves assembling evidence about what a test means
– Show relationship between test and other measures
– Construct validity
• Convergent Evidence
– Correlation between two tests believed to measure the same construct
• Discriminant Evidence
– Divergent validation
– The test measures something unique
– Low correlations with unrelated constructs
UTILITY
• The usefulness or practical value of testing to improve efficiency
• the usefulness or practical value of a training program or intervention
• Factors That Affect a Test’s Utility
• Psychometric Soundness
– Reliability and Validity
• Cost
– economic
– financial
– budget-related
• Benefits
– benefits of testing justify the costs of administering, scoring, and interpreting the test
• Utility Analysis
– a family of techniques that entail a cost– benefit analysis designed to yield information relevant to a
decision about the usefulness and/or practical value of a tool of assessment
Norm-referenced testing
• Depicting an individual’s score relative to sample
• Raw scores are meaningless
• We present data in a way that can reflect the performance relative to the normative sample
– Standard Scores
– Scaled Scores
– T Scores
– Percentile Scores
• Transformation of scores
– Simplest form is percentage
– E.g. percentage of correct responses in a test
• Characteristics of transformation
– It does not change the score
– It takes into account information not contained in the raw score itself
– It presents the score in units that are more informative and interpretable than the raw score
– transformations
• Linear Transformation
– Constant value
– Transformed score = constant + (weight x raw score)
• Z-Scores
• T Scores
• Scaled Scores
• Area Transformations
– Uses the normal curve to create scale score values
• E.g. farenheit to celsius
• Percentile Rank
• Stanine
Standard scores
• Traditional method of descriptive reporting usually found in intellectual assessment
• How far the individual’s score deviates from the mean for the relevant normative sample
• A standard score of 100 represents precisely average performance: 50% scored better than one individual and 50% scored
lower
• Generally accepted normal range of scores would fall between 90 and 110 (normal limits)
– 80-89: low average
– 70-79: borderline deficiency
– <70: deficient performance
• only 2% fall below this standard score, hence it is generally accepted as the cutoff point for the identification
of deficient performance
– 110-120: high average
– >120: superior performance
• Only 2% of the population achieving a score of above 130
• Other standard scores
• T Scores*
• IQ Format Scores
– Deviation IQ
– M=100, SD=15 (Some have SD=16)
• Z-scores
– M=0, SD=1
• Stanine (standard as nine)
– M=5, SD=2
– Scaled scores
• Scaled scores
– Presents a different descriptive framework
– A scaled score of 10 represents average performance
• Normal range is 8-12
• 6-7: low average
• 4-5: borderline deficient
• <4: deficient
• T scores
– M=50, SD=10
– 50 is average
– 40-60: normal range
– Below 30 (2nd %ile) or above 70(98th %ile) are unusually low or high performance
• Common among personality tests
• Converting z-scores to t scores:
15
• Percentile score
– Distribution is divided into 100 equal parts
– Provides an index on where one’s score stands relative to all others on a scale of 1 to 100
• 50th %ile is average performance
• 25th to 75th percentile represents normal range
• Below 2nd and above the 98th percentile are very unusual
– If an individual gets a 15th Percentile, that means only 15% of the standardization sample received scores at or below
the score the subject attained
– Advantage: easy to understand, easy to calculate
• Ordinal Scale
– Difference between percentile units are not equivalent
• Reason for selecting a method
• No particular reason
• Scoring systems are interchangeable
– T of 50 = Standard Score of 100 = Scaled Score of 10
• All are equivalent to a percentile score of 50
• Considerations are for statistical analysis
– Percentiles cannot be used statistically
Criterion-referenced testing
• Setting standards and cutoffs
• Test scores are compared to external standards (i.e. criterion)
– Performance is assessed on how it relates to standards
– E.g. Grading Systems (A+, A-, B+, …, F; 1, 2, 3, 5; 4, 3, 2, 1)
• Advantage: it is easy to compare scores to a standard
• Problem: technical problems in developing these tests
– Basis for setting the standards, cutoffs