Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1

Review Center for Allied Professions


Psychometrician Review
PSYCHOLOGICAL ASSESSMENT
Professor Renz Christian Argao

INTRODUCTION TO PSYCHOLOGICAL ASSESSMENT

WHAT IS A TEST?
• Assessment: procedure to gather information about people
• Test: type of assessment that uses specific procedures to obtain information and convert that information to number or scores.
– Use of specific or systematic procedures
• Selecting a set of items or tests questions
• Specifying conditions under which the test is administered
– Scoring of responses
– Sample of behavior
• TESTING
– term used to refer to the process that covers the administration of a test to the interpretation of a test score.
• PSYCHOLOGICAL TESTING
– The process of measuring psychology-related variables through the use of devices or procedures designed to obtain
a sample of behavior.

PSYCHOLOGICAL TESTING VS. PSYCHOLOGICAL ASSESSMENT


• PSYCHOLOGICAL ASSESSMENT
– The collection and integration of psychology-related data for the use in a psychological evaluation that is
accomplished through the use of tools such as tests, interviews, case studies, behavioral observations, and specially
designed apparatuses and measurement procedure.
• Approaches to assessment
– COLLABORATIVE PSYCHOLOGICAL ASSESSMENT
• Assessor and assessee works as partners from initial contact to final feedback
• May include therapy as part of the process
• Therapeutic Psychological Assessment: therapeutic self-discovery and new understandings are
encouraged throughout the assessment process
– DYNAMIC ASSESSMENT
– an interactive approach to psychological assessment that usually follows a model of
• evaluation
• intervention of some sort
• evaluation
• Used in:
• Educational
• Correctional
• Corporate
• Neuropsychological
• Clinical

TOOLS OF PSYCHOLOGICAL ASSESSMENT


THE TEST
• Test – measuring device or procedure

PSYCHOLOGICAL TEST: DIFFERENCES


• CONTENT
– Subject matter
– “focus”
– The case of the same purpose but differing in content
• i.e. Personality tests
• Different theoretical orientation
• Different operant definitions
• FORMAT
– form, plan, structure, arrangement, and layout of test items
– Time limit
– Form on which the test is administered
• Pencil-and-paper, computerized
– Procedures in obtaining samples of behavior
– PSYCHOLOGICAL TEST: DIFFERENCES
• ADMINISTRATION PROCEDURES
– Individual
• Skills
• Tasks
2
• Knowledge
– Group Administration
• SCORING AND INTERPRETATION PROCEDURES
– Score
• Code or summary statement
• Reflects the evaluation
– Scoring
• process of assigning such evaluative codes or statements to performance on tests, tasks, interviews, or
other behavior samples
• SCORING AND INTERPRETATION PROCEDURES
– Types of Scores
• Based from summing up or use of elaborate procedures
• TECHNICAL QUALITY
– Psychometric Soundness
• Psychometrics = science of psychological measurement
• Validity
• Reliability
• Utility
THE INTERVIEW
• Interview : “face-to-face talk”
• In Psychology
– More than talking
– “what is said and how it is said”
– VERBAL AND NON VERBAL BEHAVIOR
• Body Language
• Eye movement/contact
• Facial expression
• Gestures
• Dress/Attire, Hygiene
– Telephone Interview
– Panel Interview
THE PORTFOLIO
– Files containing one’s works
– Can be in film, canvas, paper, etc.
– Sample of one’s ability
CASE HISTORY DATA
– Refers to records, transcripts, and other accounts in written, pictorial, or other form that preserve archival information,
official and informal accounts, and other data and items relevant to an assessee
– Files or excerpt from files stored in institutions
– Letters, correspondences, news clippings, work samples, doodles, diary
BEHAVIORAL OBSERVATION
– Monitoring the actions of others or oneself by visual or electronic means while recording quantitative and/or
qualitative information regarding the actions
– Used as a diagnostic aid, for selection purposes
ROLE PLAY TESTS
– acting an improvised or partially improvised part in a simulated situation
– a tool of assessment wherein assessees are directed to act as if they were in a particular situation

WHO ARE THE PARTIES INVOLVED IN PSYCHOLOGICAL TESTING?


• Test Developers
• Test users
• THE TEST TAKER
• THE SOCIETY-AT-LARGE
– Society’s demand for “some way of organizing or systematizing the many-faceted complexity of individual
differences.”
– As society changes, new tests are developed
– Laws and legislations on testing
– Court decisions
• OTHERS
– Companies
– Organizations
– Governmental agencies
– Schools
– Clinics
3
How Is psychological assessment conducted?
• CONDUCTING ASSESSMENTS
– Tests Standards
– Responsible test users have obligations before, during, and after a test or any measurement procedure is
administered
– Assessment of People with Disabilities
• Alternate Assessment
• an evaluative or diagnostic procedure or process that varies from the usual, customary, or
standardized way a measurement is derived either by virtue of some special accommodation made
to the assessee or by means of alternative methods designed to measure the same variable(s).
• Accommodation (adapt, adjust, or make suitable” of the assessee
• Accommodation may be defined as the adaptation of a test, procedure, or situation, or the
substitution of one test for another, to make the assessment more suitable for an assessee with
exceptional needs
• Large Print, Audio Format, Braille,

MEASURING PSYCHOLOGICAL CHARACTERISTICS


• Psychological Measurement is Less Precise
– Psychological tests measures only a sample of the property under study; inference
– Psychological Measurement uses a more limited scale
– Psychological Measurement is affected by extraneous variables
– Measuring psychological characteristics
• Psychological Measurement is Less Direct
– Psychological Tests are designed to draw inferences about underlying attributes or characteristics
– Psychological Tests are designed to measure constructs
• Hypothetical dimensions or characteristics
• Operational Definitions
• Pros and cons of psychological testing
• PROBLEMS
– Misunderstanding about or misuse of psychological tests
• People regard test scores as precise
– Imprecise measures = ineffective?
– Tests are biased against women and minority groups, dehumanizing, and invasion of personal privacy

CULTURAL, ETHICAL & LEGAL CONSIDERATIONS


Culture and Assessment
• Culture
– the socially transmitted behavior patterns, beliefs, and products of work of a particular population, community, or
group of people
– Totality of the way of life
– Psychological Assessment: Sensitivity to culture
• Culture-Specific Tests
– tests designed for use with people from one culture but not from another
– Culture-Fair Tests
• Issues
– Verbal Communication
• Nonverbal Communication and Behavior
• Standards of Evaluation
• Tests and Group Membership

Legal and Ethical Issues


• Laws
– Rules that individuals must obey for the good of the society as a whole
• Ethics
– a body of principles of right, proper, or good conduct

Regulations concerning the use of tests


• Regulations on use of tests
– Test Copyright
– Test Materials Reproduction
• Use of Tests
– Safeguarding of Test Materials
– Promotion of Client Welfare

CODE OF PROFESSIONAL ETHICS


• Public Concerns
– Public misunderstanding concerning tests
4
• fear, anger, legislation, litigation, and administrative regulations
• Legislation
• Minimum competency testing programs
• Truth in testing legislation
• Litigation
• psychologist as expert witness
• Concerns of the Profession
– Investigation on malpractice
– Test-User Qualification
• Level A, B, C
• Testing for people with disabilities
– Adaptive testing
• Computerized test administration, scoring, and interpretation
– Access to test administration, scoring, and interpretation software
– Comparability of pencil-and-paper and computerized versions of tests
– The value of computerized test interpretations
– Unprofessional, unregulated “psychological testing” online
• Rights of Test Takers
RIGHT OF INFORMED CONSENT
– Why & How
– What information will be released?
– Disclosure of information in an understandable manner
THE RIGHT TO BE INFORMED OF TEST FINDINGS
– Right to be informed, in language they can understand, of the nature of the findings
– Recommendations
– If test results are voided
THE RIGHT TO PRIVACY AND CONFIDENTIALITY
– Recognizes the freedom of the individual to pick and choose for himself the time, circumstances, and particularly the
extent to which he wishes to share or withhold from others his attitudes, beliefs, behavior, and opinions
THE RIGHT TO THE LEAST STIGMATIZING LABEL

CODE OF ETHICS FOR PHILIPPINE PSYCHOLOGISTS


Psychological Association of the Philippines
• III.J. Informed Consent
1. When conducting research or providing assessment, therapy, counseling, or consulting services in person or via
electronic transmission or other forms of communication, we shall obtain the informed consent of the individual or
individuals using language that is reasonably understandable to that person or persons except when conducting such
activities without consent is mandated by law or governmental regulation or as other provided in this Ethics Code.
2. For persons who are legally incapable of giving informed consent, we shall nevertheless (a) provide an appropriate
explanation, (b) seek the individual’s assent, (c) consider such persons’ preferences and best interests, and (d) obtain
appropriate permission from a legally authorized person, if such substitute consent is permitted or required by law. When
consent by a legally authorized person is not permitted or required by law, we shall take reasonable steps to protect the
individual’s rights and welfare.
3. When psychological services are court ordered or otherwise mandated, we shall inform the individual of the nature of the
anticipated services, including whether the services are court ordered or mandated and any limits of confidentiality, before
proceeding.
4. We shall appropriately document written or oral consent, permission, and assent.
• IV. CONFIDENTIALITY
A. Maintaining Confidentiality
It is our duty to safeguard any information divulged by our clients, regardless of the medium where it was stored. It is also
our duty to make sure that this information is secured and is not placed in areas, spaces or computers easily accessible to other
unqualified persons.
B. Limitations of Confidentiality
1. It is our duty to discuss the limitations of confidentiality to our clients, may it be due to regulated laws, institutional rules, or
professional or scientific relationship. In cases where the client is a minor or is legally incapable of giving informed
consent, the primary guardian or legal representative should be informed about the limitations of confidentiality.
2. Before the actual interview, session, or any other related psychological activities, we explain explicitly to the client all
anticipated uses of the information they will disclose.
3. We may release information to appropriate individuals or authorities only after careful deliberation or when there is
imminent danger to the individual and community. In court cases, information should be limited only to those pertinent to
the legitimate request of the court.
4. If the psychological services, products, or information is coursed through an electronic transmission, it is our duty to
inform the clients of risks to privacy.
5
ETHICAL STANDARDS AND PROCEDURES IN SPECIFIC FUNCTIONS
• VII. ASSESSMENT
A. Bases for Assessment
1. The expert opinions that we provide through our recommendations, reports, and diagnostic or evaluative statements are
based on substantial information and appropriate assessment techniques.
2. We provide expert opinions regarding the psychological characteristics of a person only after employing adequate
assessment procedures and examination to support our conclusions and recommendations.
3. In instances where we are asked to provide opinions about an individual without conducting an examination on the basis of
review of existing test results and reports, we discuss the limitation
• VII. ASSESSMENT
B. Informed Consent in Assessment
1. We gather informed consent prior to the assessment of our clients except for the following instances:
A. when it is mandated by the law
B. when it is implied such as in routine educational, institutional and organizational activity
C. when the purpose of the assessment is to determine the individual’s decisional capacity.
2. We educate our clients about the nature of our services, financial arrangements, potential risks, and limits of confidentiality. In
instances where our clients are not competent to provide informed consent on assessment, we discuss these matters with
immediate family members or legal guardians. (See also III-J, Informed Consent in Human Relations)
3. In instances where a third party interpreter is needed, the confidentiality of test results and the security of the tests must be
ensured. The limitations of the obtained data are discussed in our results, conclusions, and recommendations.
• VII. ASSESSMENT
C. Assessment Tools
1. We judiciously select and administer only those tests which are pertinent to the reasons for referral and purpose of the
assessment.
2. We use data collection, methods and procedures that are consistent with current scientific and professional developments.
3. We use tests that are standardized, valid, reliable, and has a normative data directly referable to the population of our clients.
4. We administer assessment tools that are appropriate to the language, competence and other relevant characteristics of our
client.
• VII. ASSESSMENT
D. Obsolete and Outdated Test Results
1. We do not base our interpretations, conclusions, and recommendations on outdated test results.
2. We do not provide interpretations, conclusions, and recommendations on the basis of obsolete tests.
E. Interpreting Assessment Results
1. In fairness to our clients, under no circumstances should we report the test results without taking into consideration the
validity, reliability, and appropriateness of the test. We should therefore indicate our reservations regarding the interpretations.
2. We interpret assessment results while considering the purpose of the assessment and other factors such as the client’s test
taking abilities, characteristics, situational, personal, and cultural differences.
• VII. ASSESSMENT
F. Release of Test Data
1. It is our responsibility to ensure that test results and interpretations are not used by persons other than those explicitly agreed
upon by the referral sources prior to the assessment procedure.
2. We do not release test data in the forms of raw and scaled scores, client’s responses to test questions or stimuli, and notes
regarding the client’s statements and behaviors during the examination unless regulated by the court.
• VII. ASSESSMENT
G. Explaining Assessment Results
1. We release test results only to the sources of referral and with a written permission from the client if it is a self referral.
2. Where test results have to be communicated to relatives, parents, or teachers, we explain them through a non-technical
language.
3. We explain findings and test results to our clients or designated representatives except when the relationship precludes the
provision of explanation of results and it is explained in advanced to the client.
4. When test results needs to be shared with schools, social agencies, the courts or industry, we supervise such releases.
• VII. ASSESSMENT
H. Test Security
• The administration and handling of all test materials (manuals, keys, answer sheets, reusable booklets, etc.) shall be handled
only by qualified users or personnel.
I. Assessment by Unqualified Persons
1. We do not promote the use of assessment tools and methods by unqualified persons except for training purposes with
adequate supervision.
2. We ensure that test protocols, their interpretations and all other records are kept secured from unqualified persons.
• VII. ASSESSMENT
J. Test Construction
• We develop tests and other assessment tools using current scientific findings and knowledge, appropriate psychometric
properties, validation, and standardization procedures., social agencies, the courts or industry, we supervise such releases.
6
ASSUMPTIONS & THEORIES
BASIC ASSUMPTIONS ABOUT PSYCHOLOGICAL TESTING & ASSESSMENT
Assumption 1: Psychological Traits & States Exist
TRAIT-Distinguishable, relatively enduring way in which an individual differs from another
STATE-Distinguishes a person from another but is less enduring
 Trait and the magnitude of the trait
 Based on observing a sample of behavior
 Direct observation
 Analysis of self-report statements
 Pencil-or-paper tests
 Some people deny that traits exist and there is a controversy how they exist
 Construct – informed, scientific concept developed or constructed to describe or explain behavior
Assumption 2: Psychological Traits & States Can Be Quantified and Measured
 Define the trait
 Develop items that provides insight to the trait
 Develop appropriate test items and appropriate way to score the items
 Cumulative scoring
Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior
 tasks in some tests mimic the actual behaviors that the test user is attempting to understand
 tests yield only a sample of the behavior that can be expected to be emitted under nontest conditions
 Predict or postdict
Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses
 how a test was developed
 the circumstances under which it is appropriate to administer the test
 how the test should be administered
 to whom
 how the test results should be interpreted
 limitations
Assumption 5: Various Sources of Error Are Part of the Assessment Process
 Error – mistake, miscalculation, etc.
 In assessment: more that what is expected
 Error Variance – the component of a test score attributable to sources other than the trait or ability measured
Assumption 6: Testing and Assessment Can Be Conducted in a Fair & Unbiased Manner
Assumption 7: Testing and Assessment Benefit Society

CLASSICAL TEST THEORY


Domain Sampling Theory
 Assumes that the items selected for any one test are just sample of items from an infinite domain of potential items
 Domain sampling is the most common CTT used for practical purposes
 Concerns of CTTs is to cope with random error of the raw score
 The less random error in the measure, the more the raw score reflects the true score
ASSUMPTIONS:
 The raw score is made up of the true score plus random error ; the average of raw scores is the best estimate
of the true score
 Random errors around the true score would be normally distributed; the expected value of error = 0
 Random errors are uncorrelated
 Random errors are uncorrelated to the T
 Theory of true & error scores
 The standard deviation of the distribution of random errors around the true score is called the standard error of
measurement
 The lower the SEM, the more tightly packed around the true score the random errors will be
RAMIFICATIONS OF THE CTT
 SEM is consistent across an entire population
 As the test become longer, it becomes increasingly reliable
 The important statistics about test items (i.e. difficulty) is dependent on the sample of respondents being representative of the
population
 Scores in the population are assumed to be
A) measured at the interval level
B) normally distributed
When these are not met, Test Developers convert scores, combine scales, and others to ensure the assumption is met
 When item responses change, the properties of a test changes.

MODERN TEST THEORY


Item Response Theory (IRT)
7
 Developed to address the shortcomings and limitations of the CTT
 Lord and Novik (1968)

CTT VS. IRT


CTT
 Focus is on the single score that one obtains on the test.
 All items are treated as though they are parallel
IRT
 Focus is on the pattern of responses the respondent makes to the set of items
 It does not assume that all items are parallel

ITEM RESPONSE THEORY


 Fundamental assumption: there is a linkage between a response to any item on a test and the characteristic being assessed
by the test.
 Characteristic is a latent trait Θ
 Linkage is the probability of a positive response to any single item on a test is a function of the individual’s Θ level
 Item response theory
 Critical feature that is analyzed in IRT is the entire pattern of item responses to all test items by an individual.
 IRT: Pattern of Item Responses; CTT: Raw Scores of the Test
• IRT: Assessment of Measurement Error at any level of Θ; CTT: Measurement Error is the same at every level of the
test score
IRT MODELS
One Parameter Logistic Model (1PL)
 Most basic model in IRT; Rasch Model (1960)
 Item parameter that is estimated in 1PL is the difficulty parameter: b
 b is scaled on a normal distribution with a M=0.00 and σ=1.0
 NB: Θ is also on the same scale
 ∴ items with higher b values are more difficult insofar as the respondent must have a higher level of Θ to pass or to endorse
the item than items with lower Θ values
Two Parameter Logistic Model (2PL)
 All items are not assumed to have equal slopes
 Slope is denoted as a (Usual values range from 0.5 to 2.5)
• Items with higher a values have steeper slopes and discriminate more in the middle of the Θ level range
• Items with lower a values have flatter slopes and discriminate more at the ends of the Θ level range
• If we know the Θ level, the a value, and b value parameters for an item, we can calculate the probability that the person
will pass the item.
Three Parameter Logistic Model (3PL)
 Most complex of the dichotomous response models and adds a c parameter
 c value is a guessing parameter
 Useful when items are constructed so that guessing the correct answer is possible, even at very low Θ levels
 Used for multiple-choice and true-false items; helpful in identifying response styles such as social desirability or faking good

MULTIPLE RESPONSE IRT MODELS


Polytomous Models
Considerations:
a) nominal response categories
b) have graded responses
c) allow for partial credit to be given to a multipart question
Nominal Response Categories
 Assumption: all incorrect responses are equally incorrect
 IRT Analysis answers the question “Are individuals with different ability levels more or less likely to select a particular incorrect
response.
 Allows a fine-grained analysis of distractor alternatives
 Useful in determining group differences is response patterns
Graded Response Categories
 Likert-type responses
 Assumption: responses lie on an ordered, but categorical, level
 Item responses do not have to be the same across items; Muraki restriction: all items in a scale are to be responded with the
same number of response categories
 Higher values should be associated with higher ability or attitude
Partial Credit Responses
 Partial Credit Model by Masters (1982)
 More generalized version of Masters’ model proposed by Muraki (1992)
 A more restrictive version of Masters’ model proposed earliest by Andrich (1978)
8
 Used in achievement test situations where test takers complete items with multiple parts and are given partial credit for
answering some sections correctly even if the entire item is not answered correctly
 Similar to graded response models: It is assumed that as respondents move higher on the latent trait, the person will be more
likely to correctly respond to all parts of a question
 Similar to nominal models: probability of a particular score, or response alternative, is an exponential divided by the sum of
exponentials

IRT RAMIFICATIONS
 SEM differs in different levels of Θ; shorter tests can be more reliable than longer ones
 Different forms are best for respondents of different Θ levels

IRT (as opposed to CTT):


 Response formats that differ can be combined w/o difficulty
 When initial scores are different for different respondents, the change scores can be meaningfully interpreted
 Item’s stem or stimulus features can be assessed for their impact using IRT

Limitations of Modern Test Theory


 Large sample size needed
 Lack of user-friendly computer programs
 Complex models are not yet easily analyzed
 Our interest is not just on how each item functions but also on how the items (total score) as a unit functions

CTT and IRT


 Both necessary and neither, by itself, sufficient for complete psychometric assessment

NORMS
• Measurement
– the act of assigning numbers or symbols to characteristics of things (people, events, whatever) according to rules
– The rules used in assigning numbers are guidelines for representing the magnitude (or some other characteristic) of
the object being measured
• Scale
– A scale is a set of numbers (or other symbols) whose properties model empirical properties of the objects to which
the numbers are assigned.
• Types of Variables
– Discrete Variables
• consist of indivisible categories
– Continuous Variable
• infinitely divisible into whatever units a researcher may choose

LEVELS OR SCALES OF MEASUREMENT


Nominal Scale
• involve classification or categorization based on one or more distinguishing characteristics, where all things measured
must be placed into mutually exclusive and exhaustive categories
• Numbers are used as labels (no numerical prop)
• Levels or scales of measurement
Nominal Scale
• Nominal scales are data that record categories.
• Unordered set: nominal scales represent a rather low level of measurement
• e.g. Gender, Yes or No
• Mode, Chi Square
• Levels or scales of measurement
Ordinal Scale
• Classification and rank ordering on some characteristic is permissible with ordinal scales
• Ordinal scales record information about the rank order of scores.
• e.g. First, Second, 1, 2, Test Scores
• Median, Percentile
• Levels or scales of measurement
Interval Scale
• contain equal intervals between numbers
• Interval scales tell us about the order of data points, and the size of the intervals in between data points.
• mean, standard deviation, correlation, regression, analysis of variance
• Levels or scales of measurement
Ratio Scale
• A ratio scale is an interval scale with a true zero point.
• Scale of measurement of data which permits the comparison of differences of values
9
• All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of
variation, logarithms

Scales of measurement in psychology


• Ordinal level of measurement is most frequently used in psychology

NORMS
• Transformation of raw scores into a more meaningful scale derived from the performance of a large sample of persons
representative of one or more specified groups.
• Norms of a test are based on the distribution of scores obtained by some defined sample of individuals.
• Developed through SAMPLING
– Deriving a representative group of the population – a sample

SAMPLING
• Probability Sampling
– Each element has a known probability of being sampled.
• Nonprobability Sampling
– Each element has an unknown probability of being sampled.

PROBABILITY SAMPLING METHODS


• Simple Random Sampling
• Systematic Sampling
• Stratified Random Sampling
– Disproportionate Stratified Random Sampling
– Proportionate Stratified Random Sampling
• Cluster Sampling
• Combined-strategy Sampling

NONPROBABILITY SAMPLING METHODS


• QUOTA SAMPLING
• CONVENIENCE SAMPLING
• THEORETICAL OR PURPOSIVE SAMPLING

Sample Size vs. Sample Representativeness


• How large a sample do we need?
– Alpha level
– Effect size
– Power

TYPES OF NORMS
• Developmental Norms
– Indicates how far along the normal developmental path an individual has progressed
• Age Norms
• Age-equivalent scores
• Indicate the average performance of different samples of testtakers who were at various ages at
the time the test was administered
• Grade Norms
• Designed to indicate the average test performance of testtakers in a given school grade
• Mental Age
• A child’s score on a test corresponds to the highest year level or age that he can complete
• Ordinal Scale
• Identify the stage reached by the child in the development of a specific behavior function

• Within Group Norms


– The individual’s performance is evaluated in terms of the performance of the most nearly comparable standardization
group
• Percentiles
• A percentile is an expression of the percentage of people whose score on a test or measure falls
below a particular raw score
• A distribution is divided into 100 equal parts - %
• National Norms
• Derived from a normative sample that was nationally representative of the population at the time
the norming study was conducted
• Standard Scores
• Derived scores which uses as its unit the SD of the population upon which the test was
standardized
10
NORMS-REFERENCED VS. CRITERION-REFERENCED TESTING
• NORM-REFERENCED TESTING
– A score is interpreted by comparing it with the scores obtained by others on the same test
• CRITERION-REFERENCED
– Proposed by Glaser in 1963, this uses a specified content domain rather than a specified population of persons

PSYCHOMETRIC PROPERTIES
RELIABILITY & VALIDITY
RELIABILITY
• Basic premise
– Presence of error
• Find the magnitude of error and develop ways to minimize them
– Tests that are relatively free from measurement error are deemed to be reliable.
• The extent to which measurements are consistent or repeatable; also, it is the extent to which measurements differ from
occasion to occasion as a function of measurement error
• It is concerned with that portion of measurement that is due to permanent effects that persist from sample to sample.
• Conceptualization of error
– Error exist because we only obtain a sample of behavior
• Charles Spearman pioneered reliability assessment
– De Moivre
– Pearson
– Kuder and Richardson
– Cronbach

Basics of test score theory


• CTT:
– Measuring instruments are imperfect, observed score is almost always different from the true ability/characteristic
– Difference is from measurement error
– Errors of measurement are random
– True score will not change w/ repeated application of the same test
• Because of random error, repeated application produces different results
• Domain sampling model
– Problem created by using a limited number of items to represent a larger, more complicated construct
– Task in reliability analysis is to estimate how much error is made by using a test score from the shorter test as
estimate of the true ability
– Reliability is the ratio of variance of the observed score on the shorter test and the variance of the long-run true score
– Reliability can be estimated by correlating the observed score with the true score
– T is not available so we estimate what they would be
– Many tests by sampling from the same domain, we can get a normal distribution of unbiased estimates of the T.
– To estimate reliability, we create many randomly paralleled tests
• Item response theory
– IRT focuses on an item difficulty to assess the ability

RELIABILITY COEFFICIENT
– is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total
variance
Sources of error
• Test Construction
– Item sampling; content sampling
• Test Administration
• Test Scoring and Interpretation

Reliability estimates
• Test-Retest Reliability Estimate
– Test-retest method
– an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations
of the same test
– When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to
as the coefficient of stability
• Parallel-Forms and Alternate-Forms Reliability Estimates
– coefficient of equivalence
– Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are
equal
– Alternate forms are simply different versions of a test that have been constructed so as to be parallel
• Split-Half Reliability Estimates
– obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once
– Step 1. Divide the test into equivalent halves.
11
– Step 2. Calculate a Pearson r between scores on the two halves of the test.
– Step 3. Adjust the half-test reliability using the Spearman-Brown formula
• Other methods of determining internal consistency
– Inter-item consistency refers to the degree of correlation among all the items on a scale
• A measure of inter-item consistency is calculated from a single administration of a single form of a test
• Homogenous Test – measures a single trait
• The more homogenous the test is, the better the internal consistency
– Kuder-Richardson formulas
• Where test items are highly homogeneous, KR20 and split-half reliability estimates will be similar.
• However, KR20 is the statistic of choice for determining the inter-item consistency of dichotomous items,
primarily those items that can be scored right or wrong (such as multiple-choice items).
• If test items are more heterogeneous, KR20 will yield lower reliability estimates than the split-half method.
• where stands for the Kuder-Richardson formula 20 reliability coefficient, k is the number of test items,
σ is the variance of total test scores, p is the proportion of testtakers who pass the item, q is the proportion
2

of people who fail the item, and Σpq is the sum of the pq products over all items
– KR21
• Reliability formula that does not require the calculation of p’s and q’s for every item
• It uses an approximation of the sum of the pq products: the mean test score
• Assumptions: all items are of equal difficulty, or the average difficulty is 50%
– Coefficient Alpha
• May be thought of as the mean of all possible split-half correlations, corrected by the Spearman-Brown
formula
• Nondichotomous items
• where is coefficient alpha, k is the number of items, is the variance of one item, Σ is the sum of
variances of each item, and is the variance of the total test scores

How to increase reliability?


• Increase the number of items or observations
• Eliminate items that are unclear
• Standardize the conditions under which the test is taken
• Moderate the degree of difficulty of the tests
• Minimize the effects of external events
• Standardize instructions
• Maintain consistent scoring procedures

RELIABILITY ESTIMATES

Type of It is a measure How You Do It What the Statistical


Reliability of… Reliability Computation
Coefficient
Looks Like
Test-Retest stability Administer the Correlation
same (Pearson r or
test/measure at Sperman’s rho)
two different times
to the same group
of participants
Parallel or equivalence Administer two Correlation
Alternate forms different forms of (Pearson r or
the test to the Sperman’s rho)
same group of
participants
Inter-rater Agreement Have two (or more) Perecntage of Percentage
raters rate behaviors agreements; Kappa’s coefficient
and then determine the k
amount of agreement
between them
Internal How consistently Correlate performance α Cronbach’s alpha
Consistency each item measures on each item with KR Kuder-Richardson
20
the same underlying overall performance σ Ordinal/Composite
construct across participants

Alpha as an index
• Usually, an internal consistency value of .70 is deemed as appropriate.
12
• However, a newly developed test should not, as much as possible, obtain a very high internal consistency of .90 and above
• Some scales achieving a modest reliability of .60 to .69 are accepted (in values research)
• What happens if I do not achieve any alpha values within these range?
– terrible scale
• Considerations in the use and purpose of reliability coefficients

• Nature of the Test


– Homogeneity versus heterogeneity of test items
– Dynamic versus static characteristics
– Speed Test versus Power Test

VALIDITY
• The agreement between a test score or measure and the quality it is believed to measure.
• As applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context.
– More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test
scores.
• the process of gathering and evaluating evidence about validity
– validation studies (i.e. local validation studies)

Logic of validity analysis


• A valid test is one that
– Predicts future performance on appropriate variables
– Measures an appropriate domain
– Measures appropriate characteristics of test takers
• Validity is determined by the relationship between test scores and some other variable referred to as the validation measure
– Theory behind validity analysis
CTT
• Validity analysis focuses on the variables producing the true score differences
• Analysis is used to determine the extent to which true score is determined by characteristics relevant to the
purpose of the test
• Theory behind validity analysis
• Theoretically, the True Score has 2 components:
– Stable characteristics of the individual relevant to the purpose of the test
– Stable characteristics of the individual irrelevant to the purpose of the test
• If the test were perfect instruments, the only stable characteristics measured would be those that the test was
designed to measure
• Theory behind validity analysis
• A valid test is where the test scores tell us more of the stable differences among people in relevant
characteristics than about stable differences in irrelevant characteristics
• The goal of validity analysis is to identify and measure these different sources of variance

VALIDITY: TRINITARIAN MODEL


• Three approaches to assessing validity
• Scrutinizing the test’s content
• Relating scores obtained on the test to other test scores or other measures
• Executing a comprehensive analysis of
– how scores on the test relate to other test scores and measures
– how scores on the test can be understood within some theoretical framework for understanding the
construct that the test was designed to measure

• FACE VALIDITY
– Based from face value, it can measure what it purports to measure
• Not a true measure of validity
• No evidence
• content validity
• CONTENT VALIDITY
– Extent to which a test assesses all the important aspects of a phenomenon that it purports to measure
• Two concepts:
– Construct under-representation
• Failure to capture important components of the construct
– Construct-irrelevant variance
• Scores are influenced by factors irrelevant to the construct
• CRITERION VALIDITY
• How a test corresponds to a particular criterion
• CONCURRENT VALIDITY
• extent to which a test yields the same results as other, established measures of the same behavior, thoughts, or
feelings
13
– PREDICTIVE VALIDITY
• good at predicting how a person will think, act, or feel in the future
• CONSTRUCT VALIDITY
– extent to which a test measures what it is supposed to measure and not something else altogether
– Predictive
• Predictor and criterion
– Concurrent

Validity coefficient
• Relationship between a test and a criterion
• Tells the extent to which the test is valid for making statements about the criterion
• >.60 : rare; .30 to .40 are usually considered high
• Statistical significance: less than 5 in 100 chances

Evaluating validity coefficients


• Look for changes in the cause of the relationship
• What does the criterion mean?
– The criterion should be valid and reliable
• Review the subject population in the ValStud
• Is the sample size adequate?
• Do not confuse the criterion with the predictor
• Is there variability in the criterion and the predictor?
• Is there evidence for validity generalization?
• Consider differential prediction

Construct validity
• Construct: something built by mental synthesis
• Involves assembling evidence about what a test means
– Show relationship between test and other measures
– Construct validity
• Convergent Evidence
– Correlation between two tests believed to measure the same construct
• Discriminant Evidence
– Divergent validation
– The test measures something unique
– Low correlations with unrelated constructs

RELATIONSHIP BETWEEN VALIDITY & RELIABILITY


• Reliability: ability to produce consistent scores that measure stable characteristics
• Validity: which stable characteristics the test scores measure
• It is theoretically possible to develop a reliable test that is not valid. If a test is not reliable, its potential validity is
limited.

UTILITY
• The usefulness or practical value of testing to improve efficiency
• the usefulness or practical value of a training program or intervention
• Factors That Affect a Test’s Utility
• Psychometric Soundness
– Reliability and Validity
• Cost
– economic
– financial
– budget-related
• Benefits
– benefits of testing justify the costs of administering, scoring, and interpreting the test
• Utility Analysis
– a family of techniques that entail a cost– benefit analysis designed to yield information relevant to a
decision about the usefulness and/or practical value of a tool of assessment

WORKING WITH SCORES


• Accurate measurement
– Norming
• allows the examiner to establish what “average” is for an individual of the examinee’s age
– Considerations on gender, race, etc.
– The goal for the best ‘apples to apples; orange to orange’ comparisons
– Establishing a range of normalcy around which to compare the performance on a test
14

Norm-referenced testing
• Depicting an individual’s score relative to sample
• Raw scores are meaningless
• We present data in a way that can reflect the performance relative to the normative sample
– Standard Scores
– Scaled Scores
– T Scores
– Percentile Scores
• Transformation of scores
– Simplest form is percentage
– E.g. percentage of correct responses in a test
• Characteristics of transformation
– It does not change the score
– It takes into account information not contained in the raw score itself
– It presents the score in units that are more informative and interpretable than the raw score
– transformations
• Linear Transformation
– Constant value
– Transformed score = constant + (weight x raw score)
• Z-Scores
• T Scores
• Scaled Scores
• Area Transformations
– Uses the normal curve to create scale score values
• E.g. farenheit to celsius
• Percentile Rank
• Stanine

Standard scores
• Traditional method of descriptive reporting usually found in intellectual assessment
• How far the individual’s score deviates from the mean for the relevant normative sample
• A standard score of 100 represents precisely average performance: 50% scored better than one individual and 50% scored
lower
• Generally accepted normal range of scores would fall between 90 and 110 (normal limits)
– 80-89: low average
– 70-79: borderline deficiency
– <70: deficient performance
• only 2% fall below this standard score, hence it is generally accepted as the cutoff point for the identification
of deficient performance
– 110-120: high average
– >120: superior performance
• Only 2% of the population achieving a score of above 130
• Other standard scores
• T Scores*
• IQ Format Scores
– Deviation IQ
– M=100, SD=15 (Some have SD=16)
• Z-scores
– M=0, SD=1
• Stanine (standard as nine)
– M=5, SD=2
– Scaled scores
• Scaled scores
– Presents a different descriptive framework
– A scaled score of 10 represents average performance
• Normal range is 8-12
• 6-7: low average
• 4-5: borderline deficient
• <4: deficient
• T scores
– M=50, SD=10
– 50 is average
– 40-60: normal range
– Below 30 (2nd %ile) or above 70(98th %ile) are unusually low or high performance
• Common among personality tests
• Converting z-scores to t scores:
15
• Percentile score
– Distribution is divided into 100 equal parts
– Provides an index on where one’s score stands relative to all others on a scale of 1 to 100
• 50th %ile is average performance
• 25th to 75th percentile represents normal range
• Below 2nd and above the 98th percentile are very unusual
– If an individual gets a 15th Percentile, that means only 15% of the standardization sample received scores at or below
the score the subject attained
– Advantage: easy to understand, easy to calculate
• Ordinal Scale
– Difference between percentile units are not equivalent
• Reason for selecting a method
• No particular reason
• Scoring systems are interchangeable
– T of 50 = Standard Score of 100 = Scaled Score of 10
• All are equivalent to a percentile score of 50
• Considerations are for statistical analysis
– Percentiles cannot be used statistically

Criterion-referenced testing
• Setting standards and cutoffs
• Test scores are compared to external standards (i.e. criterion)
– Performance is assessed on how it relates to standards
– E.g. Grading Systems (A+, A-, B+, …, F; 1, 2, 3, 5; 4, 3, 2, 1)
• Advantage: it is easy to compare scores to a standard
• Problem: technical problems in developing these tests
– Basis for setting the standards, cutoffs

You might also like