TOS Outline PsychAssess 2.0

Psychological Assessment
#BLEPP2023
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Psych Pearls
 Psychometric Properties
Item: a specific stimulusand Principles
to which (39)responds overtly and this response is being scored or evaluated
a person
Psychometric Properties essential in Constructing,  Administration Procedures: one-to-one basis or
Selecting, Interpreting tests group administration
Psychological Testing - process of measuring  Score: code or summary of statement, usually
psychology-related variables by means of devices or but not necessarily numerical in nature, but
procedures designed to obtain a sample of behavior reflects an evaluation of performance on a test
- numerical in nature  Scoring: the process of assigning scores to
- individual or by group performances
- administrators can be interchangeable without  Cut-Score: reference point derived by
affecting the evaluation judgement and used to divide a set of data into
- requires technician-like skills in terms of two or more classification
administration and scoring  Psychometric Soundness: technical quality
- yield a test score or series of test score  Psychometrics: science of psychological
- minutes to few hours measurement
 Psychological
PsychometristAssessment
or Psychometrician: refer
- gathering andtointegration
professional who uses, analyzes, and interprets psychological data
of psychology-related data for the purpose of making
psychological evaluation
- answers referral question thru the use of different Ability or Maximal Performance Test – assess what
tools of evaluation a person can do
- individual 1. Achievement Test – measurement of the previous
- assessor is the key to the process of selecting tests learning
and/or other tools of evaluation - used to measure general knowledge in a specific
- requires an educated selection of tools of evaluation, period of time
skill in evaluation, and thoughtful organization and - used to assess mastery
integration of data - rely mostly on content validity
- entails logical problem-solving that brings to bear - fact-based or conceptual
many sources of data assigned to answer the referral 2. Aptitude – refers to the potential for learning or
question acquiring a specific skill
- Educational: evaluate abilities and skills relevant in - tends to focus on informal learning
school context - rely mostly on predictive validity
- Retrospective: draw conclusions about psychological 3. Intelligence – refers to a person’s general potential
aspects of a person as they existed at some point in to solve problems, adapt to changing environments,
time prior to the assessment abstract thinking, and profit from experience
- Remote: subject is not in physical proximity to the Human Ability – considerable overlap of
person conducting the evaluation achievement, aptitude, and intelligence test
- Ecological Momentary: “in the moment” evaluation Typical Performance Test – measure usual or habitual
of specific problems and related cognitive and thoughts, feelings, and behavior
behavioral variables at the very time and place that - indicate how test takers think and act on a daily basis
they occur - use interval scales
- Collaborative: the assessor and assesee may work as - no right and wrong answers
“partners” from initial contact through final feedback Personality Test – measures individual dispositions
- Therapeutic: therapeutic self-discovery and new and preferences
understanding are encouraged - designed to identify characteristic
o- Psychological
Dynamic: describe interactive
Test – device approach
or procedure to
designed - measured ideographically or nomothetically
psychological assessment
to measure variables thattousually
related follows the
psychology 1. Structured Personality tests – provide statement,
model: evaluation
 Content: > intervention
subject matter of some sort > usually self-report, and require the subject to choose
evaluation
 Format: form, plan, structure, arrangement, between two or more alternative responses
layout 2. Projective Personality Tests – unstructured, and the
stimulus or response are ambiguous
Hi :) this reviewer is FREE! u can share it with others but never sell it okay? let’s help each other <3 -aly
#BLEPP2023
3. Attitude Test – elicit personal beliefs and opinions o Behavioral Observation – monitoring of actions of
4. Interest Inventories – measures likes and dislikes others or oneself by visual or electronic means
as well as one’s personality orientation towards the while recording quantitative and/or qualitative
world of work information regarding those actions
Other Tests:  Naturalistic Observation: observe humans in
1. Speed Tests – the interest is the number of times a natural setting
test taker can answer correctly in a specific period  SORC Model: Stimulus, Organismic Valuables,
2. Power Tests – reflects the level of difficulty of Actual Response, Consequence
items the test takers answer correctly o Role Play – defined as acting an improvised or
3. Values Inventory partially improvised part in a stimulated situation
4. Trade Test  Role Play Test: assesses are directed to act as if
5. Neuropsychological Test they are in a particular situation
6. Norm-Referenced test o Other tools include computer, physiological devices
7. Criterion-Referenced Tests (biofeedback devices)
o Interview – method of gathering information Psychological Assessment Process
through direct communication involving reciprocal 1. Determining the Referral Question
exchange 2. Acquiring Knowledge relating to the content of
Standardized/Structured – questions are prepared the problem
Non-standardized/Unstructured – pursue relevant 3. Data collection
ideas in depth 4. Data Interpretation
Semi-Standardized/Focused – may probe further on o Hit Rate – accurately predicts success or failure
specific number of questions o Profile – narrative description, graph, table. Or
Non-Directive – subject is allowed to express his other representations of the extent to which a person
feelings without fear of disapproval has demonstrated certain targeted characteristics as
 Mental Status Examination: determines the a result of the administration or application of tools
mental status of the patient of assessment
 Intake Interview: determine why the client came o Actuarial Assessment – an approach to evaluation
for assessment; chance to inform the client about characterized by the application of empirically
the policies, fees, and process involved demonstrated statistical rules as determining factor
 Social Case: biographical sketch of the client in assessors’ judgement and actions
 Employment Interview: determine whether the o Mechanical Prediction – application of computer
candidate is suitable for hiring algorithms together with statistical rules and
 Panel Interview (Board Interview): more than probabilities to generate findings and
one interviewer participates in the assessment recommendations
 Motivational Interview: used by counselors and o Extra-Test Behavior – observations made by an
clinicians to gather information about some examiner regarding what the examinee does and how
problematic behavior, while simultaneously the examinee reacts during the course of testing that
attempting to address it therapeutically are indirectly related to the test’s specific content but
o Portfolio – samples of one’s ability and of possible significance to interpretation
accomplishment Parties in Psychological Assessment
o Case History Data – refers to records, transcripts, 1. Test Author/Developer – creates the tests or other
and other accounts in written, pictorial, or other methods of assessment
form that preserve archival information, official and 2. Test Publishers – they publish, market, sell, and
informal accounts, and other data and items relevant control the distribution of tests
to an assessee 3. Test Reviewers – prepare evaluative critiques based
 Case study: a report or illustrative account on the technical and practical aspects of the tests
concerning a person or an event that was 4. Test Users – uses the test of assessment
compiled on the basis of case history data 5. Test Takers – those who take the tests
 Groupthink: result of the varied forces that 6. Test Sponsors – institutions or government who
drive decision-makers to reach a consensus contract test developers for various testing services
7. Society
#BLEPP2023
the types of item content that would provide
o Test Battery – selection of tests and assessment insight to it, to gauge the strength of that trait
procedures typically composed of tests designed to o Measuring traits and states means of a test entails
measure different variables but having a common developing not only appropriate tests items but
objective also appropriate ways to score the test and
Assumptions about Psychological Testing and interpret the results
Assessment o Cumulative Scoring – assumption that the more
Assumption 1: Psychological Traits and States Exist the testtaker responds in a particular direction
o Trait – any distinguishable, relatively enduring keyed by the test manual as correct or consistent
way in which one individual varies from another with a particular trait, the higher that testtaker is
- Permit people predict the present from the past presumed to be on the targeted ability or trait
- Characteristic patterns of thinking, feeling, and Assumption 3: Test-Rlated Behavior Predicts Non-
behaving that generalize across similar situations, Test-Related Behavior
differ systematically between individuals, and remain o The tasks in some tests mimics the actual
rather stable across time behaviors that the test user is attempting to
- Psychological Trait – intelligence, specific understand
intellectual abilities, cognitive style, adjustment, o Such tests only yield a sample of the behavior
interests, attitudes, sexual orientation and preferences, that can be expected to be emitted under
psychopathology, etc. nontest
o States – distinguish one person from another but conditions
are relatively less enduring Assumption 4: Test and Other Measurement
- Characteristic pattern of thinking, feeling, and Techniques have strengths and weaknesses
behaving in a concrete situation at a specific moment o Competent test users understand and appreciate
in time the limitations of the test they use as well as how
- Identify those behaviors that can be controlled by those limitations might be compensated for by
manipulating the situation data from other sources
o Psychological Traits exists as construct Assumption 5: Various Sources of Error are part
- Construct: an informed, scientific concept developed of the Assessment Process
or constructed to explain a behavior, inferred from o Error – refers to something that is more than
overt behavior expected; it is component of the measurement
- Overt Behavior: an observable action or the product process
of an observable action  Refers to a long-standing assumption that
o Trait is not expected to be manifested in behavior factors other than what a test attempts to
100% of the time measure will influence performance on the
o Whether a trait manifests itself in observable test
behavior, and to what degree it manifests, is  Error Variance – the component of a test
presumed to depend not only on the strength of score attributable to sources other than the
the trait in the individual but also on the nature of trait or ability measured
the action (situation-dependent) o Potential Sources of error variance:
o Context within which behavior occurs also plays 1. Assessors
a role in helping us select appropriate trait terms 2. Measuring Instruments
for observed behaviors 3. Random errors such as luck
o Definition of trait and state also refer to a way in o Classical Test Theory – each testtaker has true
which one individual varies from another score on a test that would be obtained but for the
o Assessors may make comparisons among people action of measurement error
who, because of their membership in some group Assumption 6: Testing and Assessment can be
or for any number of other reasons, are decidedly conducted in a Fair and Unbiased Manner
not average o Despite best efforts of many professionals,
Assumption 2: Psychological Traits and States can fairness-related questions and problems do
be Quantified and Measured occasionally rise
o Once the trait, state or other construct has been In al questions about tests with regards to fairness, it is
defined to be measured, a test developer consider important to keep in mind that tests are tools ꟷthey
#BLEPP2023
Assumption 7: Testing and Assessment Benefit  Factors that contribute to inconsistency:
Society characteristics of the individual, test, or
o Considering the many critical decisions that are situation, which have nothing to do with the
based on testing and assessment procedures, we attribute being measured, but still affect the
can readily appreciate the need for tests scores
Reliability o Goals of Reliability:
o Reliability – dependability or consistency of the  Estimate errors
instrument or scores obtained by the same person  Devise techniques to improve testing and reduce
when re-examined with the same test on different errors
occasions, or with different sets of equivalent items o Variance – useful in describing sources of test
 Test may be reliable in one context, but score variability
unreliable in another  True Variance: variance from true differences
 Estimate the range of possible random  Error Variance: variance from irrelevant
fluctuations that can be expected in an random sources
individual’s score Measurement Error – all of the factors associated
 Free from errors with the process of measuring some variable, other
 More number of items = higher reliability than the variable being measured
 Minimizing error - difference between the observed score and the true
 Using only representative sample to obtain an score
observed score - Positive: can increase one’s score
 True score cannot be found - Negative: decrease one’s score
 Reliability Coefficient: index of reliability, a - Sources of Error Variance:
proportion that indicates the ratio between the a. Item Sampling/Content Sampling: refer to variation
true score variance on a test and the total among items within a test as well as to variation
variance among items between tests
o Classical Test Theory (True Score Theory) – - The extent to which testtaker’s score is affected by
score on a ability tests is presumed to reflect not the content sampled on a test and by the way the
only the testtaker’s true score on the ability being content is sampled is a source of error variance
measured but also the error b. Test Administration- testtaker’s motivation or
 Error: refers to the component of the observed attention, environment, etc.
test score that does not have to do with the c. Test Scoring and Interpretation – may employ
testtaker’s ability objective-type items amenable to computer scoring of
 Errors of measurement are random well-documented reliability
Random Error – source of error in measuring a
targeted variable caused by unpredictable fluctuations
and inconsistencies of other variables in measurement
process (e.g., noise, temperature, weather)
Systematic Error – source of error in a measuring a
variable that is typically constant or proportionate to
what is presumed to be the true values of the variable
being measured
- has consistent effect on the true score
- SD does not change, the mean does
 Reliability refers to the proportion of total
variance attributed to true variance
 When you average all the observed scores  The greater the proportion of the total variance
obtained over a period of time, then the result attributed to true variance, the more reliable the
would be closest to the true score test
 The greater number of items, the higher the  Error variance may increase or decrease a test
reliability score by varying amounts, consistency of test
 Factors the contribute to consistency: stable score, and thus, the reliability can be affected
attributes
#BLEPP2023
Test-Retest Reliability time
Error: Time Sampling - most rigorous and burdensome, since test developers
- time sampling reliability create two forms of the test
- an estimate of reliability obtained by correlating - main problem: difference between the two test
pairs of scores from the same people on two different - test scores may be affected by motivation, fatigue, or
administrations of the test intervening events
- appropriate when evaluating the reliability of a test - means and the variances of the observed scores must
that purports to measure an enduring and stable be equal for two forms
attribute such as personality trait - Statistical Tool: Pearson R or Spearman Rho
- established by comparing the scores obtained from Internal Consistency (Inter-Item Reliability)
two successive measurements of the same individuals Error: Item Sampling Homogeneity
and calculating a correlated between the two set of - used when tests are administered once
scores - consistency among items within the test
- the longer the time passes, the greater likelihood that - measures the internal consistency of the test which is
the reliability coefficient would be insignificant the degree to which each item measures the same
- Carryover Effects: happened when the test-retest construct
interval is short, wherein the second test is influenced - measurement for unstable traits
by the first test because they remember or practiced - if all items measure the same construct, then it has a
the previous test = inflated correlation/overestimation good internal consistency
of reliability - useful in assessing Homogeneity
- Practice Effect: scores on the second session are - Homogeneity: if a test contains items that measure a
higher due to their experience of the first session of single trait (unifactorial)
testing - Heterogeneity: degree to which a test measures
- test-retest with longer interval might be affected of different factors (more than one factor/trait)
other extreme factors, thus, resulting to low - more homogenous = higher inter-item consistency
correlation - KR-20: used for inter-item consistency of
- lower correlation = poor reliability dichotomous items (intelligence tests, personality tests
- Mortality: problems in absences in second session with yes or no options, multiple choice), unequal
(just remove the first tests of the absents) variances, dichotomous scored
- Coefficient of Stability - KR-21: if all the items have the same degree of
- statistical tool: Pearson R, Spearman Rho difficulty (speed tests), equal variances, dichotomous
Parallel Forms/Alternate Forms Reliability scored
Error: Item Sampling (Immediate), Item Sampling - Cronbach’s Coefficient Alpha: used when two
changes over time (delaued) halves of the test have unequal variances and on tests
- established when at least two different versions of containing non-dichotomous items, unequal variances
the test yield almost the same scores - Average Proportional Distance: measure used to
- has the most universal applicability evaluate internal consistence of a test that focuses on
- Parallel Forms: each form of the test, the means, the degree of differences that exists between item
and the variances, are EQUAL; same items, different scores
positionings/numberings Split-Half Reliability
- Alternate Forms: simply different version of a test Error: Item sample: Nature of Split
that has been constructed so as to be parallel - Split Half Reliability: obtained by correlating two
- test should contain the same number of items and the pairs of scores obtained from equivalent halves of a
items should be expressed in the same form and single test administered ONCE
should cover the same type of content; range and - useful when it is impractical or undesirable to assess
difficulty must also be equal reliability with two tests or to administer a test twice
- if there is a test leakage, use the form that is not - cannot just divide the items in the middle because it
mostly administered might spuriously raise or lower the reliability
- Counterbalancing: technique to avoid carryover coefficient, so just randomly assign items or assign
effects for parallel forms, by using different sequence odd-numbered items to one half and even-numbered
for groups items to the other half
- can be administered on the same day or different
#BLEPP2023
#BLEPP2023
- Spearman-Brown Formula: allows a test developer o Criterion-Referenced Tests – designed to provide
of user to estimate internal consistency reliability an indication of where a testtaker stands with
from a correlation of two halves of a test, if each respect to some variable or criterion
half had been the length of the whole test and have  As individual differences decrease, a traditional
the equal variances measure of reliability would also decrease,
- Spearman-Brown Prophecy Formula: estimates how regardless of the stability of individual
many more items are needed in order to achieve the performance
target reliability o Classical Test Theory – everyone has a “true
- multiply the estimate to the original number of items score” on test
- Rulon’s Formula: counterpart of spearman-brown  True Score: genuinely reflects an individual’s
formula, which is the ratio of the variance of ability level as measured by a particular test
difference between the odd and even splits and the  Random Error
variance of the total, combined odd-even, score o Domain Sampling Theory – estimate the extent to
- if the reliability of the original test is relatively low, which specific sources of variation under defined
then developer could create new items, clarify test conditions are contributing to the test scores
instructions, or simplifying the scoring rules  Considers problem created by using a limited
- equal variances, dichotomous scored number of items to represent a larger and more
- Statistical Tool: Pearson R or Spearman Rho complicated construct
Inter-Scorer Reliability  Test reliability is conceived of as an objective
Error: Scorer Differences measure of how precisely the test score assesses
- the degree of agreement or consistency between two the domain from which the test draws a sample
or more scorers with regard to a particular measure  Generalizability Theory: based on the idea that a
- used for coding nonbehavioral behavior person’s test scores vary from testing to testing
- observer differences because of the variables in the testing situations
- Fleiss Kappa: determine the level between TWO or  Universe: test situation
MORE raters when the method of assessment is  Facets: number of items in the test, amount of
measured on CATEGORICAL SCALE review, and the purpose of test administration
- Cohen’s Kappa: two raters only  According to Generalizability Theory, given the
- Krippendorff’s Alpha: two or more rater, based on exact same conditions of all the facets in the
observed disagreement corrected for disagreement universe, the exact same test score should be
expected by chance obtained (Universe score)
o Tests designed to measure one factor  Decision Study: developers examine the
(Homogenous) are expected to have high degree of usefulness of test scores in helping the test user
internal consistency and vice versa make decisions
o Dynamic – trait, state, or ability presumed to be  Systematic Error
ever- changing as a function of situational and o Item Response Theory – the probability that a
cognitive experience person with X ability will be able to perform at a
o Static – barely changing or relatively unchanging level of Y in a test
o Restriction of range or Restriction of variance –  Focus: item difficulty
if the variance of either variable in a correlational  Latent-Trait Theory
analysis is restricted by the sampling procedure  a system of assumption about measurement and
used, then the resulting correlation coefficient tends the extent to which item measures the trait
to be lower  The computer is used to focus on the range of
o Power Tests – when time limit is long enough to item difficulty that helps assess an individual’s
allow test takers to attempt all times ability level
o Speed Tests – generally contains items of uniform  If you got several easy items correct, the
level of difficulty with time limit computer will them move to more difficult items
 Reliability should be based on performance from  Difficulty: attribute of not being easily
two independent testing periods using test-retest accomplished, solved, or comprehended
and alternate-forms or split-half-reliability  Discrimination: degree to which an item
differentiates among people with higher or lower
levels of the trait, ability etc.
#BLEPP2023
#BLEPP2023
 Dichotomous: can be answered with only one of that does occur
two alternative responses
 Polytomous: 3 or more alternative responses
o Standard Error of Measurement – provide a
measure of the precision of an observed test score
 Standard deviation of errors as the basic measure
of error
 Index of the amount of inconsistent or the
amount of the expected error in an individual’s
score
 Allows to quantify the extent to which a test
provide accurate scores
 Provides an estimate of the amount of error
inherent in an observed score or measurement
 Higher reliability, lower SEM
 Used to estimate or infer the extent to which an
observed score deviates from a true score
 Standard Error of a Score
 Confidence Interval: a range or band of test
scores that is likely to contain true scores
o Standard Error of the Difference – can aid a test
user in determining how large a difference should
be before it is considered statistically significant
o Standard Error of Estimate – refers to the
standard error of the difference between the
predicted and observed values
o Confidence Interval – a range of and of test score
that is likely to contain true score
 Tells us the relative ability of the true score
within the specified range and confidence level
 The larger the range, the higher the confidence
o If the reliability is low, you can increase the number
of items or use factor analysis and item analysis to
increase internal consistency
o Reliability Estimates – nature of the test will often
determine the reliability metric
a) Homogenous (unifactor) or heterogeneous
(multifactor)
b) Dynamic (unstable) or static (stable)
c) Range of scores is restricted or not
d) Speed Test or Power Test
e) Criterion or non-Criterion
o Test Sensitivity – detects true positive
o Test Specificity – detects true negative
o Base Rate – proportion of the population that
actually possess the characteristic of interest
o Selection ratio – no. of available positions
compared to the no. of applicants
o Four Possible Hit and Miss Outcomes
1. True Positives (Sensitivity) – predict success
that does occur
2. True Negatives (Specificity) – predict failure
#BLEPP2023
3. False Positive (Type 1) – success does not
occur
4. False Negative (Type 2) – predicted failure
but succeed
Validity
o Validity – a judgment or estimate of how well a
test measures what it supposed to measure
 Evidence about the appropriateness of
inferences drawn from test scores
 Degree to which the measurement
procedure measures the variables to measure
 Inferences – logical result or deduction
 May diminish as the culture or times change
 Predicts future performance
 Measures appropriate domain
 Measures appropriate characteristics
o Validation – the process of gathering and
evaluating evidence about validity
o Validation Studies – yield insights regarding a
particular population of testtakers as compared to
the norming sample described in a test manual
o Internal Validity – degree of control among
variables in the study (increased through random
assignment)
o External Validity – generalizability of the
research results (increased through random
selection)
o Conceptual Validity – focuses on individual
with their unique histories and behaviors
 Means of evaluating and integrating test data
so that the clinician’s conclusions make
accurate statements about the examinee
o Face Validity – a test appears to measure to the
person being tested than to what the test actually
measures
Content Validity
- describes a judgement of how adequately a test
samples behavior representative of the universe of
behavior that the test was designed to sample
- when the proportion of the material covered by the
test approximates the proportion of material covered
in
#BLEPP2023
the course
#BLEPP2023
- Test Blueprint: a plan regarding the types of - logical and statistical
information to be covered by the items, the no. of - judgement about the appropriateness of inferences
items tapping each area of coverage, the organization drawn from test scores regarding individual standing
of the items, and so forth on variable called construct
- more logical than statistical - Construct: an informed, scientific idea developed or
- concerned with the extent to which the test is hypothesized to describe or explain behavior;
representative of defined body of content consisting unobservable, presupposed traits that may invoke to
the topics and processes describe test behavior or criterion performance
- panel of experts can review the test items and rate - One way a test developer can improve the
them in terms of how closely they match the objective homogeneity of a test containing dichotomous items is
or domain specification by eliminating items that do not show significant
- examine if items are essential, useful and necessary correlation coefficients with total test scores
- construct underrepresentation: failure to capture - If it is an academic test and high scorers on the entire
important components of a construct test for some reason tended to get that particular item
- construct-irrelevant variance: happens when scores wrong while low scorers got it right, then the item is
are influenced by factors irrelevant to the construct obviously not a good one
- Lawshe: developed the formula of Content Validity - Some constructs lend themselves more readily than
Ratio others to predictions of change over time
- Zero CVR: exactly half of the experts rate the item as - Method of Contrasted Groups: demonstrate that
essential scores on the test vary in a predictable way as a
Criterion Validity function of membership in a group
- more statistical than logical - If a test is a valid measure of a particular construct,
- a judgement of how adequately a test score can be then the scores from the group of people who does not
used to infer an individual’s most probable standing have that construct would have different test scores
on some measure of interestꟷthe measure of interest than those who really possesses that construct
being criterion - Convergent Evidence: if scores on the test
- Criterion: standard on which a judgement or undergoing construct validation tend to highly
decision may be made correlated with another established, validated test that
- Characteristics: relevant, valid, uncontaminated measures the same construct
- Criterion Contamination: occurs when the criterion - Discriminant Evidence: a validity coefficient
measure includes aspects of performance that are not showing little relationship between test scores and/or
part of the job or when the measure is affected by other variables with which scores on the test being
“construct-irrelevant” (Messick, 1989) factors that are construct-validated should not be correlated
not part of the criterion construct - test is homogenous
1. Concurrent Validity: If the test scores obtained at - test score increases or decreases as a function of age,
about the same time as the criterion measures are passage of time, or experimental manipulation
obtained; economically efficient - pretest-posttest differences
2. Predictive Validity: measures of the relationship - scores differ from groups
between test scores and a criterion measure obtained - scores correlated with scores on other test in
at a future time accordance to what is predicted
- Incremental Validity: the degree to which an o Factor Analysis – designed to identify factors or
additional predictor explains something about the specific variables that are typically attributes,
criterion measure that is not explained by predictors characteristics, or dimensions on which people may
already in use; used to improve the domain differ
- related to predictive validity wherein it is defined as  Developed by Charles Spearman
the degree to which an additional predictor explains  Employed as data reduction method
something about the criterion measure that is not  Used to study the interrelationships among set of
explained by predictors already in use variables
Construct Validity (Umbrella Validity)  Identify the factor or factors in common between
- covers all types of validity test scores on subscales within a particular test
 Explanatory FA: estimating or extracting
factors; deciding how many factors must be
#BLEPP2023
retained
#BLEPP2023
 Confirmatory FA: researchers test the degree to o Cost – disadvantages, losses, or expenses both
which a hypothetical model fits the actual data economic and noneconomic terms
 Factor Loading: conveys info about the extent o Benefit – profits, gains or advantages
to which the factor determines the test score or o The cost of test administration can be well worth it
scores if the results is certain noneconomic benefits
 can be used to obtain both convergent and o Utility Analysis – family of techniques that entail a
discriminant validity cost-benefit analysis designed to yield information
o Cross-Validation – revalidation of the test to a relevant to a decision about the usefulness and/or
criterion based on another group different from the practical value of a tool of assessment
original group form which the test was validated o Expectancy table – provide an indication that a
 Validity Shrinkage: decrease in validity after testtaker will score within some interval of scores
cross-validation on a criterion measure – passing, acceptable, failing
 Co-Validation: validation of more than one test o Might indicate future behaviors, then if successful,
from the same group the test is working as it should
 Co-Norming: norming more than one test from o Taylor-Russel Tables – provide an estimate of the
the same group extent to which inclusion of a particular test in the
o Bias – factor inherent in a test that systematically selection system will improve selection
prevents accurate, impartial measurement o Selection Ratio – numerical value that reflects the
 Prejudice, preferential treatment relationship between the number of people to be
 Prevention during test dev through a procedure hired and the number of people available to be hired
called Estimated True Score Transformation
o Rating – numerical or verbal judgement that places
a person or an attribute along a continuum identified
by a scale of numerical or word descriptors known o Base Rate – percentage of people hired under the
as Rating Scale existing system for a particular position
 Rating Error: intentional or unintentional o One limitation of Taylor-Russel Tables is that the
misuse of the scale relationship between the predictor (test) and
 Leniency Error: rater is lenient in scoring criterion must be linear
(Generosity Error) o Naylor-Shine Tables – entails obtaining the
 Severity Error: rater is strict in scoring difference between the means of the selected and
 Central Tendency Error: rater’s rating would unselected groups to derive an index of what the
tend to cluster in the middle of the rating scale test is adding to already established procedures
 One way to overcome rating errors is to use o Brogden-Cronbach-Gleser Formula – used to
rankings calculate the dollar amount of a utility gain resulting
 Halo Effect: tendency to give high score due to from the use of a particular selection instrument
failure to discriminate among conceptually o Utility Gain – estimate of the benefit of using a
distinct and potentially independent aspects of a particular test
ratee’s behavior o Productivity Gains – an estimated increase in work
o Fairness – the extent to which a test is used in an output
impartial, just, and equitable way o High performing applicants may have been offered
o Attempting to define the validity of the test will be in other companies as well
futile if the test is NOT reliable o The more complex the job, the more people differ
Utility on how well or poorly they do that job
o Utility – usefulness or practical value of testing to o Cut Score – reference point derived as a result of a
improve efficiency judgement and used to divide a set of data into two or
o Can tell us something about the practical value of more classifications
the information derived from scores on the test Relative Cut Score – reference point based on norm-
o Helps us make better decisions related considerations (norm-referenced); e.g, NMAT
o Higher criterion-related validity = higher utility Fixed Cut Scores – set with reference to a judgement
o One of the most basic elements in utility analysis is concerning minimum level of proficiency required;
financial cost of the selection device e.g., Board Exams
#BLEPP2023
Multiple Cut Scores – refers to the use of two or more Validity
cut scores with reference to one predictor for the
purpose of categorization
Multiple Hurdle – multi-stage selection process, a cut
score is in place for each predictor
Compensatory Model of Selection – assumption that
high scores on one attribute can compensate for lower
scores
o Angoff Method – setting fixed cut scores
 low interrater reliability
o Known Groups Method – collection of data on the
predictor of interest from group known to possess
and not possess a trait of interest
 The determination of where to set cutoff score is
inherently affected by the composition of
contrasting groups Item Difficulty
o IRT-Based Methods – cut scores are typically set
based on testtaker’s performance across all the
items on the test
 Item-Mapping Method: arrangement of items in
histogram, with each column containing items
with deemed to be equivalent value
 Bookmark Method: expert places “bookmark”
between the two pages that are deemed to
separate testtakers who have acquired the
minimal knowledge, skills, and/or abilities from Item Discrimination
those who are not
o Method of Predictive Yield – took into account the
number of positions to be filled, projections
regarding the likelihood of offer acceptance, and the
distribution of applicant scores
o Discriminant Analysis – shed light on the
relationship between identified variables and two
naturally occurring groups
Reason for accepting or rejecting instruments and P-Value
tools based on Psychometric Properties o P-Value ≤ ∞, reject null hypothesis
Reliability o P-Value ≥ ∞, accept null hypothesis
o Basic Research = 0.70 to 0.90 Research Methods and Statistics (20)

o Clinical Setting = 0.90 to 0.95 Statistics Applied in Research Studies on tests and
Tests Development
Measures of Central Tendency - statistics that
indicates the average or midmost score between the
extreme scores in a distribution
#BLEPP2023
- Goal: Identify the most typical or representative of Measures of Spread or Variability – statistics that
entire group describe the amount of variation in a distribution
- Measures of Central Location - gives idea of how well the measure of central
Mean - the average of all the tendency represent the data
raw scores - large spread of values means large
- Equal to the sum of the differences between individual scores
observations divided by Range - equal to the difference
the number of between highest and the
observations lowest score
- Interval and ratio data - Provides a quick but
(when normal gross description of the
distribution) spread of scores
- Point of least squares - When its value is based
- Balance point for the on extreme scores of the
distribution distribution, the resulting
- susceptible to outliers description of variation
Median – the middle score of the may be understated or
distribution overstated
- Ordinal, Interval, Ratio Interquartile Range - difference between Q1
- for extreme scores, use and Q2
median Semi-Quartile Range - interquartile range
- Identical for sample and divided by 2
population Standard Deviation - approximation of the
- Also used when there average deviation around
has an unknown or the mean
undetermined score - gives detail of how
- Used in “open-ended” much above or below a
categories (e.g., 5 or score to the mean
more, more than 8, at - equal to the square root
least 10) of the average squared
- For ordinal data deviations about the
- if the distribution is mean
skewed for ratio/interval - Equal to the square root
data, use median of the variance
Mode - most frequently - Distance from the mean
occurring score in the Variance - equal to the arithmetic
distribution mean of the squares of
- Bimodal Distribution: if the differences between
there are two scores that the scores in a
occur with highest distribution and their
frequency mean
- Not commonly used - average squared
- Useful in analyses of deviation around the
qualitative or verbal mean
nature Measures of Location
- For nominal scales, Percentile or Percentile - not linearly
discrete variables Rank transformable, converged
- Value of the mode gives at the middle and the
an indication of the shape outer ends show large
of the distribution as well interval
as a measure of central
tendency
#BLEPP2023
- expressed in terms of and the differences of
the percentage of persons their salaries
in the standardization One-Way Repeated - 1 group, measured at
sample who fall below a Measures least 3 times
given score - e.g., measuring the
- indicates the focus level of board
individual’s relative reviewers during
position in the morning, afternoon, and
standardization sample night sessions of review
Quartile - dividing points between Two-Way ANOVA - 3 or more groups, tested
the four quarters in the for 2 variables
distribution - e.g., people in different
- Specific point socio-economic status
- Quarter: refers to an and the differences of
interval their salaries and their
Decile/STEN - divide into 10 equal eating habits
parts ANCOVA - used when you need to
- a measure of the control for an additional
asymmetry of the variable which may be
probability distribution of influencing the
a real-valued random relationship between your
about its mean independent and
Correlation dependent variable
Pearson R - interval/ratio + ANOVA Mixed Design - 2 or more groups,
interval/ratio measured more than 3
Spearman Rho - ordinal + ordinal times
Biserial - artificial Dichotomous + - e.g., Young Adults,
interval/ratio Middle Adults, and Old
Point Biserial - true dichotomous + Adults’ blood pressure
interval/ratio is measured during
Phi Coefficient - nominal (true dic) + breakfast, lunch, and
nominal (true/artificial dinner
dic.) Non-Parametric Tests
Tetrachoric - Art. Dichotomous + Art. Mann Whitney U Test - t-test independent
Dichotomos Wilcoxon Signed Rank - t-test dependent
Kendall’s - 3 or more ordinal/rank Test
Rank Biserial - nominal + ordinal Kruskal-Wallis H Test - one-way/two-way
Differences ANOVA
T-test Independent - two separate groups, Friedman Test - ANOVA repeated
random assignment measures
- e.g., blood pressure of Lambda - for 2 groups of nominal
male and female grad data
students Chi-Square
T-Test Dependent - one group, two scores Goodness of Fit - used to measure
- e.g., blood pressure differences and involves
before and after the nominal data and only
lecture of Grad students one variable with 2 or
more categories
#BLEPP2023
One-Way ANOVA - 3 or more groups, tested Test of Independence - used to measure
once correlation and involves
- e.g., people in different nominal data and two
socio-economic status
variables with two or II. Test Construction – stage in the process that
more categories entails writing test items, revisions, formatting, setting
Regression – used when one wants to provide scoring rules
framework of prediction on the basis of one factor in - it is not good to create an item that contains
order to predict the probable value of another factor numerous ideas
Linear Regression of Y - Y = a + bX - Item Pool: reservoir or well from which the items
on X - Used to predict the will or will not be drawn for the final version of the
unknown value of test
variable Y when value - Item Banks: relatively large and easily accessible
of collection of test questions
variable X is known - Computerized Adaptive Testing: refers to an
Linear Regression of X - X = c + dY interactive, computer administered test-taking process
on Y - Used to predict the wherein items presented to the testtaker are based in
unknown value of part on the testtaker’s performance on previous items
variable X using the - The test administered may be different for each
known variable Y
#BLEPP2023
testtaker, depending on the test performance on the
items presented
- Reduces floor and ceiling effects
- Floor Effects: occurs when there is some lower limit
on a survey or questionnaire and a large percentage of
respondents score near this lower limit (testtakers
have low scores)
- Ceiling Effects: occurs when there is some upper
limit on a survey or questionnaire and a large
percentage of respondents score near this upper limit
(testtakers have high scores)
o True Dichotomy – dichotomy in which there are - Item Branching: ability of the computer to tailor the
only fixed possible categories content and order of presentation of items on the basis
o Artificial Dichotomy - dichotomy in which there of responses to previous items
are other possibilities in a certain category - Item Format: form, plan, structure, arrangement, and
Methods and Statistics used in Research Studies and layout of individual test items
Test Construction - Dichotomous Format: offers two alternatives for
Test Development each item
o Test Development – an umbrella term for all that - Polychotomous Format: each item has more than
goes into the process of creating a test two alternatives
I. Test Conceptualization – brainstorming of ideas - Category Format: a format where respondents are
about what kind of test a developer wants to publish asked to rate a construct
- stage wherein the ff. is determined: construct, goal, 1. Checklist – subject receives a longlist of adjectives
user, taker, administration, format, response, benefits, and indicates whether each one if characteristic of
costs, interpretation himself or herself
- determines whether the test would be norm- 2. Guttman Scale – items are arranged from weaker
referenced or criterion-referenced to stronger expressions of attitude, belief, or feelings
- Pilot Work/Pilot Study/Pilot Research – preliminary - Selected-Response Format: require testtakers to select
research surrounding the creation of a prototype of the response from a set of alternative responses
test 1. Multiple Choice - Has three elements: stem
- Attempts to determine how best to measure a (question), a correct option, and several incorrect
targeted construct alternatives (distractors or foils), Should’ve one
- Entail lit reviews and experimentation, creation, correct answer, has grammatically parallel
revision, and deletion of preliminary items alternatives, similar length, alternatives that fit
grammatically with the stem, avoid ridiculous
distractors, not excessively
long, “all of the above”, “none of the above” (25%)
#BLEPP2023
- Effective Distractors: a distractor that was chosen 3. Constant Sum – respondents are asked to allocate a
equally by both high and low performing groups that constant sum of units, such as points, among set of
enhances the consistency of test results stimulus objects with respect to some criterion
- Ineffective Distractors: may hurt the reliability of the 4. Q-Sort Technique – sort object based on similarity
test because they are time consuming to read and can with respect to some criterion
limit the no. of good items Non-Comparative Scales of Measurement
- Cute Distractors: less likely to be chosen, may affect 1. Continuous Rating – rate the objects by placing a
the reliability of the test bec the testtakers may guess mark at the appropriate position on a continuous line
from the remaining options that runs from one extreme of the criterion variable to
2. Matching Item - Test taker is presented with two the other
columns: Premises and Responses - e.g., Rating Guardians of the Galaxy as the best
3. Binary Choice - Usually takes the form of a Marvel Movie of Phase 4
sentence that requires the testtaker to indicate whether 2. Itemized Rating – having numbers or brief
the statement is or is not a fact (50%) descriptions associated with each category
- Constructed-Response Format: requires testtakers to - e.g., 1 if your like the item the most, 2 if so-so, 3 if
supply or to create the correct answer, not merely you hate it
selecting it 3. Likert Scale – indicate their own attitudes by
1. Completion Item - Requires the examinee to checking how strongly they agree or disagree with
provide a word or phrase that completes a sentence carefully worded statements that range from very
2. Short-Answer - Should be written clearly enough positive to very negative towards attitudinal object
that the testtaker can respond succinctly, with short - principle of measuring attitudes by asking people to
answer respond to a series of statements about a topic, in
3. Essay – allows creative integration and expression terms of the extent to which they agree with them
of the material 4. Visual Analogue Scale – a 100-mm line that allows
- Scaling: process of setting rules for assigning subjects to express the magnitude of an experience or
numbers in measurement belief
Primary Scales of Measurement 5. Semantic Differential Scale – derive respondent’s
1. Nominal - involve classification or categorization attitude towards the given object by asking him to
based on one or more distinguishing characteristics select an appropriate position on a scale between two
- Label and categorize observations but do not make bipolar opposites
any quantitative distinctions between observations 6. Staple Scale – developed to measure the direction
- mode and intensity of an attitude simultaneously
2. Ordinal - rank ordering on some characteristics is 7. Summative Scale – final score is obtained by
also permissible summing the ratings across all the items
- median 8. Thurstone Scale – involves the collection of a
3. Ratio - contains equal intervals, has no absolute zero variety of different statements about a phenomenon
point (even negative values have interpretation to it) which are ranked by an expert panel in order to
- Zero value does not mean it represents none develop the questionnaire
4. Interval - - has true zero point (if the score is zero, - allows multiple answers
it means none/null) 9. Ipsative Scale – the respondent must choose
- Easiest to manipulate between two or more equally socially acceptable
Comparative Scales of Measurement options
1. Paired Comparison - produces ordinal data by III. Test Tryout - the test should be tried out on
presenting with pairs of two stimuli which they are people who are similar in critical respects to the
asked to compare people for whom the test was designed
- respondent is presented with two objects at a time - An informal rule of thumb should be no fewer than 5
and asked to select one object according to some and preferably as many as 10 for each item (the more,
criterion the better)
2. Rank Order – respondents are presented with - Risk of using few subjects = phantom factors emerge
several items simultaneously and asked to rank them - Should be executed under conditions as identical as
in possible
#BLEPP2023
- A good test item is one that answered correctly by - The higher Item-Validity index, the greater the test’s
high scorers as a whole criterion-related validity
- Empirical Criterion Keying: administering a large - Item-Discrimination Index: measure of item
pool of test items to a sample of individuals who are discrimination; measure of the difference between the
known to differ on the construct being measured proportion of high scorers answering an item correctly
- Item Analysis: statistical procedure used to analyze and the proportion of low scorers answering the item
items, evaluate test items correctly
- Discriminability Analysis: employed to examine - Extreme Group Method: compares people who have
correlation between each item and the total score of done well with those who have done poorly
the test - Discrimination Index: difference between these
- Item: suggest a sample of behavior of an individual proportion
- Table of Specification: a blueprint of the test in terms - Point-Biserial Method: correlation between a
of number of items per difficulty, topic importance, or dichotomous variable and continuous variable
taxonomy
- Guidelines for Item writing: Define clearly what to
measure, generate item pool, avoid long items, keep
the level of reading difficulty appropriate for those
who will complete the test, avoid double-barreled
items, consider making positive and negative worded
items
- Double-Barreled Items: items that convey more than
one ideas at the same time - Item-Characteristic Curve: graphic representation of
- Item Difficulty: defined by the number of people item difficulty and discrimination
who get a particular item correct - Guessing: one that eluded any universally accepted
- Item-Difficulty Index: calculating the proportion of solutions
the total number of testtakers who answered the item - Item analyses taken under speed conditions yield
correctly; The larger, the easier the item misleading or uninterpretable results
- Item-Endorsement Index for personality testing, - Restrict item analysis on a speed test only to the
percentage of individual who endorsed an item in a items completed by the testtaker
personality test - Test developer ideally should administer the test to
- The optimal average item difficulty is approx. 50% be item-analyzed with generous time limits to
with items on the testing ranging in difficulty from complete the test
about 30% to 80% Scoring Items/Scoring Models
1. Cumulative Model – testtaker obtains a measure of
the level of the trait; thus, high scorers may suggest
high level in the trait being measured
2. Class Scoring/Category Scoring – testtaker
response earn credit toward placement in a particular
class or category with other testtaker whose pattern of
responses is similar in some way
3. Ipsative Scoring – compares testtaker’s score on one
scale within a test to another scale within that same
- Omnibus Spiral Format: items in an ability are test, two unrelated constructs
arranged into increasing difficulty IV. Test Revision – characterize each item according
- Item-Reliability Index: provides an indication of the to its strength and weaknesses
internal consistency of a test - As revision proceeds, the advantage of writing a
- The higher Item-Reliability index, the greater the large item pool becomes more apparent because some
test’s internal consistency items were removed and must be replaced by the items
- Item-Validity Index: designed to provide an in the item pool
indication of the degree to which a test is measure - Administer the revised test under standardized
conditions to a second appropriate sample of examinee
#BLEPP2023
o - Basal
Cross-Validation:
Level – therevalidation of a atest
level of which theonminimum
a sample
of testtakers
criterion numberother than those
of correct on who
responses test performance
is obtained
was originally found to be a valid predictor of some o Computer Assisted Psychological Assessment –
criterion; often results to validity shrinkage standardized test administration is assured for
- Validity Shrinkage: decrease in item validities that testtakers and variation is kept to a minimum
inevitably occurs after cross-validation  Test content and length is tailored according to
- Co-validation: conducted on two or more test using the taker’s ability
the same sample of testtakers
Statistics
- Co-norming: creation of norms or the revision of
o Measurement – the act of assigning numbers or
existing norms
symbols to characteristics of things according to
- Anchor Protocol: test protocol scored by highly
rules
authoritative scorer that is designed as a model for
Descriptive Statistics – methods used to provide
scoring and a mechanism for resolving scoring
concise description of a collection of quantitative
discrepancies
information
- Scoring Drift: discrepancy between scoring in an
Inferential Statistics – method used to make
anchor protocol and the scoring of another protocol
inferences from observations of a small group of
- Differential Item Functioning: item functions
people known as sample to a larger group of
differently in one group of testtakers known to have
individuals
the same level of the underlying trait
known as population
- DIF Analysis: test developers scrutinize group by
and
group item response curves looking for DIF Items
expert judgments regarding item effectiveness
- DIF Items: items that respondents from different
groups at the same level of underlying trait have
different probabilities of endorsing a function of their
group membership
o Computerized Adaptive Testing – refers to an

interactive, computer administered test-taking
process wherein items presented to the testtaker are
based in part on the testtaker’s performance on
previous items
 The test administered may be different for each
testtaker, depending on the test performance on
the items presented
 Reduces floor and ceiling effects
 Floor Effects: occurs when there is some lower
limit on a survey or questionnaire and a large
percentage of respondents score near this lower
limit (testtakers have low scores)
 Ceiling Effects: occurs when there is some
upper limit on a survey or questionnaire and a
large percentage of respondents score near this
upper limit (testtakers have high scores)
 Item Branching: ability of the computer to tailor
the content and order of presentation of items
on the basis of responses to previous items
 Routing Test: subtest used to direct or route the
testtaker to a suitable level of items
 Item-Mapping Method: setting cut scores that
entails a histographic representation of items
#BLEPP2023
o Magnitude – the property of “moreness”
o Equal Intervals – the difference between two
points at any place on the scale has the same
meaning as the difference between two other
points that differ by the same number of scale units
o Absolute 0 – when nothing of the property being
measured exists
o Scale – a set of numbers who properties model
empirical properties of the objects to which the
numbers are assigned
Continuous Scale – takes on any value within the
range and the possible value within that range is infinite
- used to measure a variable which can theoretically be
divided
Discrete Scale – can be counted; has distinct,
countable values
- used to measure a variable which cannot be
theoretically be divided
o Error – refers to the collective influence of all the
factors on a test score or measurement beyond
those specifically measured by the test or
measurement
 Degree to which the test score/measurement
may be wrong, considering other factors like
state of the testtaker, venue, test itself etc.
 Measurement with continuous scale always
involve with error
Four Levels of Scales of Measurement
Nominal – involve classification or categorization
based on one or more distinguishing characteristics
- Label and categorize observations but do not make
any quantitative distinctions between observations
- mode
#BLEPP2023
Ordinal - rank ordering on some characteristics is also - Bimodal Distribution: if there are two scores that
permissible occur with highest frequency
- median - Not commonly used
Interval - contains equal intervals, has no absolute - Useful in analyses of qualitative or verbal nature
zero point (even negative values have interpretation to - For nominal scales, discrete variables
it) - Value of the mode gives an indication of the shape of
Ratio - has true zero point (if the score is zero, it means the distribution as well as a measure of central
none/null) o tendency
Variability – an indication how scores in a
- Easiest to manipulate distribution are scattered or dispersed
o Distribution – defined as a set of test scores o Measures of Variability – statistics that describe the
arrayed for recording or study amount of variation in a distribution
o Raw Scores – straightforward, unmodified o Range – equal to the difference between highest and
accounting of performance that is usually numerical the lowest score
o Frequency Distribution – all scores are listed  Provides a quick but gross description of the
alongside the number of times each score occurred spread of scores
o Independent Variable – being manipulated in the  When its value is based on extreme scores of the
study distribution, the resulting description of variation
o Quasi-Independent Variable – nonmanipulated may be understated or overstated
variable to designate groups o Quartile – dividing points between the four
 Factor: for ANOVA quarters in the distribution
Post-Hoc Tests – used in ANOVA to determine which  Specific point
mean differences are significantly different  Quarter: refers to an interval
Tukey’s HSD test – allows the compute a single value  Interquartile Range: measure of variability
that determines the minimum difference between equal to the difference between Q3 and Q1
treatment means that is necessary for significance  Semi-interquartile Range: equal to the
o Measures of Central Tendency – statistics that interquartile range divided by 2
indicates the average or midmost score between the o Standard Deviation – equal to the square root of
extreme scores in a distribution the average squared deviations about the mean
 Goal: Identify the most typical or representative of  Equal to the square root of the variance
entire group  Variance: equal to the arithmetic mean of the
Mean – the average of all the raw scores squares of the differences between the scores in
- Equal to the sum of the observations divided by the a distribution and their mean
number of observations  Distance from the mean
- Interval and ratio data (when normal distribution) o Normal Curve – also known as Gaussian Curve
- Point of least squares o Bell-shaped, smooth, mathematically defined curve
- Balance point for the distribution that is highest at its center
Median – the middle score of the distribution o Asymptotically = approaches but never touches the
- Ordinal, Interval, Ratio axis
- Useful in cases where relatively few scores fall at the o Tail – 2 – 3 standard deviations above and below
high end of the distribution or relatively few scores the mean
fall at the low end of the distribution
- In other words, for extreme scores, use median
(skewed)
- Identical for sample and population
- Also used when there has an unknown or
undetermined score
- Used in “open-ended” categories (e.g., 5 or more,
more than 8, at least 10)
- For ordinal data
Mode – most frequently occurring score in the
distribution
#BLEPP2023
 Mean < Median < Mode

o Skewed is associated with abnormal, perhaps
because the skewed distribution deviates from the
symmetrical or so-called normal distribution
o Symmetrical Distribution – right side of the graph o Kurtosis – steepness if a distribution in its center
is mirror image of the left side Platykurtic – relatively flat
 Has only one mode and it is in the center of the Leptokurtic – relatively peaked
distribution Mesokurtic – somewhere in the middle
 Mean = median = mode
o Skewness – nature and extent to which symmetry is
absent
o Positive Skewed – few scores fall the high end of
the distribution
 The exam is difficult
 More items that was easier would have been
desirable in order to better discriminate at the
lower end of the distribution of test scores
 High Kurtosis = high peak and fatter tails

 Lower Kurtosis = rounded peak and thinner tails
o Standard Score – raw score that has been
converted from one scale to another scale
o Z-Scores – results from the conversion of a raw
score into a number indicating how many SD units
the raw score is below or above the mean of the
distribution
 Identify and describe the exact location of each
score in a distribution
 Mean > Median > Mode  Standardize an entire distribution
o Negative Skewed – when relatively few of the  Zero plus or minus one scale
scores fall at the low end of the distribution  Have negative values
 The exam is easy  Requires that we know the value of the variance
 More items of a higher level of difficulty would to compute the standard error
make it possible to better discriminate between o T-Scores – a scale with a mean set at 50 and a
scores at the upper end of the distribution standard deviation set at 10
 Fifty plus or minus 10 scale
 5 standard deviations below the mean would be
equal to a t-score of 0
 Raw score that fell in the mean has T of 50
 Raw score 5 standard deviations about the mean
would be equal to a T of 100
 No negative values
#BLEPP2023
 Used when the population or variance is unknown o Directional Hypothesis Test or One-Tailed Test –
o Stanine – a method of scaling test scores on a nine- statistical hypotheses specify either an increase or a
point standard scale with a mean of five (5) and a decrease in the population mean
standard deviation of two (2) o T-Test – used to test hypotheses about an unknown
o Linear Transformation – one that retains a direct population mean and variance
numerical relationship to the original raw score  Can be used in “before and after” type of research
o Nonlinear Transformation – required when the  Sample must consist of independent
data under consideration are not normally observationsꟷthat is, if there is not consistent,
distributed predictable relationship between the first
o Normalizing the distribution involves stretching the observation and the second
skewed curve into the shape of a normal curve and  The population that is sampled must be normal
creating a corresponding scale of standard scores, a  If not normal distribution, use a large sample
scale that is technically referred to as Normalized o Correlation Coefficient – number that provides us
Standard Score Scale with an index of the strength of the relationship
o Generally preferrable to fine-tune the test according between two things
to difficulty or other relevant variables so that the o Correlation – an expression of the degree and
resulting distribution will approximate the normal direction of correspondence between two things
curve  + & - = direction
o STEN – standard to ten; divides a scale into 10 units  Number anywhere to -1 to 1 = magnitude
 Positive – same direction, either both going up
or both going down
 Negative – Inverse Direction, either DV is up
and IV goes down or IV goes up and DV goes
down
 0 = no correlation
Mean SD
Z-Score 0 1
T-Score 50 10
Stanine 5 2
STEN 5.5 2
IQ 100 15
GRE or SAT 500 100
o Pearson r/Pearson Correlation
o Hypothesis Testing – statistical method that uses a
Coefficient/Pearson Product-Moment Coefficient
sample data to evaluate a hypothesis about a
of Correlation – used when two variables being
population
correlated are continuous and linear
Alternative Hypothesis – states there is a change,
 Devised by Karl Pearson
difference, or relationships
 Coefficient of Determination – an indication of
Null Hypothesis – no change, no difference, or no
how much variance is shared by the X- and Y-
relationship
variables
o Alpha Level or Level of Significance – used to o Spearman Rho/Rank-Order Correlation
define concept of “very unlikely” in a hypothesis
test Coefficient/Rank-Difference Correlation
Coefficient – frequently used if the sample size is
o Critical Region – composed of extreme values that
small and when both sets of measurement are in
are very unlikely to be obtained if the null
ordinal
hypothesis is true
 Developed by Charles Spearman
o If sample data fall in the critical region, the null
hypothesis is rejected o Outlier – extremely atypical point located at a
relatively long distance from the rest of the
o The alpha level for a hypothesis test is the
probability that the test will lead to a Type I error coordinate points in a scatterplot
#BLEPP2023
o Regression Analysis – used for prediction b. Aptitude – refers to the potential for learning or
 Predict the values of a dependent or response acquiring a specific skill
variable based on values of at least one c. Intelligence – refers to a person’s general potential
independent or explanatory variable to solve problems, adapt to changing environments,
 Residual: the difference between an observed abstract thinking, and profit from experience
value of the response variable and the value of Human Ability – considerable overlap of
the response variable predicted from the achievement, aptitude, and intelligence test
regression line Typical Performance Test – measure usual or habitual
 The Principle of Least Squares thoughts, feelings, and behavior
 Standard Error of Estimate: standard deviation Personality Test – measures individual dispositions
of the residuals in regression analysis and preferences
 Slope: determines how much the Y variable a. Structured Personality tests – provide statement,
changes when X is increased by 1 point usually self-report, and require the subject to choose
o T-Test (Independent) – comparison or determining between two or more alternative responses
differences b. Projective Personality Tests – unstructured, and the
 2 different groups/independent samples + stimulus or response are ambiguous
interval/ratio scales (continuous variables) c. Attitude Test – elicit personal beliefs and opinions
Equal Variance – 2 groups are equal d. Interest Inventories – measures likes and dislikes
Unequal Variance – groups are unequal as well as one’s personality orientation towards the
o T-test (Dependent)/Paired Test – one groups world of work
nominal (either matched or repeated measures) + 2 - Purpose: for evaluation, drawing conclusions of
treatments some aspects of the behavior of a person, therapy,
o One-Way ANOVA – 3 or more IV, 1 DV decision- making
comparison of differences - Settings: Industrial, Clinical, Educational,
o Two-Way ANOVA – 2 IV, 1 DV Counseling, Business, Courts, Research
o Critical Value – reject the null and accept the - Population: Test Developers, Test Publishers, Test
alternative if [ obtained value > critical value ] Reviewers, Test Users, Test Sponsors, Test Takers,
o P-Value (Probability Value) – reject null and Society
accept alternative if [ p-value < alpha level ] Levels of Tests
o Norms – refer to the performances by defined 1. Level A – anyone under a direction of a supervisor
groups on a particular test or consultant
o Age-Related Norms – certain tests have different 2. Level B – psychometricians and psychologists only
normative groups for age groups 3. Level C – psychologists only
o Tracking – tendency to stay at about the same level 2. Interview – method of gathering information
relative to one’s peers through direct communication involving reciprocal
Norm-Referenced Tests – compares each person exchange
with the norm - can be structured, unstructured, semi-structured, or
Criterion-Referenced Tests – describes specific non-directive
types of skills, tasks, or knowledge that the test taker - Mental Status Examination: determines the mental
can demonstrate status of the patient
Selection of Assessment Methods and Tools and Uses, - Intake Interview: determine why the client came for
Benefits, and Limitations of Assessment tools and assessment; chance to inform the client about the
instruments (32) policies, fees, and process involved
Identify appropriate assessment methods, tools (2) - Social Case: biographical sketch of the client
1. Test – measuring device or procedure - Employment Interview: determine whether the
- Psychological Test: device or procedure designed to candidate is suitable for hiring
measure variables related to psychology - Panel Interview (Board Interview): more than one
Ability or Maximal Performance Test – assess what interviewer participates in the assessment
a person can do - Motivational Interview: used by counselors and
a. Achievement Test – measurement of the previous clinicians to gather information about some
learning problematic behavior, while simultaneously attempting
to address it therapeutically
#BLEPP2023
3. Portfolio – samples of one’s ability and - provides behavioral observations during
accomplishment administration
- Purpose: Usually in industrial settings for evaluation Wechsler Intelligence Scales (WAIS-IV, WPPSI-
of future performance IV, WISC-V)
4. Case History Data – refers to records, transcripts, [C]
and other accounts in written, pictorial, or other form - WAIS (16-90 years old), WPPSI (2-6 years old),
that preserve archival information, official and WISC (6-11)
informal accounts, and other data and items relevant - individually administered
to an assessee - norm-referenced
5. Behavioral Observation – monitoring of actions of - Standard Scores: 100 (mean), 15 (SD)
others or oneself by visual or electronic means while - Scaled Scores: 10 (mean), 3 (SD)
recording quantitative and/or qualitative information - addresses the weakness in Stanford-Binet
regarding those actions - could also assess functioning in people with brain
- Naturalistic Observation: observe humans in natural injury
setting - evaluates patterns of brain dysfunction
6. Role Play – defined as acting an improvised or - yields FSIQ, Index Scores (Verbal Comprehension,
partially improvised part in a stimulated situation Perceptual Reasoning, Working Memory, and
- Role Play Test: assesses are directed to act as if they Processing Speed), and subtest-level scaled scores
are in a particular situation Raven’s Progressive Matrices (RPM)
- Purpose: Assessment and Evaluation [B]
- Settings: Industrial, Clinical - 4 – 90 years old
- Population: Job Applicants, Children - nonverbal test
7. Computers – using technology to assess an client, - used to measure general intelligence & abstract
thus, can serve as test administrators and very efficient reasoning
test scorers - multiple choice of abstract reasoning
8. Others: videos, biofeedback devices - group test
Intelligence Tests - IRT-Based
Stanford-Binet Intelligence Scale 5th Ed. (SB-5) Culture Fair Intelligence Test (CFIT)
[C] [ B]
- 2-85 years old - Nonverbal instrument to measure your analytical and
- individually administered reasoning ability in the abstract and novel situations
- norm-referenced - Measures individual intelligence in a manner
- Scales: Verbal, Nonverbal, and Full Scale (FSIQ) designed to reduced, as much as possible, the
- Nonverbal and Verbal Cognitive Factors: Fluid influence of culture
Reasoning, Knowledge, Quantitative Reasoning, - Individual or by group
Visual-Spatial Processing, Working Memory - Aids in the identification of learning problems and
- age scale and point-scale format helps in making more reliable and informed decisions
- originally created to identify mentally disabled in relation to the special education needs of children
children in Paris Purdue Non-Language Test
- 1908 Scale introduced Age Scale format and Mental [B]
Age - Designed to measure mental ability, since it consists
- 1916 scale significantly applied IQ concept entirely of geometric forms
- Standard Scores: 100 (mean), 15 (SD) - Culture-fair
- Scaled Scores: 10 (mean), 3 (SD) - Self-Administering
- co-normed with Bender-Gestalt and Woodcock- Panukat ng Katalinuhang Pilipino
Johnson Tests - Basis for screening, classifying, and identifying
- based on Cattell-Horn-Carroll Model of General needs that will enhance the learning process
Intellectual Ability - In business, it is utilized as predictors of
- no accommodations for pwds occupational achievement by gauging applicant’s
- 2 routing tests ability and fitness
- w/ teaching items, floor level, and ceiling level for a particular job
#BLEPP2023
- Essential for determining one’s capacity to handle - K Scale = reveals a person’s defensiveness around
the challenges associated with certain degree certain questions and traits; also faking good
programs - K scale sometimes used to correct scores on five
- Subtests: Vocabulary, Analogy, Numerical Ability, clinical scales. The scores are statistically corrected
Wonderlic Personnel Test (WPT) for an individual’s overwillingness or unwillingness to
- Assessing cognitive ability and problem-solving admit deviance
aptitude of prospective employees - “Cannot Say” (CNS) Scale: measures how a person
- Multiple choice, answered in 12 minutes doesn’t answer a test item
Armed Services Vocational Aptitude Battery - High ? Scale: client might have difficulties with
- Most widely used aptitude test in US reading, psychomotor retardation, or extreme
- Multiple-aptitude battery that measures developed defensiveness
abilities and helps predict future academic and - True Response Inconsistency (TRIN): five true, then
occupational success in the military five false answers
Kaufman Assessment Battery for Children-II - Varied Response Inconsistency (VRIN): random true
(KABC-II) or false
- Infrequency-Psychopathology Scale (Fp): reveal
- Alan & Nadeen Kaufman intentional or unintentional over-reporting
- for assessing cognitive development in children - FBS Scale: “symptom validity scale” designed to
- 13 to 18 years old detect intentional over-reporting of symptoms
Personality Tests - Back Page Infrequency (Fb): reflects significant
Minnesota Multiphasic Personality Inventory change in the testtaker’s approach to the latter part of
(MMPI-2) the test
[C] Myers-Briggs Type Indicator (MBTI)
- Katherine Cook Briggs and Isabel Briggs Myers
- Multiphasic personality inventory intended for used - Self-report inventory designed to identify a person’s
with both clinical and normal populations to identify personality type, strengths, and preferences
sources of maladjustment and personal strengths - Extraversion-Introversion Scale: where you prefer to
- Starke Hathaway and J. Charnley McKinley focus your attention and energy, the outer world and
- Help in diagnosing mental health disorders, external events or your inner world of ideas and
distinguishing normal from abnormal experiences
- should be administered to someone with no guilt - Sensing-Intuition Scale: how do you take inform,
feelings for creating a crime you take in or focus on interpreting and adding
- individual or by groups meaning on the information
- Clinical Scales: Hypochondriasis, Depression, - Thinking-Feeling Scale: how do you make decisions,
Hysteria, Psychopathic Deviate, logical or following what your heart says
Masculinity/Femininity, Paranoia, Psychasthenia - Judging-Perceiving Scale: how do you orient the
(Anxiety, Depression, OCD), Schizophrenia, outer world? What is your style in dealing with the
Hypomania, Social Introversion outer world – get things decided or stay open to new
- Lie Scale (L Scale): items that are somewhat info and options?
negative but apply to most people; assess the Edward’s Preference Personality Schedule
likelihood of the test taker to approach the instrument (EPPS) [ B ]
with defensive mindset - designed primarily as an instrument for research and
- High in L scale = faking good counselling purposes to provide quick and convenient
- High in F scale = faking bad, severe distress or measures of a number of relatively normal personality
psychopathology variables
- Superlative Self Presentation Scale (S Scale): a - based of Murray’s Need Theory
measure of defensiveness; Superlative Self- - Objective, forced-choice inventory for assessing the
Presentation to see if you intentionally distort answers relative importance that an individual places on 15
to look better personality variables
- Correction Scale (K Scale): reflection of the - Useful in personal counselling and with non-clinical
frankness of the testtaker’s self-report adults
- Individual
#BLEPP2023
Guilford-Zimmerman Temperament Survey - 5 years and older
(GZTS) - subjects look at 10 ambiguous inkblot images and
- items are stated affirmatively rather than in question describe what they see in each one
nd
form, using the 2 person pronoun - once used to diagnose mental illnesses like
- measures 10 personality traits: General Activity, schizophrenia
Restraint, Ascendance, Sociability, Emotional Stability, - Exner System: coding system used in this test
Objectivity, Friendliness, Thoughtfulness, Personal - Content: the name or class of objects used in the
Relations, Masculinity patient’s responses
NEO Personality Inventory (NEO-PI-R)
- Standard questionnaire measure of the Five Factor Content:
Model, provides systematic assessment of emotional, 1. Nature
interpersonal, experiential, attitudinal, and 2. Animal Feature
motivational styles 3. Whole Human
- gold standard for personality assessment 4. Human Feature
- Self-Administered 5. Fictional/Mythical Human Detail
- Neuroticism: identifies individuals who are prone to 6. Sex
psychological distress
- Extraversion: quantity and intensity of energy Determinants:
directed 1. Form
- Openness To Experience: active seeking and 2. Movement
appreciation of experiences for their own sake 3. Color
- Agreeableness: the kind of interactions an individual 4. Shading
prefers from compassion to tough mindedness 5. Pairs and Reflections
- Conscientiousness: degree of organization,
persistence, control, and motivation in goal-directed Location:
behavior 1. W – the whole inkblot was used to depict an image
Panukat ng Ugali at Pagkatao/Panukat ng 2. D – commonly described part of the blot was used
Pagkataong Pilipino 3. Dd – an uncommonly described or unusual detail
- Indigenous personality test was used
- Tap specific values, traits and behavioral dimensions 4. S – the white space in the background was used
related or meaningful to the study of Filipinos Thematic Apperception Test
Sixteen Personality Factor Questionnaire [C]
- Raymond Cattell - Christiana Morgan and Henry Murray
- constructed through factor analysis - 5 and above
- Evaluates a personality on two levels of traits - 31 picture cards serve as stimuli for stories and
- Primary Scales: Warmth, Reasoning, Emotional descriptions about relationships or social situations
Stability, Dominance, Liveliness, Rule-Consciousness, - popularly known as the picture interpretation
Social Boldness, Sensitivity, Vigilance, technique because it uses a standard series of
Abstractedness, Privateness, Apprehension, Openness provocative yet ambiguous pictures about which the
to change, Self-Reliance, Perfectionism, Tension subject is asked to tell a story
- Global Scales: Extraversion, Anxiety, Tough- - also modified African American testtakers
Mindedness, Independence, Self-Control Children’s Apperception Test
Big Five Inventory-II (BFI-2)
- Bellak & Bellak
- Soto & John - 3-10 years old
- Assesses big 5 domains and 15 facets - based on the idea that animals engaged in various
- for commercial purposes to researches and students activities were useful in stimulating projective
Projective Tests storytelling by children
Rorshcach Inkblot Test Hand Test
[C]
- Hermann Rorschach - Edward Wagner
- 5 years old and above
#BLEPP2023
- used to measure action tendencies, particularly - can also be used to assess brain damage and general
acting out and aggressive behavior, in adults and mental functioning
children - measures the person’s psychological and emotional
-Apperceptive
10 cards (1 blank)
Personality Test (APT) functioning
- The house reflects the person’s experience of their
- Holmstrom et. Al. immediate social world
- attempt to address the criticisms of TAT - The tree is a more direct expression of the person’s
- introduced objectivity in scoring system emotional and psychological sense of self
- 8 cards include male and female of different ages - The person is a more direct reflection of the person’s
and minority group members sense of self
- testtakers will respond to a series of multiple choice Draw-A-Person Test (DAP)
questions after storytelling
Word Association Test (WAT) - Florence Goodenough
- 4 to 10 years old
- Rapaport et. Al. - a projective drawing task that is often utilized in
- presentation of a list of stimulus words, assessee psychological assessments of children
responds verbally or in writing the first thing that - Aspects such as the size of the head, placement of
comes into their minds the arms, and even things such as if teeth were drawn
Rotter Incomplete Sentences Blank (RISB) or not are thought to reveal a range of personality
traits
- Julian Rotter & Janet Rafferty -Helps people who have anxieties taking tests (no
- Grade 9 to Adulthood strict format)
- most popular SCT -Can assess people with communication problems
SACK’s Sentence Completion Test (SSCT) -Relatively culture free
-Allow for
Kinetic self-administration
Family Drawing
- Joseph Sacks and Sidney Levy
- 12 years old and older - Burns & Kaufman
- asks respondents to complete 60 questions with the - derived from Hulses’ FDT “doing something”
first thing that comes to mind across four areas: Family, Clinical & Counseling Tests
Sex, Interpersonal, Relationships and Self concept Millon Clinical Multiaxial Scale-IV (MCMI-IV)
Bender-Gestalt Visual Motor Test
[C] - Theodore Millon
- 18 years old and above
- Lauretta Bender - for diagnosing and treatment of personality disorders
- 4 years and older - exaggeration of polarities results to maladaptive
- consists of a series of durable template cards, each behavior
displaying a unique figure, then they are asked to draw - Pleasure-Pain: the fundamental evolutionary task
each figure as he or she observes it - Active-Passive: one adapts to the environment or
- provides interpretative information about an adapts the environment to one’s self
individual’s development and neuropsychological - Self-Others: invest to others versus invest to oneself
functioning Beck Depression Inventory (BDI-II)
- reveals the maturation level of visuomotor
perceptions, which is associated with language ability - Aaron Beck
and various functions of intelligence - 13 to 80 years old
House-Tree-Person Test (HTP) - 21-item self-report that tapos Major Depressive
symptoms accdg. to the criteria in the DSM
- John Buck and Emmanuel Hammer MacAndrew Alcoholism Scale (MAC & MAC-R)
- 3 years and up
- measures aspects of a person’s personality through - from MMPI-II
interpretation of drawings and responses to questions - Personality & attitude variables thought to underlie
alcoholism
#BLEPP2023
California Psychological Inventory (CPI-III) - can take note of verbal - sometimes, due to
and nonverbal cues negligence of interviewer
- attempts to evaluate personality in normally adjusted - flexible and interviewee, it can
individuals - time and cost effective miss out important
- has validity scales that determines faking bad and - both structured and information
faking good unstructured allows - interviewer’s effect on
- interpersonal style and orientation, normative clinicians to place a wider, the interviewee
orientation and values, cognitive and intellectual more meaningful context - various error such as
function, and role and personal style - can also be used to help halo effect, primacy
- has special purpose scales, such as managerial predict future behaviors effect, etc.
potential, work orientation, creative temperament, interviews allow - interrater reliability
leadership potential, amicability, law enforcement - clinicians to establish - interviewer bias
orientation, tough-mindedness rapport and encourage
Rosenberg Self-Esteem Scale client self-exploration.
Portfolio
- measures global feelings of self-worth - provides comprehensive - can be very demanding
- 10-item, 4 point likert scale illustration of the client - time consuming
- used with addolescents which highlights the
Dispositional Resilience Scale (DRS) strengths and weaknesses
Observation
- measures psychological hardiness defined as the - flexible - For private practitioners,
ability to view stressful situations as meaningful, - suitable for subjs that it is typically not practical
changeable, and challenging cannot be studied in lab or economically feasible
Ego Resiliency Scale-Revised setting to spend hours out of the
- measure ego resiliency or emotional intelligence - more realistic consulting room
HOPE Scale - affordable observing clients as they
- developed by Snyder - can detect patterns go about their daily lives
- Agency: cognitive model with goal driven energy - lack of scientific
- Pathway: capacity to contrast systems to meet goals control, ethical
- good measure of hope for traumatized people considerations, and
- positively correlated with health psychological potential for bias from
adjustment, high achievement, good problem solving observers and subjects
skills, and positive health-related outcomes - unable to draw cause-
Satisfaction with Life Scale (SWLS) and-effect conclusions
- overall assessment of life satisfaction as a cognitive - lack of control
judgmental process - lack of validity
Positive and Negative Affect Schedule (PANAS) - observer bias
Case History
- measure the level of positive and negative emotions - can fully show the - cannot be used to
a test taker has during the test administration experience of the generalize a phenomenon
Strengths and weaknesses of assessment tools (2) observer in the program
Pros Cons - shed light on an
Test individual’s past and
- can gather a sample of - In crisis situations when current adjustment as well
behavior objectively with relatively rapid decisions as on the events and
lesser bias need to be made, it can be circumstances that may
- flexible, can be verbal impractical to take the have contributed to any
or nonverbal time required to changes in adjustment
administer and interpret Role Play
tests - encourages individuals - may not be as useful as
Interview to come together to find the real thing in all
solutions and to get to situations
#BLEPP2023
know how their - time-consuming
colleagues think - expensive
- group can discuss ways - inconvenient to assess
to potentially resolve the in a real situation
situation and participants - While some employees
leave with as much will be comfortable role
information as possible, playing, they’re less
resulting in more efficient adept at getting into the
handling of similar real- required
life scenarios mood needed to actually
replicate a situation  The greater number of items, the higher the
Test Administration, Scoring, Interpretation and reliability
Usage (20)  Factors that contribute to inconsistency:
Detect Errors and impacts in Test characteristics of the individual, test, or
situation, which have nothing to do with the
1.Issues
Flynnin Effect
Intelligence Testing rise in intelligence score
– progressive
attribute being measured, but still affect the
that is expected to occur on a normed intelligence test
scores
from the date when the test was first normed
 Gradual increase in the general intelligence among o Error Variance – variance from irrelevant random
sources
newborns
 Frog Pond Effect: theory that individuals Measurement Error – all of the factors associated
evaluate themselves as worse when in a group with the process of measuring some variable, other
of high-performing individuals than the variable being measured
2. Culture Bias of Testing - difference between the observed score and the true
 Culture-Free: attempt to eliminate culture so score
nature can be isolated - Positive: can increase one’s score
- Negative: decrease one’s score
 Impossible to develop bec culture is evident in
- Sources of Error Variance:
its influence since birth or an individual and the
a. Item Sampling/Content Sampling
interaction between nature and nurture is
b. Test Administration
cumulative and not relative
c. Test Scoring and Interpretation
 Culture Fair: minimize the influence of culture
with regard to various aspects of the evaluation Random Error – source of error in measuring a
procedures targeted variable caused by unpredictable fluctuations
 Fair to all, fair to some cultures, fair only to one and inconsistencies of other variables in measurement
process (e.g., noise, temperature, weather)
culture
 Culture Loading: the extent to which a test Systematic Error – source of error in a measuring a
incorporates the vocabulary concepts traditions, variable that is typically constant or proportionate to
knowledge etc. with particular culture what is presumed to be the true values of the variable
Errors: Reliability being measured
- has consistent effect on the true score
o Classical Test Theory (True Score Theory) –
- SD does not change, the mean does
score on ability tests is presumed to reflect not only
 Error variance may increase or decrease a test score
the testtaker’s true score on the ability being
measured but also the error by varying amounts, consistency of test score, and
 Error: refers to the component of the observed thus, the reliability can be affected
Test-Retest Reliability
test score that does not have to do with the
testtaker’s ability Error: Time Sampling
 Errors of measurement are random - the longer the time passes, the greater likelihood that
the reliability coefficient would be insignificant
- Carryover Effects: happened when the test-retest
interval is short, wherein the second test is influenced
by the first test because they remember or practiced
the previous test = inflated correlation/overestimation
of reliability
- Practice Effect: scores on the second session are
#BLEPP2023
higher due to their experience of the first session of 2. True Negatives (Specificity) – predict failure
testing that does occur
- test-retest with longer interval might be affected of 3. False Positive (Type 1) – success does not occur
other extreme factors, thus, resulting to low 4. False Negative (Type 2) – predicted failure but
correlation succeed
- target time for next administration: at least two weeks
Parallel Forms/Alternate Forms Reliability
Error: Item Sampling (Immediate), Item Sampling
changes over time (delayed)
- Counterbalancing: technique to avoid carryover
effects for parallel forms, by using different sequence
for groups
- most rigorous and burdensome, since test developers
create two forms of the test
- main problem: difference between the two tests
- test scores may be affected by motivation, fatigue, or
intervening events
- create a large set of questions that address the same
construct then randomly divide the questions into two Errors due to Behavioral Assessment
sets 1. Reactivity – when evaluated, the behavior increases
Internal Consistency (Inter-Item Reliability) - Hawthorne Effect
Error: Item Sampling Homogeneity 2. Drift – moving away from what one has learned
Split-Half Reliability going to idiosyncratic definitions of behavior
Error: Item sample: Nature of Split - subjects should be retrained in a point of time
Inter-Scorer Reliability - Contrast Effect: cognitive bias that distorts our
Error: Scorer Differences perception of something when we compare it to
o Standard Error of Measurement – provide a something else, by enhancing the differences between
measure of the precision of an observed test score them
 Standard deviation of errors as the basic measure of 3. Expectancies – tendency for results to be
error influenced by what test administrators expect to find
 Index of the amount of inconsistent or the - Rosenthal/Pygmalion Effect: Test administrator’s
amount of the expected error in an individual’s expected results influences the result of the test
score - Golem Effect: negative expectations decreases one’s
 Allows to quantify the extent to which a test performance
provide accurate scores 4. Rating Errors – intentional or unintentional misuse
 Provides an estimate of the amount of error of the scale
inherent in an observed score or measurement - Leniency Error: rater is lenient in scoring
 Higher reliability, lower SEM (Generosity Error)
 Used to estimate or infer the extent to which an - Severity Error: rater is strict in scoring
observed score deviates from a true score - Central Tendency Error: rater’s rating would tend to
 Standard Error of a Score cluster in the middle of the rating scale
 Confidence Interval: a range or band of test - Halo Effect: tendency to give high score due to
scores that is likely to contain true scores failure to discriminate among conceptually distinct
o Standard Error of the Difference – can aid a test and potentially independent aspects of a ratee’s
user in determining how large a difference should behavior
be before it is considered statistically significant - snap judgement on the basis of positive trait
o Standard Error of Estimate – refers to the - Horn Effect: Opposite of Halo Effect
standard error of the difference between the - One way to overcome rating errors is to use rankings
predicted and observed values 5. Fundamental Attribution Error – tendency to
o Four Possible Hit and Miss Outcomes explain someone’s behavior based on internal factors
1. True Positives (Sensitivity) – predict success such as personality or disposition, and to
that does occur underestimate the influence the external factors
Hi :) this reviewer is FREE! u can share it with others but never sellhave
it okay? let’s help each other <3 -aly
on another
#BLEPP2023
- Barnum Effect: people tend to accept vague to ensure that services were not denied. However,
personality descriptions as accurate descriptions of the services are discontinued once the
themselves (Aunt Fanny Effect) appropriate services are available
o Bias – factor inherent in a test that systematically o Psychologists should discuss the limits of
prevents accurate, impartial measurement confidentiality, uses of the information that would
 Prejudice, preferential treatment be generated from the services to the persons and
 Prevention during test dev through a procedure organizations with whom they establish a scientific
called Estimated True Score Transformation or professional relationships
Ethical Principles and Standards of Practice (19) o Before recording voices or images, they must obtain
o If mistakes was made, they should do something to permission first from all persons involved or their
correct or minimize the mistakes legal rep
o If an ethical violation made by another psychologist o Only discuss confidential information with persons
was witnessed, they should resolve the issue with clearly concerned/involved with the matters
informal resolution, as long as it does not violate o Disclosure is allowed with appropriate consent
any confidentiality rights that may be involved  No consent is not allowed UNLESS mandated
o If informal resolution is not enough or appropriate, by the law
referral to state or national committees on o No disclosure of confidential information that could
professional ethics, state licensing boards, or the lead to the identification of a client unless they have
appropriate institutional authorities can be done. obtained prior consent or the disclosure cannot be
Still, confidentiality rights of the professional in avoided
question must be kept.  Only disclose necessary information
o Failure to cooperate in ethics investigation itself, is o Exemptions to disclosure:
an ethics violation, unless they request for  If the client is disguised/identity is protected
deferment of adjudication of an ethics complaint  Has consent
o Psychologists must file complaints responsibly by  Legally mandated
checking facts about the allegations o Psychologists can create public statements as long
o Psychologists DO NOT deny persons employment, as they would be responsible for it
advancement, admissions, tenure or promotion  They cannot compensate employees of the
based solely upon their having made or their being media in return for publicity in a news item
the subject of an ethics complaint  Paid Advertisement must be clearly recognizable
 Just because they are questioned by the ethics  when they are commenting publicly via internet,
committee or involved in an on-going ethics media, etc., they must ensure that their statement
investigation, they would be discriminated or are based on their professional knowledge in
denied advancement accord with appropriate psych literature and
 Unless the outcome of the proceedings are practice, consistent with ethics, and do not
already considered indicate that a professional relationship has been
o Psychologists should do their services within the established with the recipient
boundaries of their competence, which is based on o Must provide accurate information and obtain
the amount of training, education, experience, or approval prior to conducting the research
consultation they had o Informed consent is required, which include:
o When they are tasked to provide services to  Purpose of the research
clients who are deprived with mental health  Duration and procedures
services (e.g., communities far from the urban  Right to decline and withdraw
cities), however, they were still not able to  Consequences of declining or withdrawing
obtain the needed competence for the job, they  Potential risks, discomfort, or adverse effects
could still provide services AS LONG AS they  Benefits
make reasonable effort to obtain the competence  Limits of confidentiality
required, just to ensure that the services were not  Incentives for participation
denied to those communities  Researcher’s contact information
o During emergencies, psychologists provide o Permission for recording images or vices are needed
services to individuals, even though they are yet unless the research consists of solely naturalistic
to complete the competency/training needed just
#BLEPP2023
observations in public places, or research designed o House Bill 4982 – SOGIE Bill
includes deception
 Consent must be obtained during debriefing
o Dispense or Omitting Informed consent only when:
1. Research would not create distress or harm
 Study of normal educational practices conducted
in an educational settings
 Anonymous questionnaires, naturalistic
observation, archival research
 Confidentiality is protected
2. Permitted by law
o Avoid offering excessive incentives for research
participation that could coerce participation
o DO not conduct study that involves deception
unless they have justified the use of deceptive
techniques in the study
 Must be discussed as early as possible and not
during the conclusion of data collection
o They must give opportunity to the participants
about the nature, results, and conclusions of the
research and make sure that there are no
misconceptions about the research
o Must ensure the safety and minimize the
discomfort, infection, illness, and pain of animal
subjects
 If so, procedures must be justified and be as
minimal as possible
 During termination, they must do it rapidly and
minimize the pain
o Must no present portions of another’s work or data
as their own
 Must take responsibility and credit, including
authorship credit, only for work they have
actually performed or to which they have
substantially contributed
 Faculty advisors discuss publication credit with
students as early as possible
o After publishing, they should not withhold data
from other competent professionals who intends to
reanalyze the data
 Shared data must be used only for the declared
purpose
o RA 9258 – Guidance and Counseling Act of 2004
o RA 9262 – Violence Against Women and Children
o RA 7610 – Child Abuse
o RA 9165 – Comprehensive Dangerous Drugs Act
of 2002
o RA 11469 – Bayanihan to Heal as One Act
o RA 7277 – Magna Carta for Disabled Persons
o RA 11210 – Expanded Maternity Leave Law
o RA 11650 – Inclusive Education Law
o RA 10173 – Data Privacy Act
#BLEPP2023
o Art. 12 of Revised Penal Code – Insanity Plea
end
congratulations on reaching the end of this reviewer!!

i hope u learned something!! :D
one day, we will be remembered.
- aly <3

TOS Outline PsychAssess 2.0

Uploaded by

Copyright:

Available Formats

You might also like

TOS Outline PsychAssess 2.0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TOS Outline PsychAssess 2.0

Uploaded by

Copyright:

Available Formats

Psychological Assessment

o Basic Research = 0.70 to 0.90 Research Methods and Statistics (20)

o Computerized Adaptive Testing – refers to an

 Mean < Median < Mode

 High Kurtosis = high peak and fatter tails

congratulations on reaching the end of this reviewer!!

one day, we will be remembered.

You might also like