Dorothea Dax Cohen

NOT FOR SALE!
#RPMTWTCOMMUNITY @TYPICALLYDONG | DOROTHEA DAX
NOTES FROM COHEN

Summary
CHAPTER 4: OF TESTS AND TESTING

Assumption 1: Psychological Traits and States Exist
Traits are any distinguishable, Traits are not expected to be
relatively enduring way in manifested 100%. Yet, the
which one individual varies situations where a person is
from another. can help express these traits
States also distinguish one and make them more
person from another but are observable.
relatively less enduring. The context where the
Androgynous – refers to an behavior occurred should be
absence of primacy of male or considered to make better
female characteristics. inferences from them.
PSYCHOLOGICAL TRAITS Kneeling and praying out loud
EXIST ONLY AS A CONSTRUCT, may be interpreted wrongly if
and we can only infer their it is seen in a cinema or in a
existence from overt behavior. sports fest, but not when in a
Overt behaviors are praying room or church.
observable behaviors.
Traits and States are defined to draw a line where people differ uniquely from one
another.
Assumption 2: Psychological Traits and States Can be Quantified and

Measured
It is the test developers that give The test score is presumed to
meaning to a certain construct represent the strength of the
being measured. The meaning of targeted ability or trait or state
results from a test should be and is frequently based on
interpreted from the way it is cumulative scoring.
defined by the test developer.
The more the test taker responds
A well-defined construct deserves in a particular direction as keyed
a well-constructed items that by the test manual as correct or
would give insight about it. A consistent with a particular trait,
construct being measured in a test the higher that test taker is
should have items that best presumed to be on the targeted
measure and represent it. ability or trait.
Justifiably, a proper scoring and

interpretation should also be set.
- Nothing worthwhile comes easy.

NOT FOR SALE! #RPMTWTCOMMUNITY @TYPICALLYDONG | DOROTHEA DAX
Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior

When a test asks a test taker to The obtained sample of behavior
shade a grid using a pen or press may be used to make predictions
key on a keyboard to answer, the about future behavior.
test itself has only a little thing to
do with those factors. Forensically, the aim is to “post-
dict” it - to aid in the
The objective of such is to give understanding of a behavior that
insights on the other aspects of has already taken place.
the examinee’s behavior.
Assumption 4: Test and Other Measurement Techniques Have Strength and

Weaknesses
Competent test-users know how a interpreted. Competent test users
test was developed, the understand and appreciate the
circumstances under which it is limitations of the tests they use as
appropriate to administer the well as how those limitations
test, how the test should be might be compensated for by data
administered and to whom, and from other sources
how the test results should be
Assumption 5: Various Sources of Error Are Part of the Assessment Process

Error refers to a long-standing Error variance is the component
assumption that factors other of a test score attributable to
than what a test attempts to sources other than the trait or
measure will influence ability measured.
performance on the test.
Both the assessee and the assessor, the measuring instruments,

are a source of error.
Assumption 6: Testing and Assessment Can Be Conducted in a Fair and

Unbiased Manner
The most controversial to keep in mind that tests are
assumption among all seven tools. And just like other, more
assessment assumptions. familiar tools (hammers, ice picks,
wrenches, and so on), they can be
In all questions about tests with used properly or improperly.
regard to fairness, it is important

Assumption 7: Testing and Assessment Benefit Society

///
What is a Good Test?
Clear instructions for administration, scoring, and interpretation.
Utility (cost and time efficient)
Validity (measures what it purports to measure)
Psychometrically sound (valid and reliable)
Reliability
consistency of the measuring tool reliability is a necessary but not
- the precision with which the test sufficient element of a good test.
measures and the extent to which In addition to being reliable, tests
error is present in measurements. must be reasonably accurate. In
unreliable measurement is to be the language of psychometrics,
avoided. tests must be valid.
Psychological tests, like other
tests and instruments, are reliable
to varying degrees.
Validity
measure what it purports to Do the items adequately sample

measure. the range of areas that must be
a test of reaction time is a valid sampled to adequately measure
test if it accurately measures the construct?
reaction time.
Norms
norm-referenced testing and a reference when evaluating or
assessment – a method of interpreting individual test
evaluation and a way of deriving scores.
meaning from test scores by
evaluating an individual test normative sample – is that group
taker’s score and comparing it to of people whose performance on
scores of a group of test takers. a particular test is analyzed for
reference in evaluating the
Goal: to yield information on a test performance of individual test
taker’s standing or ranking takers.
relative to some comparison
group of test takers.
norms – are the test performance Norming – the process of deriving
data of a particular group of test norms; it may be modified to
takers that are designed for use as

describe a particular type of norm

derivation.
Sampling to Develop Norms

test standardization – the process it should have clearly specified
of administering a test to a procedures for administration and
representative sample of test scoring, typically including
takers for the purpose of normative data.
establishing norms.
There is a need to get sample out of the population because it is impractical, expensive,
and not even necessary to test the whole or wide range of population.
Sample - a portion of the universe Sampling - The process of
of people deemed to be selecting the portion of the
representative of the whole universe deemed to be
population. representative of the whole
population.
In order to best assist future users of the test, test developers are encouraged
to:
Provide information to support recommended interpretations of the results,
including the nature of the content, norms or comparison groups, and other
technical evidence.

CHAPTER 5: RELIABILITY
Reliability is synonymous to:
Dependability or consistency.
Variance – is the SD2. Random error – is a source of error in
measuring a targeted variable caused
The greater the proportion of the by unpredictable fluctuations and
total variance attributed to true inconsistencies of other variables in
variance, the more reliable the
the measurement process.
test.
Measurement error – refers to, Systematic error – refers to a source

of error in measuring a variable that
collectively, all the factors associated
is typically constant or proportionate
with the process of measuring some
to what is presumed to be the true
variable, other than the variable
value of the variable being measured.
being measured.
Sources of error variance:

Test Construction – from test items and/or content sampling.
Test Administration – room environment, test takers’ variables, examiner-related
variables
Test scoring and interpretation – scorers and scoring system.

Other sources of error:

Non-systematic error – forgetting, failing to notice abusive behavior, and
misunderstanding instructions regarding reporting.
Systematic error – overreporting and underreporting.
Note: the amount of test variance that is true relative to error may never be
known.
Tests of Reliability Estimates

1. Test-Retest Reliability Estimates
Correlating: 2 score sets from 1 population, 2 different administrations, 1
form of test.
measuring something that is extend beyond the magnitude
relatively stable over time, of the obtained coefficient.
such as a personality trait. most appropriate in gauging
as the time interval between the reliability of tests that
administrations of the same employ outcome measures
test increases, the correlation such as reaction time or
between the scores obtained perceptual judgments
on each testing decreases (including discriminations of
(inverse relationship) brightness, loudness, or
time sampling can be a source taste).
of error variance. Not so good to be applied to
When the interval between children as they are most likely
testing is greater than 6 growing in spurts – the change
months, the estimate of test- in their performance might be
retest reliability is often attributed to error when in
referred to as the coefficient fact, it may be a genuine
of stability. progress in their skills. If done
An evaluation of a test-retest so, test developers should
reliability coefficient must design the administration as
short-spanned as possible, as
little as 4 days.
Two Ways to Obtaining an Estimate of Test-Retest Reliability:

Two test administrations with the same group are required.
Test scores may be affected by factors such as motivation, fatigue, or
intervening events such as practice, learning, or therapy.

2. Parallel Forms/Alternate Forms Reliability Estimates

The degree of the relationship Parallel Forms: the means and
between various forms of a variances of observed test
test can be evaluated by scores from the two versions
means of an alternate-forms of the test are equal.
or parallel-forms coefficient of
reliability, which is often Alternate forms: typically
termed the coefficient of designed to be equivalent with
equivalence. respect to variables such as
content and level of difficulty.
Alternate-forms reliability: an estimate of the extent to which these different

forms of the same test have been affected by item sampling error, or other
error.
3. Test of Internal Consistency Reliability Estimates
3.1 Split-Half Reliability Estimates
Obtained by correlating Useful when it is

two pairs of scores impractical or undesirable
obtained from equivalent to assess reliability with
halves of a single test two tests or to administer
administered once. a test twice.
Steps to conduct split-half reliability:
Step 1. Divide the test into equivalent halves.
Step 2. Calculate a Pearson r between scores on the two halves of the test.
Step 3. Adjust the half-test reliability using the Spearman–Brown formula.
Recommended way to splitting the test in half:
Odd-even reliability
Dividing the test by content
Assigning random items to both halves
Not recommended:
Divide the test in the middle.

Spearman–Brown estimates are The new items must be equivalent in

based on a test that is twice as long as content and difficulty so that the
the original half test. longer test still measures what the
original test measured.
Spearman-Brown can be used in
shortened test in the interest of Internal consistency estimates of
administration time. reliability, such as that obtained by
use of the Spearman–Brown formula,
Spearman–Brown formula could also are inappropriate for measuring the
be used to determine the number of reliability of heterogeneous tests
items needed to attain a desired level (should be homogenous) and speed
of reliability. tests.
3.2 Inter-item Consistency
refers to the degree of correlation An index of inter-item

among all the items on a scale. consistency, in turn, is useful in
1 administration, 1 form of test. assessing the homogeneity of the
test.
Tests are said to be homogeneous if
they contain items that measure a
single trait.
** Tests that measure different domains NEO-PI are less homogenous. Hence, to get the
desired reliability, the test developer should conduct a separate test of reliability for each
of the domains of a single test.
The more homogeneous a test is, the higher inter-item consistency it can be
expected to have.
(Positive relationship)
Homogeneity = oneness

3.3 The Kuder-Richardson Formulas
KR-20
KR-20 is the statistic of choice Applies to test that are highly

for determining the inter-item homogenous, dichotomous,
consistency of dichotomous and with varying degree of
items, primarily those items difficulty among items.
that can be scored right or
wrong.
If test items are more heterogeneous, KR-20 will yield lower reliability estimates
than the split-half method. Hence, split -half is a better choice.
KR-21
The KR-21 formula may be used if there is reason to assume that all the test
items have approximately the same degree of difficulty.
3.4Cronbach’s Alpha
1951 this formula yields an estimate

the mean of all possible split- of the mean of all possible
half correlations, corrected by test-retest, split-half
the Spearman–Brown coefficients.
formula. requires only one
appropriate for use on tests administration of the test.
containing non-dichotomous ranges in value from 0 to 1.
items. help answer questions about
the preferred statistic for how similar sets of data are.
obtaining an estimate of a value of alpha above .90 may
internal consistency reliability. be “too high” and indicate
redundancy in the items.
4. Average Proportional Distance Method
a measure used to evaluate the internal consistency of a test that focuses on

the degree of difference that exists between item scores.
<0.25 suggestive of problems with the internal consistency of the test.
0.25 to 0.2 acceptable range.
>0.2 excellent internal consistency

items measuring a single construct such as extraversion should ideally be

correlated with one another in the .6 to .7 range.
Advantage: the APD index is not connected to the number of items on a
measure.
5. Inter-scorer (inter-rater) reliability

the degree of agreement or often used when coding
consistency between two or nonverbal behavior.
more scorers (or judges or the simplest way of
raters) regarding a particular determining the degree of
measure. consistency among scorers in
may be promoted by providing the scoring of a test is to
raters with the opportunity for calculate a coefficient of
group discussion along with correlation.
practice exercises and
information on rater accuracy.

POWER TEST VS. SPEED TEST

Power Test Speed Test
When a time limit is long contains items of uniform level
enough to allow test takers to of difficulty so that, when
attempt all items, and if some given generous time limits, all
items are so difficult that no test takers should be able to
test taker is able to obtain a complete all the test items
perfect score. correctly.
Score differences on a speed
test are therefore based on
performance speed because
items attempted tend to be
correct.
Split-half and KR-20 in this
case would yield high
coefficient but in many likely
forms, it would give us nothing
about response consistency.
Criterion-referenced Test
designed to provide an tends to contain material that

indication of where a test has been mastered in
taker stands with respect to hierarchical fashion.
some variable or criterion.
The True Score Model of Measurement and Alternatives to It

1. Classical Test Theory
CTT - referred to as the true score (or assumptions are so readily met,
classical) model of measurement. compared to IRT whose assumptions
the most widely used and accepted are more difficult to meet.
model in the psychometric literature compatibility and ease of use with
today. widely used statistical techniques.
everyone has a “true score” on a test: Factor analytic techniques, whether
a value that genuinely reflects an exploratory or confirmatory, are all
individual’s ability (or trait) level as “based on the CTT measurement
measured by a particular test. foundation.”
it its assumptions allow for its Disadvantage: all items are presumed
application in most situations. to be contributing equally to the
CTT assumptions are characterized as score total.
“weak”—this is precisely because its

Factors to consider: length of tests that are developed using a CTT model
Note: CTT favors the development of longer rather than shorter tests
1.1 Domain Sampling Theory

seek to estimate the extent to means and variances of those
which specific sources of in the test that samples from
variation under defined the domain.
conditions are contributing to measures of internal
the test score. consistency are perhaps the
the items in the domain are most compatible with domain
thought to have the same sampling theory.
1.2 Generalizability Theory

Lee J. Cronbach (1970). test’s reliability does not
a person’s test scores vary reside within the test itself;
from testing to testing test’s reliability is very much a
because of variables in the function of the circumstances
testing situation. under which the test is
Cronbach: “describe the developed, administered, and
details of the particular test interpreted.
situation or universe leading
to a specific test score.”
2. Item-Response Theory/ Latent-trait Theory
encompassing item difficulty latent traits theoretically can

and item discriminability take on values from −∞ to +∞
assumptions are made about (negative infinity to positive
the frequency distribution of infinity).
test scores.
The Standard Error of Measurement

provides a measure of the precision of an observed test score.
provides an estimate of the amount of error inherent in an observed score or
measurement.
the higher the reliability of a test, the lower the SEM (inverse relationship).

CHAPTER 6: VALIDITY
Validity
the meaningfulness of a test Characterized as “acceptable”
score — what the test score or “weak.”
truly means. “Reasonable boundaries:” No
is a judgment or estimate of test or measurement
how well a test measures what technique is “universally
it purports to measure in a valid” for all time, for all uses,
particular context. with all types of test taker
it is a judgment based on populations.
evidence about the Local validation studies are
appropriateness of inferences necessary when the test user
drawn from test scores. plans to alter in some way the
An inference is a logical result format, instructions,
or deduction. language, or content of the
test.
Important notes when dealing with validity:

1. It cannot be higher than reliability. Reliability sets the ceiling for it.
2. A valid test is ALWAYS reliable.
Construct validity is the variable being measured is

“umbrella validity” as all types emitted.
of validity falls under it. The greater the ecological
Ecological validity refers to a validity of a test or other
judgment regarding how well measurement procedure, the
a test measures what it greater the generalizability of
purports to measure at the the measurement results to
time and place that the real-life circumstances.

TYPES OF VALIDITY IN DETAIL…

Face Validity
what a test appears to lack of face validity =
measure to the person being decreased confidence in the
tested than to what the test perceived effectiveness of the
measures. test & test taker’s cooperation
Rorschach: low face validity or motivation to do his or her
judgments about face validity best.
are frequently thought of from more of a matter of public
the perspective of the test relations than psychometric
taker, not the test user. soundness.
Content Validity
a judgment of how adequately a test samples behavior representative of the
universe of behavior that the test was designed to sample.
Criterion Validity
a judgment of how adequately Judgments of criterion-related
a test score can be used to validity, whether concurrent
infer an individual’s most or predictive, are based on
probable standing on some two types of statistical
measure of interest. evidence: the validity
the standard against which a coefficient and expectancy
test or a test score is data.
evaluated. An expectancy table can
Concurrent Validity: an index provide an indication of the
of the degree to which a test likelihood that a test taker will
score is related to some score within some interval of
criterion measure obtained at scores on a criterion
the same time (concurrently). measure—an interval that
Predictive validity: an index of may be categorized as
the degree to which a test “passing,” “acceptable,” or
score predicts some criterion “failing.”
measure.
Characteristics of a criterion:
1. Relevant - pertinent or applicable to the matter at hand.
2. Valid
3. Uncontaminated
Criterion Contamination - the term applied to a criterion measure that has been
based, at least in part, on predictor measures.
If a certain variable has been used both as a predictor and as a criterion, then
a criterion contamination has occurred.
Contamination cannot be gauged neither can it be corrected.

Concurrent Validity
indicates the extent to which test scores may be used to estimate an individual’s
present standing on a criterion.
Validating criterion / Criterion validator – to where a single test is being
compared to.
Predictive Validity
- how accurately scores on the test predict some criterion measure.
Base rate – is the extent to Types of Miss Rate:
which a particular trait,
false positive is a miss wherein
behavior, characteristic, or
the test predicted that the test
attribute exists in the
taker did possess the
population.
characteristic or attribute
Hit rate – the proportion of
being measured when in fact
people a test accurately
the test taker did not, (type 1
identifies as possessing or
error).
exhibiting a particular trait,
false negative is a miss
behavior, characteristic, or
wherein the test predicted
attribute.
that the test taker did not
Miss rate – the proportion of
possess the characteristic or
people the test fails to identify
attribute being measured
as having, or not having, a
when the test taker actually
particular characteristic or
did, (type 2 error).
attribute (inaccurate
prediction).

Validity Coefficient
a correlation coefficient that provides a measure of the relationship between
test scores and scores on the criterion measure.
Incremental Validity
the degree to which an additional predictor explains something about the
criterion measure that is not explained by predictors already in use.
Construct Validity
a judgment about the appropriateness of inferences drawn from test scores
regarding individual standings on a variable.
A construct is an informed, scientific idea developed or hypothesized to
describe or explain behavior.
Constructs are unobservable, presupposed (underlying) traits that a test
developer may invoke to describe test behavior or criterion performance.
Steps:
1. Formulate hypotheses about the construct being defined.
2. Set high and low scorers.
3. When tested, both scorers should behave as the way they are expected to.
4. If not, test developer should need to reexamine the construct/ the hypotheses
made.
Reasons for not meeting the hypotheses:

The test does not measure the construct.
The hypotheses are wrongly assumed.
Erroneous statistical methods and procedures were committed.

Restructuring Test Homogeneity:

1. Correlate average subtest scores with the average total test score using the
Pearson r (but not all the time).
2. Correlation coefficient should be more than .50.
3. Eliminate items that do not show significant correlation coefficients with total test
scores.
4. Ensure that the test can discriminate the high scorers and low scorers.
5. Eliminate items that do not show significant Spearman rank-order correlation
coefficients on a multipoint scale test (use Coefficient alpha in this case.)
Discriminant Validity
A validity coefficient showing little (a statistically insignificant) relationship
between test scores and/or other variables with which scores on the test being
construct-validated should not theoretically be correlated.
2 similar constructs that correlated Good convergent validity
2 similar constructs that did not correlate Poor convergent validity
2 different constructs that correlated Good discriminant validity
2 different constructs that did not correlate Poor discriminant validity
Multitrait-multimethod matrix
(Campbell & Fiske, 1959) is the behavioral observation or a
matrix or table that results personality test) are inserted
from correlating variables into the table, and the
(traits) within and between resulting matrix of
methods. correlations provides insight
Example: Values for any with respect to both the
number of traits (such as convergent and the
aggressiveness or discriminant validity of the
extraversion) as obtained by methods used.
various methods (such as

Factor Analysis
a shorthand term for a class of to identify the factor or factors
mathematical procedures in common between test
designed to identify factors or scores on subscales within a
specific variables that are particular test, or the factors in
typically attributes, common between scores on a
characteristics, or dimensions series of tests.
on which people may differ. Procedures are very technical,
employed as a data reduction need of a good aid of
method in which several sets sophisticated software.
of scores and the correlations
between them are analyzed.
2 Types:
1. Exploratory factor analysis (EFA)
estimating, or extracting factors; deciding how many factors to retain; and
rotating factors to an interpretable orientation.
2. Confirmatory factor analysis (CFA)

researchers test the degree to which a hypothetical model (which includes
factors) fits the actual data.
Factor Loading: “Each test is thought of as a vehicle carrying a certain amount of
one or more abilities.”
conveys information about the extent to which the factor determines
the test score or scores.
Validity, Bias, and Fairness…

Test Bias
bias is a factor inherent in a test that systematically prevents accurate, impartial
measurement.
Bias implies systematic variation.
Constructive and Unconstructive Worry: Unconstructive worry is positively
correlated with trait anxiety but is negatively correlated with punctuality and
other test-taking mitigating behaviors; the concept is completely inverse with
constructive worry.
The test is deemed biased if some portion of its variance stems from some
factor(s) that are irrelevant to performance on the criterion measure.
Prevention during test development is the best cure for test bias.

Rating Error
A rating is a numerical or verbal judgment (or both) that places a person or an
attribute along a continuum identified by a scale of numerical or word
descriptors known as a rating scale.
a judgment resulting from the intentional or unintentional misuse of a rating
scale.
Types of Error:
1. Leniency/Generosity Error To combat restriction-of-range rating
the tendency on the part of errors (all these three), it is good to:
the rater to be lenient in use rankings, a procedure that
scoring, marking, and/or requires the rater to measure
grading (high scores individuals against one
obtained) another instead of against an
to avoid such, include a list of absolute scale.
specific competencies to be By using rankings instead of
evaluated, as well as when and ratings, the rater (now the
how such evaluations for “ranker”) is forced to select
competency should be first, second, third choices,
conducted. and so forth.
2. Severity Error/Strictness 4. Halo effect
Error a tendency to give a particular
an error attributed to scores ratee a higher rating than he or
that place too low scores on she objectively deserves
the subjects. because of the rater’s failure
3. Central Tendency Error to discriminate among
exhibits a general and conceptually distinct and
systematic reluctance to potentially independent
giving ratings at either the aspects of a ratee’s behavior.
positive or the negative
extreme.
Test Fairness
the extent to which a test is used in an impartial, just, and equitable way.

CHAPTER 7: UTILITY
Test Utility
refers to the practical value of Judgments concerning the
using a test to aid in decision utility of a test are made based
making. on test reliability and validity
data as well as on other data.
Note: A valid test does not automatically mean useful.
3 things to consider when dealing with test utility:

1. Psychometric soundness (reliability & validity)
2. Cost (economic, financial, or budget-related)
disadvantages, losses, or expenses in both economic and noneconomic
terms.
3. Benefits
Judgments whether the benefits of testing justify the costs of
administering, scoring, and interpreting the test.
refers to profits, gains, or advantages.
Utility Analysis: defined as a family of techniques that entail a cost–benefit analysis
designed to yield information relevant to a decision about the usefulness and/or practical
value of a tool of assessment.
Endpoint: an educated decision about which of many possible courses of action is optimal.
Taylor-Russel Tables
provides an estimate of the a particular test who will be
extent to which inclusion of a successful at their jobs.
particular test in the selection Three variables: validity, the
system will improve selection. selection ratio used, and the
the tables provide an estimate base rate.
of the percentage of determining the increase over
employees hired by the use of current procedures.
Selection ratio – number of applicants to hire over total number of applicants.
Base rate – percentage of people hired under the existing system for a particular
position.
If you are to evaluate Taylor-Russel Tables, consider these:
Quadrants 2 and 4 You made the right decision of hiring (Q2) and not hiring (Q4) the
excellent (Q2) and not excellent (Q4) applicants.
Quadrant 1 Hiring low-scoring applicants but high-performing employees
- False negative; Type 2 error
Quadrant 3 Hiring high-scoring applicants but low-performing employees
- False positive; Type 1 error

Naylor-Shine Tables
obtaining the difference between the means of the selected and unselected
groups to derive an index of what the test (or some other tool of assessment)
is adding to already established procedures.
determining the increase in average score on some criterion measure.
Both should be obtained using a concurrent validation procedure.

Decision Theory
provides guidelines for setting optimal cutoff scores.
The higher the selection ratio, the higher the false positives; the lower the
selection ratio, the higher the false negatives.
Practical Considerations:
Note: issues related to existing base rates can affect the accuracy of decisions made based
on tests.
1. The pool of job applicants
2. The complexity of the job
3. The cut score in use
A. Relative cut-off scores: set based on norm-related considerations rather than
on the relationship of test scores to a criterion, also “norm-referenced cut
score.”
B. Fixed cut-off score: set with reference to a judgment concerning a minimum
level of proficiency required to be included in a particular classification, “also
absolute cut score.”
Multiple cut scores - refers to the use of two or more cut scores with reference
to one predictor for the purpose of categorizing testtakers.
Multiple hurdles - may be thought of as one collective element of a multistage
decision-making process in which the achievement of a particular cut score on
one test is necessary in order to advance to the next stage of evaluation in the
selection process.
it assumes that an individual must possess a certain minimum amount of
knowledge, skill, or ability for each attribute measured by a predictor to
be successful in the desired position.
Compensatory Model of Selection - high scores on one attribute can, in fact,
“balance out” or compensate for low scores on another attribute.
Uses multiple regression.
Methods for Setting Cut Scores

1. The Angoff Method
applied to personnel selection tasks as well as to questions regarding the
presence or absence of a particular trait, attribute, or ability.
the judgments of the experts are averaged to yield cut scores for the test.
This should have high inter-rater reliability or agreement among
observers/expert panels.
2. The Known-Group Models

“method of contrasting groups”
entails collection of data on the predictor of interest from groups known to
possess, and not to possess, a trait, attribute, or ability of interest.
a cut score is set on the test that best discriminates the two groups’ test
performance.

3. IRT-based Methods
cut scores are typically set based on tessttakers’ performance across all the
items on the test; some portion of the total number of items on the test must
be scored “correct” in order for the testtaker to “pass” the test.
An item with a minimum level of difficulty answered correctly by the testtaker
is determined by experts and serves as the cut score.
Licensing Exam: item-mapping method
 entails the arrangement of items in a histogram, with each
column in the histogram containing items deemed to be of
equivalent value.
Academic application: bookmark method
 The expert then places a “bookmark” between the two pages
(or, the two items) that are deemed to separate test takers who
have acquired the minimal knowledge, skills, and/or abilities from
those who have not.
4. The Method of Predictive Yield

a technique which takes into account the number of positions to be filled,
projections regarding the likelihood of offer acceptance, and the distribution of
applicant scores.
5. Discriminant (Function) Analysis

used to shed light on the relationship between identified variables (such as
scores on a battery of tests) and two (and in some cases more) naturally
occurring groups (such as persons judged to be successful at a job and persons
judged unsuccessful at a job).
CHAPTER 8: TEST DEVELOPMENT

All tests are not created equal.
Test development is an umbrella term for all that goes into the process of
creating a test.

The Test Development Five Stages: (CCTAR)

1. Test Conceptualization
- building the test idea
Criterion-referenced – scores
Norm-referenced – scores are
are compared to a certain
compared to the whole test
criterium, requires mastery of
takers’ scores.
knowledge or skills.
Pilot Work / Pilot Study / Pilot Research – refers to the preliminary research
surrounding the creation of a prototype of the test.
2. Test Construction
- entails writing test items (or re-writing or revising existing items), as well as
formatting items, setting scoring rules, and otherwise designing and building a test.
Scaling
the process of setting rules for assigning numbers in measurement.
Thurstone: credited for being at the forefront of efforts to develop

methodologically sound scaling methods.
“A Law of Comparative Judgment” – his proudest achievement; one of
the primary architects of modern factor analysis.
Types of Scales:
Age-based
Grade-based
Stanine
Unidimensional vs. Multidimensional
Comparative vs. Categorical (& many more)
Continuation of test construction:
Scaling Methods:
MDBS-R: Morally Debatable Behavior Scale-Revised
A rating scale - “a practical means of assessing what people believe, the
strength of their convictions, as well as individual differences in moral
tolerance.”
contains 30 items.
It also uses a summative scale: the final test score is obtained by summing
the ratings across all the items.

Rating Scale
a grouping of words, statements, or symbols on which judgments of the
strength of a particular trait, attitude, or emotion are indicated by the test
taker.
can be used to record judgments of oneself, others, experiences, or objects,
and they can take several forms.
Likert Scale
used extensively in psychology, usually to scale attitudes.
Likert scales are usually reliable, which may account for their widespread
popularity.
The use of rating scales of any type results in ordinal-level data.
Method of Paired Comparisons

Test takers are presented with pairs of stimuli (two photographs, two objects,
two statements), which they are asked to compare.
Test takers receive a higher score for selecting the option deemed more
justifiable by most of a group of judges.
An advantage of the method of paired comparisons is that it forces test takers
to choose between items.
Methods of Sorting:
1. Comparative Scaling
entails judgments of a stimulus in comparison with every other
stimulus on the scale.
Test takers would be asked to sort the cards from most justifiable to
least justifiable.
2. Categorical Scaling
Stimuli are placed into one of two or more alternative categories that
differ quantitatively with respect to some continuum.
Guttman Scale
Items on it range sequentially from weaker to stronger expressions of the attitude,
belief, or feeling being measured.
Respondents who agree with ‘A’ should also agree with ‘B’ and the rest of the
options if option A has the strongest statement.
the test is analyzed using scalogram analysis: an item-analysis procedure and
approach to test development that involves a graphic mapping of a test taker’s
responses.
The method of equal-appearing intervals is an example of a scaling method of the direct
estimation variety. In contrast to other methods that involve indirect estimation, there is
no need to transform the test taker’s responses into some other scale.

In writing test items, these should be considered:

Range of content
Item formats
Number of items
(Alternate forms necessary? How many?)
 Double the number of items desired for final version from the first draft.
 Item pool is the reservoir or well from which items will or will not be drawn
for the final version of the test.
 Item bank is a relatively large and easily accessible collection of test
questions.
Notes:
 Computerized adaptive testing has been found to reduce the number of test items
that need to be administered by as much as 50% while simultaneously reducing
measurement error by 50%.
 CAT tends to reduce floor effects and ceiling effects.
EPPS - Edward’s Personal Preference Schedule – designed to measure the relative
strength of different psychological needs.
210 pairs of statements
“forced” to answer true or false or yes or no.
possible to draw only intra-individual conclusions about the test taker.
not be appropriate to draw inter-individual comparisons on the basis of an ipsative
scored test.

3. Test Tryout
test administration to a representative sample of testtakers under conditions
that simulate the conditions that the final version of the test.
There should be no fewer than 5 subjects and preferably as many as 10 for each
item on the test.
the more subjects employed, the weaker the role of chance in subsequent data
analysis.
phantom factors—factors that actually are just artifacts of the small sample
size.
4. Item Analysis
analysis of test takers’ performance on the test using statistical procedures to
assist in making judgments about which items are good as they are, which items
need to be revised, and which items should be discarded.
A. The Item Difficulty Index

obtained by calculating the proportion of the total number of test takers
who answered the item correctly.
the larger the item-difficulty index, the easier the item.
may be called “item-endorsement index” in other contexts, such as
personality testing.
For maximum discrimination among the abilities of the test takers, the
optimal average item difficulty is approximately .5, with individual items
on the test ranging in difficulty from about .3 to .8.
In a true–false item, the optimal item difficulty is halfway between .50
and 1.00, or .75.
B. The Item-Reliability Index
provides an indication of the internal consistency of a test.
the higher this index, the greater the test’s internal consistency.
C. Factor Analysis & Inter-item Consistency

factor analysis: useful in determining whether items on a test appear to
be measuring the same thing(s).

5. Test Revision
refers to action taken to modify a test’s content or format for the purpose of
improving the test’s effectiveness as a tool of measurement.
revised version of the test will then be tried out on a new sample of test takers.
CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT

“Intelligence is what the tests test.” – Edwin G. Boring (1923)
Galton (1883) believed that the most intelligent persons were those equipped with the
best sensory abilities.
Tests of visual acuity or hearing ability are, in a sense, tests of intelligence.
the first person to publish on the heritability of intelligence, thus anticipating
later nature-nurture debates.
Galton had viewed intelligence as several distinct processes or abilities that
could be assessed only by separate tests.
Binet argued that when one solves a particular problem, the abilities used cannot be
separated because they interact to produce the solution.
Wechsler: Intelligence, operationally defined, is the aggregate or global capacity of the

individual to act purposefully, to think rationally and to deal effectively with his
environment.
It is aggregate or global because it is composed of elements or abilities which,
though not entirely independent, are qualitatively differentiable.
the best way to measure this global ability was by measuring aspects of several
“qualitatively differentiable” abilities: verbal & performance based.
Piaget: intelligence may be conceived of as a kind of evolving biological adaptation to the

outside world.
As cognitive skills are gained, adaptation (at a symbolic level) increases, and
mental trial and error replaces physical trial and error.
the process of cognitive development occurs neither solely through maturation
nor solely through learning.
as a consequence of interaction with the environment, psychological structures
become reorganized.
L. Thurstone: conceived of intelligence as composed of what he termed primary mental

abilities (PMAs).
PMA: verbal meaning, perceptual speed, reasoning, number facility, rote
memory, word fluency, and spatial relations.

Gardner (1983, 1994): developed a theory of multiple (seven, actually) intelligences:

logical-mathematical, bodily-kinesthetic, linguistic, musical, spatial, interpersonal, and
intrapersonal.
Perspective on Intelligence
A major thread running through the theories of Binet, Wechsler, and Piaget is a focus on
interactionism.
A. Interactionism
refers to the complex concept by which heredity and environment are
presumed to interact and influence the development of one’s intelligence.
B. Factor-analytic theories
squarely on identifying the ability or groups of abilities deemed to constitute
intelligence.
Factor analysis is a group of statistical techniques designed to determine the
existence of underlying relationships between sets of variables, including test
scores.
Spearman (Correlation)
Two-Factor Theory of Intelligence: general mental abilities & specific mental
abilities or errors of general mental abilities.
The greater the magnitude of g in a test of intelligence, the better the test was
thought to predict overall intelligence.
Raymond B. Cattell: TWO MAJOR TYPES OF COGNITIVE ABILITIES

A. Crystallized
acquired skills and knowledge that are dependent on exposure to a
particular culture as well as on formal and informal education.
Retrieval of information and application of general knowledge.
B. Fluid
are nonverbal, relatively culture-free, and independent of specific
instruction (such as memory for digits).
Carroll: THREE-STRATUM THEORY OF COGNITIVE ABILITIES

The three strata: (Top to bottom)
1. General intelligence
2. eight abilities and processes: fluid intelligence (Gf), crystallized intelligence
(Gc), general memory and learning (Y), broad visual perception (V), broad
auditory perception (U), broad retrieval capacity (R), broad cognitive
speediness (S), and processing/decision speed (T).
3. “Level factors” and “speed factors.”
The three-stratum theory is a hierarchical model, meaning that all of the abilities listed in
a stratum are subsumed by or incorporated in the strata above.

E.L. Thorndike: MULTIFACTOR THEORY OF INTELLIGENCE

1. Social Intelligence - dealing with people.
2. Concrete Intelligence - dealing with objects.
3. Abstract Intelligence - dealing with verbal and mathematical symbols.
For Thorndike, one’s ability to learn is determined by the number and speed
of the bonds that can be marshaled.
Information-processing theories
Aleksandr Luria
identifying the specific mental processes that constitute intelligence.
focuses on the mechanisms by which information is processed—how information
is processed, rather than what is processed.
Two types: simultaneous and successive.
A. Simultaneous (parallel)
Information is integrated all at one time.
simultaneous processing may be described as “synthesized.”
tasks that involve the simultaneous mental representations of images.
Map reading
B. Successive
each bit of information is individually processed in sequence.
sequential processing is logical and analytic in nature.
piece by piece and one piece after the other, information is arranged and
rearranged so that it makes sense.
PASS model of intellectual functioning

planning, attention, simultaneous, and successive.
planning refers to strategy development for problem solving.
attention (also referred to as arousal) refers to receptivity to information.
and simultaneous and successive refer to the type of information processing
employed.
Chapter 10 Mini Notes

WRAT – Wide Range Achievement Test-4
Reading, spelling, arithmetic (math), and reading comprehension.
STEP – Sequential Tests of Educational Purposes

Comprehensive, kindergarten – g12.
Can include behavior inventory, educational environment, and activity
inventories.
Can identify “gifted children.”

SRA California Achievement Test

Kindergarten – g12.
WIAT-III – Wechsler Individual Achievement Test-III

Age 4-50
School, research, and clinical setting.
16 subtests
potential to yield actionable data relating to student achievement in academic
areas such as reading, writing, and mathematics, as well as skill in listening and
speaking.
Hanggang dito nalang mga bhie. This is only a supplement of our learnings. I hope this helps
you reinforce your knowledge on Psychological Assessment!
Altogether, we can do this!
My reference is Cohen’s and Swerdlik’s Psychological Testing and Assessment. This

summary is intended for educational purpose only and there is no intention to sell and/or
steal their contents.
Please do not sell.
@typicallydong || dorothea dax

Dorothea Dax Cohen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dorothea Dax Cohen

Uploaded by

Copyright:

Available Formats

NOT FOR SALE!

#RPMTWTCOMMUNITY @TYPICALLYDONG | DOROTHEA DAX

NOTES FROM COHEN

CHAPTER 4: OF TESTS AND TESTING

Assumption 2: Psychological Traits and States Can be Quantiﬁed and

Justiﬁably, a proper scoring and

- Nothing worthwhile comes easy.

Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior

Assumption 4: Test and Other Measurement Techniques Have Strength and

Assumption 5: Various Sources of Error Are Part of the Assessment Process

Both the assessee and the assessor, the measuring instruments,

Assumption 6: Testing and Assessment Can Be Conducted in a Fair and

- Nothing worthwhile comes easy.

Assumption 7: Testing and Assessment Beneﬁt Society

measure what it purports to Do the items adequately sample

- Nothing worthwhile comes easy.

describe a particular type of norm

Sampling to Develop Norms

- Nothing worthwhile comes easy.

Measurement error – refers to, Systematic error – refers to a source

Sources of error variance:

- Nothing worthwhile comes easy.

Other sources of error:

Tests of Reliability Estimates

Two Ways to Obtaining an Estimate of Test-Retest Reliability:

- Nothing worthwhile comes easy.

2. Parallel Forms/Alternate Forms Reliability Estimates

Alternate-forms reliability: an estimate of the extent to which these diﬀerent

3. Test of Internal Consistency Reliability Estimates

3.1 Split-Half Reliability Estimates

Obtained by correlating Useful when it is

- Nothing worthwhile comes easy.

Spearman–Brown estimates are The new items must be equivalent in

3.2 Inter-item Consistency

refers to the degree of correlation An index of inter-item

- Nothing worthwhile comes easy.

3.3 The Kuder-Richardson Formulas

KR-20 is the statistic of choice Applies to test that are highly

1951 this formula yields an estimate

4. Average Proportional Distance Method

a measure used to evaluate the internal consistency of a test that focuses on

- Nothing worthwhile comes easy.

items measuring a single construct such as extraversion should ideally be

5. Inter-scorer (inter-rater) reliability

- Nothing worthwhile comes easy.

POWER TEST VS. SPEED TEST

designed to provide an tends to contain material that

The True Score Model of Measurement and Alternatives to It

- Nothing worthwhile comes easy.

1.1 Domain Sampling Theory

1.2 Generalizability Theory

2. Item-Response Theory/ Latent-trait Theory

encompassing item diﬃculty latent traits theoretically can

The Standard Error of Measurement

- Nothing worthwhile comes easy.

Important notes when dealing with validity:

Construct validity is the variable being measured is

- Nothing worthwhile comes easy.

TYPES OF VALIDITY IN DETAIL…

- Nothing worthwhile comes easy.

- Nothing worthwhile comes easy.

Reasons for not meeting the hypotheses:

- Nothing worthwhile comes easy.