Professional Documents
Culture Documents
Dorothea Dax Cohen
Dorothea Dax Cohen
Traits and States are defined to draw a line where people differ uniquely from one
another.
Reliability
consistency of the measuring tool reliability is a necessary but not
- the precision with which the test sufficient element of a good test.
measures and the extent to which In addition to being reliable, tests
error is present in measurements. must be reasonably accurate. In
unreliable measurement is to be the language of psychometrics,
avoided. tests must be valid.
Psychological tests, like other
tests and instruments, are reliable
to varying degrees.
Validity
Norms
norm-referenced testing and a reference when evaluating or
assessment – a method of interpreting individual test
evaluation and a way of deriving scores.
meaning from test scores by
evaluating an individual test normative sample – is that group
taker’s score and comparing it to of people whose performance on
scores of a group of test takers. a particular test is analyzed for
reference in evaluating the
Goal: to yield information on a test performance of individual test
taker’s standing or ranking takers.
relative to some comparison
group of test takers.
norms – are the test performance Norming – the process of deriving
data of a particular group of test norms; it may be modified to
takers that are designed for use as
There is a need to get sample out of the population because it is impractical, expensive,
and not even necessary to test the whole or wide range of population.
Sample - a portion of the universe Sampling - The process of
of people deemed to be selecting the portion of the
representative of the whole universe deemed to be
population. representative of the whole
population.
In order to best assist future users of the test, test developers are encouraged
to:
Provide information to support recommended interpretations of the results,
including the nature of the content, norms or comparison groups, and other
technical evidence.
CHAPTER 5: RELIABILITY
Reliability is synonymous to:
Dependability or consistency.
Variance – is the SD2. Random error – is a source of error in
measuring a targeted variable caused
The greater the proportion of the by unpredictable fluctuations and
total variance attributed to true inconsistencies of other variables in
variance, the more reliable the
the measurement process.
test.
Note: the amount of test variance that is true relative to error may never be
known.
** Tests that measure different domains NEO-PI are less homogenous. Hence, to get the
desired reliability, the test developer should conduct a separate test of reliability for each
of the domains of a single test.
The more homogeneous a test is, the higher inter-item consistency it can be
expected to have.
(Positive relationship)
Homogeneity = oneness
KR-20
If test items are more heterogeneous, KR-20 will yield lower reliability estimates
than the split-half method. Hence, split -half is a better choice.
KR-21
The KR-21 formula may be used if there is reason to assume that all the test
items have approximately the same degree of difficulty.
3.4Cronbach’s Alpha
Criterion-referenced Test
CTT - referred to as the true score (or assumptions are so readily met,
classical) model of measurement. compared to IRT whose assumptions
the most widely used and accepted are more difficult to meet.
model in the psychometric literature compatibility and ease of use with
today. widely used statistical techniques.
everyone has a “true score” on a test: Factor analytic techniques, whether
a value that genuinely reflects an exploratory or confirmatory, are all
individual’s ability (or trait) level as “based on the CTT measurement
measured by a particular test. foundation.”
it its assumptions allow for its Disadvantage: all items are presumed
application in most situations. to be contributing equally to the
CTT assumptions are characterized as score total.
“weak”—this is precisely because its
Factors to consider: length of tests that are developed using a CTT model
Note: CTT favors the development of longer rather than shorter tests
CHAPTER 6: VALIDITY
Validity
the meaningfulness of a test Characterized as “acceptable”
score — what the test score or “weak.”
truly means. “Reasonable boundaries:” No
is a judgment or estimate of test or measurement
how well a test measures what technique is “universally
it purports to measure in a valid” for all time, for all uses,
particular context. with all types of test taker
it is a judgment based on populations.
evidence about the Local validation studies are
appropriateness of inferences necessary when the test user
drawn from test scores. plans to alter in some way the
An inference is a logical result format, instructions,
or deduction. language, or content of the
test.
Content Validity
a judgment of how adequately a test samples behavior representative of the
universe of behavior that the test was designed to sample.
Criterion Validity
a judgment of how adequately Judgments of criterion-related
a test score can be used to validity, whether concurrent
infer an individual’s most or predictive, are based on
probable standing on some two types of statistical
measure of interest. evidence: the validity
the standard against which a coefficient and expectancy
test or a test score is data.
evaluated. An expectancy table can
Concurrent Validity: an index provide an indication of the
of the degree to which a test likelihood that a test taker will
score is related to some score within some interval of
criterion measure obtained at scores on a criterion
the same time (concurrently). measure—an interval that
Predictive validity: an index of may be categorized as
the degree to which a test “passing,” “acceptable,” or
score predicts some criterion “failing.”
measure.
Characteristics of a criterion:
1. Relevant - pertinent or applicable to the matter at hand.
2. Valid
3. Uncontaminated
Criterion Contamination - the term applied to a criterion measure that has been
based, at least in part, on predictor measures.
If a certain variable has been used both as a predictor and as a criterion, then
a criterion contamination has occurred.
Contamination cannot be gauged neither can it be corrected.
Concurrent Validity
indicates the extent to which test scores may be used to estimate an individual’s
present standing on a criterion.
Validating criterion / Criterion validator – to where a single test is being
compared to.
Predictive Validity
- how accurately scores on the test predict some criterion measure.
Base rate – is the extent to Types of Miss Rate:
which a particular trait,
false positive is a miss wherein
behavior, characteristic, or
the test predicted that the test
attribute exists in the
taker did possess the
population.
characteristic or attribute
Hit rate – the proportion of
being measured when in fact
people a test accurately
the test taker did not, (type 1
identifies as possessing or
error).
exhibiting a particular trait,
false negative is a miss
behavior, characteristic, or
wherein the test predicted
attribute.
that the test taker did not
Miss rate – the proportion of
possess the characteristic or
people the test fails to identify
attribute being measured
as having, or not having, a
when the test taker actually
particular characteristic or
did, (type 2 error).
attribute (inaccurate
prediction).
Validity Coefficient
a correlation coefficient that provides a measure of the relationship between
test scores and scores on the criterion measure.
Incremental Validity
the degree to which an additional predictor explains something about the
criterion measure that is not explained by predictors already in use.
Construct Validity
a judgment about the appropriateness of inferences drawn from test scores
regarding individual standings on a variable.
A construct is an informed, scientific idea developed or hypothesized to
describe or explain behavior.
Constructs are unobservable, presupposed (underlying) traits that a test
developer may invoke to describe test behavior or criterion performance.
Steps:
1. Formulate hypotheses about the construct being defined.
2. Set high and low scorers.
3. When tested, both scorers should behave as the way they are expected to.
4. If not, test developer should need to reexamine the construct/ the hypotheses
made.
Discriminant Validity
A validity coefficient showing little (a statistically insignificant) relationship
between test scores and/or other variables with which scores on the test being
construct-validated should not theoretically be correlated.
2 similar constructs that correlated Good convergent validity
2 similar constructs that did not correlate Poor convergent validity
2 different constructs that correlated Good discriminant validity
2 different constructs that did not correlate Poor discriminant validity
Multitrait-multimethod matrix
(Campbell & Fiske, 1959) is the behavioral observation or a
matrix or table that results personality test) are inserted
from correlating variables into the table, and the
(traits) within and between resulting matrix of
methods. correlations provides insight
Example: Values for any with respect to both the
number of traits (such as convergent and the
aggressiveness or discriminant validity of the
extraversion) as obtained by methods used.
various methods (such as
Factor Analysis
a shorthand term for a class of to identify the factor or factors
mathematical procedures in common between test
designed to identify factors or scores on subscales within a
specific variables that are particular test, or the factors in
typically attributes, common between scores on a
characteristics, or dimensions series of tests.
on which people may differ. Procedures are very technical,
employed as a data reduction need of a good aid of
method in which several sets sophisticated software.
of scores and the correlations
between them are analyzed.
2 Types:
1. Exploratory factor analysis (EFA)
estimating, or extracting factors; deciding how many factors to retain; and
rotating factors to an interpretable orientation.
Rating Error
A rating is a numerical or verbal judgment (or both) that places a person or an
attribute along a continuum identified by a scale of numerical or word
descriptors known as a rating scale.
a judgment resulting from the intentional or unintentional misuse of a rating
scale.
Types of Error:
1. Leniency/Generosity Error To combat restriction-of-range rating
the tendency on the part of errors (all these three), it is good to:
the rater to be lenient in use rankings, a procedure that
scoring, marking, and/or requires the rater to measure
grading (high scores individuals against one
obtained) another instead of against an
to avoid such, include a list of absolute scale.
specific competencies to be By using rankings instead of
evaluated, as well as when and ratings, the rater (now the
how such evaluations for “ranker”) is forced to select
competency should be first, second, third choices,
conducted. and so forth.
2. Severity Error/Strictness 4. Halo effect
Error a tendency to give a particular
an error attributed to scores ratee a higher rating than he or
that place too low scores on she objectively deserves
the subjects. because of the rater’s failure
3. Central Tendency Error to discriminate among
exhibits a general and conceptually distinct and
systematic reluctance to potentially independent
giving ratings at either the aspects of a ratee’s behavior.
positive or the negative
extreme.
Test Fairness
the extent to which a test is used in an impartial, just, and equitable way.
CHAPTER 7: UTILITY
Test Utility
refers to the practical value of Judgments concerning the
using a test to aid in decision utility of a test are made based
making. on test reliability and validity
data as well as on other data.
Note: A valid test does not automatically mean useful.
Taylor-Russel Tables
provides an estimate of the a particular test who will be
extent to which inclusion of a successful at their jobs.
particular test in the selection Three variables: validity, the
system will improve selection. selection ratio used, and the
the tables provide an estimate base rate.
of the percentage of determining the increase over
employees hired by the use of current procedures.
Selection ratio – number of applicants to hire over total number of applicants.
Base rate – percentage of people hired under the existing system for a particular
position.
If you are to evaluate Taylor-Russel Tables, consider these:
Quadrants 2 and 4 You made the right decision of hiring (Q2) and not hiring (Q4) the
excellent (Q2) and not excellent (Q4) applicants.
Quadrant 1 Hiring low-scoring applicants but high-performing employees
- False negative; Type 2 error
Quadrant 3 Hiring high-scoring applicants but low-performing employees
- False positive; Type 1 error
Naylor-Shine Tables
obtaining the difference between the means of the selected and unselected
groups to derive an index of what the test (or some other tool of assessment)
is adding to already established procedures.
determining the increase in average score on some criterion measure.
Both should be obtained using a concurrent validation procedure.
Decision Theory
provides guidelines for setting optimal cutoff scores.
The higher the selection ratio, the higher the false positives; the lower the
selection ratio, the higher the false negatives.
Practical Considerations:
Note: issues related to existing base rates can affect the accuracy of decisions made based
on tests.
1. The pool of job applicants
2. The complexity of the job
3. The cut score in use
A. Relative cut-off scores: set based on norm-related considerations rather than
on the relationship of test scores to a criterion, also “norm-referenced cut
score.”
B. Fixed cut-off score: set with reference to a judgment concerning a minimum
level of proficiency required to be included in a particular classification, “also
absolute cut score.”
Multiple cut scores - refers to the use of two or more cut scores with reference
to one predictor for the purpose of categorizing testtakers.
Multiple hurdles - may be thought of as one collective element of a multistage
decision-making process in which the achievement of a particular cut score on
one test is necessary in order to advance to the next stage of evaluation in the
selection process.
it assumes that an individual must possess a certain minimum amount of
knowledge, skill, or ability for each attribute measured by a predictor to
be successful in the desired position.
Compensatory Model of Selection - high scores on one attribute can, in fact,
“balance out” or compensate for low scores on another attribute.
Uses multiple regression.
3. IRT-based Methods
cut scores are typically set based on tessttakers’ performance across all the
items on the test; some portion of the total number of items on the test must
be scored “correct” in order for the testtaker to “pass” the test.
An item with a minimum level of difficulty answered correctly by the testtaker
is determined by experts and serves as the cut score.
Licensing Exam: item-mapping method
entails the arrangement of items in a histogram, with each
column in the histogram containing items deemed to be of
equivalent value.
Academic application: bookmark method
The expert then places a “bookmark” between the two pages
(or, the two items) that are deemed to separate test takers who
have acquired the minimal knowledge, skills, and/or abilities from
those who have not.
Pilot Work / Pilot Study / Pilot Research – refers to the preliminary research
surrounding the creation of a prototype of the test.
2. Test Construction
- entails writing test items (or re-writing or revising existing items), as well as
formatting items, setting scoring rules, and otherwise designing and building a test.
Scaling
the process of setting rules for assigning numbers in measurement.
Types of Scales:
Age-based
Grade-based
Stanine
Unidimensional vs. Multidimensional
Comparative vs. Categorical (& many more)
Continuation of test construction:
Scaling Methods:
MDBS-R: Morally Debatable Behavior Scale-Revised
A rating scale - “a practical means of assessing what people believe, the
strength of their convictions, as well as individual differences in moral
tolerance.”
contains 30 items.
It also uses a summative scale: the final test score is obtained by summing
the ratings across all the items.
Rating Scale
a grouping of words, statements, or symbols on which judgments of the
strength of a particular trait, attitude, or emotion are indicated by the test
taker.
can be used to record judgments of oneself, others, experiences, or objects,
and they can take several forms.
Likert Scale
used extensively in psychology, usually to scale attitudes.
Likert scales are usually reliable, which may account for their widespread
popularity.
The use of rating scales of any type results in ordinal-level data.
Methods of Sorting:
1. Comparative Scaling
entails judgments of a stimulus in comparison with every other
stimulus on the scale.
Test takers would be asked to sort the cards from most justifiable to
least justifiable.
2. Categorical Scaling
Stimuli are placed into one of two or more alternative categories that
differ quantitatively with respect to some continuum.
Guttman Scale
Items on it range sequentially from weaker to stronger expressions of the attitude,
belief, or feeling being measured.
Respondents who agree with ‘A’ should also agree with ‘B’ and the rest of the
options if option A has the strongest statement.
the test is analyzed using scalogram analysis: an item-analysis procedure and
approach to test development that involves a graphic mapping of a test taker’s
responses.
The method of equal-appearing intervals is an example of a scaling method of the direct
estimation variety. In contrast to other methods that involve indirect estimation, there is
no need to transform the test taker’s responses into some other scale.
3. Test Tryout
test administration to a representative sample of testtakers under conditions
that simulate the conditions that the final version of the test.
There should be no fewer than 5 subjects and preferably as many as 10 for each
item on the test.
the more subjects employed, the weaker the role of chance in subsequent data
analysis.
phantom factors—factors that actually are just artifacts of the small sample
size.
4. Item Analysis
analysis of test takers’ performance on the test using statistical procedures to
assist in making judgments about which items are good as they are, which items
need to be revised, and which items should be discarded.
5. Test Revision
refers to action taken to modify a test’s content or format for the purpose of
improving the test’s effectiveness as a tool of measurement.
revised version of the test will then be tried out on a new sample of test takers.
Galton (1883) believed that the most intelligent persons were those equipped with the
best sensory abilities.
Tests of visual acuity or hearing ability are, in a sense, tests of intelligence.
the first person to publish on the heritability of intelligence, thus anticipating
later nature-nurture debates.
Galton had viewed intelligence as several distinct processes or abilities that
could be assessed only by separate tests.
Binet argued that when one solves a particular problem, the abilities used cannot be
separated because they interact to produce the solution.
Perspective on Intelligence
A major thread running through the theories of Binet, Wechsler, and Piaget is a focus on
interactionism.
A. Interactionism
refers to the complex concept by which heredity and environment are
presumed to interact and influence the development of one’s intelligence.
B. Factor-analytic theories
squarely on identifying the ability or groups of abilities deemed to constitute
intelligence.
Factor analysis is a group of statistical techniques designed to determine the
existence of underlying relationships between sets of variables, including test
scores.
Spearman (Correlation)
Two-Factor Theory of Intelligence: general mental abilities & specific mental
abilities or errors of general mental abilities.
The greater the magnitude of g in a test of intelligence, the better the test was
thought to predict overall intelligence.
B. Fluid
are nonverbal, relatively culture-free, and independent of specific
instruction (such as memory for digits).
For Thorndike, one’s ability to learn is determined by the number and speed
of the bonds that can be marshaled.
Information-processing theories
Aleksandr Luria
identifying the specific mental processes that constitute intelligence.
focuses on the mechanisms by which information is processed—how information
is processed, rather than what is processed.
Two types: simultaneous and successive.
A. Simultaneous (parallel)
Information is integrated all at one time.
simultaneous processing may be described as “synthesized.”
tasks that involve the simultaneous mental representations of images.
Map reading
B. Successive
each bit of information is individually processed in sequence.
sequential processing is logical and analytic in nature.
piece by piece and one piece after the other, information is arranged and
rearranged so that it makes sense.
Hanggang dito nalang mga bhie. This is only a supplement of our learnings. I hope this helps
you reinforce your knowledge on Psychological Assessment!
Altogether, we can do this!