Professional Documents
Culture Documents
Introduction To Psychological Tests
Introduction To Psychological Tests
Introduction To Psychological Tests
BY MISHIKA MALIK
Psychological testing in its modern form originated little more than 100 years ago in laboratory
studies of sensory discrimination, motor skills, and reaction time.
Francis Galton invented the first battery of tests.
Rudimentary forms of testing date back to at least 2200 b.c when the Chinese emperor has his
officials examined every third year to determine their fitness for office.
Such testing was modified and refined over the centuries until written exams were introduced in
the Han dynasty.
The testing practices were unnecessarily gruelling, and the Chinese also failed to validate their
selection procedures.
But the examination program incorporated relevant selection criteria.
Physiognomy: It is based on the notion that we can judge the inner character of people from their
outward appearance, especially the face.
It represents an early form of psychological testing.
Interest in physiognomy can be dated back to the fourth century.
Physiognomy remained popular for centuries and laid the foundation for the more specialized
form of quackery, phrenology.
Phrenology: reading ‘bumps’ on the head.
The founding of phrenology is usually attributed to Franz Joseph Gall.
He argued that the brain is the organ of sentiments and faculties and that these capacities are
localized. To the extent that a faculty was well developed, the corresponding component of the
brain would be enlarged and in turn form a bump because the skill conforms the shape of the
brain. (This is incorrect).
Johann Spurzheim popularized phrenology.
Francis Galton (1822-1911) pioneered the new experimental psychology in Great Britain.
He was obsessed with measurements.
Galton demonstrates that individual differences not only exist but also are objectively
measurable.
Galton continued the tradition of brass instruments, but with an important difference: His
procedures were much more amenable to the timely collection of data from hundreds if not
thousands of subjects.
The tests and measures involved both the physical and behavioural domains.
He demonstrated that objective tests could be devised and that meaningful scores could be
obtianed through standardized procedures.
James McKeen Cattell studied the new experimental psychology with both Wundt and Galton
before settling at Columbia University where, for 26 years, he was the undisputed Dean of
American psychology.
Cattell wanted to study individual differences and he did.
He invented the term ‘mental test’.
Wissler made experimental psychology turn away from the brass instruments approach.
Rating scales are widely used in psychology as a means of quantifying subjective psychological
variables of many kinds.
A crude form of rating scale can be traced back to Galen, the second century Greco-Roman
physician.
The first person to devise and apply rating scales for psychological variables was Christian
Thomasius (1655-1728).
Changing conceptions of mental retardation in the 1800s
By the early 1800s medical practitioners realized that some of those with psychiatric impairment
had reversible illnesses that did not necessarily imply diminished intellect, whereas those with
mental retardation showed a greater developmental continuity and invariably had impaired
intellect.
A newfound humanism began to influence social practices toward individuals with psychological
and mental disabilities.
With this humanism there arose a greater interest in the diagnosis and remediation of mental
retardation.
Around the beginning of the nineteenth century, many physicians had begun to perceive the
difference between mental retardation and mental illness.
Esquirol (1772-1840) was the first to formalize the difference in writing.
He proposed the first classification system in mental retardation and language skills were the
main diagnostic criteria.
Edouard Seguin helped establish a new humanism toward those with mental retardation in the
late 1800s.
Alfred Binet (1857-1911) invented the first modern intelligence test in 1905.
Binet was a prolific researcher and author long before he turned his attentions to intelligence
testing.
Binet argued that intelligence could be better measured by means of the higher psychological
processes rather than elementary sensory processes such as reaction time.
Binet and Simon were called to develop a practical tool for selecting children for which special
placement in schools was needed.
Thus arose the first formal scale for assessing the intelligence of children.
The scale was appropriate for assessing the entire gamut of intelligence.
The purpose was classification, not measurement.
With the successful application of Binet’s mental test, psychologists realized that their inventions
could have pragmatic significance for many different segments of society.
The profusion of tests developed early in the twentieth century helped shape the character of
contemporary tests.
Goddard gained a reputation as one of the leading experts on the use of intelligence tests to
identify persons with impaired intellect.
He believed that the impaired children should be segregated so that they would be prevented
from ‘contaminating society’.
When Goddard visits Ellis Island, he became convinced that the rates of feeblemindedness were
much higher than estimated by the physicians who staffed the immigration service.
He became a apostle for the use of intelligence tests to identify feebleminded immigrants.
Goddard’s scholarly views were influenced by the social ideologies of his time.
It was Stanford professor Lewis M. Terman (1857-1956) who popularized IQ testing with his
revision of the Binet scales in 1916.
The Stanford-Binet was a substantial revision.
It was the standard of intelligence testing for decades.
The slow pace of developments in group testing picked up dramatically as the United States
entered WWI.
Robert M. Yerkes convinced the U.S. government and the army that all of its recruits should be
given intelligence tests for purposes of classification and assignment.
It gave rise to the army Alpha and the army Beta.
The alpha consisted of eight verbally loaded tests for average and high-functioning recruits.
The army beta was a nonverbal group test designed for the use with illiterates and recruits whose
first language was not English.
The army testing was intended to help segregate and eliminate the mentally incompetent, to
classify men according to their mental ability, and to assist in the placement of competent men in
responsible positions.
Yerkes’s grand scheme for testing army recruits helped to usher in the era of group tests.
Machine scoring was introduced in the 1930s, making objective group tests even more efficient
than before.
Aptitude tests measure more specific and delimited abilities than intelligence tests.
A single aptitude test will measure just one ability domain.
The development of aptitude tests lagged behind of that of intelligence tests for two reasons
Statistical
A new technique, factor analysis, was often needed to discern which aptitudes were
primary and, therefore, distinct from each other.
Research on this was not refined until the 1930s
Social
The absence of a practical application for such refined instruments.
It was not until WWII that a pressing need arose to select candidates who were highly
qualified for very difficult and specialized tasks.
It was not until WWI that personality tests emerged in a form resembling their contemporary
appearance.
It was needed to detect which army recruits were susceptible for psychoneurosis.
Made by Woodworth.
In addition, the MMPI introduced the use of validity scales to determine fake bad, fake good, and
random response patterns.
The projective approach originated with the word association method pioneered by Galton.
Also the Rorschach ink bolds.
Psychologists were devising measures for guidance and counselling of the masses of more
normal persons.
The interest inventory, which has roots going back to Thorndike.
Beginning in the 1940s, personality tests began to flourish as useful tools for clinical evaluation
and also for assessment of the normal spectrum of functioning.
In the twenty-first century, the reach of testing continues to increase. Both in one-to-one clinical
uses and in group testing fro societal applications.
Evidence-based practice: the integration of best research evidence with clinical expertise and
patient values.
The advance of evidence-based practice is part of a worldwide trend to require proof that
treatments and interventions yield measurable positive outcomes.
2. Objectivity: The administration, scoring, and interpretation of scores are objective insofar
as they are independent of the subjective judgment of the individual examiner. Any one
individual should theoretically obtain the identical score on a test regardless of who
happens to be his examiner. This is not entirely so, of course, since perfect
standardization and objectivity have not been attained in practice. But at least such
objectivity is the goal of test construction and has been achieved to a reasonably high
degree in most tests. There are other major ways in which psychological tests can be
properly described as objective. The determination of the difficulty level of an item or of
a whole test is based on objective, empirical procedures. Not only the arrangement but
also the selection of items for inclusion in a test can be determined by the proportion of
subjects in the trial samples who pass each item. Thus, if there is a bunching of items at
the easy or difficult end of the scale, some items can be discarded. Similarly, if items are
sparse in certain portions of the difficulty range, new items can be added to fill the gaps.
Types:
Test norms consist of data that make it possible to determine the relative standing of an
individual who has taken a test. By itself, a subject’s raw score (e.g., the number of answers that
agree with the scoring key) has little meaning. Almost always, a test score must be interpreted as
indicating the subject’s position relative to others in some group. Norms provide a basis for
comparing the individual with a group.
Numerical values called centiles (or percentiles) serve as the basis for one widely applicable
system of norms. From a distribution of a group’s raw scores the percentage of subjects falling
below any given raw score can be found. Any raw score can then be interpreted relative to the
performance of the reference (or normative) group—eighth-graders, five-year-olds, institutional
inmates, job applicants. The centile rank corresponding to each raw score, therefore, shows the
percentage of subjects who scored below that point. Thus, 25 percent of the normative group
earn scores lower than the 25th centile; and an average called the median corresponds to the 50th
centile.
Another class of norm system (standard scores) is based on how far each raw score falls above or
below an average score, the arithmetic mean. One resulting type of standard score, symbolized
as z, is positive (e.g., +1.69 or +2.43) for a raw score above the mean and negative for a raw
score below the mean. Negative and fractional values can, however, be avoided in practice by
using other types of standard scores obtained by multiplying z scores by an arbitrarily selected
constant (say, 10) and by adding another constant (say, 50, which changes the z score mean of
zero to a new mean of 50). Such changes of constants do not alter the essential characteristics of
the underlying set of z scores.
The French psychologist Alfred Binet, in pioneering the development of tests of intelligence,
listed test items along a normative scale on the basis of the chronological age (actual age in years
and months) of groups of children that passed them. A mental-age score (e.g., seven) was
assigned to each subject, indicating the chronological age (e.g., seven years old) in the reference
sample for which his raw score was the mean. But mental age is not a direct index of brightness;
a mental age of seven in a 10-year-old is different from the same mental age in a four-year-old.
To correct for this, a later development was a form of IQ (intelligence quotient), computed as the
ratio of the subject’s mental age to his chronological age, multiplied by 100. (Thus, the IQ made
it easy to tell if a child was bright or dull for his age.)
Ratio IQs for younger age groups exhibit means close to 100 and spreads of roughly 45 points
above and below 100. The classical ratio IQ has been largely supplanted by the deviation IQ,
mainly because the spread around the average has not been uniform due to different ranges of
item difficulty at different age levels. The deviation IQ, a type of standard score, has a mean of
100 and a standard deviation of 16 for each age level. Practice with the Stanford-Binet test
reflects the finding that average performance on the test does not increase beyond age 18.
Therefore, the chronological age of any individual older than 18 is taken as 18 for the purpose of
determining IQ.
The Stanford-Binet has been largely supplanted by several tests developed by the American
psychologist David Weschler between the late 1930s and the early 1960s. These tests have
subtests for several capacities, some verbal and some operational, each subtest having its own
norms. After constructing tests for adults, Wechsler developed tests for older and for younger
children.
4. Reliability: It refers to the consistency of scores obtained by the same persons when
reexamined with the same test on different ✓ occasions, or with different sets of
equivalent items, or under other variable examining conditions. This concept of reliability
underlies the computation of the error of measurement of a single score, whereby we can
predict the range of fluctuation likely to occur in a single individual's score as a result of
irrelevant, chance factors. The concept of test reliability has been used to cover several
aspects of score consistency.
TYPES OF RELIABILITY
Test-retest reliability: The most obvious method for finding the reliability of test scores
by repeating the identical test on a second occasion. The error variance corresponds to the
random fluctuations of performance from one test session to the other. These variations
may result in part from uncontrolled testing conditions, such as extreme changes in
weather, sudden noises and other distractions, or a broken pencil point. To some extent,
however, they arise from changes in the condition of the subject himself, as illustrated by
illness, fatigue, emotional strain, worry, recent experiences of a pleasant or unpleasant
nature, and the like. Retest reliability shows the extent to which scores on a test can be
generalized over different occasions; the higher the reliability, the less susceptible the
scores are to the random daily changes in the condition of the subject or of the testing
environment. When retest reliability is reported in a test manual, the interval over which
it was measured should always be specified. Since retest correlations decrease
progressively as this interval lengthens, there is not one but an infinite number of retest
reliability coefficients for any test. It is also desirable to give some indication of relevant
intervening experiences of the subjects on whom reliability was measured, such as
educational or job experiences, counseling, psychotherapy, and so forth. The concept of
reliability is generally restricted to shortrange, random changes that characterize the test
performance itself rather than the entire behavior domain that is being tested. It should be
noted that different behavior functions may themselves vary in the extent of daily
fluctuation they exhibit. Only tests that are not appreciably affected by repetition lend
themselves to the retest technique. A number of sensory discrimination and motor tests
would fall into this category. For the large majority of psychological tests, however, the
retest technique is inappropriate.
The Spearman-Brown Formula: Notice that the split-half method gives us an estimate
of reliability for an instrument half as long as the full test. Although there are some
exceptions, a shorter test generally is less reliable than a longer test. This is especially
true if, in comparison to the shorter test, the longer test embodies equivalent content and
similar item difficulty. Thus, the Pearson r between two halves of a test will usually
underestimate the reliability of the full instrument. The Spearman-Brown formula
provides the appropriate adjustment.
Furthermore, the standard deviation of the distribution of obtained scores would be the
standard error of measurement (SEM). While the true score on a test likely differs
from one person to the next, the SEM is regarded as constant, an inherent property of the
test. If we repeated this hypothetical experiment with another subject, the estimated true
score would probably differ, but the SEM should work out to be a similar value. SEM is
an index of measurement error that pertains to the test in question. A subject’s obtained
score would then also be his or her true score. However, this outcome is simply
impossible in real-world testing. Every test exhibits some degree of measurement error.
The larger the SEM, the greater the typical measurement error. However, the accuracy or
inaccuracy of any individual score is always a probabilistic matter and never a known
quantity. As noted, the SEM can be thought of as the standard deviation of an examinee’s
hypothetical obtained scores on a large number of equivalent tests, under the assumption
that practice and boredom effects are ruled out. Like any standard deviation of a normal
distribution, the SEM has well-known statistical uses.
5. Validity: A test is valid to the extent that inferences made from it are appropriate,
meaningful, and useful. A test score per se is meaningless until the examiner draws
inferences from it based on the test manual or other research findings. Validity reflects an
evolutionary, research-based judgment of how adequately a test measures the attribute it
was designed to measure. Traditionally, the different ways of accumulating validity
evidence have been grouped into three categories:
Individual & Group tests: There are a number of tests which are meant
to be performed individually. Such tests are called individual tests and
these tests are preferred for vocational guidance and counselling and for
clinical and diagnostic work with emotionally disturbed persons. As
individual tests are more costly, therefore they are less used in the industry
than the group tests. An example of an individual psychological test can
be the Stanford -Binet intelligence scale. On the contrary, some tests are
usually designed for a purpose so that they can be administered to a large
number of people in the industry. The examples of group tests can be
Purdue Vocational Achievement Tests, the Adaptability Test and the
Wonderlic Personnel Test.
Essay & Objective Tests: The essay tests are probably one of the oldest
methods of psychological tests that are created to check the candidate’s
ability to organise and articulate his or her thoughts clearly and of course
logically. It is Lord Macaulay who has been credited with introducing this
concept for the Indian Administrative Services or IAS. On the other hand,
the objective test has one correct answer and does not require or ask for
any sort of long extensive answers/explanation from the candidates. These
tests are generally used to check the mental power of the candidate and
reasoning and clarity of the concepts above all.
Psychological tests are used to measure & detect the abilities of a person.
It is used to measure the individual differences, that is difference between
abilities of different people & the performance of the same person at the
same time. They assist in diagnosis e.g.: Rorschach Test. The information
produced is scientifically reliable & tested. They assist in formulation of
psychopathology of the disease. E.g.: TAT They assist in assessing the
severity of the disorder. E.g.: Hamilton scale for depression. They help in
assessing the general characteristics of the individual. E.g.: Assessment of
intelligence, assessment of personality. These tests are also used for
forensic evaluation regarding litigation, family, criminal charges & court
issues. These tests asses the level of functioning or disability., help direct
treatment & assess treatment outcome.
Competence:
Theoretical issues: Is your test reliable? • Is your test valid for particular
purpose? • Are you measuring a stable characteristic of the person being
tested? • If so, differences in scores over time reflect measurement error or
subject variables such as fatigue. • What is the value of your test result –
will it still be true next year?
Debriefing: Restate purpose of the research. -Explain how the results will
be used (usually emphasize that the interest is in the group findings). -
Reiterate that findings will be treated confidentially. -Answer all of the
respondents’ questions fully.
Test Security: Test materials must be kept secure • Test items are not
revealed except in training programs and when mandated by law, to
protect test integrity • Test items are private property.
Divided loyalties: Who is the client? • The person being tested, or the
institution you work for? • What if these parties have conflicting interests?
Examples? • How do you maintain test security but also explain an
adverse decision?