Introduction To Psychological Tests

introduction to psycholog
BY MISHIKA MALIK
KAMLA NEHRU COLLEGE,DELHI UNIVERSITY

ROLL NUMBER:21/946
COURSE: BA HONS. PSYCHOLOGY
TEACHER:MS.SHIVANI DUTTA
INTRODUCTION TO PSYCHOLOGICAL TESTS
PSYCHOLOGICAL TESTS: A psychological test is essentially an objective and
standardized measure of a sample of behavior. Psychological tests are like tests in any other
science, insofar as observations are made on a small but carefully chosen sample of an
individual's behavior. In addition, most tests have norms or standards by which the results can be
used to predict other, more important behaviors.
HISTORY OF PSYCHOLOGICAL TESTING:
Psychological testing in its modern form originated little more than 100 years ago in laboratory
studies of sensory discrimination, motor skills, and reaction time.
Francis Galton invented the first battery of tests.
Rudimentary forms of testing in China in 2200 b.c
Rudimentary forms of testing date back to at least 2200 b.c when the Chinese emperor has his
officials examined every third year to determine their fitness for office.
Such testing was modified and refined over the centuries until written exams were introduced in
the Han dynasty.
The testing practices were unnecessarily gruelling, and the Chinese also failed to validate their
selection procedures.
But the examination program incorporated relevant selection criteria.
Physiognomy, phrenology, and the psychograph
Physiognomy: It is based on the notion that we can judge the inner character of people from their
outward appearance, especially the face.
It represents an early form of psychological testing.
Interest in physiognomy can be dated back to the fourth century.
Physiognomy remained popular for centuries and laid the foundation for the more specialized
form of quackery, phrenology.
Phrenology: reading ‘bumps’ on the head.
The founding of phrenology is usually attributed to Franz Joseph Gall.
He argued that the brain is the organ of sentiments and faculties and that these capacities are
localized. To the extent that a faculty was well developed, the corresponding component of the
brain would be enlarged and in turn form a bump because the skill conforms the shape of the
brain. (This is incorrect).
Johann Spurzheim popularized phrenology.
The psychograph was a machine that measured phrenoloy.

Made by Henry C. Lavery in 1931.
The brass instruments era of testing

Experimental psychology flourished in the late 1800s in continental Europe and Great Britain.
Human abilities were tested in laboratories with objective procedures that were capable of
replication.
The problem with experimental psychology was that it mistook simple sensory processes for
intelligence.
They used assorted brass instruments to measure sensory thresholds and reaction times, thinking
that such abilities were at the heart of intelligence.
Wilhelm Wundt founded the first psychological laboratory in 1879 in Leipzig.

He believed that the speed of thought might differ from one person to the next.
Galton and the first battery of mental tests
Francis Galton (1822-1911) pioneered the new experimental psychology in Great Britain.
He was obsessed with measurements.
Galton demonstrates that individual differences not only exist but also are objectively
measurable.
Galton continued the tradition of brass instruments, but with an important difference: His
procedures were much more amenable to the timely collection of data from hundreds if not
thousands of subjects.
The tests and measures involved both the physical and behavioural domains.
He demonstrated that objective tests could be devised and that meaningful scores could be
obtianed through standardized procedures.
Cattell imports brass instruments to the United States
James McKeen Cattell studied the new experimental psychology with both Wundt and Galton
before settling at Columbia University where, for 26 years, he was the undisputed Dean of
American psychology.
Cattell wanted to study individual differences and he did.
He invented the term ‘mental test’.
Wissler made experimental psychology turn away from the brass instruments approach.
In 1905, Binet introduced his scale of intelligence.
Rating scales and their origins
Rating scales are widely used in psychology as a means of quantifying subjective psychological
variables of many kinds.
A crude form of rating scale can be traced back to Galen, the second century Greco-Roman
physician.
The first person to devise and apply rating scales for psychological variables was Christian
Thomasius (1655-1728).
Changing conceptions of mental retardation in the 1800s
By the early 1800s medical practitioners realized that some of those with psychiatric impairment
had reversible illnesses that did not necessarily imply diminished intellect, whereas those with
mental retardation showed a greater developmental continuity and invariably had impaired
intellect.
A newfound humanism began to influence social practices toward individuals with psychological
and mental disabilities.
With this humanism there arose a greater interest in the diagnosis and remediation of mental
retardation.
Esquirol and diagnosis in mental retardation
Around the beginning of the nineteenth century, many physicians had begun to perceive the
difference between mental retardation and mental illness.
Esquirol (1772-1840) was the first to formalize the difference in writing.
He proposed the first classification system in mental retardation and language skills were the
main diagnostic criteria.
Seguin and education of individuals with mental retardation
Edouard Seguin helped establish a new humanism toward those with mental retardation in the
late 1800s.
Influence of Binet’s early research on his test
Alfred Binet (1857-1911) invented the first modern intelligence test in 1905.
Binet was a prolific researcher and author long before he turned his attentions to intelligence
testing.
Binet and testing for higher mental processes
Binet argued that intelligence could be better measured by means of the higher psychological
processes rather than elementary sensory processes such as reaction time.
Binet and Simon were called to develop a practical tool for selecting children for which special
placement in schools was needed.
Thus arose the first formal scale for assessing the intelligence of children.
The scale was appropriate for assessing the entire gamut of intelligence.
The purpose was classification, not measurement.
The revised scales and the advent of IQ
In 1908, Binet and Simon published a revision of the 1905 scale.

The major innovation of the 1908 scale was the introduction of the concept of mental level.
The idea of deriving a mental level was a monumental development that was to influence the
character of intelligence testing throughout the twentieth century (later mental age).
Testing form the early 1900s to the present
With the successful application of Binet’s mental test, psychologists realized that their inventions
could have pragmatic significance for many different segments of society.
The profusion of tests developed early in the twentieth century helped shape the character of
contemporary tests.
Early used and abuses of tests in the United States
First translation of the Binet-Simon scale
Goddard gained a reputation as one of the leading experts on the use of intelligence tests to
identify persons with impaired intellect.
He believed that the impaired children should be segregated so that they would be prevented
from ‘contaminating society’.
The Binet-Simon immigration
When Goddard visits Ellis Island, he became convinced that the rates of feeblemindedness were
much higher than estimated by the physicians who staffed the immigration service.
He became a apostle for the use of intelligence tests to identify feebleminded immigrants.
Goddard’s scholarly views were influenced by the social ideologies of his time.
Testing for giftedness: Leta Stetter Hollingworth
One of the earliest uses of IQ tests was testing for giftedness.

Hollingworth demonstrated that children with high genius showed significantly greater school
achievement than those of mere ordinary genius.
She dispelled the belief that gifted children should not be moved ahead in school because they
would lag behind older children in penmanship and other motor skills.
She advanced the science of IQ testing.
She proposed a revolving fund from which gifted children could draw for their development.
The Stanford-Binet: The early mainstay of IQ
It was Stanford professor Lewis M. Terman (1857-1956) who popularized IQ testing with his
revision of the Binet scales in 1916.
The Stanford-Binet was a substantial revision.
It was the standard of intelligence testing for decades.
Group tests and the classification of WWI army recruits

Researchers sought group mental test to supplement tests imported from France.
Group test were slow to catch on, partly because the early versions still had to be scored
laboriously by hand.
The slow pace of developments in group testing picked up dramatically as the United States
entered WWI.
Robert M. Yerkes convinced the U.S. government and the army that all of its recruits should be
given intelligence tests for purposes of classification and assignment.
It gave rise to the army Alpha and the army Beta.
The army Alpha and Beta examinations
The alpha consisted of eight verbally loaded tests for average and high-functioning recruits.
The army beta was a nonverbal group test designed for the use with illiterates and recruits whose
first language was not English.
The army testing was intended to help segregate and eliminate the mentally incompetent, to
classify men according to their mental ability, and to assist in the placement of competent men in
responsible positions.
Early educational testing
Yerkes’s grand scheme for testing army recruits helped to usher in the era of group tests.
Machine scoring was introduced in the 1930s, making objective group tests even more efficient
than before.
The development of the aptitude tests
Aptitude tests measure more specific and delimited abilities than intelligence tests.
A single aptitude test will measure just one ability domain.
The development of aptitude tests lagged behind of that of intelligence tests for two reasons
 Statistical
A new technique, factor analysis, was often needed to discern which aptitudes were
primary and, therefore, distinct from each other.
Research on this was not refined until the 1930s
 Social
The absence of a practical application for such refined instruments.
It was not until WWII that a pressing need arose to select candidates who were highly
qualified for very difficult and specialized tasks.
Personality and vocational testing after WWI
It was not until WWI that personality tests emerged in a form resembling their contemporary
appearance.
It was needed to detect which army recruits were susceptible for psychoneurosis.
Made by Woodworth.
In addition, the MMPI introduced the use of validity scales to determine fake bad, fake good, and
random response patterns.
The origins of projective testing
The projective approach originated with the word association method pioneered by Galton.
Also the Rorschach ink bolds.
The development of interest inventories
Psychologists were devising measures for guidance and counselling of the masses of more
normal persons.
The interest inventory, which has roots going back to Thorndike.
The emergence of structured personality tests
Beginning in the 1940s, personality tests began to flourish as useful tools for clinical evaluation
and also for assessment of the normal spectrum of functioning.
The expansion and proliferation of testing
In the twenty-first century, the reach of testing continues to increase. Both in one-to-one clinical
uses and in group testing fro societal applications.
Evidence-based practice and outcomes assessment
Evidence-based practice: the integration of best research evidence with clinical expertise and
patient values.
The advance of evidence-based practice is part of a worldwide trend to require proof that
treatments and interventions yield measurable positive outcomes.
Characteristics Of A Good Psychological Test
1. Standardisation: Standardization implies uniformity of procedure in administering and

scoring the test. If the scores obtained by different individuals are to be comparable,
testing conditions must obviously be the same for all. Such a requirement is only a
special application of the need for controlled conditions in all scientific observations. In
order to secure uniformity of testing conditions, the test constructor provides detailed
directions for administering each newly developed test. The formulation of such
directions is a major part of the standardization of a new test. Such standardization
extends to the exact materials employed, time limits, oral instructions to subjects,
preliminary demonstrations, ways of handling queries from subjects, and every other
detail of the testing situation.
Another important step in the standardization of a test is the establishment of norms.
Psychological tests have no predetermined standards of passing or failing; an individual's
score is evaluated by comparing it with the scores obtained by others. As its name
implies, a norm is the normal or average performance.
In the process of standardizing a test, it is administered to a large, representative sample
of the type of subjects for whom it is designed. This group, known as the standardization
sample, serves to establish the norms. Such norms indicate not only the average
performance but also the relative frequency of varying degrees of deviation above and
below the average. It is thus possible to evaluate different degrees of superiority and
inferiority.
2. Objectivity: The administration, scoring, and interpretation of scores are objective insofar
as they are independent of the subjective judgment of the individual examiner. Any one
individual should theoretically obtain the identical score on a test regardless of who
happens to be his examiner. This is not entirely so, of course, since perfect
standardization and objectivity have not been attained in practice. But at least such
objectivity is the goal of test construction and has been achieved to a reasonably high
degree in most tests. There are other major ways in which psychological tests can be
properly described as objective. The determination of the difficulty level of an item or of
a whole test is based on objective, empirical procedures. Not only the arrangement but
also the selection of items for inclusion in a test can be determined by the proportion of
subjects in the trial samples who pass each item. Thus, if there is a bunching of items at
the easy or difficult end of the scale, some items can be discarded. Similarly, if items are
sparse in certain portions of the difficulty range, new items can be added to fill the gaps.
3. Norms: Scores on psychological tests are most commonly interpreted by reference to

norms which represent the test performance of the standardization sample. The norms are
thus empirically established by determining what a representative group of persons
actually do on the test. Any individual's raw score is then referred to the distribution of
scores obtained by the standardization sample, to discover where he falls in that
distribution. In order to determine more precisely the individual’s exact position with
reference to the standardization sample, the raw score is converted into some relative
measure. These derived scores are designed to serve a dual purpose. First, they indicate
the individual’s relative standing in the normative sample and thus permit an evaluation
of his performance in reference to other persons. Second, they provide comparable
measures that permit a direct comparison of the individual’s performance on different
tests.
Types:
Test norms consist of data that make it possible to determine the relative standing of an
individual who has taken a test. By itself, a subject’s raw score (e.g., the number of answers that
agree with the scoring key) has little meaning. Almost always, a test score must be interpreted as
indicating the subject’s position relative to others in some group. Norms provide a basis for
comparing the individual with a group.
Numerical values called centiles (or percentiles) serve as the basis for one widely applicable
system of norms. From a distribution of a group’s raw scores the percentage of subjects falling
below any given raw score can be found. Any raw score can then be interpreted relative to the
performance of the reference (or normative) group—eighth-graders, five-year-olds, institutional
inmates, job applicants. The centile rank corresponding to each raw score, therefore, shows the
percentage of subjects who scored below that point. Thus, 25 percent of the normative group
earn scores lower than the 25th centile; and an average called the median corresponds to the 50th
centile.
Another class of norm system (standard scores) is based on how far each raw score falls above or
below an average score, the arithmetic mean. One resulting type of standard score, symbolized
as z, is positive (e.g., +1.69 or +2.43) for a raw score above the mean and negative for a raw
score below the mean. Negative and fractional values can, however, be avoided in practice by
using other types of standard scores obtained by multiplying z scores by an arbitrarily selected
constant (say, 10) and by adding another constant (say, 50, which changes the z score mean of
zero to a new mean of 50). Such changes of constants do not alter the essential characteristics of
the underlying set of z scores.
The French psychologist Alfred Binet, in pioneering the development of tests of intelligence,
listed test items along a normative scale on the basis of the chronological age (actual age in years
and months) of groups of children that passed them. A mental-age score (e.g., seven) was
assigned to each subject, indicating the chronological age (e.g., seven years old) in the reference
sample for which his raw score was the mean. But mental age is not a direct index of brightness;
a mental age of seven in a 10-year-old is different from the same mental age in a four-year-old.
To correct for this, a later development was a form of IQ (intelligence quotient), computed as the
ratio of the subject’s mental age to his chronological age, multiplied by 100. (Thus, the IQ made
it easy to tell if a child was bright or dull for his age.)
Ratio IQs for younger age groups exhibit means close to 100 and spreads of roughly 45 points
above and below 100. The classical ratio IQ has been largely supplanted by the deviation IQ,
mainly because the spread around the average has not been uniform due to different ranges of
item difficulty at different age levels. The deviation IQ, a type of standard score, has a mean of
100 and a standard deviation of 16 for each age level. Practice with the Stanford-Binet test
reflects the finding that average performance on the test does not increase beyond age 18.
Therefore, the chronological age of any individual older than 18 is taken as 18 for the purpose of
determining IQ.
The Stanford-Binet has been largely supplanted by several tests developed by the American
psychologist David Weschler between the late 1930s and the early 1960s. These tests have
subtests for several capacities, some verbal and some operational, each subtest having its own
norms. After constructing tests for adults, Wechsler developed tests for older and for younger
children.
4. Reliability: It refers to the consistency of scores obtained by the same persons when
reexamined with the same test on different ✓ occasions, or with different sets of
equivalent items, or under other variable examining conditions. This concept of reliability
underlies the computation of the error of measurement of a single score, whereby we can
predict the range of fluctuation likely to occur in a single individual's score as a result of
irrelevant, chance factors. The concept of test reliability has been used to cover several
aspects of score consistency.
TYPES OF RELIABILITY
Test-retest reliability: The most obvious method for finding the reliability of test scores
by repeating the identical test on a second occasion. The error variance corresponds to the
random fluctuations of performance from one test session to the other. These variations
may result in part from uncontrolled testing conditions, such as extreme changes in
weather, sudden noises and other distractions, or a broken pencil point. To some extent,
however, they arise from changes in the condition of the subject himself, as illustrated by
illness, fatigue, emotional strain, worry, recent experiences of a pleasant or unpleasant
nature, and the like. Retest reliability shows the extent to which scores on a test can be
generalized over different occasions; the higher the reliability, the less susceptible the
scores are to the random daily changes in the condition of the subject or of the testing
environment. When retest reliability is reported in a test manual, the interval over which
it was measured should always be specified. Since retest correlations decrease
progressively as this interval lengthens, there is not one but an infinite number of retest
reliability coefficients for any test. It is also desirable to give some indication of relevant
intervening experiences of the subjects on whom reliability was measured, such as
educational or job experiences, counseling, psychotherapy, and so forth. The concept of
reliability is generally restricted to shortrange, random changes that characterize the test
performance itself rather than the entire behavior domain that is being tested. It should be
noted that different behavior functions may themselves vary in the extent of daily
fluctuation they exhibit. Only tests that are not appreciably affected by repetition lend
themselves to the retest technique. A number of sensory discrimination and motor tests
would fall into this category. For the large majority of psychological tests, however, the
retest technique is inappropriate.
Alternate-form reliability: One way of avoiding the difficulties encountered in test-

retest reliability is through the use of alternate forms of the test. The same persons can
thus be tested with one form on the first occasion and with another, comparable form on
the second. The correlation between the scores obtained on the two forms represents the
reliability coefficient of the test. It will be noted that such a reliability coefficient is a
measure of both temporal stability and consistency of response to different item samples
(or test forms). This coefficient thus combines two types of reliability. Since both types
are important for most testing purposes, however, alternate-form reliability provides a
useful measure for evaluating many tests. Like test-retest reliability, alternate-form
reliability should always be accompanied by a statement of the length of the interval
between test administrations, as well as a description of relevant intervening experiences.
If the two forms are administered in immediate succession, the resulting correlation
shows reliability across forms only, not across occasions. The error variance in this case
represents fluctuations in performance from one set of items to another, but not
fluctuations over time. In the development of alternate forms, care should of course be
exercised to ensure that they are truly parallel. Fundamentally, parallel forms of a test
should be independently constructed tests designed to meet the same specifications. The
tests should contain the same number of items, Reliability 113 and the items should be
expressed in the same form and should cover the same type of content. The range and
level of difficulty of the items should also be equal. Instructions, time limits, illustrative
examples, format, and all other aspects of the test must likewise be checked for
comparability. It should be added that the availability of parallel test forms is desirable
for other reasons besides the determination of test reliability. Alternate forms are useful
in follow-up studies or in investigations of the effects of some intervening experimental
factor on test performance. The use of several alternate forms also provides a means of
reducing the possibility of coaching or cheating. Although much more widely applicable
than test-retest reliability, alternate-form reliability also has certain limitations. In the first
place, if the behavior functions under consideration are subject to a large practice effect,
the use of alternate forms will reduce but not eliminate such an effect. To be sure, if all
examinees were to show the same improvement with repetition, the correlation between
their scores would remain unaffected, since adding a constant amount to each score does
not alter the correlation coefficient. It is much more likely, however, that individuals will
differ in amount of improvement, owing to extent of previous practice with similar
material, motivation in taking the test, and other factors. Under these conditions, the
practice effect represents another source of variance that will tend to reduce the
correlation between the two test forms. If the practice effect is small, reduction will be
negligible. Another related question concerns the degree to which the nature of the test
will change with repetition. Finally, it should be added that alternate forms are
unavailable for many tests, because of the practical difficulties of constructing
comparable forms. For all these reasons, other techniques for estimating test reliability
are often required.
Split-half reliability. From a single administration of one form of a test it is possible to

arrive at a measure of reliability by various split-half procedures. In such a way, two
scores are obtained for each person by dividing the test into comparable halves. It is
apparent that split-half reliability provides a measure of consistency with regard to
content sampling. Temporal stability of the scores does not enter into such reliability,
because only one test session is involved. This type of reliability coefficient is sometimes
called a coefficient of internal consistency, since only a single administration of a single
form is required. To find split-half reliability, the first problem is how to split the test in
order to obtain the most nearly comparable halves. Any test can be divided in many
different ways. In most tests, the first half and the second half would not be comparable,
owing to differences in nature and difficulty level of items, as well as to the cumulative
effects of warming up, practice, fatigue, boredom, and any other factors varying
progressively from the beginning to the end of the test. A procedure that is adequate for
most purposes is to find the scores on the odd and even items of the test. If the items were
originally arranged in an approximate order of difficulty, such a division yields very
nearly equivalent half-scores.
The Spearman-Brown Formula: Notice that the split-half method gives us an estimate
of reliability for an instrument half as long as the full test. Although there are some
exceptions, a shorter test generally is less reliable than a longer test. This is especially
true if, in comparison to the shorter test, the longer test embodies equivalent content and
similar item difficulty. Thus, the Pearson r between two halves of a test will usually
underestimate the reliability of the full instrument. The Spearman-Brown formula
provides the appropriate adjustment.
Coefficient Alpha: As proposed by Cronbach (1951) and subsequently elaborated by

others (Novick & Lewis, 1967; Kaiser & Michael, 1975), coefficient alpha may be
thought of as the mean of all possible split-half coefficients, corrected by the Spearman-
Brown formula. Coefficient alpha is an index of the internal consistency of the items, that
is, their tendency to correlate positively with one another. Insofar as a test or scale with
high internal consistency will also tend to show stability of scores in a test–retest
approach, coefficient alpha is therefore a useful estimate of reliability. Traditionally,
coefficient alpha has been thought of as an index of unidimensionality, that is, the degree
to which a test or scale measures a single factor. Certainly coefficient alpha is an index of
the interrelatedness of the individual items, but this is not synonymous with the
unidimensionality of what the test or scale measures. In fact, it is possible for a scale to
measure two or more distinct factors and yet still possess a very strong coefficient alpha.
Furthermore, the standard deviation of the distribution of obtained scores would be the
standard error of measurement (SEM). While the true score on a test likely differs
from one person to the next, the SEM is regarded as constant, an inherent property of the
test. If we repeated this hypothetical experiment with another subject, the estimated true
score would probably differ, but the SEM should work out to be a similar value. SEM is
an index of measurement error that pertains to the test in question. A subject’s obtained
score would then also be his or her true score. However, this outcome is simply
impossible in real-world testing. Every test exhibits some degree of measurement error.
The larger the SEM, the greater the typical measurement error. However, the accuracy or
inaccuracy of any individual score is always a probabilistic matter and never a known
quantity. As noted, the SEM can be thought of as the standard deviation of an examinee’s
hypothetical obtained scores on a large number of equivalent tests, under the assumption
that practice and boredom effects are ruled out. Like any standard deviation of a normal
distribution, the SEM has well-known statistical uses.
5. Validity: A test is valid to the extent that inferences made from it are appropriate,
meaningful, and useful. A test score per se is meaningless until the examiner draws
inferences from it based on the test manual or other research findings. Validity reflects an
evolutionary, research-based judgment of how adequately a test measures the attribute it
was designed to measure. Traditionally, the different ways of accumulating validity
evidence have been grouped into three categories:
• Content validity • Criterion-related validity • Construct validity
1. CONTENT VALIDITY: Content validity is determined by the degree to which

the questions, tasks, or items on a test are representative of the universe of
behavior the test was designed to sample. In theory, content validity is really
nothing more than a sampling issue. The items of a test can be visualized as a
sample drawn from a larger population of potential items that define what the
researcher really wishes to measure. If the sample (specific items on the test) is
representative of the population (all possible items), then the test possesses
content validity. Content validity is a useful concept when a great deal is known
about the variable that the researcher wishes to measure.
2. Criterion-related validity: It is demonstrated when a test is shown to be

effective in estimating an examinee’s performance on some outcome measure. In
this context, the variable of primary interest is the outcome measure, called a
criterion. The test score is useful only insofar as it provides a basis for accurate
prediction of the criterion. Two different approaches to validity evidence are
subsumed under the heading of criterion-related validity. In concurrent validity,
the criterion measures are obtained at approximately the same time as the test
scores. In predictive validity, the criterion measures are obtained in the future,
usually months or years after the test scores are obtained. Each of these two
approaches is best suited to different testing situations. The criterion must itself be
reliable if it is to be a useful index of what the test measures. A criterion measure
must also be appropriate for the test under investigation. A criterion must also be
free of contamination from the test itself. If the screening test contains the same
items as the SRE, then the correlation between these two measures will be
artificially inflated. This potential source of error in test validation is referred to as
criterion contamination, since the criterion is “contaminated” by its artificial
commonality with the test.
 CONCURRENT VALIDITY: In a concurrent validation study, test scores

and criterion information are obtained simultaneously. Concurrent
evidence of test validity is usually desirable for achievement tests, tests
used for licensing or certification, and diagnostic clinical tests. An
evaluation of concurrent validity indicates the extent to which test scores
accurately estimate an individual’s present position on the relevant
criterion.
 PREDICTIVE VALIDITY: In a predictive validation study, test scores are

used to estimate outcome measures obtained at a later date. Predictive
validity is particularly relevant for entrance examinations and employment
tests. Such tests share a common function—determining who is likely to
succeed at a future endeavor.
 CONSTRUCT VALIDITY: The final type of validity discussed in this

unit is construct validity, and it is undoubtedly the most difficult and
elusive of the bunch. A construct is a theoretical, intangible quality or trait
in which individuals differ. Examples of constructs include leadership
ability, overcontrolled hostility, depression, and intelligence. In general,
constructs are theorized to have some form of independent existence and
to exert broad but to some extent predictable influences on human
behavior. A test designed to measure a construct must estimate the
existence of an inferred, underlying characteristic (e.g., leadership ability)
based on a limited sample of behavior. Construct validity refers to the
appropriateness of these inferences about the underlying construct. All
psychological constructs possess two characteristics in common:
1. There is no single external referent sufficient to validate the existence of

the construct; that is, the construct cannot be operationally defined.
2. Nonetheless, a network of interlocking suppositions can be derived

from existing theory about the construct.
 FACTOR ANALYSIS: Factor analysis is a specialized statistical

technique that is particularly useful for investigating construct validity.
The purpose of factor analysis is to identify the minimum number of
determiners (factors) required to account for the intercorrelations among a
battery of tests. The goal in factor analysis is to find a smaller set of
dimensions, called factors, that can account for the observed array of
intercorrelations among individual tests. A typical approach in factor
analysis is to administer a battery of tests to several hundred subjects and
then calculate a correlation matrix from the scores on all possible pairs of
tests. A factor loading is actually a correlation between an individual test
and a single factor. Thus, factor loadings can vary between +1.0 and -1.0.
The final outcome of a factor analysis is a table depicting the correlation
of each test with each factor.
 CONVERGENT AND DISCRIMINANT VALIDATION: In a thoughtful

analysis of construct validation, D. T. Campbell (1960) points out that in
order to demonstrate construct validity we must show not only that a test
correlates highly with other variables with which it should theoretically
correlate, but also that it does not correlate significantly with variables
from which it should differ. In an earlier article, Campbell and Fiske
(1959) described the former process as convergent validation and the latter
as discriminant validation.
RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY:

Reliability is concerned with the extent to which the purpose of the test is
being served. It studies how truthfully the test measures what it purports to
measure. On the other hand, validity is the correlation of the test with
some outside external criteria. A test to be valid, has to be reliable. A test
which possesses poor reliability is not expected to yield high validity. To
be valid a test must be reliable. Tests with low reliability cannot be highly
valid.
Types of Psychological tests

Personality Test: Personality test evaluates our behaviours, emotions,
behavioural and environmental traits, attitudes and even clinical
disturbances in people. Each personality test is used to measure a certain
variable or compare two variables. For example, adolescent emotional
problems or psychopathologies are screened using The Minnesota
Multiphasic Personality Inventory (MMPI-A). There are various versions
of the MMPI depending on the sample type you want to test. Another
unique type of personality test is the projective assessment. Very
commonly used Projective assessments are the Thematic Apperception
Test and the Rorschach Inkblot test. These projective psychological tests
are formed to test the response of a person to a certain stimulus which
elicits different hidden emotions, underlying thoughts or beliefs using
pictures.
Aptitude Tests: These tests measure the latent ability of a candidate to

learn a new job or skill. These tests assess an individual’s potentiality to
learn a job through adequate training and are effective if applicants do not
possess earlier job experience. Instruments used are variously described as
tests of ‘intelligence’ ‘mental ability’, ‘mental alertness’, or simply as –
personnel tests.
Speed & Power Test: It is a test where the performance is being

measured based primarily upon the speed with which one works. The
example can be tests of clerical ability. The other alternative can be where
the test is difficult and the applicant or the person is given as much time as
he/she wants. This type of tests where the person’s score is based
exclusively upon his/her ability to answer the question correctly
irrespective of the time he/she has taken is known as power test. An
example can be of tests like Tweezers Dexterity Test etc.
Emotional Intelligence Tests: An Emotional Intelligence test taps

various emotions through situations presented to the test-taker. An
emotional intelligence test requires a person’s honesty in it to accurately
evaluate a person’s EQ [Emotional Quotient] and suggest ways to improve
it. It is often noted that people who have higher EQ are much more content
and successful than people otherwise. Even though emotional intelligence
can overlap with other aspects like personalitqy or genetic compositions,
Emotional Intelligence of a person tends to fluctuate or change. It often
requires constant consciousness in your actions and evaluation of its
consequences.
Neuropsychological Tests: Neuropsychological tests are the most

essential form among the many types of psychological tests used for
assessing diseases like Alzheimer’s, Brain injury, Emotional disorders,
such as depression or anxiety. It is important for doctors to know the core
of the problem to cure it. Neurological tests assess factors like Memory,
Language, Executive functioning, Dementia, Visuospatial Function, etc.
Individual & Group tests: There are a number of tests which are meant
to be performed individually. Such tests are called individual tests and
these tests are preferred for vocational guidance and counselling and for
clinical and diagnostic work with emotionally disturbed persons. As
individual tests are more costly, therefore they are less used in the industry
than the group tests. An example of an individual psychological test can
be the Stanford -Binet intelligence scale. On the contrary, some tests are
usually designed for a purpose so that they can be administered to a large
number of people in the industry. The examples of group tests can be
Purdue Vocational Achievement Tests, the Adaptability Test and the
Wonderlic Personnel Test.
Essay & Objective Tests: The essay tests are probably one of the oldest
methods of psychological tests that are created to check the candidate’s
ability to organise and articulate his or her thoughts clearly and of course
logically. It is Lord Macaulay who has been credited with introducing this
concept for the Indian Administrative Services or IAS. On the other hand,
the objective test has one correct answer and does not require or ask for
any sort of long extensive answers/explanation from the candidates. These
tests are generally used to check the mental power of the candidate and
reasoning and clarity of the concepts above all.
Uses of psychological tests
Psychological tests are used to measure & detect the abilities of a person.
It is used to measure the individual differences, that is difference between
abilities of different people & the performance of the same person at the
same time. They assist in diagnosis e.g.: Rorschach Test. The information
produced is scientifically reliable & tested. They assist in formulation of
psychopathology of the disease. E.g.: TAT They assist in assessing the
severity of the disorder. E.g.: Hamilton scale for depression. They help in
assessing the general characteristics of the individual. E.g.: Assessment of
intelligence, assessment of personality. These tests are also used for
forensic evaluation regarding litigation, family, criminal charges & court
issues. These tests asses the level of functioning or disability., help direct
treatment & assess treatment outcome.
Ethical concerns in testing
Competence:
Theoretical issues: Is your test reliable? • Is your test valid for particular
purpose? • Are you measuring a stable characteristic of the person being
tested? • If so, differences in scores over time reflect measurement error or
subject variables such as fatigue. • What is the value of your test result –
will it still be true next year?
Actuarial judgement vs Clinical Judgement: Actuarial judgment occurs

when we feed test scores into statistical formulas to diagnose a
psychological condition or predict future performance. Clinical judgment
occurs when we have a trained psychologist interpret test scores to
diagnose a psychological condition or predict future performance. In
actuarial judgment, we cannot make accurate predictions tailored to
individuals. In clinical judgment, the claim is that you can determine
“what caused what” in an individual’s person’s life.
Informed Consent: Consent requires “affirmative permission before

actions can be taken”. Elements of Informed Consent Agreements –Must
be presented in a clear and understandable manner –Reason for the test
administration. –Tests and evaluations procedures to be used. –How
assessment scores will be used. –Who will have access to the results. –
Present rights of test taker e.g.: to refuse. If underage is tested written
informed consent must be obtained from the parents, guardian.
Debriefing: Restate purpose of the research. -Explain how the results will
be used (usually emphasize that the interest is in the group findings). -
Reiterate that findings will be treated confidentially. -Answer all of the
respondents’ questions fully.
Knowledge of results: You Must fully disclose test results in

understandable language • Avoid using theoretical constructs e.g.,
crystallized intelligence, ego strength etc. • Do not use technical terms,
e.g., your neuroticism is 6 sten. Confidentiality: Test results are
confidential information • Release of results should only be made to
another qualified professional after client’s consent.
Test Security: Test materials must be kept secure • Test items are not
revealed except in training programs and when mandated by law, to
protect test integrity • Test items are private property.
Divided loyalties: Who is the client? • The person being tested, or the
institution you work for? • What if these parties have conflicting interests?
Examples? • How do you maintain test security but also explain an
adverse decision?
Invasion of privacy: When tested people may feel their privacy is

invaded. • The clinician is always ultimately responsible; this includes
scoring and interpretation done by a computer • Informed consent –
Informing the client about both the nature of the information being
collected and the purposes for the which results will be used • Relevance –
is the information gathered through assessment relevant to the
counselling? Counsellor should be able to clearly state purpose and
benefits of appraisal process.
Labelling: Once diagnosed, the disease can be labelled. • E.g., psychiatric

labels can be damaging. • Public has little understanding of e.g.
schizophrenia. • When diagnosing, use least stigmatizing label consistent
with accurate representation – It does not mean that counsellors should
always use less or no stigmatizing diagnostic codes; a less stigmatizing
code that is inaccurate could prevent the client from receiving appropriate
treatment.
Dehumanization: Some forms of testing remove any human element from

decision-making process. It is seen as becoming more prevalent with the
increase in computer-testing
References
 American Psychological Association, American Educational Research Association &
National Council on Measurement in Education (1954) Technical recommendations for
psychological tests and diagnostic techniques. Washington, DC: American Psychological
Association
 Anastasi, A. (1982). Psychological Testing (5th ed). New York: Macmillan
 Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297-334
 Gregory, R.J. (2004), Psychological testing, history, principles, & applications. Allyn &
Bacon
 Harman. H.H (1976). Modern Factor Analysis (3rd ed.). Chicago: University of Chicago
Pres

Introduction To Psychological Tests

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Psychological Tests

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Psychological Tests

Uploaded by

Copyright:

Available Formats

introduction to psycholog

KAMLA NEHRU COLLEGE,DELHI UNIVERSITY

HISTORY OF PSYCHOLOGICAL TESTING:

Rudimentary forms of testing in China in 2200 b.c

Physiognomy, phrenology, and the psychograph

The psychograph was a machine that measured phrenoloy.

The brass instruments era of testing

Wilhelm Wundt founded the first psychological laboratory in 1879 in Leipzig.

Galton and the first battery of mental tests

Cattell imports brass instruments to the United States

In 1905, Binet introduced his scale of intelligence.

Rating scales and their origins

Esquirol and diagnosis in mental retardation

Seguin and education of individuals with mental retardation

Influence of Binet’s early research on his test

Binet and testing for higher mental processes

The revised scales and the advent of IQ

In 1908, Binet and Simon published a revision of the 1905 scale.

Testing form the early 1900s to the present

Early used and abuses of tests in the United States

First translation of the Binet-Simon scale

The Binet-Simon immigration

Testing for giftedness: Leta Stetter Hollingworth

One of the earliest uses of IQ tests was testing for giftedness.

The Stanford-Binet: The early mainstay of IQ

Group tests and the classification of WWI army recruits

The army Alpha and Beta examinations

Early educational testing

The development of the aptitude tests

Personality and vocational testing after WWI

The origins of projective testing

The development of interest inventories

The emergence of structured personality tests

The expansion and proliferation of testing

Evidence-based practice and outcomes assessment

Characteristics Of A Good Psychological Test

1. Standardisation: Standardization implies uniformity of procedure in administering and

3. Norms: Scores on psychological tests are most commonly interpreted by reference to

Alternate-form reliability: One way of avoiding the difficulties encountered in test-

Split-half reliability. From a single administration of one form of a test it is possible to

Coefficient Alpha: As proposed by Cronbach (1951) and subsequently elaborated by

• Content validity • Criterion-related validity • Construct validity

1. CONTENT VALIDITY: Content validity is determined by the degree to which

2. Criterion-related validity: It is demonstrated when a test is shown to be

 CONCURRENT VALIDITY: In a concurrent validation study, test scores

 PREDICTIVE VALIDITY: In a predictive validation study, test scores are

 CONSTRUCT VALIDITY: The final type of validity discussed in this

1. There is no single external referent sufficient to validate the existence of

2. Nonetheless, a network of interlocking suppositions can be derived

 FACTOR ANALYSIS: Factor analysis is a specialized statistical

 CONVERGENT AND DISCRIMINANT VALIDATION: In a thoughtful

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY:

Types of Psychological tests

Aptitude Tests: These tests measure the latent ability of a candidate to

Speed & Power Test: It is a test where the performance is being

Emotional Intelligence Tests: An Emotional Intelligence test taps

Neuropsychological Tests: Neuropsychological tests are the most

Uses of psychological tests