Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

contribution of the revised

History of Psychological Assessment 1908 Binet-Simon Scale.


• Chinese – civil service testing to
determine suitable people to work in a Standford-Binet Intelligence Scale
government. • Lewis Madison Terman of Stanford
o Determine work evaluations University – revised the Binet test for
and promotions decisions. use in United States.
- The only American version of the
PEOPLE IN PSYCHOLOGY (PSYCHOLOGICAL
Binet test that flourished.
ASSESSMENT)
- Standardization sample = 1000
• Charles Darwin – higher forms of life
people
evolved partially because of differences
- Original items were revised, many
among individual forms of life within a
new items were added.
species. (“the origin of species” 1859)
• Francis Galton – some people Army Alpha & Army Beta
possessed characteristics that made
• Developed by WW1, Robert Yerkes
them more fit than others. (“Hereditary
(head of committee of distinguished
Genius” 1869)
psychologist)
• James McKeen Cattell – coined the
- Two structured group tests of
term Mental Test.
human abilities.
• Wilhelm Wundt – set up a laboratory at
- Most fair enough test (culturally)
the University of Leipzig in
- Alpha = required reading abilities
o Founded the science of
- Beta = measures illiterate adults’
psychology.
intelligence.
o Following in the tradition of
Weber and Fechner. Woodworth Personal Data Sheet
• Alfred Binet (with Theodore Simon) –
- 1st structured personality test.
developed the 1st Major General
- Developed during the WW1
Intelligence Test.
- Published in final form just after the
TESTS war.
Binet-Simon Scale - Assumed that a test response can
be taken at Face Value.
• 1st version (Binet-Simon Scale, 1905).
o 30 items of increasing difficulty Rorschach Inkblot Test (1932)
and designed to identify • 1st published by Herman Rorschach of
intellectually subnormal Switzerland, 1921.
individuals. • David Levy (US) introduced it.
o Standardization sample = 50 - 1st doctoral dissertation in US.
children who had been given - Was not published until 1932, when
the test under standard Sam Beck (Levy’s student) decided
conditions – same instructions to investigate the properties of the
and format. Rorschach test scientifically.
o Mental age concept – one of • Highly controversial projective test that
the most important provides an ambiguous stimulus and
asks the subject to explain what the • Mental age
inkblot might be. - Measurement of a child’s
performance on the test relative to
Thematic Apperception Test (TAT)
other children of that particular age
• Henry Murray & Christian Morgan, group.
1935. • Factor Analysis
- Requires the subject to make up a - Is a method of finding the minimum
story about the ambiguous scene. number of dimensions
- Purported to measure human (characteristics, attributes) called
needs and thus to ascertain factors, to account for a large
individual differences in motivation. number of variables.
• Face Value (Face Validity)
Minnesota Multiphasic Personality Inventory
- When an assessment or test
(MMPI)
appears to do what it claims to do.
- New era for structured personality
tests. NORMS AND BASIC STATISTICS FOR TESTING
- Test items were originally • Descriptive statistic – methods used to
developed by selecting questions provide a concise description of a
that have been endorsed by people collection of quantitative information.
diagnosed with different mental - Mean, Standard Deviations etc.
health conditions. • Inferential statistics - methods used to
- Test that made no assumptions make inferences from observations of a
about the meaning of a test small group of people known to a larger
response. Such meaning was to group of individuals known as a
determined by empirical research. population.
- Practical applications or real-life
California Psychological Inventory (CPI) applications of the results.
- Structured personality test Properties of Scale
developed according to the same
principles as the MMPI. • Magnitude
- Property of “moreness”.
16 Personality Factor Questionnaire (16PF) - Scale has magnitude if (for example)
- A structured personality test based attribute represents more, less, or
on the statistical procedure of factor equal amounts of the given quantity
analysis. than does another instance.
- Ex. Ranges (1-10 none at all, 11-20
CONCEPTS mild)
• Representative sample - Values can be rank ordered.
- Comprises individuals similar to • Equal intervals
those for whom the test is to be - Difference between two points at
used. any place on the scale has the same
- Must reflect all segments of the meaning as the difference between
population in proportion to their two other points that differ by the
actual numbers. same number of scale unit.
- Scale units are equal. Frequency Distributions
- Scale has equal interval when the
- Displays scores on a variable or a measure
relationship between the measured
to reflect how frequently each value was
units and some outcome can be
obtained.
described by a straight line or a
- Simple frequency distribution – indicates
linear equation in form Y = a+bX.
that individual scores have been used and
• Absolute 0
the data have not been grouped.
- Is obtained when nothing of the
- Group frequency distribution – test-score
property being measured exist.
intervals (class intervals), replace the
- For example, if you are measuring
actual test scores.
heart rate and observe that your
➢ ex. 95-99; f=5
patient has a rate of 0 and has died,
then you would conclude that there Graphs
is no heart rate at all.
1. Histogram
Types of Scales 2. Bar graph
3. Frequency polygon
• Nominal Scales
- not really a scale. Measure of Central Tendencies
- Only purpose is to name objects.
1. Arithmetic Mean
- Does not have any of the 3
- Is equal to the sum of the
properties.
observations (test scores) divided
• Ordinal Scale
by the number of observations.
- Has magnitude
- Appropriate for: INTERVAL & RATIO
- IQ tests are in ordinal scale.
data. (When normal distribution is
• Interval scale
assumed to be observed.)
- Has magnitude and equal intervals.
- For frequency distribution:
- Common example: temperature in
degrees Fahrenheit.
- 35 degrees Fahrenheit is warmer
than 32 degrees Fahrenheit.
- Difference between 90*F and 80*F
is equal to a similar difference of 10
degrees at any point on the scale. 2. Median – in ascending or descending
• Ratio order; middle.
- Has all 3 properties. - Appropriate for: ORDINAL,
- Ex. Kelvin Scale which is based on INTERAL, & RATIO data.
the absolute zero. - When relatively few scores fall
under the HIGH or LOW end of the
Describing Data distribution.
- A distribution may be defined as a set of 3. Mode – frequently occurring.
test scores arrayed for recording or study. - AVERAGE – adjacent scores occur
- Raw scores – straightforward, unmodified equally often and more often than
accounting of performance that is usually others.
numerical.
- BIMODAL DISTRIBUTION – 2 - SEMI-INTERQUARTILE RANGE; equal
modes. to the interquartile range divided by
- Appropriate for: NOMINAL DATA. 2.
- Useful in analyses of qualitative or - Quartile: specific point, Quarters:
verbal nature. interval.

Measurement Scales in Psychology

- Ordinal level of measurement is


most frequently used in psychology.
- Kelinger, “Intelligence, aptitude, and
personality test scores are basically
and strictly speaking, ordinal.”

Measure of Variability
• Variability – how scores in a distribution
are scattered or dispersed. - An individual score may, for
• Two or more distributions of test scores example, fall AT THE THIRD
can have the same mean even though QUARTILE or IN THE THIRD
differences in the dispersion of scores QUARTER (but not “in” the third
around the mean can be wide. quartile or “at the third quarter)
- Symmetrical distribution; Q1 and
The range
Q3 have equal distance from the
- Equal to the difference between the median.
highest and lowest scores. ➢ Skewness; is the lack of
- Simplest measure but potential use is symmetry.
limited.
The average deviation
➢ One extreme score can radically alter
the value of the range. - Absolute value of the deviation
➢ Resulting description of variations may score, ignoring the positive or
be understated or overstated. negative sign and treating all
- Provides a quick but gross description of deviation scores as positive.
the spread of scores. - All deviation score are then
summed and divided by the total
The interquartile and semi-quartile ranges
number of scores.
- Distributions of test scores can be
The standard deviation
divided into four parts such that
25% of the test scores occur in each - A measure of variability equal to the
quarter. square root of the average squared
- Equal to the difference between Q3 deviations about the mean.
and Q1. Like the media, it is an - Square root of the variance Commented [AD1]: First: find the mean
ORDINAL STATISTICS. ➢ Variance – equal to the arithmetic 2nd: minus each score to the mean (deviation)
3rd: square every deviation
- Q1, Q2 (median), Q3. mean of the squares of the Variance: add all the squared deviation then divided by it to
difference between the score in a the total number of score.
distribution.
SD: squared the variance
∑(𝑥−𝜇)2 - In theory, the distribution of the normal
➢ Formula: 𝜎 2 =
𝑁
curve ranges from negative infinity to
▪ When squared = SD.
positive infinity.
Skewness
The area under the normal curve
- Symmetry is absent.
- 50% of the scores occur above the mean
- Positive skew; most of the scores fall under
and 50% of the scores occur below the
the low end.
mean.
➢ May indicate test was too difficult.
- Approximately 34% of all scores occur
- Negative skew; most of the scores fall
between the mean and 1 SD below and/or
under the high end.
above the mean.
➢ May indicate test was too easy.
- Approximately 68% of all scores occur
Kurtosis between the mean and ±1 SD.
- Approximately 95% of all scores occur
- Steepness of a distribution in its center.
between the mean and ±2 SD.
➢ Platykurtic – relatively flat.
- Approximately 99% of all scores occur
➢ Leptokurtic – relatively peaked
between the mean and ±3 SD.
➢ Mesokurtic – somewhere in the
middle.

The Normal Curve

- Development:
➢ 18th century: Abraham DeMoivre - Has two tails
➢ Marquis de Laplace - General rule: the larger the sample size
➢ 19th century: Karl Friedrich Gauss and the wider the range of abilities
➢ “Laplace-Gaussian Curve” measured by a particular test, the more the
➢ Karl Pearson “Normal Curve” to be graph of the test scores will approximate
more inclusive to those who the normal curve.
contributed. - In terms of mental ability as
- Theoretically, the normal curve is a bell- operationalized by tests of intelligence,
shapes, smooth, mathematically defined performance that is approximately 2 SD
curve that is highest at its center. (Ideal from the mean (i.e., IQ of 70-75 or lower or
curve.) IQ of 125-130 or higher) is one key element
- From the center it tapers on both sides in identification.
approaching the X-axis asymptotically (it ➢ IQ tests: mean= 100, SD=15
approaches, but never touches, the axis.) ➢ Success at life’s tasks, or its absence,
also plays a defining role, but the
primary classifying feature of both
gifted and retarded groups is intellectual - Score: equal to the mean, Z-score is
deviance. These individuals are out of 0.
sync with more average people, simply o Greater than the mean, Z-
by their difference from what is score is positive.
expected for their age and o Less than the mean, Z-score
circumstances. is negative.

Describing Distributions

• Mean – average score.


• Standard deviation – approximation of
the average deviation around the mean. - equal to the difference between a
- Square root of the variance. particular raw score and the mean
- Square root of the average squared divided by the standard deviation.
deviation around the mean. • McCall’s T (T-Score)
- Tho not average deviation, it gives a - Exactly the same as standard scores
useful approximation of how much (Z-score)
a typical score is above or below the - Standard deviation = 10, Mean =
average score. 50.
- transforming Z-score into T-Score

- advantage: none of the scores is


negative.
• Stanine (standard nine) – converts any
set of scores into a transformed scale,
which ranges from 1 to 9.
- Developed in the US Air Force
during WWII.
- Mean = 5, SD = approximately 2.
- Five steps to go from raw scores to
stanines:
➢ Find the mean of the raw
• Variance - average squared deviation scores.
around the mean ➢ Find the SD of the raw
scores.
➢ Change the raw scores to Z-
scores.
➢ Change the Z-scores to
Standard Scores percentiles. (use “areas of
• Z-score – transforms data into standard normal
standardized units that are easier to distribution”)
interpret. ➢ Use the table below to
- Standard deviation = 1, Mean = 0. convert percentiles into
- Difference between a score and the stanines.
mean, divided by the SD.
- Usually used in achievement tests ➢ Determine how many cases
(elementary and secondary are in the group.
schools). ➢ Divide the number of cases
- 5th stanine = performance in the below the score of interest
average range from ¼ SD below the (step 1) by the total number
mean to ¼ SD above the mean, and of cases in the group (step
captures the middle 20% of the 2).
scores in a normal distribution. ➢ Multiply the result of step 3
- 4th and 6th stanines are also ½ SD by 100.
wide and capture the 17% of cases • Percentiles
below and above (respectively) the - Are the specific scores or points
5th stanine. within a distribution.
- Total frequency for a set of
observations divided into
hundredths.
- Indicates the particular score, below
which a defined percentage of score
falls.
Percentages Percentiles Stanines
of Cases
4 1-4 1 Top 4 percent
7 5-11 2
Standardized measures of rank 12 12-23 3
17 24-40 4
• Quartiles – points that divide the
20 41-60 5
frequency distribution into equal 4ths.
17 61-77 6
- 1st quartile = 25th percentile.
12 78-89 7
- 2nd quartile = median/ 50th 7 90-96 8
percentile. 4 97-100 9 Bottom 4
- 3rd quartile = 75th percentile. percent
• Deciles – use points that mark 10% Standard Score for SAT and GRE
rather than 25% intervals.
- Top decile (D9) = point below - Scholastic Aptitude Test (SAT)
which 90% of the cases fall. - Graduate Record Examination (GRE)
- D8 = 80th percentile ➢ Mean: 500, SD: 100
• Percentile ranks Deviation IQ
- Replace simple ranks when we want
to adjust for the number of scores - Mean: 100, SD: 15
in a group. - Typical mean and SD for IQ tests
- To calculate: results in approximately 95% (SD
➢ Determine how many cases ±2) of deviation IQs ranging from 70
fall below the score of to 130.
interest. Commented [AD2]: When you standardized your normally
Linear transformation distributed data most probability your raw scores’s mean and
deviations will parallel the standard scores. (direct numerical
relationship) This is now called as Linear transformation.
- Standard score obtained by linear - Describes the specific types of skills,
transformation is one that retains a direct tasks, or knowledge that the test
numerical relationship to the original raw taker can demonstrate such as
score. mathematical skills.
- Criterion-referenced testing
Nonlinear transformation Commented [AD3]: When you standardized your not
movement emphasizes the
normally distributed data, you will have a nonlinear
- May be required when the data under diagnostic use of tests – that is, transformation. Wherein your raw scores’s mean and
consideration are not normally distributed using them to identify problems deviation does not parallel with your standardized scores.
that can be remedied. There is no direct numerical relationship.
yet comparisons with normal distribution
need to be made.

Normalized standard scores CORRELATION AND REGRESSION


Correlation coefficient – mathematical index
- Method used when test developers have a
that describes the direction and magnitude of a
skewed distribution regardless of having
large samples and aims to have a normal relationship.
distribution. • Positive correlation
- Involves “stretching” the skewed curve into - High scores on Y are associated with
the shape of normal curve and creating a high scores on X, and low scores on
corresponding scale of standard scores. Y correspond to low scores on X.
- Formula: Y=aX+b. - When Y increases/decreases, X also
increases/decreases.
• Negative correlation
- Higher scores on Y are associated
with lower scores on X, and lower
scores on Y are associated with
higher scores on X.
- When Y increases, X decreases.
- When Y decreases, X increases.
• No correlation
- Variables are no related.

Concepts (2)
• Norm-referenced test
- Compares each person with a norm.
These tests do not compare Regression – related technique (to correlation).
students with one another; they Used to make predictions about scores on one
compare each student’s variable from knowledge of scores on another
performance with a criterion or variable.
expected level of performance.
• Criterion-referenced test - Regression line – where the
predictions are obtained from.
➢ Defined as the bestfitting - Use for finding the association
straight line through a set of between two sets of ranks.
points in a scatter diagram. - Easy to calculate and is often used
➢ It is found by using the when the individual in a sample can
principle of least squares, be ranked on two variables but their
which minimized the actual scores are not known or have
squared deviation around a normal distribution.
the regression line. - Popular method for correlating
▪ Mean – point of unvalidated survey, instruments or
least squares for Likert-type survey responses.
any single variable. - NON-PARAMETRIC MEASURE.
This means that the • Biserial correlation
sum of the squared - Relationship between a continuous
deviations around variable and an artificial Commented [AD5]: Variables that can are typically
the mean will be dichotomous variable. obtained by measuring. Variables that can assume any value
within a special range. Ex. Ratio and interval variables.
less than it is - For example: passing or failing the
(Weight & Temperature)
around any value bar examination (artificial
other than the dichotomous variable) and grade
mean. point average in law school
(continuous variable)
Pearson product moment correlation
• Point biserial correlation
coefficient
- Continuous variable and true
- Ratio used to determine the degree dichotomous variable (gender).
of variation in one variable that can - For example: relationship between
be estimated from knowledge gender and GPA.
about variation in the other • Phi coefficient
variable. - Both variables are dichotomous
variables and at least one is “true”.
- For example: relationship between
passing or failing the bar
examination and gender. Commented [AD4]: Main difference between Spearman’s
• Tetrachoric correlation rho and Pearson r
- Both dichotomous variables are Spearman’s rho
artificial. Non-parametric measure.
Evaluates monotonic relationship (one V decreases as the
Terms and Issues in the Use of Correlation other V increases or one V increases as the other V
- The correlation coefficient can take decreases.) [it can be positive (linear correlation) or
• Residual negative correlation]
on any value from -1.0 to 1.0. Evaluates the strength and direction of the correlation.
- The difference between the
- Commonly used; most often we -1 (perfect negative correlation), 1 (perfect positive
predicted and observed values. correlation)
want to find the correlation
- Y-Y’ (R= actual y value – predicted y
between two continuous variables. Pearson r
value)
(height, weight, intelligence.) Parametric measure
➢ Regression equation gives a Assess linear correlation.
- PARAMETRIC MEASURE (PEARSON
predicted value of Y for Assess the strength and direction of the correlation.
R) 1 (positive correlation), -1 (negative linear correlation), 0
each value of X.
• Spearman’s rho (no correlation)
➢ In addition to these - A high coefficient of alienation
predicted values, there are indicates that the two variables
observed values of Y. share very little variance in
- Interpretation: Having a negative common. (non-association)
residual means that the predicted • Shrinkage
value is too high, similarly if you - Amount of decrease observed
have a positive residual, it means when a regression equation is
that the predicted value was too created for one population and then
low. applied to another.
• Standard error of estimate - one problem with regression
- Measure of the accuracy of analysis is that it takes advantage of
prediction. chance relationships within a
- Prediction is most accurate when particular sample of subjects.
the standard of error of estimate is ➢ There is a tendency to
relatively small. overestimate the
- As it become larger, the prediction relationship, particularly if
becomes less accurate. the sample of subjects is
• Coefficient of determination small.
- Correlation coefficient squared. • Cross validation
- This value tells us the proportion of - Best way to ensure that proper
the total variation in scores on Y references are being made.
that we know as a function of ➢ Use the regression equation
information about X. to predict performance in a
- For example, correlation between group of subjects other
SAT score and performance of 1st than to which the equation
year college is .40, then the was applied.
coefficient of determination is .16. ➢ Then a standard error of
➢ .40² = .16. estimate can be obtained
➢ This means that we can for the relationship
explain 16% of the between the values
variation in 1st year college predicted by the equation
performance by knowing and the values actually
SAT scores. observed.
• Coefficient of alienation • Multivariate analysis
- Measure of non-association - Broad term that refers to the
between two variables. analysis of multiple variables
- Computation: subtracting simultaneously.
coefficient of determination to 1. - Multiple variables, can include both
√1 − 𝑟
2 dependent and independent
- This is the proportion of common variables.
variance not shared between the - Considers the relationship among
variables, the unexplained variance combinations of three or more
between the variables. variables.
- For example: the prediction of - Estimate the extent to which an
success in the 1st year of college observed test score deviates from a
from the linear combination of SAT true score.
verbal and math scores. - An INDEX of the extent to which
• Discriminant analysis (multivariate one individual’s scores vary over Commented [AD6]: For example:
method) tests presumed to be parallel. Test battery includes: personality and intelligence test. (IV;
predictor)
- When the task is to find the linear - Formula:
DV: successful and non-successful.
combination of variables that ➢ SD divided by 1 minus the
provides a maximum discrimination reliability coefficient. If there is a small difference between the scores of successful
group and non-successful group in intelligence test, then
between categories. - SEM functions like a SD = predict
intelligence would have a discriminant evidence. This is
- For example: attempts to determine what would happen if an individual because the IQ of the non-successful and successful group
whether set of measures predicts took additional equivalent test does not significantly differs with each other, thus IQ does
not predict if an individual will become successful or not in
success or failure on a particular ➢ 68% = +- 1 SD
their job.
performance evaluation. ➢ 95% = +- 2 SD
• Multiple regression analysis ➢ 99% = +-3 SD
(Multivariate method) ➢ If SD is constant = the
- Specifically deals with multiple smaller the SEM,
independent variables influencing correlation coefficient
a single dependent variable. It increases; more reliable.
quantifies the extent to which each - useful in establishing what is called
independent variable contributes to a confidence interval: a range or
the variability in the dependent band of test scores that is likely to
variable. contain the true score.
Discriminant analysis and multiple
Standard error of the difference
regression analysis find linear
combinations of variables that - aid a test user in determining how
maximize the prediction of some large a difference should be before
criterion. it is considered statistically
Factor analysis is used to study the significant.
interrelationships among a set of - Address this question
variables without reference to a ➢ How did this individual’s
criterion. performance on test 1
Dependent Variable: Also known as the compare with his or her
"criterion" or "outcome" variable. performance on test 2?
Independent Variable: Also known as ➢ How did this individual’s
the "predictor" or "explanatory" performance on test 1
variable. compare with someone
else’s performance on test
Standard error of measurement/ standard
1?
error of score
➢ How did this individual’s
- Measure of the precision of an performance on test 1
observed test score. compare with someone
else’s performance on test
2?
- Formula: appropriate for test-retest
➢ Square root of the sum of evaluation (Rorschach Inkblot Test).
SEM of first test and SEM • Carryover effect
of second test. - Occurs when the 1st session
➢ Or influences scores from the 2nd
➢ square root of the 2 minus session.
reliability coefficient of 1st - When there are carryover effects,
test minus the reliability the test-retest correlation usually
coefficient of 2nd test. overestimates the true reliability.
• Practice effects
Reliability - Type of carryover effects.
• Reliability coefficient - Some skills improve with practice.
- the ratio of the variance of the true • Parallel forms reliability
scores on a test to the variance of - Compares two equivalent forms of a
the observed scores. test that measure the same
attribute.
- Use different items but have the
same level of difficulty.
• Test reliability • Split-half reliability
- Usually estimated in one of three - Estimates the internal consistency.
ways. - Test is given and divided into halves
1. Test-retest method that are scored separately.
➢ Consistency: when - Results of one half is compared with
administered on different the results of the other.
occasions. - If the items get progressively more
2. Parallel forms method difficult, then you might be better
➢ We evaluate the test across advised to use the odd-even
different forms of the test. system.
3. Internal consistency method
➢ We examine how people Test scores gain reliability as the number of
perform on similar subsets items increases.
of items selected from the
An estimate of reliability based on two half-tests
same form of the measure.
would be deflated because each half would be
• Test-retest reliability less reliable than the whole test. Commented [AD7]: Reliability of the WHOLE TEST is much
- Estimates are used to evaluate the higher than the two halves, since whole test has more items.
error associated with administering The correlation between the two halves of the
a test at two different times. test would be a reasonable estimate of the
- This type of analysis is of value only reliability of half test.
when we measure traits or
• Spearman-Brown (Correction) Formula
characteristics that do not change
- To correct for half-length test.
over time.
- Allows you to estimate what the
- Test that measures some constantly
correlation between the 2 halves
changing characteristics are not
would have been if each half had
been the length of the whole test.
- Estimate: Correlation between 2 - Measure of agreement between 2
halves = whole test. judges who each rate a set of
- Using it is not always advisable.
• Cronbach’s coefficient alpha (a) [1951] Commented [AD8]: used when two halves of the test
- When the two halves of a test have have unequal variances and on tests containing non-
dichotomous items, unequal variances
unequal variances.
- This general reliability coefficient
provides the lowest estimate of
reliability that one can expect. Commented [AD9]: For the researchers to know if a
- Estimates the internal consistency. certain instrument is reliable enough or not. It is important
to know how much an instrument is NOT reliable for the
➢ Items are not scored as 0 or objects using NOMINAL SCALES. researchers to do various methods to increase the reliability.
1 (right or wrong). - Best method for assessing the level Much better to know how UNRELIABLE THE INSTRUMENT IS
➢ More general method off of agreement among several rather than OVERESTIMATE ITS RELIABILITY.
finding reliability estimate observers.
through internal - Values:
consistency. ➢ 1 (perfect agreement)
➢ -1 (less agreement that can
be expected on the basis of
of chance alone.)
➢ A value greater than .75
➢ N = number of items.
generally indicates
➢ S = sum of the total scores
“excellent” agreement.
for all items.
➢ A value between .40 and
➢ Si = the sum of the item
.75 indicates “fair to good”
scores for the each item.
(satisfactory) agreement.
➢ A value less than .40
indicates “poor”
agreement.

Sources of measurement error and methods of


reliability assessment

Commented [AD10]: Add:


• Kuder-Richardson 20 (KR20 / KR-20) KR-20: used for inter-item consistency of
- Estimating the internal consistency. dichotomous items (intelligence tests, personality tests
- Calculating the reliability of a test with yes or no options, multiple choice), unequal
variances, dichotomous scored
when Items are dichotomous.
- Scored 0 or 1 (usually for right or It has been suggested that reliability estimates KR-21: if all the items have the same degree of
wrong) – not Likert’s scale. difficulty (speed tests), equal variances, dichotomous
in the range of .70 and .80 are good enough for
Scored
• Kappa Statistic (intro; J.Cohen, 1960). most purposes in basic research. For a test used
to make a decision that affects some person’s there had been no measurement
future, evaluators should attempt to find a test error.
with a reliability greater than .95 (medical
researchers)

• Spearman-Brown prophecy formula


- can estimate how many items will
have to be added in order to bring a
test to an acceptable level of reliability.

Inter-item reliability (Internal consistency)

- Error: Item Sampling Homogeneity


To ensure that the items measure the same - used when tests are administered
things; (methods to do to deal with low once
reliability) - consistency among items within the
test
• Factor Analysis
- measures the internal consistency
- Tests are most reliable if they are
of the test which is the degree to
unidimensional. Commented [AD11]: when it measures a single
which each item measures the
➢ One factor should account underlying construct or factor. Factors should be explaining a
same construct single factor. Example: OCEAN (big 5) explains the personality
for considerably more of
- measurement for unstable traits or is CORRELATED TO PERSONALITY.
the variance than any other
- if all items measure the same
factor. Items that do not
construct, then it has a good
load on this factor might be
internal consistency
best omitted.
- useful in assessing Homogeneity
• Discriminability Analysis - Homogeneity: if a test contains
- Examine the correlation between items that measure a single trait
each item and the total score for (unifactorial)
the test. - Heterogeneity: degree to which a
- Correlation between the score on a test measures different factors
single item and the total score is (more than one factor/trait)
low, the item is probably measure - more homogenous = higher inter-
something else. item consistency
➢ Too easy or hard that
people do not differ in their
response.
➢ It drags down the estimate
of reliability and should be Validity
excluded. - The agreement between a test score or
• Correction for attenuation measure and the quality it is believed to
- Estimating what the correlation measure.
between tests would have been if
- Sometimes defined as the answer to the - Another type of evidence for validity.
question, “does the test measure what it is - Comes from assessments of the
supposed to measure?” simultaneous relationship between the test
• Face validity and the criterion.
- Not recognized as a legitimate category
The Strong-Campbell Interest Inventory (SCII)
because it is not technically a form of
uses as criteria patterns of interest among
validity.
people who are satisfied with their careers
- Mere appearance that a measure has
(Campbell, 1977). Then the patterns of interest
validity.
for people taking the tests before they have
- If the items seem to be reasonably
chosen an occupation are matched to patterns
related to the perceived purpose of the test.
of interest among people who are happy in
- Not a validity at all because it does not
various occupations.
offer evidence to support conclusions drawn
from test scores. - It has a concurrent validity.
• Content-related evidence validity - If the pattern of interest of those people
- Consider the adequacy of who have not yet chosen an occupation did
representation of the conceptual not match the pattern of interest of the
domain the test is designed to cover. people who are satisfied with their career,
- Consider how the content is related or to then we can assume why they have not yet
what extent does the content represent chosen an occupation. It could be a
the variable that the test is designed to problem with pattern interest that is why
measure. they have not yet decided on what career
- Validity evidence that is something not to pursue.
separate from other types; boundaries
between content and other types of Validity coefficient
evidence for validity is not clearly - The relationship between a test and a
defined. criterion.
- Unique features: Logical rather than - This coefficient tells the extent to which the Commented [AD12]: For example, if a test of
statistical, other than Face validity. test is valid for making statements about mathematical knowledge included algebra but not geometry,
the validity of the test would be threatened by construct
• Construct underrepresentation the criterion.
underrepresentation.
- Describes the failure to capture - In practice, one rarely sees a validity
coefficient larger than .60, and validity Commented [AD13]: For example, a test of intelligence
important components of a construct.
might be influenced by reading comprehension, test anxiety,
• Construct-irrelevant variance coefficients in the range of .30 to .40 are or illness.
- Occurs when scores are influenced by commonly considered high.
Commented [AD14]: For example, the SAT serves as
factors irrelevant to the construct. • Cross validation (study) predictive validity evidence as a college admissions test if it
• Criterion validity evidence - Good validity study accurately forecasts how well high-school students will do in
- Assesses how well the test actually forecasts their college studies. The SAT, including its quantitative and
- Tells us just how well a test corresponds verbal subtests, is the predictor variable, and the college
with a particular criterion. performance for an independent group of grade point average (GPA) is the criterion. The purpose of the
- Criterion is the standard against which the subjects. test is to predict the likelihood of succeeding on the
- The initial validity study assesses the criterion—that is, achieving a high GPA in college
test is compared.
• Predictive validity evidence relationship between the test and the Commented [AD15]: Example: between a learning
criterion, whereas the cross validation study disability test and school performance.
- Form of criterion validity evidence that has
a forecasting function of tests. checks how well this relationship holds for measures and criterion measures are taken at the same time
• Concurrent related evidence for validity an independent group of subjects. because the test is designed to explain why the person is
now having difficulty in school.
- The larger the sample size in the initial evidence for what the test does not
study, the better the likelihood that the measure.
relationship will cross validate.
• Generalizability Writing and Evaluating Test Items
- Evidence that the findings obtained in one Item Formats
situation can be generalized or applied to
• Dichotomous format
other situation.
- Offers two alternatives for each item.
• Construct - Ex. true-false examination.
- Defined as something built by mental • Polytomous format (polychotomous)
synthesis. - More than two alternatives Commented [AD16]: the process of combining ideas into
- As a construct, intelligence does not exist a congruous object of thought.
- Ex. multiple choice examination.
as a separate thing we can touch or feel, so
- Incorrect choices – distractors.
it cannot be used as an objective criterion. - Correction for guessing:
• Construct validity evidence Commented [AD17]: Here you define the construct, since
- Established through a series of activities in there is no existing universal define for it, then develops an
instrument to measure the validity of the definition you have
which a researcher simultaneously defines given to the construct.
some construct and develops the
instrumentation to measure it.
- This process is required when “no criterion
or universe of content is accepted as
• Likert format
entirely adequate to define the quality to be
- Respondent are required to indicate the
measured.”
degree of agreement with a particular
• Convergent evidence validity attitudinal question.
- When a measure correlates well with other ➢ Popular format for attitude and
tests believed to measure the same personality scales.
construct.
• Category format
- This sort of evidence shows that measures
- Similar to the Likert format but that uses an
of the same construct, converge, or narrow
even greater number of choice.
in, on the same thing.
- 1-10
• Discriminant evidence (Divergent
• Visual analogue scale
validation)
- Uniqueness from other tests/studies.
- If a health index measures the same thing
that self-ratings of health, symptoms, and
chronic medical conditions measures, then
why do we need it in addition to all these
other measures?
- The answer is that the index taps
something other than the tests used in the
convergent evidence studies.
- To demonstrate discriminant evidence for
validity, a test should have low correlations
with measures of unrelated constructs, or
Wechsler adult intelligence scale- IV (WAIS IV) 3. SB5, have short forms (with 2 subtest), WISC-
IV have none.
- widely used intelligence test designed to
assess cognitive abilities in adults aged 16 to 4. both test contain child-friendly materials
90+. It measures various cognitive domains,
5. both have optional available softwate for
including verbal comprehension, perceptual
scoring and reporting writing
reasoning, working memory, and processing
speed. 6. norming sample for testtakers ages 6-16 was
2,200 for both tests

7. WISC-IV included parent education, sb5 did


Use: The WAIS-IV is used for a variety of
not.
purposes, including clinical assessment and
diagnosis of intellectual disabilities, 8. sb5 included socioeconomic status, wisc-iv
neuropsychological evaluation, educational did not.
planning, vocational assessment, and research.
(ADAPTIVE TEST) 9. both fans of the CHC model of intelligence

Wechsler intelligence scale for children - IV wechsler preschool and primary scale of
(WISC-IV) intelligence-iii (WPPSI-III)

- The WISC-IV is a widely used intelligence test - Age Range: The WPPSI-III is designed to assess
designed to assess cognitive abilities in children the intelligence and cognitive abilities of
aged 6 to 16 years old. It measures various children aged 2 years 6 months to 7 years 3
cognitive domains, including verbal months.
comprehension, perceptual reasoning, working
memory, and processing speed.
Purpose: The primary purpose of the WPPSI-III
is to assess cognitive functioning in young
Use: The WISC-IV is used for a variety of children, including their intellectual abilities,
purposes, including educational assessment, strengths, and weaknesses.
clinical evaluation of intellectual abilities,
identification of learning disabilities or
developmental disorders, and research. Use: The WPPSI-III is used for a variety of
purposes, including educational assessment,
clinical evaluation of intellectual abilities,
SB-5 vs WISC-IV identification of learning disabilities or
developmental disorders, and research.
1. both test published in 2003, individual
administration, 1 hour, full scale IQ composite
score based on 10 subtests. Parent and Teacher Rating Scales: In addition to
2. WISC-IV, have 5 supplemental tests (30mins the direct assessment of the child's cognitive
administration), SB-5 have none. abilities, the WPPSI-III includes parent and
teacher rating scales to gather information
about the child's behavior, social skills, and
adaptive functioning in everyday settings.

Adaptive Behavior Assessment: The WPPSI-III


includes an optional measure of adaptive
behavior, which assesses the child's ability to
function independently in daily activities such as
self-care, communication, and social interaction.
This provides a more comprehensive
understanding of the child's overall functioning

wechsler abbreviated scale of intelligence

- short-form intelligence test designed for


individuals aged 6 to 89 years old.

Purpose: The WASI is designed to provide a


quick and reliable estimate of an individual's
intellectual abilities. It is often used in clinical
and research settings when time constraints or
other factors make administration of a full-scale
intelligence test impractical.

Administration: The WASI can be administered


individually or in a group setting by trained
professionals, such as psychologists or
educational specialists. Administration typically
takes 30 to 45 minutes to complete.

You might also like