Tests and Measurements Presentation

Introduction to Testing
and Measurement
Testing:
Basic Definitions
Assessment - process of documenting
knowledge, skills, attitudes, and/or
beliefs
Evaluation - the making of a
judgment about the amount,
number, or value
Measurement - quantitative (involves
assigning numbers)
Testing - form of measurement
Basic Definitions
(Continued)
Reliability - Measures consistency

Validity - Valid to the degree that
accomplishes purpose
Objective - To the degree that two or
more reasonable persons given a key
will agree
Basic Statistics
Mean, Median, and
Standard Deviation
Mean
(Arithmetic Average - the sum divided by

the count.)
Advantages
Calculation includes all scores
Indicates typical score for
group
Disadvantages
Easily distorted by extreme
scores
Median
(Midpoint - place the numbers in value

order and find the middle number)
Advantages
Not easily distorted by
extremely high or low scores
Disadvantages
Does not take into account the
value of all the scores in the
group
Mean or median?
Rule of Thumb
use median when extremely
high or low scores (outliers)
are present;
use the mean for most other
situation
Standard Deviation
Indicates by how much the
scores in a distribution typically
deviate from the mean
Mean represents 50% of the
norm group,
68% within 1 SD above or below
the mean,
95% within 2 SD above or below
the mean,
99.7% within 3 SD above or below
mean
Normal Curve - Properties

Symmetrical, bell-shaped
Total area under the curve represents total
number of scores in the distribution
Vertical lines mark sub-areas and represent
proportions of scores falling in a particular
range
Points along baseline correspond to
standard deviations away from the mean
Testing and Measurement

Validity & Reliability
Validity of Test Scores

The extent to which the scores
on the test are representative of
what you are trying to measure
Example - Does the science test
measure only the knowledge of
science, or is it dependent on
reading ability and therefore
measuring science and reading
ability?
Types of Validity
Content Validity
Determined by the degree to
which the questions or items are
representative of the universe of
behavior the test was designed to
sample (does the test assess what
it claims to assess?)
Criterion-Related Validity
Determined by whether there is a
relationship between a test and an
immediate criterion measure
example - a driving test,
employment
Factors That Can Reduce

Validity?
Factors in the Test
Vague Directions
Irrelevant Items
Poorly Constructed Items
Items that Contain Clues to
the Correct Answer
Too Few or Improperly
Sequenced Items
What Affects Validity

(Continued)
Factors in Test Administration and

Scoring
Insufficient Time to Complete
the Test
Testing Environment
Undetected Cheating
Inappropriate Help or Coaching
Properly Motivated Students
Unreliable Item Scoring
What Affects Validity

(Continued)
Factors Affecting Pupil

Responses
High Level of Fear or
Anxiety About Taking the
Test
A Tendency to Rush
Though the Test
Guessing
Reliability of Test Scores

Consistency
Measure of confidence that if
same individuals were retested
under similar conditions that
the results could be replicated
Types of Reliability
Test-Retest: Coefficient of Stability
Alternate Form: Coefficient of
Equivalence
Internal Consistency: Consistency of
examinee across test items
Interrater Reliability: Consistency of
judges or scorers
Reliability
General Guidelines
Test scores used for decision
about individuals require a
much higher degree of
reliability than those for making
decisions about groups.
Higher reliability coefficients
are essential if decisions based
on test scores have long term
consequences.
Reliability
General Guidelines
(Continued)
Lower reliability coefficients are

tolerable if decisions are
reversible or have only a
temporary impact.
Reliability coefficients for
standardized tests should be .90
or higher
Reliability coefficients are
influenced by many factors.
How to Increase
Reliability
Use objective tests
Use a more heterogeneous
group
Make sure the difficulty level is
appropriate for the individuals
being tested
Increase the number of items
Reliability vs. Validity

Reliability means that the testtakers will get the same score in
multiple takes (within reason of
course).
Validity means measuring what it
is supposed to measure
Reliability doesn't necessarily
equate to validity:
A test can be reliable without being
valid.
However, a test cannot be valid
unless it is reliable.
Types of Tests
Standardized Tests:
Norm-Referenced and
Criterion-Referenced Tests
Standardized Test
administered and scored in a
consistent, or "standard", manner.
designed in such a way that the
questions, conditions for
administering, scoring procedures,
and interpretations are consistent
administered and scored in a
predetermined, standard manner.
not necessarily a high-stakes, timelimited, or multiple-choice.
Standardized Testing
Benefits
Objectivity
Evidence of validity or reliability of
results
Ability to compare across students,
schools, states, etc.
Ease of administration and scoring
Efficiency (group testing)
Developed over time and
supported with data and research
Standardized Testing
Possible issues
Can only sample a portion of the

domain
May not match school curriculum
May not answer relevant questions
Interpretations may not be relevant
for all populations
Extraneous factors may prevent
good measure of the students
ability
May not be available for some
constructs/concepts
Base test type according to

decision to be made
Norm-Referenced: Level of
achievement compared to others
students
Criterion-Referenced: Level of
achievement compared to external
criterion
Norm-Referenced Scores
Based on the normal curve
Reflects student performance
compared to other similar students
Shows relative strengths and
weaknesses
Are not standards of what should
be - only indicators of what is
Examples: CogAT, Iowa, NNAT, WISC,
Stanford, Terra Nova
Norms
A set standard of development
or achievement usually derived
from the average or median
achievement of a large group
Used to compare one students
results to those of a large sample
of students:
National norms - based on a large
sample from across the nation
Local norms - based on a large
sample from local schools within a
city, district, state, etc.
Norms
(Continued)
Indicate what the current reality

is
are not standards, or indicators of
what should be
Derived by assessing students

thought to be typical
For mental ability scores, use
student age norms
For achievement scores, use
student grade scores
Good Norms are

Recent
When outdated norms are used, results can be
misleading. Norms change every 5-7 years. (Tests
with norms over 10 years old are not used for
gifted evaluation in Cobb County.)
Representative
Because participation in the norm group is
voluntary, norm groups might not be
representative.
Relevant
The normal students used to establish the
norms may not have been provided a normal
instructional program.
Norm Referenced Tests (NRT)

Appropriate Uses
Used to compare student
performance with large, usually
national or international,
sample of similar students
Used to make relative
comparisons among schools or
school systems to a national
sample
Criterion-Referenced Tests
Allow inferences about:
a curricular domain of skills and
knowledge (e.g. the CCGPS, state
standards)
a cognitive domain of skill
reading comprehension
math computation
standing with respect to a judgmental

criterion
CRCT (Criterion Referenced Competency Test
EOCT (End of Course Test)
Georgia Milestones
Criterion Referenced Tests

(CRT)
Appropriate Uses
To make instructional decisions

about individual students
To make placement decisions about
students, along with other
information
To make evaluative (formative and
summative) decisions about
programs
To make decisions about the
curriculum
Types of Scores
NRTs & CRTs
Raw Scores
Actual number of points
received on test
For example, 25 correct answers
out of 30 questions equals a raw
score of 25
Have not been cooked in

cauldron of statistics
Standard Scores
Raw scores converted to new
scale
Can be used to make direct
comparisons among classes,
schools, or districts
Can be misinterpreted because
somewhat arbitrary scale values
used from test to test
Commonly Reported Standard Scores
SAT, GRE, NCEs, Stanines, SAS
Normal Curve Equivalent

(NCE)
Normalized standard scores
used for reporting some
standardized achievement tests
Converted to a scale with a
mean of 50 and a standard
deviation of 21.06
Reported in a range between
values of 1 and 99
Are not particularly useful in
reporting test reports to parents
Standard Age Scores (SAS)

Used to report the results of
ability tests
Sometimes reported as
deviation IQ scores
Converted to a scale with a
mean of 100 and a standard
deviation of 15
Average is considered 15
above and below 100 from
85 -115 on the normal curve
Stanines
Standard Scores with whole
number values ranging from 1
to 9
Relate to percentile bands
Useful as a simple
approximation of performance;
May lead to a loss of precision
in reporting
Percentile Scores
Commonly used in expressing results of
standardized tests
Probably the best single derived score
for general use in relaying test results
Indicate the percentage of students in
the norm group scoring lower than the
examinee
Range between values of 1 and 99
Used to interpret a students
performance in comparison to other
students
Can result in misinterpretation because
all percentile ranks are not equally
spaced along any one scale
Percentile Bands
Range of values thought to contain the
students true percentile rank
smaller bands reflect higher reliability
Example: Susan might have a percentile
band ranging between 76 and 86 for
math computation on the ITBS, and a
percentile band ranging between 82 and
92 for reading.
Scores indicate that Susan probably
performs better at reading than she
did at math computation
However, exact percentile score for
math could be higher than for reading
Grade Equivalents
Identifies grade level at which
typical student obtains same
raw score
Expressed by grade and month
Are useful in measuring growth
Can be easily misinterpreted
Grade Equivalent Interpretation

Compares student performance on grade-level
material against the average performance of
students at other grade levels on the same
material
Reported in terms of grade level and months
Does not mean a 5th grade student with a 9.5
GE score in reading can do 8th grade reading
work
Does not mean the 5th grade student needs to
be in 8th grade
Does mean the 5th grade student is performing
better than peers at same level
Does mean that 5th grade student reads 5th
grade material as well as the average 8th
grader
Grade EquivalentsCommon Misinterpretations

Can not be interpreted as estimate of
grade where a student should be placed
Are not equal across the range of the
scale
Are not necessarily equal across tests
Extremely high or low GE scores are not
dependable estimates of student
achievement
Things to Know
Know the Test study the manual and
understand the content and purpose
Know the Norms cannot interpret
scores well if dont understand
norming population
Know the Score is it standard score,
raw score, percentile rank, or
something else?
Know the Background test results
dont tell the whole story so consider
multiple sources of data and
information on student
More to know
Research on your own the more you
know, the more you can explain test
results with accuracy and confidence
Communicate effectively provide
pertinent information in a clear,
understandable manner to approved
individuals
Use the test understanding
increases with multiple uses
Use caution test scores can reflect
ability but they do not determine
ability
Reference Test Scores and What They Mean, 6th edition by H. Lyman,

Tests and Measurements Presentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tests and Measurements Presentation

Uploaded by

Copyright:

Available Formats

Introduction to Testing

Reliability - Measures consistency

(Arithmetic Average - the sum divided by

(Midpoint - place the numbers in value

Normal Curve - Properties

Testing and Measurement

Validity of Test Scores

Factors That Can Reduce

What Affects Validity

Factors in Test Administration and

What Affects Validity

Factors Affecting Pupil

Reliability of Test Scores

Lower reliability coefficients are

Reliability vs. Validity

Can only sample a portion of the

Base test type according to

Indicate what the current reality

Derived by assessing students

Good Norms are

Norm Referenced Tests (NRT)

standing with respect to a judgmental

Criterion Referenced Tests

To make instructional decisions

Have not been cooked in

Normal Curve Equivalent

Standard Age Scores (SAS)

Grade Equivalent Interpretation

Grade EquivalentsCommon Misinterpretations

You might also like