Statistics Refresher Validity Reliability PPT Content

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Chapter 2: Statistics Refresher (ex: Celsius, Fahrenheit)

Test scores are frequently expressed as numbers, and 4. Ratio Scale: has all the three properties (Kelvin)
statistical tools are used to describe, make inferences from,
and draw conclusions about numbers. Type of Scale Magnitude Equal Absolute 0
Intervals
Why we need statistics? Nominal No No No
Statistical methods serve two important purposes in the
Ordinal Yes No No
quest for scientific understanding.
1. Statistics are used for purposes of description. Interval Yes Yes No
2. We can use statistics to make inferences, which are Ratio Yes Yes Yes
logical deductions about events that cannot be observed
directly. Frequency Distrubutions
A distribution of scores summarizes the scores for a
Descriptive Statistics are methods used to provide a group of individuals.
concise description of a collection of quantitative The frequency distribution displays scores on a
information. variable or a measure to reflect how frequently each
Inferential Statistics are methods used to make value was obtained; defines all the possible scores and
inferences from observations of a sample to population. determines how many people obtained each of those
scores.
Scales of Measurement
Measurement: application of rules for assigning
numbers to objects.
The rules are the specific procedures used to transform
qualities of attributes into numbers.

Continuous Scale: measuring


Discrete Scale: counting or categorizing; countable

Properties of Scales TEST IS DIFFICULT TEST IS EASY


Three important properties make scales of measurement
different from one another: Skewness: extent w/c symmetry is absent
1. Magnitude: property of “moreness”; if a particular  Positively skewed: frequent scores are clustered
instance of the attribute represents more, less, or at the lower end and the tail points toward higher or
equal amounts of the given quantity than does more positive scores.
another instance.  Negatively skewed: frequent scores are clustered
2. Equal Interval: if the difference between two points at the higher end and the tail points toward lower or
at any place on the scale has the same meaning as more negative scores.
the difference between two other points that differ
by the same number of scale units. Kurtosis: degree to which score cluster at the ends of
3. Absolute 0: when nothing of the property being distribution; the tails; steepness
measured exists; for many psychological qualities, it  Leptokurtic- pointy positive kurtosis
is extremely difficult, if not impossible, to define an  Mesokurtic- normal distribution
absolute 0 point  Platykurtic- flat negative kurtosis

Types of Scales [NOIR]


1. Nominal Scales: classification or categorization Percentiles
based on one or more distinguishing characteristics. The specific scores or points within a distribution; it
(ex: DSM, yes/no scale, sex, hair color, marital etc.) divides the total frequency for a set of observations into
2. Ordinal Scales: rank ordering characteristics; hundredths; indicates the particular score, below which a
compared / other & assigned a rank defined percentage of scores falls.
(ex: board exam topnotchers, likert scale, level of Percentile Ranks replace simple ranks to adjust
agreement) the number of scores in a group.
3. Interval Scale: has the properties of magnitude and
equal intervals but not absolute 0 What percent of the scores fall below a particular score?
Comparative (difference, effect)
no. of times DV no. of groups
measured
no. of beaten examinees T-test independent 1 2
X 100
total no. of scores means
T- test dependent 2 1
interval,
means
Measures of Central Tendency ratio
Anova one way 1 >2
Mean: average of a data set/ arithmetic average score in Anova repeated >2 1
measures
a distribution Mann Whitney U test 1 2
Median: middle of the set of numbers (organizing the Wilcoxon Signed 2 1
scores in ascending/descending order) ordinal Rank test
Kruskal Wallis test 1 >2
Mode: most common number/score in a data set; 2 Friedman test >2 1
scores/modes—bimodal Anova two way 2 IV, 2 levels
MANOVA many measure-
Standard Deviation: an approximation of the average ments of DV
deviation around the mean; measure of how spread-out
numbers are Parametric/ Non-Parametric
Variance: average of the squared differences from the  Dependent T-test → Wilcoxon Signed Rank Test
mean  Independent T-test → Mann Whitney U Test
 Repeated Measures Anova → Friedman Test
Z score
 One way/Two way Anova → Kruskal Wallis Test
It transforms data into standardized units that are easier
to interpret; the difference between a score and the
Norms And A Good Test
mean, divided by the standard deviation.
Norms: performances by defined groups on particular
T score (McCall's T)
tests; standard which results will be compared; used to
It refers to a standardized score system that can be
give information about performance relative to what has
obtained from a simple linear transformation of Z scores.
been observed in a standardization sample
Mean = 50
Standard Deviation = 10
Age-related Norms
- Certain tests have different normative groups for
Standard Scores
particular age groups.
MEAN SD
- ex: SB-IQ test were obtained from various age
Z scores 0 1
groups.
T scores 50 10
Norm-referenced Tests
Stanine scores 5 2
Sten scores 5.5 2 - Determine how a test taker compares with other or
IQ 100 15 each person with a norm.
CEE 500 100 Criterion-referenced Tests
- Describes the specific types of skills, tasks, or
Variability knowledge a test taker can demonstrate.
 Range: High score- Low score
 Quartile: division of frequency, normal distribution Norming
divided into 4 parts Standardization: process of administering a test to a
 Interquartile: Q3-Q1 representative sample of test takers for the purpose of
establishing norms
 Semi-interquartile: Interquartile divided by 2
Sample: a portion of the universe of people that
represents the whole population
Sampling: process of selecting sample

Probability Sampling
- Simple Random Sampling: everyone has an equal
chance of being selected as part of the sample
- Systematic Sampling: every nth item or person after
is picked
- Stratified Sampling: random selection within
predefined groups
- Cluster Sampling: groups rather than individual units Standard error of measurement (SEm)
of the target population are selected randomly estimates how repeated measures of a person on the
same instrument tend to be distributed around his “true”
Non-Probability Sampling score; which the true score is always unknown because
- Convenience Sampling: selected based on their no measure can be constructed that provides a perfect
availability reflection of the true score.
- Quota Sampling: specifying who should be recruited
for a survey according to certain groups or criteria Domain Sampling Theory
- Purposive Sampling: chosen consciously based on another central concept in CTT; assumes that the items
their knowledge and understanding of the research that have been selected for any one test are just a
question sample of items from an infinite domain of potential
- Snowball or Referral Sampling: people recruited to items; considers the problems created by using a limited
be part of a sample are asked to invite those they number of items to represent a larger and more
know to take part complicated construct.
- as the sample gets larger, it represents the domain
more and more accurately.
- the greater the number of items, the higher the
2.1. Reliability reliability.
- pertains to the consistency of test measurement
- can be estimated from the correlation of the Item Response Theory (IRT)
observed tests score with the true score A way to analyze responses to tests or questionnaires
- a test may be reliable in one context and unreliable with the goal of improving measurement accuracy and
in other reliability; used to focus on the range of item difficulty
that helps assess an individual's ability level.
In psychological testing, the word error does not imply item difficulty = item easiness
that a mistake has been made. error implies that there
will always be some inaccuracy in our measurements. Sources of Error Variance
In other words, tests that are relatively free of  Test Construction: item sampling or content
measurement error are reliable, and tests that have so sampling, terms that refer to a variation among
much measurement error are considered unreliable. items within a test as well as variation among items
between tests.
History & Theory of Reliability  Test Administration: may influence the test taker's
 Abraham De Moivre introduced the basic notion of attention or motivation.
sampling error.  Test Scoring and Interpretation: individual
 Karl Pearson developed the product moment administered test still require scoring by trained
correlation. personnel.
 Charles Spearman worked out most of the basics of
contemporary reliability theory with 1904 article Test Re-Test Reliability
entitled "The Proof and Measurement of Association 1 group, 1 test, 2 administration
between Two Things." 2 weeks – 6months
testers use “rubber yardsticks” to estimate measurements o used to evaluate the error associated with
administering a test at two different times.
Basics of Test Score Theory o It is appropriate when evaluating the reliability of a
The difference between the score we obtain and the test that purports to measure something that is
score we are really interested is the error of relatively stable over time, ex: personality trait.
measurement – X-T=E o Carryover Effect: occurs when the first testing
session influences the scores from the old session.
(short span of interval)
Classical Test Score Theory: assumes that each o Practice Effect: the test-retest correlation usually
person has a true score that would be obtained if there overestimates the true reliability. (test-taker's
were no errors in measurement ( X=T+E) concern)
- The difference between the true score and the
observed score results from measurement error Test-Retest Procedure
 Sample population = (Test A) [1] + (Test A) [2]
 Administer the psychological test o to shorten the items, odd-even system or top
 Get the test result bottom method can be used.
 Wait for the interval (time gap) - top bottom method: 1st half (25 items), 2nd
 Re-administer the psychological test half (25 items)
 Get the test result - odd even system: 1st half (odd), 2nd half
 Correlate the results (even)
o to adjust the half-test reliability, use the Spearman
Disadvantages of Test-Retest Brown Formula.
 Possible better performance o allows a test developer or user to estimate internal
consistency reliability from correlation of two halves
 Checking / knowing the answers
of a test.
 Practice effect
o Spearman-Brown formula: r = 2r/1+r
where, (r) is the estimated correlation between the two
coefficient of stability: the estimate of test-retest
halves of the test (if each had had the total number of
reliability, when the interval between testing is greater items) and r is the correlation between the two halves of
than six months the test
As a Psychometrician in a clinic, the psychologist instructed you to
Parallel-Forms & Alternate- Forms Reliability develop a test that measure emotional stability of suicidal patients. You
decided to use split-half to establish its reliability. The correlation
1 group, 2 test, 1 administration
between two halves of the test is .78. according to the spearman-
o compares two equivalent forms of test that measure brown formula, the estimated reliability would be .876
the same attribute. The two forms use different Computation: r = 2 (.78) / 1 + .78
items; however the rules used to select items of a
particular difficulty level are the same Kuder-Richardson Formula (KR20)
o two forms are administered to the same group of KR20 is (Kuder & Richardson, 1937) developed this
people on the same day. Pearson's R product- formula to measure the reliability estimate of Split-half.
moment correlation is used to estimate the reliability The formula is applicable to items that are
o alternate forms are designed to be equivalent with DICHOTOMOUS, scored 0 or 1 (usually right or wrong).
respect to variables such as content and level of o used when the items varies the level of difficulty.
difficulty.
o coefficient of equivalence- measures the same
attributes.

Procedure
 Administer the first test
 Administer the Alternate form test
 Score both tests
 Correlate

KR21: used when the items have same level of difficulty


Disadvantages of using the Alternate Form Method
 Hard to develop or construct
Coefficient Alpha
 Time consuming
o Cronbach's Alpha (Cronbach, 1951) sometimes
called coefficient alpha (a) is the most common
Split Half Reliability
measure of internal consistency (reliability).
1 group, 2 test, 2 administration
o Cronbach's Alpha is most commonly used when you
o obtained by correlating two pairs of scores obtained
have multiple Likert questions in a
from equivalent halves of a single test administered
survey/questionnaire that form a scale and you wish
once.
to determine if the scale is reliable.
o test is given and divided into halves that are scored
o Items are not scored as 0 or 1. Applicable for
separately.
personality and attitude scales.
o results of one half of the test are then compared
with the results of the other.
How reliable is reliability?
o if the test is long, the best method is to divide the
- In research setting, it has been suggested that a
items randomly into two halves.
reliability estimates in the range of .70 and .80 are
good enough for most purposes in basic research  refers to the ability to draw accurate inferences from
(Kaplan & Saccuzzo, 2005) test scores to a related behavioral criterion of
- In clinical setting, high reliability is extremely interest
important. When tests are used to make important  the extent to which a measure is related to an
decisions about someone's future, a reliability of .90 outcome.
to .95 might be good enough - Predictive Validity; Concurrent Validity
- If the test is unreliable, information obtained with it is
of little or no value. Predictive Validity
- Psychometricians: .70 - how well a certain measure can predict future
behavior
What to do with low reliability?
1. Increase the number of items Concurrent Validity
2. Factor and item analysis - extent to which test scores may be used to estimate
3. Correction of attenuation an individual’s present standing on a criterion

Construct Validity
2.2. Validity  judgment about appropriateness of inferences
Validity, as applied to a test, is a judgment or estimate of drawn from test scores regarding individual
how well a test measures what it purports to measure in standings on a variable called construct.
a particular context.  it arrived at by executing a comprehensive analysis
Validity indicated what the test aims or purports to of:
measure. a. How scores on the test relate to other test
It answer the question, “Does the test measure what it scores and measures, and
is supposed to measure?” b. How scores on the test can be understood
within some theoretical framework for
Face Validity understanding the construct that the test was
Face validity is not really a validity at all because it does designed to measure.
not offer evidence to support conclusions drawn from
test score. Convergent Validity
Face validity relates more to what a test appears to - when a measure correlates well with other tests
measure to the person being tested than to what the test believed to measure the same construct.
actually measures. - ex: correlate assessment scores for a math ability
Face validity only establishes the presentation, physical test with scores obtained from other math ability
appearance of the psychological test (Is the test tests.
presentable to the test takers?)
Discriminant Validity
Three Categories of Validity - a construct measure diverges from other measures
Content Validity that should be measuring different things.
 explores the appropriateness of test items of a - ex: correlate assessment scores for a math ability
psychological test. test with scores obtained from a verbal ability test.
 It means that the test covers the content what is
supposed to cover. This describes a judgment of Practicality of a Test
how adequately a test samples behavior  A test must be usable
representative of what the test was designed to  Selection of the test should also be based on:
sample. - Effort
 test blueprint for the "structure" of the test - Affordability
evaluation. A plan regarding the types of information - Time frame
to be covered by the items  Test requires simple directions
 Easy administration and scoring
Criterion-related Validity
 criterion: a standard on which a judgment or
decision may be based. The standard against which
a test or a test score is evaluated

You might also like