Item Analysis & Reliability

Item analysis, reliability and
validity; further multivariate stats

• Using statistics covered in the main sessions
(including correlation and factor analysis) to
explore reliability and validity
• A introduction to scale reliability stats
(Cronbach’s alpha)
• Mixed methods as an approach to validity
• Some pointers to additional material
– the Rasch model
– further multivariate methods
Item analysis
• Statistical checks on the items that

make up a test
Item analysis
• Used with items which can be marked
as right or wrong
• And when the set of item scores is
added to give a single score
• Assumption: items measure a single
trait
Useful stats
• Facility
• Discrimination
Item difficulty
Facility = the fraction of persons tested

who answered the item correctly (p).
It is usually recommended that p values

should fall within the range 0.2 to 0.8.
Discrimination
• Does the item discriminate between high

scorers and low scorers on the whole test?
• Can be tested via the item-total correlation

Some revision
Spearman
Pearson
Correlation
IQ and attitude to school
180
160
140
120
100
80
IQ score
60
40
0 10 20
attitude to school
Correlation
Correlations
attitude to
IQ s core s chool
IQ s core Pears on Correlation 1.000 .564**
Sig. (2-tailed) . .000
N 40 40
attitude to s chool Pears on Correlation .564** 1.000
Sig. (2-tailed) .000 .
N 40 40
**. Correlation is significant at the 0.01 level (2-tailed). IQ and attitude to school
180
160
140
120
100
IQ score 80
60
40
0 10 20
attitude to school
No Correlation
Correlations
random nos 1 random nos 2

random nos 1 Pears on Correlation 1.000 .087
N 40 40
random nos 2 Pears on Correlation .087 1.000
N 40 41
12
10
random nos 2
0
-2
0 2 4 6 8 10 12
random nos 1
Strong correlation
Correlations
IQ s core NEWIQ
IQ s core Pears on Correlation 1.000 .995**
N 40 40
NEWIQ Pears on Correlation .995** 1.000
Sig. (2-tailed) .000 . 180
N 40 40
**. Correlation is s ignificant at the 0.01 level 160
(2-tailed).
140
120
100
80
IQ score
60
40
40 60 80 100 120 140 160
NEWIQ
Discrimination
• Does the item discriminate between high

scorers and low scorers on the whole test?
• Can be tested via the item-total correlation

Part of Alpha output from SPSS
Item-Total Statistics
Scale Corrected Cronbach's

Scale Mean if Variance if Item-Total Alpha if Item
Item Deleted Item Deleted Correlation Deleted
ite0001 3.1071 8.766 .722 .861
ite0003 2.9643 7.888 .897 .844
ite0005 2.8929 8.099 .749 .855
ite0007 2.9643 8.332 .705 .859
ite0009 2.9643 8.406 .674 .861
ite0002 2.9643 9.295 .324 .886
ite0004 2.8214 8.671 .503 .875
ite0006 2.8571 9.238 .308 .889
ite0008 2.7500 8.935 .402 .883
ite0010 2.9643 7.888 .897 .844
Interpreting the scores
• A point biserial correlation:
above 0.3 is considered ‘good’;
0.2 to 0.3 is considered ‘workable’
below 0.2 is considered unacceptable
Discrimination Index
(N in top 27% - (N in bottom 27%

getting it right) getting it right)
N in a 27% group
What makes a good question?
FACILITY
Discrimination
Below 40% 40%-60% Above 60%
> 0.40 Difficult ACCEPTABLE Easy
0.30-0.39 Marginal Improvable Marginal
0.20-0.29 Reject Marginal Reject
<0.20 REJECT
Scale qualities
Reliability
• the extent to which the scores on the test
are measured consistently
– Parallel-form reliability
– Split-half reliability
– Internal consistency reliability
– Test-retest reliability
– Inter-rater reliability
• Parallel-form reliability
Correlations
A1 A2
A1 Pearson Correlation 1 .721**
N 28 25
A2 Pearson Correlation .721** 1
N 25 25
**. Correlation is significant at the 0.01 level
(2-tailed).
• Split-half reliability
• Spearman-Brown formula
Correlations
ODD EVEN
ODD Pearson Correlation 1 .807**
N 28 28
EVEN Pearson Correlation .807** 1
N 28 28
**. Correlation is significant at the 0.01 level
(2-tailed).
****** Method 1 (space saver) will be used for this analysis ******
_
R E L I A B I L I T Y A N A L Y S I S - S C A L E (S P L I T)
Reliability Coefficients
N of Cases = 28.0 N of Items = 10
Correlation between forms =.8068 Equal-length Spearman-Brown = .8931
Guttman Split-half = .8828 Unequal-length Spearman-Brown =.8931
5 Items in part 1 5 Items in part 2
Alpha for part 1 = .8926 Alpha for part 2 = .6125

• Alpha
R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)
N of Cases = 28.0
N of
Statistics for Mean Variance Std Dev Variables
Scale 3.2500 10.4167 3.2275 10
Item-total Statistics
Scale Scale Corrected

Mean Variance Item- Squared Alpha
if Item if Item Total Multiple if Item
Deleted Deleted Correlation Correlation Deleted
ITE0001 3.1071 8.7659 .7222 . .8610

ITE0003 2.9643 7.8876 .8968 . .8437
ITE0005 2.8929 8.0992 .7487 . .8547
ITE0007 2.9643 8.3320 .7052 . .8587
ITE0009 2.9643 8.4061 .6744 . .8611
ITE0002 2.9643 9.2950 .3244 . .8863
ITE0004 2.8214 8.6706 .5027 . .8746
ITE0006 2.8571 9.2381 .3080 . .8892
ITE0008 2.7500 8.9352 .4015 . .8827
ITE0010 2.9643 7.8876 .8968 . .8437
Reliability Coefficients 10 items
Alpha = .8782 Standardized item alpha = .8839

Internal consistency and
dimensionality
• Surprisingly, internally consistent scales

may not be uni-dimensional
High alpha – but unidimensional?
• High alpha can be achieved if there are
sets of questions that correlate with each
other within a set but not necessarily
across the sets
• Follow alpha by factor analysis (then
perhaps alpha for each factor separately)
• Report factors
• Consider whether the overall scale still has
meaning
Factor analysis
A B C D E
A 1 0.8 0.9 0.1 0.2
B 1 0.6 0.3 0.1
C 1 0.3 0.1
D 1 0.9
E 1
Factor analysis
A B C D E
A 1 0.8 0.9 0.1 0.2
B 1 0.6 0.3 0.1
C 1 0.3 0.1
D 1 0.9
E 1
Factor structure of the scale
investigated above - Alpha = 0.88
Rotated Component Matrixa
Component
1 2 3
ITE0005 .897
ITE0010 .873
ITE0003 .873
ITE0006 .724
ITE0009 .661
ITE0008 .929
ITE0007 .750
ITE0004
ITE0002 .895
ITE0001 .731
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 4 iterations.
Links between question difficulty
and scale reliability
qu2
qu1
Negative alphas
Are unusual but can be the result of
• The problem of questions that are too
easy
• Errors in coding
• Sampling error
OR
• That the questions really don’t measure
the same thing and right answers to some
go with wrong answers to others)
• Classical test Theory
• Item Response Theory

Rasch Model
• http://www.rasch.org/memo42.htm
• http://www.rasch.org/memo62.htm
Validity
• Does the test measure what it sets out to
measure
• Concurrent validity
• Discriminant validity
• Predictive validity
• Reliability and validity

Wider issues of validity in
quantitative studies
• Threats to validity in experimental designs
• A simple design:
• OXO
Issues of validity at the design
stage – experimental designs
Internal threats to validity
• History
• Maturation
• Testing
• Instrumentation
• Selection
• Statistical regression
• Mortality
External threats to validity
• Interaction of selection bias and
treatment
• Interaction between testing and
treatment
• Reaction to being in an experiment
Hawthorne Effect
• Treatment group do better because they
are in a privileged group
Hawthorne Effect
• Treatment group do better because they
are in a privileged group – or do they?
The Hawthorne defect: Persistence of a flawed theory
“Like other hallowed but unproven concepts in psychology, the so-called Hawthorne
effect has a life of its own.”
By Berkeley Rice
http://www.cs.unc.edu/~stotts/204/nohawth.html
Compensatory effect
John Henry Effect
• Control group do better because they
are not going to let the ‘smarties’ get the
better of them
True experiment
X Oe
R
- Oc
Quasi experiments – no R
O X O
O O
Mixed methodology
• Use quasi experimental designs to reveal
possible consequences of actions
• Use interpretative designs to check the
causal relationship between the actions
and the consequences – from several
perspectives
– (including enquiring about the known threats
to validity – history, maturation etc)
Another view of mixed
methodologies
Linking qualitative and quantitative data
• qualitative work gives rich exemplification of generalisable

relationships established by statistical methods – (Sci
Paradigm)
• quantitative work establishes the generalisablity of
hypotheses which emerge from a qualitative enquiry (Sci
Paradigm)
• qualitative and quantitative work are used together
(iteratively) to deepen the understanding of the particular
cases on which we have been working. (Interp. Paradigm)
It’s …
• NOT the purpose of qualitative work simply to give rich
exemplification of generalisable relationships
established by statistical methods – to give a human
face to a statistical study.
• NOT that quantitative work should be used to establish
the generalisablity of hypotheses which emerge from a
qualitative enquiry - as if this is in some way a
necessary step in order that the qualitative findings can
be taken seriously.
• BUT qualitative and quantitative work are used together
(iteratively) to deepen the understanding of the
particular cases on which we have been working.
An example
C1 site differences
3.5
SpprtMS
Shared Control >
Student Negotiation
WrkbdAst
ELDrama
3.0
ESOL LS
ITSkills
Connect2
ADMPA
GNVQBus
Pth4 Prn
Shared Control
2.5
Student Negotiation >

Shared Control
AVCET&T CACHE
2.0
Voc Path
BTECHlth
Engneer ASPsych
1.5
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4
Student Negotiation
Explaining the High SC/Low SN
grouping
Support for Mature Students
• Self assessment, negotiation based on
assignments, individual learning plans
agreed and reviewed by tutor and student
Workbased assessment
• Individual support from tutor (underground
working)
Explaining the High SC/Low SN
grouping
Workbased assessment
• Different geographical placements
Support for Mature Students and IT skills

• Same room
but ISOLATION
• Different times
ESOL
• Same room, same times
but
• Different languages
Does isolation feature
elsewhere?
3.5
Isolation was a feature of

SpprtMS
Shared Control >
Student Negotiation
WrkbdAst
ELDrama
3.0
ESOL LS
the ‘top left’ sites

ITSkills
Connect2
ADMPA
GNVQBus
Pth4 Prn
Shared Control
2.5
Student Negotiation >

Shared Control
AVCET&T CACHE
2.0
Voc Path
In 4 of the ‘bottom right’

BTECHlth
Engneer ASPsych
1.5
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4
sites isolation was ‘not at

Student Negotiation
all evident in this site’
Isolation (broadly defined) appeared to be

a factor related to a site culture in which
there was low student negotiation.
Some further multivariate stats
EFA & CFA
• Factor analysis used to explore
relationships amongst variables
• Factor analysis used to test expected
relationship between variables
Path analysis
.5 Exogenous variable
variance
ach Endogenous var

.6
IQ .4
Mediator var
.3
mot Direct effects – path
coeff
disturbances
.9
Structural equation modelling
http://www2.chass.ncsu.edu/garson/PA765/structur.htm
One indicator per latent: SEM=PA
No dependent: SEM=CFA
Multilevel Modelling
• Bennett – 1976, Teaching Styles and Pupil
Progress
– Children taught in a formal style did better
• Aitkin et al – 1981, Teaching Styles and

Pupil Progress: A Re-Analysis British
Journal of Educational Psychology, v51 n2
p170-86
– When pupils’ grouping into classes was taken
into account, this difference disappeared
What kind of problem can this
explore?
• Pupils taught by formal methods do better that those
taught by informal methods
– Formal teaching methods are best
BUT
• All formal teachers are in mixed schools
• All informal teachers in single sex schools
– So mixed schools are best
BUT
• All mixed sch are in one LA that spends a lot on
education
• All single sex sch are in another LA that does not spend
a lot on education
– So really it’s resourcing that matters
And also…
• So MLM groups people appropriately
– eg class, school, local authority
But it also
• Allows the use of covariates at any level of
grouping
– eg initial test scores at the class level, teacher
experience at the school level, funding to schools at
the LA level
• Allows the exploration of interactions
– eg is the variation between schools greater for
children with low initial test scores
A really helpful website:
• http://www.cmm.bristol.ac.uk/learning-
training/multilevel-models/what-why.shtml
Including a useful video on the principles
training/videos/jr-clioday_files/Default.htm
and on an application
training/videos/kj-clioday_files/Default.htm

Item Analysis &amp; Reliability

Uploaded by

Copyright:

Available Formats

You might also like

Item Analysis &amp; Reliability

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Item Analysis &amp; Reliability

Uploaded by

Copyright:

Available Formats

Item analysis, reliability and

validity; further multivariate stats

• Statistical checks on the items that

Facility = the fraction of persons tested

It is usually recommended that p values

• Does the item discriminate between high

• Can be tested via the item-total correlation

random nos 1 random nos 2

• Does the item discriminate between high

• Can be tested via the item-total correlation

Scale Corrected Cronbach's

(N in top 27% - (N in bottom 27%

> 0.40 Difficult ACCEPTABLE Easy

0.30-0.39 Marginal Improvable Marginal

0.20-0.29 Reject Marginal Reject

N of Cases = 28.0 N of Items = 10

Correlation between forms =.8068 Equal-length Spearman-Brown = .8931

Guttman Split-half = .8828 Unequal-length Spearman-Brown =.8931

5 Items in part 1 5 Items in part 2

Alpha for part 1 = .8926 Alpha for part 2 = .6125

Scale Scale Corrected

ITE0001 3.1071 8.7659 .7222 . .8610

Reliability Coefficients 10 items

Alpha = .8782 Standardized item alpha = .8839

• Surprisingly, internally consistent scales

• Item Response Theory

• Reliability and validity

The Hawthorne defect: Persistence of a flawed theory

• qualitative work gives rich exemplification of generalisable

Student Negotiation >

Support for Mature Students and IT skills

Isolation was a feature of

the ‘top left’ sites

Student Negotiation >

In 4 of the ‘bottom right’

sites isolation was ‘not at

all evident in this site’

Isolation (broadly defined) appeared to be

ach Endogenous var

• Aitkin et al – 1981, Teaching Styles and

You might also like

Item Analysis & Reliability

Item Analysis & Reliability

Item Analysis & Reliability