Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

123. C.

Item Analysis

After you create your objective assessment items and give your test, how can you be sure
that the items are appropriate -- not too difficult and not too easy? How will you know if
the test effectively differentiates between students who do well on the overall test and
those who do not? An item analysis is a valuable, yet relatively easy, procedure that
teachers can use to answer both of these questions.

To determine the difficulty level of test items, a measure called the Difficulty Index is
used. This measure asks teachers to calculate the proportion of students who
answered the test item accurately. By looking at each alternative (for multiple
choice), we can also find out if there are answer choices that should be replaced.
For example, let's say you gave a multiple choice quiz and there were four
answer choices (A, B, C, and D). The following table illustrates how many
students selected each answer choice for Question #1 and #2.

Question A B C D

#1 0 3 24* 3

#2 12* 13 3 2

* Denotes correct answer.

For Question #1, we can see that A was not a very good distractor -- no one selected that
answer. We can also compute the difficulty of the item by dividing the number of
students who choose the correct answer (24) by the number of total students (30). Using
this formula, the difficulty of Question #1 (referred to as p) is equal to 24/30 or .80. A
rough "rule-of-thumb" is that if the item difficulty is more than .75, it is an easy item; if
the difficulty is below .25, it is a difficult item. Given these parameters, this item could be
regarded moderately easy -- lots (80%) of students got it correct. In contrast, Question #2
is much more difficult (12/30 = .40). In fact, on Question #2, more students selected an
incorrect answer (B) than selected the correct answer (A). This item should be carefully
analyzed to ensure that B is an appropriate distractor.

Another measure, the Discrimination Index, refers to how well an assessment


differentiates between high and low scorers. In other words, you should be able to expect
that the high-performing students would select the correct answer for each question more
often than the low-performing students.  If this is true, then the assessment is said to have
a positive discrimination index (between 0 and 1) -- indicating that students who received
a high total score chose the correct answer for a specific item more often than the
students who had a lower overall score. If, however, you find that more of the low-
performing students got a specific item correct, then the item has a negative
discrimination index (between -1 and 0). Let's look at an example.
Table 2 displays the results of ten questions on a quiz. Note that the students are
arranged with the top overall scorers at the top of the table.

Total Questions
Student
Score (%) 1 2 3

Asif 90 1 0 1

Sam 90 1 0 1

Jill 80 0 0 1

Charlie 80 1 0 1

Sonya 70 1 0 1

Ruben 60 1 0 0

Clay 60 1 0 1

Kelley 50 1 1 0

Justin 50 1 1 0

Tonya 40 0 1 0

         

"1" indicates the answer was correct; "0" indicates it was incorrect.

Follow these steps to determine the Difficulty Index and the Discrimination Index.

1. After the students are arranged with the highest overall scores at the top, count the
number of students in the upper and lower group who got each item correct. For
Question #1, there were 4 students in the top half who got it correct, and 4
students in the bottom half.
2. Determine the Difficulty Index by dividing the number who got it correct by the
total number of students. For Question #1, this would be 8/10 or p=.80.
3. Determine the Discrimination Index by subtracting the number of students in the
lower group who got the item correct from the number of students in the upper
group who got the item correct.  Then, divide by the number of students in each
group (in this case, there are five in each group). For Question #1, that means you
would subtract 4 from 4, and divide by 5, which results in a Discrimination Index
of  0.
4. The answers for Questions 1-3 are provided in Table 2.
# Correct (Upper # Correct (Lower Difficulty Discrimination
Item
group) group) (p) (D)

Question
4 4 .80 0
1

Question
0 3 .30 -0.6
2

Question
5 1 .60 0.8
3

Now that we have the table filled in, what does it mean? We can see that Question #2 had
a difficulty index of .30 (meaning it was quite difficult), and it also had a negative
discrimination index of -0.6 (meaning that the low-performing students were more likely
to get this item correct).  This question should be carefully analyzed, and probably
deleted or changed. Our "best" overall question is Question 3, which had a moderate
difficulty level (.60), and discriminated extremely well (0.8).

Another consideration for an item analysis is the cognitive level that is being assessed. 
For example, you might categorize the questions based on Bloom's taxonomy (perhaps
grouping questions that address Level I and those that address Level II). In this manner,
you would be able to determine if the difficulty index and discrimination index of those
groups of questions are appropriate. For example, you might note that the majority of the
questions that demand higher levels of thinking skills are too difficult or do not
discriminate well.  You could then concentrate on improving those questions and focus
your instructional strategies on higher-level skills.

Stanine (STAndard NINE) is a method of scaling test scores on a nine-point standard


scale with a mean of five (5) and a standard deviation of two (2).

Some web sources attribute stanines to the U.S. Army Air Forces during World War II.
The earliest known use of Stanines was by the U.S. Army Air Forces in 1943[1].

Test scores are scaled to stanine scores using the following algorithm:

1. Rank results from lowest to highest


2. Give the lowest 4% a stanine of 1, the next 7% a stanine of 2, etc.,
according to the following table:

Calculating Stanines
Result Ranking 4% 7% 12% 17% 20% 17% 12% 7% 4%
Stanine 1 2 3 4 5 6 7 8 9

An achievement test is a test of developed skill or knowledge. The most common type of
achievement test is a standardized test developed to measure skills and knowledge
learned in a given grade level, usually through planned instruction, such as training or
classroom instruction.[1][2] Achievement tests are often contrasted with tests that measure
aptitude, a more general and stable cognitive trait.

Achievement test scores are often used in an educational system to determine what level
of instruction for which a student is prepared. High achievement scores usually indicate a
mastery of grade-level material, and the readiness for advanced instruction. Low
achievement scores can indicate the need for remediation or repeating a course grade.

Under No Child Left Behind, achievement tests have taken on an additional role of
assessing proficiency of students. Proficiency is defined as the amount of grade-
appropriate knowledge and skills a student has acquired up to the point of testing. Better
teaching practices are expected to increase the amount learned in a school year, and
therefore to increase achievement scores, and yield more "proficient" students than
before.

When writing achievement test items, writers usually begin with a list of content
standards (either written by content specialists or based on state-created content
standards) which specify exactly what students are expected to learn in a given school
year. The goal of item writers is to create test items that measure the most important
skills and knowledge attained in a given grade-level. The number and type of test items
written is determined by the grade-level content standards. Content validity is determined
by the representativeness of the items included on the final test.

[edit] See also

Measurement Error
The true score theory is a good simple model for measurement, but it may not always be
an accurate reflection of reality. In particular, it assumes that any observation is
composed of the true value plus some random error value. But is that reasonable? What if
all error is not random? Isn't it possible that some errors are systematic, that they hold
across most or all of the members of a group? One way to deal with this notion is to
revise the simple true score model by dividing the error component into two
subcomponents, random error and systematic error. here, we'll look at the differences
between these two types of errors and try to diagnose their effects on our research.
What is Random Error?

Random error is caused by any factors that randomly affect measurement of the variable
across the sample. For instance, each person's mood can inflate or deflate their
performance on any occasion. In a particular testing, some children may be feeling in a
good mood and others may be depressed. If mood affects their performance on the
measure, it may artificially inflate the observed scores for some children and artificially
deflate them for others. The important thing about random error is that it does not have
any consistent effects across the entire sample. Instead, it pushes observed scores up or
down randomly. This means that if we could see all of the random errors in a distribution
they would have to sum to 0 -- there would be as many negative errors as positive ones.
The important property of random error is that it adds variability to the data but does not
affect average performance for the group. Because of this, random error is sometimes
considered noise.
What is Systematic Error?

Systematic error is caused by any factors that systematically affect measurement of the
variable across the sample. For instance, if there is loud traffic going by just outside of a
classroom where students are taking a test, this noise is liable to affect all of the children's
scores -- in this case, systematically lowering them. Unlike random error, systematic
errors tend to be consistently either positive or negative -- because of this, systematic
error is sometimes considered to be bias in measurement.
Reducing Measurement Error

So, how can we reduce measurement errors, random or systematic? One thing you can do
is to pilot test your instruments, getting feedback from your respondents regarding how
easy or hard the measure was and information about how the testing environment
affected their performance. Second, if you are gathering measures using people to collect
the data (as interviewers or observers) you should make sure you train them thoroughly
so that they aren't inadvertently introducing error. Third, when you collect the data for
your study you should double-check the data thoroughly. All data entry for computer
analysis should be "double-punched" and verified. This means that you enter the data
twice, the second time having your data entry machine check that you are typing the
exact same data you did the first time. Fourth, you can use statistical procedures to adjust
for measurement error. These range from rather simple formulas you can apply directly to
your data to very complex modeling procedures for modeling the error and its effects.
Finally, one of the best things you can do to deal with measurement errors, especially
systematic errors, is to use multiple measures of the same construct. Especially if the
different measures don't share the same systematic errors, you will be able to triangulate
across the multiple measures and get a more accurate sense of what's going on

[edit] Introduction

Consider the distribution in the figure. The bars on the right side of the distribution taper
differently than the bars on the left side. These tapering sides are called tails, and they
provide a visual means for determining which of the two kinds of skewness a distribution
has:
1. negative skew: The left tail is longer; the mass of the distribution is
concentrated on the right of the figure. It has relatively few low values. The
distribution is said to be left-skewed. Example (observations):
1,1000,1001,1002,1003
2. positive skew: The right tail is longer; the mass of the distribution is
concentrated on the left of the figure. It has relatively few high values. The
distribution is said to be right-skewed. Example (observations):
1,2,3,4,100.

In a skewed (unbalanced, lopsided) distribution, the mean is farther out in the long tail
than is the median. If there is no skewness or the distribution is symmetric like the bell-
shaped normal curve then the mean = median = mode.

Many textbooks teach a rule of thumb stating that the mean is right of the median under
right skew, and left of the median under left skew. This rule fails with surprising
frequency. It can fail in multimodal distributions, or in distributions where one tail is long
but the other is heavy. Most commonly, though, the rule fails in discrete distributions
where the areas to the left and right of the median are not equal. Such distributions not
only contradict the textbook relationship between mean, median, and skew, they also
contradict the textbook interpretation of the median.[1]

The epidemiology of autism is the study of factors affecting autism spectrum disorders
(ASD). Most recent reviews of epidemiology estimate a prevalence of one to two cases
per 1,000 people for autism, and about six per 1,000 for ASD;[1] because of inadequate
data, these numbers may underestimate ASD's true prevalence.[2] ASD averages a 4.3:1
male-to-female ratio. The number of children known to have autism has increased
dramatically since the 1980s, at least partly due to changes in diagnostic practice; the
question of whether actual prevalence has increased is unresolved,[1] and as-yet-
unidentified environmental risk factors cannot be ruled out.[3] The risk of autism is
associated with several prenatal factors, including advanced parental age and diabetes in
the mother during pregnancy.[4] ASD is associated with several genetic disorders[5] and
with epilepsy,[6] and autism is associated with mental retardation.[7]

You might also like