Professional Documents
Culture Documents
Language Testing
Language Testing
Language Testing
Language Testing
Introduction
The test described and analyzed in this paper was created for a Level 2 Academic
preparation course from INTO at the University of South Florida. This course deals with reading
and writing strategies such as scanning for specific information and understanding comparative
and contrastive formats. Students are also expected to learn new vocabulary throughout the
course that relates to content from the textbook. This method of teaching knowledge through
content is required at INTO in order to provide students with information to attach their
meanings rather than teaching abstractly. There are 11 students in the class; 6 male, and 5 female
(2 Chinese speakers and 9 Arabic speakers) of different ages, educational, and professional
backgrounds. The test was created with these factors in mind and with the intention of assessing
The purpose of this test was to ensure students were grasping knowledge from the class
which included vocabulary items, information from a reading passage and the ability to reiterate
factual information. The test was slightly longer than two pages and was assumed to take roughly
thirty minutes for students to complete. The test was given in the classroom that class was normally
held in on a Friday after a short fifteen minute review of items on the quiz that was propelled by
student questions. Once students felt comfortable and no longer had questions, they were asked to
remove all things from their desks before the test was distributed to which they complied. Students
were told to not write on their test until the instructions in all sections were explained. After each
section was briefly explained by the instructor, the students were asked if they had any questions
which they did not. Students were then allowed to begin their quiz hence the instructor stated that
In regards scoring procedure techniques, an answer key was created and also other answers
were derived from the textbook of the course. Once all quizzes were graded for the preliminary
quiz the mean score of the class was 19.5 out of 22 which can be converted to 89% of 100%. The
mean score of the second quiz was 35.09 out of 40 which is translated as 88% of 100%. Although
there is a decrease in the class averages it can be considered a result of reliability especially since
the actual numerical value of the difference is only 1 percent. The reliability of the test can be
considered beneficial in that students can depend on a similar approach for assessment. No training
scorer applicable for this test. Only one scorer was involved (see Table 1. Number of Tests and Test
Scores and Bar Graph Number Scores per Students for more information).
# of Tests 1 2 3 4 5 6 7 8 9 10 11
Scores 37 29 38 35 26 35 40 36 30 40 40
Table 1. Number of Tests and Test Scores.
As mentioned above, the overall design of the test is to establish how successful the learners
were in achieving the objectives based on two language skills: one receptive: reading and one
productive: writing. Since it is intended to measure the ongoing learner's’ progress, the test is based
TEST ANALYSIS 4
succinct analysis of the test, it is relevant to include the course objectives. Therefore, in order to
specific information, identifying main ideas, and guessing meaning from context.
3. How to summarize and compare/contrast ideas by locating the main idea using clues from the
context, locating supporting details that give specific information (e.g. dates, background
information), stating a main idea in your own words (paraphrase) and using compare/contrast
4. Strategies to improve written fluency and complexity to include forming simple and complex
sentences; and using the appropriate word order, basic transitions, sentence capitalization and
punctuation
terminology
6. How to collect information from an outside source by using internet search engines to find
images and conduct basic searches and navigating information on a provided website
TEST ANALYSIS 5
7. Use technology by checking email by using an electronic device (desktop, laptop, tablet, and
At 8:50am the instructor handed the tests out by placing them on students’ desks and began by
stating “Do not write on the quiz yet”. Once all quizzes were handed out, the instructor began to
explain each section briefly but one student disregarded instructions to wait for explanations and
had begun writing their name. The instructor strictly asked that the student stop so that all students
had the same amount of time available to write and continued with instructions once the student
stopped writing.
Students were able to finish the test in the time frame and other students that needed more
time were told they could work through the break in order to finish if necessary which two students
decided to utilize. Students were confused about the three questions near the end of the test since
there were also three images which corresponded to the last question in order to remind students
of which reading they were meant to write a fact about. Students began writing either what these
inventions were named or described features about them. Students also wrote answers from the
previous section since they were puzzled about the three questions. The instructor realized that it
was a problem that many students were having and made the decision to reiterate the instructions
as well as write the instructions again on the board for all students to see. The instructions read
“What are the three purposes of supporting details?” which students observed and one student
asked for clarification of which item the instructor was referring to. Despite this explanation and
revised instructions written on the board, some students were still unsure and continued with their
• Summarize test results; provide relevant test statistics including range, mean, and standard
The range calculation is 38 points and this data is an indication of the spread of data among
the student’s scores. This data was useful because it tells us about the distance between the highest
student’s score (40 points), from the lowest student’s score (26 points) of the test; therefore, the
difference is 38 points.
Another important measurement was the statistical mean. It also gave us a good idea about
interpreting the statistical data. The mean or the average among the students’ scores calculated
was 35.09. This data informed us about which student scored on or above the majority (e.g.,
students 1, 3, 4, 6, 7, 8, 10, and 11) and which students scored below the majority (e.g., only three
students, 2, 5, and 9). Once can say that eight out of eleven students scored above the average and
this is roughly about 72.72% with only 27.27% who scored below the average. From this finding,
one can say that most of the students performed well in the test that is 72.72%. (Chart of Students’
In addition to this, the standard deviation (SD) was calculated. The SD from the mean was of
3.92σ . The SD data gave us what is considered the normal or standard. For example, this informs
us which students deviates largely or slightly from the mean. One can also assume that this could
be an indicator of the reliability of the test or the natural variance of student’s comprehension.
Table 2. Values of Number of Students, Scores, Mean (μ ), Variance, and Standard Deviation (σ)
illustrates some examples. One example is student 1 with a variance of (3.64), slightly close from
SD (σ 3.92). This is comprehensible due to the student’s score (37 points), small gap from the
average (35.09 points). Another example are students 4 and 6, both with the same variance
(0.0081), just slightly below from the SD (σ 3.92); this justifiable because both students scored
almost the average (35 points). However, students 7, 10, and 11 which variances are 24.10, largely
distance from the average (σ 3.92), the reason why is because both students scored the highest
points (40); therefore, the gap between them is larger. In the case of students 2 and 5 which
variances are (37.09) and (82.62) respectively, also a large distant from SD (σ 35.09); this is also
understandable because both students scored the lowest points (29) and (26) respectively. Table 2.
Values of Number of Students, Scores, Mean (μ ), Variance, and Standard Deviation (σ))
Similarly, the last portion of the test ( See Section D) were designed to measure content,
the mean, and standard deviation for this section and it was calculated separately. The mean is 16.4
with a standard deviation of 3.55. The lower mean and higher standard deviation suggests this
Part 3 Validity
Since the textbook Pathways 2: Reading, Writing and Critical Thinking was used, the
content validity was determined as per the goals and objectives of the course. It included
representative samples of the language skills taught (e.g., main idea) of the relevant structured that
appeared in the textbook. Also, the test specification was based on the goals and objectives of the
course regarding reading/writing skills and strategies necessary for students to advance their level
in the program. For example, some strategies taught are skimming the text for information,
scanning for main ideas or understanding the gist of a passage. These strategies are taught and
learned through content that is found in the textbook assigned to the course. The information that
was tested related to whether students were able to comprehend the article read in class and is able
TEST ANALYSIS 9
to ascertain what the main idea and supporting details of an article are based on strategies such as
With regards face validity, it is considered that the test measured what it needed to measure.
For example, the strategies mentioned above taught throughout the course such as skimming the
text for information (Pathways 2: Reading, Writing and Critical Thinking), scanning for main
ideas or understanding the gist of a passage it is reflected in the format of the quiz in areas
vocabulary meanings, fill-in-the-blank sentences, and multiple choice questions. Hence, the test is
Part 4 Reliability
• Provide evidence demonstrating reliability of your test; present the type (s) of evidence
To create two approximately equal halves and the table of specifications was used to created
for the test (see Appendix A Table 3. Table of Specification). 2 gap filling and 2 multiple choice
vocabulary questions were used; one gap filling or multiple choice vocabulary question; 1 multiple
choice power of creativity question; one multiple choice solar cooking question; 4 short answer
questions on reading skills/Big Ideas: Little Packages for each half. The differences in the halves
were caused by an odd number of vocabulary and reading skills questions as well as the Big Ideas:
After splitting the test in equal halves the correlation coefficients was calculated (0.6918) with
the SD of (4.765). This means that there is a moderate tendency for one half to represent the other
half and that according to (Lado 1991) says that good vocabulary, structure, and reading tests are
usually in the .90 and .99 range. (See Table 5. Split Equal Halves Tests Scores, Standard Deviation
We know the meaning and the SD, now we look at the inter-rater. For our test analysis, the
inter-rater or inter-observer was only one observer/judge who computed the scores; hence, the
inter-rater is not applicable, it is only applicable when two or more observers/judges are involved
and subjectivity is at stakes. Nevertheless, with regards to the intra-rater reliability, it is applicable
to our test because the same observer/judge had the opportunity to observe on participants more
than one occasion. Therefore, we could expect high agreement - agreed with herself - it is the same
person doing the observing/rating over time, considering the fact the the observer/judge was the
In this token, the rater/observer rescored the last question of the test after multiple weeks
and had a 100% agreement with the score given during the first scoring. The correlation coefficient
is 1 for the scores given initially and during the rescore. This shows high intra rater reliability for
the last question of the test, whose criteria was: the answer is a comprehensible fact derived from
textbook content.
In overall the reliability of the full test calculated for Score A is 1.382 and for Score B is
1.692. These two set of reliability is too high, perhaps one might think of moving few items from
the test. One might need to look close at all items of the test, including in the way it was
administered, and think how it might be more reliable. This is further discussed in Part 6. Test
Revision.
TEST ANALYSIS 11
Part 5 Item Analysis - (where appropriate) calculate facility value for the multiple choice questions
The purpose for the item analysis is to examine the contribution that each item is making
to the test. Items are identified as faulty or inefficient and can be modified or rejected. For our test
analysis a facility value was calculated and as well as analysis of distractor in the case of multiple
choice.
The facility value on an item is the proportion for the test takers/students that scored one.
For example, the multiple choice question number 1 is 1, 11 out of 11 students answered this
question correctly. One instance of this is illustrated below, where the correct answer is b. invent
and think of new ideas. With regards the distractor for multiple choice (MC), the c. exercise
everyday and d. travel frequently do not appear to be good distractor. The answer a. accomplish
many goals may not be effective due to the word “accomplish” being outside of the student’s
lexical knowledge. Therefore, this distractor does not contribute to the test reliability. This is
The facility value for multiple choice question number two is 0.90. This value suggests that
Another example for facility value for number 3 is 1 as well. This means every student
answered this question correctly. The sample question can be seen below. The distractors used
vocabulary that the student is familiar with; consequently, the students should be aware of the
TEST ANALYSIS 12
meaning of the distractor. One can assume that this is possible because the student may know
which distractor is not correct and therefore choosing the correct answer c. through a process of
elimination.
The facility value for multiple choice question 4 is also one, thus every student answered
it correctly. The sample question can be seen below. The answers b. and c. are designed to distract
the student by thinking of the words “pollution” and “population”, which are both complex words
that start with the letter ‘P’. However, this may not be a good distractor, due the student’s potential
lack of world knowledge on pollution or population. Another reason the distractor may fail in this
example, it might due to the fact that the different grammatical structures of the distractor. In the
sample below the student would only know the answer by grammatical knowledge based on the
formulaic sentence: subject + be + -ing. Based on this, the student can discriminate between two
choices a. or d. Therefore, only in this way, the student had a 50% chance to get the answer right.
4. Prevention is …
Multiple choice question number 5 illustrated below. The facility value is .72, only 8 out
of 11 students got it right, the distractor did an adequate job of distracting students. Also, the fact
TEST ANALYSIS 13
that the students may have not seen the picture before may have made an impact on their choices;
perhaps, the student could have been confused by the answers being very similar in meaning.
on the right →
What is this an
example of?
a. a design
b. a diagram
c. an
illustration
d. all of the
above
The facility value for section B question 1 is .90 suggesting it is a good indicator of testing
content. The facility value for section B question 2 is .81, this indicates it is neither too easy nor
too difficult for the students. The facility value for section C question 1 is .63, and the question is
shown in the table below. The answers follow the same grammatical format, but to find the correct
answer you must have an advanced knowledge of the text and understand what the question is
asking. The question also has a long sentence which may be above the student’s knowledge of
English.
William could not read the book about windmills because he did not know much English,
so what did he do?
TEST ANALYSIS 14
The facility value of question 2 in section C is 1. This suggests the question might has
been fairly easy for the students or the distractors were not adequate. The sample question is
shown in the table below. Answer c. is incorrect, and it stands out because the subject is “he”
instead of “they”, unlike the question which asks “what did they do?”. Answer b. also does not
have “they” and instead it does not have an action verb included in the sentence.
The village William lived in needed more water, what did they do?
For our test revision we have considered two components for test reliability: one is the student’s
performance and the other is the reliability of the scoring. We believe that in the area where items
were to be excluded, it may not have discriminated well between weaker and stronger student.
Perhaps, the items were either too easy or too difficult for the candidates. For example, it would
be interesting to analyze an item that discriminate in favour of the weaker student, where the
weaker student performs better than stronger student. But, this item analysis is only left to the best
informant of the classroom, the teacher who is capable to identify the weaker and stronger student.
We believe that by doing an item analysis and calculating the discrimination indices might
contribute to a more reliability of the test scoring. The sample below, shows that the multiple
choice for question number 1. is 1, this means that 11 out of 11 students answered this question
correctly, hence this item does not discriminate at all (weak and strong students performed equally
TEST ANALYSIS 15
well on it). The discrimination index would have been of zero or if all the students would have
gotten the answer wrong. We consider discrimination to be important for our test revision because
the more discriminating the items are, the more reliable the test will be.
In addition to calculating discrimination indices, another important aspect of our test revision
is to do a succinct analysis of distractor. An example illustrated below for question 2. the facility
value in section C is 1. This suggests the question might has been fairly easy for the students or
the distractor were not adequate. Answer g. is incorrect, and it stands out because the subject is
“he” instead of “they”, unlike the question which asks “what did they do?”. Answer b. also does
not have “they” and instead it does not have an action verb included in the sentence.
The village William lived in needed more water, what did they do?
Section D proved to be confusing for students due to the irregularity of the instructions
and therefore should be modified for future use. It was also placed next to the images for Section
E and therefore further confused students as to whether these items were related. A revision
would include more explicit instructions that trigger students’ memory of what answer the
question is seeking such as “What are the three functions or purposes of Supporting Details?”. It
would also be prudent to place it as the third section so that students complete this before writing
TEST ANALYSIS 16
the supporting details from the passage on the test and so that the three images from the last
In sum, distractor that does not work well normally they are chosen by very few candidates
(for this test, only one person chose the distractor) and this makes no contribution to test reliability.
This type of distractor should be replaced for better ones or modified (e.g., the use of ‘they’
pronoun in all the choices). Indeed, more judges/observers/raters involved in the making of the
test might have contributed to more reliability of the test. The instructions could have been more
clear in certain sections and the formatting could have help to remedy other confusions as well. In
creating a revised version of this test, it is believed that students will be able to perform with more
Reference
Perry, Jr., F. (2011). Research in Applied Linguistics Becoming a Discerning Consumer (Second
Appendices
Appendix A
Table of Specifications Percentage of
Items
Total
Vocabulary Content Reading
number of
Skills
Items
GAP Multiple Short Multiple Short
FILL Choice Answer Choice Answer
TEST ANALYSIS 18
“Solar Cooking” 2 2 9%
“Big Ideas: Little 1 1 4%
Packages”
Appendix B
Student # A B
1 19 18
2 14 15
3 18 20
4 18 17
TEST ANALYSIS 19
5 15 11
6 19 16
7 20 20
8 19 17
9 18 12
10 20 20
11 20 20
Table 3. Split Test Analysis
Appendix C
1 37 35 4.765 0.6910
2 29 40
3 38 36
TEST ANALYSIS 20
4 35 30
5 26 40
Table 5. Split Equal Halves Tests Scores, Standard Deviation (SD) and Correlation Coefficient.