Language Testing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

TEST ANALYSIS 1

Running Head: TEST ANALYSIS

Language Testing

Mariandreína Kostantinov and colleagues (2017)

University of South Florida


TEST ANALYSIS 2

Introduction

The test described and analyzed in this paper was created for a Level 2 Academic

preparation course from INTO at the University of South Florida. This course deals with reading

and writing strategies such as scanning for specific information and understanding comparative

and contrastive formats. Students are also expected to learn new vocabulary throughout the

course that relates to content from the textbook. This method of teaching knowledge through

content is required at INTO in order to provide students with information to attach their

meanings rather than teaching abstractly. There are 11 students in the class; 6 male, and 5 female

(2 Chinese speakers and 9 Arabic speakers) of different ages, educational, and professional

backgrounds. The test was created with these factors in mind and with the intention of assessing

students’ understanding of information delivered in class.

Part 1 Test Administration, Scoring, and Observed Problems:

The purpose of this test was to ensure students were grasping knowledge from the class

which included vocabulary items, information from a reading passage and the ability to reiterate

factual information. The test was slightly longer than two pages and was assumed to take roughly

thirty minutes for students to complete. The test was given in the classroom that class was normally

held in on a Friday after a short fifteen minute review of items on the quiz that was propelled by

student questions. Once students felt comfortable and no longer had questions, they were asked to

remove all things from their desks before the test was distributed to which they complied. Students

were told to not write on their test until the instructions in all sections were explained. After each

section was briefly explained by the instructor, the students were asked if they had any questions

which they did not. Students were then allowed to begin their quiz hence the instructor stated that

students could start and finally write on their quiz.


TEST ANALYSIS 3

In regards scoring procedure techniques, an answer key was created and also other answers

were derived from the textbook of the course. Once all quizzes were graded for the preliminary

quiz the mean score of the class was 19.5 out of 22 which can be converted to 89% of 100%. The

mean score of the second quiz was 35.09 out of 40 which is translated as 88% of 100%. Although

there is a decrease in the class averages it can be considered a result of reliability especially since

the actual numerical value of the difference is only 1 percent. The reliability of the test can be

considered beneficial in that students can depend on a similar approach for assessment. No training

scorer applicable for this test. Only one scorer was involved (see Table 1. Number of Tests and Test

Scores and Bar Graph Number Scores per Students for more information).

# of Tests 1 2 3 4 5 6 7 8 9 10 11

Scores 37 29 38 35 26 35 40 36 30 40 40
Table 1. Number of Tests and Test Scores.

As mentioned above, the overall design of the test is to establish how successful the learners

were in achieving the objectives based on two language skills: one receptive: reading and one

productive: writing. Since it is intended to measure the ongoing learner's’ progress, the test is based
TEST ANALYSIS 4

on course of objectives contributing to a formative assessment of the class. In order to make a

succinct analysis of the test, it is relevant to include the course objectives. Therefore, in order to

achieve the course goals, students will learn:

1. Key concepts and vocabulary related to academic content

2. Active reading strategies to include predicting, previewing, skimming, scanning for

specific information, identifying main ideas, and guessing meaning from context.

3. How to summarize and compare/contrast ideas by locating the main idea using clues from the

context, locating supporting details that give specific information (e.g. dates, background

information), stating a main idea in your own words (paraphrase) and using compare/contrast

terminology and graphic organizers (e.g. Venn diagram).

4. Strategies to improve written fluency and complexity to include forming simple and complex

sentences; and using the appropriate word order, basic transitions, sentence capitalization and

punctuation

5. How to write paragraphs using comparative and contrastive formats

a. writing process: prewriting, drafting, and checking spelling/editing

b. paragraph format: writing a topic sentence, writing supporting details, creating a

concluding idea, using transitions to connect ideas

c. comparative and contrastive elements: using adjectives and compare/contrast

terminology

6. How to collect information from an outside source by using internet search engines to find

images and conduct basic searches and navigating information on a provided website
TEST ANALYSIS 5

7. Use technology by checking email by using an electronic device (desktop, laptop, tablet, and

phone) outside class.

Perceived Issues in Test administration

At 8:50am the instructor handed the tests out by placing them on students’ desks and began by

stating “Do not write on the quiz yet”. Once all quizzes were handed out, the instructor began to

explain each section briefly but one student disregarded instructions to wait for explanations and

had begun writing their name. The instructor strictly asked that the student stop so that all students

had the same amount of time available to write and continued with instructions once the student

stopped writing.

Students were able to finish the test in the time frame and other students that needed more

time were told they could work through the break in order to finish if necessary which two students

decided to utilize. Students were confused about the three questions near the end of the test since

there were also three images which corresponded to the last question in order to remind students

of which reading they were meant to write a fact about. Students began writing either what these

inventions were named or described features about them. Students also wrote answers from the

previous section since they were puzzled about the three questions. The instructor realized that it

was a problem that many students were having and made the decision to reiterate the instructions

as well as write the instructions again on the board for all students to see. The instructions read

“What are the three purposes of supporting details?” which students observed and one student

asked for clarification of which item the instructor was referring to. Despite this explanation and

revised instructions written on the board, some students were still unsure and continued with their

original thoughts for the answers to these questions.

Part 2 Test Results and Statistics:


TEST ANALYSIS 6

• Summarize test results; provide relevant test statistics including range, mean, and standard

deviation of your test results; interpret test results.

The range calculation is 38 points and this data is an indication of the spread of data among

the student’s scores. This data was useful because it tells us about the distance between the highest

student’s score (40 points), from the lowest student’s score (26 points) of the test; therefore, the

difference is 38 points.

Another important measurement was the statistical mean. It also gave us a good idea about

interpreting the statistical data. The mean or the average among the students’ scores calculated

was 35.09. This data informed us about which student scored on or above the majority (e.g.,

students 1, 3, 4, 6, 7, 8, 10, and 11) and which students scored below the majority (e.g., only three

students, 2, 5, and 9). Once can say that eight out of eleven students scored above the average and

this is roughly about 72.72% with only 27.27% who scored below the average. From this finding,

one can say that most of the students performed well in the test that is 72.72%. (Chart of Students’

Scores and Standard Deviation illustrates this example).


TEST ANALYSIS 7

In addition to this, the standard deviation (SD) was calculated. The SD from the mean was of

3.92σ . The SD data gave us what is considered the normal or standard. For example, this informs

us which students deviates largely or slightly from the mean. One can also assume that this could

be an indicator of the reliability of the test or the natural variance of student’s comprehension.

Table 2. Values of Number of Students, Scores, Mean (μ ), Variance, and Standard Deviation (σ)

illustrates some examples. One example is student 1 with a variance of (3.64), slightly close from

SD (σ 3.92). This is comprehensible due to the student’s score (37 points), small gap from the

average (35.09 points). Another example are students 4 and 6, both with the same variance

(0.0081), just slightly below from the SD (σ 3.92); this justifiable because both students scored

almost the average (35 points). However, students 7, 10, and 11 which variances are 24.10, largely

distance from the average (σ 3.92), the reason why is because both students scored the highest

points (40); therefore, the gap between them is larger. In the case of students 2 and 5 which

variances are (37.09) and (82.62) respectively, also a large distant from SD (σ 35.09); this is also

understandable because both students scored the lowest points (29) and (26) respectively. Table 2.

Values of Number of Students, Scores, Mean (μ ), Variance, and Standard Deviation (σ))

illustrates some examples.


TEST ANALYSIS 8

n Scores (μ ) Variance SD (σ)

1 37 35.09 3.64 3.92


2 29 37.64
3 38 8.46
4 35 0.0081
5 26 82.62
6 35 0.0081
7 40 24.10
8 36 0.82
9 30 25.90
10 40 24.10
11 40 24.10
Table 2. Values of Number of Students (n), Scores, Mean (μ ), Variance, and Standard Deviation
(SD (σ).

Similarly, the last portion of the test ( See Section D) were designed to measure content,

the mean, and standard deviation for this section and it was calculated separately. The mean is 16.4

with a standard deviation of 3.55. The lower mean and higher standard deviation suggests this

section may be negatively affecting the overall tests reliability.

Part 3 Validity

Since the textbook Pathways 2: Reading, Writing and Critical Thinking was used, the

content validity was determined as per the goals and objectives of the course. It included

representative samples of the language skills taught (e.g., main idea) of the relevant structured that

appeared in the textbook. Also, the test specification was based on the goals and objectives of the

course regarding reading/writing skills and strategies necessary for students to advance their level

in the program. For example, some strategies taught are skimming the text for information,

scanning for main ideas or understanding the gist of a passage. These strategies are taught and

learned through content that is found in the textbook assigned to the course. The information that

was tested related to whether students were able to comprehend the article read in class and is able
TEST ANALYSIS 9

to ascertain what the main idea and supporting details of an article are based on strategies such as

identifying supporting details.

With regards face validity, it is considered that the test measured what it needed to measure.

For example, the strategies mentioned above taught throughout the course such as skimming the

text for information (Pathways 2: Reading, Writing and Critical Thinking), scanning for main

ideas or understanding the gist of a passage it is reflected in the format of the quiz in areas

vocabulary meanings, fill-in-the-blank sentences, and multiple choice questions. Hence, the test is

said to have face validity. An example is illustrated in the sample below:

A. Vocabulary (2 points each)


prevention electricity powered solar power eventually afford
Fill in the blank for the following sentences:
1. His family couldn’t _________________________ to pay for him to go to school.
2. The windmill _________________________ the whole town.
3. _________________________ is what some people in poor areas use to cook their food.
4. In some poor areas, people do not have _________________________ or running water.
5. _________________________, I will finish this test.

Part 4 Reliability

• Provide evidence demonstrating reliability of your test; present the type (s) of evidence

most appropriate for your test; discuss results of item analysis.

To create two approximately equal halves and the table of specifications was used to created

for the test (see Appendix A Table 3. Table of Specification). 2 gap filling and 2 multiple choice

vocabulary questions were used; one gap filling or multiple choice vocabulary question; 1 multiple

choice power of creativity question; one multiple choice solar cooking question; 4 short answer

questions on reading skills/Big Ideas: Little Packages for each half. The differences in the halves

were caused by an odd number of vocabulary and reading skills questions as well as the Big Ideas:

Little Packages question.


TEST ANALYSIS 10

After splitting the test in equal halves the correlation coefficients was calculated (0.6918) with

the SD of (4.765). This means that there is a moderate tendency for one half to represent the other

half and that according to (Lado 1991) says that good vocabulary, structure, and reading tests are

usually in the .90 and .99 range. (See Table 5. Split Equal Halves Tests Scores, Standard Deviation

(SD) and Correlation Coefficient for more information).

We know the meaning and the SD, now we look at the inter-rater. For our test analysis, the

inter-rater or inter-observer was only one observer/judge who computed the scores; hence, the

inter-rater is not applicable, it is only applicable when two or more observers/judges are involved

and subjectivity is at stakes. Nevertheless, with regards to the intra-rater reliability, it is applicable

to our test because the same observer/judge had the opportunity to observe on participants more

than one occasion. Therefore, we could expect high agreement - agreed with herself - it is the same

person doing the observing/rating over time, considering the fact the the observer/judge was the

teacher who set the task.

In this token, the rater/observer rescored the last question of the test after multiple weeks

and had a 100% agreement with the score given during the first scoring. The correlation coefficient

is 1 for the scores given initially and during the rescore. This shows high intra rater reliability for

the last question of the test, whose criteria was: the answer is a comprehensible fact derived from

textbook content.

In overall the reliability of the full test calculated for Score A is 1.382 and for Score B is

1.692. These two set of reliability is too high, perhaps one might think of moving few items from

the test. One might need to look close at all items of the test, including in the way it was

administered, and think how it might be more reliable. This is further discussed in Part 6. Test

Revision.
TEST ANALYSIS 11

Part 5 Item Analysis - (where appropriate) calculate facility value for the multiple choice questions

• Perform item analysis; present and discuss results of item analysis

The purpose for the item analysis is to examine the contribution that each item is making

to the test. Items are identified as faulty or inefficient and can be modified or rejected. For our test

analysis a facility value was calculated and as well as analysis of distractor in the case of multiple

choice.

The facility value on an item is the proportion for the test takers/students that scored one.

For example, the multiple choice question number 1 is 1, 11 out of 11 students answered this

question correctly. One instance of this is illustrated below, where the correct answer is b. invent

and think of new ideas. With regards the distractor for multiple choice (MC), the c. exercise

everyday and d. travel frequently do not appear to be good distractor. The answer a. accomplish

many goals may not be effective due to the word “accomplish” being outside of the student’s

lexical knowledge. Therefore, this distractor does not contribute to the test reliability. This is

further discussed in Part 6. Test Revision.

1. Creative people are more likely to…

a. accomplish many goals


b. invent and think of new ideas
c. exercise every day
d. travel frequently

The facility value for multiple choice question number two is 0.90. This value suggests that

the question has good distractor and is effectively measuring content.

Another example for facility value for number 3 is 1 as well. This means every student

answered this question correctly. The sample question can be seen below. The distractors used

vocabulary that the student is familiar with; consequently, the students should be aware of the
TEST ANALYSIS 12

meaning of the distractor. One can assume that this is possible because the student may know

which distractor is not correct and therefore choosing the correct answer c. through a process of

elimination.

3. An example of equipment could be…

a. electricity, solar power, fuel


b. diagrams, designs, illustrations
c. shock absorber, tractor fan, PVC pipes
d. invention, creation, product

The facility value for multiple choice question 4 is also one, thus every student answered

it correctly. The sample question can be seen below. The answers b. and c. are designed to distract

the student by thinking of the words “pollution” and “population”, which are both complex words

that start with the letter ‘P’. However, this may not be a good distractor, due the student’s potential

lack of world knowledge on pollution or population. Another reason the distractor may fail in this

example, it might due to the fact that the different grammatical structures of the distractor. In the

sample below the student would only know the answer by grammatical knowledge based on the

formulaic sentence: subject + be + -ing. Based on this, the student can discriminate between two

choices a. or d. Therefore, only in this way, the student had a 50% chance to get the answer right.

4. Prevention is …

a. eating something quickly


b. when the environment is dirty
c. the amount of people in a country
d. stopping something from happening

Multiple choice question number 5 illustrated below. The facility value is .72, only 8 out

of 11 students got it right, the distractor did an adequate job of distracting students. Also, the fact
TEST ANALYSIS 13

that the students may have not seen the picture before may have made an impact on their choices;

perhaps, the student could have been confused by the answers being very similar in meaning.

5. Look to the box

on the right →
What is this an
example of?

a. a design
b. a diagram
c. an
illustration
d. all of the
above

The facility value for section B question 1 is .90 suggesting it is a good indicator of testing

content. The facility value for section B question 2 is .81, this indicates it is neither too easy nor

too difficult for the students. The facility value for section C question 1 is .63, and the question is

shown in the table below. The answers follow the same grammatical format, but to find the correct

answer you must have an advanced knowledge of the text and understand what the question is

asking. The question also has a long sentence which may be above the student’s knowledge of

English.

William could not read the book about windmills because he did not know much English,
so what did he do?
TEST ANALYSIS 14

a. he went to school to learn English


b. he built a windmill
c. he went to the library
d. he used the pictures

The facility value of question 2 in section C is 1. This suggests the question might has

been fairly easy for the students or the distractors were not adequate. The sample question is

shown in the table below. Answer c. is incorrect, and it stands out because the subject is “he”

instead of “they”, unlike the question which asks “what did they do?”. Answer b. also does not

have “they” and instead it does not have an action verb included in the sentence.

The village William lived in needed more water, what did they do?

a. they used the windmill to get water


b. there was a drought
c. he went to the junkyard
d. they powered their cell phones

Part 6 Test Revision

For our test revision we have considered two components for test reliability: one is the student’s

performance and the other is the reliability of the scoring. We believe that in the area where items

were to be excluded, it may not have discriminated well between weaker and stronger student.

Perhaps, the items were either too easy or too difficult for the candidates. For example, it would

be interesting to analyze an item that discriminate in favour of the weaker student, where the

weaker student performs better than stronger student. But, this item analysis is only left to the best

informant of the classroom, the teacher who is capable to identify the weaker and stronger student.

We believe that by doing an item analysis and calculating the discrimination indices might

contribute to a more reliability of the test scoring. The sample below, shows that the multiple

choice for question number 1. is 1, this means that 11 out of 11 students answered this question

correctly, hence this item does not discriminate at all (weak and strong students performed equally
TEST ANALYSIS 15

well on it). The discrimination index would have been of zero or if all the students would have

gotten the answer wrong. We consider discrimination to be important for our test revision because

the more discriminating the items are, the more reliable the test will be.

1. Creative people are more likely to…

e. accomplish many goals


f. invent and think of new ideas
g. exercise every day
h. travel frequently

In addition to calculating discrimination indices, another important aspect of our test revision

is to do a succinct analysis of distractor. An example illustrated below for question 2. the facility

value in section C is 1. This suggests the question might has been fairly easy for the students or

the distractor were not adequate. Answer g. is incorrect, and it stands out because the subject is

“he” instead of “they”, unlike the question which asks “what did they do?”. Answer b. also does

not have “they” and instead it does not have an action verb included in the sentence.

The village William lived in needed more water, what did they do?

e. they used the windmill to get water


f. there was a drought
g. he went to the junkyard
h. they powered their cell phones

Section D proved to be confusing for students due to the irregularity of the instructions

and therefore should be modified for future use. It was also placed next to the images for Section

E and therefore further confused students as to whether these items were related. A revision

would include more explicit instructions that trigger students’ memory of what answer the

question is seeking such as “What are the three functions or purposes of Supporting Details?”. It

would also be prudent to place it as the third section so that students complete this before writing
TEST ANALYSIS 16

the supporting details from the passage on the test and so that the three images from the last

section will not interfere.

In sum, distractor that does not work well normally they are chosen by very few candidates

(for this test, only one person chose the distractor) and this makes no contribution to test reliability.

This type of distractor should be replaced for better ones or modified (e.g., the use of ‘they’

pronoun in all the choices). Indeed, more judges/observers/raters involved in the making of the

test might have contributed to more reliability of the test. The instructions could have been more

clear in certain sections and the formatting could have help to remedy other confusions as well. In

creating a revised version of this test, it is believed that students will be able to perform with more

accuracy and provide more insight for future test construction.

Reference

Hughes, A. (2012). Validity; Reliability. In Testing for Language Teachers(Second ed.).

Cambridge: Cambridge University Press.

Perry, Jr., F. (2011). Research in Applied Linguistics Becoming a Discerning Consumer (Second

ed.). New York, New York: Routledge.


TEST ANALYSIS 17

Appendices
Appendix A
Table of Specifications Percentage of
Items
Total
Vocabulary Content Reading
number of
Skills
Items
GAP Multiple Short Multiple Short
FILL Choice Answer Choice Answer
TEST ANALYSIS 18

“The Power of 5 5 2 12 54%


Creativity”

“Solar Cooking” 2 2 9%
“Big Ideas: Little 1 1 4%
Packages”

Reading Skills 7 7 31%


Total number of 5 5 1 4 7 22
Items

Percentage of 22% 22% 4% 18% 31%


Items

Table 3. Table of Specification

Appendix B
Student # A B

1 19 18

2 14 15

3 18 20

4 18 17
TEST ANALYSIS 19

5 15 11

6 19 16

7 20 20

8 19 17

9 18 12

10 20 20

11 20 20
Table 3. Split Test Analysis

Appendix C

# of Students Score A Score B SD Correlation Coefficient

1 37 35 4.765 0.6910

2 29 40

3 38 36
TEST ANALYSIS 20

4 35 30

5 26 40
Table 5. Split Equal Halves Tests Scores, Standard Deviation (SD) and Correlation Coefficient.

You might also like