Language Testing scoring items laaaaast Ameera

The Libyan Academy, Misurata
School of Languages
English Language Department
Language Testing
Scoring Language Tests
Presenter: Ameera F. Alargat

Inspector: Dr. Abdulhameed Onaiba
Contents
What is Scoring?
Scorability
The Impact of Scoring on Reliability and Validity
- Types of Scorable Responses
- Scoring Reliability
- Factors Affecting Scorability
- Types of Scoring Reliability
- Strategies to Enhance Scorability
- Factors that impact scoring reliability
Rules for Scoring Items
- Scoring validity
Scores interpretation methods
- Types of Scoring Validity
- Norm-referenced interpretation
- Enhancing Scoring Validity
- Criterion-referenced interpretation
Automated Scoring
Types of Scores
_ Band Score
_ Raw Score
_ True Score
_ Zero Score
Rating Scales
- Holistic Scales
-Analytical Scale
- Primary-trait Scales
- Multi-trait Scales
Scoring
What is scoring?
- Scoring is evaluation of performance by assigning a grade or score.
- It is concerned with the how much or how good of language testing.
- “ Scoring refers to the process of evaluating test takers’ performance and assigning
numerical or categorical values to represent the level of language proficiency
demonstrated. Scoring may be holistic, where a single score is assigned, or analytic,
where separate scores are given for different aspects of language use." (Fulcher &
Davidson, 2007)
Scorability in Language Testing
What is Scorability?
- Scorability refers to the degree to which test responses can be scored objectively and reliably
- It is a crucial component of test quality and validity
- Highly scorable tests allow for consistent scoring across raters and time
Types of Scorable Responses
1. Objectively Scorable Responses
- Multiple-choice, true/false, matching, fill-in-the-blank
- Can be scored accurately and consistently by multiple raters
2. Subjectively Scorable Responses
- Open-ended, constructed responses (e.g. essays, oral responses)
- Require human judgment and may be prone to scoring inconsistencies
Close response items (MCQs) are more scorable than open-ended response questions.
(Fulcher,2010, p.201)
Factors Affecting Scorability
- Response format “the type of question” (objective vs. subjective)
- Clarity of scoring criteria and rubrics “ I.e., clear rules and guidelines”
- Rater training and monitoring
- Contextual factors (e.g. test setting, task complexity)
Strategies to Enhance Scorability
- Use a mix of objectively and subjectively scorable items
- Develop clear, comprehensive scoring rubrics
- Train raters extensively
- Double-check the scoring or have multiple people score the same answers
- Use technology (e.g. automated scoring) where appropriate

Rules for Scoring Items
1- Identifying the Scoring Method
- Match Correct:
This refers to the processing option where either the question is answered right or wrong. There is no
partial credit. (Right or Wrong scoring approach)
- Map Response:
This refers to the processing option where it is possible to get partial credit for a response. This is
useful for questions where the answer is given in multiple parts. A singular part can be weighed more
heavily than other parts in this option.
- Choose based on question type and desired level of detail
2- Selecting the Appropriate Scoring Method
- Match Correct: Best for simple yes/no or single-component questions
- Map Response: Useful when answer has multiple parts, want to give partial credit
Example:
- Multiple Choice (Match Correct)
- Short Answer (Map Response)
3- Assigning Partial Credit (Map Response)
- Review possible answers, assign point values to each
- Weight components based on relative importance
- Allows credit for partially correct responses
Example:
- Essay question with defined scoring criteria (e.g. 1 point per key factor, 2 points for analysis)
4- Setting Score Ranges
- This refers to defining the minimum and maximum possible scores for the assessment item.
- It sets the range of scores that the student can achieve.
- For example, if the maximum possible score is 10 points, then the score range would be
from 0 to 10 points.
- This ensures that the scoring is fair and accurately reflects the student's performance, without
scores falling outside a reasonable range.
Types of Scores
1. Band Score:
- This is a level or grade, like A, B, C, etc.
- It's based on meeting certain criteria or standards.
- For example, an essay might get a "Band 5" score based on things like organization,
vocabulary, grammar.
2. Raw Score:
- This is just the total number of correct answers on a test.
- For example, if you got 25 out of 50 questions right, your raw score is 25.
- It doesn't have any special meaning, it's just the basic number you got right.
3. True Score:
- This is a hypothetical "ideal" score that is supposed to represent test-taker’s actual ability.
- It's what your score would be if the test was perfect and had no errors.
- Your true score is your real skill or knowledge, not just what you happened to get on one
test.
4. Zero Score (Standard Score):

- This converts your raw score into a standardized number.
- It shows how your score relates to the average or "normal" score.
- For example, a zero score of 100 would mean you scored at the average level
Scores interpretation methods
1- Norm-referenced
-Test takers are ranked based on their performance compared to the norm group (a group with similar
characteristics such as age or grade level who has taken the same test).
-The interpretation focuses on how well an individual performed compared to other group.
Examples:
-If a student receives a percentile rank score of 34, this means that he or she
performed better than 34% of the students in the norm group (Hussain, Tadesse, & Sajid, 2015).
-College admissions
-IQs
Advantages:
- Provides information about an individual's performance relative to a larger group, allowing
for ranking and comparison.
- Helps identify high-performing and low-performing individuals within the norm group.
- Allows for tracking an individual's progress over time by comparing their scores to the same
norm group.
- Easy to use
Disadvantages:
- Does not provide information about an individual's mastery of specific skills or knowledge.
- Focuses on comparing individuals rather than measuring their absolute performance against
a standard.
2- Criterion-referenced
- It focuses on evaluating an individual's performance against predetermined criteria or
standards. The purpose is to determine whether the individual has achieved a specific level of
proficiency or mastery in a particular domain.
-Brown (2004) describes criterion-referenced interpretation as "the process of interpreting test
scores by comparing them to a pre-determined standard of performance, rather than to the
performance of a norming group."
-If an instructor decides an exam score of 90% of 100% is the criterion or

standard for a letter grade of A, all students scoring 90% or better get an A.
-If the highest class exam score is 80%, no-one gets an A (Aviles, 2001).
-Hence, a test taker’s score is interpreted with reference to the criterion
score, rather than to the scores of other test -takers
(Richards and Schmidt, 2010).
-ILETS -TOEFL
Advantages:
- Provides information about an individual's mastery of specific skills or knowledge,
rather than just their relative standing.
- Allows for more meaningful interpretation of performance, as the focus is on what the
individual can do rather than how they compare to others.
- Can be more closely aligned with instructional objectives and curriculum standards.
- Facilitates the identification of strengths and weaknesses in specific areas.
Disadvantages:
- Does not provide information about an individual's performance relative to a larger
group.
- Can be more challenging to develop and validate the criteria or standards being used.
- May not be as useful for ranking or selecting individuals, as the focus is on meeting a
specific standard rather than comparative performance.
- Requires careful alignment between the test content and the criteria being used to
ensure validity
Rating scales
These are the ‘examiner-oriented’ scales
1- Holistic scale
A holistic scale awards a single score to represent the overall quality of performance
Example: A teacher gives a student a grade of "B" on his/her essay, based on the
overall quality, without looking at specific elements like grammar, organization, etc.
- Advantages of Holistic Scales -
- Simple and quick to use
- Provide an overall score that is easy to interpret
Disadvantages of Holistic Scales
- Focus on reliability over validity
- Score may not provide much diagnostic information
- Link between descriptor and performance is not clear
2- Analytical Scoring
- Scores are awarded based on different aspects of performance (e.g. number of errors, use of
cohesive devices, etc.) instead of just assigning an overall score.
- Requires clear, explicit criteria linking evidence to claims about constructs
- Developed through analysis of language samples and expert judgment
Advantages of Analytical Scoring
- Transparency - criteria and scoring process are clearly defined
- Stronger validity claims as evidence is systematically linked to constructs
- Diagnostic feedback for test-takers on specific aspects of language ability

3-Primary Trait Scoring
- Designed for evaluating specific traits or skills in a test-taker’ performance
- Include a description of the task, the primary trait to be measured, a rating scale, sample
performances, and explanations
- Stronger link between evidence and claims about student ability
Advantages of Primary Trait Scoring
- Stronger validity claims through explicit connections to the construct
- Provides more detailed feedback on specific aspects of performance
Disadvantages of Primary Trait Scoring
- More complex and time-consuming to develop
- Less generalizability - scores only apply to the specific task

4- Multi-Trait Scoring
- Multiple scores are awarded for a single performance
- Each score represents a separate claim about the relationship between the evidence and
multiple underlying constructs (e.g. grammar, vocabulary, organization, etc.)
- Can be general (like holistic) or task-specific (like primary trait)
- Developed through either expert committee judgment or analysis of language samples
Advantages of Multi-Trait Scoring
- Provides more detailed and diagnostic feedback for test-takers
- Allows for evaluation of multiple aspects of language ability
- Can be tailored to specific tasks or task types

Challenges with Multi-Trait and Analytical Scales
- More complex and time-consuming to develop and use compared to holistic scales
- Need to ensure rater training and consistency in application of criteria
- Potential trade-off between specificity and generalizability of scores

The Impact of Scoring on Reliability and Validity
Reliability
It refers to the consistency and dependability of the scoring process. For a language test
to be reliable, the scores assigned should be stable and reproducible, regardless of who is
scoring the test or when it is scored.
Aspects of reliability in language test scoring include:
- Inter-rater reliability: Different raters/scorers should assign similar scores for the same
test performance.
- Intra-rater reliability: The same rater should assign consistent scores when evaluating
the same performance in different conditions.
- Test-retest reliability: Test takers should receive similar scores if they take the test again
after a period of time.
Factors that impact scoring reliability
- Clear scoring criteria and rubrics
- Rater training and monitoring
- Use of multiple raters
High reliability ensures the scores are dependable and that any differences in scores
reflect true differences in language ability, not inconsistencies in the scoring process.
Validity
Validity is about whether the test is actually measuring what it claims to measure - the
intended language skills and proficiency. Validity is critical, as the test scores should
provide meaningful and accurate information about the test taker's language abilities.
Enhancing Scoring Validity
- Clear operational definitions of constructs
- Detailed scoring rubrics and guidelines
- Rater training on construct-relevant scoring
- Monitoring and adjusting scoring as needed
Automated Scoring
What is Automated Scoring?
- Automated scoring uses computer programs to grade and score student responses in language tests.
- This can help make language assessments more efficient and consistent.
Benefits of Automated Scoring
- Saves time and effort compared to having humans grade all the responses.
- Applies the same scoring rules to every response, ensuring fair and reliable results.
- Allows testing of more students without needing more human raters.
- Can provide immediate feedback to students on their performance.
Types of Automated Scoring
1. Selected-Response Items
- Multiple-choice, true/false, matching, etc.
- Easy for computers to quickly and accurately grade these objective questions.
2. Constructed-Response Items
- Open-ended responses like short answers or essays.
- More complex for computers to evaluate, but advances in technology are making this possible.
Challenges and Considerations
- Making sure automated scoring accurately measures the language skills it's intended to test.
- Ensuring the computer programs can handle all types of student responses, even unusual
ones.
- Explaining clearly how the automated scoring works and any limitations it may have.
- Combining automated scoring with human review for certain responses.
References
Brown, H. D., & Abeywickrama, P. (2004). Language assessment. Principles and Classroom
Practices. White Plains, NY: Pearson Education, 20.
Coombe, C. (2018). An A to Z of Second Language Assessment: How Language Teachers

Understand Assessment Concepts. London, UK: British Council.
Douglas, D. (2014). Understanding language testing. Routledge
Fulcher, G. (2010). Practical language testing. Routledge. .
Fulcher, G., & Davidson, F. (Eds.). (2012). The Routledge handbook of language testing. New
York, NY: Routledge
Hussain, S., Tadesse. T., & Sajid, S. (2015). Norm-Referenced and Criterion-Referenced Test in
EFL Classroom. International Journal of Humanities and Social Science Invention, 4 (10), 24-30,

Language Testing scoring items laaaaast Ameera

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Testing scoring items laaaaast Ameera

Uploaded by

Copyright:

Available Formats

The Libyan Academy, Misurata

Scoring Language Tests

Presenter: Ameera F. Alargat

- It is concerned with the how much or how good of language testing.

numerical or categorical values to represent the level of language proficiency

demonstrated. Scoring may be holistic, where a single score is assigned, or analytic,

- Response format “the type of question” (objective vs. subjective)

- Rater training and monitoring

- Contextual factors (e.g. test setting, task complexity)

Strategies to Enhance Scorability

- Use a mix of objectively and subjectively scorable items

- Develop clear, comprehensive scoring rubrics

- Train raters extensively

- Use technology (e.g. automated scoring) where appropriate

4. Zero Score (Standard Score):

- Provides information about an individual's performance relative to a larger group, allowing

for ranking and comparison.

proficiency or mastery in a particular domain.

-Brown (2004) describes criterion-referenced interpretation as "the process of interpreting test

scores by comparing them to a pre-determined standard of performance, rather than to the

performance of a norming group."

-If an instructor decides an exam score of 90% of 100% is the criterion or

(Richards and Schmidt, 2010).

cohesive devices, etc.) instead of just assigning an overall score.

- Requires clear, explicit criteria linking evidence to claims about constructs

- Developed through analysis of language samples and expert judgment

Advantages of Analytical Scoring

- Transparency - criteria and scoring process are clearly defined

- Stronger validity claims as evidence is systematically linked to constructs

- Diagnostic feedback for test-takers on specific aspects of language ability

performances, and explanations

- Stronger link between evidence and claims about student ability

Advantages of Primary Trait Scoring

- Stronger validity claims through explicit connections to the construct

- Provides more detailed feedback on specific aspects of performance

Disadvantages of Primary Trait Scoring

- More complex and time-consuming to develop

- Less generalizability - scores only apply to the specific task

multiple underlying constructs (e.g. grammar, vocabulary, organization, etc.)

- Can be general (like holistic) or task-specific (like primary trait)

- Developed through either expert committee judgment or analysis of language samples

Advantages of Multi-Trait Scoring

- Provides more detailed and diagnostic feedback for test-takers

- Allows for evaluation of multiple aspects of language ability

- Can be tailored to specific tasks or task types

- Need to ensure rater training and consistency in application of criteria

- Potential trade-off between specificity and generalizability of scores

- Rater training and monitoring

- Use of multiple raters

Coombe, C. (2018). An A to Z of Second Language Assessment: How Language Teachers

Douglas, D. (2014). Understanding language testing. Routledge

Fulcher, G. (2010). Practical language testing. Routledge. .

You might also like