Professional Documents
Culture Documents
Application of The B Method For Evaluating Free-Text Answers in An E-Learning Environment
Application of The B Method For Evaluating Free-Text Answers in An E-Learning Environment
Application of The B Method For Evaluating Free-Text Answers in An E-Learning Environment
e-learning environment
Diana Pérez, Enrique Alfonseca and Pilar Rodrı́guez
∗
Departamento de Ingenierı́a Informática, Universidad Autónoma de Madrid
28049 Madrid, Spain
{Diana.Perez,Enrique.Alfonseca,Pilar.Rodriguez}@ii.uam.es
Abstract
We have applied B LEU method (Papineni et al., 2001) (originally designed to evaluate automatic Machine Translation systems) to assess
short essays written by students and give them a score. We study how much B LEU scores correlate to human scorings and other keyword-
based evaluation metrics. We conclude that, although it is only applicable to a restricted category of questions, B LEU attains better results
than the other keyword-based procedures. Its simplicity and language-independece makes it a good candidate to be combined with other
well-studied computer assessment scoring procedures.
Table 1: Answer sets used in the evaluation. Columns indicate: set number; number of candidate texts (NC), mean length of
the candidate texts (MC), number of reference texts (NR), mean length of the reference texts (NR), language (En, English;
Sp, Spanish), question type (Def., definitions and descriptions; A/D, advantages and disadvantages; Y/N, Yes-No and
justification of the answer), and a short description of the question.
procedure is very sensitive to the choice of the reference • The length (N) of the maximum n-gram to look for
translations. coincidences in the reference texts.
3. Experiment: B LEU in e-learning • The brevity penalty factor to compare the correlation
values between the original factor and our modified
3.1. Overview one.
Computer-Assisted Assesment can be considered very
related to MT assessment, as: 3.4. Results
Number of reference texts As said before, B LEU is very
• The student answer can be treated as the candidate sensitive to the number and the quality of the reference
translation whose accuracy we want to score. texts available (Papineni et al., 2001). In our experiment,
• The reference translation made by humans is the ref- we have varied the number of references from 1 up to the
erence answer written by the instructors. maximum number available for each question (see Table 1).
Table 2 shows the results for each of the data sets. As can
3.2. Training and test material be seen, in general, the results improved with the number
We have built ten different benchmark data, described of references, although in some cases the addition of a new
in Table 1. Beforehand, all of them were marked by hand reference having words in common with wrong answers de-
by two different human judges, who also wrote several ref- creased the accuracy of the marks.
erence texts for each of the questions. Results for different N-grams We have varied the max-
We can classify the questions in three distinct cate- imum size of the N-grams taken into consideration from 1
gories: to 4. As Table 3 shows, the best results have been obtained
with a combination of unigrams, bigrams and trigrams. If
• Definitions and descriptions, e.g. What is an operative we only compare unigrams, or unigrams and bigrams, the
system?, describe how to encapsulate a class in C++. systems loses information about the collocations and per-
formance decays. On the other hand, given that the stu-
• Advantages and disadvantages, e.g. Enumerate the the
dents’ answers are completely unrestricted, in most of the
advantages and disadvantages of the token ring algo-
cases they do not have four consecutive words in common
rithm.
with a reference text, and therefore the B LEU score drops
• Yes/No question, e.g. Is RPC appropriate for a chat to zero.
server? (Justify your answer). Brevity penalty The B LEU procedure is basically a pre-
cision score, as it calculates the percentage of N-grams
3.3. Experiment performed from the candidate answer which appear in any reference,
The algorithm has been evaluated, for each of the data but it does not check the recall of the answer. Therefore,
sets, by comparing the N-grams from the student answers it applies a brevity penalty so as to penalyse very short an-
against the references, and obtaining the B LEU score for swers which do not convey the complete information. In its
each candidate. The correlation value between the auto- original form (Papineni et al., 2001), it compares the length
matic scores and the human’s scores has been taken as the of the candidate answer against the length of the reference
indicator of the goodness of this procedure. with the most similar length. We have compared it against
We have varied the following parameters of B LEU: the following brevity penalty:
• The number of reference texts used in the evaluation 1. For each reference text, calculate the percentage of its
process. words covered in the candidate.
Set No. of reference texts
1 2 3 4 5 6 7 8
1 0.3866 0.5738 0.5843 0.5886
2 0.2996 0.3459 0.3609
3 0.3777 0.1667 0.1750 0.3693
4 0.3914 0.3685 0.5731 0.8220
5 0.3430 0.3634 0.3383 0.3909 0.3986 0.4030 0.4159
6 0.0427 0.0245 0.0257 0.0685 0.0834 0.0205 0.0014 0.0209
7 0.1256 0.1512 0.1876 0.1982 0.2102
8 0.3615 0.41536 0.4172
9 0.6909 0.7949 0.7358
10 0.7174 0.8006 0.7508
Table 3: Scores of B LEU for a different choice of the type Table 5: Comparison of B LEU with two other keyword-
of N-grams chosen in the comparison, from only unigrams based methods. Because of the kind of evaluation per-
(in the first column) to N-grams with lengths from 1 to 4 formed, those with very few answers couldn’t be evaluated
(in the last column) with VSM.
Comparison with other methods We have implemented • The yes/no questions have the same problem, because
two other scoring algorithms as baseline: the explanation might be the same regardless on the
boolean answer. Thus, the correlation for 7 was 0.275. We also wish to thank R. Carro for the time she has dedi-
In the case of question 9, the correlation is very high cated to discuss with us about the validity of this method to
because there are just a few answers and most of them computer-based assessments evaluation.
are correct.
6. References
• Finally, the definitions and descriptions, which are
usually very narrow questions, have produced better J. Burstein, C. Leacock, and R. Swartz. 2001. Automated
results, with correlations between 0.36 and 0.76. evaluation of essay and short answers. In M. Danson
(Ed.), Proceedings of the Sixth International Computer
Comparison with other algorithms As can be seen, Assisted Assessment Conference, Loughborough, UK.
B LEU clearly outperforms other keyword-based algo- D. Callear, J. Jerrams-Smith, and V. Soh. 2001. Caa of
rithms. Although it is not directly comparable to VSM, short non-mcq answers. In Proceedings of the 5th Inter-
given that the evaluation procedure is different, the results national CAA conference, Loughborough, UK.
hint that it has given better results. In the case of the exam P. W. Foltz, D. Laham, and T. K. Landauer. 1999. Auto-
answers, the benchmark data used is complicated for VSM, mated essay scoring: Applications to educational tech-
as most of the answers are short (30-50 words), and most nology. In Proceedings of EdMedia’99.
of the keywords valid in the correct answers were used also D. Kanejiya and S. Prasad. 2003. Automatic evaluation
in the wrong ones. of students’ answers using syntactically enhanced lsa.
We have described here an application of the B LEU al- In Proceedings Of the HLT-NAACL 2003 Workshop on
gorithm for evaluating student answers with a shallow pro- Building Educational Applications Using Natural Lan-
cedure. The main advantages of this approach are that guage Processing, pages 53–60.
C.Y. Lin and E. H. Hovy. 2003. Automatic evaluation
• It is very simple to program (just a few hours).
of summaries using n-gram co-occurrence statistics. In
• It is language-independent, as the only processing Proceedings of 2003 Language Technology Conference
done to the text is tokenisation. (HLT-NAACL 2003), Edmonton, Canada.
Y. Ming, A. Mikhailov, and T. Lay Kuan. 1999. Intelli-
• It can be integrated with other techniques and re- gent essay marking system. In Educational Technology
sources, such as thesauri, deep parsers etc., or it could Conference.
be used in substitution of other keyword-based proce- T. Mitchell, T. Russell, P. Broomhead, and N. Aldridge.
dures in more complex systems. 2002. Towards robust computerised marking of free-text
responses. In Proceedings of the 6th International Com-
On the other hand, as has sometimes been noted, B LEU
puter Assisted Assessment (CAA) Conference, Loughbor-
is very dependent on the choice of the reference texts, so
ough, UK.
that leaves a high responsibility for the professors, who
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. Bleu:
have to write them. Secondly, it is not suitable for all kinds
a method for automatic evaluation of machine transla-
of questions, such as those where the order of the sentences
tion.
is important or, as we have seen, for an enumeration of ad-
C. P. Rosé, A. Roque, D. Bhembe, and K. VanLehn. 2003.
vantages and disadvantages. Further processing would be
A hybrid text classification approach for analysis of stu-
necessary for scoring these questions.
dents essays. In Building Educational Applications Us-
Therefore, we conclude that the current version of the
ing Natural Language Processing, pages 68–75.
system could be effectively used in an e-learning environ-
ment as a help to teachers who want to double check the S. Valenti, F. Neri, and A. Cucchiarelli. 2003. An overview
scores they are giving and to students who want or need of current research on automated essay grading. Journal
more practice than the one they receive in the classroom. of Information Technology Education, 2:319–330.
Nevertheless, we do not recommend the use of this version D. Whittington and H. Hunt. 1999. Approaches to the
as a stand-alone application. computerized assessment of free text responses. In M.
The following are some ideas for future work: Danson (Ed.), Proceedings of the Sixth International
Computer Assisted Assessment Conference, Loughbor-
• Automatise the production of the reference texts. ough, UK.
P. Wiemer-Hastings and I. Zipitria. 2001. Rules for syntax,
• Perform the evaluation against yet more existing sys- vectors for semantics. In Proceedings of the 23rd annual
tems. Conf. of the Cognitive Science Society, Mawhwah. N.J.
• Explore how to extend the procedure with other lin- Erlbaum.
guistic processing modules, such as parsing or treat-
ment of synonyms.
5. Acknowledgements
This work has been partially sponsored by CICYT,
project number TIC2001-0685-C02-01. We gratefully ac-
knowledge A. Ortigosa, P. Rodriguez and M. Alfonseca for
letting us exams and their scores to perform the evaluations.