Professional Documents
Culture Documents
The Remaining Items From
The Remaining Items From
Seminar 1
BLEU (BiLingual Evaluation Understudy) score: This algorithm aims to evaluate the
quality of text that has been machine translated. The central idea behind BLEU is "the closer a
machine translation is to a professional human translation, the better it is. To assess this, scores
are calculated for individual translated segments-generally sentences - by comparing them with
a set of good-quality reference translations. Those scores are then averaged over the whole
corpus to reach an estimate of the translation's overall quality. Even though BLEU has become a
standard in the industry, it has its limitations. Intelligibility or grammatical correctness are not
taken into account explicitly, for instance, as they are supposed to be included in the correct
reference translations.
NIST: The name of this metric comes from the US National Institute of Standards and
Technology This measure is based on the BLEU score, but it differs from the BLEU algorithm in
several ways.
While BLEU simply calculates how many n-grams match both in the reference translation and
in the MT output and gives these n-grams the same weight, NIST also calculates how
"informative" a particular n-gram is. When a correct in-gram is found, the algorithm measures
whether that combination is a common sequence in the corpus material or whether that
fragment is not that common in the data. Depending on the result, an n-gram will be given more
or less weight. To give an example, if the bigram "on the" is correctly matched, it will receive a
lower weight than the correct matching of the bigram "interesting calculations," as this is less
likely to occur.
NIST also differs from BLEU in terms of how some penalties are calculated. For example, small
variations in translation length do not impact the overall NIST score as much as in BLEU..
METEOR (Metric for Evaluation of Translation with Explicit ORdering): This metric was
designed to address some of the problems found in the more popular BLEU metric, and also
produces a good correlation with human judgment at the sentence or segment level (this differs
from the BLEU metric in that BLEU seeks correlation at the corpus level). With this system,
several features that had not been part of any other metrics at the time were introduced.
Matches in METEOR are made by following the parameters below, among others:
Exact words: As with other metrics, a match is made if two words are identical in the
machine translation output and the reference translation
Stem: Words are reduced to their stem form. If two words have the same stern, a match
is also made
Synonymy: Words are matched if they are synonyms of one another. Words are
considered synonymous if they share any synonym sets according to an external
database
The Levenshtein distance between "sport" and "short" is 1, because one edit is required to
convert one word into the other (replace "p" with "h). The Levenshtein distance between "dog"
and "frog" is 2, as it is not possible to convert the first word into the second with fewer edits
(replace "d" with "f" and add "r").
This algorithm always has a maximum value that corresponds to the maximum length of both
input strings. In the case that two words do not have anything in common, the minimum
number of edits will not exceed the maximum number of characters in the longer string.
Example: If we have "computer" and "alibi", the Levenshtein distance will be 8 and no higher
than 8:
Delete "t"
Delete "e"
Delete "r"
As with other automated measures, the results of the Levenshtein distance are not set in stone.
As mentioned before, there can be many correct translations for a single source. The
Levenshtein distance will not be able to measure quality on its own. Results will vary, for
example, if clauses are positioned differently in the MT output and in the human reference
translation.
Example
MT: "If I go home after 10pm, I will let you know" Reference human translation: "I will let you
know if I go home after 10 pm
In this case, the MT output is correct and no changes would be necessary during a post-editing
stage. However, the Levenshtein distance will be quite high, as many changes would be required
to turn the first sentence into the second one. TER: This is a word-based metric that calculates
the minimum number of edits required to match an MT output to a correct reference
translation, normalized by the length of the reference.
# of edits
TERP (TER-Plus) is an extension of Translation Edit Rate (TER) and builds on the success of
TER as an evaluation metric and alignment tool. At the same time, it addresses several of TER's
weaknesses through the use of paraphrases, morphological stemming and synonyms, as well as
edit costs that are optimized to correlate more closely with various types of human judgments.
Put simply, TERP measures the number of edits that are necessary to go from the raw MT
output to a final edited version. As such, it is a helpful metric to measure typing and editing
effort. The TERP score is a number from 1 to 100; the higher the number, the more editing was
required.