Hang & Chao, Machine Translation Evaluation. A Survey (Paper 2016)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/303280649
Machine Translation Evaluation: A Survey
Article · May 2016
CITATIONS READS
20 3,379
2 authors:
Lifeng Han Derek F. Wong

The University of Manchester University of Macau
94 PUBLICATIONS 400 CITATIONS 212 PUBLICATIONS 2,890 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Lifeng Han on 04 March 2018.
The user has requested enhancement of the downloaded file.

Machine Translation Evaluation: A Survey
Aaron Li-Feng Han Derek F. Wong , Lidia S. Chao

ILLC NLP2CT Lab
Faculty of Science Faculty of Science and Technology
University of Amsterdam University of Macau
l.han@uva.nl derekfw, lidiasc@umac.mo
Abstract 1 Introduction
This paper introduces the state-of-the-art

Machine translation (MT) began as early as in the
arXiv:1605.04515v6 [cs.CL] 19 Jun 2016
machine translation (MT) evaluation sur-

1950s (Weaver, 1955) , and gained a rapid de-
vey that contains both manual and au-
velopment since the 1990s (Mariño et al., 2006)
tomatic evaluation methods. The tradi-
due to the development of storage and comput-
tional human evaluation criteria mainly
ing power of computer and the widely avail-
include the intelligibility, fidelity, flu-
able multilingual and bilingual corpora. There
ency, adequacy, comprehension, and in-
are many important works in MT areas, for
formativeness. The advanced human as-
some to mention by time, IBM Watson research
sessments include task-oriented measures,
group (Brown et al., 1993) designed five statis-
post-editing, segment ranking, and ex-
tical MT models and the ways of how to es-
tended criteriea, etc. We classify the au-
timate the parameters in the models given the
tomatic evaluation methods into two cat-
bilingual translation corpora; (Koehn et al., 2003)
egories, including lexical similarity sce-
proposed statistical phrase-based MT model;
nario and linguistic features application.
Och (Och, 2003) presented Minimum Error Rate
The lexical similarity methods contain edit
Training (MERT) for log-linear statistical machine
distance, precision, recall, F-measure, and
translation models; (Koehn and Monz, 2005) in-
word order. The linguistic features can
troduced a Shared task of building statistical ma-
be divided into syntactic features and se-
chine translation (SMT) systems for four Eu-
mantic features respectively. The syntactic
ropean language pairs; (Chiang, 2005) proposed
features include part of speech tag, phrase
a hierarchical phrase-based SMT model that is
types and sentence structures, and the se-
learned from a bitext without syntactic informa-
mantic features include named entity, syn-
tion; (Menezes et al., 2006) introduced a syntacti-
onyms, textual entailment, paraphrase, se-
cally informed phrasal SMT system for English-
mantic roles, and language models. Sub-
to-Spanish translation using a phrase translation
sequently, we also introduce the evalua-
model, which was based on global reordering
tion methods for MT evaluation including
and dependency tree; (Koehn et al., 2007b) devel-
different correlation scores, and the recent
oped an open source SMT software toolkit Moses;
quality estimation (QE) tasks for MT.
(Hwang et al., 2007) utilized the shallow linguis-
This paper differs from the existing works tic knowledge to improve word alignment and lan-
(Dorr et al., 2009; EuroMatrix, 2007) guage model quality between linguistically dif-
from several aspects, by introducing some ferent languages; (Fraser and Marcu, 2007) made
recent development of MT evaluation a discussion of the relationship between word
measures, the different classifications alignment and the quality of machine transla-
from manual to automatic evaluation tion; (Sánchez-Martínez and Forcada, 2009) de-
measures, the introduction of recent QE scribed an unsupervised method for the auto-
tasks of MT, and the concise construction matic inference of structural transfer rules for
of the content.1 . a shallow-transfer machine translation system;
1
Part of this work was done when working in NLP2CT (Khalilov and Fonollosa, 2011) designed an effec-
lab in Macau tive syntax-based reordering approach to address
the word ordering problem. Neural MT (NMT) is Paul et al., 2010; Federico et al., 2011). This
a recently active topic that conduct the automatic campaign has a stronger focus on speech transla-
translation workflow very differently with the tra- tion including the English and Asian languages,
ditional phrase-based SMT methods. Instead of e.g. Chinese, Japanese and Korean.
training the different MT components separately, The better evaluation metrics will be surly help-
NMT model utilizes the artificial neural network ful to the development of better MT systems
(ANN) to learn the model jointly to maximize (Liu et al., 2011). Due to all the above efforts, the
the translation performance through two steps re- MT evaluation research achieved a rapid develop-
current neural network (RNN) of encoder and ment.
decoder (Cho et al., 2014; Bahdanau et al., 2014; This paper is constructed as follow: Section 2
Wolk and Marasek, 2015). There were far more and 3 discuss the human assessment methods and
representative MT works that we havent listed automatic evaluation methods respectively, Sec-
here. tion 4 introduces the evaluating methods of the
Due to the wide-spread development of MT sys- MT evaluation, Section 5 is the advanced MT
tems, the MT evaluation became more and more evaluation, Section 6 is the discussion and related
important to tell us how well the MT systems per- works, and the perspective is presented in Section
form and whether they make some progress. How- 7.
ever, the MT evaluation is difficult because the
natural languages are highly ambiguous and dif- 2 Human Evaluation Methods
ferent languages do not always express the same 2.1 Traditional Human Assessment
content in the same way (Arnold, 2003).
2.1.1 Intelligibility and Fidelity
There are several events that promote the de-
The earliest human assessment methods for MT
velopment of MT and MT evaluation research.
can be traced back to around 1966. They in-
One of them was the NIST open machine trans-
clude the intelligibility and fidelity used by the au-
lation Evaluation series (OpenMT), which were
tomatic language processing advisory committee
very prestigious evaluation campaigns from 2001
(ALPAC) (Carroll, 1966). The requirement that a
to 2009 (Group, 2010).
translation be intelligible means that, as far as pos-
The innovation of MT and the evalua- sible, the translation should read like normal, well-
tion methods is also promoted by the annual edited prose and be readily understandable in the
Workshop on Statistical Machine Trans- same way that such a sentence would be under-
lation (WMT) (Koehn and Monz, 2006; standable if originally composed in the translation
Callison-Burch et al., 2007; language. The requirement that a translation is of
Callison-Burch et al., 2008; high fidelity or accuracy includes that the transla-
Callison-Burch et al., 2009; tion should, as little as possible, twist, distort, or
Callison-Burch et al., 2010; controvert the meaning intended by the original.
Callison-Burch et al., 2011;
Callison-Burch et al., 2012; Bojar et al., 2013; 2.1.2 Fluency, Adequacy and Comprehension
Bojar et al., 2014; Bojar et al., 2015) organized by In1990s, the Advanced Research Projects
the special interest group in machine translation Agency (ARPA) created the methodology to
(SIGMT) since 2006. The evaluation campaigns evaluate machine translation systems using
focus on European languages. There are roughly the adequacy, fluency and comprehension
two tracks in the annual WMT workshop includ- (Church and Hovy, 1991) in MT evaluation
ing the translation task and evaluation task. The campaigns (White et al., 1994).
tested language pairs are clearly divided into
two directions, i.e., English-to-other and other-
to-English, covering French, German, Spanish, #Cottect
Comprehension = , (1)
Czech, Hungarian, Haitian Creole and Russian. 6
Judgment point−1
Another promotion is the international S−1
Fluency = , (2)
workshop of spoken language translation #Sentences in passage
Judgment point−1
(IWSLT) that has been organized annually S−1
from 2004 (Eck and Hori, 2005; Paul, 2009; Adequacy = . (3)
#Fragments in passage
The evaluator is asked to look at each fragment, tense/voice/number, or there are repeated, added
delimited by syntactic constituent and containing or un-translated words; poorly adequate, the con-
sufficient information, and judge the adequacy on tent of the input sentence is not adequately con-
a scale 1-to-5. The results are computed by aver- veyed by the translation; and completely inade-
aging the judgments over all of the decisions in the quate, the content of the input sentence is not con-
translation set. veyed at all by the translation.
The fluency evaluation is compiled with the
same manner as that for the adequacy except for 2.2 Advanced Human Assessment
that the evaluator is to make intuitive judgments on 2.2.1 Task-oriented
a sentence by sentence basis for each translation. (White and Taylor, 1998) develop a task-oriented
The evaluators are asked to determine whether the evaluation methodology for Japanese-to-English
translation is good English without reference to translation to measure MT systems in light of
the correct translation. The fluency evaluation is the tasks for which their output might be used.
to determine whether the sentence is well-formed They seek to associate the diagnostic scores as-
and fluent in context. signed to the output used in the DARPA evalu-
The modified comprehension develops into the ation with a scale of language-dependent tasks,
“Informativeness”, whose objective is to measure such as scanning, sorting, and topic identifica-
a system’s ability to produce a translation that con- tion. They develop the MT proficiency met-
veys sufficient information, such that people can ric with a corpus of multiple variants which are
gain necessary information from it. Developed usable as a set of controlled samples for user
from the reference set of expert translations, six judgments.The principal steps include identifying
questions have six possible answers respectively the user-performed text-handling tasks, discover-
including, “none of above” and “cannot be deter- ing the order of text-handling task tolerance, an-
mined”. alyzing the linguistic and non-linguistic transla-
tion problems in the corpus used in determining
2.1.3 Further Development
task tolerance, and developing a set of source lan-
(Bangalore et al., 2000) conduct a research devel- guage patterns which correspond to diagnostic tar-
oping accuracy into several kinds including simple get phenomena. A brief introduction of task-based
string accuracy, generation string accuracy, and MT evaluation work was shown in their later work
two corresponding tree-based accuracies.Reeder (Doyon et al., 1999).
(2004) shows the correlation between fluency and Voss and Tate (Voss and Tate, 2006) introduced
the number of words it takes to distinguish be- the tasked-based MT output evaluation by the ex-
tween human translation and machine translation. traction of who, when, where types elements.
The “Linguistics Data Consortium” (LDC) de- They extend the work later into event understand-
velops two five-points scales representing fluency ing in (Laoudi et al., 2006).
and adequacy for the annual NIST machine trans-
lation evaluation workshop. The developed scales 2.2.2 Extended Criteria
become the widely used methodology when man- (King et al., 2003) extend a large range of manual
ually evaluating MT is to assign values. The five evaluation methods for MT systems, which, in ad-
point scale for adequacy indicates how much of dition to the early talked accuracy, include suit-
the meaning expressed in the reference translation ability, whether even accurate results are suitable
is also expressed in a hypothesis translation; the in the particular context in which the system is to
second five point scale indicates how fluent the be used; interoperability, whether with other soft-
translation is, involving both grammatical correct- ware or with hardware platforms; reliability, i.e.,
ness and idiomatic word choices. don’t break down all the time or take long time to
(Specia et al., 2011) conduct a study of the MT get running again after breaking down; usability,
adequacy and design it into four levels, from easy to get the interfaces, easy to learn and op-
score 4 to score 1: highly adequate, the trans- erate, and looks pretty; efficiency, when needed,
lation faithfully conveys the content of the input keep up with the flow of dealt documents; main-
sentence; fairly adequate, while the translation tainability, being able to modify the system in or-
generally conveys the meaning of the input sen- der to adapt it to particular users; and portability,
tence, there are some problems with word order or one version of a system can be replaced by a new
version, because MT systems are rarely static and 3 Automatic Evaluation Metric
they tend to be improved over time as resources
Manual evaluation suffers some disadvantages
grow and bugs are fixed.
such as time-consuming, expensive, not tunable,
and not reproducible. Due to the weaknesses
2.2.3 Utilizing Post-editing in human judgments, automatic evaluation met-
A measure of quality is to compare translation rics have been widely used for machine transla-
from scratch and post-edited result of an auto- tion. Typically, they compare the output of ma-
matic translation. This type of evaluation is how- chine translation systems against human transla-
ever time consuming and depends on the skills tions, but there are also some metrics that do not
of the translator and post-editor. One exam- use the reference translation. There two usually
ple of a metric that is designed in such a man- two ways to offer the human reference translation,
ner is the human translation error rate (HTER) either offering one single reference or offering
(Snover et al., 2006), based on the number of edit- multiple references for a single source sentence
ing steps, computing the editing steps between an (Lin and Och, 2004; Han et al., 2012). Common
automatic translation and a reference translation. metrics measure the overlap in words and word
Here, a human annotator has to find the minimum sequences, as well as word order and edit dis-
number of insertions, deletions, substitutions, and tance. We classify this kind of metrics as the “Lex-
shifts to convert the system output into an accept- ical Similarity category. Advanced metrics also
able translation. HTER is defined as the number take linguistic features into account such as syn-
of editing steps divided by the number of words in tax and semantics, e.g. POS, sentence structure,
the acceptable translation. textual entailment, paraphrase, synonyms, named
entities, semantic roles and language models.We
2.2.4 Segment Ranking classify these metrics that utilize the linguistic fea-
tures into “Linguistic Features” category. In the
In the WMT metrics task, the human assessment following sub-section, some of the metrics intro-
based on segment ranking is usually employed. duced in Section 3.1 also use certain linguistic fea-
Judgesare frequently asked to provide a complete tures.
ranking over all the candidate translations of the
same source segment (Callison-Burch et al., 2011; 3.1 Lexical Similarity
Callison-Burch et al., 2012). In the recent WMT 3.1.1 Edit Distance
tasks (Bojar et al., 2013),five systems are ran- By calculating the minimum number of edit-
domly selected for the judges to rank. Each time, ing steps to transform output to reference,
the source segment and the reference translation (Su et al., 1992) introduce the word error rate
are presented to the judges together with the can- (WER) metric into MT evaluation. This met-
didate translations of five systems. The judges ric takes word order into account, and the oper-
will rank the systems from 1 to 5, allowing tie ations include insertion (adding word), deletion
scores. For each ranking, there is the potential (dropping word) and replacement (or substitution,
to provide as many as 10 pairwise results if no replace one word with another), the minimum
ties. The collected pairwise rankings can be used number of editing steps needed to match two se-
to assign a score to each participated system to re- quences.
flect the quality of the automatic translations.The
assigned score can also be utilized to reflect how
frequently a system is judged to be better or worse substitution+insertion+deletion
WER = . (5)
than other systems when they are compared on the referencelength
same source segment, according to the following
One of the weak points of the WER is the
formula:
fact that word ordering is not taken into account
appropriately. The WER scores very low when
#better pairwise ranking the word order of system output translation is
. (4) “wrong” according to the reference. In the Leven-
#total pairwise comparison − #ties comparisons
shtein distance, the mismatches in word order re-
quire the deletion and re-insertion of the misplaced
words. However, due to the diversity of language
expression, some so-called “wrong” order sen- N
tences by WER also prove to be good translations.
X
BLEU = BP × exp λn log Precisionn , (8)
To address this problem, the position-independent n=1
word error rate (PER) (Tillmann et al., 1997) is (
1 if c > r,
designed to ignore word order when matching BP = (9)
e1− c
r
output and reference. Without taking into ac- if c <= r.
count of the word order, PER counts the number
where c is the total length of candidate transla-
of times that identical words appear in both sen-
tion corpus, and r refers to the sum of effective
tences. Depending on whether the translated sen-
reference sentence length in the corpus. If there
tence is longer or shorter than the reference trans-
are multi-references for each candidate sentence,
lation, the rest of the words are either insertion or
then the nearest length as compared to the candi-
deletion ones.
date sentence is selected as the effective one. In
the BLEU metric, the n-gram precision weight λn
is usually selected as uniform weight. However,
correc − max(0, outputlength − referencelength )
PER = 1 − . the 4-gram precision value is usually very low or
referencelength
(6) even zero when the test corpus is small. To weight
more heavily those n-grams that are more informa-
tive, (Doddington, 2002) proposes the NIST met-
Another way to overcome the unconscionable
ric with the information weight added.
penalty on word order in the Levenshtein dis-
tance is adding a novel editing step that allows
the movement of word sequences from one part #occurrence of w1 , · · · , wn−1
Info = log2 (10)
of the output to another. This is something a #occurrence of w1 , · · · , wn
human post-editor would do with the cut-and-
paste function of a word processor. In this Furthermore, he replaces the geometric mean
light, (Snover et al., 2006) design the translation of co-occurrences with the arithmetic average of
edit rate (TER) metric that adds block movement n-gram counts, extend the n-gram into 5-gram
(jumping action) as an editing step.The shift op- (N = 5), and select the average length of refer-
tion performs on a contiguous sequence of words ence translations instead of the nearest length.
within the output sentence. The TER score is cal- ROUGE (Lin and Hovy, 2003) is a recall-
culated as: oriented automated evaluation metric, which is
initially developed for summaries. Following the
adoption by the machine translation community of
automatic evaluation using the BLEU/NIST scor-
#of edit ing process, Lin conducts a study of a similar idea
TER = (7)
#of average reference words for evaluating summaries. They also apply the
ROUGE into automatic machine translation eval-
For the edits, the cost of the block movement, uation work (Lin and Och, 2004).
any number of continuous words and any distance, (Turian et al., 2006) conducted experiments to
is equal to that of the single word operation, such examine how standard measures such as preci-
as insertion, deletion and substitution. sion and recall and F-measure can be applied for
evaluation of MT and showed the comparisons of
these standard measures with some existing alter-
3.1.2 Precision and Recall native evaluation measures. F-measure is the com-
The widely used evaluation metric BLEU bination of precision (P ) and recall (R), which is
(Papineni et al., 2002) is based on the degree firstly employed in the information retrieval and
of n-gram overlapping between the strings of latterly has been adopted by the information ex-
words produced by the machine and the human traction, MT evaluation and other tasks.
translation references at the corpus level. BLEU
computes the precision for n-gram of size 1-to-4 PR
with the coefficient of brevity penalty (BP). Fβ = (1 + β 2 ) (11)
R + β2P
(Banerjee and Lavie, 2005) design a novel eval- penalty. The LEPOR metric yields the excel-
uation metric METEOR. METEOR is based on lent performance on the English-to-other (Span-
general concept of flexible unigram matching, un- ish, German, French, Czech and Russian) lan-
igram precision and unigram recall, including the guage pairs in ACL-WMT13 metrics shared tasks
match of words that are simple morphological at system level evaluation (Han et al., 2013b).
variants of each other by the identical stem and
words that are synonyms of each other. To mea- 3.2 Linguistic Features
sure how well-ordered the matched words in the Although some of the previous mentioned met-
candidate translation are in relation to the hu- rics employ the linguistic information into consid-
man reference, METEOR introduces a penalty co- eration, e.g. the semantic information synonyms
efficient by employing the number of matched and stemming in METEOR, the lexical similarity
chunks. methods mainly focus on the exact matches of the
surface words in the output translation. The ad-
vantages of the metrics based on lexical similarity
#chunks
Penalty = 0.5 × ( )3 , are that they perform well in capturing the trans-
#matched unigrams lation fluency (Lo et al., 2012), and they are very
(12) fast and low cost. On the other hand, there are
10P R also some weaknesses, for instance, the syntactic
MEREOR = × (1 − Penalty). (13)
R + 9P information is rarely considered and the underly-
ing assumption that a good translation is one that
3.1.3 Word Order
shares the same lexical choices as the reference
The right word order places an important role to translations is not justified semantically. Lexical
ensure a high quality translation output. How- similarity does not adequately reflect similarity in
ever, the language diversity also allows different meaning. Translation evaluation metric that re-
appearances or structures of the sentence. How to flects meaning similarity needs to be based on sim-
successfully achieve the penalty on really wrong ilarity of semantic structure not merely flat lexical
word order (wrongly structured sentence) instead similarity.
of on the “correctly” different order, the candidate
sentence that has different word order with the ref- 3.2.1 Syntactic Similarity
erence is well structured, attracts a lot of interests Syntactic similarity methods usually employ the
from researchers in the NLP literature. In fact, the features of morphological part-of-speech informa-
Levenshtein distance and n-gram based measures tion, phrase categories or sentence structure gener-
also contain the word order information. ated by the linguistic tools such as language parser
Featuring the explicit assessment of word or- or chunker.
der and word choice, (Wong and yu Kit, 2009) de- In grammar, a part of speech (POS) is a lin-
velop the evaluation metric ATEC, assessment of guistic category of words or lexical items, which
text essential characteristics. It is also based on is generally defined by the syntactic or mor-
precision and recall criteria but with the designed phological behavior of the lexical item. Com-
position difference penalty coefficient attached. mon linguistic categories of lexical items include
The word choice is assessed by matching word noun, verb, adjective, adverb, and preposition, etc.
forms at various linguistic levels, including sur- To reflect the syntactic quality of automatically
face form, stem, sound and sense, and further by translated sentences, some researchers employ the
weighing the informativeness of each word. Com- POS information into their evaluation. Using the
bining the precision, order, and recall information IBM model one, (Popović et al., 2011) evaluate
together, (Chen et al., 2012) develop an automatic the translation quality by calculating the similarity
evaluation metric PORT that is initially for the scores of source and target (translated) sentence
tuning of the MT systems to output higher qual- without using reference translation, based on the
ity translation. Another evaluation metric LEPOR morphemes, 4-gram POS and lexicon probabili-
(Han et al., 2012; Han et al., 2014) is proposed as ties. (Dahlmeier et al., 2011) develop the evalua-
the combination of many evaluation factors in- tion metrics TESLA, combining the synonyms of
cluding n-gram based word order penalty in ad- bilingual phrase tables and POS information in the
dition to precision, recall, and sentence-length matching task. Other similar works using POS in-
formation include (Giménez and Márquez, 2007; To capture the semantic equivalence of sen-
Popovic and Ney, 2007; Han et al., 2014). tences or text fragments, the named entity knowl-
In linguistics, a phrase may refer to any group edge is brought from the literature of named-
of words that form a constituent and so func- entity recognition, which is aiming to identify and
tion as a single unit in the syntax of a sen- classify atomic elements in the text into different
tence. To measure a MT system’s performance in entity categories (Marsh and Perzanowski, 1998;
translating new text-types, such as in what ways Guo et al., 2009). The commonly used entity cat-
the system itself could be extended to deal with egories include the names of persons, locations,
new text-types, (Povlsen et al., 1998) perform a organizations and time. In the MEDAR2011 eval-
research work focusing on the study of English- uation campaign,one baseline system based on
to-Danish machine-translation system. The syn- Moses (Koehn et al., 2007a) utilizes Open NLP
tactic constructions are explored with more com- toolkit to perform named entity detection, in ad-
plex linguistic knowledge, such as the identify- dition to other packages. The low performances
ing of fronted adverbial subordinate clauses and from the perspective of named entities cause a
prepositional phrases. Assuming that the simi- drop in fluency and adequacy.In the quality es-
lar grammatical structures should occur on both timation of machine translation task of WMT
source and translations, (Avramidis et al., 2011) 2012, (Buck, 2012) introduces the features includ-
perform the evaluation on source (German) and ing named entity, in addition to discriminative
target (English) sentence employing the features word lexicon, neural networks, back off behavior
of sentence length ratio, unknown words, phrase (Raybaud et al., 2011) and edit distance, etc. The
numbers including noun phrase, verb phrase and experiments on individual features show that, from
prepositional phrase. Other similar works using the perspective of the increasing the correlation
the phrase similarity include the (Li et al., 2012) score with human judgments, the feature of named
that uses noun phrase and verb phrase from chunk- entity contributes nearly the most compared with
ing and (Echizen-ya and Araki, 2010) that only the contributions of other features.
uses the noun phrase chunking in automatic evalu-
Synonyms are words with the same or close
ation and (Han et al., 2013a) that designs a univer-
meanings. One of the widely used syn-
sal phrase tagset for French to English MT evalu-
onym database in NLP literature is the WordNet
ation.
(Miller et al., 1990), which is an English lexical
Syntax is the study of the principles and pro- database grouping English words into sets of syn-
cesses by which sentences are constructed in onyms. WordNet classifies the words mainly into
particular languages. To address the overall four kinds of part-of-speech (POS) categories in-
goodness of the translated sentence’s structure, cluding Noun, Verb, Adjective, and Adverb with-
(Liu and Gildea, 2005) employ constituent labels out prepositions, determiners, etc. Synonymous
and head-modifier dependencies from language words or phrases are organized using the unit of
parser as syntactic features for MT evaluation. synset. Each synset is a hierarchical structure with
They compute the similarity of dependency trees. the words in different levels according to their se-
The overall experiments prove that adding syntac- mantic relations.
tic information can improve the evaluation per-
formance especially for predicting the fluency Textual entailment is usually used as a direc-
of hypothesis translations.Other works that using tive relation between text fragments. If the truth
syntactic information into the evaluation include of one text fragment TA follows another text frag-
(Lo and Wu, 2011a) and (Lo et al., 2012) that use ment TB, then there is a directional relation be-
an automatic shallow parser. tween TA and TB (TB ⇒ TA). Instead of the pure
logical or mathematical entailment, the textual en-
3.2.2 Semantic Similarity tailment in natural language processing (NLP) is
As a contrast to the syntactic information, which usually performed with a relaxed or loose defini-
captures the overall grammaticality or sentence tion (Dagan et al., 2006). For instance, according
structure similarity, the semantic similarity of the to text fragment TB, if it can be inferred that the
automatic translations and the source sentences (or text fragment TA is most likely to be true then
references) can be measured by the employing of the relationship TB ⇒ TA also establishes. That
some semantic features. the relation is directive also means that the inverse
inference (TA ⇒ TB) is not ensured to be true cal language model usually assigns a probability
(Dagan and Glickman, 2004). Recently, Castillo to a sequence of words by means of a probability
and Estrella (2012) present a new approach for MT distribution. (Gamon et al., 2005) propose LM-
evaluation based on the task of “Semantic Textual SVM, language-model, support vector machine,
Similarity”. This problem is addressed using a tex- method investigating the possibility of evaluating
tual entailment engine based on WordNet semantic MT quality and fluency in the absence of refer-
features. ence translations. They evaluate the performance
Paraphrase is to restatement the meaning of of the system when used as a classifier for identi-
a passage or text utilizing other words, which fying highly dysfluent and illformed sentences.
can be seen as bidirectional textual entailment (Stanojević and Sima’an, 2014a) designed a
(Androutsopoulos and Malakasiotis, 2010). In- novel sentence level MT evaluation metric BEER,
stead of the literal translation, word by word which has the advantage of incorporate large
and line by line used by metaphrase, para- number of features in a linear model to maximize
phrase represents a dynamic equivalent. Fur- the correlation with human judgments. To make
ther knowledge of paraphrase from the as- smoother sentence level scores, they explored two
pect of linguistics is introduced in the works kinds of less sparse features including “character
of (McKeown, 1979; Meteer and Shaked, 1988; n-grams” (e.g. stem checking) and “abstract
Barzilayand and Lee, 2003). (Snover et al., 2006) ordering patterns” (permutation trees). They
describe a new evaluation metric TER-Plus further investigated the model with more dense
(TERp). Sequences of words in the reference features such as adequacy features, fluency
are considered to be paraphrases of a sequence of features and features based on permutation
words in the hypothesis if that phrase pair occurs trees (Stanojević and Sima’an, 2014c). In the
in TERp phrase table. latest version, they extended the permutation-tree
(Gildea et al., 2006) into permutation-forests
The semantic roles are employed by some
model (Stanojević and Sima’an, 2014b), and
researchers as linguistic features in the MT
showed stable good performance on different
evaluation. To utilize the semantic roles, the
language pairs in WMT sentence level evaluation
sentences are usually first shallow parsed and
task.
entity tagged. Then the semantic roles used to
Generally, the linguistic features men-
specify the arguments and adjuncts that occur
tioned above, including both syntactic and
in both the candidate translation and reference
semantic features, are usually combined in
translation. For instance, the semantic roles
two ways, either by following a machine
introduced by (Giménez and Márquez, 2007;
learning approach (Albrecht and Hwa, 2007;
Giméne and Márquez, 2008) include causative
Leusch and Ney, 2009), or trying to com-
agent, adverbial adjunct, directional adjunct,
bine a wide variety of metrics in a
negation marker, and predication adjunct, etc.In
more simple and straightforward way,
the further development, (Lo and Wu, 2011a;
such as (Giméne and Márquez, 2008;
Lo and Wu, 2011b) design the metric MEANT to
Specia and Giménez, 2010;
capture the predicate-argument relations as the
Comelles et al., 2012),etc.
structural relations in semantic frames, which is
not reflected by the flat semantic role label features 4 Evaluating the MT Evaluation
in the work of (Giménez and Márquez, 2007).
Furthermore, instead of using uniform weights, 4.1 Statistical Significance
(Lo et al., 2012) weight the different types of If different MT systems produce translations with
semantic roles according to their relative impor- different qualities on a data set, how can we en-
tance to the adequate preservation of meaning, sure that they indeed own different system quality?
which is empirically determined.Generally, the To explore this problem, (Koehn, 2004) performs
semantic roles account for the semantic structure a research work on the statistical significance test
of a segment and have proved effective to assess for machine translation evaluation. The bootstrap
adequacy in the above papers. resampling method is used to compute the statis-
The language models are also utilized by the tical significance intervals for evaluation metrics
MT and MT evaluation researchers. A statisti- on small test sets. Statistical significance usually
refers to two separate notions, of which one is the Because the standard deviations of variable X
p-value, the probability that the observed data will and Y are higher than 0 (σX > 0 and σY > 0), if
occur by chance in a given single null hypothesis. the covariance σXY between X and Y is positive,
The other one is the “Type I” error rate of a statis- negative or zero, the correlation score between X
tical hypothesis test, which is also named as “false and Y will correspondingly result in positive, neg-
positive” and measured by the probability of in- ative or zero, respectively. Based on a sample of
correctly rejecting a given null hypothesis in favor paired data (X, Y ) as (xi , yi ), i = 1 to n , the
of a second alternative hypothesis (Hald, 1998). Pearson correlation coefficient is calculated by:
4.2 Evaluating Human Judgment Pn

Since the human judgments are usually trusted i=1 (xi
− µx )(yi − µy )
ρXY = pPn pPn
2 2
as the golden standards that the automatic eval- i=1 (xi − µx ) i=1 (yi − µy )
uation metrics should try to approach, the relia- (16)
bility and coherence of human judgments is very where µx and µy specify the means of discrete
important. Cohen’s kappa agreement coefficient random variable X and Y respectively.
is one of the commonly used evaluation methods 4.3.2 Spearman rank Correlation
(Cohen, 1960). For the problem in nominal scale
Spearman rank correlation coefficient, a simpli-
agreement between two judges, there are two rel-
fied version of Pearson correlation coefficient , is
evant quantities p0 and pc . The factor p0 is the
another algorithm to measure the correlations of
proportion of units in which the judges agreed and
automatic evaluation and manual judges, espe-
pc is the proportion of units for which agreement
cially in recent years (Callison-Burch et al., 2008;
is expected by chance. The coefficient k is simply
the proportion of chance-expected disagreements
which do not occur, or alternatively, it is the pro-
Callison-Burch et al., 2011). When there are
portion of agreement after chance agreement is re-
no ties, Spearman rank correlation coefficient,
moved from consideration:
which is sometimes specified as (rs) is calculated
p0 − pc as:
k= (14)
1 − pc
6 ni=1 d2i
P
where p0 − pc represents the proportion of rsϕ(XY ) = 1 − (17)
n(n2 − 1)
the cases in which beyond-chance agreement
occurs and is the numerator of the coefficient where di is the difference-value (D-value)
(Landis and Koch, 1977). between the two corresponding rank variables
(xi − yi ) in X ~ = {x1 , x2 , ..., xn } and Y
~ =
4.3 Correlating Manual and Automatic Score {y1 , y2 , ..., yn } describing the system ϕ.
In this section, we introduce three correlation co-
4.3.3 Kendall’s τ
efficient algorithms that are commonly used by
the recent WMT workshops to measure the close- Kendall’s τ (Kendall, 1938) has been
ness of the automatic evaluation and manual judg- used in recent years for the correla-
ments. Choosing which correlation algorithm de- tion between automatic order and refer-
pends on whether the scores or ranks schemes are ence order (Callison-Burch et al., 2010;
utilized. Callison-Burch et al., 2011;
Callison-Burch et al., 2012). It is defined
4.3.1 Pearson Correlation as:
Pearson’s correlation coefficient (Pearson, 1900)
is commonly represented by the Greek letter
num concordant pairs − num discordant pairs
ρ. The correlation between random variables τ=
X and Y denoted as is measured as follow total pairs
(18)
(Montgomery and Runger, 2003).
The latest version of Kendall’s τ is in-
troduced in (Kendall and Gibbons, 1990).
cov(X, Y ) σXY (Lebanon and Lafferty, 2002) give an overview
ρXY = p = (15)
V (X)V (Y ) σ X σY work for Kendall’s τ showing its application in
calculating how much the system orders differ (and therefore consistent), and extrinsically inter-
from the reference order. More concretely, pretable.
(Lapata, 2003) proposes the use of Kendall’s
τ , a measure of rank correlation, estimating 6 Discussion and Related Works
the distance between a system-generated and a
human-generated gold-standard order. So far, the human judgment scores of MT results
are usually considered as the golden standard that
5 Advanced Quality Estimation the automatic evaluation metrics should try to ap-
proach. However, some improper handlings in the
In recent years, some MT evaluation methods that
process also yield problems. For instance, in the
do not use the manually offered golden reference
ACL WMT 2011 English-Czech task, the multi-
translations are proposed. They are usually called
annotator agreement kappa value k is very low and
as “Quality Estimation (QE)”. Some of the related
even the exact same string produced by two sys-
works have already been mentioned in previous
tems is ranked differently each time by the same
sections. The latest quality estimation tasks of
annotator. The evaluation results are highly af-
MT can be found from WMT12 to WMT15
fected by the manual reference translations. How
(Callison-Burch et al., 2012; Bojar et al., 2013;
to ensure the quality of reference translations and
Bojar et al., 2014; Bojar et al., 2015). They de-
the agreement level of human judgments are two
fined a novel evaluation metric that provides some
important problems.
advantages over the traditional ranking metrics.
The designed criterion DeltaAvg assumes that Automatic evaluation metrics are indirect mea-
the reference test set has a number associated sures of translation quality, because that they are
with each entry that represents its extrinsic usually using the various string distance algo-
value. Given these values, their metric does not rithms to measure the closeness between the ma-
need an explicit reference ranking, the way the chine translation system outputs and the manually
Spearman ranking correlation does. The goal of offered reference translations and they are based
the DeltaAvg metric is to measure how valuable on the calculating of correlation score with man-
a proposed ranking is according to the extrinsic ual MT evaluation (Moran and Lewis, 2012). Fur-
values associated with the test entries. thermore, the automatic evaluation metrics tend
to ignore the relevance of words (Koehn, 2010),
for instance, the name entities and core concepts
n−1
P
V (S1,k ) are more important than punctuations and deter-
k=1 miners but most automatic evaluation metrics put
DeltaAvgv [n] = − V (S) (19)
n−1 the same weight on each word of the sentences.
For the scoring task, they use two task eval- Third, automatic evaluation metrics usually yield
uation metrics that have been traditionally used meaningless score, which is very test set specific
for measuring performance for regression tasks: and the absolute value is not informative. For in-
Mean Absolute Error (MAE) as a primary met- stance, what is the meaning of -16094 score by
ric, and Root of Mean Squared Error (RMSE) as the MTeRater metric (Parton et al., 2011) or 1.98
a secondary metric. For a given test set S with score by ROSE (Song and Cohn, 2011)? The au-
entries si , 1 6 i 6 |S| , they denote by H(si ) tomatic evaluation metrics should try to achieve
the proposed score for entry si (hypothesis), and the goals of low cost, reduce time and money
by V (si ) the reference value for entry si (gold- spent on carrying out evaluation; tunable, auto-
standard value). matically optimize system performance towards
metric;meaningful, score should give intuitive in-
PN terpretation of translation quality; consistent, re-
i−1 |H(si ) − V (si )| peated use of metric should give same results;
MAE = (20)
s N correct, metric must rank better systems higher
PN
− V (si ))2 as mentioned in (Koehn, 2010), of which the low
i−1 (H(si )
RMSE = (21) cost, tunable and consistent characteristics are eas-
N
ily achieved by the metric developers, but the rest
where N = |S|. Both these metrics two goals (meaningful and correct) are usually the
are nonparametric, automatic and deterministic challenges in front of the NLP researchers.
There are some related works about MT clude rule based, human aided, and statistical sys-
evaluation survey or literature review before. tems, then the lexical similarity metrics, such as
For instance, in the DARPA GALE report BLEU, give a strong disagreement between rank-
(Dorr et al., 2009), researchers first introduced ing results provided by them and the human eval-
the automatic and semi-automatic MT evaluation uators.So the linguistic features are very impor-
measures, and the task and human in loop mea- tant in the MT evaluation procedure. However,
sures; then, they gave a description of the MT in-appropriate utilization, or abundant or abused
metrology in GALE program, which focus on the utilization, will result in difficulty in promotion.In
HTER metric as standard method used in GALE; the future, how to utilize the linguistic features
finally, they compared some automatic metrics and more accurate, flexible, and simplified, will be one
explored some other usages of the metric, such as tendency in MT evaluation.Furthermore, the MT
optimization in MT parameter training. evaluation from the aspects of semantic similarity
In another research project report EuroMatrix is more reasonable and reaches closer to the hu-
(EuroMatrix, 2007), researchers first gave an in- man judgments, so it should receive more atten-
troduction of the MT history, then, they introduced tion.
human evaluation of MT and objective evaluation Secondly, the Quality Estimation tasks make
of MT as two main sections of the work; finally, some difference from the traditional evaluation,
they introduced a listed of popular evaluation mea- such as extracting reference-independent features
sures at that time including WER, SER, CDER, from input sentences and the translation, obtain-
X-Score, D-score, NIST, RED, IER and TER etc. ing quality score based on models produced from
Mrquez (Mrquez, 2013) introduced the Asiya training data, predicting the quality of an unseen
online interface developed by their institute for translated text at system run-time, filtering out
MT output error analysis, where they also briefly sentences which are not good enough for post pro-
mentioned the MT evaluation developments of cessing, and selecting the best translation among
lexical measures and linguistically motivated mea- multiple systems, etc., so they will continuously
sures, and pointed out the the chanllenges in the attract many researchers.
quality estimation task. Thirdly, some advanced or challenging tech-
Our work differs with the previous ones, by in- nologies that can be tried for the MT evalua-
troducing some recent development of MT eval- tion include the deep learning (Gupta et al., 2015;
uation models, the different classifications from Zhang and Zong, 2015), semantic logic form, and
manual to automatic evaluation measures, the in- decipherment model, etc.
troduction of recent QE tasks, and the concise con-
struction of the content. 8 Aknowledgement
This work was supported by NWO VICI un-
7 Perspective der Grant No. 277-89-002. The second author
In this section, we mention several aspects that are was supported by the Research Committee of the
useful and will attract much attention for the fur- University of Macau (Grant No. MYRG2015-
ther development of MT evaluation field. 00175-FST and MYRG2015-00188-FST) and the
Firstly, it is about the lexical similarity and Science and Technology Development Fund of
the linguistic features. Because the natural lan- Macau (Grant No. 057/2014/A). The authors
guages are expressive and ambiguous at dif- thank Prof. Khalil Sima’an, the collegues in SLPL
ferent levels (Giménez and Márquez, 2007), lex- lab in ILLC for valuable comments on this paper,
ical similarity based metrics limit their scope thank Dr. Ying Shi for editing the first version of
to the lexical dimension and are not suffi- the draft.
cient to ensure that two sentences convey the
same meaning or not. For instance, the
References
researches of (Callison-Burch et al., 2006) and
(Koehn and Monz, 2006) report that lexical simi- [Albrecht and Hwa2007] J. Albrecht and R. Hwa.
2007. A re-examination of machine learning ap-
larity metrics tend to favor the automatic statis- proaches for sentence-level mt evaluation. In The
tical machine translation systems. If the eval- Proceedings of the 45th Annual Meeting of the ACL,
uated systems belong to different types that in- Prague, Czech Republic.
[Androutsopoulos and Malakasiotis2010] Jon Androut- [Brown et al.1993] Peter F Brown, Vincent J Della
sopoulos and Prodromos Malakasiotis. 2010. A sur- Pietra, Stephen A Della Pietra, and Robert L Mer-
vey of paraphrasing and textual entailment methods. cer. 1993. The mathematics of statistical machine
Journal of Artificial Intelligence Research, 38:135– translation: Parameter estimation. Computational
187. linguistics, 19(2):263–311.
[Arnold2003] D. Arnold. 2003. Computers and Trans- [Buck2012] Christian Buck. 2012. Black box features
lation: A translator’s guide-Chap8 Why translation for the wmt 2012 quality estimation shared task. In
is difficult for computers. Benjamins Translation Li- Proceedings of WMT.
brary.
[Callison-Burch et al.2006] Chris Callison-Burch,
[Avramidis et al.2011] Eleftherios Avramidis, Maja Philipp Koehn, and Miles Osborne. 2006. Improved
Popovic, David Vilar, and Aljoscha Burchardt. statistical machine translation using paraphrases. In
2011. Evaluate with confidence estimation: Proceedings of HLT-NAACL.
Machine ranking of translation outputs using
grammatical features. In Proceedings of WMT. [Callison-Burch et al.2007] Chris Callison-Burch,
Cameron Fordyce, Philipp Koehn, Christof Monz,
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun and Josh Schroeder. 2007. (meta-) evaluation of
Cho, and Yoshua Bengio. 2014. Neural machine machine translation. In Proceedings of WMT.
translation by jointly learning to align and translate.
CoRR, abs/1409.0473. [Callison-Burch et al.2008] Chris Callison-Burch,
Cameron Fordyce, Philipp Koehn, Christof Monz,
[Banerjee and Lavie2005] Satanjeev Banerjee and Alon and Josh Schroeder. 2008. Further meta-evaluation
Lavie. 2005. Meteor: An automatic metric for of machine translation. In Proceedings of WMT.
mt evaluation with improved correlation with human
judgments. In Proceedings of the ACL. [Callison-Burch et al.2009] Chris Callison-Burch,
Philipp Koehn, Christof Monz, and Josh Schroeder.
[Bangalore et al.2000] Srinivas Bangalore, Owen Ram- 2009. Findings of the 2009 workshop on statistical
bow, and Steven Whittaker. 2000. Evaluation met- machine translation. In Proceedings of the 4th
rics for generation. In Proceedings of INLG. WMT.
[Callison-Burch et al.2010] Chris Callison-Burch,

[Barzilayand and Lee2003] Regina Barzilayand and
Philipp Koehn, Christof Monz, Kay Peterson, Mark
Lillian Lee. 2003. Learning to paraphrase: an
Przybocki, and Omar F. Zaridan. 2010. Findings
unsupervised approach using multiple-sequence
of the 2010 joint workshop on statistical machine
alignment. In Proceedings NAACL.
translation and metrics for machine translation. In
Proceedings of the WMT.
[Bojar et al.2013] Ondřej Bojar, Christian Buck, Chris
Callison-Burch, Christian Federmann, Barry Had- [Callison-Burch et al.2011] Chris Callison-Burch,
dow, Philipp Koehn, Christof Monz, Matt Post, Philipp Koehn, Christof Monz, and Omar F. Zari-
Radu Soricut, and Lucia Specia. 2013. Findings dan. 2011. Findings of the 2011 workshop on
of the 2013 workshop on statistical machine transla- statistical machine translation. In Proceedings of
tion. In Proceedings of WMT. WMT.
[Bojar et al.2014] Ondrej Bojar, Christian Buck, Chris- [Callison-Burch et al.2012] Chris Callison-Burch,
tian Federmann, Barry Haddow, Philipp Koehn, Jo- Philipp Koehn, Christof Monz, Matt Post, Radu
hannes Leveling, Christof Monz, Pavel Pecina, Matt Soricut, and Lucia Specia. 2012. Findings of the
Post, Herve Saint-Amand, Radu Soricut, Lucia Spe- 2012 workshop on statistical machine translation.
cia, and Aleš Tamchyna. 2014. Findings of the In Proceedings of WMT.
2014 workshop on statistical machine translation.
In Proceedings of the Ninth Workshop on Statisti- [Carroll1966] John B. Carroll. 1966. An experiment
cal Machine Translation, pages 12–58, Baltimore, in evaluating the quality of translation. Mechani-
Maryland, USA, June. Association for Computa- cal Translation and Computational Linguistics, 9(3-
tional Linguistics. 4):67–75.
[Bojar et al.2015] Ondřej Bojar, Rajen Chatterjee, [Chen et al.2012] Boxing Chen, Roland Kuhn, and
Christian Federmann, Barry Haddow, Matthias Samuel Larkin. 2012. Port: a precision-order-recall
Huck, Chris Hokamp, Philipp Koehn, Varvara Lo- mt evaluation metric for tuning. In Proceedings of
gacheva, Christof Monz, Matteo Negri, Matt Post, the ACL.
Carolina Scarton, Lucia Specia, and Marco Turchi.
2015. Findings of the 2015 workshop on statistical [Chiang2005] David Chiang. 2005. A hierarchical
machine translation. In Proceedings of the Tenth phrase-based model for statistical machine transla-
Workshop on Statistical Machine Translation, pages tion. In Proceedings of the 43rd Annual Meet-
1–46, Lisbon, Portugal, September. Association for ing of the Association for Computational Linguistics
Computational Linguistics. (ACL), pages 263–270.
[Cho et al.2014] KyungHyun Cho, Bart van Merrien- [Federico et al.2011] Marcello Federico, Luisa Ben-
boer, Dzmitry Bahdanau, and Yoshua Bengio. tivogli, Michael Paul, and Sebastian Stüker. 2011.
2014. On the properties of neural machine Overview of the iwslt 2011 evaluation campaign. In
translation: Encoder-decoder approaches. CoRR, In proceeding of International Workshop on Spoken
abs/1409.1259. Language Translation (IWSLT).
[Church and Hovy1991] Kenneth Church and Eduard [Fraser and Marcu2007] Alexander Fraser and Daniel
Hovy. 1991. Good applications for crummy ma- Marcu. 2007. Measuring word alignment quality
chine translation. In Proceedings of the Natu- for statistical machine translation. Computational
ral Language Processing Systems Evaluation Work- Linguistics.
shop.
[Gamon et al.2005] Michael Gamon, Anthony Aue, and
[Cohen1960] Jasob Cohen. 1960. A coefficient of Martine Smets. 2005. Sentence-level mt evalua-
agreement for nominal scales. Educational and Psy- tion without reference translations beyond language
chological Measurement, 20(1):3746. modelling. In Proceedings of EAMT, pages 103–
112.
[Comelles et al.2012] Elisabet Comelles, Jordi Atse-
rias, Victoria Arranz, and Irene Castellón. 2012. [Gildea et al.2006] Daniel Gildea, Giorgio Satta, and
Verta: Linguistic features in mt evaluation. In Hao Zhang. 2006. Factoring synchronous gram-
LREC, pages 3944–3950. mars by sorting. In Proceedings of ACL.
[Dagan and Glickman2004] Ido Dagan and Oren Glick-
man. 2004. Probabilistic textual entailment: [Giméne and Márquez2008] Jesús Giméne and Lluís
Generic applied modeling of language variability. In Márquez. 2008. A smorgasbord of features for
Learning Methods for Text Understanding and Min- automatic mt evaluation. In Proceedings of WMT,
ing workshop. pages 195–198.
[Dagan et al.2006] Ido Dagan, Oren Glickman, and [Giménez and Márquez2007] Jesús Giménez and Lluís
Bernardo Magnini. 2006. The pascal recognis- Márquez. 2007. Linguistic features for automatic
ing textual entailment challenge. Machine Learning evaluation of heterogenous mt systems. In Proceed-
Challenges:LNCS, 3944:177–190. ings of WMT.
[Dahlmeier et al.2011] Daniel Dahlmeier, Chang Liu, [Group2010] NIST Multimodal Information Group.
and Hwee Tou Ng. 2011. Tesla at wmt2011: Trans- 2010. Nist 2005 open machine translation (openmt)
lation evaluation and tunable metric. In Proceedings evaluation. In Philadelphia: Linguistic Data Con-
of WMT. sortium. Report.
[Doddington2002] George Doddington. 2002. Auto- [Guo et al.2009] Jiafeng Guo, Gu Xu, Xueqi Cheng,
matic evaluation of machine translation quality us- and Hang Li. 2009. Named entity recognition in
ing n-gram co-occurrence statistics. In HLT Pro- query. In Proceeding of SIGIR.
ceedings.
[Gupta et al.2015] Rohit Gupta, Constantin Orasan, and
[Dorr et al.2009] Bonnie Dorr, Matt Snover, and etc. Josef van Genabith. 2015. Machine translation
Nitin Madnani. 2009. Part 5: Machine translation evaluation using recurrent neural networks. In Pro-
evaluation. In Bonnie Dorr edited DARPA GALE ceedings of the Tenth Workshop on Statistical Ma-
program report. chine Translation, pages 380–384, Lisbon, Portugal,
[Doyon et al.1999] Jennifer B. Doyon, John S. White, September. Association for Computational Linguis-
and Kathryn B. Taylor. 1999. Task-based evalua- tics.
tion for machine translation. In Proceedings of MT
[Hald1998] Anders Hald. 1998. A History of Math-
Summit 7.
ematical Statistics from 1750 to 1930. ISBN-10:
[Echizen-ya and Araki2010] H. Echizen-ya and 0471179124. Wiley-Interscience; 1 edition.
K. Araki. 2010. Automatic evaluation method for
machine translation using noun-phrase chunking. In [Han et al.2012] Aaron Li Feng Han, Derek Fai Wong,
Proceedings of the ACL. and Lidia Sam Chao. 2012. A robust evaluation
metric for machine translation with augmented fac-
[Eck and Hori2005] Matthias Eck and Chiori Hori. tors. In Proceedings of COLING.
2005. Overview of the iwslt 2005 evaluation cam-
paign. In In proceeding of International Workshop [Han et al.2013a] Aaron Li Feng Han, Derek Fai Wong,
on Spoken Language Translation (IWSLT). Lidia Sam Chao, Liangeye He, Shuo Li, and Ling
Zhu. 2013a. Phrase tagset mapping for french
[EuroMatrix2007] Project EuroMatrix. 2007. 1.3: Sur- and english treebanks and its application in ma-
vey of machine translation evaluation. In EuroMa- chine translation evaluation. In International Con-
trix Project Report, Statistical and Hybrid MT be- ference of the German Society for Computational
tween All European Languages, co-ordinator: Prof. Linguistics and Language Technology, LNAI Vol.
Hans Uszkoreit. 8105, pages 119–131.
[Han et al.2013b] Aaron Li Feng Han, Derek Fai Wong, [Koehn et al.2007b] Philipp Koehn, Hieu Hoang,
Lidia Sam Chao, Yi Lu, Liangye He, Yiming Wang, Alexandra Birch, Chris Callison-Burch, Mar-
and Jiaji Zhou. 2013b. A description of tunable ma- cello Federico, Nicola Bertoldi, Brooke Cowan,
chine translation evaluation systems in wmt13 met- Wade Shen, Christine Moran, Richard Zens, et al.
rics task. In Proceedings of WMT, pages 414–421. 2007b. Moses: Open source toolkit for statisti-
cal machine translation. In Proceedings of the
[Han et al.2014] Aaron Li Feng Han, Derek Fai Wong, 45th annual meeting of the ACL on interactive
Lidia Sam Chao, Liangeye He, and Yi Lu. 2014. poster and demonstration sessions, pages 177–180.
Unsupervised quality estimation model for english Association for Computational Linguistics.
to german translation and its application in extensive
supervised evaluation. In The Scientific World Jour- [Koehn2004] Philipp Koehn. 2004. Statistical signif-
nal. Issue: Recent Advances in Information Technol- icance tests for machine translation evaluation. In
ogy, pages 1–12. Proceedings of EMNLP.
[Hwang et al.2007] Young Sook Hwang, Andrew [Koehn2010] Philipp Koehn. 2010. Statistical Ma-
Finch, and Yutaka Sasaki. 2007. Improving statis- chine Translation. Cambridge University Press.
tical machine translation using shallow linguistic
knowledge. Computer Speech and Language, [Landis and Koch1977] J. Richard Landis and Gary G.
21(2):350–372. Koch. 1977. The measurement of observer agree-
ment for categorical data. Biometrics, 33(1):159–
[Kendall and Gibbons1990] Maurice G. Kendall and 174.
Jean Dickinson Gibbons. 1990. Rank Correlation
Methods. Oxford University Press, New York. [Laoudi et al.2006] Jamal Laoudi, Ra R. Tate, and
Clare R. Voss. 2006. Task-based mt evaluation:
[Kendall1938] Maurice G. Kendall. 1938. A new mea- From who/when/where extraction to event under-
sure of rank correlation. Biometrika, 30:81–93. standing. In in Proceedings of LREC06, pages
2048–2053.
[Khalilov and Fonollosa2011] Maxim Khalilov and
José A. R. Fonollosa. 2011. Syntax-based reorder- [Lapata2003] Mirella Lapata. 2003. Probabilistic text
ing for statistical machine translation. Computer structuring: Experiments with sentence ordering. In
Speech and Language, 25(4):761–788. Proceedings of ACL.
[King et al.2003] Marrgaret King, Andrei Popescu- [Lebanon and Lafferty2002] Guy Lebanon and John
Belis, and Eduard Hovy. 2003. Femti: Creating and Lafferty. 2002. Combining rankings using condi-
using a framework for mt evaluation. In Proceed- tional probability models on permutations. In Pro-
ings of the Machine Translation Summit IX. ceeding of the ICML.
[Koehn and Monz2005] Philipp Koehn and Christof [Leusch and Ney2009] Gregor Leusch and Hermann
Monz. 2005. Shared task: Statistical machine trans- Ney. 2009. Edit distances with block movements
lation between european languages. In Proceedings and error rate confidence estimates. Machine Trans-
of the ACL Workshop on Building and Using Paral- lation, 23(2-3).
lel Texts. [Li et al.2012] Liang You Li, Zheng Xian Gong, and
[Koehn and Monz2006] Philipp Koehn and Christof Guo Dong Zhou. 2012. Phrase-based evaluation for
Monz. 2006. Manual and automatic evaluation machine translation. In Proceedings of COLING,
of machine translation between european languages. pages 663–672.
In Proceedings on the Workshop on Statistical Ma- [Lin and Hovy2003] Chin-Yew Lin and E. H. Hovy.
chine Translation, pages 102–121, New York City, 2003. Automatic evaluation of summaries using
June. Association for Computational Linguistics. n-gram co-occurrence statistics. In Proceedings
[Koehn et al.2003] Philipp Koehn, Franz Josef Och, NAACL.
and Daniel Marcu. 2003. Statistical phrase-based [Lin and Och2004] Chin-Yew Lin and Franz Josef Och.
translation. In Proceedings of the 2003 Conference 2004. Automatic evaluation of machine transla-
of the North American Chapter of the Association tion quality using longest common subsequence and
for Computational Linguistics on Human Language skip-bigram statistics. In Proceedings of ACL.
Technology-Volume 1, pages 48–54. Association for
Computational Linguistics. [Liu and Gildea2005] Ding Liu and Daniel Gildea.
2005. Syntactic features for evaluation of machine
[Koehn et al.2007a] Philipp Koehn, Hieu Hoang, translation. In Proceedingsof theACL Workshop on
Alexandra Birch, Chris Callison-Burch, Marcello Intrinsic and Extrinsic Evaluation Measures for Ma-
Federico, Nicola Bertoldi, Brooke Cowan, Wade chine Translation and/or Summarization.
Shen, Christine Moran, Richard Zens, Chris Dyer,
Ondrej Bojar, Alexandra Constantin, and Evan [Liu et al.2011] Chang Liu, Daniel Dahlmeier, and
Herbst. 2007a. Moses: Open source toolkit for Hwee Tou Ng. 2011. Better evaluation metrics
statistical machine translation. In Proceedings of lead to better machine translation. In Proceedings
ACL. of EMNLP.
[Lo and Wu2011a] Chi Kiu Lo and Dekai Wu. 2011a. [Parton et al.2011] Kristen Parton, Joel Tetreault ans
Meant: An inexpensive, high- accuracy, semi- Nitin Madnani, and Martin Chodorow. 2011. E-
automatic metric for evaluating translation utility rating machine translation. In Proceedings of WMT.
based on semantic roles. In Proceedings of ACL.
[Paul et al.2010] Michael Paul, Marcello Federico, and
[Lo and Wu2011b] Chi Kiu Lo and Dekai Wu. 2011b. Sebastian Stüker. 2010. Overview of the iwslt 2010
Structured vs. flat semantic role representations for evaluation campaign. In Proceeding of IWSLT.
machine translation evaluation. In Proceedings of
the 5th Workshop on Syntax and Structure in Statis- [Paul2009] M. Paul. 2009. Overview of the iwslt 2009
ticalTranslation (SSST-5). evaluation campaign. In Proceeding of IWSLT.
[Lo et al.2012] Chi Kiu Lo, Anand Karthik Turmuluru, [Pearson1900] Karl Pearson. 1900. On the criterion
and Dekai Wu. 2012. Fully automatic semantic mt that a given system of deviations from the proba-
evaluation. In Proceedings of WMT. ble in the case of a correlated system of variables
is such that it can be reasonably supposed to have
[Mariño et al.2006] José B. Mariño, Rafael E. Banchs, arisen from random sampling. Philosophical Maga-
Josep M. Crego, Adrià de Gispert, Patrik Lambert, zine, 50(5):157–175.
José A. R. Fonollosa, and Marta R. Costa-jussà.
2006. N-gram based machine translation. Compu- [Popović et al.2011] Maja Popović, David Vilar, Eleft-
tational LinguisticsLinguistics, 32(4):527–549. herios Avramidis, and Aljoscha Burchardt. 2011.
Evaluation without references: Ibm1 scores as eval-
[Marsh and Perzanowski1998] Elaine Marsh and Den-
uation metrics. In Proceedings of WMT.
nis Perzanowski. 1998. Muc-7 evaluation of ie tech-
nology: Overview of results. In Proceedingsof Mes- [Popovic and Ney2007] M. Popovic and Hermann Ney.
sage Understanding Conference (MUC-7). 2007. Word error rates: Decomposition over pos
[McKeown1979] Kathleen R. McKeown. 1979. Para- classes and applications for error analysis. In Pro-
phrasing using given and new information in a ceedings of WMT.
question-answer system. In Proceedings of ACL.
[Povlsen et al.1998] Claus Povlsen, Nancy Underwood,
[Menezes et al.2006] Arul Menezes, Kristina Bradley Music, and Anne Neville. 1998. Evaluating
Toutanova, and Chris Quirk. 2006. Microsoft text-type suitability for machine translation a case
research treelet translation system: Naacl 2006 study on an english-danish system. In Proceeding
europarl evaluation. In Proceedings of WMT. LREC.
[Meteer and Shaked1988] Marie Meteer and Varda [Raybaud et al.2011] Sylvain Raybaud, David Lan-
Shaked. 1988. Microsoft research treelet transla- glois, and Kamel Smaïli. 2011. ”this sentence is
tion system: Naacl 2006 europarl evaluation. In wrong.” detecting errors in machine-translated sen-
Proceedings of COLING. tences. Machine Translation, 25(1):1–34.
[Miller et al.1990] G. A. Miller, R. Beckwith, C. Fell- [Sánchez-Martínez and Forcada2009] F. Sánchez-
baum, D. Gross, and K. J. Miller. 1990. Wordnet: Martínez and M. L. Forcada. 2009. Inferring
an on-line lexical database. International Journal of shallow-transfer machine translation rules from
Lexicography, 3(4):235–244. small parallel corpora. Journal of Artificial
Intelligence Research, 34:605–635.
[Montgomery and Runger2003] Douglas C. Mont-
gomery and George C. Runger. 2003. Applied [Snover et al.2006] Mattthew Snover, Bonnie J. Dorr,
statistics and probability for engineers. John Wiley Richard Schwartz, Linnea Micciulla, and John
and Sons, New York, third edition. Makhoul. 2006. A study of translation edit rate with
[Moran and Lewis2012] John Moran and David Lewis. targeted human annotation. In Proceeding of AMTA.
2012. Unobtrusive methods for low-cost manual as-
[Song and Cohn2011] Xingyi Song and Trevor Cohn.
sessment of machine translation. Tralogy I [Online],
2011. Regression and ranking based optimisation
Session 5.
for sentence level mt evaluation. In Proceedings of
[Mrquez2013] L. Mrquez. 2013. automatic evaluation WMT.
of machine translation quality. Dialogue 2013 in-
vited talk, extended. [Specia and Giménez2010] L. Specia and J. Giménez.
2010. Combining confidence estimation and
[Och2003] Franz Josef Och. 2003. Minimum error rate reference-based metrics for segment-level mt eval-
training for statistical machine translation. In Pro- uation. In The Ninth Conference of the Association
ceedings of ACL. for Machine Translation in the Americas (AMTA).
[Papineni et al.2002] Kishore Papineni, Salim Roukos, [Specia et al.2011] Lucia Specia, Naheh Hajlaoui,
Todd Ward, and Wei Jing Zhu. 2002. Bleu: a Catalina Hallett, and Wilker Aziz. 2011. Pre-
method for automatic evaluation of machine trans- dicting machine translation adequacy. In Machine
lation. In Proceedings of ACL. Translation Summit XIII.
[Stanojević and Sima’an2014a] Miloš Stanojević and [Wong and yu Kit2009] Billy Wong and Chun yu Kit.
Khalil Sima’an. 2014a. Beer: Better evaluation as 2009. Atec: automatic evaluation of machine trans-
ranking. In Proceedings of the Ninth Workshop on lation via word choice and word order. Machine
Statistical Machine Translation. Translation, 23(2-3):141–155.
[Stanojević and Sima’an2014b] Miloš Stanojević and [Zhang and Zong2015] Jiajun Zhang and Chengqing
Khalil Sima’an. 2014b. Evaluating word order re- Zong. 2015. Deep neural networks in machine
cursively over permutation-forests. In Proceedings translation: An overview. IEEE Intelligent Systems,
of the Eight Workshop on Syntax, Semantics and (5):16–25.
Structure in Statistical Translation.
[Stanojević and Sima’an2014c] Miloš Stanojević and

Khalil Sima’an. 2014c. Fitting sentence level trans-
lation evaluation with many dense features. In Pro-
ceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing.
[Su et al.1992] Keh-Yih Su, Wu Ming-Wen, and Chang

Jing-Shin. 1992. A new quantitative quality mea-
sure for machine translation systems. In Proceeding
of COLING.
[Tillmann et al.1997] Christoph Tillmann, Stephan Vo-

gel, Hermann Ney, Arkaitz Zubiaga, and Has-
san Sawaf. 1997. Accelerated dp based search
for statistical translation. In Proceeding of EU-
ROSPEECH.
[Turian et al.2006] Joseph P Turian, Luke Shea, and

I Dan Melamed. 2006. Evaluation of machine trans-
lation and its evaluation. Technical report, DTIC
Document.
[Voss and Tate2006] Clare R. Voss and Ra R. Tate.

2006. Task-based evaluation of machine translation
(mt) engines: Measuring how well people extract
who, when, where-type elements in mt output. In In
Proceedings of 11th Annual Conference of the Euro-
pean Association for Machine Translation (EAMT-
2006, pages 203–212.
[Weaver1955] Warren Weaver. 1955. Translation. Ma-

chine Translation of Languages: Fourteen Essays.
[White and Taylor1998] John S. White and Kathryn B.

Taylor. 1998. A task-oriented evaluation metric for
machine translation. In Proceeding LREC.
[White et al.1994] John S. White, Theresa O’ Connell,

and Francis O’ Mara. 1994. The arpa mt evalua-
tion methodologies: Evolution, lessons, and future
approaches. In Proceeding of AMTA.
[Wolk and Marasek2015] Krzysztof Wolk and

Krzysztof Marasek. 2015. Neural-based ma-
chine translation for medical text domain. based
on european medicines agency leaflet texts. Pro-
cedia Computer Science, 64:2 – 9. Conference on
{ENTERprise} Information Systems/International
Conference on Project MANagement/Conference
on Health and Social Care Information Systems
and Technologies, CENTERIS/ProjMAN / {HCist}
2015 October 7-9, 2015.
View publication stats

Hang & Chao, Machine Translation Evaluation. A Survey (Paper 2016)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hang & Chao, Machine Translation Evaluation. A Survey (Paper 2016)

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Machine Translation Evaluation: A Survey

Article · May 2016

Lifeng Han Derek F. Wong

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Aaron Li-Feng Han Derek F. Wong , Lidia S. Chao

This paper introduces the state-of-the-art

machine translation (MT) evaluation sur-

4.2 Evaluating Human Judgment Pn

[Callison-Burch et al.2010] Chris Callison-Burch,

[Stanojević and Sima’an2014c] Miloš Stanojević and

[Su et al.1992] Keh-Yih Su, Wu Ming-Wen, and Chang

[Tillmann et al.1997] Christoph Tillmann, Stephan Vo-

[Turian et al.2006] Joseph P Turian, Luke Shea, and

[Voss and Tate2006] Clare R. Voss and Ra R. Tate.

[Weaver1955] Warren Weaver. 1955. Translation. Ma-

[White and Taylor1998] John S. White and Kathryn B.

[White et al.1994] John S. White, Theresa O’ Connell,

[Wolk and Marasek2015] Krzysztof Wolk and

View publication stats

You might also like