Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Open in app

Search

Member-only story

#GEN-AI RESEARCH PAPERS

Exploring GEMBA: A New LLM-Based Metric for


Translation Quality Assessment
Using LLMs for evaluating translation quality

Dr. Varshita Sher · Follow


Published in Towards Data Science
9 min read · Sep 30

Listen Share More

Image generated by Author using DALL.E 2

Introduction
I recently read an intriguing paper from the Microsoft¹ team (published in May 2023)
that caught my attention. The paper delves into the world of translation evaluation,
shedding light on an innovative metric called GEMBA (GPT Estimation Metric Based
Assessment). In this blog post, we’ll dissect the paper and provide insights into this
exciting development.

1. Premise: Exploring the motivation behind the paper

2. Research Questions and Hypotheses: Exploring the paper’s primary research


question and hypotheses.

3. Metrics for Assessing Translation Quality: Shallow-dive into existing metrics,


including BLEU, COMET, and METEOR.

4. Introduce GEMBA: Deep dive into the novel GEMBA metric.

5. Experiment Details: Insights into the experiments conducted

6. Key findings: Highlighting the main results from the paper

7. Limitations: Discussing things to look out for before implementing GEMBA in


production
1. Premise
LLMs, although not initially designed for translation tasks, have demonstrated
impressive precision in this domain. This realization led authors to the exploration of
using LLMs as evaluation tools for translations. The paper’s central idea is quite
straightforward- positioning LLMs (Large Language Models) as tools for evaluating
translations, not just for performing them. The authors propose a new metric called
GEMBA, which outperforms existing state-of-the-art metrics for translation quality
assessment.

2. Research Question

Can LLMs be used for quality assessment of


translations?
3. Translation Quality Assessment Metrics
Before delving into GEMBA, let’s take a quick look at the existing metrics used to
evaluate the quality of machine-generated translations, such as BLEU, COMET,
METEOR, etc. These metrics have their own strengths and are suited to different use
cases, depending on the specific aspects of translation quality that are most important
for the evaluation.

For instance, BLEU (Bilingual Evaluation Understudy) primarily focuses on n-gram


precision, which means it measures how many n-word sequences in the machine
translation overlap with n-word sequences in one or more reference translations. It
rewards the presence of specific word sequences and does not explicitly consider word
order, stemming, or synonymy. METEOR (Metric for Evaluation of Translation with
Explicit ORdering), on the other hand, goes beyond basic n-gram matching and takes a
more holistic approach to translation evaluation. It considers multiple aspects of
translation quality, including word stemming, synonymy, word order, exact word
matches, and even penalties for untranslated words. Similarly, COMET (Content-based
Machine Translation Evaluation) uses a slightly different approach by focusing on
content-based evaluation and calculating the semantic similarity between a machine
translation output and a reference translation using embeddings. In simpler words, it
evaluates how well the content and meaning of the machine translation match the
reference, regardless of the specific linguistic variations or word choices.

You can learn about other evaluation metrics such as YiSi, chrF, BERTScore, etc here.

Given the plethora of metrics we discussed just now, you may ask — Why do we need a
new metric like GEMBA?
The answer lies in its unique approach — prompting LLMs to assess translations based
on their own judgment. Unlike traditional metrics, GEMBA seeks to align with human
assessment of translations by scoring them on a scale (say 0 to 100), focusing on both
meaning and grammar.

4. Introducing GEMBA
As mentioned previously, GEMBA is a GPT-based metric for the assessment of
translation quality, which works both with a reference translation and without.

In essence, GEMBA is a really well-engineered prompt for the evaluation task and
consists of:

prompt variant (from a pre-defined set of four variants)

source language name, e.g., “Chinese”

target language name, e.g., “English”

source segments i.e. the sentence that needs to be translated

candidate translations i.e. the translated sentence

[optional] reference translations i.e. the baseline translation that can be used as
ground truth

Here’s an example of one of the prompt variants: GEMBA-DA (Direct Assessment)

GEMBA-DA prompt. Image taken from original paper


P.S. In case you are interested in the other three variants, here is detailed information on all
prompt variants introduced by the paper:

GEMBA prompt variants. Image taken from original paper

5. Experimentation and Evaluation


The authors tested the efficiency of the GEMBA metric using the widely popular MQM
2022 dataset (Multidimensional Quality Metrics) as an evaluation set. This dataset
includes a diverse range of (100K+) sentences from various domains such as news,
social, e-commerce, etc., and covers three translation directions: English to Russian,
English to German, and Chinese to English.

Additionally, to find the best GPT model for implementing GEMBA, the authors tested
each of the prompt variants with seven models from the GPT family ranging from GPT
2 up to the latest GPT-4 model.
7 GPT models used for evaluating GEMBA. Image taken from original paper

Having set the stage for experimentation of the GEMBA metric, the next obvious
question is —

Q: How do we tell if GEMBA is performing better than conventional metrics such as BLEU
and COMET?
A: If GEMBA scores corresponds closely with what a human thinks of the translation, we have
a winner!

To operationalize that answer, there are two metrics (Kendall’s Tau and Accuracy
(accuracy, Kocmi et al., 2021)) that need to be calculated, depending on whether we are
undertaking segment-level evaluation or system-level evaluation. But first, what are they?

System level evaluation assesses the overall performance of a machine translation system as
a whole. It looks at the quality of translations generated by the system across a wide range of
text or content.
Segment level evaluation focuses on assessing the quality of translations on a per-segment
basis (typically a sentence or a smaller unit of text)

Generally speaking:

Kendall’s Tau is used for segment-level evaluations

Accuracy is used for system-level evaluations

Let’s take a deep dive into their formulas for clarity using simple examples:

A. Kendall’s Tau

(Kendall’s Tau tells if there is a correlation between 2


rankings)
Suppose you have three translations (A, B, and C) of a given sentence, and you want to
assess the correlation between rankings produced by a metric (e.g., LLM, BLEU,
METEOR scores) and human judgments of translation quality.

Reference (Human): “The quick brown fox jumps over the lazy dog.”
Translation A: “The fast brown fox jumps over the lazy dog.”
Translation B: “The quick red fox jumps over the lazy dog.”
Translation C: “The lazy dog is jumped over by the quick brown fox.”
Human Ranking: A > B > C (i.e., they prefer Translation A the most, then B, and lastly
C)
Metric scores for these translations:
LLM(A) = 0.85
LLM(B) = 0.75
LLM(C) = 0.60

With all information, we can calculate Kendall’s Tau as follows:

Next, let’s calculate concordant and discordant pairs:

Pair 1: (A, B)
Human Ranking: A > B
LLM Scores: LLM(A) = 0.85 > LLM(B) = 0.75
Result: Concordant pair (both human and metric prefer A over B).

Pair 2: (A, C)
Human Ranking: A > C
LLM Scores: LLM(A) = 0.85 > LLM(C) = 0.60
Result: Concordant pair (both human and metric prefer A over C).
Pair 3: (B, C)
Human Ranking: B > C
LLM Scores: LLM(B) = 0.75 > LLM(C) = 0.60
Result: Concordant pair (both human and metric prefer B over C).

Plugging these into the formula we get:

Kendall’s Tau

In other words, τ = 1 suggests a perfect agreement between the metric and human
judgment and is hence a high-quality metric that can be used for automating the
quality of translations.

B. Accuracy

Kendall’s Tau assesses the similarity or agreement


between rankings, whereas accuracy measures the
correctness of rankings.
To illustrate calculations of accuracy, take the same setup as above i.e. Reference
(Human), Translation A, Translation B, Translation C, Human Ranking but let’s update
the metric scores a bit so that B is marked as a better translation than C according to
Bleu:

Metric Scores (BLEU):


BLEU(A) = 0.80
BLEU(B) = 0.70
BLEU(C) = 0.75

With all the information, here’s how to calculate accuracy:

Let’s calculate metric Δ (which is nothing but the difference in metric values for a pair
of translations) and human Δ (which is the difference in the human scores for a pair of
translations). If you look at the formula closely, you will notice we are not interested in
the actual value of the Δ but the sign of the Δ. Put simply, a high accuracy can only be
achieved if the sign of both Δs are the same i.e. human and metric are in alignment
about a translation.
Pair 1: (A, B)
Metric∆ = BLEU(A) — BLEU(B) = 0.80–0.70 = 0.10
Human∆ = 1 (A is ranked higher than B)
Result: Metric∆ and Human∆ have the same sign (both positive). This is a rank
agreement.

Pair 2: (A, C)
Metric∆ = BLEU(A) — BLEU(C) = 0.80–0.75 = 0.05
Human∆ = 2 (A is ranked higher than C)
Result: Metric∆ and Human∆ have the same sign (both positive). This is a rank
agreement.

Pair 3: (B, C)
Metric∆ = BLEU(B) — BLEU(C) = 0.70–0.75 = -0.05
Human∆ = 1 (B is ranked higher than C)
Result: Metric∆ and Human∆ have different signs (metric is negative, human is
positive). This is a rank disagreement.

Plugging these into the formula we get:


Accuracy(Bleu) = (2/3)*100 = 67%, meaning the BLEU metric accurately ranks 2 out of 3
pairs of translations as per human judgment. Whether or not this percentage is good
enough to automate evaluation with Bleu, I will leave that to the reader’s discretion!

Note: The examples used to demonstrate the calculation of Kendall’s Tau and Accuracy were
simplified for demonstration purposes. In reality, the formulas get more complicated if ties
need to be handled i.e. if a human/metric gives the same ranking to two or more translations.
You can read more about them here.

6. Key Results: System vs. Segment Level Evaluation


The paper reports that GEMBA excels in system-level evaluations, surpassing existing
metrics.

Results from system-level evaluations. Image taken from original paper


However, in segment-level evaluations, there is room for improvement. Ties in
rankings between LLMs and humans at this level may account for this discrepancy as
Kendall’s Tau penalizes ties. Since the Gemba-DA metric returns a discrete value
between 0–100, there is a high probability that two translations will obtain an equal
score.

Results from segment-level evaluations (P.S. the first column Accuracy is the same one from the previous table).
Image taken from original paper

The results also stress the importance of choosing the right LLM for implementing
GEMBA. Among the seven models from the GPT family tested any model beyond 3.5
showed promising results. GPT-4 stood out as the top performer, but models like
Davinci, ChatGPT Turbo, and GPT-3.5 also performed well.

GEMBA implementation with various models from the GPT family. Image taken from original paper

7. Limitations and Considerations

The paper highlights certain limitations to address GEMBA’s broader applicability.

There is a need for evaluation of GEMBA with low-resource languages since the
paper only considers English, Chinese, Russian, and German.

There could be potential data leakage concerns as it is uncertain whether the test
data was included in Open AI’s training (the secret sauce hasn’t been released by
Open AI at the time of writing). Having said that, the likelihood is very low as GPT
models claim to have a knowledge cutoff date of Sep 2021, and the MQM dataset
was released in Dec 2022.

There could be occasional invalid responses from LLMs:


►textual answer instead of score → handled by increasing temperature until a
numeric score is outputted.
►“2”, “two”, “**”, “ ★★”, “two stars”, or “2 stars” → handled in post-processing to
maintain uniformity.
►authors excluded outputs from LLM in the non-English target language ( such as
星 or 五) from analysis.

Number of invalid responses. Image taken from original paper.

Conclusion
Keeping in mind the ease of implementing GEMBA with a simple prompt, it definitely
stands out as a groundbreaking metric for translation quality assessment. Its alignment
with human judgment and adaptability to various LLM models make it a compelling
addition to the field of NLP and translation evaluation. As we continue to explore and
refine GEMBA (perhaps with few-shot prompting), it holds promise as a valuable tool
for ensuring high-quality translations in diverse language contexts.

[1] Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators
of translation quality. arXiv preprint arXiv:2302.14520.

Language Translation Large Language Models Research Generative Ai Tools

Thoughts And Theory

You might also like