Professional Documents
Culture Documents
Exploring GEMBA - A New LLM-Based Metric For Translation Quality Assessment - by Dr. Varshita Sher - Sep, 2023 - Towards Data Science
Exploring GEMBA - A New LLM-Based Metric For Translation Quality Assessment - by Dr. Varshita Sher - Sep, 2023 - Towards Data Science
Search
Member-only story
Introduction
I recently read an intriguing paper from the Microsoft¹ team (published in May 2023)
that caught my attention. The paper delves into the world of translation evaluation,
shedding light on an innovative metric called GEMBA (GPT Estimation Metric Based
Assessment). In this blog post, we’ll dissect the paper and provide insights into this
exciting development.
2. Research Question
You can learn about other evaluation metrics such as YiSi, chrF, BERTScore, etc here.
Given the plethora of metrics we discussed just now, you may ask — Why do we need a
new metric like GEMBA?
The answer lies in its unique approach — prompting LLMs to assess translations based
on their own judgment. Unlike traditional metrics, GEMBA seeks to align with human
assessment of translations by scoring them on a scale (say 0 to 100), focusing on both
meaning and grammar.
4. Introducing GEMBA
As mentioned previously, GEMBA is a GPT-based metric for the assessment of
translation quality, which works both with a reference translation and without.
In essence, GEMBA is a really well-engineered prompt for the evaluation task and
consists of:
[optional] reference translations i.e. the baseline translation that can be used as
ground truth
Additionally, to find the best GPT model for implementing GEMBA, the authors tested
each of the prompt variants with seven models from the GPT family ranging from GPT
2 up to the latest GPT-4 model.
7 GPT models used for evaluating GEMBA. Image taken from original paper
Having set the stage for experimentation of the GEMBA metric, the next obvious
question is —
Q: How do we tell if GEMBA is performing better than conventional metrics such as BLEU
and COMET?
A: If GEMBA scores corresponds closely with what a human thinks of the translation, we have
a winner!
To operationalize that answer, there are two metrics (Kendall’s Tau and Accuracy
(accuracy, Kocmi et al., 2021)) that need to be calculated, depending on whether we are
undertaking segment-level evaluation or system-level evaluation. But first, what are they?
System level evaluation assesses the overall performance of a machine translation system as
a whole. It looks at the quality of translations generated by the system across a wide range of
text or content.
Segment level evaluation focuses on assessing the quality of translations on a per-segment
basis (typically a sentence or a smaller unit of text)
Generally speaking:
Let’s take a deep dive into their formulas for clarity using simple examples:
A. Kendall’s Tau
Reference (Human): “The quick brown fox jumps over the lazy dog.”
Translation A: “The fast brown fox jumps over the lazy dog.”
Translation B: “The quick red fox jumps over the lazy dog.”
Translation C: “The lazy dog is jumped over by the quick brown fox.”
Human Ranking: A > B > C (i.e., they prefer Translation A the most, then B, and lastly
C)
Metric scores for these translations:
LLM(A) = 0.85
LLM(B) = 0.75
LLM(C) = 0.60
Pair 1: (A, B)
Human Ranking: A > B
LLM Scores: LLM(A) = 0.85 > LLM(B) = 0.75
Result: Concordant pair (both human and metric prefer A over B).
Pair 2: (A, C)
Human Ranking: A > C
LLM Scores: LLM(A) = 0.85 > LLM(C) = 0.60
Result: Concordant pair (both human and metric prefer A over C).
Pair 3: (B, C)
Human Ranking: B > C
LLM Scores: LLM(B) = 0.75 > LLM(C) = 0.60
Result: Concordant pair (both human and metric prefer B over C).
Kendall’s Tau
In other words, τ = 1 suggests a perfect agreement between the metric and human
judgment and is hence a high-quality metric that can be used for automating the
quality of translations.
B. Accuracy
Let’s calculate metric Δ (which is nothing but the difference in metric values for a pair
of translations) and human Δ (which is the difference in the human scores for a pair of
translations). If you look at the formula closely, you will notice we are not interested in
the actual value of the Δ but the sign of the Δ. Put simply, a high accuracy can only be
achieved if the sign of both Δs are the same i.e. human and metric are in alignment
about a translation.
Pair 1: (A, B)
Metric∆ = BLEU(A) — BLEU(B) = 0.80–0.70 = 0.10
Human∆ = 1 (A is ranked higher than B)
Result: Metric∆ and Human∆ have the same sign (both positive). This is a rank
agreement.
Pair 2: (A, C)
Metric∆ = BLEU(A) — BLEU(C) = 0.80–0.75 = 0.05
Human∆ = 2 (A is ranked higher than C)
Result: Metric∆ and Human∆ have the same sign (both positive). This is a rank
agreement.
Pair 3: (B, C)
Metric∆ = BLEU(B) — BLEU(C) = 0.70–0.75 = -0.05
Human∆ = 1 (B is ranked higher than C)
Result: Metric∆ and Human∆ have different signs (metric is negative, human is
positive). This is a rank disagreement.
Note: The examples used to demonstrate the calculation of Kendall’s Tau and Accuracy were
simplified for demonstration purposes. In reality, the formulas get more complicated if ties
need to be handled i.e. if a human/metric gives the same ranking to two or more translations.
You can read more about them here.
Results from segment-level evaluations (P.S. the first column Accuracy is the same one from the previous table).
Image taken from original paper
The results also stress the importance of choosing the right LLM for implementing
GEMBA. Among the seven models from the GPT family tested any model beyond 3.5
showed promising results. GPT-4 stood out as the top performer, but models like
Davinci, ChatGPT Turbo, and GPT-3.5 also performed well.
GEMBA implementation with various models from the GPT family. Image taken from original paper
There is a need for evaluation of GEMBA with low-resource languages since the
paper only considers English, Chinese, Russian, and German.
There could be potential data leakage concerns as it is uncertain whether the test
data was included in Open AI’s training (the secret sauce hasn’t been released by
Open AI at the time of writing). Having said that, the likelihood is very low as GPT
models claim to have a knowledge cutoff date of Sep 2021, and the MQM dataset
was released in Dec 2022.
Conclusion
Keeping in mind the ease of implementing GEMBA with a simple prompt, it definitely
stands out as a groundbreaking metric for translation quality assessment. Its alignment
with human judgment and adaptability to various LLM models make it a compelling
addition to the field of NLP and translation evaluation. As we continue to explore and
refine GEMBA (perhaps with few-shot prompting), it holds promise as a valuable tool
for ensuring high-quality translations in diverse language contexts.
[1] Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators
of translation quality. arXiv preprint arXiv:2302.14520.