Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

SDL LEARNING AND

DEVELOPMENT
Post-Editing Certification
SDL iMT

SDL Proprietary and Confidential


Topics
• History of Machine Translation
• Why use Machine Translation
• Machine Translation Technologies
• Creating Statistical Machine
Translation Output
• Training and Testing Process
• Understanding Post-Editing
• Effective Post-Editing
• Use of SDL BeGlobal Baselines
in SDL Studio
• Sources of Reference
History of
Machine
Translation
Understanding Machine Translation

Machine Translation (MT) is


automated translation that uses
software to translate text from one
natural language to another

MT is one of the oldest applications of


Artificial Intelligence. In essence, an
attempt for a computer to replicate tasks
that humans are already very good at

MT is not replacing the need for


human translation - but is an
effective tool when understood and
used in the right context
4
MT History
First statistical learning
1940s War-time cryptography approaches (original patents 1990s

by IBM) & surge in research


Georgetown Experiment
1950s heralded investment in World events (9/11) trigger
automated translation large-scale translation needs; 2000

Google moves to SMT


1960s
ALPAC report pointed to lack of
success > funding reduced Early adopters of MT, Microsoft
dramatically and Google charge for API
2010

1970s/ Attempts to commercialise MT


1980s based on larger number of Statistical MT science flourishes Present
transformation rules with support from industry Day

and defence projects


1980s
Increased computing power
opens up new opportunities

All the above underpinned by advances in other localisation


technology developments – Translation Memories, workflow tools,
5 terminology systems, CAT tools and SDL Trados
Brief History of MT in SDL

Due diligence and


acquisition of LW

Over 20 customers,
more than 120
language pair
Dramatic
MT Group set up customisations
increase in the
to use RBMT in Team is re-branded number of words
a high quality Partnership with 2010 iMT (intelligent post-edited in
translation Language Machine Translation) SDL by trained
process Weaver and post-editors
Acquisition of training on how First post-editing
Rules Based to customise projects using SMT
2004 2012
Machine go into production
SMT engines
Translation
(RBMT) engine
2009 2011

2000

6
Definition of Post-Editing
• The “term used for the correction of
machine translation output by human
linguists/editors” (Veale and Way 1997)
• “Checking, proof-reading and revising
translations carried out by any kind of
translating automaton” (Gouadec 2007)
• “In basic terms, the task of the post-editor
is to edit, modify and/or correct pre-
translated text that has been processed by
a machine translation system from a
source language (a) in to target
language(s).“ (Allen 2003)
• This […] usually consists in automatically
translating the source file and providing the
result to the translator as a first version of
the text to be post-edited. This practice is
usually referred as MT post-editing and is
considerably promising in enhancing the
translation process both in terms of time
and quality (Vieira. L.; Specia, L. 2011)
• Following guidelines, the translators
correct the output from translation
memories and machine translation to
produce different levels of quality.
Gradually this activity, post-editing, is
becoming a more frequent activity in
localisation, as opposed to the full
translation of new texts. (Guerberof, Ana.
2009)
Why Use
Machine
Translation
Today’s Reality

• Multi-directional communication
• Proliferation of channels and
devices
• Channel relevance critical
• The ‘empowered’ consumer
• Cyclical and multi-directional
engagement process

9
Need for a MT Solution

10
Language growth on internet 2000 – 2010

Arabic +20x

Chinese +20x

Portuguese +9x

Spanish +7x

French +6x

EN +3x

Source: TAUS: The New Lingua Franca

11
Why the Need for Machine Translation?

12
Machine
Translation
Technologies
Understanding the Challenges of MT

One word can have many meanings:


• I’ll get a cup of coffee
• I didn’t get that joke
• I get up at 8am
• I get nervous
• Yeah, I get around 100 e-mails a day

Word order differs according


to language

One word can have an infinity


of contexts

14
Challenges of MT
Same sentence can be translated completely differently depending on context.

• Remove the windows


• Uninstall Windows
• Clean the lining
• Reformat the PC
• Apply the glue

Windows can be reinstalled now.

The vehicle battery The TV remote


is empty. battery is dead.

Replace the battery.

15
Different MT technologies

RBMT: Rules-Based Machine Translation: the engine


consists of a set of rules, each written by a linguist

SMT: Statistical Machine Translation: on the basis of a


large set of examples, the engine learns translation rules
for itself. No human intervention

Hybrid engine: any mixture of SMT and RBMT technology

16
Rules-Based Machine Translation - RBMT

Encode translation rules manually:


• “building”  “edificio”
• “the”  “el” or “la”
• “edificio” is a masculine noun
• “el” is a masculine determiner
• “la” is a feminine determiner
• Spanish determiners and nouns must agree in gender

Employ those rules automatically:


• “the building”  “el edificio”

17
Challenges of RBMT

• Words are ambiguous


– “bank” has many meanings

• Grammar is ambiguous
– I saw a man with a telescope

• Rule X should only apply in context Y … requires more rules


to govern circumstances…
• Too many rules (millions)
• Rules conflict with each other and vary by language
18
Statistical Machine Translation (SMT)

SMT requires less development for a


The system “learns” how to translate
new language pair than rules-based
by analysing a corpus of material
machine translation technology
(RBMT)

A new baseline is built from a very


large corpus
Compared to RBMT, raw SMT output
is more fluent, and customised SMT
engines better reflect the style of the
Customisation optimises the MT
training material
engine for specific types of content

19
Statistical Machine Translation
Better known as Data-Driven Machine Translation

20
Statistical versus Rules-based
Statistical Rules-based
• Starting point for a new language direction is an • Each language pair is built by looking at
aligned corpus of 200 to 300 million words (uses construction of both source and target language
existing translation databases to build up (taking into account source and target grammar
language pairs) and vocabulary)
• Language pairs will differ in quality depending on • Rules and information encoded in the dictionaries
the quality and extent of the databases are used to analyse the source and generate the
translation
• System learns how to translate by analysing
statistical relationships between large volumes of • It takes a number of man years to develop a new
aligned source and target data language pair
• New language directions can be created quickly • Source content needs to be well written to
generate good output
• Client customisations possible (recommendation
is an aligned corpus of 1 to 5 million words of • End result will not flow like statistical output, no
relevant customer data) context-sensitivity
• Source content should be of acceptable quality • Easy to set terminology per client (terminology for
the dictionaries is translated to ensure
• More fluent output and some context-sensitivity
consistency)
• Little control on terminology apart from TM

21
Creating
Statistical
Machine
Translation
Output
How do you get high quality from MT?

Out of the box, no MT engine is capable of producing high quality


publishable translation
• Especially true for terminology-rich technical documentation
• Terminology is domain specific and often company specific

What is SDL iMT’s process to get the required results?


• Customise the MT system for the project
• Post-edit the MT output to publishable quality
• For the overall solution, integrate the MT with a translation environment

23
SMT Engine Training

Why engine training is important


• The MT output is trained so that the terminology and style are appropriate for the
content, which makes the output easier to post-edit

How SMT engines are trained Types of trained engine


• Parallel corpus of domain specific • Baseline for broad spectrum content
content, e.g. translation memory • Verticals for specific domains e.g.
(TM) Automotive, Travel, IT
• TM is cleaned and prepared, • Customised for customer specific
iterative process with parameter application
tuning

24
SDL iMT Baselines

Core MT engines for each language pair

Contain hundreds of million of words of bilingual data

Data mined from different sources in the public


domain, covering various subjects
Customisations and verticals use the baseline
engines as backup
Baselines improve constantly as more high-quality
data is added
25
SDL iMT Verticals

Trained statistical engine exclusive for a domain

Solution used when client-specific data not available

MT output more likely to follow technical terminology

Data selected from principal sources within a domain


or industry

26
SDL iMT Customised Engines

Training based on client-specific bilingual data

More data usually has a positive effect on the quality of


the MT output

Advantages of customised engines: adherence to client-


specific terminology and style

27
Training and
Testing
Process
SDL iMT System – Process Summary
Content MT Customisation
Production QA
Evaluation

Training data prep and Apply


engine customisation Translation
Memory
Quality
Prep of testing Assessment
material and
Machine Translation
Source Delivery
Content Translation

Evaluate MT output

Post-Edit Update
Refine training Translation
or deploy for Memory
production

Integrate MT on
Translation process

SDL MT Translation
Server Memory
29
How to measure the quality of an engine?
Human evaluation Automatic evaluation

• Relatively expensive • Many automatic metrics (BLEU, NIST,


• Time consuming TER, METEOR, Levenshtein)

• Prone to subjectivity • Most assess MT quality compared to a


reference translation
• Difficult to measure productivity

• New developments focus on the collection of industry benchmark


data to find best evaluation practices (TAUS - Dynamic Quality
Framework Tools.)

This remains a major challenge within


30
the machine translation industry
Continuous Improvement

SDL MT developers are constantly researching


SDL Data Engineers are
ways to improve Generic, Vertical, and
continuously mining large
Customised MT Engines
amounts of good data used
by the statistical algorithms

SDL Research Scientists are continuously


improving the Statistical Machine Translation
algorithms
•  Language Models, Translation Models, New Language Pairs are
Reordering Models, Syntax, Transliteration, created according to market
Rule-Based Components etc… demand

31
Improvement through Feedback

A large part of our work relates to


receiving, analysing and responding
to specific linguistic feedback from
the post-editors who work on MT
Post-Editors projects
Feedback

SDL iMT

Better MT This allows us to not only fix


technical issues but also gather
experience and improve MT in the
long run
32
What Type of Feedback Helps

Feedback from post-editors is a crucial factor on


the improvement of MT engines

Improvements on statistical MT are not as


immediate as on rules-based

All constructive feedback is useful:


• Quality for a specific job is below expectations
• Non-standard statistical MT behaviour
• Provide examples and as much detail as
33
possible
Understanding

Post-Editing
Post-Editing as an Opportunity for Translators

For many translators post-editing is a new Post-editing requires an open mind


skill, therefore familiarisation is important and flexible approach as well as a
willingness to see MT as another tool
in the translator’s kitbag
There is a learning curve with post-editing, it
gets faster and easier with practice
While average conventional
Industry research has shown that experience translation output is 2000-2500
is the single most important factor in words/day, with post-editing
translation productivity and becomes even productivity can grow to 6000
more influential in post-editing words/day or more

Technology changes are not a threat to the Post-editing brings MT and human
role of the translator, but can help make skills together
translators more efficient and competitive in
the market MT will not replace human translation!

35
The translation landscape is changing

Source: Common Sense Advisory

36
Post-editing Integration into Production Environment

Apply Translation Memory

System for Machine Apply Machine Translation


Translation

Post-Edit

Edit Fuzzy Matches Translation


Memory

Review and Translation


Delivery

Update
Translation Memory

37
Effective
Post-Editing
How to be a Good Post-Editor

What makes a good post-editor?


• Excellent linguistic skills
• Knowledge of domain and subject matter
• Proficiency with CAT tools and
automated text-checking functions
• Knowledge of expected MT behaviour
• Positive attitude to MT
• Post-editing practice to achieve
proficiency!

39
Post-Editing vs. Review

• Post-editing replaces the translation stage


• Post-editing is not just a light review
• Correcting machine translated output is very different from reviewing human
translations
• Knowledge of typical SMT behaviour is key to post-edit effectively
• Review stage, if required for a project, follows post-editing stage
• Reviewing post-edited output might differ, but the purpose of review does not change

40
Degrees of Post-Editing

Post-editing to publishable quality


• Most frequent form of post-editing
• Generally used for higher visibility texts

Post-editing to understandable quality, or


light post-editing
• Less frequent form of post-editing
• Generally used for lower visibility texts

41
Post-Editing to Understandable Quality

Understandable quality standards


• Lower quality expectations
• Focus is on meaning not on style and grammar
• Expectations based on client requirements
• Clear requirements needed

Typical purposes of understandable quality texts


• Offering users a quick answer on how to fix an
issue
• Providing a translation solution for lower
visibility content (FAQs, Blogs, Knowledge
Bases)

42
Typical Guidelines in Post-Editing to
Understandable Quality

Make sure that all information is Do not change the style


transferred if inconsistent

Do not change the structure if


Check that numbers are correct
meaning is clear

Use correct terminology Do not correct grammar mistakes if


meaning is clear

Make necessary changes if meaning Always remember that client


is wrong or not complete guidelines take precedence

43
Post-Editing and Conventional Translation

The following basic principles still apply as with conventional translation:


• The translation must be a correct and true reflection of the source
• Grammar, spelling and punctuation must be correct according to the rules of the
target language
• The correct terminology must be used consistently
• Style must be appropriate for the document
• Cultural references (date and time formats, units of measurement, number formats,
currency information, etc.) must be adapted to the target language
• The original formatting must be reproduced
• The translation must read well and be suitable for its intended purpose
• The same references must be used as for conventional translation (project-specific
Guidelines, TMs, Glossaries, Termbases, etc.)
• Post-Editors need to follow standard project procedures (Q&A, etc.)
44
In Order to Post-Edit Effectively...

Make use of MT
output as much as
possible
Do not over-edit or
under-edit the machine
translation output

Find parts of the MT


output which can be
used to help speed up
work

Always re-read your


translation after editing

45
Post-editing Process

Ready Set Go
• Read source • Build round • Ensure all
text, then MT MT output elements
output • Don’t under or present
• Determine over edit • Correct
usable • Focus on grammar
elements accuracy • Check
terminology

46
Patterns to watch out for in SMT

Word order in Additional or


Negation issues
target missing words

Compound Context-dependent
Proper nouns
formation and terminology translated
hyphenation

Wrong preposition,
Capitalisation gender, agreement Antonyms
or verb inflection

47
Quality Expectations in Post-Editing
• Is post-edited MT capable of producing high quality documents for
publication?
• Yes

• Would a client reviewer notice a difference in style between


conventional and post-edited content?
• Possibly

• Would the quality of the post-edited content be any less acceptable?


• Same acceptable quality as conventional translation

• Is the question of consistency exclusive to post-editing?


• Conventional translation may present consistency issues too

• Conclusion: there could be end product differences of a stylistic nature


between data translated conventionally and post-edited data.

48
Use of SDL
BeGlobal
Baselines in
SDL Studio
SMT at your fingertips

As translators, we need to make


use of all available tools

Every Studio user is a


potential post-editor
Example of Studio project with MT applied

Segments marked as AT are


raw machine translation to be
post-edited

51
Sources of
Reference
REFERENCES
• Slide 10
– MT: The New Lingua Franca, TAUS https://
www.taus.net/articles/mt-the-new-lingua-franca
• Slide 36
– Trends in Translation Pricing, Copyright © 2012 by Common Sense
Advisory, Inc., September 2012

53
Copyright © 2008-2013 SDL plc. All rights reserved. All company names, brand names, trademarks, service marks,

images and logos are the property of their respective owners.

This presentation and its content are SDL confidential unless otherwise specified, and may not be copied, used or

distributed except as authorised by SDL.

You might also like