NLP - PPT - CH 7

Natural Language
Processing
Pushpak Bhattacharya
Aditya Joshi
Chapter 7
Machine Translation
Copyright © 2023 by Wiley India Pvt. Ltd.

Chapter 7 Machine Translation
• 7.1 Introduction
• 7.2 Rule-Based Machine Translation
• 7.3 Indian Language Statistical Machine Translation
• 7.4 Phrase-Based Statistical Machine Translation
• 7.5 Factor-Based Statistical Machine Translation
• 7.6 Cooperative NLP: Pivot-Based Machine Translation
• 7.7 Neural Machine Translation
2 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

Learning Objectives
• Explain paradigms of Machine Translation (MT)
• Understand the challenges of MT for resources-scale languages
• Appreciate the mathematics behind MT
• Describe encoder–decoder models for MT

Introduction
“Translation is the process of converting text from one language to another, retaining
the meaning in the source text and ensuring grammaticality, idiomaticity, and register-
conformity in the target text”
• Thousands of languages are spoken around the world in various countries
• Translation has served as the vehicle for making ideas expressed in one language
accessible in other languages
• Machine translation (MT) refers to the process of translation by a machine (i.e., a

computer)

Ambiguity Resolution in Machine Translation
• In order to appreciate the complexity of MT, we revisit the NLP stack and see how MT engages with this stack
Fig 7.1 The NLP stack

• The NLP stack is important for MT
• Every layer in the NLP stack sends signals into the translation process
• It ensures more and more accurate and high-quality translation
• NLP stack help reduce the requirement of data needed for training an MT system
using only raw data
• A paradigm of MT called statistical machine translation (SMT) trains an MT model

with parallel sentences
Machine learning (ML) based MT, which means SMT and neural machine translation
(NMT), relegates the responsibility of ambiguity resolution to data and ML

RBMT-EBMT-SMT-NMT
• Knowledge-based machine translation (KBMT) is also called rule-based machine

translation (RBMT)
• RBMT is of two kinds—interlingua based and transfer based
• Data-driven MT paradigms have example-based machine translation (EBMT), SMT,

and NMT
• All these MT paradigms have an ‘A’ word as the essence of the paradigm

Fig 7.2 Paradigms of machine translation

• The four ‘A’ words—Analysis, Analogy, Alignment, and Attention—are crucial for
four MT paradigms:
i. Analysis in RBMT
ii. Alignment in SMT
iii. Analogy in EBMT
iv. Attention in NMT

Today’s Ruling Paradigm: Neural Machine Translation
• Neural machine translation (NMT) is done by neural networks
• Like its predecessor paradigms, NMT mirrors the Vauquois triangle
• The famed ‘Encoder-Decoder’ architecture as well as the more modern

‘Transformer’ architecture implements the A-T-G pipeline through layers of neurons
• NMT is extremely data-intensive, due to its requirement of fixing millions and

sometimes billions of weight values

Fig 7.3 Variation of accuracy of machine translation with corpus size
It is to be noticed how the SMT line (grey) is no match for the NMT line (black) as the corpus size increases

Ambiguity in Machine Translation: Language Divergence
“Languages have different ways of expressing meanings, the so-called phenomenon of

language divergence”
• One of the ideals of MT has always been the extraction of meaning completely and
correctly from the source text
• Then the production of the target language text from the extracted meaning
• Meaning extraction is an exercise in disambiguation at every layer of the NLP stack-

morphology, POS tagging, chunking, parsing, and semantics

• When the same meaning is expressed by two different languages, two kinds of
divergence arise:
i. Lexico-semantic divergence:
 It is essentially vocabulary difference (i.e., the difference of words and phrases)
ii. Structural divergence:
 Here, languages are different in the manner in which they arrange words and
phrases in a sentence

Fig 7.4 Language divergence illustrated with the English sentence, ‘This blanket is very soft’

Vauquois Triangle
• Language divergence phenomena have been unified in a famous framework called

Vauquios triangle
• The top of the triangle represents the complete disambiguated meaning of the source
sentence
• On the way down from the top, we begin to generate the target language sentence
• Descending down the right side of the triangle through different stages of natural
language generations (NLG)
• The broad stages of NLG are root word determination, target root substitution,
morphology generation on target roots

Fig 7.5 Vauquois triangle illustrating the analysis-transfer-generation (A-T-G) process
The left side of the triangle is effectively the NLP stack. Complete disambiguation lands the source sentence representation at the top of the triangle
The bottom of the triangle is the opposite extreme, where no analysis is needed as in the case of very close languages such as Spanish-Catalan

• Vauquois triangle in its original form is very elaborate with many sub-activities of
intricate nature
• A simplified Vauquois triangle is depicted next with source and target languages at the
bottom of the pyramid
• A transfer happening at some place between the top and bottom of the triangle
• The left side of the triangle is the analysis side and the right side is the generation side
• At the top of the triangle is the interlingua-based MT
• Any transition into the generation side below the top gives rise to the transfer-based
machine translation (TBMT)

Fig 7.6 Abridged Vauquois triangle

Rule-Based Machine Translation
• In RBMT, all rules—whether for analysis, transfer, or generation—are written by

human experts
• So, the responsibility of correct and complete capturing of language and translation
phenomena and formulating rules therefrom lies with a human system designer
• The pipeline shown next is the typical architecture for Indian language to Indian
language machine translation (ILILMT)
• It was executed as a consortium activity in the period 2000-2006 funded by India’s

Ministry of Electronics and Information Technology (MeitY)

Fig 7.7 RBMT pipeline, illustrating analysis-transfer-generation (A-T-G)

• Like all languages of the world, Indian language computing also has its challenges
• These challenges are:
i. Scale and diversity: There are 22 scheduled languages in India, written in 13 different
scripts, with over 720 dialects
ii. Code mixing: Owing to India’s multilingual culture, people in India routinely and
seamlessly use at least two languages in their day-to-day communication resulting in
code-mixing
iii. Absence of basic NLP tools and resources: Most Indian languages do not have these
tools and resources, MT may be relegated to reliance on low quality or absence of
tools

iv. Absence of linguistic documentation and treatise for many languages: For many
languages of India, no linguistic tradition exists
v. Script complexity and non-standard input mechanism: The QWERTY keyboard for
Roman scripts is non-optimal for Indian languages
vi. Non-standard transliteration: Due to the ubiquity of English language keyboards,

there may be non-standard transliteration to represent the same Indian language word
vii. Non-standard storage: Many organizations in India have their proprietary fonts that
do not follow the Unicode format
viii. Challenging language phenomena: Compound verbs in Indian languages are one
such phenomenon

Indian Language Statistical Machine Translation
• In 2014, MeitY of India funded creation of parallel corpora in many Indian languages
• This project is called Indian Language Corpora Initiative (ILCI)
• About 100,000 parallel sentences were created for languages from the Indo-Aryan
and Dravidian families
• Leveraging on the created parallel corpora, SMT systems were created for pairs of
Indian languages
• Also between English and Indian languages

• One such comprehensive work was SMT systems for 110 pairs of languages
• The BLEU scores, which are performance measures of translating to-and-fro different
pairs of languages
• In general, scores are high in the Indo-Aryan family
• Dravidian family which is characterized by heavy agglutination shows a much lower

range of BLEU scores
• Translation involving Dravidian languages requires looking inside the words and
mapping morphemes for obtaining proper translation

Fig 7.8 BLEU score values within Indo-Aryan and Dravidian families and across families

Mitigating the Resource Problem
We have only a handful of methods for MT for mitigating the resource problem:
i. Subwords:
• Subword-based MT involves breaking the word into its parts, making use of
characters, syllables, orthographic syllables, and byte pair encodings (BPE)
ii. Cooperative NLP:
• This aims to take help from another language, which can happen in two ways:
a) The first way is to use a pivot language

b) The second way of cooperative NLP is to use transfer learning

iii. The third way of resource scarcity mitigation is to use higher-level language
properties such as POS and sense ID
• It leads to obtaining additional clues for disambiguation

Methods of Subwording:
• Subwording may be performed in terms of:
a) Characters: ‘j’ + ‘aa’ + ‘u’ + ‘M’ + ‘g’ + ‘aa’

b) Morphemes: ‘jaa’ + ‘uMgaa’
c) Syllables: ‘jaa’ + ‘uM’ + ‘gaa’
d) Orthographic syllables: Strings ending in vowels: ‘jaau’ + ‘Mgaa’
e) BPE: Depends on corpora, statistically frequent patterns. On that count both ‘jaa’
and ‘uMgaa’ are likely

Actual Evidence of Benefits of Subwording
Here, we show the quantitative evidence of benefits of subwording in this figure:
Fig 7.9 BLEU scores for word/token-

based machine translation

• Figure below shows morpheme-based SMT BLEU scores:
Fig 7.10a Morphemebased SMT

and corresponding BLEU scores

• There is consistent BLEU score improvement as shown in the figure below:
Fig 7.10b Per cent improvement

over word level scores

• BPE-based SMT showed a still larger improvement in scores:
Fig 7.11a BPE-based SMT and

corresponding BLEU score

• Punjabi-Tamil BLEU score improvement is 28.26% as shown here:
Fig 7.11b Per cent improvement

over word level scores

Phrase-Based Statistical Machine Translation
(PBSMT)
Need for Phrase Alignment
When translating a sentence from one language to another, a simple approach may
be:
• To translate the sentence word by word, accompanied by morphological and

syntactic adjustments
• Such an approach will require a dictionary that maps words of the source
language to those of the target language

• However, there are many compelling reasons for why translation based on units of
text longer than words should be done
• Instead of words, if we allow word groups to align, the modelling becomes much
simpler
• Note that process of creating phrase alignments is essentially that of merging

neighbours
• In a tabular representation of alignments, this amounts to growing strings of

words by expanding along diagonals and aligning these strings

Example:
Alignments are marked with ‘X’. For English Hindi, the alignment set is
A 1: {<Mumbai, Mumbai>, <of, ke>, <people, log>}

A 2: {<mumbai, Mumbai>, < ke, of>, < log, people>}
Now the grow-diag process will create the alignments
People of ke log (black square) of Mumbai mumbai ke (light grey square)

People of Mumbai mumbai ke log (dark grey square)
• It is visually explained in figure in 7.12

Fig 7.12 Creation of phrase alignments from word alignments
through grow-diag algorithm

Case of Promotional/Demotional Divergence
• Promotional and demotional divergences are particular cases of language

divergence
• Consider the translation pair:
‘The play is on’ ‘khel chal rahaa hai; gloss: play continue <progressive auxiliary>
<auxiliary>’
• The translation of ‘on’ is ‘chal rahaa hai’

• It is apparent that setting up correspondence between ‘rahaa’ and ‘on’ is not only
artificial and non-intuitive
• A portion of the probability mass for mapping of ‘on’ is spent on ‘rahaa’
• thus depriving the more deserving candidates like ‘par’ and ‘upar’
• This may lead to the strange translation of ‘The book is on the table’ as:
‘mej ke rahaa kitaab hai’ instead of the correct ‘mej ke upar kitaab hai’!

Case of Multiword (Includes Idioms)
• Non-compositional multi-words are not amenable to word-based alignment
• Unless the source and target languages are extremely close linguistically and
culturally
• Example:
• Not a single word in the Bengali sentence above has an equivalent translation in
the parallel English sentence

• Another Example:
• In this case, there is almost one-to-one correspondence between Hindi and

Bengali

Phrases Are Not Necessarily Linguistic Phrases
• Phrases in PBSMT are not necessarily linguistic phrases but are sequences of
words
• Albeit some of these word sequences can be linguistic phrases, but it is not
necessarily so
• It is possible to have aligned phrases that are non-equivalent in meaning.
• Even when the two languages are close to each other, phrases aligned can be non-
linguistic

Use of the Phrase Table
• We have to work with the aligned phrases in the phrase table along with their
probability values
• The probability value of a phrase translation indicates how good a translation pair
is, as formed by the phrase and its translation
• When a new sentence needs to be translated, we have to match parts of the input
sentence in the phrase table, pick up the translations
• It also combines the translations, and finally score the resulting ‘sentences

• Using phrase translation probabilities and language model probabilities
• Everything starts with finding and matching parts of the input sentence in the
phrase table
• The size of the phrase table is, thus, an important factor in the translation process

Mathematics of Phrase-Based Statistical Machine Translation
• The basic equation of SMT is:
• Here, e and f have their usual meaning of output and input, respectively
• The translation with the highest score is
• P( f |e) and PLM(e) are the translation model and language model, respectively

• The translation probability P( f |e) is modelled as:
• is called phrase translation probability and d(.) is distortion probability

Factor-Based Statistical Machine Translation
• In Hindi, the target sentence is ‘mei_ne aam khaa_yaa’
• Where does ‘ne’ come from?
• Hindi grammar rules say that the agent should get the ergative marker ‘ne’ if the
verb is transitive (‘sakarmak kriya’) and in the past tense
• This rule holds even for ellipsis wherein lexemes are implicit
• If the word ‘aam’ is dropped, the translation will still be ‘mei_ne khaa_yaa’

Fig 7.14 Translation of an English sentence to Hindi using a factor-based SMT paradigm

Fig 7.15 Mapping of factors in factor-based SMT. Suffix + semantic relation gives
rise to case-marking suffix or case-marking post-position, lemma maps to
lemma, and finally a word in correct form appears in the target language

Cooperative NLP: Pivot-Based Machine Translation
• An intermediate language is introduced to supply missing data when the parallel
corpus is in short supply
• The intermediate language is called bridge language
• The theory of translation through pivot language is based on the concept of

marginalization in probability theory
• Equations below show this:

• Where p( f |e) is given by:
• Here, is the highest probability output sentence, as per argmax over e, given the
input sentence f

• Figure below explains the utilization of phrase tables in pivot-based SMT:
Fig 7.22 Processing of phrases in pivot-based SMT.

• Table 7.3 shows that the Mauritian Creole is very close to French with a large
vocabulary overlap:

• Table 7.4 shows the data of English to French—2 million sentences:

• The graph shown below shows that the term BACK-OFF means backing off into the source-pivot and pivot-
target phrase tables when the source-target phrase table does not yield a match:
Fig 7.17 BLEU scores for MC→EN translation with and without pivot; Grey bars are with FR as pivot

Neural Machine Translation
• In neural NLP, word vectors are designed to place the words in a vector space
• It enables us to apply mathematical operations of ‘distance’, ‘similarity’, ‘addition’,

‘averaging’, etc
• In a neural framework, it is easier to see that ‘dog’ as a concept has more

similarity with ‘cat’ than with ‘door’
• Potentially, the language objects are parts of a continuum, and the whole power
of geometry, algebra, calculus can be harnessed for doing NLP

Encoder-Decoder
• Let us enumerate the essential steps through the now ‘classic-in-NMT’ encoder-
decoder:
i. The input sentence passes through what is called the encoder as a sequence of
word vectors
ii. At the end of encoding, out comes a vector that is supposed to be a

representation of the whole input sentence
iii. This encoder output vector is processed by the decoder to output the target
language sentence

Problem of Long-Distance Dependency
• Caveat is long-distance dependency
• Generating word forms conforming to agreement rules has to grapple with the
challenge of long-distance dependency
• This problem of attenuation of memory brings on stage two key ideas: ‘context
vector’ and ‘attention’
• The input sentence is processed token by token
• After every token, the encoder output is tapped and sent to the decoder. That
output is called context vector

Attention
• There is combination of context vectors from the encoder at every token along
with autoregression
• It has proven quite effective in dealing with long-distance dependency
• But this combination too is not adequate for correct translation
• Processing of a sentence requires paying different amounts of attention to

different parts of the sentence at different stages of processing

NMT Using Transformers
• Transformers raised the bar in MT by establishing new benchmarks
• Most recently in in WMT 14 English to German and English to French tasks in 2017
• The new performance figures forced the community to take these techniques
seriously
• Subsequent sustained interest and new very good performance figures in diverse
applications cemented the position of Transformers

Positional Encoding
• One of the main contributions of the Transformer is the introduction of positional

encoding
• In Transformers, positions are encoded as embeddings
• Positional embeddings are supplied along with input word embeddings
• The training phase teaches the Transformer to condition the output by paying
attention to not only input words, but also their positions

• Let us assume the dimension of the position vector is d
• It is kept same as the dimension of the word vector
• The position vector is added component wise to the word vector
• Let POS denote the position vector of dimension d. Each position t in the input
sentence has a position vector associated with it
• Let us call this POS t . Let the i-th component of the t-th position vector be
denoted as pos(t, i), i varying from 0 to (d/2) – 1

• Then,

Why sine and cosine functions?
Foundational Observation 1:
• Let S be a set of symbols. Let P be the set of patterns the symbols create
• If |P |>| S|, then there must exist patterns in P that have repeated symbols

Foundational Observation 2:
• If the patterns can be arranged in a series with equal difference of values between
every consecutive pair
• Then, at any given position, the symbols at different positions of the pattern
strings must REPEAT
• The frequency of repetition depends on the position of the symbol

Translation by Transformer
• The power of Transformer comes from positional embeddings and self and cross-
attention
• We describe the main point of self-attention with an example
• Consider two phrases:
• The translations to Hindi are:

Fig 7.22 Through self-attention contextual word vectors are obtained from initial word vectors
corresponding to the words ‘bank’, ‘of’, ‘the’, and ‘river’

Fig 7.23 The four contextual vectors of the words in the phrase ‘bank of the river’ are obtained by multiplying the
original word vectors by weights which are called self-attention weights and are learned based on mutual
pairwise similarities of the words, through the so-called query, key, and value triples

Fig 7.24 The lines from English to Hindi words depict crossattention. The weight of connection from
English ‘Peter’ to Hindi ‘piitar’ should be larger than the weights to ‘jaldii’ and ‘soyaa’;
This is accomplished by learning from parallel training data

Fig 7.25 Sub-set of results reported for machine translation in the landmark paper by
Vaswani et al. (2017)

Thank you

NLP - PPT - CH 7

Uploaded by

Copyright:

Available Formats

You might also like

NLP - PPT - CH 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP - PPT - CH 7

Uploaded by

Copyright:

Available Formats

Natural Language

Copyright © 2023 by Wiley India Pvt. Ltd.

2 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• Explain paradigms of Machine Translation (MT)

• Understand the challenges of MT for resources-scale languages

• Appreciate the mathematics behind MT

• Describe encoder–decoder models for MT

3 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• Thousands of languages are spoken around the world in various countries

• Machine translation (MT) refers to the process of translation by a machine (i.e., a

4 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

Fig 7.1 The NLP stack

5 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• It ensures more and more accurate and high-quality translation

• A paradigm of MT called statistical machine translation (SMT) trains an MT model

6 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• Knowledge-based machine translation (KBMT) is also called rule-based machine

• RBMT is of two kinds—interlingua based and transfer based

• Data-driven MT paradigms have example-based machine translation (EBMT), SMT,

7 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

8 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

ii. Alignment in SMT

iii. Analogy in EBMT

iv. Attention in NMT

9 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• Neural machine translation (NMT) is done by neural networks

• Like its predecessor paradigms, NMT mirrors the Vauquois triangle

• The famed ‘Encoder-Decoder’ architecture as well as the more modern

• NMT is extremely data-intensive, due to its requirement of fixing millions and

10 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

11 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

“Languages have different ways of expressing meanings, the so-called phenomenon of

• Meaning extraction is an exercise in disambiguation at every layer of the NLP stack-

12 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

 It is essentially vocabulary difference (i.e., the difference of words and phrases)

ii. Structural divergence:

13 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

14 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• Language divergence phenomena have been unified in a famous framework called

15 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

16 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• At the top of the triangle is the interlingua-based MT

17 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

18 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• In RBMT, all rules—whether for analysis, transfer, or generation—are written by

• It was executed as a consortium activity in the period 2000-2006 funded by India’s

19 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

20 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• These challenges are:

21 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

vi. Non-standard transliteration: Due to the ubiquity of English language keyboards,

22 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• This project is called Indian Language Corpora Initiative (ILCI)

• Also between English and Indian languages

23 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

• In general, scores are high in the Indo-Aryan family

• Dravidian family which is characterized by heavy agglutination shows a much lower

24 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.