NLP - PPT - CH 7

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 75

Natural Language

Processing
Pushpak Bhattacharya
Aditya Joshi

Chapter 7
Machine Translation

Copyright © 2023 by Wiley India Pvt. Ltd.


Chapter 7 Machine Translation
• 7.1 Introduction
• 7.2 Rule-Based Machine Translation
• 7.3 Indian Language Statistical Machine Translation
• 7.4 Phrase-Based Statistical Machine Translation
• 7.5 Factor-Based Statistical Machine Translation
• 7.6 Cooperative NLP: Pivot-Based Machine Translation
• 7.7 Neural Machine Translation

2 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Learning Objectives

• Explain paradigms of Machine Translation (MT)

• Understand the challenges of MT for resources-scale languages

• Appreciate the mathematics behind MT

• Describe encoder–decoder models for MT

3 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Introduction
“Translation is the process of converting text from one language to another, retaining
the meaning in the source text and ensuring grammaticality, idiomaticity, and register-
conformity in the target text”

• Thousands of languages are spoken around the world in various countries

• Translation has served as the vehicle for making ideas expressed in one language
accessible in other languages

• Machine translation (MT) refers to the process of translation by a machine (i.e., a


computer)

4 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Ambiguity Resolution in Machine Translation
• In order to appreciate the complexity of MT, we revisit the NLP stack and see how MT engages with this stack

Fig 7.1 The NLP stack

5 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• The NLP stack is important for MT

• Every layer in the NLP stack sends signals into the translation process

• It ensures more and more accurate and high-quality translation

• NLP stack help reduce the requirement of data needed for training an MT system
using only raw data

• A paradigm of MT called statistical machine translation (SMT) trains an MT model


with parallel sentences

Machine learning (ML) based MT, which means SMT and neural machine translation
(NMT), relegates the responsibility of ambiguity resolution to data and ML

6 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


RBMT-EBMT-SMT-NMT

• Knowledge-based machine translation (KBMT) is also called rule-based machine


translation (RBMT)

• RBMT is of two kinds—interlingua based and transfer based

• Data-driven MT paradigms have example-based machine translation (EBMT), SMT,


and NMT

• All these MT paradigms have an ‘A’ word as the essence of the paradigm

7 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.2 Paradigms of machine translation

8 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• The four ‘A’ words—Analysis, Analogy, Alignment, and Attention—are crucial for
four MT paradigms:

i. Analysis in RBMT

ii. Alignment in SMT

iii. Analogy in EBMT

iv. Attention in NMT

9 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Today’s Ruling Paradigm: Neural Machine Translation

• Neural machine translation (NMT) is done by neural networks

• Like its predecessor paradigms, NMT mirrors the Vauquois triangle

• The famed ‘Encoder-Decoder’ architecture as well as the more modern


‘Transformer’ architecture implements the A-T-G pipeline through layers of neurons

• NMT is extremely data-intensive, due to its requirement of fixing millions and


sometimes billions of weight values

10 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.3 Variation of accuracy of machine translation with corpus size
It is to be noticed how the SMT line (grey) is no match for the NMT line (black) as the corpus size increases

11 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Ambiguity in Machine Translation: Language Divergence

“Languages have different ways of expressing meanings, the so-called phenomenon of


language divergence”

• One of the ideals of MT has always been the extraction of meaning completely and
correctly from the source text

• Then the production of the target language text from the extracted meaning

• Meaning extraction is an exercise in disambiguation at every layer of the NLP stack-


morphology, POS tagging, chunking, parsing, and semantics

12 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• When the same meaning is expressed by two different languages, two kinds of
divergence arise:

i. Lexico-semantic divergence:

 It is essentially vocabulary difference (i.e., the difference of words and phrases)

ii. Structural divergence:

 Here, languages are different in the manner in which they arrange words and
phrases in a sentence

13 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.4 Language divergence illustrated with the English sentence, ‘This blanket is very soft’

14 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Vauquois Triangle

• Language divergence phenomena have been unified in a famous framework called


Vauquios triangle

• The top of the triangle represents the complete disambiguated meaning of the source
sentence

• On the way down from the top, we begin to generate the target language sentence

• Descending down the right side of the triangle through different stages of natural
language generations (NLG)

• The broad stages of NLG are root word determination, target root substitution,
morphology generation on target roots

15 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.5 Vauquois triangle illustrating the analysis-transfer-generation (A-T-G) process
The left side of the triangle is effectively the NLP stack. Complete disambiguation lands the source sentence representation at the top of the triangle
The bottom of the triangle is the opposite extreme, where no analysis is needed as in the case of very close languages such as Spanish-Catalan

16 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Vauquois triangle in its original form is very elaborate with many sub-activities of
intricate nature

• A simplified Vauquois triangle is depicted next with source and target languages at the
bottom of the pyramid

• A transfer happening at some place between the top and bottom of the triangle

• The left side of the triangle is the analysis side and the right side is the generation side

• At the top of the triangle is the interlingua-based MT

• Any transition into the generation side below the top gives rise to the transfer-based
machine translation (TBMT)

17 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.6 Abridged Vauquois triangle

18 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Rule-Based Machine Translation

• In RBMT, all rules—whether for analysis, transfer, or generation—are written by


human experts

• So, the responsibility of correct and complete capturing of language and translation
phenomena and formulating rules therefrom lies with a human system designer

• The pipeline shown next is the typical architecture for Indian language to Indian
language machine translation (ILILMT)

• It was executed as a consortium activity in the period 2000-2006 funded by India’s


Ministry of Electronics and Information Technology (MeitY)

19 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.7 RBMT pipeline, illustrating analysis-transfer-generation (A-T-G)

20 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Like all languages of the world, Indian language computing also has its challenges

• These challenges are:

i. Scale and diversity: There are 22 scheduled languages in India, written in 13 different
scripts, with over 720 dialects

ii. Code mixing: Owing to India’s multilingual culture, people in India routinely and
seamlessly use at least two languages in their day-to-day communication resulting in
code-mixing

iii. Absence of basic NLP tools and resources: Most Indian languages do not have these
tools and resources, MT may be relegated to reliance on low quality or absence of
tools

21 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


iv. Absence of linguistic documentation and treatise for many languages: For many
languages of India, no linguistic tradition exists

v. Script complexity and non-standard input mechanism: The QWERTY keyboard for
Roman scripts is non-optimal for Indian languages

vi. Non-standard transliteration: Due to the ubiquity of English language keyboards,


there may be non-standard transliteration to represent the same Indian language word

vii. Non-standard storage: Many organizations in India have their proprietary fonts that
do not follow the Unicode format

viii. Challenging language phenomena: Compound verbs in Indian languages are one
such phenomenon

22 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Indian Language Statistical Machine Translation

• In 2014, MeitY of India funded creation of parallel corpora in many Indian languages

• This project is called Indian Language Corpora Initiative (ILCI)

• About 100,000 parallel sentences were created for languages from the Indo-Aryan
and Dravidian families

• Leveraging on the created parallel corpora, SMT systems were created for pairs of
Indian languages

• Also between English and Indian languages

23 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• One such comprehensive work was SMT systems for 110 pairs of languages

• The BLEU scores, which are performance measures of translating to-and-fro different
pairs of languages

• In general, scores are high in the Indo-Aryan family

• Dravidian family which is characterized by heavy agglutination shows a much lower


range of BLEU scores

• Translation involving Dravidian languages requires looking inside the words and
mapping morphemes for obtaining proper translation

24 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.8 BLEU score values within Indo-Aryan and Dravidian families and across families

25 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Mitigating the Resource Problem

We have only a handful of methods for MT for mitigating the resource problem:

i. Subwords:

• Subword-based MT involves breaking the word into its parts, making use of
characters, syllables, orthographic syllables, and byte pair encodings (BPE)

ii. Cooperative NLP:

• This aims to take help from another language, which can happen in two ways:

a) The first way is to use a pivot language


b) The second way of cooperative NLP is to use transfer learning

26 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


iii. The third way of resource scarcity mitigation is to use higher-level language
properties such as POS and sense ID

• It leads to obtaining additional clues for disambiguation

27 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Methods of Subwording:

• Subwording may be performed in terms of:

a) Characters: ‘j’ + ‘aa’ + ‘u’ + ‘M’ + ‘g’ + ‘aa’


b) Morphemes: ‘jaa’ + ‘uMgaa’
c) Syllables: ‘jaa’ + ‘uM’ + ‘gaa’
d) Orthographic syllables: Strings ending in vowels: ‘jaau’ + ‘Mgaa’
e) BPE: Depends on corpora, statistically frequent patterns. On that count both ‘jaa’
and ‘uMgaa’ are likely

28 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Actual Evidence of Benefits of Subwording
Here, we show the quantitative evidence of benefits of subwording in this figure:

Fig 7.9 BLEU scores for word/token-


based machine translation

29 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Figure below shows morpheme-based SMT BLEU scores:

Fig 7.10a Morphemebased SMT


and corresponding BLEU scores

30 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• There is consistent BLEU score improvement as shown in the figure below:

Fig 7.10b Per cent improvement


over word level scores

31 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• BPE-based SMT showed a still larger improvement in scores:

Fig 7.11a BPE-based SMT and


corresponding BLEU score

32 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Punjabi-Tamil BLEU score improvement is 28.26% as shown here:

Fig 7.11b Per cent improvement


over word level scores

33 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Phrase-Based Statistical Machine Translation
(PBSMT)
Need for Phrase Alignment

When translating a sentence from one language to another, a simple approach may
be:

• To translate the sentence word by word, accompanied by morphological and


syntactic adjustments

• Such an approach will require a dictionary that maps words of the source
language to those of the target language

34 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• However, there are many compelling reasons for why translation based on units of
text longer than words should be done

• Instead of words, if we allow word groups to align, the modelling becomes much
simpler

• Note that process of creating phrase alignments is essentially that of merging


neighbours

• In a tabular representation of alignments, this amounts to growing strings of


words by expanding along diagonals and aligning these strings

35 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Example:

Alignments are marked with ‘X’. For English Hindi, the alignment set is

A 1: {<Mumbai, Mumbai>, <of, ke>, <people, log>}


A 2: {<mumbai, Mumbai>, < ke, of>, < log, people>}

Now the grow-diag process will create the alignments

People of ke log (black square) of Mumbai mumbai ke (light grey square)


People of Mumbai mumbai ke log (dark grey square)

• It is visually explained in figure in 7.12

36 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.12 Creation of phrase alignments from word alignments
through grow-diag algorithm

37 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Case of Promotional/Demotional Divergence

• Promotional and demotional divergences are particular cases of language


divergence

• Consider the translation pair:

‘The play is on’ ‘khel chal rahaa hai; gloss: play continue <progressive auxiliary>
<auxiliary>’

• The translation of ‘on’ is ‘chal rahaa hai’

38 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• It is apparent that setting up correspondence between ‘rahaa’ and ‘on’ is not only
artificial and non-intuitive

• A portion of the probability mass for mapping of ‘on’ is spent on ‘rahaa’

• thus depriving the more deserving candidates like ‘par’ and ‘upar’

• This may lead to the strange translation of ‘The book is on the table’ as:

‘mej ke rahaa kitaab hai’ instead of the correct ‘mej ke upar kitaab hai’!

39 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Case of Multiword (Includes Idioms)

• Non-compositional multi-words are not amenable to word-based alignment

• Unless the source and target languages are extremely close linguistically and
culturally

• Example:

• Not a single word in the Bengali sentence above has an equivalent translation in
the parallel English sentence

40 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Another Example:

• In this case, there is almost one-to-one correspondence between Hindi and


Bengali

41 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Phrases Are Not Necessarily Linguistic Phrases

• Phrases in PBSMT are not necessarily linguistic phrases but are sequences of
words

• Albeit some of these word sequences can be linguistic phrases, but it is not
necessarily so

• It is possible to have aligned phrases that are non-equivalent in meaning.

• Even when the two languages are close to each other, phrases aligned can be non-
linguistic

42 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


43 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.
44 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.
Use of the Phrase Table

• We have to work with the aligned phrases in the phrase table along with their
probability values

• The probability value of a phrase translation indicates how good a translation pair
is, as formed by the phrase and its translation

• When a new sentence needs to be translated, we have to match parts of the input
sentence in the phrase table, pick up the translations

• It also combines the translations, and finally score the resulting ‘sentences

45 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Using phrase translation probabilities and language model probabilities

• Everything starts with finding and matching parts of the input sentence in the
phrase table

• The size of the phrase table is, thus, an important factor in the translation process

46 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Mathematics of Phrase-Based Statistical Machine Translation

• The basic equation of SMT is:

• Here, e and f have their usual meaning of output and input, respectively

• The translation with the highest score is

• P( f |e) and PLM(e) are the translation model and language model, respectively

47 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• The translation probability P( f |e) is modelled as:

• is called phrase translation probability and d(.) is distortion probability

48 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Factor-Based Statistical Machine Translation
• In Hindi, the target sentence is ‘mei_ne aam khaa_yaa’

• Where does ‘ne’ come from?

• Hindi grammar rules say that the agent should get the ergative marker ‘ne’ if the
verb is transitive (‘sakarmak kriya’) and in the past tense

• This rule holds even for ellipsis wherein lexemes are implicit

• If the word ‘aam’ is dropped, the translation will still be ‘mei_ne khaa_yaa’

49 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.14 Translation of an English sentence to Hindi using a factor-based SMT paradigm

50 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.15 Mapping of factors in factor-based SMT. Suffix + semantic relation gives
rise to case-marking suffix or case-marking post-position, lemma maps to
lemma, and finally a word in correct form appears in the target language

51 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Cooperative NLP: Pivot-Based Machine Translation
• An intermediate language is introduced to supply missing data when the parallel
corpus is in short supply

• The intermediate language is called bridge language

• The theory of translation through pivot language is based on the concept of


marginalization in probability theory

• Equations below show this:

52 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Where p( f |e) is given by:

• Here, is the highest probability output sentence, as per argmax over e, given the
input sentence f

53 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Figure below explains the utilization of phrase tables in pivot-based SMT:

Fig 7.22 Processing of phrases in pivot-based SMT.

54 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Table 7.3 shows that the Mauritian Creole is very close to French with a large
vocabulary overlap:

55 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Table 7.4 shows the data of English to French—2 million sentences:

56 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• The graph shown below shows that the term BACK-OFF means backing off into the source-pivot and pivot-
target phrase tables when the source-target phrase table does not yield a match:

Fig 7.17 BLEU scores for MC→EN translation with and without pivot; Grey bars are with FR as pivot

57 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Neural Machine Translation
• In neural NLP, word vectors are designed to place the words in a vector space

• It enables us to apply mathematical operations of ‘distance’, ‘similarity’, ‘addition’,


‘averaging’, etc

• In a neural framework, it is easier to see that ‘dog’ as a concept has more


similarity with ‘cat’ than with ‘door’

• Potentially, the language objects are parts of a continuum, and the whole power
of geometry, algebra, calculus can be harnessed for doing NLP

58 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


59 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.
Encoder-Decoder

• Let us enumerate the essential steps through the now ‘classic-in-NMT’ encoder-
decoder:

i. The input sentence passes through what is called the encoder as a sequence of
word vectors

ii. At the end of encoding, out comes a vector that is supposed to be a


representation of the whole input sentence

iii. This encoder output vector is processed by the decoder to output the target
language sentence

60 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


61 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.
Problem of Long-Distance Dependency

• Caveat is long-distance dependency

• Generating word forms conforming to agreement rules has to grapple with the
challenge of long-distance dependency

• This problem of attenuation of memory brings on stage two key ideas: ‘context
vector’ and ‘attention’

• The input sentence is processed token by token

• After every token, the encoder output is tapped and sent to the decoder. That
output is called context vector

62 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Attention

• There is combination of context vectors from the encoder at every token along
with autoregression

• It has proven quite effective in dealing with long-distance dependency

• But this combination too is not adequate for correct translation

• Processing of a sentence requires paying different amounts of attention to


different parts of the sentence at different stages of processing

63 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


NMT Using Transformers

• Transformers raised the bar in MT by establishing new benchmarks

• Most recently in in WMT 14 English to German and English to French tasks in 2017

• The new performance figures forced the community to take these techniques
seriously

• Subsequent sustained interest and new very good performance figures in diverse
applications cemented the position of Transformers

64 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Positional Encoding

• One of the main contributions of the Transformer is the introduction of positional


encoding

• In Transformers, positions are encoded as embeddings

• Positional embeddings are supplied along with input word embeddings

• The training phase teaches the Transformer to condition the output by paying
attention to not only input words, but also their positions

65 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Let us assume the dimension of the position vector is d

• It is kept same as the dimension of the word vector

• The position vector is added component wise to the word vector

• Let POS denote the position vector of dimension d. Each position t in the input
sentence has a position vector associated with it

• Let us call this POS t . Let the i-th component of the t-th position vector be
denoted as pos(t, i), i varying from 0 to (d/2) – 1

66 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


• Then,

67 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Why sine and cosine functions?

Foundational Observation 1:

• Let S be a set of symbols. Let P be the set of patterns the symbols create

• If |P |>| S|, then there must exist patterns in P that have repeated symbols

68 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Foundational Observation 2:

• If the patterns can be arranged in a series with equal difference of values between
every consecutive pair

• Then, at any given position, the symbols at different positions of the pattern
strings must REPEAT

• The frequency of repetition depends on the position of the symbol

69 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Translation by Transformer

• The power of Transformer comes from positional embeddings and self and cross-
attention

• We describe the main point of self-attention with an example

• Consider two phrases:

• The translations to Hindi are:

70 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.22 Through self-attention contextual word vectors are obtained from initial word vectors
corresponding to the words ‘bank’, ‘of’, ‘the’, and ‘river’

71 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.23 The four contextual vectors of the words in the phrase ‘bank of the river’ are obtained by multiplying the
original word vectors by weights which are called self-attention weights and are learned based on mutual
pairwise similarities of the words, through the so-called query, key, and value triples

72 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.24 The lines from English to Hindi words depict crossattention. The weight of connection from
English ‘Peter’ to Hindi ‘piitar’ should be larger than the weights to ‘jaldii’ and ‘soyaa’;
This is accomplished by learning from parallel training data

73 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Fig 7.25 Sub-set of results reported for machine translation in the landmark paper by
Vaswani et al. (2017)

74 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.


Thank you

75 CH7 Machine Translation Copyright © 2023 by Wiley India Pvt. Ltd.

You might also like