Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Machine Translation

Machine Translation(MT)
 Translation of text units from one language into another
using computers
 One of the earliest NLP application
 We need not know multiple languages, just feed it to a MT
system and get it translated.
Need of MT
 It can help overcome technological barriers.
 A lot of information is available in today’s world but this
information is available in very small subset of languages
and beyond the reach of significant portion of the society.
This has led to digital divide in the society.
 MT can be of great help in removing this divide.
 Multilingual countries like India where very few people
can understand English are more particularly in need of
MT systems to translate information from English into
local languages.
Problems in MT
There are many structural and stylistic differences among
languages, which make automatic translation a difficult
task:
 Word Order: Arrangement of words in a sentence varies
across languages. E.g. in English, words are arranged in
order subject verb and object; whereas in Indian
languages object usually precedes verb.
 Word Sense: The sense of a word in one language may
translate into a different sense with the words of another
language. This creates problem in target language word
selection.
Problems in MT(contd..)
 Anaphora resolution (AR) which most commonly appears as pronoun
resolution is the problem of resolving references to earlier or later items in
the discourse.
 These items are usually noun phrases representing objects in the real world
called referents but can also be verb phrases, whole sentences or paragraphs.
 There are primarily three types of anaphora:

- Pronominal: This is the most common type where a referent is referred by a


pronoun. Example: "John found the love of his life" where 'his' refers to
'John'.
- Definite noun phrase: The antecedent is referred by a phrase of the form
“<the><noun phrase> ". Continued example: "The relationship did not last
long", where 'The relationship' refers to 'the love' in the preceding sentence.
- Quantifier/Ordinal: The anaphor is a quantifier such as 'one' or an ordinal
such as 'first'. Continued Example: "He started a new one" where 'one' refers
to 'The relationship' (effectively meaning 'a relationship')
Problems in MT(contd..)
 Idioms: A sentence involving idiomatic expressions is
difficult to translate as idioms are composed of words that
do not directly contribute to their meaning. E.g.
“The old man finally kicked the bucket” can be translated
into Hindi by translator as “Boodhe aadmi ne ant-ta balti
mein laat maari”.
 Ambiguity: Certain languages do not permit certain types
of ambiguities. E.g. consider the sentence having PP-
ambiguity: “The man saw the girl with a telescope”. In
order to translate this sentence into Hindi, the PP-
ambiguity must first be resolved.
MT Approaches
 MT approaches can be broadly classified into the following
four categories:
Machine Translation
Approaches

Direct Rule-based Corpus-based Knowledge-


Translation Translation Translation based Translation

Example
Transfer Interlingua Statistical
based
Direct Machine Translation(DMT)
 DMT systems provide direct machine translation i.e. , no
intermediate representation is used.
 They carry out word-by-word translation with the help of
bilingual dictionary, usually followed by some syntactic
rearrangement.
 They take a monolithic approach towards development,
i.e. they consider all the details of one language pair.
 Anusaarka(IIIT Hyderabad) is a MT based on direct
approach
Overview
 Three main methodologies for Machine Translation
 Direct
 Transfer
 Interlingual
Contd..
The general procedure for direct translation subsystems can be
summarized in the following three steps:
1. Remove morphological inflections from the words to get
the root form of the source language words.
2. Look up a bilingual dictionary to get the target-language
words corresponding to the source language words.
3. Change the word order to that which best matches the word
order of the target language, e.g. in a English-Hindi
translation system, this would involve changing
prepositions to post-positions and changing the subject-
verb-object structure to subject-object-verb.
DMT System
Target language
Source
text
language text

SL TL
Morphologica Words Bilingual Words Syntactic
l
lookup rearrangement
analysis

SL-TL dictionary
Example
 Consider this English sentence
Khushbu slept in the garden.
To translate this sentence into Hindi, a direct translation system will
first look up a dictionary to get target words for each word appearing
in the source-language sentences. Then the words are reordered to
match the default sentence structure of Hindi. The output of these
steps is:
Word Translation:
खुशबु सोयी में बाग
Khushbu soyi mein baag
Syntactic rearrangement:
खुशबु बाग में सोयी
Khushbu baag mein soyi
Contd..
 Besides word ordering and preposition handling, suffix handling is also needed
in order to make the translation acceptable. E.g. in the following sentence we
need to change the Hindi word ladka to ladke. This is termed as idiomatization.
 English sentence:

The boy gave the girl a book.


Word Translation:
Ladka dee ladki ek kitaab
लड़का दी लड़की एक किताब
Syntactic rearrangement:
Ladka ladki ek kitaab dee
लड़का लड़की एक किताब दी
Karaka handling and idiomatization:
Ladke ne :Ladki ko ek kitaab di
लड़के ने लड़की को एक किताब दी
Contd..
 Other changes include modifying verbs and adjective according to the
gender of the subject. E.g.

She saw stars in the sky


वो देखा तारे में आसमान
Wo dekha tare mein aasman

वो आसमान में तारे देखी


Wo aasman mein tare dekhi

Karaka Handling and Idiomatization:


उसने आसमान में तारे देखे
Usne aansman mein taare dekhe
Some Points
 Selection of the correct target language word is another
problem in direct translation system. E.g.
 Book a ticket for me.
 A word by word translation does not make it clear whether
‘book’ is used as noun or verb.
 A DMT system involves only lexical analysis.
 It does not consider structure and relationship between words.
 It does not attempt to disambiguate words. Hence the quality
of output is not often very good.
 A DMT is developed for a specific language pair and cannot
be adapted for a different pair.
Rule –based Machine Translation(RBMT)
 RBMT parse the source text and produce an intermediate
representation, which may be a parse tree or some abstract
representation.
 Target language text is generated from the intermediate
representation.
 RBMT rely on specifications of rules for morphology,
syntax, lexical selection and transfer, semantic analysis
and generation and hence are called rule based MT.
 Example: Ariane and SUSY system.
RBMT Classification
Depending on the intermediate representation used,
RBMT are further classified as follows:
1. Transfer based Machine Translation
2. Interlingua Machine Translation
Transfer based Translation
 These models transform the structure of input to produce a
representation that matches the rules of the target language.
 This transformation requires understanding of the
differences between the source and target language.
 Transfer based MT system has following three components:

1. Analysis- To produce source language structure


2. Transfer- To transfer the source language representation
to a target level representation
3. Generation- To generate target language text using target
level structure
Transfer based Translation

Source language text Target language


text

TL
SL Representation representation
Analysis Transfer Synthesis

SL Grammar SL-TL dictionary TL Grammar


and grammar
Overview
 Three main methodologies for Machine Translation
 Direct
 Transfer
 Interlingual
Analysis
 First stage analyses the source text and produces a
structure confirming the rules of source language.
 It may involve morphological, syntactic and semantic
analyses.
 As this stage involves parsing of source text, syntactic
ambiguities and lexical ambiguities are better resolved in
this approach than in direct translation approach.
Transfer and Generation
 Second stage transfer source language representation into
target language representation.
 All the language-pair specific characteristics are handled
by transfer component.
 Third stage is responsible for generating the actual target
language text.
Transfer based Translation advantages
1. Modular Structure:
 The analysis of SL text(i.e. parser) is independent of
target language generator.
 In order to provide translation capability among a set of
languages, we need an analyzer and a generator
component for each language and a transfer component
for each pair of such languages.
 E.g. for translating 6 languages we need 6 analyzers, 6
generators and 30 transfer component as opposed to 30
complete transfer systems needed in direct translation
approach.
Transfer based Translation advantages
contd..
2. Handles ambiguity:
 It can easily handle ambiguities that carry over from one
language to another.
E.g. We need not manually resolve the PP-attachment
ambiguity in this sentence “The girl plucked a flower
with stick”, while translating it into French.
 It can also handle lexical ambiguity.

E.g. Consider sentence “book a ticket for me”


A parse tree generated by transfer system makes it clear that
book is used as a verb here not as noun (which would
mean kitaab in hindi).
Interlingua-based Machine Translation
 Here the source language text is converted into a language
independent meaning representation called ‘interlingua’.
 An interlingua represents all sentences that mean the same
thing in the same way regardless of the source language
they happen to be in.(Jurafsky)
 From interlingual representation texts are generated into
other languages.
 Translation is a two way process: analysis and synthesis.
Overview
 Interlingua
 Single underlying representation for both SL and TL
which ideally
 Abstracts away from language-specific characteristics
 Creates a “language-neutral” representation
 Can be used as a “pivot” representation in the translation
Contd…
SL1 Text

TL1 Text
Interlingua Synthesis
Analysis
representation

SL11 Grammar
Grammar TL11 Grammar
Grammar

SLn Text
TLn Text
Analysis Synthesis

SLn Grammar TLnn Grammar


Grammar
Grammar
Contd..
 In the first stage SL text is represented in interlingua.
 In second stage, TL text is generated.
 Analysis phase is specific to SL text and synthesis phase is
specific to target language .So it is convenient to use in
multilingual environment.
 E.g. to make multilingual translation capability among n
languages, we need only n analysis and n generation
components as opposed to n (n -1) complete MT systems
in direct translation approach.
Overview
 Three main methodologies for Machine Translation
 Direct
 Transfer
 Interlingual
Overview
 Cost/Benefit analysis of moving up the triangle
 Benefit
 Reduces the amount of work required to traverse
the gap between languages
 Cost
 Increases amount of analysis
 Convert the source input into a suitable
pre-transfer representation
 Increases amount of synthesis
 Convert the post-transfer representation
into the final target surface form
Overview
 Two major advantages of Interlingua method
1. The more target languages there are, the more valuable an
Interlingua becomes

TL1
Inter- TL2
Source Lingua TL3
Language
TL4
TL5
TL6
Overview
 Two major advantages of Interlingua method
2. Interlingual representations can also be used by NLP
systems for other multilingual applications
Overview
 Sounds great, but…due to many complexities
 Only one interlingua MT system has ever been made
operational in a commercial setting:-
KANT (knowledge-based accurate natural language
translation )system
 Only a few have been taken beyond research prototype
Statistical Machine Translation (SMT)
 Deals with automatically mapping sentences in one human
language (for example French) into another human language (such
as English).
 The first language is called the source and the second language is
called the target.
 There are many SMT variants, depending upon how translation is
modelled.
 Some approaches are in terms of a string-to-string mapping, some
use trees-to-strings, and some use tree to-tree models.
 All share in common the central idea that translation is automatic,
with models estimated from parallel corpora (source-target pairs)
and also from monolingual corpora (examples of target sentences).

You might also like