1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language

CHAPTER 1
INTRODUCTION
1.1 GENERAL
Machine Translation is an automatic translation of one natural language text to another
using computer. Initial attempts for Machine Translation made in 1950’s didn’t meet
with success. Now internet users need a fast automatic translation system between
languages. Several approaches like Linguistic based and Interlingua based systems are
used to develop a machine translation system. But currently, statistical methods
dominate the machine translation field. Statistical Machine Translation (SMT)
approach draws knowledge from automata theory, artificial intelligence, data structure
and statistics. SMT system treats translation as a machine learning problem. This means
that a learning algorithm is applied to a large amount of parallel corpora. Parallel
corpora are sentences in one language along with its translation. Learning algorithms
create a model from parallel sentences and using this model, unseen sentences are
translated. If parallel corpora are available for a language pair then it is easy to build a
bilingual SMT system. The accuracy of the system is highly dependent on the quality
and quantity of the parallel corpus and the domain. These parallel corpora are
constantly growing. Parallel corpora are the fundamental resource for SMT system.
Parallel corpora are available from government’s bi-lingual text books, news papers,
websites and novels.
SMT models are giving good accuracy for language pairs, particularly for similar
languages in specific domains or languages that have large availability of bi-lingual
corpora. If a sentence in language pair is not structurally similar, then the translation
patterns are difficult to learn. Huge amounts of parallel corpora are required for
learning the pattern, therefore statistical methods are difficult to use in “less
resourced” languages. To enhance the translation performance of dissimilar language
pairs and less resourced languages, an external preprocessing is required. This
preprocessing is performed using linguistic tools.
In SMT system, statistical methods are used for mapping of source language
phrases into target language phrases. Statistical model parameters are estimated from
bi-lingual and mono-lingual corpora. There are two models in the SMT system. They
1

are Translation model and Language model. The translation model takes parallel
sentences and finds the translation hypothesis between the phrases. Language model is
based on the statistical properties of n-grams. It uses the monolingual corpora.
Several translation models are available in SMT system. Some important models
are phrase based model, syntax based model and factored model. Phrase Based
Statistical Machine Translation (PBSMT) is limited to the mapping of small text
chunks. Factored translation model is an extension of phrase based models. It integrates
linguistic information at the word level. This thesis proposes a pre-processing method
that uses linguistic tools to the development of English to Tamil machine translation
system. In this translation system, external linguistic tools are used to augment the
linguistic information into the parallel corpora. The pre and post processing
methodology proposed in this thesis are applicable to other language pairs too.
1.2 OVERVIEW OF MACHINE TRANSLATION
Machine translation is one of the major oldest and the most active area in natural
language processing. The word ‘translation’ refers to transformation of text or speech
from one language into other. Machine translation can be defined as, the application of
computers to the task of translating texts from one natural language to another. It is a
focussed field of research in linguistic concepts of syntax, semantics, pragmatics and
discourse.
Today a number of systems are available for producing translations, though they
are not perfect. In the process of translation, which is either carried out manually or
automated through machines, the context of the text in the source language when
translated must convey the exact context in the target language. Translation is not just
word level replacement. A translator, either a machine or human, must interpret and
analyse all the elements in the text. Also human/machine should be familiar with all the
issues during the translation process and must know how to handle it. This requires in-
depth knowledge in grammar, sentence structure, meanings, etc and also an
understanding in each language’s culture in order to handle idioms and phrases
originated from different culture. The cross culture understanding is an important issue
that holds the accuracy of the translation.
2

It will be a great challenge for humans to design automatic machine translation
system. It is difficult for translating sentences by taking into consideration all the
required information. Humans need several revisions to make the perfect translation.
No two individual human translators can generate identical translations of the same text
in the same language pair. Hence it will be a greater challenge for humans to design a
fully automated machine translation system to produce high quality translations.
1.3 ROLE OF MACHINE TRANSLATION IN NLP

Natural Language Processing (NLP) is the field of computer science devoted to the
development of models and technologies enabling computers to use human languages
both as input and output [1]. The ultimate goal of NLP is to build computational models
that equal human performance in the task of reading, writing, learning, speaking and
understanding. Computational models are useful to explore the nature of linguistic
communication as well as for enabling effective human-machine interaction. Jurafsky
and Martin (2005) [2] describe Natural Language Processing as “computational
techniques that process spoken and written human language as language”. According to
the Microsoft researchers, the goal of the Natural Language Processing (NLP) is “to
design and build software that will analyze, understand and generate languages that
humans use naturally, so that eventually one will be able to address their computer like
addressing another person”.
Machine Translation is used for translating texts for assimilation purpose which
aids bilingual or cross-lingual communication and also for searching, accessing and
understanding foreign language information from databases and web-pages [3]. In the
field of information retrieval a lot of research is going on in Cross-Language
Information Retrieval (CLIR), i.e. information retrieval systems capable of searching
databases in many different languages [4].
Construction of robust systems for speech-to-speech translation to facilitate “cross-

lingual” oral communication has been the dream of speech and natural language
researchers for decades. Machine translation is an important module in speech
translation systems. Currently, computer assisted learning plays a major role in
academic environment. The use of Machine Translation in language learning has not
yet got enough attention because of poor quality of automatic translation output. Using
3

good automatic translation system, students can improve their translation and writing
skills. Such system can break the language barriers of students and language learners.
1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION

SYSTEM
Traditionally, rule based approaches are used to develop a machine translation system.
Rule based approach feeds the rules into machine using appropriate representations.
Feeding all linguistic knowledge into a machine would be very hard. In this context, the
statistical approach to Machine Translation has some attractive qualities that made it
the preferred approach in machine translation research over the past two decades.
Statistical translation models learn translation patterns directly from data, and
generalize them to translate a new text. The SMT approach is largely language-
independent, i.e. the models can be applied to any language pair.
System based on statistical methods is much better than the traditional rule-based
systems. In SMT, implementation and development times are much shorter. SMT can
improve by coupling new models for reordering and decoding. It only needs to learn
parallel corpora for generating a translation system. In contrast, rule based system
needs transfer rules which only linguistic experts can generate. These rules are entirely
dependent on language pair involved and defining general “transfer-rules” is not an
easy task, especially for languages with different structures [5].
SMT system can be developed rapidly if the appropriate corpus is available. A Rule
Based Machine Translation (RBMT) system requires a lot of development and
customization costs until it reaches the desired quality threshold. Packaged RBMT
systems have been already developed and it is extremely difficult to reprogram models
and equivalences. Above all, RBMT has a much longer process involving more human
resources. RBMT system is retrained by adding new rules and vocabulary among other
things [5].
Statistical Machine Translation works well for translations in a specific domain

with the engine trained with bilingual corpus in that domain. A SMT system requires
more computing resources in terms of hardware to train the models. Billions of
calculations need to take place during the training of the engine and the computing
knowledge required for it is highly specialized. However, training time can be reduced
4

nowadays thanks to the wider availability of more powerful computers. RBMT requires
a longer deployment and compilation time by experts so that, in principle, building
costs are also higher. SMT generates statistical patterns automatically, including a good
learning of exceptions to rules. As regards to the rules governing the transfer of RBMT
systems, certainly they can be seen as special cases of statistical standards.
Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT
systems can be upgraded with syntactic information and even semantics, like the
RBMT. A SMT engine can generate improved translations if retrained or adapted
again. In contrast, the RBMT generates very similar translations after retraining [5].
SMT systems, in general, have trouble in handling the morphology on the source or
the target side especially for morphologically rich languages. Errors in morphology can
have severe consequences on meaning of the sentence. They change the grammatical
function of words or the interpretation of the sentence through the wrong verb tense.
Factored translation models try to solve this issue by explicitly handling morphology on
the generation side.
Another advantage of Statistical Machine Translation system is that, it generates a

more natural or closer to the literal translation of the input sentence. Symbolic
approaches to machine translation take great human effort in language engineering. In
knowledge based machine translation, for example, designers must first find out what
kinds of linguistic, general common-sense and domain-specific knowledge is important
for a task. Then they have to design an Interlingua representation for the knowledge
and write grammars to parse input sentences. Output sentences are generated using the
Interlingua representation. All of these require expertise in language technologies and it
requires tedious and laborious work.
The major advantage of Statistical Machine Translation system is its learnability.

As long as a model is set up, it can learn automatically with well-studied algorithms for
parameter estimation. Therefore parallel corpus replaces the human expertise for the
task. The coverage of grammar is also one of the serious problems in rule based system.
Statistical Machine Translation system is a good candidate that meets these criteria. It
can learn to have a good coverage as long as the training data is representative enough.
It can statistically model the noise in spoken language, so it does not have to make a
binary keep/abandon decision and is therefore more robust to noisy data [5].
5

1.5 MOTIVATION OF THE THESIS
Machine translation (MT) is the application of computers to the task of translating texts
from one natural language to another. Even though machine translation was envisioned
as a computer application in the 1950’s, machine translation is still considered to be an
open problem [3].
The demand for machine translation is growing rapidly. As multilingualism is

considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a
project to build machine translation system for all European language pairs, to
automatically translate the documents to its 23 official languages, which were being
translated manually. Also as the United Nations (UN) is translating a large number of
documents into several languages, the UN has created bilingual corpora for some
language pairs like Chinese–English, Arabic–English which are among the largest
bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the
World Wide Web, as around 20% of web pages and other resources are available in
their national languages. Machine Translation can be used to translate these web pages
and resources to the required language in order to understand the content in those pages
and resources, thereby decreasing the effect of language as a barrier of communication
[7].
In a linguistically diverse country like India, machine translation is a very essential

technology. Human translation is widely prevalent in India since ancient times which
are evident from the various works of philosophy, arts, mythology, religion and science
which have been translated among ancient to modern Indian languages. Also, numerous
classic works of art, ancient, medieval and modern, have also been translated between
European and Indian languages since the 18th century. As of now, human translation in
India finds application mainly in the administration, media and education and to a
lesser extent in business, arts and science and technology [8].
India has 18 constitutional languages, which are written in 10 different scripts.

Hindi is the official language of the India. English is the language which is most widely
used in the media, commerce, science and technology and education. Many of the states
have their own regional language, which is either Hindi or one of the other
constitutional languages.
6

In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are translation of news from English into local languages,
translation of annual reports of government departments and public sector units among
English, Hindi and the local language. Many resources such as news, weather reports,
books, etc., in English are being manually translated to Indian languages. Of these,
News and weather reports from all around the world are translated from English to
Indian languages by human translators more often. Human translation is slow and also
consumes more time and cost compared to machine translation. It is clear from this that
there is large market available for machine translation rather than human translation
from English into Indian languages. The reason for choosing automatic machine
translation rather than human translation is that machine translation is faster and
cheaper than human translation.
Tamil, a Dravidian language, is spoken by around 72 million people and has the
official status in the state of Tamilnadu and Indian union territory of Puducherry.
Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken
by significant minorities in Malaysia and Mauritius as well as emigrant communities
around the world. It is one of the 22 scheduled languages of India and declared a
classical language by the government of India in 2004 [9].
In this thesis a methodology for English to Tamil Statistical Machine Translation is

proposed, along with a pre-processing technique. This pre-processing method is used to
handle morphological variance between English and Tamil. Linguistic tools are
developed to generate linguistically motivated data for the factored translation model
for English-Tamil.
1.6 OBJECTIVE OF THE THESIS

The main aim of this research is to develop a morphology based prototype Statistical
Machine Translation system for English to Tamil language by integrating different
linguistic tools. This research will also address the issue of how the morphologically
correct sentence is generated when translating from a morphologically simple language
into a morphologically rich language. The objective of the research is detailed as
follows:
7

• Develop a pre-processing module (Reordering, Compounding and
Factorization) for English language sentence to transform the structure to
more similar to that of Tamil.
The pre-processing module for source language includes three stages, which are
reordering, factorization and compounding. In reordering stage, the source language
sentence is to be syntactically reordered according to the Tamil language syntax.
After reordering, the English words will be factored into lemma and other
morphological features. It will be followed by the compounding process, in which
the various function words are removed from the reordered sentence and attached
as a morphological factor to the corresponding content word.
• Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a

sentence.
Tamil POS tagger is going to develop using Support Vector Machine (SVM)
based machine learning tool. POS annotated corpus will be created for training the
automatic tagger system.
• Develop a Morphological Analyser to segment the Tamil surface word into

linguistic factors.
Morphological analyzer system is to be developed using machine learning

approach. POS tagger and morphological analyser tools are to be used for pre-
processing the Tamil language sentence. Linguistic information from the tools is to
be incorporated to the surface words before SMT training.
• Build a Morphology based prototype Factored Statistical Machine Translation

(F-SMT) system for English to Tamil.
After pre-processing, the bi-lingual sentences are to be created and transformed

as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and
factored using Tamil POS tagger and morphological analyser. These sentences will
be used for training the factored Statistical machine translation model.
8

• Develop a Tamil Morphological Generator system to generate Tamil surface
word form.
Morphological generator transforms the translation output into grammatically

correct target language sentence. Morphological generator is used in post
processing module for English to Tamil machine translation system.
1.7 RESEARCH METHODOLOGY
1.7.1 Overall System Architecture
Tamil is a morphologically rich language with free word-order of Subject-Object-

Verb (SOV) pattern. English language is morphologically simple with a fixed word
order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not
perform well for the languages with different word order and disparate morphological
structure. For resolving this, factored models are introduced in SMT system. The
factored model, which is a subtype of SMT system, will allow multiple levels of
representation of the word-from the most specific level to more general levels of
analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1
shows the overall architecture of the proposed English to Tamil SMT system. The
preprocessing module is externally attached to the factored SMT system. This module
converts bilingual corpora into factored bi-lingual corpora using morphology based
linguistic tools and reordering rules. After preprocessing, the representations of source
language sentence syntax closely follow the sentence structure of target language. This
transformation decreases the complexity in alignment, which is also one of the key
problems in baseline SMT system.
Parallel corpora are used to train the statistical translation models. Parallel corpora
are created and converted into factored parallel corpora using preprocessing. English
sentences are factored using Stanford Parser tool and Tamil sentences are factored
using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected
from various news papers and factored using Tamil linguistic tools. This mono-lingual
corpus is used in language model. Finally, in post-processing, Tamil morphological
generator is used for generating a surface word from output factors.
9

Figure 1.1 Morphoology based
d Factored SMT
S for En
nglish to Tam
mil languagge
1.77.2 Detaills of Pre-p

processing English Language
L Sentence
Maachine Transslation systeem for languuage pair wiith disparatee morphologgical structurre
neeeds appropriiate pre-proccessing or moodeling befoore translatioon. The prepprocessing caan
be performed on the raw source langguage sentennce to makee it more ap
ppropriate fo
for
trannslating into
o target lannguage senttence. The pre-processing modulee for Englissh
lannguage sentence consistss of reorderinng, factorizaation and com
mpounding.
1.77.2.1 Reordeering Englissh Language Sentence
g means, reaarrange the word order of source llanguage sentence into a

Reordering
wo
ord order thaat is closer to that of tthe target laanguage senntence. It is an importannt
proocess for lannguages whhich differs in their synntactic struccture. Englissh and Tam
mil
lannguage pair has disparate syntactic structure. English
E worrd order is Subject-Verb
S b-
bject (SVO) whereas Tam
Ob mil word orrder is Subjeect-Object-V
Verb (SOV). For examplle,
thee main verb of a Tamil sentence allways comess at the endd but in Engglish it comees
bettween subject and objeect [11]. Ennglish syntacctic relationns are retrievved from thhe
Staanford Parseer tool. Based on reorderring rules soource languaage sentencee is reordered.
10
Reordering rules are handcrafted using the syntactic word order difference between
English and Tamil language. 180 reordering rules are created based on the sentence
structure of English and Tamil. Reordering significantly improves the performance of
the Machine Translation system. Lexicalized distortion reordering model is
implemented in Moses toolkit [180]. But this automatic reordering in Moses toolkit is
good for short range sentences. Therefore external tool or component is needed for
dealing the long distance reordering. This reordering is also a one way of indirectly
integrating syntactic information to the source language. 80% of English sentences are
reordered correctly according to the rules which are developed. Example for English
reordering is given in the Figure 1.2.
English Sentence: I bought vegetables to my home.
Reordered English: I my to home vegetables bought
Tamil Sentence : நான் என் ைடய ட் ற்கு காய்கறிகள் வாங்கிேனன் .
Figure 1.2 Reordering of English language
1.7.2.2 Factorization of English Language Sentence
Factored models can be used for morphologically rich languages, in order to reduce
the amount of bi-lingual data. Factorization refers splitting the word into linguistic
factors and integrates as a vector. Stanford Parser is used to parse the English
sentences. From the parsed tree, the linguistic information such as lemma, part-of-
speech tags, syntactic information and dependency information are retrieved. This
linguistic information is integrated as factors in the original word.
1.7.2.3 Compounding for English language sentence
Compounding is defined as adding additional morphological information to

morphological factor of source (English) language words [188]. Additional
morphological information includes function word, subject information, dependency
relations, auxiliary verbs and model verbs. This information is based on the
11

morphological structure of Tamil language sentence. In compounding phase, the
function words are identified from the English factored corpora using dependency
information. After finding the function words, these are removed from the factored
sentence and attached as a morphological factor to the corresponding content word.
Compounding process reduces the length of the English sentence. Like function words,
auxiliary verbs and model verbs are also removed and attached as a morphological
factor of source language word. Now the morphological representation of the English
language sentence is similar to that of the Tamil language sentence. This compounding
step indirectly integrates dependency information into the source language factor. Table
1.1 and Table 1.2 show the factored and compounded sentences respectively.
Table 1.1 Factored English Sentences
I | i | PN | prn
my | my | PN | PRP$
home | home | N | NN
to | to | TO | TO
vegetables | vegetable | N | NNS
bought | buy | V | VBD .
Table 1.2 Compounded English Sentences
I | i | PN | prn_i
my | my | PN | PRP$
home | home | N |NN_to
vegetables | vegetable | N | NNS
bought | buy | V | VBD_1S.
1.7.3 Details of Pre-processing for Tamil Language Sentence
Like preprocessing of English sentence, Tamil sentence is also pre-processed using

linguistic tools such as Parts-of-Speech (POS) Tagger and morphological analyzer.
Tamil surface words are segmented into linguistic information and this information is
integrated as factors in SMT training corpora. Tamil sentence is given to Part-of-
Speech Tagger tool and then using this part-of-speech information, the simplified part-
of-speech tag is identified. Based on this simplified tag, the word is given to the Tamil
12

morphological analyzer. Morphological analyzer split the word to lemma and
morphological information. Parallel corpora as well as the monolingual corpora are
preprocessed in this stage.
1.7.3.1 Tamil Part-of-Speech Tagger
POS tagging means labeling grammatical classes i.e. assigning parts of speech tags
to each and every word of the given sentence. Tamil sentences are POS tagged using
Tamil POS Tagger tool. This tagger was developed, using Support Vector Machine
(SVM) based machine learning tool, SVMTool [12], which make the task simple and
efficient. In this method, POS tagged corpus is created and used to generate a trained
model. The SVMTool is used for creating models using tagged sentences and untagged
sentences are tagged using those models. 42k sentences (approx 5 lakh words) are
tagged for this Part-of-Speech tagger with the help of eminent Tamil linguist. The
experiments are conducted with our tagged corpus. The overall accuracy of 94.6% is
obtained for the test set which contains 6K sentences (approx 35 thousand words).
1.7.3.2 Tamil Morphological Analyzer
After POS tagging, sentences in the corpora are morphologically analyzed for
finding the lemma and morphological information. Morphological analyzer is a
software tool used to segment the word into meaningful units. Morphological analysis
of Tamil is a complex process because of its “morphological-rich” nature. Generally,
rule based approaches are used to develop morphological analyzer system. For a
morphologically rich language like Tamil, the creation of rules is a challenging task.
Here a novel machine learning based approach is proposed and implemented for Tamil
verb and noun Morphological analyzer. Additionally, this approach is tested for
languages such as Malayalam, Telugu and Kannada.
This approach is based on sequence labeling and training by kernel methods. It

captures the non-linear relationships and various morphological features of natural
language words in a better and simpler way. In this machine learning approach, two
training models are created for morphological analyzer. First model is trained using the
sequence of input characters and their corresponding output labels. This trained Model-
I is used for finding the morpheme boundaries. Second model is trained using sequence
of morphemes and their grammatical categories. This trained Model-II is used for
13

assigning grammatical classes to each morpheme. The SVM based tool was used for
training the data. This tool segments each word into its lemma and morphological
information.
1.7.4 Factored SMT System for English to Tamil Language
Factored translation is an extension of Phrase based Statistical Machine Translation

(PBSMT) that allows the integration of additional morphological and lexical
information, such as lemma, word class, gender, number, etc., at the word level on
source and the target languages. In SMT system, three different toolkits are used for
translation modeling, language modeling and decoding. These toolkits are implemented
using GIZA++, SRILM and Moses toolkits. GIZA++ is a Statistical Machine
Translation toolkit that is used to train IBM models 1-5 and an HMM word alignment
model. It is an extension of GIZA which was designed as part of the SMT toolkit.
SRILM is a toolkit for language modeling that can be used in speech recognition,
statistical tagging and segmentation, and Statistical Machine Translation. Moses is an
open source SMT system toolkit that allows to automatically training translation
models for any language pair. What is needed is a collection of translated texts (parallel
corpus). An efficient search algorithm finds quickly the highest probability translation
among the exponential number of choices. Figure 1.3 explains the mapping of English
factors and Tamil factors in Factored SMT system.
Figure 1.3 Mapping English Word Factors to Tamil Word Factors
Morphological, syntactic and semantic information can be integrated as factors

in factored translation model during training. Initially, English factors “Lemma” and
“Minimized-POS” are aligned to Tamil factors “Lemma” and “M-POS” then
14

“Minimized-POS” and “Compound-Tag” factors of English word is aligned to
“Morphological information” factor of Tamil word. Here, the important thing is Tamil
surface new words are not generated in SMT decoder. Only factors are generated from
SMT system and the surface word is generated in the post processing stage. Tamil
morphological generator is used in post processing to generate a Tamil surface word
from output factors. The system is evaluated with different sentence patterns like
simple, continuous and model auxiliaries and with these types, 85% of the sentences
are translated correctly. In addition, for other sentence types, the performance is 60%.
The prototype machine translation system which is developed properly handles the
noun-verb agreement. This is an essential requirement for translating into
morphologically rich languages like Tamil. BLEU and NIST evaluation scores clearly
show that the factored model with an integration of linguistic knowledge gives better
result for English to Tamil Statistical Machine Translation system.
1.7.5 Post-processing for English to Tamil SMT
Post-processing is engaged to generate a Tamil surface word using output factors.

In factored SMT system, the aim is to translate factors only, not to generate a surface
word. Due to the morphological rich nature of Tamil language, word generation is
handled separately. Morphological generator is applied in post-processing stage of
English to Tamil Machine Translation system. Post-processing transforms the
translated factors into grammatically correct target language sentence.
1.7.5.1 Tamil Morphological Generator
Tamil morphological generator receives the factors in the form of “lemma +

word_class + morpho-lexical information”, where lemma specifies the lemma of the
word form to be generated, word_class denotes the grammatical category and morpho-
lexical information states the type of inflection. These factors are output of the
proposed Machine Translation system. The novel suffix based approach is developed
for Tamil Morphological generator. Tamil noun and verb paradigm classification is
done based on its case and tense markers respectively. Number of paradigms for verb
and noun is defined. In Tamil, verbs are classified into 32 paradigms and nouns and
classified into 25 [13]. Noun and verb paradigms are used for creating suffix table.
Morphological generator system is divided into 3 modules. The first module takes the
15

lemma and word-class as input and gives the lemma’s paradigm number and word’s
stem as output. This paradigm number is referred as column index. Paradigm number
provides information about all the possible inflected words of a lemma in a particular
word class. The second module takes morpho-lexical information as an input and gives
its index number as an output. From the complete morpho-lexical information list, the
index number of the corresponding input morpho-lexical information factor is identified
and this is referred as row index. In third module, a two dimensional suffix-table is
used to generate the word using row index and column index. Finally the identified
suffix is attached with the stem to create a word form. For pronouns, pattern matching
approach is followed for generating pronoun word form.
1.8 RESEARCH CONTRIBUTIONS
This thesis shows how preprocessing and post processing can be used to improve
the statistical machine translation for English to Tamil language. The main focus of this
research is on translation from English into Tamil language, but also the development
of linguistic tools for Tamil language. The contributions are,
• Introduced a novel pre-processing method for English sentences which is
based on reordering and compounding. Reordering rearrange the English
sentence structures according to Tamil sentence. Compounding removes the
function words and auxiliaries then merged to the morphological factor of
content word. This pre-processing reorganizes the English sentence structure
according to the structure of Tamil sentence.
• Created a Tamil POS Tagger and tagged corpora size of 5 lakh words which
is a part of pre-processing Tamil language sentence.
• Introduced a novel method for developing Tamil morphological analyser
which is based on Machine learning approach. Corpora developed for this
approach contains 4 lakh morphologically segmented Tamil verbs and 2 lakh
Tamil nouns.
• Introduced a novel algorithm for developing Tamil morphological generator
with the use of paradigms and suffixes. Using this generator, it is possible to
generate 10 thousand distinct word form of a single Tamil verb.
• Successfully integrated these pre-processing and post-processing modules and
developed English to Tamil factored SMT system.
16

1.9 ORGANIZATION OF THE THESIS
This thesis is divided into ten chapters. Figure 1.4 shows the Organization of the thesis.
Chapter‐I INTRODUCTION
Chapter‐2 LITERATURE SURVEY
Chapter‐3 BACKGROUND
PREPROCESSING PREPROCESSING
Chapter‐4 ENGLISH TAMIL
LANGUAGE LANGUAGE
POS TAGGER
Chapter‐5
FOR TAMIL
MORPH ANALYZER
Chapter‐6 FOR TAMIL
Chapter‐7 FACTORED SMT
MORPHOLOGICAL
Chapter‐8 GENERATOR FOR TAMIL
EXPERIMENTS AND
Chapter‐9 RESULTS
Chapter‐10 CONCLUSION
Figure 1.4 Thesis Organizations
17

This thesis is organized as follows. General introduction is presented in chapter 1.
Chapter 2 presents the literature survey for linguistic tools and available Machine
Translation systems for Indian languages. In Chapter 3, the theoretical background and
language processing for Tamil is described. Chapter 4 contains the different stages of
preprocessing English language sentences. Stages include reordering, factorization and
compounding. Chapter 5 and 6 presents the preprocessing of Tamil sentence using
linguistic tools. In Chapter 5, development of Tamil POS tagger is explained and
Chapter 6 illustrates the Morphological Analyzer for Tamil language. This
morphological analyzer is developed based on the new machine learning based
approach. Additionally, the detailed descriptions of the method and data resources are
also illustrated. Chapter 7 presents the Factored SMT system for English to Tamil
language. This chapter explains how the factored corpora are trained and decoded using
SMT Toolkit. Post-processing for Tamil language is discussed in chapter 8.
Morphological generator is used as a Post-processing tool. This chapter also explains
the detailed description about a new algorithm which is developed for Tamil
Morphological generator. Chapter 9 explains the experiment and results of English to
Tamil Statistical Machine Translation system. It also describes the training and testing
details of SMT toolkit. The output of the developed system is evaluated using BLEU
and NIST metrics. Finally Chapter 10 concludes the thesis and explains the future
directions about this research.
18

1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.2 OVERVIEW OF MACHINE TRANSLATION

1.3 ROLE OF MACHINE TRANSLATION IN NLP

Construction of robust systems for speech-to-speech translation to facilitate “cross-

1.4 FEATURES OF STATISTICAL MACHINE TRANSLATION

Statistical Machine Translation works well for translations in a specific domain

Another advantage of Statistical Machine Translation system is that, it generates a

The major advantage of Statistical Machine Translation system is its learnability.

The demand for machine translation is growing rapidly. As multilingualism is

In a linguistically diverse country like India, machine translation is a very essential

India has 18 constitutional languages, which are written in 10 different scripts.

In this thesis a methodology for English to Tamil Statistical Machine Translation is

1.6 OBJECTIVE OF THE THESIS

• Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a

• Develop a Morphological Analyser to segment the Tamil surface word into

Morphological analyzer system is to be developed using machine learning

• Build a Morphology based prototype Factored Statistical Machine Translation

After pre-processing, the bi-lingual sentences are to be created and transformed

Morphological generator transforms the translation output into grammatically

1.7 RESEARCH METHODOLOGY

1.7.1 Overall System Architecture

Tamil is a morphologically rich language with free word-order of Subject-Object-

1.77.2 Detaills of Pre-p

1.77.2.1 Reordeering Englissh Language Sentence

g means, reaarrange the word order of source llanguage sentence into a

English Sentence: I bought vegetables to my home.

Reordered English: I my to home vegetables bought

Tamil Sentence : நான் என் ைடய ட் ற்கு காய்கறிகள் வாங்கிேனன் .

Figure 1.2 Reordering of English language

1.7.2.2 Factorization of English Language Sentence

1.7.2.3 Compounding for English language sentence

Compounding is defined as adding additional morphological information to

Table 1.1 Factored English Sentences

Table 1.2 Compounded English Sentences

1.7.3 Details of Pre-processing for Tamil Language Sentence

Like preprocessing of English sentence, Tamil sentence is also pre-processed using

1.7.3.1 Tamil Part-of-Speech Tagger

1.7.3.2 Tamil Morphological Analyzer

This approach is based on sequence labeling and training by kernel methods. It

1.7.4 Factored SMT System for English to Tamil Language

Factored translation is an extension of Phrase based Statistical Machine Translation

Figure 1.3 Mapping English Word Factors to Tamil Word Factors

Morphological, syntactic and semantic information can be integrated as factors

1.7.5 Post-processing for English to Tamil SMT

Post-processing is engaged to generate a Tamil surface word using output factors.

1.7.5.1 Tamil Morphological Generator

Tamil morphological generator receives the factors in the form of “lemma +

1.8 RESEARCH CONTRIBUTIONS

Chapter‐2 LITERATURE SURVEY

Chapter‐7 FACTORED SMT

Figure 1.4 Thesis Organizations

You might also like