Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

SENTENCE TRANSLATION FOR KANNADA USING

MORPHOLOGICAL ANALYSER AND GENERATOR


Mallamma V Reddy1, Dr. M. Hanumanthappa2
1,2
Department of Computer Science and Applications,
Bangalore University, Bangalore, INDIA
1
mallamma_vreddy@yahoo.co.in
2
hanu6572@hotmail.com

The language faculty in human being has the ability to


ABSTRACT analyze a given language. There are various methods by
which a morphological analyzer can be built and we
Kannada Language being one of the major Dravidian propose the Suffix Stripping Method which is found to be
languages of India and it has 27th place in most spoken very economical. An analyzer can analyze the inflected
language in the world. But still it does not yet have form of a word into suffixes and stem even if the stem is
computerized grammar checking methods for a given not entered in the dictionary. The general format of the
Kannada sentence. When Computational Linguistic is morphological analyzer of Kannada is
concerns Kannada is lagging far behind compared to
Telugu and Kannada. Writing the grammar production for Word → stem/root + suffixes
any south Indian language is bit difficult. Because the
languages are highly inflected with three gender forms and The basic principle of morphological generation
two number forms. In most of the Indian languages is to get forms from a root and a set of properties (lexical
including Kannada a verb ends with a token which category and morphological properties). A morphological
indicates the gender of the person (Noun/ Pronoun). generator needs to be designed to tackle the different
Morphological Analyser and Generator tool is an essential syntactic categories such as nouns, verbs, adjectives,
component of any NLP application. Here the adverbs etc. separately, since the addition of
morphological analyzer can simultaneously serve as a morphological constituents to each of these syntactic
stemmer, part of speech tagger and spell checker and categories depends on different types of information. The
hence it becomes a very efficient tool for content Suffix Joining Method is used for building morphological
management. This paper proposes a way of using generators. The identified suffixes are used along with the
morphological analyzer and generator for Kannada morphophonemic rules and morphotactics for developing
Sentence Translation with simple illustrations. the morphological generator. The general format of the
morphological generator is
Keywords: Cross Language Information Retrieval (CLIR),
Morphological Analyzer and Generator, Part of Speech Tagger Stem/root + suffixes → Word.
(POS)
2. MORPHOLOGICAL ANALYSIS
1. INTRODUCTION
Morphological analyzer and morphological generator
The main steps in Natural Language Processing are are two essential and basic tools for building any language
Morphological Analysis: Individual words are analyzed processing application. Morphological Analysis is the
into their components and nonword tokens such as process of providing grammatical information of a word
punctuation are separated from the words. Syntactic given its suffix. Morphological analyzer is a computer
Analysis: Linear sequences of words are transformed into program which takes a word as input and produces its
structures that show how the words relate to each other. grammatical structure as output. A morphological analyzer
Semantic Analysis: The structures created by the syntactic will return its root/stem word along with its grammatical
analyzer are assigned meanings. Discourse integration: information depending upon its word category. For nouns
The meaning of an individual sentence may depend on the it will provide gender, number, and case information and
sentences that precede it and may influence the meanings for verbs, it will be tense, aspects, and modularity
of the sentences that follow it. Pragmatic Analysis: The
structure representing what was said is reinterpreted to e.g. - children Child + n + s (pl) (English)
determine what was actually meant. Morphological
analyzer and Generator is the first and the most important Grammatical structure–morpheme order, feature values,
phase of Natural Language Processing in Cross Language suffixes.
Information Retrieval (CLIR) [1]. In this paper we are Feature value-gender, number, person etc.
presenting the usage of the morphological analyzer and
generator particularly for Kannada sentence translation. Morphology deals with all combinations that form words
or parts of words. Two broad classes of morphemes, stems
Machine translation is throwing up many challenges and affixes: The stem is the “main morpheme” of the
and opening up many opportunities for doing work. Some word, supplying the main meaning.
of the problems relate to grammars; others pertain to word
analysis, bilingual dictionaries, language generation, etc. e.g., eat in eat + ing.
In machine translation [2] morphological analysis is the
main process. First step to analysis the source language.
Affixes: an affix is a bound morph that is realized as a · Agglutinative languages (e.g.Ugro-Finnicand Turkic
sequence of phonemes. Concatenative morphology (since languages): all bound forms are either prefixes or
a word is composed of a number of morphemes suffixes, i.e., they are added to a stem like beads on a
concatenated together) uses the following types of affixes string. Every affix represents a distinct morphological
feature. Every feature is expressed by exactly one
Prefixes: A Prefix is an affix that is attached in front of a affix.
stem. e.g.-admission- re in readmission · Inflectional languages (e.g. Indo-European language):
distinct features are merged into a single bound form
Suffixes: A Suffix is an affix that is attached after the (a so called portmanteau morph). The same
stem. Suffixes are used derivationally and inflectionally. underlying feature may be expressed differently,
depending on the paradigm.
E.g.–ing in telling
· Polysynthetic languages (e.g. Limit language): these
Circumfixes: A Circumfixis the combination of a prefix languages express more of syntax in morphology than
and a suffix which together express some feature. other languages, e.g., verb arguments are incorporated
Circumfixes can be viewed as really two affixes applied into the verb. This classification is quite artificial.
one after the other. Real languages rarely fall cleanly into one of the
above classes, e.g., even Mandarin has a few suffixes.
E.g. German ge--tinge + sag + t ([have] said) Moreover, this classification mixes the aspect of what
is expressed morphologically and the means
In non-concatenative morphology (morphemes are expressing it.
combined in more complex ways) the stem morpheme is
split up. The following types of affixes are used: 3. RELATED WORK

Infixes: Infixes are attached in between some phonemes The most competent approach to morphological
of a stem. generator is using Finite State Transducers (Alicia
Garrido, et.al, 1999). Letter transducers based
Transfixes: Transfixes are a special kind of infix involves morphological analyzer and generator was developed by
not only discontinuous affixes but also discontinuous Alicia Garrido. Perez Aguiar has used an intuitive pattern-
bases. matching approach for developing morphological
generator to Spanish language. Guido Minnen and his
2.1 THE ROLE OF MORPHOLOGY IN team has developed a morphological generator based on
DIFFERENT LANGUAGES Finite state techniques and it is implemented using the
widely available Unix Flex utility (Guido Minnen, et.al,
Morphology is not equally prominent in all spoken 2000). For Indian languages many attempts have been
languages. What one language expresses morphologically made to build morphological generator. A Hindi
[3] may be expressed by a separate word or left implicit in morphological generator has been developed based on
another language. For example, English expresses the database driven approach (Vishal Goyal, et.al, 2008). Tel-
plural nouns by means of morphology (the forms like More Morphological generator for Telugu is based on
boys, spies, vehicles where the morpheme, with its variant linguistic rules and Perl program (Madhavi G, et.al, 2006).
forms expresses the plurality) but Yoruba (a language of Morphological generator has been designed for syntactic
south-western Nigeria) use separate word expressing the categories of Kannada using Paradigm based approach and
same meaning. Thus, ‘ookunrin’ means the man, and ‘a sandhi rules (P.Anandan, et.al, 2001). Finite state
won’ can be used to express the plural: ‘the men’. Quite machines are used for developing morphological generator
generally, we can say that English makes more use of for Kannada (A. G. Menon et.al.2009).
morphology than Yoruba. But there are many languages
that make more use of morphology than English. For 4. KANNADA MORPHOLOGY
instance Sumerian uses Morphology to distinguish
between ‘he went’ and ‘I went’, and between ‘he went’ Kannada is a morphologically rich language in which
and ‘he went to him’, where English must use separate morphemes combine with the root words in the form of
words. The terms analytic and synthetic are used to suffixes. Kannada grammarians divide the words of the
describe the degree to which morphology is made use in a language into three categories namely:
language. Languages like Yoruba, Vietnamese or English,
where morphology plays a relatively modest role are i) Declinable words (namapada): Morphology of
called analytic. Traditionally, linguists discriminate declinable words, as in many Dravidian
between the following types of languages types of languages is fairly simple compared to verbs.
languages with regard to morphology: Kannada words are of three genders- masculine,
feminine and neutral. Declinable and Conjugable
· Isolating languages (e.g. Mandarin Chinese): there are words have two numbers- singular and plural.
no bound forms. E.g., no affixes that can be attached
to a word. The only morphological operation is ii) Verbs (kriyapada) or Conjugable words: The
composition. verb is much more complex than the nouns.
There are three persons namely first, second and
third person. Tense of verbs is past, present or
future. Aspect may be simple, continuous or
perfect. Verbs occur as the last constituent of the In order to identify the legitimate roots/stems, the
sentence. They can be broadly divided into finite dictionary of root/stem needs to be as exhaustive as
or non-finite forms. Finite verbs have nothing possible. Considering this fact, the analyzer is designed to
added to them and are found in the last position provide three types of outputs such as:
of a sentence. They are marked for tense with
Person-Number-Gender (PNG) [6] markers. The Correct analysis: This is obtained on the basis of a
Non-finite verbs, on the other hand cannot stand complete match of suffixes, rules and the existence of the
alone. They are always marked for tense without analyzed stem/root in the root dictionary.
PNG marker.
Probable analysis: This is obtained on the basis of either
a matching of the suffixes and rules, even if the root/stem
is not found in the dictionary or a matching of the suffixes,
but not any supporting rule or existing root in the
Fig 1: A formal Grammar for Kannada Nouns dictionary.

Unprocessed words: These are the words which have


Fig 2: A formal Grammar for Kannada Verbs remained unanalyzed due to either absence of the suffix in
the suffix list or due to the absence of the rule in the list.
iii) Uninflected words (avyaya): Uninflected words
may be classified as adverbs, postpositions, Morph Generator: The aim in morphological generation
conjunctions and interjections. Some of the is to produce the inflected form of a word according to the
example words of this class are haage, mele, features and values in the Feature Structure. It is also
tanaka, alli, bagge, anthu etc. necessary to reuse the linguistic resources created for
analysis purpose. From practical point of view,
4.1 MORPHOPHONEMICS morphological generation is the inverse process of
analysis, namely the process of converting the internal
In Kannada, adjacent words are often joined and representation of a word to its surface form. The same rule
pronounced as one word. Such word combinations occur definitions can be used to generate the desired word form
in two ways- Sandhi and Samasa. Sandhi as used for analysis.
(Morphophonemics) deals with changes that occur when
two words or separate morphemes come together to form a The morphological generation mainly deals with
new word. Few sandhi types are native to Kannada [5] and the concatenation of corresponding suffixes with the root
few are borrowed from Sanskrit. We in our tool have word to form a word of specific grammatical category to
handled only Kannada sandhi. However we do not handle perform this task the suffix joining method has been used.
Samasa. The input of the morphological generator would be the
Kannada sandhi is of three types - lopa, agama and root word which then inflects this word to the morphology
adesha sandhi. While lopa and agama take place both in of the respective language and gives as the output the
compound words and in the junction of the crude forms of target forms of the word. The Morphological structure of
words and suffixes, adesha sandhi occurs only in Kannada verb is quite complex since it caters to person,
compound words. Detailed description of sandhi types can gender, and number markings and also combines with
be found in [10]. auxiliaries that indicate aspect, mood etc While
morphologically generating the verb, the gender, number
5. CONSTRUCTION OF MORPHOLOGICAL and person of the subject is necessary in order to select the
ANALYZER FOR KANNADA appropriate suffix catering to the selected tense. So while
going from English to Kannada, there are about eleven
Morph Analyzer: The analyzer based on suffix different forms for a single stem in Kannada
stripping approach is so modeled that it analyses the
inflected form of a word into suffixes and stems. It does so
by making use of a root/stem dictionary (for identifying
legitimate roots/stems), a list of suffixes, comprising of all
possible suffixes that various categories can take (in order
to identify a valid suffix), and the morpheme sequencing
rules. The Root Dictionary contains a list of roots, each
with its lexical category and features.

The suffix stripping algorithm is a method of


morphological analysis which makes use of a root/stem
dictionary (for identifying legitimate roots/stems), a list of
suffixes, comprising of all possible suffixes that various
categories can take, and the morpheme sequencing rules.
This method is economical. Once the suffixes are
Fig 3: Block Diagram of English-Kannada Translation
identified, removing the suffixes and applying proper
morpheme sequencing rules can obtain the stem.
translation extraction was evaluated based on precision
5.1 BILINGUAL DICTIONARY and recall rates at the word. Since, we considered exactly
one word in the source language and one translation in the
Bilingual dictionary is a crucial part not only for machine target language at a time.
translation, but also for other natural language processing
applications such as cross-language information retrieval. The word level recall and precision rates were defined as
Creating a bilingual dictionary in the form of lexemes or follows:
words is a difficult task as it covers more than one area of
meaning, but these multiple meanings don’t correspond to
a single word in the target language. Basically machine (1)
translation systems are linked to electronic dictionaries.
The content of the dictionaries must be adequate in both
quantity and quality: that is, the vocabulary coverage must
be extensive and appropriately selected, and the translation
equivalents carefully chosen, if target language output is to (2)
be satisfactory or indeed even possible. The size and
quality of dictionary limits the scope and coverage of a 7. RESULTS
system, and the quality of translation that can be
expected[4]. The dictionary entries are based on lexical Cross Language Information Retrieval Tool [7] is built by
stems of specified category, strictly monolingual analysis using the ASP.NET as front end and for a Database the
and generation dictionaries, and transfer dictionaries based Kannada is encrypted by using the Encoding system.
on language-pair-specific information. English and Kannada are the source language and the
target language, respectively, in our query translation. All
5.2 ALGORITHM FOR MORPHOLOGICAL the experiments carried out here involve the same set of
ANALYZER AND GENERATOR English queries and the same query expansion, translation
and retrieval method. The only difference between the
In this section we are going to describe about the new experimental conditions is in what dictionaries are used in
algorithm which is developed for morphological generator. the query translation. The experiment results as shown in
The main advantage for this algorithm [6] is simple and fig 4.
accurate.

Algorithm

Step 1: Get the word to be analyzed.


Step 2: Check whether the entered word is found in the
Root Dictionary.
Step 3: If the word is found in the dictionary, stop;
Else
Step 4: Separate any suffix from the right hand side
Step 5: If any suffix is present in the word, then check the
availability of the suffix in the dictionary.
Then
Step 6: Remove the suffix present,
Then re-initialize the word without identified suffix, Go to
Step 2.
Step 7: Repeat this process until the Dictionary finds the
root/stem word.
Step 8: Store the English root/stem word in a variable and
then get the corresponding Kannada word from the
bilingual dictionary
Step 9: Check what all grammatical features does the
English word have given and then generate the
corresponding features for the Kannada word
Step 10: Exit.

6. EXPERIMENTAL SETUP

Several corpora were collected to estimate the parameters


of the proposed models and to evaluate the performance of
the proposed approach. The corpus BUBShabdaSagara-
2011 for training consisted of 9000 words in dictionary for Fig 4: Sample Result
Kannada. The training corpus composed of a bilingual
word list. In the experiment, the performance of word
Computer Applications (0975 – 8887) Volume 13–
8. CONCLUSION No.8, January 2011.

The paper is about the design and development of [7]. Mallamma.V.Reddy, Dr.M.Hanumanthappa, “CLIR
morphological analyzer and generator for Kannada Project (English to Kannada and Telugu)”
Language. Dictionaries are of critical importance in http://bangaloreuniversitydictionary//menu.asp
machine translation. A bilingual dictionary or translation
dictionary is a specialized dictionary used to translate
words or phrases from one language to another. They are
the largest components of an MT system in terms of the
amount of information they hold. The proper functioning
of a morphological generator necessitates efficiency in the
generation of a word, once provided its Root or Stem and
the corresponding feature values. The Suffix Stripping for
morphological analyzer and the Suffix Joining for the
morphological generation of words proved to be an
efficient method. Since words are formed by the suffix
addition with root, most of the words can take the POS tag
based on the root or stem. The results of the first phase of
the suffix stripping approach have been fairly
encouraging. It was observed, that with an average of 8000
to 9000 root entries, the affix stripping approach gives
around 70% coverage. The coverage of the system is
directly related to the size of the dictionary. We hope to
expand and make the system more robust by increasing the
dictionary size.

Acknowledgement

This is the major research project entitled Cross-


Language Information Retrieval sanctioned to Dr. M.
Hanumanthappa, PI-UGC-MH, Department of computer
science and applications by the University grant
commission. We thank to the UGC for financial
assistance. This paper is in continuation of the project
carried out at the Bangalore University, Bangalore, India.

REFERENCES

[1]. F.Och and H.Ney. “A Systematic Comparison of


Various Statistical Alignment Models”.
Computational Linguistics. 2003. pp 29(1): 19-51.

[2]. G.Grefenstette. “The Problem of Cross-language


Information Retrieval. Cross-language Information
Retrieval”. Kluwer Academic Publishers. 1998. pp1-
9.

[3]. Ritchie, Graeme. The Lexicon. In Whitelock,


eds.1985.p. 225-256.

[4]. Knowles, Francis. “The Pivotal Role of the


Dictionaries in a Machine Translation System”. In
Lawson, Veronica, ed. Practical Experience of
Machine Translation”. North-Holland. 1982.

[5]. http://ccat.sas.upenn.edu/plc/kannada/

[6]. Jisha P.Jayan, Rajeev R R, Dr. S Rajendran,


“Morphological Analyser and Morphological
Generator for Malayalam - Tamil Machine
Translation” published in International Journal of

You might also like