Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No.

9, December 2010

Structural Analysis of Bangla Sentences of Different Tenses for Automatic Bangla Machine Translator
Md. Musfique Anwar, Nasrin Sultana Shume and Md. Al-Amin Bhuiyan Dept. of Co mputer Science & Engineering, Jahangirnagar University , Dhaka, Bangladesh Email: musfique.anwar@g mail.co m, shume_sultana@yahoo.com, alamin_bhuiyan@yahoo.com

Abstract
This paper addresses about structural mappings of Bangla sentences of different tenses for machine translation (MT). Machine translation requires analysis, transfer and generation steps to produce target language output from a source language input. Structural representation of Bangla sentences encodes the information of Bangla sentences and a transfer module has been designed that can generate English sentences using Context Free Grammar (CFG). The MT system generates parse tree according to the parse rules and a lexicon provides the properties of the word and its meaning in the target language. The MT system can be extendable to paragraph translation.

Unladylike can be divided into the morphemes as Un not (prefix), Lady well behaved female (root word), Like having the characteristics of (suffix). Morphological info rmation of words are stored together with syntactic and semantic information of the words. The purpose of syntactic analysis is to determine the structure of the input text. This structure consists of a hierarchy of phrases, the smallest of wh ich are the basic symbols and the largest of which is the sentence. It can be described by a tree known as parse/syntax tree with one node for each phrase. Basic symbols are represented by leaf nodes and other phrases by interior nodes. The root of the tree represents the sentence. Syntactic analysis aims to identify the sequence of grammatical elements e.g. article , verb, preposition, etc or of functional elements e.g. subject, predicate, the grouping of grammatical elements e.g. nominal phrases consisting of nouns, articles, adjectives and other modifiers and the recognition of dependency relations i.e. hierarchical relat ions. If we can identify the syntactic constituents of sentences, it will be easier for us to obtain the structural representation of the sentence [3]. Most grammar ru le formalis ms are based on the idea of phrase structure that strings are composed of substrings called phrases, which come in different categories. There are three types of phrases in BanglaNoun phrase, Adjective Phrase and Verb Phrase. Simp le sentences are composed of these phrases. Co mplex and compound sentences are composed of simp le sentences [4]. Within the early standard transformational models it is assumed that basic phrase markers are generated by phrase structure rules (PS rules) of the following sort [5]: S NP AUX VP NP A RT N VP V NP PS rules given above tell us that a S (sentence) can consist of, or can expanded as, the sequence NP (noun phrase) AUX (au xiliary verb) VP (verb phrase). The rules also indicate that NP can be expanded as ART N and that VP can be exp ressed as V NP.

Keywords:
Machine Translation, Structural representation, Context Free Grammar, Parse tree, Lexicon etc.

1. Introduction
Machine translator refers to computerized system responsible for the production of translation from one natural language to another, with or without human assistance. It excludes computer-based translation tools, which support translators by providing access to on-line dict ionaries, remote terminology databanks, transmission and reception of texts, etc. The core of MT itself is the automation of the full translation process. Machine translation (MT) means translation using computers. We need to determine a sentence structure at first using grammatical rules to interpret any language. Parsing or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for examp le, words), to determine its grammatical structure with respect to a given formal grammar. Parsing a sentence produces structural representation (SR) o r parse tree of the sentence [1]. Analysis and generation are two major phases of mach ine translation. There are two main techniques concerned in analysis phase and these are morphological analysis and syntactic analysis. Morphological parsing strategy decomposes a word into morphemes given lexicon list, proper lexicon order and different spelling change rules [2]. That means, it incorporates the rules by which the words are analyzed. For examp le, in the sentence - The young girls behavior was unladylike, the word

70

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010

This paper implements a technique to perform structural analysis of Bangla sentences of different tenses using Context Free Grammar rules.

2. Bangla Sentences Structure


In Bangla language, a simple sentence is formed by an independent clause or principal clause. Examp le:

A complex sentence consists of one or mo re subordinate clause within a princip le clause [2]. As for examp le, Bangla co mpound sentence is formed by two or mo re principal clauses joined by an indeclinable/conjunctive . Example:

Types of Bangla tense are given below in Fig. 1:

Fig. 1 Types of Bangla tense 2.1 Basic Structural Difference between B angla and English Language Following are the structural differences between Bangla and English languages: The basic sentence pattern in English is subject + verb + object (SVO), whereas in Bangla it is - subject + object + verb (SOV). Example: English: I (S) eat (V) rice (O) Bangla: (S) (O) (V) Au xiliary verb is absent in Bangla language. Example: I (Pronoun) am (Au xiliary verb) reading (Main verb) a (Art icle) book (Noun) Bangla: (Pronoun) (Article) (Noun) (Main verb) Preposition is a word placed before a noun or pronoun or a noun-equivalent to show its relation to any other word of the sentence [6]. In Bangla language, bivakti will p lace after noun or pronoun or a noun-equivalent. Examp le: English: The man sat on the chair Bangla: , here is bivakt i 2.2 Structural Transfer from Bangla to English Parsing is the process of building a parse tree for an input string . We can extract the syntactic structure of a Bangla sentence using any of the two approaches: i) top-down parsing ii) bottom-up parsing. 2.2.1 Top-Down Parsing Top-down parsing starts at the most abstract level (the level of sentences) and work down to the most concrete level (the level of words). An input sentence is derived using the context-free grammar ru les by matching the terminals of the sentence. So, given an input string, we start out by assuming that it is a sentence, and then try to prove that it really is one by using the grammar ru les left-to-right. That works as follows: If we want to prove that the input is of category S and we have the rule S NP VP, then we will try next to prove that the input string consists of a noun phrase followed by a verb phrase. 2.2.2 Bottom-Up Parsing The basic idea of bottom up parsing is to begin with the concrete data provided by the input string --- that is, the words we have to parse/recognize --- and try to build more abstract high-level in formation. Example: Consider the Bangla sentence . To perform bottom-up parsing of the sentence using the follo wing rules of the context-free grammar, <SENTENCE> <NOUN-PHRASE> <VERBPHRASE*> <NOUN-PHRA SE> <CM PLX-NOUN> | <CM PLX-NOUN> <PREP-PHRASE* > | <A RT> <ADJ> <NOUN> <PREP-PHRASE*>

71

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010

<VERB-PHRA SE> <CM PLX-VERB> | <CM PLX-VERB> <PREP-PHRASE* > | <CM PLXVERB> <PREP-PHRASE* > <PREP-PHRASE> <PREP> <CMPLX-NOUN*> | <PREP> <PRONOUN> <CM PLX-NOUN> <A RTICLE> <NOUN> | <NOUN> | <PRONOUN> | <NOUN> <PRONOUN> <NOUN> <CM PLX-VERB> < MAIN-VERB> | < MAINVERB> <NOUN-PHRASE*> During the bottom-up parsing of the Bangla sentence , we obtain the syntactical grammatical structure NOUN ARTICLE NOUN MAIN-VERB. The syntactic categories in the resulting grammat ical Context Free Grammar Ru les

structure are then replaced by the constituents of the same or smaller unit till a SENTENCE is obtained, which is shown below: Input Sentence

NOUN ARTICLE NOUN MAIN-VERB NOUN-PHRASE NOUN-PHRASE MAIN-VERB NOUN-PHRASE CM PLX-VERB NOUN-PHRASE VERB-PHRASE SENTENCE

3. Proposed MT Model
The model proposed model for s tructural analysis of Bangla sentences is shown in Fig. 2.

Source Language Sentence (Bangla)

Parser

Parse Tree of Bangla Sentence

Conversion Unit

Parse Tree of English Sentence

Output target Language Sentence (English)

Sen sentence

Lexicon Fig. 2 Block d iagram o f proposed MT model

sentence

3.1 Descripti on of the Proposed Model The proposed MT system will take a Bangla natural sentence as input for parsing. Stream of characters are sequentially scanned and grouped into tokens according to lexicon. The words having a collective mean ing are grouped together in a lexicon. The output of the Tokenizer o f the input sentence Cheleti Boi Porche is as follo ws [1] [4]: TOKEN = (Chele, Ti, Boi, Por, Che). The parser involves grouping of tokens into grammatical phrases that are used to synthesize the output. Usually, the phrases are represented by a parse tree that depicts the syntactic structure of the input. A lexicon can be defined as a dictionary of words where each word contains some syntactic, semantic, and possibly some prag matic informat ion. The entries in a lexicon could be grouped and given by word category (nouns, verbs, prepositions and so on), and all words contained within the lexicon listed within the categories to which they belong [1] [4] [5] [7]. In our project, the lexicon contains the English meaning and parts of speech of a Bangla wo rd. A context-free grammar (CFG) is a set of recursive rewrit ing rules (or productions) used to generate

patterns of strings. It provides a simple and precise mechanis m for describing the methods by which phrases in some natural language are built from smaller b locks, capturing the "block structure" of sentences in a natural way. Such as noun, verb, and preposition and their respective phrases lead to a natural recursion because noun phrase may appear inside a verb phrase and vice versa. The most common way to represent grammar is as a set of production rules which says how the parts of speech can put together to make grammatical, or wellformed sentences [8]. In the conversion unit, an input sentence is analyzed and a source language (SL) parse tree is produced using bottom-up parsing methodology. Then the corresponding parse tree of target language (TL) is produced. Each Bangla word of the input sentence is replaced with the corresponding English word fro m the lexicon in the target (English) parse tree to produce the target (English) language sentence. Structural Representation (SR) is a process of finding a parse tree for a given input string. For examp le, the parse tree of the input sentence and the corresponding parse tree of the English sentence The boy drinks tea is shown in Fig. 3.

72

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010

S NP N ART VP CV NP MV ART The NP

S VP N boy MV drinks N N tea

CV NP

Fig. 3 Bangla and English Parse tree of the sentence

4. Implementation of the Proposed Model


Flow-chart of the proposed MT model is given bellow: After executing the above procedure according to the Flow-chart, it is possible to translate a Bangla sentence into corresponding English sentence.

5. Experime ntal Results


Several experiments were conducted to justify the effectiveness of the proposed MT model. Success rate for different types of sentences is shown in Fig. 5. Fig. 6 illustrates the snapshot of the implemented method. Table 1: Success rate for different types of sentences Correct ly performed mach ine translation 745 517 338 Success rate (%) 96.75 95.74 93.89

Type of Sentences Simp le Co mplex Co mpound

Total no. of sentences 770 540 360

Fig. 4 Flo w-chart of the proposed MT Model

Fig. 5 Success rate for different types of sentences

73

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010

Fig. 6: Samp le output of the program fo r the co mplex sentence

6. Conclusion
This paper main ly focuses on the s tructural analysis phase of how to build parse tree of a given Bangla sentence according to CFG. The translation process is then applied to the Source Language (SL) Tree to obtain a tree with target language words (TL Tree). Finally, the output sentence in the target language is extracted fro m this tree in the target language and also indicates the type of the tense. But the sentences composed of idioms and phrases are beyond the scope of this project.

[4]

[5]

References
[1] M. M. Hoque and M. M. Ali, A Parsing Methodology for Bangla Natural Language Sentences, Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 277-282 (2003). S. Dasgupta and M. Khan, Feature Unification for Morphological Parsing in Bangla, Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 642647(2004). K. D. Islam, M. Billah, R. Hasan and M. M. Asaduzzaman, Syntactic Transfer and Generation of Co mplex-Co mpound Sentences for Bangla-English Machine Translation,

[6]

[2]

[7]

[3]

[8]

Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 321-326 (2003). L. Mehedy, N. Arifin and M. Kaykobad, Bangla Syntax Analysis: A Comprehensive Approach, Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 287-293 (2003). S. K. Chakravarty, K. Hasan, A. Alim, A Machine Translation (MT) Approach to Translate Bangla Co mp lex Sentences into English Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 342-346 (2003). S. A. Rah man, K. S. Mahmud, B. Roy and K. M. A. Hasan, English to Bengali Translation Using A New Natural Language Processing Algorith m Proceedings of International Conference on Co mputer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 294-298 (2003). S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd Ed ition, Pearson Education publisher, New Yo rk, 2003. M. M. Anwar, M. Z. Anwar and M. A. Bhuiyan, Syntax Analysis and Machine Translation of Bangla Sentences , IJCSNS International Journal of Computer Science and

74

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 9, December 2010

Network Security, VOL.9 No.8, August 2009, pp. 317326 (2009). Md. Musfi que Anwar completed his B.Sc (Engg.) in Co mputer Science and Engineering fro m Dept. of CSE, Jahangirnagar University, Bangladesh in 2006. He is now a Lecturer in the Dept. of CSE, Jahangirnagar Un ivers ity, Savar, Dhaka, Bangladesh. His research interests include Natural Language Processing, Artificial Intelligence, Image Processing, Pattern Recognition, Software Engineering and so on. Nasrin Sultana Shume completed her B.Sc (Engg.) in Co mputer Science and Engineering fro m Dept. of CSE, Jahangirnagar University, Bangladesh in 2006. She is now a Lecturer in the Dept. of CSE, Green Un iversity of Bangladesh, Mirpur, Dhaka, Bangladesh. Her research interests include Artificial Intelligence, Neural Networ ks, Image Processing, Pattern Recognition, Database and so on.

Md. Al-Amin Bhui yan received his B.Sc (Hons) and M.Sc. in Applied Physics and Electronics fro m Un iversity of Dhaka, Dhaka, Bangladesh in 1987 and 1988, respectively. He got the Dr. Eng. Degree in Electrical Engineering fro m Osaka City University, Japan, in 2001. He has completed his Postdoctoral in the Intelligent Systems fro m National In formatics Institute, Japan. He is now a Professor in the Dept. of CSE, Jahangirnagar University, Savar, Dhaka, Bangladesh. His main research interests include Image Face Recognition, Cognitive Science, Image Processing, Computer Graphics, Pattern Recognition, Neural Networks, Hu man-machine Interface, Artificial Intelligence, Robotics and so on.

75

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

You might also like