Professional Documents
Culture Documents
Parsing of Part-Of-Speech Tagged Assamese Texts
Parsing of Part-Of-Speech Tagged Assamese Texts
1, 2009 28
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
2
Department of Computer Science & Engineering, Tezpur University
Tezpur, Assam-784028, India
Abstract
A natural language (or ordinary language) is a language that is
spoken, written, or signed by humans for general-purpose
2. Related Works
communication, as distinguished from formal languages (such as
computer-programming languages or the "languages" used in the
study of formal logic). The computational activities required for 2.1 Natural Language Processing
enabling a computer to carry out information processing using
natural language is called natural language processing. We have The term “natural” languages refer to the languages that
taken Assamese language to check the grammars of the input people speak, like English, Assamese and Hindi etc. The
sentence. Our aim is to produce a technique to check the goal of the Natural Language Processing (NLP) group is
grammatical structures of the sentences in Assamese text. We to design and build software that will analyze, understand,
have made grammar rules by analyzing the structures of
and generate languages that humans use naturally. The
Assamese sentences. Our parsing program finds the grammatical
errors, if any, in the Assamese sentence. If there is no error, the applications of Natural Language can be divided into two
program will generate the parse tree for the Assamese sentence classes [2]
• Text based Applications: It involves the
Keywords: Context-free Grammar, Earley’s Algorithm, Natural processing of written text, such as books,
Language Processing, Parsing, Assamese Text. newspapers, reports, manuals, e-mail messages
etc. These are all reading based tasks.
• Dialogue based Applications: It involves human
1. Introduction – machine communication like spoken language.
Also includes interaction using keyboards. From
Natural language processing, a branch of artificial an end-user’s perspective, an application may
intelligence that deals with analyzing, understanding and require NLP for either processing natural
generating the languages that humans use naturally in language input or producing natural language
order to interface with computers in both written and output, or both. Also, for a particular application,
spoken contexts using natural human languages instead of only some of the tasks of NLP may be required,
computer languages. It studies the problems of automated and depth of analysis at the various levels may
generation and understanding of natural human languages. vary. Achieving human like language processing
We have taken Assamese language for information capability is a difficult goal for a machine.
processing i.e. to check the grammars of the input The difficulties are:
sentence. Parsing process makes use of two components. • Ambiguity
A parser, which is a procedural component and a • Interpreting partial information
grammar, which is declarative. The grammar changes • Many inputs can mean same thing
depending on the language to be parsed while the parser
remains unchanged. Thus by simply changing the
grammar, the system would parsed a different language. 2.2 Knowledge Required for Natural Language
We have taken Earley’s Parsing Algorithm for parsing
Assamese Sentence according to a grammar which is A Natural Language system uses the knowledge about the
defined for Assamese language. structure of the language itself, which includes words and
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 29
how words combine to form sentences, about the word 1. The first item indicates that the input sentence is
meaning and how word meanings contribute to sentence going to be parsed applying the rule S→ NP VP
meanings and so on. The different forms of knowledge from position 0.
relevant for natural language are [2]: 2. The second item indicates the portion of the input
• Phonetic and phonological knowledge: It sentence from the position number 0 to 1 has
concerns how words are related to the sounds that been parsed as NP and the remainder left to be
realize them. satisfied as VP.
• Morphological knowledge: It concerns how 3. The third item indicates that the portion of input
words are constructed from more basic meaning sentence from position number 0 to 4 has been
units called morphemes. A morpheme is the parsed as NP VP and thus S is accomplished.
primitive unit of meaning in a language (for
example, the meaning of the word friendly is Earley’s algorithm uses 3 phases
derivable from the meaning of the noun friend • Predictor
and suffix -ly, which transforms a noun into an • Scanner
adjective. • Completer
• Syntactic Knowledge: It concerns how words can Let α, β, γ are sequence of terminal or nonterminal
be put together to form correct sentences and symbols and S, A, B are non terminal symbols.
determine what structure role each word plays in
the sentence. Predictor Operation
• Semantic knowledge: It concerns what words For an item of the form [A → α.Bβ,i,j] create [B→.γ,j,j]
mean and how these meanings combine in for each production of the [B→γ] It is called predictor
sentences to form sentence meanings. operation because we can predict the next item.
• Pragmatic Knowledge: It concerns how sentences
are used in different situations and how use Completer Operation
affects the interpretation of the sentence. For an item of the form [B→γ.,j,k] create [A→αB.β,i,k]
• Discourse Knowledge: It concerns how the (i<j<k) for each item in the form of [A→α.Bβ,i,j] if exists.
immediately preceding sentences affect the It is called completer because it completes an operation.
interpretation of the next sentence.
• Word Knowledge: It includes what each language Scanner Operation
user must know about the other user’s beliefs and For an item of the form [A→α.wβ,i,j] create
goals. [A→αw.β,i,j+1], if w is a terminal symbol appeared in the
input sentence between j and j+1.
2.3 Earley’s Parsing Algorithm
Earley’s parsing algorithm
The Earley’s Parsing Algorithm [8, 9] is basically a top
down parsing algorithm where all the possible parses are 1. For each production S→α, create [S→.α,0,0]
carried simultaneously. Earley’s algorithm uses dotted 2. For j=0 to n do (n is the length of the input
Context-free Grammar (CFG) rules called items, which sentence)
has a dot in its right hand side. 3. For each item in the form of [A→α.Bβ,i,j] apply
Let the input sentence be- “0 I 1 saw 2 a 3 man 4 in 5 the 6 Predictor operation while a new item is created.
park 7”. Here the numbers appeared between words are 4. For each item in the form of [B→γ.i,j] apply
called position numbers. Completer operation while a new item is created.
For CFG rule S →NP VP we will have three types of 5. For each item in the form of [A→α.wβ,i,j] apply
dotted items- Scanner operation
• [ S→ .NP VP,0,0 ]
• [ S→ NP.VP,0,1 ] If we find an item of the form [S→α.,0,n] then we accept
• [ S→ NP VP.,0,4 ] it.
Our Algorithm
Input: Tagged Assamese Sentence
Output: Parse Tree or Error message
Step 1: If Verb is present in the sentence then create
[S→ .PP VP ,0,0]
Else create
[S→ .PP ,0,0]
Step 2: Do the following steps in a loop until there is a
success or error
Step 3: For each item of the form of [S→α.Bβ,i,j], apply
phase 1
Step 4: For each item of the form of [S→ .αwβ,i,j], apply
phase 2
Step 5: If we find an item of the form [S→α. ,0,n], then
we accept the sentence as success else error
message. Where n is the length of input sentence.
And then come out from the loop.
Our proposed grammars for Assamese sentences Step 6: Generate the parse trees for the successful
sentences.
1. S → PP VP | PP
2. PP → PN NP | NP PN | ADJ NP | NP ADJ | NP | Some other modifications of Earley’s algorithm:
ADJ | IND NP | PN | ADV NP | ADV
3. NP → NP PP | PP NP | ADV NP | PP | ART NP | 1. Earley’s algorithm blocks left recursive rules
NP ART | IND PN | PN IND [NP→ .NP PP ,0,0], when applying Predictor
operation. Since Assamese Language is a Free-
Here..... Word-Order language. We are not blocking this
NP → Noun type of rules.
PN → Pronoun 2. Earley’s algorithm creates new items for all
VP → Verb possible productions, if there is a non terminal in
ADV → Adverb the left hand side rule. But we reduce these
ADJ → Adjective productions by removing such type of
ART → Article productions, which create the number of total
IND → Indeclinable
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 32
productions in the stack, greater then total tag 12 [S → “mai” “Aru” “si” .ADV NP Apply Phase 1
length of the input sentence. VP,0,3]
3. Another restriction we used in our algorithm for 13 [S → “mai” “Aru” “si” .“ekelge” Apply Phase 2
creating new item is that, if the algorithm NP VP,0,3]
currently analyzing the last word of the sentence, 14 [S → “mai” “Aru” “si” “ekelge” Apply Phase 1
then it selects only the single production in the .NP VP,0,4]
right hand side (example [PP→NP]). The other 15 [S → “mai” “Aru” “si” “ekelge” Apply Phase 2
rules (which have more then one production rules .“gharalE” .VP,0,5]
in right hand side (example [PP→PN NP])) are 16 [S → “mai” “Aru” “si” “ekelge” Apply Phase 1
ignored by the algorithm. “gharalE” .VP,0,5]
17 [S → “mai” “Aru” “si” “ekelge” Apply Phase 2
“gharalE” .“jAm”,0,5]
4.2 Parsing Assamese text using proposed grammar 18 [S → “mai” “Aru” “si” “ekelge” Complete
and algorithm “gharalE” “jAm”.,0,6]
Let us take an Assamese sentence.
In the above example, we have shown only the steps
which proceeds to the goal. The other steps are ignored.
Now the position number for the words are placed 5. Implementation and Result Analysis
according to which word will be parsed first.
1. S [Handle]
2. >>PP [S→PP]
3. >> NP [PP→NP]
4. >>NP PP [NP→NP PP]
5. >>NP ART PP [NP → NP ART]
6. >>gru ART PP [NP → gru
7. >>gru ebidh PP [ART→ ebidh
8. >>gru ebidh ADJ NP [PP→ ADJ NP]
9. >>gru ebidh upakArI NP [ADJ→upakArI]
10. >>gru ebidh upakArI za\ntu [NP→za\ntu]
References
[1] Alfred V. Aho and Jeffrey D. Ullman. Principles of
Compiler Design. Narosa publishing House, 1995.
[2] James Allen. Natural Language Understanding.
Pearson Education, Singapur, second edition, 2004.
[3] Hem Chandra Baruah. Assamiya Vyakaran. Hemkush
Prakashan, Guwahati, 2003.
[4] D. Deka and B. Kalita. Adhunik Rasana Bisitra. Assam
Book Dipo, Guwahati, 7th edition, 2007.
[5] H. Numazaki and H. Tananaka. A new parallel
algorithm for generalized lr parsing, 1990.
[6] Stephen G. Pulman. Basic parsing techniques: an
introductory survey, 1991.
[7] Utpal Sharma. Natural Language Processing.
Department of Computer Science and Information
Technology, Tezpur University, Tezpur-784028,
Assam,India.
[8] Hozumi Tanaka. Current trends on parsing - a survey,
1993.
[9] Jay Earley. An efficient context free parsing algorithm,
Communications of the ACM. Volume 13, no 2,
February-1970
[10] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma,
Jugal Kalita. Part of Speech Tagger for Assamese Text,
ACL-IJCNLP 2009, 2-7 August 2009, Singapore.