Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IJCSI International Journal of Computer Science Issues, Vol. 6, No.

1, 2009 28
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814

Parsing of part-of-speech tagged Assamese Texts


Mirzanur Rahman1, Sufal Das1 and Utpal Sharma2
1
Department of Information Technology, Sikkim Manipal Institute of Technology
Rangpo, Sikkim-737136, India

2
Department of Computer Science & Engineering, Tezpur University
Tezpur, Assam-784028, India

Abstract
A natural language (or ordinary language) is a language that is
spoken, written, or signed by humans for general-purpose
2. Related Works
communication, as distinguished from formal languages (such as
computer-programming languages or the "languages" used in the
study of formal logic). The computational activities required for 2.1 Natural Language Processing
enabling a computer to carry out information processing using
natural language is called natural language processing. We have The term “natural” languages refer to the languages that
taken Assamese language to check the grammars of the input people speak, like English, Assamese and Hindi etc. The
sentence. Our aim is to produce a technique to check the goal of the Natural Language Processing (NLP) group is
grammatical structures of the sentences in Assamese text. We to design and build software that will analyze, understand,
have made grammar rules by analyzing the structures of
and generate languages that humans use naturally. The
Assamese sentences. Our parsing program finds the grammatical
errors, if any, in the Assamese sentence. If there is no error, the applications of Natural Language can be divided into two
program will generate the parse tree for the Assamese sentence classes [2]
• Text based Applications: It involves the
Keywords: Context-free Grammar, Earley’s Algorithm, Natural processing of written text, such as books,
Language Processing, Parsing, Assamese Text. newspapers, reports, manuals, e-mail messages
etc. These are all reading based tasks.
• Dialogue based Applications: It involves human
1. Introduction – machine communication like spoken language.
Also includes interaction using keyboards. From
Natural language processing, a branch of artificial an end-user’s perspective, an application may
intelligence that deals with analyzing, understanding and require NLP for either processing natural
generating the languages that humans use naturally in language input or producing natural language
order to interface with computers in both written and output, or both. Also, for a particular application,
spoken contexts using natural human languages instead of only some of the tasks of NLP may be required,
computer languages. It studies the problems of automated and depth of analysis at the various levels may
generation and understanding of natural human languages. vary. Achieving human like language processing
We have taken Assamese language for information capability is a difficult goal for a machine.
processing i.e. to check the grammars of the input The difficulties are:
sentence. Parsing process makes use of two components. • Ambiguity
A parser, which is a procedural component and a • Interpreting partial information
grammar, which is declarative. The grammar changes • Many inputs can mean same thing
depending on the language to be parsed while the parser
remains unchanged. Thus by simply changing the
grammar, the system would parsed a different language. 2.2 Knowledge Required for Natural Language
We have taken Earley’s Parsing Algorithm for parsing
Assamese Sentence according to a grammar which is A Natural Language system uses the knowledge about the
defined for Assamese language. structure of the language itself, which includes words and
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 29

how words combine to form sentences, about the word 1. The first item indicates that the input sentence is
meaning and how word meanings contribute to sentence going to be parsed applying the rule S→ NP VP
meanings and so on. The different forms of knowledge from position 0.
relevant for natural language are [2]: 2. The second item indicates the portion of the input
• Phonetic and phonological knowledge: It sentence from the position number 0 to 1 has
concerns how words are related to the sounds that been parsed as NP and the remainder left to be
realize them. satisfied as VP.
• Morphological knowledge: It concerns how 3. The third item indicates that the portion of input
words are constructed from more basic meaning sentence from position number 0 to 4 has been
units called morphemes. A morpheme is the parsed as NP VP and thus S is accomplished.
primitive unit of meaning in a language (for
example, the meaning of the word friendly is Earley’s algorithm uses 3 phases
derivable from the meaning of the noun friend • Predictor
and suffix -ly, which transforms a noun into an • Scanner
adjective. • Completer
• Syntactic Knowledge: It concerns how words can Let α, β, γ are sequence of terminal or nonterminal
be put together to form correct sentences and symbols and S, A, B are non terminal symbols.
determine what structure role each word plays in
the sentence. Predictor Operation
• Semantic knowledge: It concerns what words For an item of the form [A → α.Bβ,i,j] create [B→.γ,j,j]
mean and how these meanings combine in for each production of the [B→γ] It is called predictor
sentences to form sentence meanings. operation because we can predict the next item.
• Pragmatic Knowledge: It concerns how sentences
are used in different situations and how use Completer Operation
affects the interpretation of the sentence. For an item of the form [B→γ.,j,k] create [A→αB.β,i,k]
• Discourse Knowledge: It concerns how the (i<j<k) for each item in the form of [A→α.Bβ,i,j] if exists.
immediately preceding sentences affect the It is called completer because it completes an operation.
interpretation of the next sentence.
• Word Knowledge: It includes what each language Scanner Operation
user must know about the other user’s beliefs and For an item of the form [A→α.wβ,i,j] create
goals. [A→αw.β,i,j+1], if w is a terminal symbol appeared in the
input sentence between j and j+1.
2.3 Earley’s Parsing Algorithm
Earley’s parsing algorithm
The Earley’s Parsing Algorithm [8, 9] is basically a top
down parsing algorithm where all the possible parses are 1. For each production S→α, create [S→.α,0,0]
carried simultaneously. Earley’s algorithm uses dotted 2. For j=0 to n do (n is the length of the input
Context-free Grammar (CFG) rules called items, which sentence)
has a dot in its right hand side. 3. For each item in the form of [A→α.Bβ,i,j] apply
Let the input sentence be- “0 I 1 saw 2 a 3 man 4 in 5 the 6 Predictor operation while a new item is created.
park 7”. Here the numbers appeared between words are 4. For each item in the form of [B→γ.i,j] apply
called position numbers. Completer operation while a new item is created.
For CFG rule S →NP VP we will have three types of 5. For each item in the form of [A→α.wβ,i,j] apply
dotted items- Scanner operation
• [ S→ .NP VP,0,0 ]
• [ S→ NP.VP,0,1 ] If we find an item of the form [S→α.,0,n] then we accept
• [ S→ NP VP.,0,4 ] it.

Let us take an example..


Here ”0 I 1 saw 2 a 3 man 4”.
S → Starting Symbol Consider the following grammar:
NP → Noun Phrase 1. S → NP VP
VP → Verb Phrase 2. S → S PP
3. NP → n
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 30

4. NP → art n 3. Properties and problems of parsing


5. NP → NP PP algorithm
6. PP → p NP
7. VP → v NP Parsing algorithms are usually designed for classes of
8. n → I grammar rather than for some individual grammars. There
9. n → man are some important properties [6] that make a parsing
11. v → saw algorithm practically useful.
12. art → a • It should be sound with respect to a given
grammar and lexicon
Now parse the sentence using Earley’s parsing technique. • It should be complete so that it assign to an input
1. [S→.NP VP,0,0] Initialization sentence and all the analyses it can have with
2. [S→.S PP,0,0] Apply Predictor to step 1 and respect to the current grammar and lexicon.
step 2 • It should also be efficient so that it take minimum
3. [NP→.n,0,0] of computational work.
4. [NP→.art n,0,0] Algorithm should be robust, behaving in a reasonably
5. [NP→.NP PP,0,0] Apply Predictor to step 3 sensible way when presented with sentence that it is
6. [n→.“I”,0,0] Apply scanner to 6 unable to fully analyze successfully.
7. [n→ “I”.,0,1] Apply Completer to step 7 with The main problem of Natural Language is its
step 3 ambiguity. The sentences of Natural Languages are
8. [NP→n.,0,1] Apply Completer to step 8 with ambiguous in meaning. There are different meanings for
step 1 and step 5 one sentence. So all the algorithms for parsing can not be
9. [S→NP.VP,0,1] used for Natural Language processing. There are many
10. [NP→NP.PP,0,1] Apply Predictor to step 9 parsing technique used in programming languages (like C
11. [VP→.v NP,1,1] Apply Predictor to step 11 language).These techniques easy to use, because in
12. [v→.“saw”,1,1] Apply Predictor to step 10 programming language, meaning of the words are fixed.
13. [PP→.p NP,1,1] Apply Scanner to step 12 But in case of NLP we can not used this technique for
parsing, because of ambiguity.
14. [v→ “saw”.1,2] Apply Completer to step 14
with step 11 For example: “I saw a man in the park with a telescope”.
15. [VP→v.NP,1,2] Apply Predictor to step 15
16. [NP→.n,2,2] This sentence has at least three meanings-
17. [NP→.art n,2,2] • Using a telescope I saw the man in the park.
18. [NP→.NP PP 2,2] Apply Predictor to step 17 • I saw the man in the park that has a telescope.
19. [art → .“a”, 2,2] Apply Scanner to step 19 • I saw the man in the park standing behind the
20. [art → “a”.,2,3] Apply Completer to step 20 telescope which is placed in the park.
with step 17 So, this sentence is ambiguous and no algorithm can
21. [NP → art .n,2,3] Apply Predictor to step 21 resolve the ambiguity. An algorithm will be the best
22. [n → .“man”, 3,3 Apply Scanner to step 22 algorithm, which produces all the possible analyses.
23. [n → “man”.,3,4] Apply Completer to step 23 To begin with, we look for algorithm that can take care of
with step 21 ambiguity of smaller components such as ambiguity of
24. [NP → art n.,2,4] Apply Completer to 24 with 15 words and phrases.
25. [VP → v NP.,1,4] Apply Completer to 25 with 9
26. [S → NP VP.,0,4] Complete 4. Proposed grammar and algorithm for
Assamese Texts
. When applying Predictor operation Earley’s algorithm
Since it is impossible to cover all types of sentences in
often creates a set of similar items such as-step 3,4,5 and
Assamese language, we have taken some portion of the
16,17,18 expecting NP in future.
sentence and try to make grammar for them. Assamese is
free-word-order language [10]. As an example we can take
the following Assamese sentence.
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 31

This sentence can be written as.

4.1 Modification of Earley’s Algorithm for Assamese


Text Parsing
We know that Earley’s algorithm uses three operations,
Predictor, Scanner and Completer. We add Predictor and
Completer in one phase and Scanner operation in another
Here we see that one sentence can be written in different phase.
forms for the same meaning, i.e. the positions of the tags Let α, β, γ, PP, VP are sequence of terminal or
are not fixed. So we can not restrict the grammar rule for nonterminal symbols and S, B are non terminal symbols.
one sentence. The grammar rule may be very long, but we
have to accept it. The grammar rule we have tried to make, Phase 1:(Predictor+Completer)
may not work for all the sentences in Assamese language. For an item of the form [S →α .Bβ,i,j] , create
Because we have not considered all types of sentences. [S →α.γβ,i,j] for each production of the [B→γ]
Some of the sentences are shown below, which are used to
make the grammar rule [3, 4]. Phase 2 :( Scanner)
For an item of the form [S→α.wβ,i,j] create
[S→αw.β,i,j+1], if w is a terminal symbol appeared in the
input sentence between j and j+1.

Our Algorithm
Input: Tagged Assamese Sentence
Output: Parse Tree or Error message
Step 1: If Verb is present in the sentence then create
[S→ .PP VP ,0,0]
Else create
[S→ .PP ,0,0]
Step 2: Do the following steps in a loop until there is a
success or error
Step 3: For each item of the form of [S→α.Bβ,i,j], apply
phase 1
Step 4: For each item of the form of [S→ .αwβ,i,j], apply
phase 2
Step 5: If we find an item of the form [S→α. ,0,n], then
we accept the sentence as success else error
message. Where n is the length of input sentence.
And then come out from the loop.
Our proposed grammars for Assamese sentences Step 6: Generate the parse trees for the successful
sentences.
1. S → PP VP | PP
2. PP → PN NP | NP PN | ADJ NP | NP ADJ | NP | Some other modifications of Earley’s algorithm:
ADJ | IND NP | PN | ADV NP | ADV
3. NP → NP PP | PP NP | ADV NP | PP | ART NP | 1. Earley’s algorithm blocks left recursive rules
NP ART | IND PN | PN IND [NP→ .NP PP ,0,0], when applying Predictor
operation. Since Assamese Language is a Free-
Here..... Word-Order language. We are not blocking this
NP → Noun type of rules.
PN → Pronoun 2. Earley’s algorithm creates new items for all
VP → Verb possible productions, if there is a non terminal in
ADV → Adverb the left hand side rule. But we reduce these
ADJ → Adjective productions by removing such type of
ART → Article productions, which create the number of total
IND → Indeclinable
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 32

productions in the stack, greater then total tag 12 [S → “mai” “Aru” “si” .ADV NP Apply Phase 1
length of the input sentence. VP,0,3]
3. Another restriction we used in our algorithm for 13 [S → “mai” “Aru” “si” .“ekelge” Apply Phase 2
creating new item is that, if the algorithm NP VP,0,3]
currently analyzing the last word of the sentence, 14 [S → “mai” “Aru” “si” “ekelge” Apply Phase 1
then it selects only the single production in the .NP VP,0,4]
right hand side (example [PP→NP]). The other 15 [S → “mai” “Aru” “si” “ekelge” Apply Phase 2
rules (which have more then one production rules .“gharalE” .VP,0,5]
in right hand side (example [PP→PN NP])) are 16 [S → “mai” “Aru” “si” “ekelge” Apply Phase 1
ignored by the algorithm. “gharalE” .VP,0,5]
17 [S → “mai” “Aru” “si” “ekelge” Apply Phase 2
“gharalE” .“jAm”,0,5]
4.2 Parsing Assamese text using proposed grammar 18 [S → “mai” “Aru” “si” “ekelge” Complete
and algorithm “gharalE” “jAm”.,0,6]
Let us take an Assamese sentence.
In the above example, we have shown only the steps
which proceeds to the goal. The other steps are ignored.

Now the position number for the words are placed 5. Implementation and Result Analysis
according to which word will be parsed first.

5.1 Different Stages of the Program


In the program there are 3 stages.
We consider the following grammar rule • Lexical Analysis
1. S → PP VP | PP • Syntax Analysis
2. PP → PN NP | NP PN | ADJ NP | NP ADJ | NP
• Tree Generation
| ADJ | IND NP | PN | ADV NP | ADV
In Lexical Analysis stage, program finds the correct tag
3. NP → NP PP | PP NP | ADV NP | PP | ART
for each word in the sentence by searching the database.
NP | NP ART | IND PN | PN IND
There are seven databases (NP, PN, VP, ADJ, ADV, ART,
4. PN → “mai”
IND) for tagging the words.
5. PN → “si”
In Syntax Analysis stage the program tries to
6. IND → “Aru”
analyze whether the given sentence is grammatically
7. ADV → “ekelge”
correct or not.
8. NP → “gharalE”
In Tree Generation stage, the program finds all
9. VP → “jAm”
the production rules which lead to success and generates
parse tree for those rules. If there are more then one path
Parsing process will proceed as follows
to success, this stage can generates more then on parse
trees. It also displays the words of the sentences with
1 [S → .PP VP , 0,0] Apply Phase 1 proper tags. The following shows a parse tree generate by
2 [S → .NP VP,0 ,0] Apply Phase 1 the program.
3 [S → .PP NP VP,0,0] Apply Phase 1
4 [S → .PN NP NP VP,0,0] Apply Phase 1
5 [S → .“mai” NP NP VP,0,0] Apply Phase 2
6 [S → .“mai” .NP NP VP,0,1] Apply Phase 1
7 [S → “mai” .IND PN NP VP,0,1] Apply Phase 1
8 [S → “mai” . “Aru” PN NP Apply Phase 2
VP,0,1]
9 [S → “mai” “Aru” .PN NP VP,0,2] Apply Phase 1
10 [S → “mai” “Aru” .“si” NP Apply Phase 2
VP,0,2]
11 [S → “mai” “Aru” “si” .NP Apply Phase 1
VP,0,3]
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 33

1. S [Handle]
2. >>PP [S→PP]
3. >> NP [PP→NP]
4. >>NP PP [NP→NP PP]
5. >>NP ART PP [NP → NP ART]
6. >>gru ART PP [NP → gru
7. >>gru ebidh PP [ART→ ebidh
8. >>gru ebidh ADJ NP [PP→ ADJ NP]
9. >>gru ebidh upakArI NP [ADJ→upakArI]
10. >>gru ebidh upakArI za\ntu [NP→za\ntu]

From the above derivation it has been seen that the


Assamese sentence is correct according to the proposed
grammar. So our parsing program generates a parse tree
successfully as follows.

The original parse tree for the above sentence is.

Our program tests only the sentence structure according to


the proposed grammar rules. So if the sentence structure
satisfies the grammar rule, program recognizes the
sentence as a correct sentence and generates parse tree.
5.2 Result Analysis Otherwise it gives output as an error.

After implementation of Earley’s algorithm using our


proposed grammar, it has been seen that the algorithm can 6. Conclusion and Future Work
easily generates parse tree for a sentence if the sentence
structure satisfies the grammar rules. For example we take We have developed a context free grammar for simple
the following Assamese sentence Assamese sentences. Different natural languages present
different challenges in computational processing. We have
studied the issues that arise in parsing Assamese sentences
The structure of the above sentence is NP-ART-ADJ-NP. and produce an algorithm suitable for those issues. This
This is a correct sentence according to the Assamese algorithm is a modification of Earley’s Algorithm. We
literature. According to our proposed grammar a possible found that Earley’s parsing algorithms is simple and
top down derivation for the above sentence is effective.
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 34

In this work we have considered limited number of


Assamese sentences to construct the grammar rules. We
also have considered only seven main tags. In future work
we have to consider as many sentences as we can and
some more tags for constructing the grammar rules.
Because Assamese language is a free-word-order
language. Word position for one sentence may not be
same in the other sentences. So we can not restrict the
grammar rules for some limited number of sentences.

References
[1] Alfred V. Aho and Jeffrey D. Ullman. Principles of
Compiler Design. Narosa publishing House, 1995.
[2] James Allen. Natural Language Understanding.
Pearson Education, Singapur, second edition, 2004.
[3] Hem Chandra Baruah. Assamiya Vyakaran. Hemkush
Prakashan, Guwahati, 2003.
[4] D. Deka and B. Kalita. Adhunik Rasana Bisitra. Assam
Book Dipo, Guwahati, 7th edition, 2007.
[5] H. Numazaki and H. Tananaka. A new parallel
algorithm for generalized lr parsing, 1990.
[6] Stephen G. Pulman. Basic parsing techniques: an
introductory survey, 1991.
[7] Utpal Sharma. Natural Language Processing.
Department of Computer Science and Information
Technology, Tezpur University, Tezpur-784028,
Assam,India.
[8] Hozumi Tanaka. Current trends on parsing - a survey,
1993.
[9] Jay Earley. An efficient context free parsing algorithm,
Communications of the ACM. Volume 13, no 2,
February-1970
[10] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma,
Jugal Kalita. Part of Speech Tagger for Assamese Text,
ACL-IJCNLP 2009, 2-7 August 2009, Singapore.

You might also like