Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2011 International Conference on Asian Language Processing

Natural Language Grammar Induction of Indonesian Language


Corpora Using Genetic Algorithm

Arya Tandy Hermawan*), Gunawan*,**), Joan Santoso*)


*) Department of Computer Science
Sekolah Tinggi Teknik Surabaya
Surabaya, East Java, Indonesia
**) Department of Electrical Engineering
Faculty of Industrial Technology
Institut Teknologi Sepuluh Nopember
Surabaya, East Java, Indonesia
arya@stts.edu, gunawan@stts.edu, joan.santoso@gmail.com

Abstract—Grammar Induction is a machine learning process


for learning grammar from corpora. This paper will discuss II. GRAMMAR INDUCTION PROCESS
the process of grammar induction for Indonesian language The sequence of grammar induction process is shown
corpora using genetic algorithm. The Grammar production in Figure 1. In this induction process, the input data are
rules will be modeled in the form of chromosomes. The
fitness function is used to count how many sentences can be
corpora that will be processed at the preprocessing stage.
parsed. The data used are Indonesian fairy tales stories such The data will be divided into training and testing corpora.
as “Bawang Merah Bawang Putih” and “Malin Kundang”. Next, training data will be used in grammar induction
This paper describes the detailed explanations about the process that will give a grammar as a result. This result
steps of each process carried out for natural language will be tested in testing phase.
grammar problems.

Keywords-Natural Language Processing; Genetic


Algorithm; Indonesian Language; Grammar Induction

I. INTRODUCTION
Grammar induction, also known as grammatical
inference, is a process or a machine learning system that
aims to produce a set of grammar from the corpus or
corpora. The method used in this induction process is
genetic algorithm that was developed by John Holland in
the 1970's. Many researches have been done in the process
of grammar induction using genetic algorithm in [1], [2],
[3], and [4]. Research in [1] discusses about development Figure 1. Grammar Induction Process
of a CFG induction library, in [3] discusses about
III. CORPORA SELECTION AND PREPARATION
structuring a chromosome in CFG Induction using genetic
algorithm, and in [4] is about reproduction operator in Corpora data used in this experiment are the original
CFG Induction using genetic algorithm. In previous Indonesian stories such as “Bawang Merah Bawang
researches, the induction process is generally done for Putih”, “Malin Kundang”, and a collection of short stories
balanced parentheses and two-symbol palindromes in Kompas of May 2011 and July 2008. The corpora will
problems in context-free grammar, but it is rarely analyzed be divided into one sentence in a row for the POS tagging
in natural language grammar problems. Therefore, this process. Every word in corpora will be given Indonesian
paper will explain how grammar induction process is POS tagset [5] using Stanford POS Tagger which is
carried out for natural language grammar problems. already trained previously.
Section 2 describes the flow of the grammar induction Corpora data are separated into training and testing data
process. Section 3 discusses the data used in the process with a distribution of 70 percent training data and 30
such as corpora selection and preparation. Section 4 percent testing data. The Indonesian POS tagset is
discusses the methods used in this paper, such as considered less effective in the genetic process so that it
chromosome structure, crossover method, mutation needs to be reduced. The process of grouping tagset is
method, parser, fitness function, and selection method. based on Table 1. Tagset will be used for terminals in the
Section 5 discusses the result of the grammar induction grammar while the number of nonterminal is an input from
process, section 6 will discuss the conclusion of grammar user. For example, if a corpora contains a sentence as
induction’s result and the last, section 7 suggests for follows:
further researches.
Malin/NNP termasuk/VB anak/NN yang/SC cerdas/JJ
tetapi/SC sedikit/RB nakal/JJ ./.

978-0-7695-4554-7/11 $26.00 © 2011 IEEE 15


DOI 10.1109/IALP.2011.58
After being reduced, it will be as follows: 4. The maximum number of existing symbols on right
Malin/KB termasuk/KK anak/KB yang/KH cerdas/KS hand side of the rule must be defined using a
parameter.
tetapi/KH sedikit/KKT nakal/KS ./.
5. The number of production rules must be defined so
In that example, the real tagset was converted into that the length of chromosome can be obtained by
reduction tagset based on table 1. NNP Tagset will be multiplying the number of production rules with its
converted into KB, VB will be converted into KK, and so length. The length of production rules can be
on. obtained from the addition of total symbols on the
left hand side and total symbols on the right hand
TABLE I. REDUCTION TAGSET side.
No. Reduction Tagset Real Tagset 6. Comparison Ratio between the number of
1. KB, Kata Benda (Noun) NN, NNP nonterminal and terminal symbols must be
2. KS, Kata Sifat (Adjective) JJ determined to make the nonterminal and terminal
3. KK, Kata Kerja (Verb) VB symbols proportional in the chromosome.
4. KN, Kata Numeralia (Number) CD 7. Probability for first symbol on the RHS is a
5. KKT, Kata Keterangan (Adverb) RB nonterminal symbol must be determined. This is due
after some observations, most first symbol on the
6. KH, Kata Hubung (Conjunction) CC,SC
RHS is terminal symbol.
7. KDP, Kata Depan (Preposition) WH, IN,
8. The number of symbols on the Right Hand Side for
NEG, MD,
each production rule should vary from 1 to the
WDT
maximum number of symbols that are distributed
8. KL, Kata Lain (Other Class FW, SYM uniformly. To create the same length chromosome
Word Criteria) rule we must add help symbol that is represented by
9. KG, Kata Ganti (Pronoun) PRP U (Unused) symbol or 99 in integer symbol
10. TB, Tanda Baca (Punctuation) “— representation. This is used for filling the empty
“,”:”,”(“,”)” genes in one block so that all rules in chromosome
11. . (End of Sentence) . will have same length.
12. , (Comma) , 9. All nonterminal and terminal symbols are distributed
13. “ (quotation mark) ”,‘ uniformly in the chromosome according to
comparison ratio between the number of nonterminal
IV. METHODS and terminal symbols.
This section explains the grammar induction process With the given constraints, it is expected that the
using genetic algorithm such as how to create a grammar result is valid and able to produce an optimal
chromosome, the crossover operators and mutation that grammar. Figure 2 shows examples of chromosome
are used, the parser, the fitness function, and the selection structure.
method. The genetic algorithm in this induction process is
carried out continuously until the number of generations The number of production rules in one chromosome : 10
The number of maximum symbols on right hand side : 3
completed. Percentage of the number of nonterminal obtained from the number of
terminal : 0.5
A. Chromosome Structure Probability that first symbol on right hand side is nonterminal : 0.2
Chromosome representation used in this induction Comparation ratio between nonterminal and terminal : 0.3
process is integer representation, whereas chromosome Tagset used in chromosome :
representations in reference [3] is different because it uses No. Tagset Chromosome Symbol
binary representation. Each nonterminal and terminal 1. S 0
2. B -1
symbol are denoted by integer numbers. All nonterminal 3. KB 1
symbols are modeled starting from 0 to a certain negative 4. KS 2
value, while the terminal symbols are modeled starting 5. KK 3
from 1 to a certain positive integer value. In forming a 6. . 4
chromosome, there are some rules to meet for valid 7. U (Unused) 99
chromosomes. There are as follows: Chromosome :
1. The number of nonterminal symbol is determined by 0 -1 99 99 0 -1 99 99 -1 3 99 99 -1 1 4 0 -1 1 0 99 -1 2 3 3 0 2 2 0
the total parameters percentage of the existing -1 4 99 99 0 3 4 99 0 1 1 99
terminals. For example the number of terminal is 13 Chromosome divided into blocks :
0 -1 99 99 | 0 -1 99 99 | -1 3 99 99 | -1 1 4 0 | -1 1 0 99 | -1 2 3 3 | 0 2 2
and the parameter is 50% so that the number of 0 | -1 4 99 99 | 0 3 4 99 | 0 1 1 99
terminal will be 6. Result of chromosome to be converted from Integer symbol convert
2. The terminal symbols that are used must exist in the to string symbol :
corpora, otherwise the terminal can not be included in S B U U |S B U U | B KK U U | B KB . S | B KB S U | B KS KK
the formation of chromosome. For example, the KK | S KS KS S | B . U U | S KK . U | S KB KB U
Grammar :
terminal symbols in the corpus are KB, KS, KK, and
“.” so that the input tagset are KB, KS, KK, and “.”. S ::= B S ::= B B ::= KK B::=KB . S B::=KB S
3. The number of left hand side symbol in the rule is 1 B ::= KS S ::= KS B ::= . S::=KK . S::=KB
and it is a nonterminal symbol. KK KK KS S KB

Figure 2. Chromosome Structure Example

16
In figure 2, the nonterminal symbols are S and B, the D. Parser
terminal symbols are KB (Kata Benda) for noun, KS Parser plays an important role in the fitness function. In
(Kata Sifat) for adjective, KK (Kata Kerja) for verb, and grammar induction process, the parser used is Earley
“.” for end of sentence. Based on chromosome structure Parser [6]. This parser is one type of chart parsing. This
that was explained before, it can be concluded that the parser is selected because it can handle left recursion and
parameters used in the chromosome structure process are : ambiguity.
1. The Number of production rules in one This parser is used to determine whether the
chromosome. chromosome or the resulting grammar can parse the
2. The Number of maximum symbols on right hand sentences in the training corpora or not. Parser will return
side. success or fail result from each sentence parsed in training
3. Percentage of the number of nonterminal obtained corpora. In the fitness function, parser will be used to
from the number of terminal. determine how many sentences that can be parsed from the
4. Probability that first symbol on right hand side is training corpora.
a nonterminal symbol.
5. Comparison ratio between nonterminal and E. Fitness Function
terminal in chromosome. Fitness function is used to assess how well the
grammar from the genetic process. Fitness function used is
B. Crossover
based on how many sentences can be parsed by the
Crossover operator used in grammar induction process chromosome or the grammar of an individual. Fitness
is single point crossover. The process of crossover is function used is as follows:
generated randomly when the value is smaller than a given
crossover probability. Crossover can be done by
determining the crossing point and swapping the remaining
symbols. The example of crossover operation process can
be seen in figure 3. Where C(X) is the number of sentences that can be
parsed by the parser and N(X) is the number of sentences
in the corpora.
F. Selection Method
There are various methods used in the selection of
genetic algorithm such as roulette wheel, tournament, etc.
The method used for selection in grammar induction
process is roulette wheel. Roulette wheel is a method
which is often used in genetic algorithm. The grammar or
chromosome that will be selected is the chromosome or
Figure 3. Crossover Example grammar with the highest fitness.
C. Mutation V. TESTING AND RESULT
Mutation operator used is uniform mutation method.
This section will discuss the result of experiments that
The mutation generated randomly if the value is smaller
have been performed. Experiments carried out at 4
than the given mutation probability. Mutation will be done
corpora, which are “Bawang Merah Bawang Putih”, the
by swapping the nonterminal with terminal symbol, and
story of “Malin Kundang”, a collection of three short
vice versa. Restriction is given to the mutation if the
stories in Kompas May 2011 and July 2008. The first
position of the mutation point is left hand side. In that
experiment uses “Bawang Merah Bawang Putih”, the
condition the exchanged symbol is nonterminal symbol
second uses “Malin Kundang”, and the third experiment
only, but if the exchanged symbol is not on the left hand
uses a collection of three short stories in Kompas May
side it can be nonterminal or terminal symbol. The
2011, and the last or fourth experiment uses a collection of
example of mutation process can be seen in figure 4.
three short stories in Kompas July 2008. Parser used in this
testing phase is Earley Parser [6]. The result of this process
can be seen in Table 3.
An example of the result grammar from grammar
induction process can be seen in table 4. Grammar in table
4 is a grammar from Bawang Merah Bawang Putih story.
The nonterminal symbols in table 4 are S, B, C, E, D, and
E which is generated automatically.
The training process to get the grammar is done with
time ranges from 3 to 8 hours for total iterations between
10-20 iterations with 5 individuals in each experiment. For
the training, a parameter for each experiment for this
training process must be determined. The parameter can be
seen in table 2 for each corpora. The number of the
Figure 4. Mutation Example production rules are about 120 production rules. For the
number of maximum symbols we used from 3 to 4. The

17
percentage of the number of nonterminal obtained from the Parameters shown in table 2 are used to obtain the best
number of terminal is 50% and the probability for the first experiment result and the result is shown in table 3. The
symbol is nonterminal used from 0.2 to 0.3. For the result shown is quite accurate from the first experiment to
comparison ratio between nonterminal and terminal in the fourth one. Table 4 shows one example of the
chromosome is 0.3. For the crossover probability is 0.6 to grammar which is relatively different from Indonesian
0.9 and the mutation probability is 0.5 to 0.9. language grammar but it can be used to parse the
For testing process, we use the grammar from training document accurately.
process and parse the sentences in testing corpora and
count how many sentences can be parsed from that VI. CONCLUSION
grammar. The more sentences can be parsed in testing
corpora, the better result of the experiment is gained. From this experiment, several things can be concluded,
such as tagset reduction plays an important role in the level
TABLE II. PARAMETER FOR EACH EXPERIMENT of grammatical complexity. The reason why tagset needs
to be reduced is that the real tagset causes the genetic
Experiments process ineffective.
1st 2nd 3rd 4th
The total of production rules in a chromosome is
The number of production rules
in one chromosome
120 120 120 120 proportional to the number of terminal symbols that are
The number of maximum symbols used in the process of grammar induction. In the formation
3 3 3 4
on right hand side of chromosome, the rule for structuring a chromosome
Percentage of the number of play a vital role for creating a good chromosome. The
nonterminal obtained from the 50% 50% 50% 50% number of comparison terminal and nonterminal symbols
number of terminal
are also important. Parser plays a very important role in the
Probability that first symbol on
right hand side is a nonterminal 0.2 0.3 0.2 0.3 fitness function and testing process.
symbol The result is not common in Indonesian language
Comparison ratio between grammar but it matches to the corpora grammar and the
nonterminal and terminal in 0.3 0.3 0.3 0.3 structure is different from grammar that is designed by
chromosome human manually. This is due to the reduction of the tagset
Crossover probability 0.8 0.6 0.8 0.7 that make the grammar not detailed.
Mutation probability 0.9 0.5 0.9 0.5
VII. FURTHER RESEARCH
TABLE III. TESTING RESULT FOR EACH EXPERIMENT
Based on the research that has been done, some things
Experiments that can be improved to increase accuracy such as the rule
st
1 2nd 3rd 4 th
of tagset reduction that is used should be examined again
The number of terminal
symbols
13 11 13 13 so that the result can be better.
The number of nonterminal
symbols
6 5 6 6 ACKNOWLEDGMENT
The number of sentences We would like to thank to all participants who helped
60 35 167 272
from the training corpora and supported us in doing this experiment. We also hope
Fitness value 83.37 77.23 97.01 94.85
Testing(how many
that our experiment can be useful for other NLP
sentences can be parsed 22/25 16/17 71/72 115/116 experiments. With this method offered, it is expected that,
from testing corpora) this grammar induction process can be used for other
languages. Tagset can be customized with the tagset of the
TABLE IV. GRAMMAR EXAMPLE languages to be induced.
No. Grammar
1. S::=B | S KDP | C KDP C | S D | TB C | . KK KN | , | KB
REFERENCES
KK B | KDP KKT | KKT D E | TB E S " | KN | KN KL S D [1] N.S. Choubey, M.U. Kharat and Hari Mohan Pandey, “Developing
| TB S KN F | KS KS | KN F | . KKT KB | KKT | TB KH | Genetic Algorithm Library Using Java for CFG Induction”, In
KS KG International Journal of Advancement in Technology. 2011.
2. B::=S E E | C KH " | KDP | " F KL C | " C E D | KG KL | . [2] Bill Keller and Rudi Lutz, “Evolving Stochastic Context-Free
KB KDP | KKT S | KN KK KB S | KB F D | KB TB | KDP | Grammars from Examples Using a Minimum Description Length
KG KK KL , | , KDP | KB | " KB | TB . " KG | KS TB KH . in Principle”, In Workshop on Automata Induction Grammatical
| . " KG KS | KS KN KN KN Inference and Language Acquisition, ICML-97.1997.
3. C::=E B TB B | E D C | B | B KDP B | . C E | KH C | KN B [3] N.S. Choubey and M.U. Kharat, “Sequential Structuring Element
KDP | F | KH D | , KB , TB | " C F | KL E , | , KH | KKT D for CFG Induction Using Genetic Algorithm”, In International
| . F KG | KDP KG | . | KH | KS | KG " | " KG KKT KS Journal of Computer Applications. 2010.
4. E::=D KKT C | C | B B KG C | E | " KL B F | TB KS KL | [4] N.S. Choubey and M.U. Kharat, “Reproduction Operator
KK KKT , | , D , | KKT | . KN B | KG KL | TB D S B | KB | Evaluation for CFG Induction Using Genetic Algorithm”, In
KB | , KKT KS KL | KG , | " | KK | . KS KDP " | KB Journal of Computing. 2010.
KDP
5. D::=C KH | F B " | E | F | S | F B KB | S KKT KL | E KL E [5] Femphy Pisceldo, Mirna Adriani and Ruli Manurung, “Statistical
KH | KDP KK D | " | KG KL| KK . KS | KL E | KN S | , S F Based Part Of Speech Tagger for Bahasa Indonesia”. In
Proceedings of the 3rd International MALINDO Workshop, Co-
S | TB F | KK KB F D | KK . KB KH | KG KH KK . | " KB
located event ACL-IJCNLP 2009. 2009
6. F::=C | B | KL | KK KK E | " D | KKT KK D | KDP KH F | .
| KL KK | TB , S KKT | KG F , | KK | KK KH | KH KN [6] Daniel Jurafsky and James H. Martin, “SPEECH and LANGUAGE
KDP KKT | TB KKT KDP KB | . KH TB TB | KH KG KS PROCESSING An Introduction to Natural Language Processing,
KS | KN KN KS KN Computational Linguistics, and Speech Recognition”, New Jersey,
Pretince Hall, 2000.

18

You might also like