Professional Documents
Culture Documents
Natural Language Grammar Induction of Indonesian Language Corpora Using Genetic Algorithm
Natural Language Grammar Induction of Indonesian Language Corpora Using Genetic Algorithm
I. INTRODUCTION
Grammar induction, also known as grammatical
inference, is a process or a machine learning system that
aims to produce a set of grammar from the corpus or
corpora. The method used in this induction process is
genetic algorithm that was developed by John Holland in
the 1970's. Many researches have been done in the process
of grammar induction using genetic algorithm in [1], [2],
[3], and [4]. Research in [1] discusses about development Figure 1. Grammar Induction Process
of a CFG induction library, in [3] discusses about
III. CORPORA SELECTION AND PREPARATION
structuring a chromosome in CFG Induction using genetic
algorithm, and in [4] is about reproduction operator in Corpora data used in this experiment are the original
CFG Induction using genetic algorithm. In previous Indonesian stories such as “Bawang Merah Bawang
researches, the induction process is generally done for Putih”, “Malin Kundang”, and a collection of short stories
balanced parentheses and two-symbol palindromes in Kompas of May 2011 and July 2008. The corpora will
problems in context-free grammar, but it is rarely analyzed be divided into one sentence in a row for the POS tagging
in natural language grammar problems. Therefore, this process. Every word in corpora will be given Indonesian
paper will explain how grammar induction process is POS tagset [5] using Stanford POS Tagger which is
carried out for natural language grammar problems. already trained previously.
Section 2 describes the flow of the grammar induction Corpora data are separated into training and testing data
process. Section 3 discusses the data used in the process with a distribution of 70 percent training data and 30
such as corpora selection and preparation. Section 4 percent testing data. The Indonesian POS tagset is
discusses the methods used in this paper, such as considered less effective in the genetic process so that it
chromosome structure, crossover method, mutation needs to be reduced. The process of grouping tagset is
method, parser, fitness function, and selection method. based on Table 1. Tagset will be used for terminals in the
Section 5 discusses the result of the grammar induction grammar while the number of nonterminal is an input from
process, section 6 will discuss the conclusion of grammar user. For example, if a corpora contains a sentence as
induction’s result and the last, section 7 suggests for follows:
further researches.
Malin/NNP termasuk/VB anak/NN yang/SC cerdas/JJ
tetapi/SC sedikit/RB nakal/JJ ./.
16
In figure 2, the nonterminal symbols are S and B, the D. Parser
terminal symbols are KB (Kata Benda) for noun, KS Parser plays an important role in the fitness function. In
(Kata Sifat) for adjective, KK (Kata Kerja) for verb, and grammar induction process, the parser used is Earley
“.” for end of sentence. Based on chromosome structure Parser [6]. This parser is one type of chart parsing. This
that was explained before, it can be concluded that the parser is selected because it can handle left recursion and
parameters used in the chromosome structure process are : ambiguity.
1. The Number of production rules in one This parser is used to determine whether the
chromosome. chromosome or the resulting grammar can parse the
2. The Number of maximum symbols on right hand sentences in the training corpora or not. Parser will return
side. success or fail result from each sentence parsed in training
3. Percentage of the number of nonterminal obtained corpora. In the fitness function, parser will be used to
from the number of terminal. determine how many sentences that can be parsed from the
4. Probability that first symbol on right hand side is training corpora.
a nonterminal symbol.
5. Comparison ratio between nonterminal and E. Fitness Function
terminal in chromosome. Fitness function is used to assess how well the
grammar from the genetic process. Fitness function used is
B. Crossover
based on how many sentences can be parsed by the
Crossover operator used in grammar induction process chromosome or the grammar of an individual. Fitness
is single point crossover. The process of crossover is function used is as follows:
generated randomly when the value is smaller than a given
crossover probability. Crossover can be done by
determining the crossing point and swapping the remaining
symbols. The example of crossover operation process can
be seen in figure 3. Where C(X) is the number of sentences that can be
parsed by the parser and N(X) is the number of sentences
in the corpora.
F. Selection Method
There are various methods used in the selection of
genetic algorithm such as roulette wheel, tournament, etc.
The method used for selection in grammar induction
process is roulette wheel. Roulette wheel is a method
which is often used in genetic algorithm. The grammar or
chromosome that will be selected is the chromosome or
Figure 3. Crossover Example grammar with the highest fitness.
C. Mutation V. TESTING AND RESULT
Mutation operator used is uniform mutation method.
This section will discuss the result of experiments that
The mutation generated randomly if the value is smaller
have been performed. Experiments carried out at 4
than the given mutation probability. Mutation will be done
corpora, which are “Bawang Merah Bawang Putih”, the
by swapping the nonterminal with terminal symbol, and
story of “Malin Kundang”, a collection of three short
vice versa. Restriction is given to the mutation if the
stories in Kompas May 2011 and July 2008. The first
position of the mutation point is left hand side. In that
experiment uses “Bawang Merah Bawang Putih”, the
condition the exchanged symbol is nonterminal symbol
second uses “Malin Kundang”, and the third experiment
only, but if the exchanged symbol is not on the left hand
uses a collection of three short stories in Kompas May
side it can be nonterminal or terminal symbol. The
2011, and the last or fourth experiment uses a collection of
example of mutation process can be seen in figure 4.
three short stories in Kompas July 2008. Parser used in this
testing phase is Earley Parser [6]. The result of this process
can be seen in Table 3.
An example of the result grammar from grammar
induction process can be seen in table 4. Grammar in table
4 is a grammar from Bawang Merah Bawang Putih story.
The nonterminal symbols in table 4 are S, B, C, E, D, and
E which is generated automatically.
The training process to get the grammar is done with
time ranges from 3 to 8 hours for total iterations between
10-20 iterations with 5 individuals in each experiment. For
the training, a parameter for each experiment for this
training process must be determined. The parameter can be
seen in table 2 for each corpora. The number of the
Figure 4. Mutation Example production rules are about 120 production rules. For the
number of maximum symbols we used from 3 to 4. The
17
percentage of the number of nonterminal obtained from the Parameters shown in table 2 are used to obtain the best
number of terminal is 50% and the probability for the first experiment result and the result is shown in table 3. The
symbol is nonterminal used from 0.2 to 0.3. For the result shown is quite accurate from the first experiment to
comparison ratio between nonterminal and terminal in the fourth one. Table 4 shows one example of the
chromosome is 0.3. For the crossover probability is 0.6 to grammar which is relatively different from Indonesian
0.9 and the mutation probability is 0.5 to 0.9. language grammar but it can be used to parse the
For testing process, we use the grammar from training document accurately.
process and parse the sentences in testing corpora and
count how many sentences can be parsed from that VI. CONCLUSION
grammar. The more sentences can be parsed in testing
corpora, the better result of the experiment is gained. From this experiment, several things can be concluded,
such as tagset reduction plays an important role in the level
TABLE II. PARAMETER FOR EACH EXPERIMENT of grammatical complexity. The reason why tagset needs
to be reduced is that the real tagset causes the genetic
Experiments process ineffective.
1st 2nd 3rd 4th
The total of production rules in a chromosome is
The number of production rules
in one chromosome
120 120 120 120 proportional to the number of terminal symbols that are
The number of maximum symbols used in the process of grammar induction. In the formation
3 3 3 4
on right hand side of chromosome, the rule for structuring a chromosome
Percentage of the number of play a vital role for creating a good chromosome. The
nonterminal obtained from the 50% 50% 50% 50% number of comparison terminal and nonterminal symbols
number of terminal
are also important. Parser plays a very important role in the
Probability that first symbol on
right hand side is a nonterminal 0.2 0.3 0.2 0.3 fitness function and testing process.
symbol The result is not common in Indonesian language
Comparison ratio between grammar but it matches to the corpora grammar and the
nonterminal and terminal in 0.3 0.3 0.3 0.3 structure is different from grammar that is designed by
chromosome human manually. This is due to the reduction of the tagset
Crossover probability 0.8 0.6 0.8 0.7 that make the grammar not detailed.
Mutation probability 0.9 0.5 0.9 0.5
VII. FURTHER RESEARCH
TABLE III. TESTING RESULT FOR EACH EXPERIMENT
Based on the research that has been done, some things
Experiments that can be improved to increase accuracy such as the rule
st
1 2nd 3rd 4 th
of tagset reduction that is used should be examined again
The number of terminal
symbols
13 11 13 13 so that the result can be better.
The number of nonterminal
symbols
6 5 6 6 ACKNOWLEDGMENT
The number of sentences We would like to thank to all participants who helped
60 35 167 272
from the training corpora and supported us in doing this experiment. We also hope
Fitness value 83.37 77.23 97.01 94.85
Testing(how many
that our experiment can be useful for other NLP
sentences can be parsed 22/25 16/17 71/72 115/116 experiments. With this method offered, it is expected that,
from testing corpora) this grammar induction process can be used for other
languages. Tagset can be customized with the tagset of the
TABLE IV. GRAMMAR EXAMPLE languages to be induced.
No. Grammar
1. S::=B | S KDP | C KDP C | S D | TB C | . KK KN | , | KB
REFERENCES
KK B | KDP KKT | KKT D E | TB E S " | KN | KN KL S D [1] N.S. Choubey, M.U. Kharat and Hari Mohan Pandey, “Developing
| TB S KN F | KS KS | KN F | . KKT KB | KKT | TB KH | Genetic Algorithm Library Using Java for CFG Induction”, In
KS KG International Journal of Advancement in Technology. 2011.
2. B::=S E E | C KH " | KDP | " F KL C | " C E D | KG KL | . [2] Bill Keller and Rudi Lutz, “Evolving Stochastic Context-Free
KB KDP | KKT S | KN KK KB S | KB F D | KB TB | KDP | Grammars from Examples Using a Minimum Description Length
KG KK KL , | , KDP | KB | " KB | TB . " KG | KS TB KH . in Principle”, In Workshop on Automata Induction Grammatical
| . " KG KS | KS KN KN KN Inference and Language Acquisition, ICML-97.1997.
3. C::=E B TB B | E D C | B | B KDP B | . C E | KH C | KN B [3] N.S. Choubey and M.U. Kharat, “Sequential Structuring Element
KDP | F | KH D | , KB , TB | " C F | KL E , | , KH | KKT D for CFG Induction Using Genetic Algorithm”, In International
| . F KG | KDP KG | . | KH | KS | KG " | " KG KKT KS Journal of Computer Applications. 2010.
4. E::=D KKT C | C | B B KG C | E | " KL B F | TB KS KL | [4] N.S. Choubey and M.U. Kharat, “Reproduction Operator
KK KKT , | , D , | KKT | . KN B | KG KL | TB D S B | KB | Evaluation for CFG Induction Using Genetic Algorithm”, In
KB | , KKT KS KL | KG , | " | KK | . KS KDP " | KB Journal of Computing. 2010.
KDP
5. D::=C KH | F B " | E | F | S | F B KB | S KKT KL | E KL E [5] Femphy Pisceldo, Mirna Adriani and Ruli Manurung, “Statistical
KH | KDP KK D | " | KG KL| KK . KS | KL E | KN S | , S F Based Part Of Speech Tagger for Bahasa Indonesia”. In
Proceedings of the 3rd International MALINDO Workshop, Co-
S | TB F | KK KB F D | KK . KB KH | KG KH KK . | " KB
located event ACL-IJCNLP 2009. 2009
6. F::=C | B | KL | KK KK E | " D | KKT KK D | KDP KH F | .
| KL KK | TB , S KKT | KG F , | KK | KK KH | KH KN [6] Daniel Jurafsky and James H. Martin, “SPEECH and LANGUAGE
KDP KKT | TB KKT KDP KB | . KH TB TB | KH KG KS PROCESSING An Introduction to Natural Language Processing,
KS | KN KN KS KN Computational Linguistics, and Speech Recognition”, New Jersey,
Pretince Hall, 2000.
18