Professional Documents
Culture Documents
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1, FEBRUARY 2004
Abstract—This paper presents some applications of evolu- tational linguistics, bringing significant advances in issues such
tionary programming to different tasks of natural language as disambiguation, error correction, or grammar induction. Ap-
processing (NLP). First of all, the work defines a general scheme
plied to NLP, these methods rest on the assumption that we can
of application of evolutionary techniques to NLP, which gives the
mainstream for the design of the elements of the algorithm. This extract knowledge about a language (that is, on its structure and
scheme largely relies on the success of probabilistic approaches semantics) by defining a general model with free parameters,
to NLP. Secondly, the scheme has been illustrated with two whose values are then determined by statistical techniques. Sta-
fundamental applications in NLP: tagging, i.e., the assignment of tistical measurements are extracted from an annotated corpus
lexical categories to words and parsing, i.e., the determination of
the syntactic structure of sentences. In both cases, the elements of [8], i.e., a set of texts linguistically marked up. Probabilistic
the evolutionary algorithm are described in detail, as well as the approaches have already been successfully used to deal with
results of different experiments carried out to show the viability of a large number of NLP problems. Research on these methods
this evolutionary approach to deal with tasks as complex as those has grown a lot in the last years, probably due to the increasing
of NLP.
availability of large tagged corpora.
Index Terms—Evolutionary programming, lexical tagging, nat- One NLP task for which statistical methods have been very
ural language processing (NLP), parsing, probabilistic methods.
successfully applied is lexical tagging, i.e., the disambiguation
in the assignment of an appropriate lexical tag to each word in a
I. INTRODUCTION text [9]. There are different proposals to automatically perform
this disambiguation, all of which use a large amount of data to
N ATURAL language processing (NLP) includes a large
amount of complex tasks such as the determination of the
lexical tag which corresponds to each word in a text (tagging),
establish the probabilities of each situation. Most of these sys-
tems are based on hidden Markov models (HMMs) and vari-
ants [9]–[15] and neither require knowledge of the rules of the
the parsing of sentences, the determination of the antecedent
language nor try to deduce them. There are also rule-based ap-
of pronouns and relative clauses (anaphora resolution), the de-
proaches [16]–[18] which apply language rules, also extracted
termination of words belonging to an expression, etc. Features
from an annotated corpus, to improve the accuracy of the tag-
such as ambiguity make these tasks complex search processes,
ging. The accuracy of both approaches is so far comparable.
which require specifically designed algorithms to be efficiently
Another task carried out by statistical techniques [19], [20] is
solved. The use of heuristic techniques can help to design a
word sense disambiguation. It consists in determining which of
general framework to deal with different NLP processes. The
the senses of one ambiguous word is referred to in a particular
price to pay is that we are no more certain about the correctness
occurrence of the word in a text. It differs from tagging in that it
of the solution found, although good heuristic methods allow
uses semantic tags instead of lexical tags. These two issues have
to systematically decrease the probability of error (for instance
to be differentiated because nearby structures are more influen-
by increasing the number of points explored). Evolutionary
tial in lexical tagging, while relatively distant words are more
algorithms (EAs) are among such heuristic techniques; they are
useful for word sense disambiguation.
stochastic algorithms based on a search model, which emulates
Parsing is a preceding step for many NLP tasks. In most cases,
evolutionary phenomena.
there are many different parses for the same sequence of words,
An important aspect concomitant with the introduction of sta-
that is, syntactic ambiguity occurs very often. Stochastic gram-
tistical techniques in text processing, is that it transforms NLP
mars [9], [21]–[24] represent another important part of the sta-
tasks into optimization processes, for which evolutionary tech-
tistical methods in computational linguistics. These grammars
niques (already applied to areas such as planning or machine
give us an evaluation of the tree representing a possible parse
learning) have proven to be very useful [1]. Other optimization
of a sentence. The goal is then to find one of the best parses ac-
techniques have been also applied to NLP, e.g., neural networks
cording to this measurement.
[2]–[4] or pattern matching [5]–[7]. In the last decades, statis-
Machine translation, i.e., automatic translation from one lan-
tical methods have become a fundamental approach to compu-
guage to another, is one of the most important applications of
NLP. A previous fundamental step to perform machine transla-
Manuscript received October 10, 2002; revised February 3, 2003. tion is text alignment, that is, establishing the correspondence
The author is with the Departamento de Sistemas Informáticos y Progra- between paragraphs, sentences or words of parallel texts in dif-
mación, Universidad Complutense, Madrid 28040, Spain (e-mail: lurdes@
sip.ucm.es). ferent languages. A number of publications deal with statistical
Digital Object Identifier 10.1109/TEVC.2003.818189 alignment [25]–[27]. There are also some other approaches on
1089-778X/04$20.00 © 2004 IEEE
ARAUJO: SYMBIOSIS OF EVOLUTIONARY TECHNIQUES AND STATISTICAL NATURAL LANGUAGE PROCESSING 15
statistical machine translation [26], [28]. And there still remains After a number of generations, the population is expected to
a large amount of topics of NLP to which statistical techniques contain some individuals representing a near-optimum solution.
have been applied, such as different aspects of information re- Thus, EAs perform a random search, potentially able to reach
trieval or word clustering. any region in the search space because of the random selection.
The kind of nonexhaustive search performed by EAs makes However, at the same time they present a marked tendency to
them suitable to be combined with probabilistic techniques keep the most promising points for possible improvements, be-
in NLP. The purpose of most EAs is to find a good solution cause the fittest individuals have a higher probability to survive.
and not necessarily the best solution, and this is enough for EAs can also be designed as learning systems in which the eval-
most natural languages statistical processes. EAs provide at the uation of the individuals can be defined in such a way that the
same time a reasonable accuracy, as well as a unique scheme of AE is trained to solved a particular problem. In this case, the
algorithm when applied to different problems. The symbiosis system gives the best fitness value to a number of solved situa-
between EAs and statistical NLP come from the simplicity tions that are provided and then uses this information to evaluate
to develop systems for statistical NLP using EAs, because new inputs.
Different EAs can be formulated for a particular problem.
EAs provide a unified treatment of different applications, and
Such programs may differ in many ways: the representation of
from the fact that the trickiest elements of the EA, such as the
individuals, the genetic operators, the methods for creating the
fitness function or the adjustment of the parameters, are highly
initial population, the parameters, etc. The approach adopted in
simplified by resorting to the statistical NLP models. EAs
this paper is based on the evolutionary programming method,
have already been applied to some issues of natural language
that considers any set of data structures for the representation
processing [29], such as query translation [30], inference of
of individuals [1]. Thus, it differs from classical genetic algo-
context-free grammars [31]–[34], tagging [15], and parsing rithms, which use fixed-length binary strings as the representa-
[24]. The study of these approaches enable us to observe tion of individuals.
patterns of similarity in the application of these techniques. An evolutionary program for a particular problem requires
This paper aims to show the natural symbiosis of EAs and the following elements:
statistical NLP. The statistical measurements extracted by the • a representation of the potential solutions to the problem,
stochastic approaches to NLP provide in a natural way an ap- i.e., of the individuals of the population;
propriate fitness function for EAs. Furthermore, the annotated • a method for producing an initial set of individuals, which
corpus or training texts required for the application of statistical will represent the initial population;
methods also provide a nice benchmark for fine-tuning the al- • an evaluation function to compute the fitness of an indi-
gorithm parameters (population size, crossover, and mutations vidual. This function indicates “how suitable” is an indi-
rates, etc, see below), a fundamental issue for the behavior of the vidual to be a solution to the problem at hand according to
algorithm. Accordingly, this paper presents a general framework a training set;
for the application of evolutionary techniques to statistical NLP. • genetic operators to modify the individuals of a population
The mainstream for the definitions of the different elements of and to produce new ones for the next generation;
an EA to solve some statistical NLP tasks are presented. These • values for different parameters which characterize the al-
ideas are exemplified by means of particular algorithms for two gorithm: population size, survival probability to the next
central problems in NLP: tagging and parsing. generation, rates of application of the genetic operators,
The rest of the paper proceeds as follows. Section II describes etc.
the generic structure of an EA for NLP. Section III applies Now, let us consider the similarities and differences between
this scheme to tagging and includes experimental results. EAs applied to different tasks of statistical NLP. It is clear that
Section IV presents an evolutionary parser for natural language, the tasks of NLP are of a very different nature, and they require
including a parallel version and experimental results for each different structures to represent the data to which to apply the
version. Finally, Section V summarizes the main conclusions computational process (although some of the structures, such
of this work. as the parse trees, are common to many of them). The most
relevant similarity is the use of the measurements provided
by the statistical model in the definition of a fitness function.
II. EVOLUTIONARY ALGORITHMS AND NATURAL This function guides the search of the algorithm and strongly
LANGUAGE PROCESSING determines its performance. At the same time, it logically rep-
resents one of the greatest difficulties in the design of an EA.
An EA is basically a sequence of generations (iterations), Hence, the great advantage of combining these techniques:
each of which produces a new population of individuals out of EAs provide their generality to the complex search processes
the previous one. Each individual represents a potential solution of NLP and the development of the system is highly simplified
to the problem, whose “fitness” (a quantification of “closeness” by using the statistical models of NLP to define the most
to an exact solution) can be measured. The population of a new delicate elements of the algorithm. Furthermore, the training
generation is made of the survivors from the previous genera- text used to provide the statistical measurements required by
tion, as well as of individuals resulting from the application of the statistical model of NLP, also provides a nice benchmark
“genetic” operators to members of the previous population, ran- to tune the parameters of the algorithm, another critical point
domly selected with a probability proportional to their fitness. to produce efficient EAs.
16 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 8, NO. 1, FEBRUARY 2004
Fig. 3. Examples of individuals for the sentence Rice flies like sand.
For example, if , the entry in the table for tag to make fitness additive). This probability is estimated from the
could have the form training table as
A. Evolutionary Tagging
The remaining genes are evaluated in the same manner.
The evolution process is run for each sentence in the text to be A special situation needs to be handled separately. It may
tagged. Evolution aims to maximize the probability of the tag- happen that a particular sequence is not listed in the
ging of the sentences in the test corpus. The process finishes ei- training table. This may be so because its probability is strictly
ther when the fitness deviation lies below a threshold value (con- zero (if the sequence of tags is forbidden for some reason) or,
vergence) or when the evolutionary process has been running for most likely, because there are insufficient statistics (it is easy
a maximum number of generations, whatever occurs first. Let us to realize that even short contexts exhibit such a huge number
analyze in detail each of the elements of the algorithm. of possibilities that gigantic corpora would be needed in order
1) Individual Representation: Individuals are sequences of to have reliable statistics of every possibility). When this oc-
genes, each consisting of the tag of a word in the sentence to be curs, we proceed this way. If is not empty, we ignore the
tagged plus some additional information useful in the evaluation rightmost tag of the context and consider the new (shorter) se-
(such as counts of different contexts for this tag according to the quence. We seek in the training table for all sequences matching
training table). this shorter context and take for the number of occurrences the
2) Initial Population: For a given sentence of a test corpus, sum of the number of occurrences of the new context. If
the initial population is composed of a number of individuals is empty or there are no sequences matching the shorter one,
constructed by taking from a dictionary one of the valid tags we carry on ignoring the leftmost tag of the context (provided
for each word, selected with a probability proportional to their is not empty). We repeat the search with this even shorter
frequencies. sequence. The process continues alternately ignoring the right-
Words not listed in the dictionary are assigned the tag which most and then the leftmost tag of the remaining sequence (skip-
appears more often with that given context in the training text, ping the corresponding step whenever either or are
provided there are no other unknown words in the context. Since empty) until one of these shorter sequences matches at least one
the number of unknown words is very small, this is what usually of the training table entries or until we are left simply with . In
happens. Otherwise, the word is assigned a randomly chosen this latter case, we take for the logarithm of the frequency
tag. with which appears in the corpus (also listed in the training
3) Fitness (Individual Evaluation): In this case, this func- table).
tion is related to the total probability of the sequence of tags The whole population is evaluated according to this proce-
of an individual. The raw data to obtain this probability are ex- dure every generation, and the average fitness of the population
tracted from the training table. The fitness of an individual is is also computed.
defined as the sum of the fitness of its genes ( ). The 4) Genetic Operators: Because of the representation of the
fitness of a gene is defined as individuals, genetic operators have in this case a straightforward
design. To apply crossover, two individuals are selected with a
probability proportional to their fitness. Then, a crossover point
where is the probability that the tag of gene is is randomly selected, thus dividing both individuals in two parts.
, given that its context is formed by the sequence of tags to In the selection of this partition point, positions corresponding
the left and the sequence to the right (the logarithm is taken to genes with low fitness in both parents are preferred. The first
ARAUJO: SYMBIOSIS OF EVOLUTIONARY TECHNIQUES AND STATISTICAL NATURAL LANGUAGE PROCESSING 19
part of one parent is combined with the second part of the other
parent, thus producing two offsprings.
Mutation is then applied to every gene of the individuals re-
sulting from the crossover operation with a probability given by
the mutation rate. The tag of the mutation point is replaced by
another of the valid tags of the corresponding word. The new tag
is randomly chosen according to its probability (the frequency
it appears in the corpus).
Individuals resulting from the application of genetic operators
replace an equal number of individuals, selected with a proba-
bility inversely proportional to their fitness. The number of in-
dividuals that remain unchanged in one generation depends on
the crossover and mutation rates.
the probabilities for all the rules that expand the same nonter-
minal add up to one. The capabilities of the parser depend on the
probabilistic grammar it uses. The probabilistic grammar is ex-
tracted from a tree-bank, i.e., a text with syntactic annotations.
The test set is formed by the sentences to be parsed, which can
be also extracted from a corpus.
Traditional parsers [35] try to find grammar rules to complete
a parse tree for the sentence. For instance, the main basic opera-
tion in a bottom-up parser is to take a sequence of symbols and
match them to the right-hand side of the grammar rules. Accord-
ingly, the parser can be simply formulated as a search process
of the matching operation. These parsers can be easily modified
to take into account the probabilities of the rules. The main idea
of a best-first parser is to consider the most likely constituents Fig. 7. PCFG rules. Pro stands for pronoun, Det for determiner, Adj for
first. adjective, VP for verb phrase, NP for nominal phrase, PP for prepositional
phrase, AP for adjective phrase, DP for adverbial phrase, and WP for
According to our scheme of construction of the EA, the popu- wh-phrase.
lation consists of potential parses for the sentence and grammar
considered. These parses are represented in the classical way,
i.e., as trees. Genetic operators are then operations over these
trees, which exchange or modify some of their subtrees.
We define a fitness function based on a measurement of the
“distance” between the individual to evaluate and a “feasible”
and probable parse for the sentence. The feasibility of the sen-
tence is measured by the number of grammar rules that have
been correctly applied in the parsing. The probability of the
parse is given by the product of the probabilities of its grammar
rules.
The complexity of this process makes a parallel approach
[36] particularly suitable to speed up the convergence of the al-
gorithm. This parallelization can be included in the NLP-EA
scheme for all problems whose complexity requires large pop-
ulations and number of generations to obtain results of quality.
Next, the elements of the algorithm are described, along with
some experiments and the parallel implementation.
A. Individual Representation Fig. 8. Possible individuals for the sentence The man sings a song.
Fig. 10. Crossover operation algorithm. C and C are the parent individuals.
C is the offspring individual. S is the input sentence.
Fig. 12. Mutation operation algorithm.
C. Genetic Operators
The genetic operators considered herein are crossover, which the exchange cannot be performed. In this case, a new
combines two parses to generate a new one, and mutation, which procedure is followed: in each parent one of the two
creates a new parse by replacing a randomly selected gene in a halves is maintained, while the other one is randomly
previous individual. generated (generate_parse) to produce a parse consistent
1) Crossover: The crossover operator generates a new indi- with the number of words in the sentence. This produces
vidual which is added to the population in the new generation. four offsprings.
The part under a randomly selected parent’s tree point is ex- • Finally, the best offspring (select_best) is added to the pop-
changed with the corresponding part of the other parent to pro- ulation (line 23).
duce two offsprings, subject to the constraint that the genes ex- As an example, let us consider the individuals in Fig. 8 and let
changed correspond to the same type of syntactic category ( , us assume that the word selected in the input sentence is the
, etc.). This avoids wrong references of previous genes in the determiner . Fig. 11 shows the new individuals resulting from
individual. Of course, those exchanges which produce parses the crossover operation. The selected word corresponds to genes
inconsistent with the number of words in the sentence must be of different kind in each individual (noun and preposition) and,
avoided. Therefore, the crossover operation (Fig. 10) performs therefore, cannot be exchanged. The next gene containing the
the following steps. selected word in its sequence of words is now considered. This
• Select two parents and . corresponds to the main in both individuals, which now
• Randomly select a word (selected_word) from the input can be exchanged. This exchange produces two new individuals
sentence (line 1). (Fig. 11). The best individual out of the two offsprings is number
• Identify the innermost gene (identify_gen) to which the 1, which is added to the population.
selected word corresponds in each parent (lines 2 and 3). 2) Mutation: Selection for mutation is done in inverse
• If the genes correspond to different sets of words, the next proportion to the fitness of an individual. Mutation operation
gene in the inner most order is selected. This process con- changes the parse of some randomly chosen sequence of words.
tinues until the sequences of words whose parses are to be The mutation operation (Fig. 12) performs the following
exchanged are the same, or until the main or are steps.
reached (lines 4–7). • A gene is randomly chosen from the individual (lines 1
• If the two selected genes parse the same sequence of words to 2).
the exchange is performed (lines 8–11). • The parse of the selected gene, as well as every gene cor-
• If the process to select genes leads to the main or , responding to its decomposition are erased (line 3).
and the sequence of words do not match yet (lines 12–22), • A new parse is generated for the selected gene (line 4).
24 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 8, NO. 1, FEBRUARY 2004
TABLE I Fig. 14. Number of generations required to reach the correct parse for different
SENTENCES USED IN THE PARSING EXPERIMENTS input sentences, when using a crossover rate of 50% and a mutation rate of 20%.
D. Experimental Results
Fig. 15. Performance of the sentence 2 for different parameter settings, with a
The algorithm has been implemented in C++ language and population size of 50 individuals.
run on a Pentium III processor. In general, the probabilistic
grammar is extracted from a text syntactically annotated. How- reach the correct parse in a reasonable number of steps. Ac-
ever, the sentences of real texts present additional problems, cordingly, we can consider that the sentences are approximately
such as punctuation symbols, expressions of multiple words, sorted by their increasing complexity. We can observe in the
anaphora, etc., which have not been considered in this imple- graph that the simplest sentences 1 and 2, reach the correct
mentation. Because of this, the grammar and the sentences used parse in a few generations even for small populations, while
in the experiments presented herein are artificial, constructed to sentence 3 requires a large number of generations to reach the
test the system. In order to evaluate the performance, we have solution for small populations. On the other hand, sentences
considered the parsings of the sentences appearing in Table I. 4 and 5 only reach the correct parse for large populations.
The average length of the sentences is around ten words. How- But large populations lead to replication of individuals and
ever, they present different complexities for the parsing, mainly slow evolutions, so high percentages of genetic operators are
arising from the length and the number of subordinate phrases. required to speed up the process.
1) Study of the EA Parameters: Experiments include the In order to illustrate the performance of the evolution process,
study of the performance with respect to the population size. Fig. 15 represents the fitness of the best individual in the popula-
Fig. 14 shows the number of generations required to reach a tion versus the number of generations for sentence 2, with a pop-
correct parse for each sentence versus the population size. The ulation of 50 individuals and different crossover and mutation
behavior is different for each sentence. In general, the higher rates. The first observation is that the fitness grows abruptly at
the “sentence complexity,” i.e., the length and the number of certain number of generations. We can also observe that a min-
subordinate phrases, the larger the population size required to imum percentage of application of genetic operators is required
ARAUJO: SYMBIOSIS OF EVOLUTIONARY TECHNIQUES AND STATISTICAL NATURAL LANGUAGE PROCESSING 25
in order to reach the correct parse. Thus, for the population size
considered, with a crossover percentage of 10% and one of mu-
tation of 5% the solution is not reached in 500 generations, while
with percentages of 30% and 10%, respectively, the solution is
reached after 450 generations, and with percentages of 50% and
20%, respectively, the solution is reached after 100 generations. Fig. 16. Migration policy. CP stands for cooperative parser, MS for main
selector.
E. Parallel Parsing
TABLE II
Despite the ability of EAs to find a “good” solution, though TIME IN SECONDS REQUIRED TO REACH A CORRECT PARSE WHEN
perhaps approximate, to optimization problems, if such prob- PROCESSING SEQUENTIALLY AND IN PARALLEL
lems are too hard they require a very long time to give an answer.
This has led to different efforts to parallelize these programs.
Basically, the approaches to this parallelization can be classi-
fied in two groups [37], [38]: global parallelization, and island
or coarse-grained parallelization. In the first method, there is
only one population, as in the sequential model, and it is the
evaluation of individuals and the application of genetic opera- significant reduction of the solutions quality. The model is asyn-
tors what is parallelized. Thus, the behavior of the sequential chronous, and after a fixed number of generations, a parser sends
algorithm does not change. This method can obtain significant a fixed number of individuals to the next parser and then con-
speedup if the communication overhead is low, as it happens in tinues the evolution checking in each generation the arrival of
shared memory machines. the same number of individuals from the previous parser. The
The island or coarse-grain model is probably the most pop- selection of individuals to be sent in a migration is done with a
ular approach for parallelization of EAs because it does not re- probability proportional to their fitness.
quire a high cost in communications, and it therefore exhibits a Different experiments have been carried out in order to tune
high portability. In this model, the population is divided in sub- this basic model. These experiments are described in the next
population or demes, which usually evolve isolated except for section.
the exchange of some individuals or migrations after a number
of generations. In this case, we can expect a different behavior F. Experiments
with respect to the parallel model, both because this model em- The algorithm has been implemented in the C++ language
ploys different parameters of the algorithm (such as population on a SGI-CRAY ORIGIN 2000 computer2 using parallel vir-
size, which is smaller in each deme) and because the dynamics tual machine (PVM) [40], a software package which allows a
is completely changed by migrations. This can result in a faster heterogeneous network of parallel and serial computers to ap-
convergence. Though such a faster convergence may, in prin- pear as a single concurrent computational resource. In order to
ciple, reduce the quality of the solutions, results show [39] that evaluate the performance, we have considered the parsing of the
the parallel model with smaller populations but with migration most complex sentences appearing in Table I.
among demes can improve the quality of the sequential solu- Table II shows the improvement in performance obtained
tions, and that there is an optimal number of demes which max- when increasing the number of processors. This experiment
imizes the performance. has been carried out with a population size of 200 individuals,
The complexity of the parsing problem makes it appropriate the minimum required for the sequential version to reach the
for implementing a parallel version of the EA. Both the fitness correct parse, a crossover rate of 50%, a mutation rate of 20%, a
function and the genetic operators are expensive enough to make size of migrated population of 40, and an interval of migration
the evolutionary program appropriate for a coarse grain parallel of 15 generations. We can observe that the parallel execution
model. Accordingly, the parallel implementation follows an is- achieves a significant improvement even with only two proces-
land model, in which, after a number of generations, demes ex- sors. We can also observe that performance reaches saturation
change some individuals. The system is composed of a number for a certain number of processors (around eight). However,
of processes, called cooperative parsers, each of which per- these figures are expected to increase with the complexity of
forms, by evolutionary programming, a parse for the same input the analyzed sentences.
sentence. The seed in each deme (random number generation)
is different in order to obtain different individuals in different V. CONCLUSION
populations. There is a special process, known as main selector,
which selects the best individual among the best ones of each This work develops a general framework, the NLP-EA
deme. In order to reduce the communication, migrations are not scheme to design EAs devoted to tasks concerning Statistical
performed from one deme to another, but only to the next deme NLP. The statistical measurements extracted by the stochastic
in a round-robin manner (Fig. 16). The cooperative parsers approaches to NLP provide an appropriate fitness function for
are assigned an arbitrary number between 0 and , and the the EA in a natural way. Furthermore, the corpus or training
parser considers that its next parser is numbered 0. Never- texts required for the application of statistical methods also
theless, this policy has been compared with an all-to-all policy 240 MIPS R10000 (250 MHz) processors, 16 MIPS R12000 (400 MHz) pro-
in order to guarantee that the adopted option does not imply a cessors, and 12 GB RAM.
26 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 8, NO. 1, FEBRUARY 2004
provide a perfect benchmark for adjusting the algorithm pa- izations for the tags until they get enough information to decide
rameters (population size, crossover, and mutations rates, etc.), the most appropriate tag.
a fundamental issue for the behavior of the algorithm. On the EAs might also provide new insights and a new directions to
other hand, EAs provide their characteristic way of performing investigate the statistical nature of the language. In particular,
a nonexhaustive search for solving NLP tasks. Accordingly, a they can give a new approach to natural language generation,
natural symbiosis between the statistical NLP techniques and for which the space of the results is very much open.
the EAs arises.
The NLP-EA scheme has been illustrated with an EA for tag-
ging that works with a population of potential taggings for each ACKNOWLEDGMENT
input sentence in the text. The evaluation of individuals is based The author would like to express her gratitude to the referees
on a training table composed of contexts extracted from a set of for their detailed revisions and helpful comments. The author
training texts. Results indicate that the evolutionary approach is also very grateful to J. Cuesta for his careful reading of the
is a robust enough approach for tagging texts of natural lan- manuscript and encouragement.
guage, obtaining accuracies comparable to other statistical sys-
tems. The tests indicate that the length of the contexts extracted
for the training is a determining factor for the results. REFERENCES
Another example of application of the EA-NLP scheme has [1] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Pro-
been an EA for parsing. In this case, the algorithm works with grams, 2nd ed. New York: Springer-Verlag, 1994.
a population of potential parses for a given PCFG and an input [2] G. Berg, “A connectionist parser with recursive sentence structure and
lexical disambiguation,” in Proc. 10th Nat. Conf. Artificial Intelligence,
sentence. The appropriate data structure to represent potential
1992, pp. 32–37.
parses has been designed in order to enable an easy evaluation [3] R. Miikkulainen, Subsymbolic Natural Language Processing: An Inte-
of individuals and application of genetic operators. Again the grated Model of Scripts, Lexicon, and Memory. Cambridge, MA: MIT
results indicate the validity of the EA approach for parsing pos- Press, 1993.
[4] H. Schmid, “Part-of-speech tagging with neural networks,” in Proc. Int.
itive examples of natural language and, thus, the system devel- Conf. Computational Linguistics, 1994, pp. 172–176.
oped here can provide a promising starting point for parsing sen- [5] C. Cardie and D. Pierce, “Error-driven pruning of treebank grammars
tences of realistic length and complexity. The tests show that the for base noun phrase identification,” in Proc. ACL-Coling, 1998, pp.
218–224.
EA parameters need to be suitable for the input sentence com- [6] C. Cardie, S. Mardis, and D. Pierce, “Combining error-driven pruning
plexity. The more complex the sentence (length and subordina- and classification for partial parsing,” in Proc. 16th Conf. Machine
tion degree), the larger the population size required to quickly Learning, 1999, pp. 87–96.
[7] G. Navarro and M. Raffinot, Flexible Pattern Matching in
reach a correct parse. Furthermore, as the population size in- Strings—Practical On-Line Search Algorithms for Texts and Bio-
creases, higher rates of crossover and mutation are required to logical Sequences. Cambridge, U.K.: Cambridge Univ. Press, 2002.
maintain the efficiency of the algorithm. This algorithm has [8] T. McEnery and A. Wilson, Corpus Linguistics. London, U.K.: Edin-
been parallelized in the form of an island model, in which pro- burgh Univ. Press, 1996.
[9] E. Charniak, Statistical Language Learning. Cambridge, MA: MIT
cessors exchange migrating populations asynchronously and in Press, 1993.
a round-robin sequence. Results obtained for these experiments [10] F. Jelinek, “Self-organized language modeling for speech recognition,”
exhibit a clear improvement in the performance, thus showing in Readings in Speech Recognition, J. Skwirzinski, Ed. San Mateo,
CA: Morgan Kaufmann, 1990, pp. 450–506.
that the problem has enough granularity for the parallelization. [11] C. DeMarcken, “Parsing the lob corpus,” in Proc. 1990 Association
These experiments indicate the effectiveness of an evolu- Computational Linguistics, 1990, pp. 243–251.
tionary approach to tagging and parsing. The accuracy obtained [12] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, “A practical
part-of-speech tagger,” in Proc. 3rd Conf. Applied Natural Language
in tagging with these methods is comparable with that of other Processing, 1992, pp. 133–140.
probabilistic approaches, despite the fact that other systems [13] H. Schutze and Y. Singer, “Part of speech tagging using a variable
use algorithms specifically designed for this problem. The memory markov model,” in Proc. 1994 Association Computational
evolutionary approach does not guarantee the right answer, but Linguistics, 1994, pp. 181–187.
[14] B. Merialdo, “Tagging english text with a probabilistic model,” Comput.
it always produces a close approximation to it in a reasonable Linguistics, vol. 20, no. 2, pp. 155–172, 1994.
amount of time. On the other hand, it can be a suitable tool for [15] L. Araujo, “Part-of-speech tagging with evolutionary algorithms,” in
NLP because natural languages poses some features, such as Proc. Int. Conf. Intelligent Text Processing Computational Linguistics
(CICLing-2002), vol. 2276, Lecture Notes in Computer Science, 2002,
some kinds of ambiguities, not even human beings are able to pp. 230–239.
resolve. Furthermore, the result obtained from the EA can be [16] J. Quinlan, C 4.5: Programs for Machine Learning. San Mateo, CA:
the input for a classic method, which in this way will perform Morgan Kaufmann, 1993.
[17] E. Brill, “Transformation-based error-driven learning and natural lan-
a smaller search, with methods such as logic programming or guage processing: A case study in part of speech tagging,” Comput. Lin-
constraint programming. guistics, vol. 21, no. 4, 1995.
Other possible future works are the unsupervised processing [18] , “Unsupervised learning of disambiguation rules for part of speech
tagging,” in Natural Language Processing Using Very Large Corpora,
of different NLP tasks. For instance, we could be interested in S. Armstrong, K. Church, P. Isabelle, S. Manzi, E. Tzoukermann, and
performing unsupervised tagging of texts because we do not D. Yarowsky, Eds. Norwell, MA: Kluwer, 1997.
have an appropriate tagged corpus available for training. In this [19] P. Brown, J. Cocke, S. D. Pietra, F. Jelinek, R. Mercer, and P. Roossin,
case, there would be no training texts and all statistical infor- “Word sense desambiguation using statistical methods,” in Proc. Annu.
Conf. Association Computational Linguistics, 1991, pp. 264–270.
mation would have to be extracted from the text to be tagged. [20] H. Schutze, “Automatic word sense discrimination,” Comput. Linguis-
EAs can be applied to perform this kind of task, using general- tics, vol. 24, no. 1, pp. 7–124, 1998.
ARAUJO: SYMBIOSIS OF EVOLUTIONARY TECHNIQUES AND STATISTICAL NATURAL LANGUAGE PROCESSING 27
[21] T. Booth and R. Thomson, “Applying probability measures to abstract [35] J. Allen, Natural Language Understanding. Redwood City, CA: Ben-
languages,” IEEE Trans. Comput., vol. 22, pp. 442–450, 1973. jamin Cummings, 1994.
[22] C. Brew, “Stochastic hpsg,” in Proc. 7th Conf. Eur. Chapter of the Asso- [36] L. Araujo, “A parallel evolutionary algorithm for stochastic natural lan-
ciation for Computational Linguistics, Dublin, Ireland, 1995, pp. 83–89. guage parsing,” in Proc. Int. Conf. Parallel Problem Solving from Nature
[23] E. Charniak, “Statistical techniques for natural language parsing,” AI (PPSNVII), Granada, Spain, 2002, pp. 200–209.
Mag., vol. 18, no. 4, pp. 33–44, 1997. [37] E. Cantú-Paz, “A survey of parallel genetic algorithms,” Illinois Genetic
[24] L. Araujo, “Evolutionary parsing for a probabilistic context free Algorithms Lab., Chicago, IL, IlliGAL Rep. 97 003, 1997.
grammar,” in Proc. Int. Conf. Rough Sets Current Trends Computing [38] R. P. M. Nowostawski, “Review and Taxonomy of Parallel Genetic Al-
(RSCTC-2000), Lecture Notes in Computer Science 2005, 2000, pp. gorithms,” School Comput. Sci., Univ. Birmingham, Birmingham, U.K.,
590–597. Tech. Rep. CSRP-99–11, 1999.
[25] W. Gale and K. Church, “A program for aligning sentences in bilingual [39] E. Cantu-Paz and D. E. Goldberg, “Predicting speedups of idealized
corpora,” Comput. Linguistics, vol. 19, no. 1, pp. 75–102, 1994. bounding cases of parallel genetic algorithms,” in Proc. 7th Int. Conf.
[26] P. Brown, J. Cocke, S. D. Pietra, V. D. Pietra, F. Jelinek, J. Lafferty, R. Genetic Algorithms, T. Back, Ed., CITY, CA, 1997, pp. 113–120.
Mercer, and P. Roosin, “A statistical approach to machine translation,” [40] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sun-
Comput. Linguistics, vol. 16, no. 2, pp. 79–85, 1990. deram, “PVM: Parallel virtual machine,” Oak Ridge Nat. Lab., Oak
[27] D. Wu, “Aligning a parallel english-chinese corpus statistically with lex- Ridge, TN, 1994.
ical criteria,” in Proc. Annu. Conf. Association for Computational Lin-
guistics, 1994, pp. 80–87.
[28] A. Berger, P. Brown, S. D. Pietra, V. D. Pietra, J. Lafferty, H. Printz, and
L. Ures, “The candide system for machine translation,” in Proc. ARPA
Conf. Human Language Technology, 1994, pp. 152–157.
[29] A. Kool. (2000) Literature Survey. [Online]. Available: citeseer.nj.nec.
com/kool99literature.html
[30] M. W. Davis and T. Dunning, “Query translation using evolutionary pro-
gramming for multilingual information retrieval II,” in Proc. 5th Annu.
Conf. Evolutionary Programming (Evolutionary Programming Society). Lourdes Araujo received the B.S. degree in physics
San Diego, CA, 1996, pp. 103–112. and the Ph.D. degree in computer science from the
[31] P. Wyard, “Context free grammar induction using genetic algorithms,” Universidad Complutense de Madrid, Madrid, Spain,
in Proc. 4th Int. Conf. Genetic Algorithms, 1991, pp. 514–518. in 1987 and 1994, respectively. Her thesis was in a
[32] T. C. Smith and I. H. Witten, “Probability-driven lexical classification: parallel implementation of Prolog.
A corpus-based approach,” in Proc. Pacific Association Computational After spending three years at Alcatel, Madrid,
Linguistics Conf., 1995, pp. 271–283. Spain, working on the design of communication
[33] R. M. Losee, “Learning syntactic rules and tags with genetic algorithms networks, she joined the Universidad Complutense
for information retrieval and filtering: An empirical basis for grammat- in 1990, where she is currently an Associate
ical rules,” Inform. Process. Manage., vol. 32, no. 2, pp. 185–197, 1996. Professor of computer science. After that she was
[34] B. Keller and R. Lutz, “Evolving stochastic context-free grammars from involved in constraint programming on parallel
examples using a minimum description length principle,” presented computer architectures, where she met genetic programming for the first time.
at the Int. Workshop Automata Induction, Grammatical Inference Her current research interests include natural language processing, as well as
Language Acquisition, ICML’97, Nashville, TN, 1997. evolutionary algorithms.