Professional Documents
Culture Documents
Inducing Deterministic Prolog Parsers From Treebanks: A Machine Learning Approach
Inducing Deterministic Prolog Parsers From Treebanks: A Machine Learning Approach
Corpus-Based 749
representing the category “preposition.“2 This new bank (specifically, the sentences in the file ti-tb). We
predicate has been incorporated to form the rule which chose this particular data because it represents realistic
may be roughly interpreted as stating: “If the stack input from human-computer interaction, and because
contains an NP followed by a preposition, then reduce it has been used in a number of other studies on au-
this pair to a PP.” The actual control rule learned for tomated grammar acquisition (Brill, 1993; Pereira and
this operator is more complex, but this simple example Schabes, 1992) that can serve as a basis for comparison
illustrates the basic process. to CHILL.
Experiments were actually carried out on four dif-
Parsing the Treebank ferent variations of the corpus. A subset of the cor-
Training a program to do accurate parsing requires pus comprising sentences of length less than 13 words
large corpora of parsed text for training. Fortunately, was used to form a more tractable corpus for system-
such treebanks are being compiled and becoming avail- atic evaluation and to test the effect of sentence length
able. For the current experiments, we have used parsed on performance. The entire corpus contained 729 sen-
text from a preliminary version of the Penn Treebank tences with an average length of 10.3 words. The re-
(Marcus et al., 1993). One complication in using this stricted set contains 536 sentences averaging 7.9 words
data is that sentences are parsed only to the “phrase in length. A second dimension of variation is the form
level”, leaving the internal structure of NPs unana- of the input sentences and analyses. Since CHILL has
lyzed and allowing arbitrary-arity constituents. Rather the ability to create its own categories, it can use un-
than forcing the parser to learn reductions for arbitrary tagged parse trees. In order to test the advantage
length constituents, CHILL was restricted to learning gained by tagging, we also ran experiments using lexi-
binary-branching structures. This simplifies the parser cal tags instead of words on both the full and restricted
and allows for a more direct comparison to previous corpus.
bracketing experiments (e.g. (Brill, 1993; Pereira and
Schabes, 1992)) which use binary bracketings. Experimental Met hod
Making the treebank analyses compatible with the Training and testing followed the standard paradigm
binary parser required “completion” of the parses into of first choosing a random set of test examples and
binary-branching structures. This “binarization” was then creating parsers using increasingly larger subsets
accomplished automatically by introducing special in- of the remaining examples. The performance of these
ternal nodes in a right-linear fashion. For example, parsers was then determined by parsing the test exam-
the noun-phrase, np :[the, big, orange,cat], would ples. Obviously, the most stringent measure of accu-
be binarized to create: np: [the) int (np) : [big, racy is the proportion of test sentences for which the
int (np) : [orange, cat] 11. The special labeling produced parse tree exactly matches the treebanked
(int(np) for noun phrases, int(s) for sentences, etc.) parse for the sentence. Sometimes, however, a parse
permits restoration of the original structure by merg- can be useful even if it is not perfectly accurate; the
ing internal nodes. Using this technique, the re- treebank itself is not entirely consistent in the handling
sulting parses can be compared directly with tree- of various structures.
bank parses. All of the experiments reported below To better gauge the partial accuracy of the parser,
were done with automatically binarized training ex- we adopted a procedure for returning and scoring par-
amples; control rules for the artificial internal nodes tial parses. If the parser runs into a “dead-end” while
were learned in exactly the same way as for the origi- parsing a test sentence, the contents of the stack at the
nal constituents. time of impasse is returned as a single, flat constituent
labeled S. Since the parsing operators are ordered and
Experimental Results the shift operator is invariably the most frequently
The Data used operator in the training set, shift serves as a sort
of default when no reduction action applies. Therefore,
The purpose of our experiments was to investigate
at the time of impasse, all of the words of the sentence
whether the mechanisms in CHILL are sufficiently ro-
will be on the stack, and partial constituents will have
bust for application to real-world parsing problems.
been built. The contents of stack reflect the partial
There are two facets to this question, the first is
progress of the parser in finding constituents.
whether the parsers learned by CHILL generalize well
to new text. An additional issue is whether the in- Partial scoring of trees is computed by determin-
ing the extent of overlap between the computed parse
duction mechanism can handle the large numbers of
and the correct parse as recorded in the treebank.
examples that would be necessary to achieve adequate
performance on relatively large corpora. Two constituents are said to match if they span ex-
actly the same words in the sentence. If constituents
We selected as our test corpus a portion of the ATIS
match and have the same label, then they are iden-
dataset from a preliminary version of the Penn Tree-
tical. The overlap between the computed parse and
21nvented predicatesactuallyhave system generated the correct parse is computed by trying to match
names. They are renamed here forclarity. each constituent of the computed parse with some
constituent in the correct parse. If an identical con- the proportion of test sentences having no constituents
stituent is found, the score is 1.0, a matching con- that cross constituents in the correct parsing. The re-
stituent with an incorrect label scores 0.5. The sum maining column reports the percentage of (binarized)
of the scores for all constituents is the overlap score constituents that are consistent with the treebank (i.e.
for the parse. The accuracy of the parse is then com- cross no constituents in the correct parse).
puted as Accuracy = (& + &)/2 where 0 is The results for the restricted corpus in Table 1 are
the overlap score, Found is the number of constituents encouraging. While we know of no other results for
in the computed’parse, and Correct is the number of parsing accuracy of automatically constructed parsers
constituents in the correct tree. The result is an av- on this corpus, the figures of 33% completely correct
erage of the proportion of the computed parse that is using the tagged input and 17% on the raw text seem
correct and the proportion of the correct parse that quite good for a relatively modest training set of 300
was actually found. sentences. The figures for O-cross and crossing% are
Another accuracy measure, which has been used about the same as those reported in studies of au-
in evaluating systems that bracket the input sentence tomated bracketing for the unrestricted ATIS corpus
into unlabeled constituents, is the proportion of con- (Brill (1993) reports 60% and 91.12%, respectively).
stituents in the parse that do not cross any constituent However, our bracketing results for the unrestricted
boundaries in the correct tree (Black, 1991). Of course, corpus are not as good.
this measure only allows for direct comparison of sys- A comparison of Tables la and lb show that con-
tems that generate binary-branching parse trees3 By siderable advantage is gained by using word-class tags,
binarizing the output of the parser in a manner anal- rather than the actual words. This is to be expected
ogous to that described above, we can compute the as tagging significantly reduces the variety in the in-
number of sentences with parses containing no crossing put. The results for raw-text use no special mechanism
constituents, as well as the proportion of constituents for handling previously unseen words occurring in the
which are non-crossing over all test sentences. This testing examples. Achieving 70% (partial) accuracy
gives a basis of comparison with previous bracketing under these conditions seems quite good. Statistical
results, although it should be emphasized that CHILL approaches relying on n-grams or probabilistic context-
is designed for the harder task of actually producing free grammars would have difficulty due to the large
labeled parses, and is not directly optimized for the number of terminal symbols (around 400) appearing in
bracketing task. the modest-sized training corpus. The data for lexical
selection would be too sparse to adequately train the
Results pre-defined models. Likewise, the transformational ap-
The results of these experiments are summarized in proach of (Brill, 1993) is limited to bracketing strings
of lexical classes, not words. A major advantage of
Tables 1 and 2. The figures for the restricted length
corpus in Table 1 reflect averages of three trials, while our approach is the ability of the learning mechanism
the results on the full corpus are averaged over two tri- to automatically construct and attend to just those
features of the input that are most useful in guiding
als. The first column shows the size of the training set
from which the parsers were derived, while the remain- parsing.
ing columns present results for each of the four metrics It should also be noted that the system created
outlined above. Correct is the percentage of test sen- new categories in both situations. In the raw text
tences with parses that matched the treebanked parse experiments, CHILL regularly created categories for
. exactly. Partial is partial correctness using the over- preposition, verb, form-of-to-be, etc.. With
lap metric. The remaining columns reflect measures tagged input, various tags were grouped into classes
based on re-binarizing the parser output. O-Cross is such as the verb and noun forms. In both cases, the
system also formed numerous categories and relations
3A tree containing a single, flat constituent covering that seemed to defy any simple linguistic explanation.
the entire sentence always produces a perfect (non)crossing Nevertheless, these categories were helpful in parsing
score. of new text. These results support our view that any
Corpus-Based 751
~1 Size
50
100
Correct
6.2
4.7
Partial
54.0
54.9
O-Cross
33.1
41.2
Crossing %
68.9
72.0
150 8.5 59.6 39.7 71.5
practical acquisition system should be able to create make it feasible to produce parsers from thousands of
its own categories, as it is unlikely that independently- training sentences.
crafted feature systems will capture all of the nuances
necessary to do accurate parsing in a reasonably com- Related Work
plex domain. As mentioned above, most recent work on automati-
Table 2 shows results for the full corpus. As one cally constructing parsers from corpora has focused on
might expect, the results are not as good as for the re- acquiring stochastic grammars rather than symbolic
stricted set. There are a number of factors that could parsers. When learning in this framework, “one sim-
lead to diminishing performance as a function of in- ply gathers statistics” to set the parameters of a pre-
creasing sentence length. One explanation might be defined model (Charniak, 1993). However, there is a
that the longer sentences are simply more complicated long tradition of research in AI and Machine Learn-
and, thus harder to parse accurately. If the difficulty ing suggesting the utility of techniques that extract
is inherent in the sentences, the only solution is larger underlying structural models from the data. Earlier
training sets. work in learning symbolic parsers (Anderson, 1977;
Another possible problem is the compounding of er- Berwick, 1985) used fairly weak learning methods spe-
rors. If an operator is chosen incorrectly early in the cific to language acquisition and were not tested on
parse, it might lead the parser into states that have not real corpora. CHILL represents the first serious appli-
been encountered in training, leading to subsequent er- cation of modern, machine-learning methods to acquir-
rors in the application of other operators. This factor ing parsers from corpora.
might be mitigated by developing more robust training Brill (1993), p resents a technique for acquiring
procedures. By providing control examples from erro- parsers that produce binary-branching syntax trees
neous states as well as correct ones, the parser might be with unlabeled nonterminals. The technique, utilizing
trained to be somewhat self-correcting, choosing cor- structural transformation rules based on lexical cate-
rect operators later on even in the face of previous gory information, has proven quite successful on real
errors. corpora. CHILL, which creates fully labeled parses, has
A third possibility is that the additional sentence a more general learning mechanism allowing it to make
length is “swamping” the induction algorithm. In- distinctions based on more subtle structural and lexical
creasing the average sentence length significantly in- cues (e.g. creating semantic word classes for resolving
creases the number of control examples that must be attachment) .
handled by the induction mechanism. Training sizes of Our framework for learning deterministic, context-
several hundred sentences give rise to induction over dependent parsers is very similar to that of (Simmons
thousands of control examples. Additionally, longer and Yu, 1992); h owever, there are two advantages of
sentences lend themselves to more conflicting analyses our ILP method compared to their exemplar matching
and may increase the amount of noise in the control method. First, ILP methods can handle unbounded,
data making it more difficult to spot useful generaliza- structured data so that the context does not need to
tions. Additional progress here would require further be fixed to a limited window of the stack and the
improvement in the efficiency and noise-handling ca- remaining sentence. The entire stack and remaining
pabilities of the induction algorithm. sentence is available as potential context for deciding
Clearly further experimentation is needed to pin which parsing operator to apply at each step. Second,
down where the most improvement can be made. The the system is capable of creating its own syntactic and
results so far do indicate that the approach has poten- semantic word and phrase classes instead of relying on
tial. The current Prolog implementation running on a the user to provide part-of-speech tagging.
SPARC 2 was able to induce parsers from several hun- CHILL's ability to invent new classes of words and
dred sentences in a few hours, producing over 700 lines phrases specifically for resolving ambiguities such as
of Prolog code. One trial was run using 400 tagged prepositional phrase attachment makes it particularly
sentences from the full corpus; the resulting parser interesting. There is some existing work on learning
achieved 29.4% absolute accuracy and a partial scor- lexical classes from corpora (Schiitze, 1992); however,
ing of 82%. Further improvements in efficiency may the classes are based on word co-occurrence rather than
Corpus-Based 753