Stochastic Definite Clause Grammars: 1 Introduction and Background

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Stochastic Definite Clause Grammars

Christian Theil Have


Research group PLIS: Programming, Logic and Intelligent Systems
Department of Communication, Business and Information Technologies
Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark
cth@ruc.dk

Abstract where corpus annotations dictate the derivations, can


This paper introduces Stochastic Definite Clause be done by counting expansions used in the annota-
Grammars, a stochastic variant of the well- tions. Estimation with incomplete data can be ac-
known Definite Clause Grammars. The grammar complished using the Expectation-Maximization (EM)
formalism supports parameter learning from an- algorithm [8].
notated or unannotated corpora and provides a In stochastic unification grammars, the choice of
mechanism for parse selection by means of sta-
rules to expand is stochastic and the values assigned
tistical inference. Unlike probabilistic context-
free grammars, it is a context-sensitive gram- to unification variables are determined implicitly by
mar formalism and it has the ability to model rule selection. This means that in some derivations,
cross-serial dependencies in natural language. instances of the same logic variable may get different
SDCG also provides some syntax extensions values and unification will fail as result.
which makes it possible to write more compact Some of the first attempts to define stochastic uni-
grammars and makes it straight-forward to add fication grammars did not address the issue of how
lexicalization schemes to a grammar. they should be trained. Brew [2] and Eisele [9] tries
to address this problem using EM, but their methods
have problems handling cases where variables fails to
1 Introduction and background unify. The resulting probability distributions are miss-
ing some probability mass and normalization results in
We describe a stochastic variant of the well-known Def- non-optimal distributions.
inite Clause Grammars [12], which we call Stochastic Abney [1] defines a sound theory of unification gram-
Definite Clause Grammars (SDCG). mars based on Markov fields and shows how to esti-
Definite Clause Grammars (DCG) is a grammar for- mate the parameters of these models using Improved
malism built on top of Prolog, which was developed by Iterative Scaling (IIS). Abney’s proposed solution to
Pereira and Warren [12] and was based the principles the parameter estimation problem depends on sam-
from Colmerauers metamorphis grammars [6]. The pling and only considers complete data. Riezler[13]
grammars are expressed as rewrite rules which may decribes the Iterative Maximization algorithm which
include logic variables, like normal Prolog rules. DCG also work for incomplete data. Finally, Cussens [7]
exploit Prologs unification semantics, which assures provide an EM algorithm for stochastic logic programs
equality between different instances of the same logic- which handles incomplete data and is not dependent
variable. DCG also allows modeling of cross-serial of on sampling.
dependencies, which is known to be beyond the capa- SDCG is implemented as a compiler that translates
bility of context-free grammars [4]. a grammar into a program in the PRISM language.
In stochastic grammar formalisms such as proba- PRISM [16, 19, 15] is an extension of Prolog that al-
bilistic context-free grammars (PCFG), every rewrite lows expression of complex statistical models as logic
rule has an associated probability. programs. A PRISM program is a usual Prolog pro-
For a particular sentence, a grammar can produce an gram augmented with random variables. PRISM de-
exponential number of derivations. In parsing, we are fines a probability distribution over the possible Her-
usually only interested in one derivation which best brand models of a program. It includes efficient im-
reflects the intended sentence structure. In stochas- plementations of algorithms for parameter learning
tic grammars, a statistical inference algorithm can be and probabilistic inference. The execution, or sam-
used to find the most probable derivation, and this is a pling, of a PRISM program is a simulation where
very successful method for parse disambiguation. This values for the random variables is selected stochasti-
is especially true variants of PCFGs which condition cally, according to the underlying probability distribu-
rule expansions on lexical features. Charniak [3] re- tion. PRISM programs can have constraints, usually
ports that ”a vanilla PCFG will get around 75% preci- in the form of equality between unified logic variables.
sion/recall whereas lexicalized models achieve 87-88% Stochastic selection of values for such variables may
precision recall”. The reason for the impressive preci- lead to unification failure and resulting failed deriva-
sion/recall of stochastic grammars is that the proba- tions must be taken into account in parameter estima-
bilities governing the likelihood of rule expansions are tion. PRISM achieves this using the fgEM algorithm
normally derived from corpora using parameter esti- [17, 20, 18], which is an adaptation of Cussen’s Failure-
mation algorithms. Estimation with complete data, Adjusted Maximization algorithm [7]. A central part

139
International Conference RANLP 2009 - Borovets, Bulgaria, pages 139–143
of Cussens algorithm is the estimation of the number consists of a name which is a Prolog atom, followed
of times rules are used in failed derivations. PRISM by an optional parenthesized, comma-separated list of
estimates failed derivations using a failure program, features; (F1..Fn). Features are either Prolog atoms
derived through a program transformation called First or variables. Rule constituents may additionally have
Order Compilation (FOC) [14]. prefix regular expression modifiers. The allowed mod-
ifiers are * (kleene star) meaning zero or more oc-
currences, + meaning one or more occurrences and ?
2 Stochastic Definite Clause meaning zero or one occurrence.
Grammars Embedded code takes the form, { P }, where P is
a block of Prolog goals and control structures. The
Stochastic Definite Clause Grammars is a stochastic allowed subset of Prolog corresponds to what is al-
unification based grammar formalism. The grammar lowed in the body of a Prolog rule, but with the re-
syntax is modeled after, and is compatible with, Defi- striction that every goal must return a ground answer
nite Clause Grammars. To facilitate writing stochastic and may not be a variable. Also, while admitted by
grammars in DCG notation, a custom DCG compiler the syntax, meta-programming goals like call are not
has been implemented. The compiler converts a DCG allowed. The goals unify with facts and rules defined
to a PRISM program, which is a stochastic model of outside the embedded Prolog code, but not in other
the grammar. embedded code blocks.
Utilizing the functionality of PRISM, the grammar Symbol lists are Prolog lists of either atoms or vari-
formalism supports parameter learning from anno- ables or a combination of the two. The list usually
tated or unannotated corpora and provides and mech- take the form, [ S1,S2,..,SN ], but the list opera-
anism for parse selection through statistical inference. tor | may also be used. However, it is required that
Parameter learning and inference is performed using every variable in the list is ground. A symbol list may
PRISMs builtin functionality. not be empty.
SDCG include some extensions to the DCG syntax. Expansion macros have the form,
It includes a compact way of expressing recursion, in- @name(V1,V2,...,Vn)
spired by regular expressions. It has expansion macros
used for writing template rules which allow compact where name is an atom and is followed by a non-empty
expression of multiple similar rules. The grammar syn- parenthesized, comma-separated list, V1...Vn, con-
tax also adds a new conditioning operator which makes sisting of atoms or variables or a combination. A
possible to condition rule expansions on previous ex- macro corresponds have a corresponding goal, name/n,
pansions. which must be defined.

2.1 Grammar syntax 2.2 Procedural semantics


A grammar consist grammar rules and possibly some The grammar rules govern the rewriting the head of
helper Prolog rules and facts. A grammar rule takes a rule into the constituents in the body of a rule. A
the form, rule is rewritten when all its constituents have been
expanded. The order of the constituents in the body
H ==> C1,C2,..,Cn. are significant and they are expanded in a left-to-right
manner. The rewriting process always begins with the
H is called the head or left-hand side of the rule and start rule and progress in a depth-first manner. A rule
C1,C2,...,Cn is called the body or right-hand side of constituent in the body of a rule is thus a reference
the rule. The head is composed of a name, followed by to one or more other rules of the grammar. A gram-
an optional parameter list and an optional condition- mar rule is said to be matched by a constituent rule if
ing clause. It has the form, the name and arity of are the same and their features
name(F1,F2,...,Fn) | V1,V2,...,Vn unify. A constituent rule is expanded by replacing it
with the body of some matching rule. Symbol lists
The name of the rule is a Prolog atom. The pa- are terminals and are not expanded. Embedded Pro-
rameter list is a non-empty parenthesized, comma- log code is expanded to nothing and executed as a
separated list of features which may be Prolog vari- side-effect. The expansion terminates when the body
ables or atoms. The number of features in rules is re- only contains symbols or some constituent cannot be
ferred to as its arity. The optional conditioning clause expanded (derivation fails).
starts with the pipe (included) and is a non-empty, When a constituent matches more than one rule
comma-separated list of Prolog variables or atoms, or there might be more than one derivation. The choice
a combination of the two. The conditioning clause of the rule to expand given such a constituent, should
may also contain expansion macros in the case of un- be seen in the light of the probabilistic inference being
expanded rules. performed. In general, we can assume that only the
The body of a rule is a comma-separated list of con- derivations relevant to the probabilistic query being
stituents, of which there are four basic types: Rule used are expanded.
constituents, embedded Prolog code, symbol lists and
expansion macros. 2.3 Statistical semantics
Rule constituents are references to other SDCG
grammar rules. They have the same format of Prolog A rule r ∈ Rn,a with the distinct name n and arity
goals, but may not be variables. A rule constituent a has a probability P (r) ∈ [0, 1] of being expanded in

140
place of a matching rule constituent. of the nonterminal and its arity. For instance, since
A rule ri may have a condition (conditioning clause), np has an arity of 1, the corresponding random vari-
in which case the probability of its expansion depend able is named np(1). The possible outcomes of this
on the probability of the condition ci ∈ C n,a being particular random variable are np_1_1 and np_1_2.
true, C n,a being the set of possible values for condi- The first parameter of the implementation rules
tion clauses for rules in Rn,a . Each distinct condition uniquely identifies them and this name corresponds
(clause value) has a separate probability, such that to an outcome of the random variable used by the se-
lection rule. The implementation rules for the above
|C n,a |
X grammar is shown below:
P (ci ) = 1
i=1 np_impl(np_1_1, Number, In, Out) :-
det(Number, In, InOut1),
We denote number of rules in Rn,a satisfying a partic- noun(Number, InOut1, Out).
ular condition c, |n, a, c|. np_impl(np_1_2, Number, In, Out) :-
It holds for the sum of probabilities of such rules noun(Number, In, Out).
rin,a ∈ Rn,a that,
|n,a,c| 2.5 Grammar extensions
X
P (rin,a |c) =1 Regular expression operators, expansion macros and
i=1 conditioning clauses, which are extensions of the usual
where the probability of a rule r given a combination DCG syntax, makes it possible to express aspects of
of conditions c is their product, P (r|c) = P (r)P (c). If the grammar more compactly. These operators are
rules with the same head (Rn,a ) occur without condi- implemented in a preprocessing step which expands
tioning (C n,a = ∅) then the condition true is assumed the compacted grammar.
and P (true) = 1.
The probability of a derivation is the product of the 2.5.1 Regular expression modifiers
probabilities of all rules used in that derivation. The
probability of given sentence is the sum of the proba- Regular expression operators is a way of expressing
bilities for each possible derivation of the sentence. A recursion in a more convenient manner. An example
derivation may be unsuccessful due to failure of vari- grammar rule containing all the allowed regular ex-
able unification. The probability of all possible deriva- pression operators is shown below:
tions, successful and unsuccessful sums to unity, given name ==> ?(title), *(firstname), +(lastname).
by the relation, Psuccess = 1 − Pf ailure .
The regular expression operators are implemented
2.4 The translated SDCG by generating some additional rules and replacing the
original constituent (orig const), which the operator
The compiler behaves similar to a usual DCG com- is applied to, with another constituent (new const).
piler, by transforming rules in a DCG syntax to Pro- All regular expression operators can be implemented
log rules with difference lists. In addition to these nor- generating a subset of the following rules:
mal Prolog rules, which we call implementation rules,
special selection rules are used to control the stochas- 1) new_const ==> []
tic derivation process. Each rule head with the same 2) new_const ==> orig_const
number and arity in the original DCG grammar are 3) new_const ==> new_const, new_const
grouped together and managed by one selection rule.
The selection rule has the same name and number of The ? operator is implemented by adding rules 1-2.
features as the of the original rule, but any ground The + operator is implemented adding rules 2-3 and
atoms in the original rule are replaced by variables in the * operator is implemented adding all the rules.
the selection rule. Consider the two rules in the exam- The name new_const is symbolic. The compiler use a
ple below, naming scheme, which avoids conflicting names: The
name of the regular expression modifier is prefixed to
np(Number) ==> det(Number), noun(Number). the constituent name. For instance *(firstname) be-
np(Number) ==> noun(Number). comes sdcg_regex_star_firstname/0. The compiler
only adds the implementation rules for the same regu-
The generated selection rule for the two rules is shown lar expression once, even if it is used in multiple rules.
below:
2.5.2 Expansion macros
np(Number, In, Out) :-
msw(np(1), RuleIdentifier), Macros are special Prolog goals embedded in gram-
np_impl(RuleIdentifer, Number, In, Out). mar rules. They may occur in both the head and the
body of rules. Grammar rules with macros are meta
The msw goal is a special PRISM primitive which grammar rules; they act as templates for the genera-
implements simulation of a random variable, which tion similar rules. The result of macro expansion of a
here stochastically unifies RuleIndentifier to a value rule is a set of rules, equal in structure to the original
given the name of the random variable. The name of rule, but where each macro is replaced with selected
the random variable is assigned according to the name parameters from an answer for the goal. The ground

141
input to the goal is omitted by default. It is possible to sentence ==>
explicitly configure which parameters of a goal should np(nohead,NPHead),vp(NPHead,VPHead).
be inserted using an expand_mode directive. If the np(ParentHead,Head) | @headword(W) ==>
goal contains more than one non-ground/answer pa- det(ParentHead,DetHead),noun(DetHead,Head).
rameter, the answer parameters are inserted comma- vp(ParentHead,Head) | @headword(W) ==>
separated. If a rule contains more than one macro, verb(ParentHead,Head).
then the set of expanded rules correspond to a carte-
sian product of the answers for all the macros. When We have not specified conditioning modes for the
several macros in the same rule use the same name for rules, but in each case the condition corresponds to the
a variable, this works as a constraint on the answers first parameter in the head. Assume that the macro
for the macros. This is exactly as if the goals of the @headword expands to each of the words (terminals)
macros were constituents in the body of a Prolog rule. in the grammar. The headword is propagated from
The original motivation for expansion macros was the terminals, so for instance in the sentence rule, the
integration of lexical resources. Suppose that we wish choice of which vp rule to expand depends on head-
to integrate the lexicon defined by the following simple word propagated from the preceding np. Conditioning
Prolog program, a rule on every word implicates that the rule given that
word will have a distinct probability distribution.
word(he,sg,masc). word(she,sg,fem). More advanced lexicalization schemes can easily be
number(Word,Number) :- word(Word,Number,_). created using the conditioning mechanism. The limi-
gender(Word,Gender) :- word(Word,_,Gender). tation lies in the order in which variables conditioned
on are unified (and thus derivation order). It is not
expand_mode(number(-,+)). possible to condition on a variable which is not yet
expand_mode(gender(-,+)). ground.

term(@number(Word,N),@gender(Word,G)) ==> 2.5.4 Syntax extensions example


[ Word ].
As an illustrative example which applies all the syntax
We select the variables which should be inserted in extensions, we demonstrate a part of speech tagger ex-
the resulting rules. A minus (-) indicates that the pressed with SDCG. A part of speech tagger is can be
parameter is an input parameter and will not appear in implemented as a stochastic regular grammar/Hidden
place of the substituted macro and a plus (+) indicates Markov Model (HMM). A HMM based POS tagger
an output parameter which will appear in place of the can be created in SDCG with a single rule,
macro.
tag_word(Prev, @tag(Cur), [CurRest])
Since the macros in the example share the Word vari-
| @tag(_) ==>
able, it must unify to the same value for all macros.
@consume_word(W),
The result of performing macro substitutions on the
?(tag_word(Cur,_,Rest)).
grammar above is another, macro free, grammar:
This assumes definition of words, tags, a condition-
term(sg,fem)==>[she]. term(sg,masc)==>[he]. ing mode declaration. The grammar rule consumes
one word for each time it is expanded. Note that
there will be separate rules for each word, because of
2.5.3 Conditioning
the @consume_word macro, which expands the rule for
Conditioning makes it possible to condition an expan- all the words in the lexicon (enclosing them in square
sion on previous expansions, which is useful for adding brackets). The next constituent in the body is a recur-
lexicalization schemes to the grammar. An example of sive reference to the rule itself. It is governed by the
a rule with a conditioning clause is shown below: regular expression operator ?, which indicates that the
constituent may or may not be matched. If it is not
matched, we have termination of the recursion. The
n1(A,B,C) | a,b ==> n2, n3.
model defined by the rule is a fully connected second
order HMM model, where the expanded grammar has
The values of the conditioning clause, (a,b), cor- a rule for each possible transition.
responds to values for parameters in the head of To illustrate the use of the tagger we consider an
the rule. This relation is defined by adding a fact, example from [3], defined here as a simple Prolog lex-
conditioning_mode(n1(+,+,-)), to the grammar. icon,
The parameter is a compound term with the same
functor as a corresponding nonterminal. The param- tag(none). tag(det). tag(noun).
eters of this term indicate which parameters to gram- tag(verb). tag(modalverb).
mar rules named by the functor are subject to con- word(the). word(can). word(will). word(rust).
ditioning. For instance, the conditioning mode in the
above example states that the two first parameters of We introduce a helper rule to interact with the lexicon
n1 should be conditioned on (indicated with +), but and also a start rule,
the last one should not (indicated with -). consume_word([Word]) :- word(Word).
In the simple grammar fragment below, we illustrate start(TagList) ==> tag_word(none,_,TagList).
a simple conditioning scheme, inspired from [5], where
we condition on a single headword: To train the grammar we feed it with tagged sentences,

142
learn([ start([det,noun,modalverb,verb], allow expression of probabilistic grammars very com-
[the,can,will,rust],[]), pactly. This naturally includes probabilistic regular
start([det,noun,modalverb,verb], grammars (such as the demonstrated POS tagger) and
[the,can,can,rust],[]), probabilistic context-free grammars, but also includes
start([det,noun,noun], [the,can,rust],[]), context sensitive grammars. It was demonstrated that
start([det,noun],[the,rust],[]), lexicalization schemes can be compactly expressed in
start([modalverb,noun,verb], the formalism through conditioning and macros.
[will,rust,rust],[]), Some optimizations are needed in order to utilize
start([noun,modalverb,verb], large grammars (and training sets) for natural lan-
[will,can,rust],[]), guages. Alternative methods for parameter learning
start([noun,noun],[the,the],[]) ]). may be explored.
Finally, the success of the grammar formalism de-
When the grammar/tagger has been trained we can pends on the applications that using it. SDCG will
pose a viterbi query to find the most likely tag se- evolve with the development of applications using it.
quence for a sentence,

| ?- viterbig(start(T,[the,can,will,rust],[])).
References
T = [det,noun,modalverb,verb|_4794] ? [1] S. P. Abney. Stochastic attribute-value grammars. Computa-
tional Linguistics, 23(4):597–618, 1997.
yes
[2] C. Brew. Stochastic HPSG. In Proceedings of EACL-95,
February 1995.

3 Evaluation [3] E. Charniak. Statistical techniques for natural language pars-


ing. AI Magazine, 18(4):33–44, 1997.
[4] N. Chomsky. Syntactic Structures. Mouton, The Hague, 1957.
To test the grammar formalism with regard to more
realistic grammars, a grammar for a subset of the En- [5] M. J. Collins. Head-driven statistical models for natural lan-
guage parsing. PhD thesis, Penn university, Jan. 01 1999.
glish language was developed. The grammar consists
of about 90 rules, not counting pre-terminal rules, and [6] A. Colmerauer. Metamorphosis grammars. In L. Bolc, editor,
Natural Language Communication with Computers. Springer-
models various different sentences types. It was orig- Verlag, 1978.
inally modeled after the descriptions of context-free [7] J. Cussens. Parameter estimation in stochastic logic programs.
grammars for English in [11] and extended with some Machine Learning, 44(3):245, 2001.
common agreement features, chosen with the tagset of [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum
the Brown corpus in mind. likelihood from incomplete data via the EM algorithm. Journal
of the Royal Statistical Society, 39:1–38, 1977.
In a small scale experiment, the grammar was used
to parse 4000 select sentences from the Brown corpus [9] A. Eisele. Towards probabilistic extensions of constraint-based
grammars. DYANA-2 deliverable R 1.2 B, 1994.
[10], between 2 and 60 words in length. Parsing was
relatively fast - usually less than 100 milliseconds per [10] W. N. Francis and H. Kuçera. Brown corpus manual. 1979.
sentence excluding the time used to load the grammar [11] D. Jurafsky and J. H. Martin. Speech and language processing.
and sentences. Training the grammar on the same Prentice Hall, 2000.
sentences takes quite a while longer, approximately 4 [12] F. C. N. Pereira and D. H. D. Warren. Definite clause gram-
mars for language analysis – a survey of the formalism and
minutes. a comparison with augmented transition networks. Artificial
Introducing a lexicalization scheme similar to [5] in- Intelligence, 13, 1980.
creases the resulting number of random variables and [13] S. Riezler. Probabilistic constraint logic programming. CoRR,
affects both training time and inference time drasti- cmp-lg/9711001, 1997. informal publication.
cally. Some optimizations are needed to work with [14] T. Sato. First order compiler: A deterministic logic program
such lexicalized grammars in more realistic settings. synthesis algorithm. Journal of Symbolic Computation, pages
605–627, 1989.
A limitation seems to be the first order compilation
process in PRISM which takes a lot of time and con- [15] T. Sato. A glimpse of symbolic-statistical modeling by prism.
Journal of Intelligent Information Systems, 2008.
sumes a lot of memory as the grammar grows larger.
[16] T. Sato and Y. Kameya. Parameter learning of logic programs
With recursion, the process may not complete, which for symbolic-statistical modeling. Journal of Artificial Intel-
has motivated the addition of an option to limit the ligence Research (JAIR), 15:391454, 2001.
depth of the derivation tree. [17] T. Sato and Y. Kameya. A dynamic programming approach to
Precision/Recall was not measured, as the intention parameter learning of generative models with failure. In Pro-
ceedings of ICML Workshop on Statistical Relational Learn-
was only to measure the performance of the formalism, ing and its Connection to the Other Fields (SRL2004), 2004.
not the usefulness of the grammar. [18] T. Sato and Y. Kameya. Learning through failure. In L. D.
Raedt, T. Dietterich, L. Getoor, and S. H. Muggleton, editors,
Probabilistic, Logical and Relational Learning - Towards a
4 Conclusion and future work Synthesis, number 05051 in Dagstuhl Seminar Proceedings,
Dagstuhl, Germany, 2006. Internationales Begegnungs- und
Forschungszentrum für Informatik (IBFI), Schloss Dagstuhl,
Germany.
We introduced Stochastic Definite Clause Grammars,
[19] T. Sato and Y. Kameya. New advances in logic-based proba-
a new stochastic unification-based grammar formal- bilistic modeling by prism. Probabilistic Inductive Logic Pro-
ism syntacticly compatible with Definite Clause Gram- gramming LNCS 4911, Springer, page 118155, 2008.
mars. The grammar formalism borrows the expressiv- [20] T. Sato, Y. Kameya, and N.-F. Zhou. Generative model-
ity and ability to model natural language phenomena ing with failure in prism. In Proceedings of the 19th In-
ternational Joint Conference on Artificial Intelligence (IJ-
from DCG, but also enjoys the benefits from of sta- CAI2005), pages 847–852, 2005.
tistical models. SDCG extends DCG syntax which

143

You might also like