Professional Documents
Culture Documents
SV TT&TV
SV TT&TV
Syntax trees:
Syntax: Describes the structural properties of the language. Natural language is much more
complicated than the formal languages used for the artificial languages of logics and computer
programs.
Formal Languages:
A formal language is a set of strings
Each string is composed of symbols from a set called an alphabet (or a vocabulary)
o Examples of alphabets:
English letters: Σ = {a, b, c . . . z}
Decimal digits: Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Programming language ‘tokens’: Σ = {if, while, x, ==}
‘Primitive’ actions performed by a machine or system, e.g. a vending machine
Σ = {insert50c, pressButton1, ...}
Words in (some fragment of) a natural language: Σ = {the, an, a, dog, cat,
sleeps}
o Examples of strings over alphabets:
Let Σ1 = {0, 1} be an alphabet. Then 01101, 000001, 1101 are strings over Σ1.
— in fact, all binary numbers are strings over Σ1
Let Σ2 = {a, b, c, d, e, f, g} be an alphabet. Then bee, dad, cabbage, and face
are strings over Σ2, as are fffff and agagag.
Let Σ3 = {ba, ca, fa, ce, fe, ge} be an alphabet. Then face is a string over Σ3 but
bee, dad or cabbage are not.
Let Σ4 = {♠, ©, ♣} be an alphabet. Then ♠♠ and ♣©♣ are strings over Σ4.
The length of a string is the number of token symbols from the alphabet it contains
o Examples:
The length of ‘face’ over Σ2 = {a, b, c, d, e, f, g} is 4
The length of ‘face’ over Σ3 = {ba, ca, fa, ce ,fe, ge} is 2
The string of length 0 is called the empty string, denoted by ε (epsilon).
Given a string s, a substring of s is a string formed by taking contiguous symbols of s in the
order in which they occur in s.
An initial substring is called prefix and a final substring a suffix.
Regular expressions:
Regular expressions are a formal notation for characterizing sets of strings that follow a fairly
simple regular pattern.
We can construct regular expressions over an alphabet Σ as follows:
(these regular expressions are written in mathematical notation)
Regular expressions and FSAs capture the same class of languages: any language that is the
denotation of some regular expression can be computed by an FSA and vice-versa
Lecture 1b
Finite State Automata: some remarks:
There is only 1 start state
But there can be several final states
The start state and final state may be the same
Every time we traverse the FSA from the start state to one of the final states, we have a string
of the formal language (a legal string)
In case a final state has an outgoing arrow we may continue as long as we are able to reach a
final state later on.
Recognizer: If we can transition to more than one state without being in conflict with the input
string, the FSA is non-deterministic.
Sources of non-determinism:
o States with more than one outgoing arrow with the same label (leading to different
states)
o States with more than one outgoing arrow where some arrow is an ε-transition
For every NFSA, there is an equivalent DFSA. In the sense that both accept exactly the same
set of strings (same language)
Simplifying somewhat, to build a DFSA from a NFSA we need to create a deterministic path for
each possible path in the NFSA.
The recognition behavior of a DFSA is fully determined by the state it is in and the symbol it is
processing.
NFSAs are not more powerful than FSAs, but they may be simpler.
An NFSA accepts a string if there is at least one path (one way of consuming the input string)
that leads to an accepting state
o We don’t stop at the first ‘fail’
String recognition with an NFSA can thus be seen as a search problem: the problem of finding
an accepting path
The search space may be explored with two different strategies: depth-first and breadth-first.
The lexicon:
A lexicon is a repository of words (= a dictionary)
o What sort of information about words should a lexicon contain?
o How can it be represented efficiently?
Words are made up of morphemes, the smallest meaningful units in a language
An efficient way to define morphological lexicons consist in specifying FSAs that apply to sub-
categories of words.
This accounts for the productivity of morphological processes (if we learn a new verb, we can
guess its inflectional paradigm)
A morphological lexicon lists the lemmas and other morphemes in a language (and their
categories) and specifies the allowable morpheme sequences that make up the words.
Irregularities:
o Not all morphological processes consist in concatenating morphemes
Foot feet
Make makes, making, made
o To account for irregular forms, we require transformations:
Ff
Oe
Oe
Tt
o We can define such transformations with a variant of FSAs that maps between two
sets of symbols: finite state transducers (FSTs)
Lemmatization:
Lemmatization is the process of reducing the forms in an inflexional paradigm of a word to
their underlying common lemma:
o Sing, sings, singing, sang, sung sing
o Walk, walks, walking, walked walk
Why is it important?
o In information retrieval applications like web search, when we search for keywords we
want to find documents that contain any inflectional variants
We can use FSTs for both regular and irregular forms
Types of transformations:
o x:x identity (no change)
o x:y substitution
o x:ε deletion
o ε:x insertion
Stemming:
Stemming has the same goal as lemmatization, but does the reduction with less knowledge,
using heuristic rules:
o Does not rely on morphological lexicon (lemmas are not known)
o Tries to leave only the ‘stem’ (an approximation of a lemma) by stripping off the
endings of the words
Ational ate (e.g. relational relate)
Ing ε (e.g. motoring motor)
The stemming rules can also be implemented as FSTs
Stemming can be useful, but can easily lead to mistakes
o National nate
Morphological Parsing:
To construct well-formed sentences, we need to pay attention to morphological features (e.g.
agreement between subject and verb)
o His foot is broken
o His feet are broken
We would like to map the surface form (the words as they appear in a text) to more
informative representations
Parsing: producing some sort of linguistic structure for an input expression
Lecture 2a
Syntax: from words to sentences:
Syntax deals with the structural properties of sentences
o Not all sequences of words count as a sentences of a language
o Speakers have intuitions about what well-formed sentences of a language are, even if
they don’t know what a sentence means
Word classes:
Nouns, verbs, pronouns, prepositions, adverbs etc.
Three criteria for classifying words:
o Distributional criteria: Where can the word occur?
o Morphological criteria: What form does the word have? What affixes can it take?
o Notional (or semantic) criteria: What sort of concept does the word refer to?
Open classes: are typically large, have fluid membership
o Four major word classes are widely found in languages worldwide: nouns, verbs,
adjectives, adverbs
Closed classes: are typically small, have relatively fixed membership
o E.g. determiners (a, the), prepositions (English, Dutch), postpositions (Korean, Hindi)
o Closed-class words (e.g. of, which, could) often play a structural role in the grammar
as function words.
Nouns (zelfstandig naamwoord):
o Notionally: nouns refer to things; living things (cat), places (Amsterdam), nonliving
things (ship), or concepts (marriage)
o Formally: -ness, -tion, -ity, -ance tend to indicate nouns (happiness, preparation,
activity, significance)
o Distributionally: we can examine the contexts in which nouns occur. For example,
nouns can appear with possession: ‘his car’, ‘her idea’ etc.
Verbs (werkwoord):
o Notionally: verbs refer to actions (write, think, observe)
o Formally: words that end in -ate or -ize tend to be verbs; words ending in -ing are
often the present participle of a verb (automate, modernize, sleeping)
o Distributionally: we can examine the contexts where a verb appears. Different types of
verbs have different distributional properties. For example, base form verbs can
appear as infinitives: ‘to jump’, ‘to learn’.
Adjectives (bijvoegelijk naamwoord):
o Notionally: adjectives convey properties of or opinions about things that are nouns
(small, sensible, excellent)
o Formally: words that end in -al, -ble and -ous tend to be adjectives (formal, sensible,
generous)
o Distributionally: adjectives appear before a noun or after a form of be. E.g. ‘the big
building’, ‘John is tall’
Adverbs (bijwoord):
o Notionally: adverbs convey properties of actions or events (quickly, often, possibly) or
adjectives (really)
o Formally: words that end in -ly tend to be adverbs
o Distributionally: adverbs can appear next to a verb, or an adjective, or at the start of a
sentence
Importance of criteria:
o Often in reading, we come across unknown words
o Even if we don’t know its meaning, formal and distributional criteria help people (and
machines) recognize which (open) class an unknown word belongs to
Tree structures:
We can represent the phrase structure of a sentence by means of a syntactic tree:
Formal Grammars:
The tree structures we have seen can be modelled with phrase structure rules:
A collection of phrase structure rules of this sort constitutes a formal grammar for a particular
language.
Grammars (like regular expressions, FSAs, FSTs) are a formal device to specify languages
But they are more powerful because they can be used to define languages that cannot be
captured by FSAs.
Formally, a grammar can be specified by 4 parameters:
o Σ: a finite set of terminal symbols
o N: a finite set of non-terminal symbols
o S: a special symbol S ∈ N called the start symbol
o R: a finite set of rules or productions containing:
A sequence of terminal or non-terminal symbols
The symbol
Another sequence of terminal or non-terminal symbols
Rules have the form: α β, where α, β ∈ (N ∪ Σ)*
Context-Free Grammar (CFG):
o Σ: a finite set of terminal symbols
o N: a finite set of non-terminal symbols
o S: a special symbol S ∈ N called the start symbol
o R: a set of rules or productions of the form:
X α, where X ∈ N, α ∈ (N ∪ Σ)*
o Conventions:
Non-terminal symbols are represented with uppercase letters
Terminal symbols are represented with lowercase letters
The left-hand side symbol of the first rule is the start symbol
Terminal symbols: we take these to be words
Non-terminal symbols:
o Phrase types (NP, VP, PP etc.)
o Word classes (Parts-Of-Speech): interface between words and phrases (N, V, P etc.)
Start symbol: S stands for ‘sentence’
Rules:
o Phrase structure rules: rules about phrases and their internal structure
o Lexicon: rules with a POS leading to a word
CF grammars are called context-free because a rule X α says that X can always be expanded
to α, not matter where the X occurs.
We need some sort of matching between formal features of the constituents in a sentence.
Feature structures:
Replace atomic categories (NP-1p-sg) with feature structures
Two feature structures A and B unify A ∪ B if they can be merged into one consistent feature
structure C, else it fails.
Grammars in Prolog:
Definite Clause Grammars (DCGs)
DCGs allow us to enhance a grammar with features by adding extra arguments to the DCG
rules and exploiting Prolog’s matching to enforce agreement
Lecture 2b
Derivations:
The language defined by a grammar is the set of strings composed of terminal symbols that can
be derived from the grammar’s rules.
Each sequence of rules that produces a string of the language is called a derivation.
A string s is ambiguous with respect to a grammar G, if there is more than one possible
derivation that allows G to recognize or generate s.
o The first rule to be applied must begin with the start symbol
o To apply a rule, we ‘rewrite’ the left symbol with the fight sequence
o The derivation finishes when we end up with terminal symbols
o The resulting string of terminal symbols is a string in the language defined by the
grammar
Can we find regular expressions that are equivalent to the grammars?
Right-Linear Grammars:
For any regular expression or FSA, we can design an equivalent grammar with the following
properties:
o Terminals are the input symbols
o Non-terminals are the states
o For every transition X a Y, we have a production X a Y
o For every accepting state X, we have a production X ε
This kind of grammar is called a right-linear or regular grammar
Right-linear grammars are a subset of all possible grammars, and regular languages are a
subset of all possible formal languages
The languages definable by regular grammar are precisely the regular languages
o Regardless of how many times we traverse the loop, the resulting string will be part of
the language (all strings xynz for n ≥ 0)
o The Pumping Lemma:
All these observations are summarized in the Pumping Lemma, so called
because substring y is said to be ‘pumped’
Pumping Lemma: Let L be an infinite regular language. Then the following
‘pumping’ condition applies:
There is a string xyz ∈ L such that y ≠ ε and any string xynz for any
n ≥ 0 also belongs to L.
If the pumping condition does not hold, then L is not regular
The Pumping Lemma can only be used to prove that a language is not regular:
if we can show that it is not possible to find a string in L for which the
pumping condition holds, then the language is not regular.
Showing that L satisfies the Pumping Lemma doesn’t prove that L is regular.
Consider L = {anbn | n ≥ 0}
Lecture 3a
Syntactic Ambiguity
A sentence is structurally or syntactically ambiguous with respect to a grammar, if the
grammar can assign more than one parse tree.
Although the most plausible meaning of the sentence is compatible with only one structure,
the grammar can assig it two structures:
Wrong right
Sometimes, more than one syntactic structure (and their respective associated interpretation)
make sense:
o The tourist saw the astronomer with the telescope
The astronomer was holding the telescope and the tourist saw him
The tourist saw the astronomer while looking through the telescope
We can account for some of the different meanings of a sentence by assigning more than one
possible internal structure to it.
Main types of syntactic ambiguity:
o Attachment ambiguity: one constituent can appear in more than one location in the
parse tree (can be ‘attached’ to more than one phrase):
The tourist saw the astronomer with the telescope
I shot an elephant in my pajamas
The waiter brought the meal (of the day/to the table)
We saw (the Eiffel Tower / the plane) flying to Paris
o Coordination ambiguity: uncertainty about the arguments of a coordinating
conjunction such as and or or:
Secure hardware and software
Secure [hardware] and [software]
[secure hardware] and [software]
A house with a balcony or a garage
A house with [a balcony] or [a garage]
[a house with a balcony] or [a garage]
Probabilistic Grammars:
Ambiguity is pervasive in natural language
Probabilistic grammars offer a way to resolve structural ambiguities
o Main idea: given a sentence, assign a probability t each possible tree and choose the
most probable one
o Compute the probability of a parse tree
o Compute the probability of a grammar rule
Probabilistic CFCs: where each rule is augmented with a probability:
o Σ: a finite alphabet of terminal symbols
o N: a finite set of non-terminal symbols
o S: a special symbol S ∈ N called the start symbol
o R: a finite set of rules each of the form A β p, where
A is a non-terminal symbol
Β is any sequence of terminal or non-terminal symbols, including ε
P is a number between 0 and 1 expressing the probability that A will be
expanded to the sequence β, which we can write as P(A β)
For any non-terminal A, the sum of the probabilities for all rules A β must
be 1: Σβ P(A β) = 1
P(A β) is a conditional probability P(β | A): the probability of observing a β once we have
observed an A.
Example:
These probabilities can provide a criterion for disambiguation: they give us a ranking over
possible parses for any sentence. We can simply choose the parse tree with the highest
probability.
We can compute the probabilities of the grammar rules with a treebank. The trees in the
treebank are considered the correct trees, the Gold Standard trees.
For each non-terminal A, we want to compute the
probability of each rule A β occurs in the treebank
Divide that by the total number of rules that expand A
(the total number of occurrences of A on LHS in the
treebank)
Evaluation measures
o Precision: number of correct constituents in the parse tree created by the parser
divided by the total number of constituents in that tree
How many of the hypothesized constituents are correct?
o Recall: number of correct constituents in the parse tree created by the parser divided
by the number of constituents in the gold-standard parse tree
How many of the actual constituents were reproduced correctly?
o F-measure: a score that combines recall and precision as follows: F1 = (2PR)/(P+R)
Lecture 3b
Lexicalized PCFGs:
Phrasal rules in PCFG are ‘insensitive’ to word-level information
Solution: add lexical information to phrasal rules to help resolve ambiguities
o Replaces rules like S NP VP with S(ate) NP(boy) VP(ate)
o Partial trees (where some node can still be expanded by a rule), which can be seen as
intermediate steps in the construction of complete trees.
The search space defined by a grammar is a theoretical search space
We can follow one of two strategies to explore the space:
o Depth-first: we work vertically
o Breadth-first: we work horizontally
In addition, a parser navigates the search space defined by a grammar following two obvious
constraints:
o A complete tree for a sentence must begin with the start symbol S
o And must have as leaves the words in the sentence
These two constraints five rise to two search strategies:
o Top-down: the parser starts with S, assuming the input is a well-formed sentence
o Bottom-up: the parser starts with the words in the sentence and builds up structure
from there
Parsing Algorithms:
A parser is an algorithm that computes a structure for an input string given a grammar. All
parsers have two fundamental properties:
o Directionality: the sequence in which the structures are constructed
(top-down/bottom-up)
o Search strategy: the order in which the search space of possible analyses is explored
(depth-first/breadth-first)
Three basic parsing algorithms:
o Recursive descent top-down algorithm
A recursive descent parsing algorithm builds a parse tree using a top-down
approach with depth-first search:
Given a parsing goal, the parser tries to prove that the input is such as
constituent by building up structure from the top of the tree down to
the words
It does so by looking at the grammar rules left-to-right
It recursively expands its goals descending in a depth-first fashion
If at some point there is no match, the parser must back up and try a
different alternative
o Parser searches through the trees licensed by the grammar to
find the one that has the required sentence as leaves of the
tree
o Directionality = top-down: it starts from the start symbol of
the grammar and works its way down to the terminals
o Search strategy = depth-first: it expands a given non-terminal
as far as possible before proceeding to the next one.
Shortcomings
o Because it uses depth-first, some types of recursion may send
it into an indefinite loop
o Like all top-down parsers, it may waste a lot of time
considering words and structures that do not correspond to
the input sentence
Top-down parsers use a grammar to predict what the
input is before inspecting the input at all