Professional Documents
Culture Documents
F15 CS194 Lec 05 Natural Language
F15 CS194 Lec 05 Natural Language
Lecture 5
Natural Language Processing
102
101
100 101 102 103 104
Word rank
N-grams
Because word order is lost, the sentence meaning is weakened.
This sentence has quite a different meaning but the same BoW
vector:
Sentence: The mat sat on the cat
word id s: 1 14 5 3 1 12
BoW featurization:
Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
The unigrams have higher counts and are able to detect influences
that are weak, while bigrams and trigrams capture strong
influences that are more specific.
e.g. “the white house” will generally have very different influences
from the sum of influences of “the”, “white”, “house”.
N-grams as Features
Using n-grams+BoW improves accuracy over BoW alone for
classifying and other text problems. A typical performance (e.g.
classifier accuracy) looks like this:
The context is the set of words that occur near the word, i.e. at
displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the
word occurs.
Woman
Queen
27
Named Entity Recognition
28
Named Entity Recognition
• Chez Panisse, Berkeley, CA - A bowl of Churchill-
Brenneis Orchards Page mandarins and Medjool dates
• States:
• Restaurant, Place, Company, GPE, Movie, “Outside”
• Usually use BeginRestaurant, InsideRestaurant.
• Why?
• Emissions: words
• Estimating these probabilities is harder.
29
Named Entity Recognition
• Both (hand-written) grammar methods and machine-learning
methods are used.
• Example: http://nlp.stanford.edu/software/CRF-NER.shtml
30
Outline
• BoW, N-grams
• Part-of-Speech Tagging
• Named-Entity Recognition
• Grammars
• Parsing
• Dependencies
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node)
• S NP VP
• S NP VP PP
• NP DT NN
• VP VB NP
• VP VBD
• PP IN NP
• DT “the”
• NN “mat”, “cat”
• VBD “sat”
• IN “on”
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node) “the cat sat on the mat”
• S NP VP
• S NP VP PP (the cat) (sat) (on the mat)
• NP DT NN (the cat), (the mat)
• VP VB NP
• VP VBD
• PP IN NP
• DT “the”
• NN “mat”, “cat”
• VBD “sat”
• IN “on”
Grammars
English Grammars are context-free: the productions do not depend
on any words before or after the production.
(ROOT
(S
(NP (DT the) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat))))))
Grammars
S NP VP
VP VB NP SBAR
SBAR IN S
Recursion in Grammars
“Nero played his lyre while Rome burned”.
5-min Break
We’re in this room for lecture and lab from now on.
Outline
• BoW, N-grams
• Part-of-Speech Tagging
• Named-Entity Recognition
• Grammars
• Parsing
• Dependencies
PCFGs
Complex sentences can be parsed in many ways, most of which
make no sense or are extremely improbable (like Groucho’s
example).
VP
NP
VP
NP
NP
VP
NP
VP
NP
x
C
B
CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
B C
CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
x
B
C
Viterbi Parse
Given the probabilities of cells in the table, we can find the most
probable parse using a generalized Viterbi scan.
i.e. find the most likely parse at the root, then return it, and the
left and right nodes which generated it.
We can do this as part of the scoring process by maintaining a
set of backpointers.
x
B
C
Systems
• NLTK: Python-based NLP system. Many modules, good
visualization tools, but not quite state-of-the-art performance.
• Examples:
• http://nlp.stanford.edu/software/lex-parser.shtml
• http://www.maltparser.org/
Parser Performance
• MaltParser is widely used for dependency parsing because of its
speed (around 10k words/sec).