F15 CS194 Lec 05 Natural Language

Introduction to Data Science
Lecture 5
Natural Language Processing
CS 194 Fall 2015

John Canny
with some slides from David Hall
Outline
• BoW, N-grams
• Part-of-Speech Tagging
• Named-Entity Recognition
• Grammars
• Parsing
• Dependencies
Natural Language Processing (NLP)
• In a recent survey (KDNuggets blog) of data scientists, 62%
reported working “mostly or entirely” with data about people.
Much of this data is text.
• In the suggested CS194-16 projects (a random sample of data

science projects around campus), nearly half involve natural
language text processing.
• NLP is a central part of mining large datasets.

Natural Language Processing
Some basic terms:
• Syntax: the allowable structures in the language: sentences,
phrases, affixes (-ing, -ed, -ment, etc.).
• Semantics: the meaning(s) of texts in the language.
• Part-of-Speech (POS): the category of a word (noun, verb,
preposition etc.).
• Bag-of-words (BoW): a featurization that uses a vector of word
counts (or binary) ignoring order.
• N-gram: for a fixed, small N (2-5 is common), an n-gram is a
consecutive sequence of words in a text.
Bag of words Featurization
Assuming we have a dictionary mapping words to a unique integer
id, a bag-of-words featurization of a sentence could look like this:
Sentence: The cat sat on the mat
word id’s: 1 12 5 3 1 14
The BoW featurization would be the vector:
Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
position 1 3 5 12 14
In practice this would be stored as a sparse vector of (id, count)s:
(1,2),(3,1),(5,1),(12,1),(14,1)
Note that the original word order is lost, replaced by the order of
id’s.
• BoW featurization is probably the most common
representation of text for machine learning.
• Most algorithms expect numerical vector inputs and BoW

provides this.
• These include kNN, Naïve Bayes, Logistic Regression, k-

Means clustering, topic models, collaborative filtering
(using text input).
• One challenge is the size of the representation. Since words
follow a power-law distribution, vocabulary size grows with
corpus size at almost a linear rate.
• Since rare words aren’t observed often enough to inform the

model, its common to discard infrequent words occurring
less than e.g. 5 times.
• One can also use the feature selection methods mentioned

last time: mutual information and chi-squared tests. But
these often have false positive problems and are less reliable
than frequency culling on power-law data.
• The feature selection methods mentioned last time: mutual
information and chi-squared tests, often have false positive
problems on power-law data.
• Frequency-based selection usually works better.
10x more features in each block,
105
but 10x fewer observations of them.
Each is a potential false positive.
104 Difficult to control the false-positive rate
Frequency for rare features.
103
102
101
100 101 102 103 104
Word rank
N-grams
Because word order is lost, the sentence meaning is weakened.
This sentence has quite a different meaning but the same BoW
vector:
Sentence: The mat sat on the cat
word id s: 1 14 5 3 1 12
BoW featurization:
Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
But word order is important, especially the order of nearby words.
N-grams capture this, by modeling tuples of consecutive words.

N-grams
2-grams: the-cat, cat-sat, sat-on, on-the, the-mat
Notice how even these short n-grams “make sense” as linguistic
units. For the other sentence we would have different features:
2-grams: the-mat, mat-sat, sat-on, on-the, the-cat
We can go still further and construct 3-grams:
3-grams: the-cat-sat, cat-sat-on, sat-on-the, on-the-mat
Which capture still more of the meaning:
3-grams: the-mat-sat, mat-sat-on, sat-on-the, on-the-cat
N-grams Features
Typically, its advantages to use multiple n-gram features in machine
learning models with text, e.g.
unigrams + bigrams (2-grams) + trigrams (3-grams).
The unigrams have higher counts and are able to detect influences
that are weak, while bigrams and trigrams capture strong
influences that are more specific.
e.g. “the white house” will generally have very different influences
from the sum of influences of “the”, “white”, “house”.
N-grams as Features
Using n-grams+BoW improves accuracy over BoW alone for
classifying and other text problems. A typical performance (e.g.
classifier accuracy) looks like this:
3-grams < 2-grams < 1-grams < 1+2-grams < 1+2+3-grams
A few percent improvement is typical for text classification.

N-grams size
N-grams pose some challenges in feature set size.
If the original vocabulary size is |V|, the number of possible 2-
grams is |V|2 while for 3-grams it is |V|3
Luckily natural language n-grams (including single words) have a

power law frequency structure. This means that most of the n-
grams you see are common. A dictionary that contains the most
common n-grams will cover most of the n-grams you see.
Power laws for N-grams
N-grams also follow a power law distribution:
N-grams size
Because of this you may see values like this:
• Unigram dictionary size: 40,000

• Bigram dictionary size: 100,000
• Trigram dictionary size: 300,000
With coverage of > 80% of the features occurring in the text.

N-gram Language Models
N-grams can be used to build statistical models of texts.
When this is done, they are called n-gram language models.
An n-gram language model associates a probability with each n-

gram, such that the sum over all n-grams (for fixed n) is 1.
You can then determine the overall likelihood of a particular

sentence:
The cat sat on the mat
Is much more likely than
The mat sat on the cat
Skip-grams
We can also analyze the meaning of a particular word by looking at
the contexts in which it occurs.
The context is the set of words that occur near the word, i.e. at
displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the
word occurs.
A skip-gram is a set of non-consecutive words (with specified

offset), that occur in some sentence.
We can construct a BoSG (bag of skip-gram) representation for

each word from the skip-gram table.
Skip-grams
Then with a suitable embedding (DNN or linear projection) of the
skip-gram features, we find that word meaning has an algebraic
structure:
Man
King
Woman
Queen
Man + (King – Man) + (Woman – Man) = Queen
Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in

Vector Space"
Outline
• BoW, N-grams
• Grammars
• Parsing
• Dependencies
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun
• Verb
• Pronoun
• Preposition
• Adverb
• Conjunction
• Participle
• Article
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun (boat, plane, Obama)
• Verb (goes, spun, hunted)
• Pronoun (She, Her)
• Preposition (in, on)
• Adverb (quietly, then)
• Conjunction (and, but)
• Participle (eaten, running)
• Article (the, a)
Parts of Speech (Penn Treebank 2014)
1. CC Coordinating conjunction 19. PRP$ Possessive pronoun
2. CD Cardinal number 20. RB Adverb
3. DT Determiner 21. RBR Adverb, comparative
4. EX Existential there 22. RBS Adverb, superlative
5. FW Foreign word 23. RP Particle
Preposition or subordinating 24. SYM Symbol
6. IN
conjunction 25. TO to
7. JJ Adjective 26. UH Interjection
8. JJR Adjective, comparative 27. VB Verb, base form
9. JJS Adjective, superlative 28. VBD Verb, past tense
10. LS List item marker 29. VBG Verb, gerund or present participle
11. MD Modal 30. VBN Verb, past participle
12. NN Noun, singular or mass Verb, non-3rd person singular
31. VBP
13. NNS Noun, plural present
14. NNP Proper noun, singular 32. VBZ Verb, 3rd person singular present
15. NNPS Proper noun, plural 33. WDT Wh-determiner
16. PDT Predeterminer 34. WP Wh-pronoun
17. POS Possessive ending 35. WP$ Possessive wh-pronoun
18. PRP Personal pronoun 36. WRB Wh-adverb
POS tags
• Morphology:
• “liked,” “follows,” “poke”
• Context: “can” can be:

• “trash can,” “can do,” “can it”
• Therefore taggers should look at the neighborhood of a word.

Constraint-Based Tagging
Based on a table of words with morphological and context
features:
POS taggers
• Taggers typically use HMMs (Hidden-Markov Models) or
MaxEnt (Maximum Entropy) sequence models, with the latter
giving state-of-the art performance.
• Accuracy on test data is around 97%.
• A representative is the Stanford POS tagger:

http://nlp.stanford.edu/software/tagger.shtml
• Tagging is fast, 300k words/second typical (Speedread, about

10x slower for Stanford).
Outline
• BoW, N-grams
• Grammars
• Parsing
• Dependencies
Named Entity Recognition
• People, places, (specific) things

• “Chez Panisse”: Restaurant
• “Berkeley, CA”: Place
• “Churchill-Brenneis Orchards”: Company(?)
• Not “Page” or “Medjool”: part of a non-rigid designator
27
• Why a sequence model?

• “Page” can be “Larry Page” or “Page mandarins” or
just a page.
• “Berkeley” can be a person.
28
• Chez Panisse, Berkeley, CA - A bowl of Churchill-
Brenneis Orchards Page mandarins and Medjool dates
• States:
• Restaurant, Place, Company, GPE, Movie, “Outside”
• Usually use BeginRestaurant, InsideRestaurant.
• Why?
• Emissions: words
• Estimating these probabilities is harder.
29
• Both (hand-written) grammar methods and machine-learning
methods are used.
• The learning methods use sequence models: HMMs or CRFs =

Conditional Random Fields, with the latter giving state-of-the-art
performance.
• State-of-the-art on standard test data = 93.4%, human

performance = 97%.
• Speed around 200k words/sec (Speedread).
• Example: http://nlp.stanford.edu/software/CRF-NER.shtml
30
Outline
• BoW, N-grams
• Grammars
• Parsing
• Dependencies
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node)
• S  NP VP
• S  NP VP PP
• NP  DT NN
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node) “the cat sat on the mat”
• S  NP VP
• S  NP VP PP (the cat) (sat) (on the mat)
• NP  DT NN (the cat), (the mat)
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
English Grammars are context-free: the productions do not depend
on any words before or after the production.
The reconstruction of a sequence of grammar productions from a

sentence is called “parsing” the sentence.
It is most conveniently represented as a tree:

Parse Trees
“The cat sat on the mat”
Parse Trees
In bracket notation:
(ROOT
(S
(NP (DT the) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat))))))
Grammars
There are typically multiple ways to produce the same sentence.

Consider the statement by Groucho Marx:
“While I was in Africa, I shot an elephant in my pajamas”

“How he got into my pajamas, I don’t know”
Parse Trees
“…,I shot an elephant in my pajamas” -what people hear first
Parse Trees
Groucho’s version
Grammars
Recursion is common in grammar rules, e.g.
NP  NP RC
Because of this, sentences of arbitrary length are possible.
Recursion in Grammars
“Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”.
Grammars
Its also possible to have “sentences” inside other sentences…
S  NP VP
VP  VB NP SBAR
SBAR  IN S
Recursion in Grammars
“Nero played his lyre while Rome burned”.
5-min Break
Please talk to us if you haven’t finalized your group!
We will get you your data in the next day or two.
We’re in this room for lecture and lab from now on.
Outline
• BoW, N-grams
• Grammars
• Parsing
• Dependencies
PCFGs
Complex sentences can be parsed in many ways, most of which
make no sense or are extremely improbable (like Groucho’s
example).
Probabilistic Context-Free Grammars (PCFGs) associate and learn

probabilities for each rule:
S  NP VP 0.3
S  NP VP PP 0.7
The parser then tries to find the most likely sequence of

productions that generate the given sentence. This adds more
realistic “world knowledge” and generally gives much better results.
Most state-of-the-art parsers these days use PCFGs.
CKY parsing
A CKY (Cocke Younger Kasami) table is a dynamic programming
parser (most parsers use some form of this):
VP
NP
N(John) V(hit) D(the) N(ball)

CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
VP
NP

CKY parsing

CKY parsing
NP

CKY parsing
VP
NP

CKY parsing
VP
NP

CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
x
C
B
CKY parsing
B C
CKY parsing
x
B
C
Viterbi Parse
Given the probabilities of cells in the table, we can find the most
probable parse using a generalized Viterbi scan.
i.e. find the most likely parse at the root, then return it, and the
left and right nodes which generated it.
We can do this as part of the scoring process by maintaining a
set of backpointers.
x
B
C
Systems
• NLTK: Python-based NLP system. Many modules, good
visualization tools, but not quite state-of-the-art performance.
• Stanford Parser: Another comprehensive suite of tools (also POS

tagger), and state-of-the-art accuracy. Has the definitive
dependency module.
• Berkeley Parser: Slightly higher parsing accuracy (than Stanford)

but not as many modules.
• Note: high-quality dependency parsing is usually very slow, but

see: https://github.com/dlwh/puck
Outline
• BoW, N-grams
• Grammars
• Parsing
• Dependencies
Dependency Parsing
Dependency structures are rooted in existing words, i.e. there
are no non-terminal nodes, and a tree on n words has exactly n
nodes.
A dependency tree has about half as many total nodes as a

binary (Chomsky Normal form) constituency tree.
Dependency Parsing
Dependency parses may be non-binary, and structure type is
encoded in links rather than nodes:
Constituency vs. Dependency Grammars
Dependency parses are faster and more compact descriptions.
There are not simplifications of constituency parses, although
they can be derived from lexicalized constituency grammars.
Lexicalized grammars assign a head word to each non-terminal
node.
Constituency vs. Dependency Grammars
• Dependency and constituency parses capture attachment
information (e.g. sentiments directed at particular targets).
• Constituency parses are probably better for abstracting
semantic features, i.e. constituency units often correspond to
semantic units.
Constituency grammars have a much large space to search and
take longer, especially as sentence length grows O(N3).
However, if the goal is to find semantic groups its possible to

proceed bottom-up with (small) bounded N.
Dependencies
“The cat sat on the mat”
dependency tree parse tree
constituency labels of leaf nodes

Dependencies
From the dependency tree, we can obtain a “sketch” of the
sentence. i.e. by starting at the root we can look down one level to
get:
“cat sat on”
And then by looking for the object of the prepositional child, we get:
“cat sat on mat”
We can easily ignore determiners “a, the”.
And importantly, adjectival and adverbial modifiers generally

connect to their targets:
Dependencies
“Brave Merida prepared for a long, cold winter”
Dependencies
“Russell reveals himself here as a supremely gifted director of
actors”
Dependencies
• Stanford dependencies can be constructed from the output of a
lexicalized constituency parser (so you can in principle use other
parsers).
• The mapping is based on hand-written regular expressions.
• Dependency grammars have been widely used for sentiment

analysis and for semantic embeddings of sentences.
• Examples:
• http://nlp.stanford.edu/software/lex-parser.shtml
• http://www.maltparser.org/
Parser Performance
• MaltParser is widely used for dependency parsing because of its
speed (around 10k words/sec).
• High-quality constituency parsing on a fast machines usually runs

at < 2000 words/sec. Puck (with a GPU) raises this to around 8000
words/sec, similar to Malt Parser.
• Parsers are very complicated pieces of software, and are quite

sensitive to tuning. Make sure your parser is tuned properly –
evaluate on sample text.
• Parsers are often trained on sanitized data (news articles), and

will not work as well on informal text like social media. If possible,
train on similar data (e.g. LDC’s English web treebank).
Summary
• BoW, N-grams
• Grammars
• Parsing
• Dependencies

F15 CS194 Lec 05 Natural Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

F15 CS194 Lec 05 Natural Language

Uploaded by

Copyright:

Available Formats

Introduction to Data Science

CS 194 Fall 2015

• In the suggested CS194-16 projects (a random sample of data

• NLP is a central part of mining large datasets.

• Most algorithms expect numerical vector inputs and BoW

• These include kNN, Naïve Bayes, Logistic Regression, k-

• Since rare words aren’t observed often enough to inform the

• One can also use the feature selection methods mentioned

But word order is important, especially the order of nearby words.

N-grams capture this, by modeling tuples of consecutive words.

unigrams + bigrams (2-grams) + trigrams (3-grams).

3-grams < 2-grams < 1-grams < 1+2-grams < 1+2+3-grams

A few percent improvement is typical for text classification.

Luckily natural language n-grams (including single words) have a

• Unigram dictionary size: 40,000

With coverage of > 80% of the features occurring in the text.

When this is done, they are called n-gram language models.

An n-gram language model associates a probability with each n-

You can then determine the overall likelihood of a particular

A skip-gram is a set of non-consecutive words (with specified

We can construct a BoSG (bag of skip-gram) representation for

Man + (King – Man) + (Woman – Man) = Queen

Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in

• Context: “can” can be:

• Therefore taggers should look at the neighborhood of a word.

• Accuracy on test data is around 97%.

• A representative is the Stanford POS tagger:

• Tagging is fast, 300k words/second typical (Speedread, about

• People, places, (specific) things

• Why a sequence model?

• The learning methods use sequence models: HMMs or CRFs =

• State-of-the-art on standard test data = 93.4%, human

• Speed around 200k words/sec (Speedread).

The reconstruction of a sequence of grammar productions from a

It is most conveniently represented as a tree:

There are typically multiple ways to produce the same sentence.

“While I was in Africa, I shot an elephant in my pajamas”

Its also possible to have “sentences” inside other sentences…

Please talk to us if you haven’t finalized your group!

We will get you your data in the next day or two.

Probabilistic Context-Free Grammars (PCFGs) associate and learn

The parser then tries to find the most likely sequence of

N(John) V(hit) D(the) N(ball)

N(John) V(hit) D(the) N(ball)

N(John) V(hit) D(the) N(ball)

N(John) V(hit) D(the) N(ball)

N(John) V(hit) D(the) N(ball)

N(John) V(hit) D(the) N(ball)

• Stanford Parser: Another comprehensive suite of tools (also POS

• Berkeley Parser: Slightly higher parsing accuracy (than Stanford)

• Note: high-quality dependency parsing is usually very slow, but

A dependency tree has about half as many total nodes as a

However, if the goal is to find semantic groups its possible to

constituency labels of leaf nodes

And importantly, adjectival and adverbial modifiers generally

• The mapping is based on hand-written regular expressions.

• Dependency grammars have been widely used for sentiment

• High-quality constituency parsing on a fast machines usually runs

• Parsers are very complicated pieces of software, and are quite

• Parsers are often trained on sanitized data (news articles), and

You might also like