Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

1

OVERVIEW OF STATISTICAL NATURAL


LANGUAGE PROCESSING
Dr. Fei Song
School of Computer Science
University of Guelph

May 18, 2016


Outline
2

q What is Statistical Natural Language


Processing (SNLP)?
q Language Models for Information Retrieval

q Text Classification and Sentiment Analysis

q Probabilistic Models (LDA, Bayesian HMM,

and POSLDA) for language processing


q References
What is SNLP?
3

q Infer and rank the structures from text based


on statistical language modeling.
§ Probability and Statistics
§ Machine Learning Techniques
q Started in late 1950’s, but didn’t get popular
until early 1980’s.
q Many applications: Information Retrieval,

Information Extraction, Text Classification, Text


Mining, and Biological Data Analysis.
Language Modeling
4

q A statistical language model requires the


estimates for such probabilities:
P(w1,n) = P(w1,w2,…,wn)

q Probabilities to word sequences?


P(w1 w2 … wn) = P(w1) P(w2|w1) … P(wn|w1 w2 … wn-1)

e.g., Jack went to the {hospital, number, if, … }

q Left-context only?
§ The {big, pig} dog …
§ P(dog|the big) >> P(dog|the pig)
Noisy Channel Framework
5

q Through decoding, we want to find the most


likely input for the given observation.

I Noisy Channel O Î
Decoder
p(o|i)

ˆ p(i)p(o|i)
I = argmax p(i| o) = argmax = argmax p(i)p(o|i)
i i p(o) i

§ Applications: machine translation, optical character


recognition, speech recognition, spelling correction.
Language Models for IR
6

q N-gram models:
Unigram: P(w1,n) = P(w1) P(w2) … P(wn)
Bigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-1)
Trigram: P(w1,n) = P(w1) P(w2| w1) … P(wn|wn-2,n-1)

q Documents as language samples:


n
P(t1 , t2 ,..., tn | d ) = ∏ P(ti | d )
i =1
Language Models for IR
7

q Query as a generation process:


P(d | t1 , t 2 ,..., t m )
⇒ P(d ) P(t1 , t 2 ,..., t m | d ) / P(t1 , t 2 ,..., t m )
(Bayesian theorem)
⇒ P(d ) P(t1, t2 ,..., tm | d )
(Uniform prior documents)
m
⇒ P(t1 , t2 ,..., tm | d ) ⇒ ∏ P(ti | d )
i =1
(Unigram terms)
A Naïve Solution
8

q Maximum likelihood estimate:


tft ,d
Pmle (t | d ) =
dld

tf t ,d : the raw term frequency of term t in


document d

dld : the total number of tokens in document d.


Sparse Data Problem
9

q A document size is often too small


m
P(ti | d ) = 0 ⇒ ∏ P(ti | d ) ⇒ 0
i =1

q A document size is fixed:


P(information, retrieval|d) > 0 && keyword ∉d &&
crocodile ∉d
=> P(keyword|d) >> P(crocodile|d).
Zipf's Law
10

q Given the frequency f of a word and its rank r in the list


of words ordered by their frequencies:
f ∝ 1/r or f x r = k for a constant k A reasonable
number of
9000.00 medium-freq
A small 8000.00
words
number of 7000.00
6000.00
common
frequency

5000.00
words 4000.00
3000.00
2000.00
1000.00 A large
0.00 number of
0 10 20 30 40 50 60
rare words
rank
Data Smoothing
11

q Laplace’s Law: T is the max number of terms.


tft ,d + 1
PLAP (t | d ) =
dl d + T

q Extensions to Laplace’s: Lidstone’s Law.


tft ,d + λ
PLID (t | d ) = = µPmle (t | d ) + (1 − µ ) / T
dld + Tλ
where µ = dld /(dld + Tλ )
Data Smoothing
12

q Smoothed with the collection model:


Pcombined (t | d ) = ω × Pdocument (t | d ) + (1 − ω ) Pcollection (t )

§ The combined probability is still normalized with


values between 0 and 1.
§ Further differentiation between missing terms such
as “keyword” and “crocodile”.
§ Collection model can be made stable by adding
more documents into the collection.
Text Classifications/Categorizations
13

q Common classification problems:


Problems Input Categories

Tagging context of a word tag for the word


Disambiguation context of a word sense for the word
PP attachment sentence parse trees
Author identification document author(s)
Language identification document language(s)
Text categorization document topic(s)

q Common classification methods: decision trees,


maximum entropy modeling, neural networks, and
clustering.
What is Sentiment Analysis?
14

“… after a week of using the camera, I am very unhappy with


the camera. The LCD screen is too small and the picture quality
is poor. This camera is junk.”
Subjective Words
15

q A consumer is unlikely to write: “This camera is great.


It takes great pictures. The LCD screen is great. I love
this camera”.

q But more likely to write: “This camera is great. It takes


breathtaking pictures. The LCD screen is bright and
clear. I love this camera”.

q More diverse usage of subjective words: infrequent


within but frequent across documents.
Topic Models
16

q Topic modeling is a relatively new statistical


approach to understanding the thematic structure in
a collection of data
§ Uncovering hidden topics in a corpus of documents
§ Reducing dimensionality from words down to topics

q Topic models treat the document creation as a


random process of determining a topic proportion
and selecting words from the related topic
distributions.
Discover Topics
17

charles study bush surface


prince found protest atmosphere
london drug texas space
marriage research bushs system
parker risk iraq earth
camilla drugs president probe
bowles researchers cindy european
wedding dr war moon
british patients ranch huygens
thursday disease crawford titan
king vioxx sheehan mission
royal health son friday
married increased casey nasa
marry merck killed scientists
wales text antiwar cassini
queen brain california saturns
diana schizophrenia george agency
april studies mother data
relationship medical road titans
couple effects peace 14
Discover Hierarchies
18
Topic Use Changing Through Time
19

0.025

0.020

0.015 Topic
Frequency

horses
ships
vehicles

0.010

0.005

1800 1820 1840 1860 1880 1900


Year
This
Eachisword
the generative
Documents
document
exhibit model
is amultiple
random
is randomly of LDA aoftopic
topics…
drawnmixture
from topics that are shared across the corpus…

20

election
voter
vote
president
ballot

law
legal
lawyer
court
judge

city
state
florida
united
california

*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).
However,
Using all of thisreasoning,
probabilistic thematic information
we wish to is hidden.
infer We only
the latent observe
structure the documents
of the words…
Topic Proportions

21

Topics

Topic
Indices

*Harvard Law Review, Vol. 118, No. 7 (May, 2005), pp. 2314-2335 (Note).
Bayesian Probability
22

q Bayes’ Theorem
p(x | θ )p(θ )
P(θ | x) =
p(x)
posterior ∝ likelihood × prior

q Subjective probability: model prior by a given


distribution

Beta Prior Linear Likelihood Posterior


Dirichlet Distribution
23

q Distribution over distributions:

K
1
P(θ | α ) = ∏
B(α ) i=1
θ αi −1

B(α ) =
∏ Γ(α )
i=1 i
K
Γ(∑ α ) i
i=1
Latent Dirichlet Allocation (LDA)
24

§ Initially proposed by β ϕ
K
Blei, et al. (2003):

α θ z w
N
M
Generative Process:
1. ϕ(k) ~ Dir(β)
2. For each document d ∈ M: p( w, z,θ , φ | α , β ) =
K M
a. θd ~ Dir(α)
b. For each word w ∈ d:
∏ p(φ
k =1
k | β ) × ∏ p(θ m | α ) ×
m =1

i. z ~ Discrete(θd) M Nm

ii. w ~ Discrete(ϕ(z)) ∏∏ p( z
m =1 n =1
m,n | θ m ) p( wm,n | φ zm )
Inference
25

q We are interested in the posterior distributions for


ϕ, z and θ
q Computing these distributions exactly is intractable
q We therefore turn to approximate inference
techniques:
§ Gibbs sampling, variational inference, …
q Collapsed Gibbs sampling
§ The multinomial parameters are integrated out before
sampling
Gibbs Sampling
26

q Popular MCMC (Markov Chain Monte Carlo) method


that samples from the conditional distributions for the
posterior variables

q For the joint distribution p(x)=p(x1,x2,…, xm):


1. Randomly initialize each xi
2. For t = 1, 2, …, T:
2.1. x1t +1 ~ p( x1 | x2t , x3t , !, xmt )
2.2. x2t +1 ~ p ( x2 | x1t +1 , x3t , !, xmt )

2.m. xmt +1 ~ p( xm | x1t +1 , x2t +1 , !, xmt +−11 )
(Collapsed) Gibbs Sampling
27

q We integrate out the multinomial parameters φ and θ


so that the Markov chain stabilizes more quickly and we
have less variables to sample.

q Our sampling equation is given as follows:


nz(id ) + α zi
nw( zi ) + β
p( zi | z−i , w) ∝ ( d ) × ( zi )
n. + α . n. + Wβ

q GibbsLDA++: a free C/C++ implementation of LDA


Syntax Models
28

q Hidden Markov Model (HMM): A

The probability distribution of the z1 z2 z3 z4 ...


latent variable zi follows the
Markov property and depends x1 x2 x3 x4
on the value of the previous
latent variable zi-1 ϕ

q Each latent state z has a unique emission probability


§ This is a mixture model like LDA
q Useful for unsupervised POS tagging
§ Language exhibits a structure due to syntax rules
§ State-of-the-art: “Bayesian” HMM where transition rows and
emission probabilities are random variables drawn from
Dirichlet distributions [3]
Combining Topic and Syntax Models?
29

q Considering both axes of information can help us model text


more precisely and can thus aid in prediction, processing, and
ultimately many NLP tasks

q Example 1:
§ Our favourite city during the trip was _________.

§ How do we reason about what the missing word might be?

§ An HMM should be able to predict that it’s a noun

§ LDA might be able to predict that it’s a travel word*

§ A combined model could theoretically determine that it’s a


noun about travel
Combining Topic and Syntax Models?
30

q Example 2:
§ Is the word “book” a noun or a verb?
• If we know that a “library” topic generated it, it’s much more
likely to be a noun
• If we know that an “airline” topic generated it, it’s more likely
to be a verb (“to book a flight”)

q Example 3:
§ We know that the word “seal” is a noun, what is its topic?
• More likely to be related to “marine mammals” than
“construction” (“to seal a crack”)
POSLDA (Part-Of-Speech LDA) Model
31

q A “multi-faceted” topic model


where word w depends on
both topic z and class c when
c is a “semantic” class α θ
§ wi ~ p(wi | ci , zi)
z z z ...
When c is a “syntactic” class
1 2 3

q
β
the emitted word only w 1 w 2 w 3
...
depends on class c itself
ϕ c
1 c
2 c
3
...
T*SSEM + SSYN
q This model results in POS- M

specific topics and can π γ


automatically filter out “stop- S

words” that must be manually


removed in LDA
POSLDA Generative Process
32

1. For each row πr ∈ π:


a. Draw πr ~ Dirichlet(γ)
2. For each word distribution ϕn ∈ ϕ:
a. Draw ϕn ~ Dirichlet(β)
3. For each document d ∈ D:
a. Draw θd ~ Dirichlet(α)
b. For each token i ∈ d:
i. Draw ci ~ π(ci-1)
ii. If ci ∈ CSYN:
A. Draw wi ~ ϕSYN(ci)
iii. Else (ci ∈ CSEM):
i. Draw zi ~ θd
ii. Draw wi ~ ϕSEM(ci, zi)
POSLDA Interpretability
33

q Learned word distributions from TREC AP corpus:


Generalized Probabilistic Model
34

q POSLDA reduces to LDA when the number of


classes S = 1.

q POSLDA reduces to Bayesian HMM when the


number of topics K = 1.

q POSLDA reduces to HMMLDA when the number


of semantic classes Ssem = 1.
FS from Semantic Classes
35

q Research has shown that semantic classes such


as adjectives, adverbs, and verbs are more
useful for SA.
q Select representative words for a semantic
class by picking the top-ranked words with the
accumulative probability ≥ θ (e.g., 75% or
90%).
q Merge all selected words into one set Wsem,

and reduce it further by DF-cutoff if needed.


FS from Semantic Classes with Tagging
36

q POSLDA is unsupervised and the results do not


usually match with human labeled answers.
q A tagging dictionary contains all the POS tags
that can be used for the given words in a
corpus.
q With a tagging dictionary, a word is only
assigned to its related POS classes, but if not in
the dictionary, the word will participate in all
POS classes, same as the unsupervised process
for POSLDA.
FS with Automatic Stopword Removal
37

q Similar to Wsem, we can also build Wsyn from


the syntactic classes to extract topic-
independent stopwords.
q Such a process is both automatic and corpus-
specific, avoiding under- or over-removal of
the related words.
q Although POSLDA can separate semantic and
syntactic classes, removing stopwords explicitly
helps reduce the noise in the dataset.
FS for Aspect-Based SA
38

q POSLDA associates each topic with its related


semantic classes such as “nouns about sports” and
“verbs about travel”.
q By modeling topics as aspects, we can then select

features from the corresponding semantic classes


using the methods described earlier.
q To model aspects, we use manually prepared

seed lists (possibly extended with a bootstrapping


method), and pin them in the related aspects
during the modeling process.
39

Questions?
References
40

q Chris Manning and Hinrich Schütze.


Foundations of Statistical Natural Language
Processing. The MIT Press, 1999.
q Chris Manning, Prabhakar Raghavan, and
Hinrich Schütze. Introduction to Information
Retrieval. Cambridge University Press, 2008
(online copy available on the web)
q Daniel Jurafsky and James H. Martin. Speech
and Language Processing. Second Edition.
Pearson Education, 2008.
References
41

q David M. Blei, Andrew Y. Ng, and Michael I.


Jordan. Latent dirichlet allocation. J. Mach.
Learn. Res., 3:993–1022, 2003.
q Sharon Goldwater and Tom Griffiths. A fully
Bayesian approach to unsupervised part-of-
speech tagging. In Proceedings of the 45th
Annual Meeting of the Association of
Computational Linguistics, pages 744–751,
Prague, Czech Republic, June 2007.
References
42

q Bo Pang , Lillian Lee, and Shivakumar


Vaithyanathan. Thumbs up? sentiment
classification using machine learning techniques.
Processings of EMNLP, 2002.
q David M. Blei, Andrew Y. Ng, and Michael I.
Jordan. Latent Dirichlet Allocation. Journal of
Machine Learning Research, 3:993-1022, 2003.
q Sharon Goldwater and Thomas Griffiths. A fully
bayesian approach to unsupervised part-of-
speech tagging. Proceedings of the 45th Annual
Meeting of the ACL, 2007.
References
43

q William M. Darling. Generalized


Probabilistic Topic and Syntax Models for
Natural Language Processing. Ph.D. Thesis,
University of Guelph, 2012.
q Haochen Zhou and Fei Song. Aspect-Level
Sentiment Analysis Based on a Generalized
Probabilistic Topic and Syntax Model.
Proceedings of FLAIRS-28, 2015.

You might also like