Professional Documents
Culture Documents
LSA, pLSA, and LDA Acronyms, Oh My!
LSA, pLSA, and LDA Acronyms, Oh My!
Acronyms, oh my!
Slides by me,
Thomas Huffman,
Tom Landauer and Peter Foltz,
Melanie Martin,
Hsuan-Sheng Chiu,
Haiyan Qiao,
Jonathan Huang
Outline
Latent Semantic Analysis/Indexing (LSA/LSI)
Probabilistic LSA/LSI (pLSA or pLSI)
Why?
Construction
Aspect Model
EM
Tempered EM
Comparison with LSA
Latent Dirichlet Allocation (LDA)
Why?
Construction
Comparison with LSA/pLSA
LSA vs. LSI
But first:
What is the difference between LSI and LSA?
LSI refers to using this technique for indexing, or
information retrieval.
LSA refers to using it for everything else.
Its the same technique, just different
applications.
The Problem
Two problems that arise using the vector
space model:
synonymy: many ways to refer to the same object,
e.g. car and automobile
leads to poor recall
polysemy: most words have more than one
distinct meaning, e.g. model, python, chip
leads to poor precision
The Problem
Example: Vector Space Model
(from Lillian Lee)
auto
engine
bonnet
tyres
lorry
boot
car
emissions
hood
make
model
trunk
make
hidden
Markov
model
emissions
normalize
Synonymy
Will have small cosine
but are related
Polysemy
Will have large cosine
but not truly related
The Setting
Corpus, a set of N documents
D={d_1, ,d_N}
Vocabulary, a set of M words
W={w_1, ,w_M}
A matrix of size N * M to represent the occurrence of words in
documents
Called the term-document matrix
Latent Semantic Indexing
Latent present but not evident, hidden
Semantic meaning
LSI finds the hidden meaning of terms
based on their occurrences in documents
Latent Semantic Space
LSI maps terms and documents to a latent
semantic space
Comparing terms in this space should make
synonymous terms look more similar
LSI Method
Singular Value Decomposition (SVD)
A(n*m) = U(n*n) E(n*m) V(m*m)
Keep only k eigen values from E
A(n*m) = U(n*k) E(k*k) V(k*m)
Convert terms and documents to points in k-
dimensional space
A Small Example
Technical Memo Titles
c1: Human machine interface for ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user perceived response time to error measurement
m1: The generation of random, binary, ordered trees
m2: The intersection graph of paths in trees
m3: Graph minors IV: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
A Small Example 2
r (human.user) = -.38 r (human.minors) = -.29
A Small Example 3
Singular Value Decomposition
{A}={U}{S}{V}
T
Dimension Reduction
{~A}~={~U}{~S}{~V}
T
A Small Example 4
0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 -0.41
0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11
0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49
0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01
0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27
0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05
0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05
0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.17
0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58
0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 -0.23
0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23
0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18
{U} =
A Small Example 5
{S} =
3.34
2.54
2.35
1.64
1.50
1.31
0.85
0.56
0.36
A Small Example 6
{V} =
0.20 0.61 0.46 0.54 0.28 0.00 0.01 0.02 0.08
-0.06 0.17 -0.13 -0.23 0.11 0.19 0.44 0.62 0.53
0.11 -0.50 0.21 0.57 -0.51 0.10 0.19 0.25 0.08
-0.95 -0.03 0.04 0.27 0.15 0.02 0.02 0.01 -0.03
0.05 -0.21 0.38 -0.21 0.33 0.39 0.35 0.15 -0.60
-0.08 -0.26 0.72 -0.37 0.03 -0.30 -0.21 0.00 0.36
0.18 -0.43 -0.24 0.26 0.67 -0.34 -0.15 0.25 0.04
-0.01 0.05 0.01 -0.02 -0.06 0.45 -0.76 0.45 -0.07
-0.06 0.24 0.02 -0.08 -0.26 -0.62 0.02 0.52 -0.45
A Small Example 7
r (human.user) = .94 r (human.minors) = -.83
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09
interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04
computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12
user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19
system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05
response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11
survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42
trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66
graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85
minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62
LSA Titl es e x a mple :
Corr ela t i o ns be t we e n tit l es i nr aw da t a
c1 c2 c3 c4 c5 m 1 m 2 m 3
c2 - 0.19
c3 0.00 0.00
c4 0.00 0.00 0.47
c5 - 0.33 0.58 0.00 - 0.31
m 1 - 0.17 - 0.30 - 0.21 - 0.16 - 0.17
m 2 - 0.26 - 0.45 - 0.32 - 0.24 - 0.26 0.67
m 3 - 0.33 - 0.58 - 0.41 - 0.31 - 0.33 0.52 0.77
m 4 - 0.33 - 0.19 - 0.41 - 0.31 - 0.33 - 0.17 0.26 0.56
0.02
- 0.30 0.44
Correlations in first-two dimension space
c2 0.91
c3 1.00 0.91
c4 1.00 0.88 1.00
c5 0.85 0.99 0.85 0.81
m 1 - 0.85 - 0.56 - 0.85 - 0.88 - 0.45
m 2 - 0.85 - 0.56 - 0.85 - 0.88 - 0.44 1.00
m 3 - 0.85 - 0.56 - 0.85 - 0.88 - 0.44 1.00 1.00
m 4 - 0.81 - 0.50 - 0.81 - 0.84 - 0.37 1.00 1.00 1.00
Correlation
Raw data
0.92
-0.72 1.00
Pros and Cons
LSI puts documents together even if they
dont have common words if
The docs share frequently co-occurring terms
Disadvantages:
Statistical foundation is missing
SVD cont
SVD of the term-by-document matrix X:
If the singular values of S0 are ordered by size, we only keep the first k
largest values and get a reduced model:
doesnt exactly match X and it gets closer as more and more singular
values are kept
This is what we want. We dont want perfect fit since we think some of 0s in X
should be 1 and vice versa.
It reflects the major associative patterns in the data, and ignores the smaller,
less important influence and noise.
'
0 0 0
D S T X =
'
TSD X =
X
\
|
=
=1
, | | | , |
( ) ( ) ( ) ( )
[
=
=
N
n
n n n
z w p z p p p
1
, | | | , | | u o u | o u w z, ,
( ) ( ) ( ) ( )
[
}
[
= =
|
|
.
|
\
|
=
M
d
d
N
n z
dn dn dn d
d z w p z p p D p
d
dn
1 1
, | | | , | u | u o u | o
89
Latent Dirichlet allocation (cont.)
There are three levels to LDA representation
, are corpus-level parameters
d
are document-level variables
z
dn
, w
dn
are word-level variables
corpus
document
90
Latent Dirichlet allocation (cont.)
LDA and exchangeability
A finite set of random variables {z
1
,,z
N
} is said exchangeable if the joint
distribution is invariant to permutation ( is a permutation)
A infinite sequence of random variables is infinitely exchangeable if every
finite subsequence is exchangeable
De Finettis representation theorem states that the joint distribution of an
infinitely exchangeable sequence of random variables is as if a random
parameter were drawn from some distribution and then the random variables
in question were independent and identically distributed, conditioned on that
parameter
http://en.wikipedia.org/wiki/De_Finetti's_theorem
( )
( ) ( )
( )
N N
z z p z z p
t t
,..., ,...,
1 1
=
91
Latent Dirichlet allocation (cont.)
In LDA, we assume that words are generated
by topics (by fixed conditional distributions)
and that those topics are infinitely
exchangeable within a document
( ) ( ) ( ) ( ) u u u d z w p z p p p
N
n
n n n
z w,
|
|
.
|
\
|
=
[
}
=1
| |
92
Latent Dirichlet allocation (cont.)
A continuous mixture of unigrams
By marginalizing over the hidden topic variable z, we can
understand LDA as a two-level model
Generative process for a document w
1. choose ~ Dir()
2. For each of the N word w
n
(a) Choose a word w
n
from p(w
n
|, )
Marginal distribution of a document
( ) ( ) ( )
=
z
z p z w p w p u | | u | , | , |
( ) ( ) ( ) u | u o u | o d w p p w p
N
n
n
|
|
.
|
\
|
=
[
}
=1
, | | , |
93
Latent Dirichlet allocation (cont.)
The distribution on the (V-1)-simplex is
attained with only k+kV parameters.
94
Relationship with other latent variable models
Unigram model
Mixture of unigrams
Each document is generated by first choosing a topic z and
then generating N words independently form conditional
multinomial
k-1 parameters
( ) ( )
[
=
=
N
n
n
w p w p
1
( ) ( ) ( )
[
=
=
z
N
n
n
z w p z p w p
1
|
95
Relationship with other latent variable models
(cont.)
Probabilistic latent semantic indexing
Attempt to relax the simplifying assumption made in the
mixture of unigrams models
In a sense, it does capture the possibility that a document
may contain multiple topics
kv+kM parameters and linear growth in M
( ) ( ) ( ) ( )
=
z
n n
d z p z w p d p w d p | | ,
96
Relationship with other latent variable models
(cont.)
Problem of PLSI
There is no natural way to use it to assign probability to a
previously unseen document
The linear growth in parameters suggests that the model is
prone to overfitting and empirically, overfitting is indeed a
serious problem
LDA overcomes both of these problems by treating the
topic mixture weights as a k-parameter hidden random
variable
The k+kV parameters in a k-topic LDA model do not grow
with the size of the training corpus.
97
Relationship with other latent variable models
(cont.)
The unigram model find a single point on the word simplex and posits that
all word in the corpus come from the corresponding distribution.
The mixture of unigram models posits that for each documents, one of the
k points on the word simplex is chosen randomly and all the words of the
document are drawn from the distribution
The pLSI model posits that each word of a training documents comes from
a randomly chosen topic. The topics are themselves drawn from a
document-specific distribution over topics.
LDA posits that each word of both the observed and unseen documents is
generated by a randomly chosen topic which is drawn from a distribution
with a randomly chosen parameter
98
Inference and parameter estimation
The key inferential problem is that of computing the
posteriori distribution of the hidden variable given a
document
( )
( )
( ) | o
| o u
| o u
, |
, | , ,
, , | ,
w
w z
w z
p
p
p =
( )
( )
( )
( ) u | u u
o
o
| o
o
d p
N
n
k
i
V
j
w
ij i
k
i
i
k
i
i
k
i
i
j
n
i
w
|
|
.
|
\
|
|
|
.
|
\
|
I
I
=
[[
}
[
[
= = = =
=
=
1 1 1 1
1
1
1
, |
Unfortunately, this distribution is intractable to compute in general.
A function which is intractable due to the coupling between
and in the summation over latent topics
99
Inference and parameter estimation (cont.)
The basic idea of convexity-based variational inference
is to make use of Jensens inequality to obtain an
adjustable lower bound on the log likelihood.
Essentially, one considers a family of lower bounds,
indexed by a set of variational parameters.
A simple way to obtain a tractable family of lower
bound is to consider simple modifications of the
original graph model in which some of the edges and
nodes are removed.
100
Inference and parameter estimation (cont.)
Drop some edges and the w nodes
( ) ( ) ( )
[
=
=
N
n
n n
z q q q
1
| | , | , | u | u z
( )
( )
( ) | o
| o u
| o u
, |
, | , ,
, , | ,
w
w z
w z
p
p
p =
101
Inference and parameter estimation (cont.)
Variational distribution:
Lower bound on Log-likelihood
KL between variational posterior and true posterior
( ) ( ) ( )
( )
( )
( )
( )
( )
( ) | | ( ) | | | u | o u u
| u
| o u
| u
u
| u
| o u
| u u | o u | o
, | , , | , , log
, | ,
, | , ,
log , | ,
, | ,
, | , ,
, | , log , | , , log , | log
z w z
z
w z
z
z
w z
z w z w
z
z
q E p E d
q
p
q
d
q
p
q d p p
q q
z
=
}
>
}
=
}
=
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( )
( )
( )
( ) | | ( ) | | ( ) | | | o | o u | u
u
| o
| o u
| u u | u | u
u | o u | u u | u | u
| o u | u
, log , , , , | ,
,
, , ,
log , | , , | , log , | ,
, , log , | , , | , log , | ,
, , | | , | ,
, p E , p E q E
d
, p
, p
q d q q
d , | p q d q q
, | p q D
q q q
w w z z
w
w z
z z z
w z z z z
w z z
z z
z z
+ =
}
=
}
=
102
Inference and parameter estimation (cont.)
Finding a tight lower bound on the log likelihood
Maximizing the lower bound with respect to and
is equivalent to minimizing the KL divergence
between the variational posterior probability and
the true posterior probability
( ) ( ) | | ( ) | |
( ) ( ) ( ) | o u | u
| u | o u | o
, , || , | ,
, | , log , | , , log , | log
, | p q D
q E p E p
q q
w z z
z w z w
+
=
( )
( )
( ) ( ) ( ) | o u | u |
|
, , || , | , min arg ,
,
* *
, | p q D w z z =
103
Inference and parameter estimation (cont.)
Expand the lower bound:
( ) ( ) | | ( ) | |
( ) | |
( ) | |
( ) | |
( ) | |
( ) | | |
u
|
|
o u
| u | o u | o |
| log
| log
, | log
| log
| log
, | , log , | , , log , ; ,
z
z w
z
z w z
p E
p E
p E
p E
p E
q E p E L
q
q
q
q
q
q q
+
+
=
=
104
Inference and parameter estimation (cont.)
Then
( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
= =
=
=
=
=
= =
= =
=
=
=
=
=
+ + I + I
+
+ + +
+ + + I I =
N
n
k
i
ni ni
k
i
k
j
j i i i
k
i
k
j
j
N
n
k
i
ij
j
n ni
N
n
k
i
k
j
j i ni
k
i
k
j
j i i i
k
i
k
j
j
w
L
1 1
1
1
1
1
1 1
1 1
1
1
1
1
1
log
1 log log
log
1 log log
, ; ,
| |
| |
|
o o o
| o |
105
Inference and parameter estimation (cont.)
We can get variational parameters by adding
Lagrange multipliers and setting this derivative
to zero:
( ) ( ) ( )
=
=
+ =
+ +
N
n
ni i i
k
j
j i iv ni
1
1
exp
| o
| |
106
Inference and parameter estimation (cont.)
Parameter estimation
Maximize log likelihood of the data:
Variational inference provide us with a tractable lower
bound on the log likelihood, a bound which we can
maximize with respect and
Variational EM procedure
1. (E-step) For each document, find the optimizing values
of the variational parameters {, }
2. (M-step) Maximize the resulting lower bound on the log
likelihood with respect to the model parameters and
( ) ( )
=
=
M
d
d
p
1
, | log , | o | o w
107
Inference and parameter estimation (cont.)
Smoothed LDA model:
108
Discussion
LDA is a flexible generative probabilistic model for
collection of discrete data.
Exact inference is intractable for LDA, but any or a
large suite of approximate inference algorithms
for inference and parameter estimation can be
used with the LDA framework.
LDA is a simple model and is readily extended to
continuous data or other non-multinomial data.
Relation to Text Classification
and Information Retrieval
LSI for IR
Compute cosine similarity for document and
query vectors in semantic space
Helps combat synonymy
Helps combat polysemy in documents, but not
necessarily in queries
(which were not part of the SVD computation)
pLSA/LDA for IR
Several options
Compute cosine similarity between topic vectors
for documents
Use language model-based IR techniques
potentially very helpful for synonymy and
polysemy
LDA/pLSA for Text Classification
Topic models are easy to incorporate into text
classification:
1. Train a topic model using a big corpus
2. Decode the topic model (find best topic/cluster
for each word) on a training set
3. Train classifier using the topic/cluster as a
feature
4. On a test document, first decode the topic
model, then make a prediction with the classifier
Why use a topic model for
classification?
Topic models help handle polysemy and
synonymy
The count for a topic in a document can be much
more informative than the count of individual words
belonging to that topic.
Topic models help combat data sparsity
You can control the number of topics
At a reasonable choice for this number, youll observe
the topics many times in training data
(unlike individual words, which may be very sparse)
LSA for Text Classification
Trickier to do
One option is to use the reduced-dimension
document vectors for training
At test time, what to do?
Can recalculate the SVD (expensive)
Another option is to combine the reduced-
dimension term vectors for a given document to
produce a vector for the document
This is repeatable at test time (at least for words
that were seen during training)