Professional Documents
Culture Documents
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec07-Lexical-Semantics - (Cuuduongthancong - Com)
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec07-Lexical-Semantics - (Cuuduongthancong - Com)
Regina Barzilay
MIT
July, 2005
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Today: Semantic Similarity
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Today: Semantic Similarity
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Motivation
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Motivation
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Motivation
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Computing Semantic Similarity
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Lexicons and Semantic Nets
• Semantic Nets
– Links between terms (IS-A, Part-Of)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
WordNet
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Synset Example
1. water, H2O – (binary compound that occurs at room temperature as a
clear colorless odorless tasteless liquid; freezes into ice below 0 degrees
centigrade and boils above 100 degrees centigrade; widely used as a sol
vent)
2. body of water, water – (the part of the earth’s surface covered with
water (such as a river or lake or ocean); ”they invaded our territorial
waters”; ”they were sitting by the water’s edge”)
3. water system, water supply, water – (facility that provides a source of
water; ”the town debated the purification of the water supply”; ”first you
have to cut off the water”)
4. water – (once thought to be one of four elements composing the uni
verse (Empedocles))
5. urine, piss, pee, piddle, weewee, water – (liquid excretory product;
”there was blood in his urine”; ”the child had to make water”)
6. water – (a fluid necessary for the life of most animals and plants; ”he
asked for a drink of water”)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
WordNet Relations
• Original core relations:
– Synonymy
– Polysemy
– Metonymy
– Hyponymy/Hyperonymy
– Meronymy
– Antonymy
• New, useful addition for NLP:
– Glosses
– Links between derivationally and semantically related
noun/verb pairs
– Domain/topical terms
– Groups of similar verbs
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Synonymy
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Polysemy
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Polysemy Information
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Metonymy
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Hyponymy/Hyperonymy (ISA)
A is a hypernym of B if B is a type of A
A is a hyponym of B if A is a type of B
Example:
• bigamy (having two spouses at the same time)
• open marriage (a marriage in which each partner is free to enter
into extraneous sexual relationships without guilt or jealousy
from the other)
• cuckoldom (the state of a husband whose wife has committed
adultery)
• polygamy (having more than one spouse at a time)
– polyandry (having more than one husband at a time)
– polygyny (having more than one wife at a time)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Meronymy
• Part-of relation
– part-of (beak, bird)
– part-of (bark, tree)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Antonymy
• Lexical opposites
– antonym (large, small)
– antonym (big, small)
– antonym (big, little)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Computing Semantic Similarity
Suppose you are given the following words. Your task is
to group them according to how similar they are:
apple
banana
grapefruit
grape
man
woman
baby
infant
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Using WordNet to Determine Similarity
apple man
fruit male, male person
produce person, individual
... organism
banana ...
fruit woman
produce female , female person
... person, individual
organism
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity by Path Length
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Why use WordNet?
• Quality
– Developed and maintained by researchers
• Habit
– Many applications are currently using WordNet
• Available software
– SenseRelate(Pedersen et al):
http://wn-similarity.sourceforge.com
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity by Path Length
baby
child, kid
man offspring, progeny
male, male person relative, relation
person, individual person, individual
organism organism
... ...
woman
female , female person
person, individual
organism
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Why not use WordNet?
• Incomplete (technical terms may be absent)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Learning Similarity from Corpora
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Learning Similarity from Corpora
CAT
DOG
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Learning Similarity from Corpora
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Key Parameters
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Example 1: Clustering by Next Word
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Vector Space Model
man woman
grape
orange
apple
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity Measure: Euclidean
⎨⎩n
Euclidean |�x, �y| = |x
� − �y| = i=1 (xi − y i )2
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity Measure: Cosine
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Computing Similarity: Cosine
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Term Weighting
�
� (1 + log(tf ))log N if tfi,j � 1
i,j dfi
tf × idf =
� 0 otherwise
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Cosine vs. Euclidean
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Mutual Information
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Example
P (pancake,syrup)
I(pancake; syrup) = log P (pancake)P (syrup)
P (Wi =pancake,Wi+1 =syrup)
I(Wi = pancake; Wi+1 = syrup) = log P (Wi =pancake)P (Wi+1 =syrup)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Example(cont)
P (pancake,syrup)
I(pancake; syrup) = log P (pancake)P (syrup)
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity for LM
1
H(L) = − log P (w1 , . . . , wN )
N
N
−1 ⎧
� log P (wi |wi−1 )
N −1 i=2
−1 �
� Count(w 1 w2 ) log P (w 2 |W 1 )
N −1 1 2
w w
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Similarity for LM
Cluster-based generalization:
−1 �
H(L, �) � Count(w 1 w2 ) log P (c2 |c1 )P (w 2 |c1 )
N −1 1 2
w w
� H(w) − I(c1 , c2 )
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Average Mutual Information
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Example: Syntax-Based Representation
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Kullback Leibler Distance (Relative
Entropy)
• Definition: The relative entropy D(p||q) is a
measure of the inefficiency of assuming that the
distribution is q when the true distribution is p
p(x)log p(x)
⎩
• D(p||q) = q(x)
• Properties:
– Non-negative
– D(p||q) = 0 iff p = q
– Not symmetric and doesn’t satisfy triangle
inequality
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Representation
• Representation
– Syntactic vs. Window-based
– Context granularity
– Alphabet size
– Counts vs. Probability
• Distance
– Vector-based vs. Probabilistic
– Weighted vs. Unweighted
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Problems with Corpus-based Similarity
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Not-so-semantic grouping
Method Clinker
Direct Object pollution increase failure
Next Word addiction medalist inhalation Arabia growers
Adjective full increase
CuuDuongThanCong.com https://fb.com/tailieudientucntt
State-of-the-art Methods
http://www.cs.ualberta.ca/~lindek/demos/depsim.htm
CuuDuongThanCong.com https://fb.com/tailieudientucntt
State-of-the-art Methods
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Beyond Pairwise Similarity
• Clustering is “The art of finding groups in
data”(Kaufmann and Rousseeu)
• Clustering algorithms divide a data set into
homogeneous groups (clusters), based on their
similarity under the given representation.
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Hierarchical Clustering
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Agglomerative Clustering
E D C B
A 0.1 0.2 0.2 0.8
C 0.0 0.7
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Agglomerative Clustering
E D C B
A 0.1 0.2 0.2 0.8
C 0.0 0.7
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Agglomerative Clustering
E D C B
A 0.1 0.2 0.2 0.8
C 0.0 0.7
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Function
E D C B
A 0.1 0.2 0.2 0.8
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Function
E D C B
A 0.1 0.2 0.2 0.8
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Function
E D C B
A 0.1 0.2 0.2 0.8
D 0.6
A B C D E
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Clustering Function
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Single-Link Clustering
• Complexity O(n2 )
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Complete-Link Clustering
CuuDuongThanCong.com https://fb.com/tailieudientucntt
K-Means Algorithm: Example
CuuDuongThanCong.com https://fb.com/tailieudientucntt
K-Means Algorithm
CuuDuongThanCong.com https://fb.com/tailieudientucntt
K-Means Algorithm: Hard EM
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Evaluating Clustering Methods
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Comparing Clustering Methods
(Meila, 2002)
n total # of points
nk # of points in cluster Ck
K # of nonempty clusters
N11 # of pairs that are in the same cluster under C and C �
N00 # of pairs that are in the different clusters under C and C �
N10 # of pairs that are in the the same cluster under C but not C �
N01 # of pairs that are in the the same cluster under C � but not C
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Comparing by Counting Pairs
• Wallace criteria
� N11
W1 (C, C ) = ⎩
k nk (nk − 1)/2
� N11
W2 (C, C ) = ⎩
n k � (n� k� − 1)/2
k �
• Fowles-Mallows criterion
�
⎨
F (C, C ) = W1 (C, C � )W2 (C, C � )
Problems: ?
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Comparing Clustering by Set Matching
1 � 2mkk�
�
L(C, C ) = max
K k� nk + n�k
k
Problems: ?
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Comparing Clustering by Set Matching
� 1 � 2mkk�
L(C, C ) = max
K k� nk + n�k
k
C C’ C’’
C1 C2 C 3 C1 C2 C3 C1 C2 C3
C3
C C3 CC
2 1 2
CC
1 3
CuuDuongThanCong.com https://fb.com/tailieudientucntt
Summary
CuuDuongThanCong.com https://fb.com/tailieudientucntt