Professional Documents
Culture Documents
Lecture 7: Text Classification and Naive Bayes: Information Retrieval Computer Science Tripos Part II
Lecture 7: Text Classification and Naive Bayes: Information Retrieval Computer Science Tripos Part II
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Simone.Teufel@cl.cam.ac.uk
Lent 2014
328
IR System Components
Today: evaluation
329
IR System Components
330
IR System Components
331
Overview
332
Overview
Hierarchical clustering
333
HAC: Basic algorithm
334
A dendrogram
335
Term-document matrix to document-document matrix
Log frequency weighting
and cosine normalisation
SaS P(SaS,SaS) P(PaP,SaS)
SaS PaP WH PaP P(SaS,PaP) P(PaP,PaP)
0.789 0.832 0.524 WH P(SaS,WH) P(PaP,WH)
0.515 0.555 0.465
SaS PaP
0.335 0.000 0.405
0.000 0.000 0.588
for i:= 1 to n do
ci := xi
C :=c1, ... cn
j := n+1
while C > 1 do
(cn1 , cn2 ) := max(cu ,cv )∈C ×C sim(cu ,cv )
cj:= cn1 ∪ cn2
C := C { cn1 , cn2 } ∪ cj
j:=j+1
end
337
Computational complexity of the basic algorithm
338
Hierarchical clustering: similarity functions
339
Example: hierarchical clustering; similarity functions
a b c d
1
1.5
e f g h
b 1
c 2.5 1.5
d 3.5 2.5 1√
√ √
e 2
√ 5 10.25
√ 16.25
√
f √5 2
√ 6.25 10.25
√ 1
g 2√ 5 2.5 1.5
√10.25 √6.25
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
340
Single Link is O(n2)
b 1
c 2.5 1.5
d 3.5 2.5√ 1√ √
e 2 5 10.25 16.25
√ √ √
f √5 2
√ 6.25 10.25
√ 1
g 10.25 6.25 2 5 2.5 1.5
√ √ √
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 10.25
√ 16.25
√
f √5 2
√ 6.25 10.25
√ 1
g 10.25
√ 6.25
√ 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a+b c d e f g
c 2.5 1.5
d 3.5 2.5 1
√ √ √
e 2
√ 5 10.25
√ 16.25
√
f
√5 2
√
6.25
√
10.25 1
g √ 10.25 √ 6.25 2√ 5 2.5 1.5 341
Single Link
342
Clustering Result under Single Link
a b c d
e f g h
a b c d e f g h
343
Complete Link
b 1
c 2.5 1.5
d 3.5 2.5√ 1√ √
e 2√ 5 √10.25 √16.25
f 5 2 6.25 10.25
√ √ 1
√
g 10.25
√ 6.25 √ 2 √ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5
1
a b c d e f g
344
Clustering result under complete link
a b c d
e f g h
a b c d e f g h
345
Example: gene expression data
346
Rat CNS gene expression data (excerpt)
347
Rat CNS gene clustering - single link
0 20 40 60 80 100
GRg1NFM
NFL
NMDA1G
FAP
cellubrevin
GAT1nesti
n NFH
EGFR
nAChRa7mGluR5
actin
mAChR2ACHE
GRg2
GRb1MOG
NT3
GAD67
GAD6SC25
trkB
CCO2SC1
S100 beta
mGluR7GRg3
GRa2
GRa5
GRb2keratin
synaptophysin
G67I80/86EGF
CRAF
CCO1CNTF
GRa4
SOD
AP43GRb3
pre-GAD67GRa3
statin neno
PTN
SC7
IGFR1
Ins2cfos
NMDA2C
cyclin
AG67I86
IP3R1trk
InsR
L1
NMDA2D
5HT2
nAChRa4
nAChRa5
GFRNOS
nAChRa3H2
AZ
mAChR3
FGFR
5HT1bBDNF
CNTFR
IP3R3IGF I
nAChRe
mGluR6
TCP
GDNFSC6
mGluR2m
GluR8
nAChRdnAChRa6
PDGFb
Ins1
nAChRa2 mGluR45HT3
mAChR4
PDGcju
n
DD63.2
NGFa
mGluR1 NMDA2A
Brm THIGFR2
IGF IIODC trkC
NMDA2 mGluR3
GRa1 B 5HT1c
bFGF
IP3R2
ChAT
aFGF
cyclin BMK2
348
Rat CNS gene clustering - complete link
0 20 40 60 80 100
C
CO1
FGFR
GDNF
SC6 NFM
Ins2 keratinCCO2
IGFR2
cyclin A
Brm TH
ODC
IGF II
IGFR1
PTN
trkPD
GFR
TGFR G67I80/86
G67I86 SC7 CRAF
S
C1NT3
cyclin B MK2 trkB
GRg2
S100 beta
GRa5GRa2
synaptophysin
mGluR7GRg3
actin
mAChR4IP3R1
Fa PDG CNTF
5HT1c SOD
GRa1
bFGF
IP3R2
ChAT
aFGF
NGF
cjun
DD63.2
NMDA2AmGluR1
EGF
cfos
NOS
nAChRa3
NMDA2C
L1
5HT2
nAChRa4nAChRa5 InsR
H2AZ
mGluR8CNTF
R
nAChRe
nAChRa6PDGF
b
nAChRdIns1
IGF IBDNF
IP3R3
NMDA2D
mGluR4 TCP
nAChRa2mAChR3
5HT1b
mGluR6
mGluR25HT3
GRa4
GAP43
GRb2
statin MAP2
pre-GAD67 neno
GRa3
t
rkCGRb3
mGluR3
NMDA2B
mGluR5
GAD6
GAD67nAChRa7
EGFR
SC2GAT1
NFH
MOG
GRb1AC
HE
mAChR2 cellubrevinnestin
NFL
NMDA1
GFAPGRg1
349
Rat CNS gene clustering - group average link
0 20 40 60 80 100
NFL
NFM
NMDA1
GAT1
nestin
MOGNFH
cellubrevin
S
SC1 C2actin
MK2
cyclin B NT3
keratinCCO2
IP3R2
ChATaF
NMDA2 GF
B 5HT1c mGluR3
GRa1bFGF
EGF
cfos
cjun
NGF
DD63.2
mGluR1
NMDA2A
CNTFSOD
CCO1
cyclin AIGFR2
Brm TH
ODCII
IGF
IGFR1 PTN
PDGFR
trk
Ins2 TGFR
NMDA2C
mAChR3
5HT1b
NOS
nAChRa3
L1
5HT2
G67I86
InsR
nAChRa4
nAChRa5
IP3R1
mAChR4
PDG
Fa NMDA2D
FGFR H2AZ
GDNF
SC6
TCP
mGluR2
mGluR6
5HT
nAChRa2 3mGluR8
nAChRe mGluR4
nAChRa6PDGF
Ins1 b
CNTFR nAChRd BDNF
IP3R3IGF I
CRAF
G67I80/86SC7
trkB
S100 beta
GRa2 GRa5
mGluR7GRg3
GRa4GRb2
s
ynaptophysin
GRb3trkC
GAP43
pre-GAD67GRa3
MAP2
statinneno
nAChRa7E
GFR
mAChR2GRb1
ACHE GRg2
GAD6mGluR
5 GFAPGRg1
GAD67
350
Flat or hierarchical clustering?
351
Overview
A text classification task: Email spam filtering
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
352
Topic classification
γ(d′) =China
353
Statistical/probabilistic classification methods
354
Examples of how search engines use classification
355
Formal definition of TC: Training
Given:
A document space X
Documents are represented in this space - typically some type
of high-dimensional space.
A fixed set of classes C = {c1, c2, . . . , cJ }
The classes are human-defined for the needs of an application
(e.g., spam vs. nonspam).
A training set D of labeled documents. Each labeled
document 〈 d,c 〉 ∈ X × C
γ:X→C
356
Formal definition of TC: Application/Testing
357
The Naive Bayes classifier
358
Maximum a posteriori class
359
Taking the log
360
Naive Bayes classifier
Classification rule:
∑
cmap = arg max [logP(c) + logP(tk |c)]
c∈C
1≤k≤nd
Simple interpretation:
Each conditional parameter logP (tk |c) is a weight that
indicates how good an indicator tk is for c.
The prior logP (c) is a weight that indicates the relative
frequency of c.
The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the
class.
We select the class with the most evidence.
361
Parameter estimation take 1: Maximum likelihood
P (c) =Nc
N
Nc: number of docs in class c; N: total number of docs
Conditional probabilities:
Tct
P (t|c) = ∑
t′∈V Tct′
Tct is the number of tokens of t in training documents from
class c (includes multiple occurrences)
We’ve made a Naive Bayes independence assumption here:
P (tk1|c)=P(tk2 |c),independentofpositionsk 1,k2
362
The problem with maximum likelihood estimates: Zeros
C=China
TChina
∑ ,WTO =∑ 0
P (WTO|China) = TChina,t′ =0
t′∈V t′∈ V TChina,t′
363
The problem with maximum likelihood estimates: Zeros
364
To avoid zeros: Add-one smoothing
Before:
P(t|c)=∑ Tct
t′∈V Tct′
Now: Add one to each count to avoid zeros:
P(t|c)=∑ Tct (T
+ ct
1′ +1)= Tct + 1
t′∈V ∑
( t′∈V Tct′)+B
365
Example
|textc | = 8
|textc | = 3
B=6 (number of tokens)
366
Example: Parameter estimates
Conditional probabilities:
367
Example: Classification
368
Time complexity of Naive Bayes
Derivation of NB formula
Evaluation of text classification
371
Summary: clustering and classification
372
Reading
373