Professional Documents
Culture Documents
L 13 Naive Bayes Classifier
L 13 Naive Bayes Classifier
L 13 Naive Bayes Classifier
Prasad
L13NaiveBayesClassify
Prasad
L13NaiveBayesClassify
Standing queries
Prasad
13.0
13.0
Text classification:
Nave Bayes Text Classification
Today:
Prasad
Multinomial
Bernoulli
Feature Selection
L13NaiveBayesClassify
Categorization/Classification
Given:
Determine:
Prasad
13.1
Document Classification
planning
language
proof
intelligence
Test
Data:
(AI)
(Programming)
(HCI)
Classes:
ML
Training
Data:
Prasad
learning
intelligence
algorithm
reinforcement
network...
Planning
Semantics
Garb.Coll.
planning
temporal
reasoning
plan
language...
programming
semantics
language
proof...
Multimedia
garbage
...
collection
memory
optimization
region...
GUI
...
13.1
L13NaiveBayesClassify
Manual classification
Prasad
L13NaiveBayesClassify
13.0
Prasad
10
13.0
Note:
maintenance issues
(author, etc.)
Hand-weighting of
terms
11
Prasad
L13NaiveBayesClassify
12
Prasad
13
9.1.2
p (b)
xa,a p(b | x) p( x)
Posterior
Odds:
Prasad
p(a)
p(a)
O(a )
p(a ) 1 p(a)
14
13.2
Bayesian Methods
Prasad
L13NaiveBayesClassify
15
13.2
Bayes Rule
P (C , D ) P (C | D )P ( D ) P (D | C )P (C )
P(D | C)P(C)
P(C | D)
P(D)
Prasad
L13NaiveBayesClassify
16
cMAP argmax P (c j | x1 , x2 , , xn )
c j C
argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
argmax P ( x1 , x2 , , xn | c j ) P (c j )
c j C
L13NaiveBayesClassify
17
P(cj)
P(x1,x2,,xn|cj)
O(|X|n|C|) parameters
Could only be estimated if a very, very large
number of training examples was available.
Prasad
L13NaiveBayesClassify
18
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 , , X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
This model is appropriate for binary variables
Prasad
19
13.3
X2
X4
X5
X6
P (c j )
Prasad
X3
P ( xi | c j )
N (C c j )
N
N ( X i xi , C c j )
N (C c j )
20
13.3
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 , , X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
N ( X 5 t , C nf )
P( X 5 t | C nf )
0
N (C nf )
Prasad
21
13.3
P ( xi | c j )
N ( X i xi , C c j ) 1
N (C c j ) k
# of values of Xi
Prasad
L13NaiveBayesClassify
22
N ( X i xi , C c j ) 1
N (C c j ) k
# of values of Xi
P ( xi ,k | c j )
Prasad
overall fraction in
data where Xi=xi,k
N ( X i xi ,k , C c j ) mpi ,k
N (C c j ) m
L13NaiveBayesClassify
extent of 23
smoothing
Model M
0.2
the
0.1
0.01
man
0.01
woman
0.03
said
0.02
likes
Prasad
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
L13NaiveBayesClassify
P(s | M) = 0.00000008
13.2.1
Model M1
Model M2
0.2
the
0.2
the
0.01
class
0.0001 class
0.0001 sayst
0.03
sayst
0.0001 pleaseth
0.02
pleaseth
0.0001 yon
0.1
yon
0.0005 maiden
0.01
maiden
0.01
0.0001 woman
woman
the
class
pleaseth
0.2
0.2
0.01
0.0001
0.0001 0.02
yon
maiden
0.0001 0.0005
0.1
0.01
)
= P(
) P(
) P(
) P(
)
Easy.
Effective!
P( ) P( | ) P(
Other Language Models
) P(
Prasad
26
13.2.1
L13NaiveBayesClassify
28
For each cj in C do
docsj subset of documents for which the target class is cj
P (c j )
| total # documents |
Prasad
| docs j |
P ( xk | c j )
nk
n | Vocabulary |
L13NaiveBayesClassify
29
Prasad
P( x | c )
i
i positions
L13NaiveBayesClassify
30
Why?
Prasad
L13NaiveBayesClassify
31
Exercise
32
32
Example: Classification
34
log P( x | c )
i positions
Prasad
L13NaiveBayesClassify
35
Prasad
L13NaiveBayesClassify
36
Two Models
Second assumption:
P( X i w | c) P( X j w | c)
for all positions i,j, word w, and class c
Prasad
37
Parameter estimation
P ( X w t | c j )
Multinomial model:
P ( X i w | c j )
Prasad
L13NaiveBayesClassify
38
Classification
Prasad
L13NaiveBayesClassify
39
Feature Selection
Prasad
L13NaiveBayesClassify
40
13.5
Two ideas:
Information theory:
Prasad
L13NaiveBayesClassify
41
13.5
2 statistic (CHI)
2 is interested in (fo fe)2/fe summed over all table entries: is
the observed number what youd expect given the marginals?
2
2
2
(
j
,
a
)
(
O
E
)
/
E
(
2
.
25
)
/
.
25
(
3
4
.
75
)
/
4
.
75
2
2
(
500
502
)
/
502
(
9500
949
)
/
949
12
.
9
(
p
.
0
)
2 (0.25)
3 (4.75)
Term jaguar
500
expected: fe
(502)
9500 (9498)
observed: fo
42
13.5.2
2 statistic (CHI)
There is a simpler formula for 2x2 2:
A = #(t,c)
B = #(t,c)
C = #(t,c)
D = #(t, c)
Yields
1.29?
N=A+B+C+D
Value for complete independence of term and category?
p(ew , ec )
I (w , c ) p(ew , ec ) log
p(ew )p(ec )
e { 0,1} e { 0,1}
w
Prasad
L13NaiveBayesClassify
44
13.5.1
Prasad
L13NaiveBayesClassify
45
Feature Selection
Mutual Information
Chi-square
Statistical foundation
May select very slightly informative frequent terms
that are not very useful for classification
Prasad
No particular foundation
In practice, this is often 90% as good
L13NaiveBayesClassify
46
Prasad
47
Results:
Student
Extracted
180
Correct
130
Accuracy:
72%
Faculty
66
28
42%
Person
246
194
79%
Project
99
72
73%
Course
28
25
89%
Departmt
1
1
100%
Prasad
50
Prasad
L13NaiveBayesClassify
51
Prasad
L13NaiveBayesClassify
52
13.6
Violation of NB Assumptions
Conditional independence
Positional independence
Examples?
Prasad
L13NaiveBayesClassify
54
A good dependable baseline for text classification (but not the best)!
Optimal if the Independence Assumptions hold: If assumed independence is
correct, then it is the Bayes Optimal Classifier for problem
Very Fast: Learning with one pass of counting over the data; testing linear in the
number of attributes, and document collection size
L13NaiveBayesClassify
57