ML-Unit I - Naive Bayes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Machine Learning

Dr. Sunil Saumya


IIIT Dharwad
Naive Bayes
Bayes Theorem:

where:
● A and B are called events.
● P(A | B) is the probability of event A, given the event B is true (has occurred).
Event B is also termed as evidence.
● P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
● P(B | A) is the probability of B given event A, i.e. probability of event B after
evidence A is seen.
Effort (x) Result (y)

Naive Bayes classification Poor Fail

Average Pass

Average Pass
Bayes Theorem:
Good Pass
where: Good Pass
● Let’s take a one dimensional data to
Poor Fail
understand how bayes theorem
Poor Fail
works:
Poor Pass
Check “student will fail if his efforts are
poor” statement is correct or not? Poor Fail

Here, x= poor, so find y = ? Average Pass

Average Fail
Effort (x) Result (y)

Naive Bayes classification Poor Fail

Average Pass

Average Pass
where:
Good Pass
● Given problem, the bayes classifier
Good Pass
model will be:
Poor Fail

Poor Fail

Which is similar as Poor Pass


P(Fail | Poor) = P(Poor | Fail) * P(Fail) / Poor Fail
P(Poor) Average Pass

Average Fail
Effort (x) Result (y)

Naive Bayes classification Poor Fail

Average Pass

Average Pass
P(Fail | Poor) = P(Poor | Fail) * P(Fail) /
P(Poor) Good Pass

P(Poor | Fail) = Number of students who Good Pass

failed with poor efforts / Number of Poor Fail


students failed Poor Fail
P(Poor | Fail) = 4/5= 0.8 Poor Pass
P(Fail) = Number of students failed /
Poor Fail
Total students = 5/11 = 0.45
Average Pass
P(Poor)= Number of students with poor
Average Fail
efforts / Total students= 5/11 = 0.45
Effort (x) Result (y)

Naive Bayes classification Poor Fail

Average Pass

Average Pass
P(Fail | Poor) = P(Poor | Fail) * P(Fail) /
P(Poor) Good Pass

P(Fail | Poor) = (0.8 * 0.45) /0.45 = 0.8 Good Pass

Poor Fail
P(Pass | Poor) = P(Poor | Pass) * P(Pass) / Poor Fail
P(Poor) Poor Pass
P(Pass | Poor) = (1/6 * 6/11) /(5/11) = 0.04
Poor Fail

Average Pass
Therefore, for an new student if effort
Average Fail
given in poor, his result is fail.
Naive Bayes Exercise
Consider the training dataset are as follows:

Classify a Red Domestic SUV


SMS spam classification: Dataset
● Problem: How to do text classification using Naive Bayes
○ we can't feed in text directly to our classifier.
○ We extract some features from the text and then feed it as an input.
Sentence Label

Send your mobile number Ham (0)

Send your account number and mobile number Spam (1)

Your mobile number selected as a winner Spam (1)

Select your mobile number ??


SMS spam classification:
Feature extraction
● We will extract the TF-IDF (Term Frequency- Inverse Document
Frequency Feature) from the text.
Step 1: Prepare the vocabulary from the dataset having unique words
Vocabulary: “Send”, “your”, “mobile”,
“number”, “account”, “and”,
Sentence Label
“selected”, “as”, “winner”
Send your mobile number Ham (0)
Let’s calculate the number of
term for each unique word in Send your account number and mobile number Spam (1)

the vocabulary Your mobile number selected as a winner Spam (1)

Send your mobile ??


SMS spam classification:
Feature extraction
● We will extract the TF-IDF (Term Frequency- Inverse Document
Frequency Feature) from the text.
Step 2: Calculate the TF-IDF feature
TF-IDF (x) = TF (x) * IDF (x), where x is word in the vocabulary.
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (send) = 1/4= 0.25, TF * IDF (send of S1) = 0.25*0.176= 0.044
IDF (send) = log (3/2) = 0.176

Send your mobile number account and selected as winner

S1 0.044

S2

S3
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (send) = 1/7= 0.14, TF * IDF (send of S1) = 0.25*0.176= 0.044
IDF (send) = log (3/2) = 0.176 TF * IDF (send of S2) = 0.14*0.176= 0.024

Send your mobile number account and selected as winner

S1 0.044

S2 0.024

S3
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (send) = 0/6= 0, TF * IDF (send of S1) = 0.25*0.176= 0.044
IDF (send) = log (3/2) = 0.176 TF * IDF (send of S2) = 0.14*0.176= 0.024
TF * IDF (send of S3) = 0*0.176= 0
Send your mobile number account and selected as winner

S1 0.044

S2 0.024

S3 0
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (your) = 1/4= 0.25 TF * IDF (your of S1) = 0.25*0= 0
IDF (your) = log (3/3) = 0 TF * IDF (your of S2) = 0.14*0= 0
TF * IDF (your of S3) = 0.16*0= 0
Send your mobile number account and selected as winner

S1 0.044 0

S2 0.024 0

S3 0 0
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (mobile) = 1/4= 0.25 TF * IDF (mobile of S1) = 0.25*0= 0
IDF (mobile) = log (3/3) = 0 TF * IDF (mobile of S2) = 0.16*0= 0
TF * IDF (mobile of S3) = 0.14*0= 0
Send your mobile number account and selected as winner

S1 0.044 0 0

S2 0.024 0 0

S3 0 0 0
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (account) = 0/4= 0 TF * IDF (account of S1) = 0*0.477= 0
IDF (account) = log (3/1) = 0.477 TF * IDF (account of S2) = 0.16*0.477= 0.119
TF * IDF (account of S3) = 0*0= 0
Send your mobile number account and selected as winner

S1 0.044 0 0 0 0

S2 0.024 0 0 0 0.076

S3 0 0 0 0 0
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (selected) = 0/4= 0 TF * IDF (selected of S1) = 0*0.477= 0
IDF (selected) = log (3/1) = 0.477 TF * IDF (selected of S2) = 0*0.477= 0
TF * IDF (selected of S3) = 0.14*0.477= 0.076
Send your mobile number account and selected as win
ner

S1 0.044 0 0 0 0 0 0

S2 0.024 0 0 0 0.076 0.076 0

S3 0 0 0 0 0 0 0.076
SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (winner) = 0/4= 0 TF * IDF (winner of S1) = 0*0.477= 0
IDF (winner) = log (3/1) = 0.477 TF * IDF (winner of S2) = 0*0.477= 0
TF * IDF (winner of S3) = 0.14*0.477= 0.076
Send your mobile number account and selected as winner

S1 0.044 0 0 0 0 0 0 0 0

S2 0.024 0 0 0 0.076 0.076 0 0 0

S3 0 0 0 0 0 0 0.076 0.076 0.076


SMS spam classification:
Feature extraction
Step 2: Calculate the TF-IDF feature
TF (winner) = 0/4= 0 TF * IDF (winner of S1) = 0*0.477= 0
IDF (winner) = log (3/1) = 0.477 TF * IDF (winner of S2) = 0*0.477= 0
TF * IDF (winner of S3) = 0.14*0.477= 0.076
Send your mobile number account and selected as winner Class

S1 0.044 0 0 0 0 0 0 0 0 Ham

S2 0.024 0 0 0 0.076 0.076 0 0 0 Spam

S3 0 0 0 0 0 0 0.076 0.076 0.076 Spam


SMS spam classification:
Feature extraction
Step 3: Classification using Naive Bayes P(wk/Ham)= (nk+1)/ (n+ vocabulary)
Documents with Ham outcomes: Where, n be the sum of all TFIDF in HAM cases
nk: the sum of TF-IDF values of word k in HAM cases
P(Ham) = 1/3 = 0.33
P(wk/Ham)= (nk+1)/ (n+ vocabulary), P(send/Ham) = (0.044+1)/ (0.044+9)= 0.115

Send your mobile number account and selected as winner Class

S1 0.044 0 0 0 0 0 0 0 0 Ham

S2 0.024 0 0 0 0.076 0.076 0 0 0 Spam

S3 0 0 0 0 0 0 0.076 0.076 0.076 Spam


SMS spam classification:
Feature extraction
P(send/Ham) = (0.044+1)/ (0.044+9)= 0.115
P(your/Ham) = (0+1)/ (0.004+9)= 0.111
P(mobile/Ham) = (0+1)/ (0.004+9)= 0.111
Step 3: Classification using NB P(number/Ham) = (0+1)/ (0.004+9)= 0.111
P(account/Ham) = (0+1)/ (0.004+9)= 0.111
Documents with Ham outcomes: P(and/Ham) = (0+1)/ (0.004+9)= 0.111
P(Ham) = 1/3 = 0.33 P(selected/Ham) = (0+1)/ (0.004+9)= 0.111
P(as/Ham) = (0+1)/ (0.004+9)= 0.111
P(wk/Ham)= (nk+1)/ (n+ vocabulary),P(winner/Ham) = (0+1)/ (0.004+9)= 0.111

Send your mobile number account and selected as winner Class

S1 0.044 0 0 0 0 0 0 0 0 Ham

S2 0.024 0 0 0 0.076 0.076 0 0 0 Spam

S3 0 0 0 0 0 0 0.076 0.076 0.076 Spam


SMS spam classification:
Feature extraction
P(send/Spam) = (0.024+1)/ (0.404+9)= 0.108
P(your/Spam) = (0+1)/ (0.404+9)= 0.106
P(mobile/Spam) = (0+1)/ (0.404+9)= 0.106
Step 3: Classification using NB P(number/Spam) = (0+1)/ (0.404+9)= 0.106
P(account/Spam) = (0.076+1)/ (0.404+9)= 0.114
Documents with Spam outcomes: P(and/Spam) = (0.076+1)/ (0.404+9)= 0.114
P(Spam) = 2/3 = 0.66 P(selected/Spam) = (0.076+1)/ (0.404+9)= 0.114
P(as/Spam) = (0.076+1)/(0.404+9)= 0.114
P(wk/Spam)= (nk+1)/ (n+ vocabulary),P(winner/Spam) = (0.076+1)/(0.404+9)= 0.114

Send your mobile number account and selected as winner Class

S1 0.044 0 0 0 0 0 0 0 0 Ham

S2 0.024 0 0 0 0.076 0.076 0 0 0 Spam

S3 0 0 0 0 0 0 0.076 0.076 0.076 Spam


SMS spam classification
● Test data: Send your mobile
P(Ham/send your mobile)= P(Ham/send) * P (Ham/your) * P(Ham/mobile)
P(Ham/send) = P(send/ham)*P(Ham)/P(send) = 0.115*0.33
Denominator is ignored because it is common in both classes.
Therefore, P(Ham/send) = 0.037
P(Ham/your) = 0.111*0.33= 0.036
P(Ham/mobile) = 0.111*0.33= 0.036
Therefore, P(Ham/send your mobile) = 0.037*0.036*0.036 = 0.002664
SMS spam classification
● Test data: Send your mobile
P(Spam/send your mobile)= P(Spam/send) * P (Spam/your) * P(Spam/mobile)
P(Spam/send) = P(send/Spam)*P(Spam)/P(send) = 0.108*0.66
Denominator is ignored because it is common in both classes.
Therefore, P(Spam/send) = 0.071
P(Spam/your) = 0.101*0.66= 0.069
P(Spam/mobile) = 0.101*0.66= 0.069
Therefore, P(Spam/send your mobile) = 0.071*0.069*0.069 = 0.0097982
Clearly,
P(Spam/Send your mobile=0.0097982) > P(Ham/Send your mobile=0.002664)

You might also like