Lecture - 4 Classification (Naive Bayes)

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 33

Classification : Naïve Bayes

Adama Science and Technology University


School of Electrical Engineering and
Computing
Department of CSE
Dr. Mesfin Abebe Haile (2020)
Outline

 What is Naïve Bayes


 Pros and Cons of Naïve Bayes
 Probability theory
 Conditional probability
 Naïve Bayes classification

11/07/22 2
Probability Theory : Naïve Bayes

 In both kNN and DT; we asked the classifiers to make hard decisions.
 We asked for a defined answer for the question.
 Asking the best guess about the class is better. (probability)

 Probability theory forms the basis for many machine learning algorithms.

 Probability theory can help us in classify things.

11/07/22 3
Probability Theory : Naïve Bayes

Classifying with Bayesian Decision Theory:


 Pros:
 Works with small amount of data, handles multiple classes.
 Cons:
 Sensitive to how the input data is prepared.

 Works with:
 Nominal values

11/07/22 4
Probability Theory : Naïve Bayes

 Naïve Bayes is a subset of Bayesian Decision Theory.

 The decision tree wouldn’t be very successful, and kNN would


require a lot of calculations compared to the simple probability
calculation.

 Conditional Probability:
 P(gray/bucket B) = P(gray and bucket B) / P(bucket B)

11/07/22 5
Probability Theory : Naïve Bayes

Figure 1: Seven stones in two buckets

11/07/22 6
Probability Theory : Naïve Bayes

 Conditional Probability:
 Calculating the probability of a gray stone, given that the unknown stone
comes from bucket B.
 P(gray / bucket B) = 1/3
 P(gray / bucket A) = 2/4
 To formalize how to calculate the conditional probability, we can say:
 P(gray / bucket B) = P(gray and bucket B) / P(bucket B)
 P(gray and bucket B) = 1/7 (gray stone in B / total stone )
 P(bucket B) = 3/7 (Three stone in bucket B) – Simple

11/07/22 7
Probability Theory : Naïve Bayes

 Conditional Probability
 P(gray / bucket B) = P(gray and bucket B) / P(bucket B)
 P(gray / bucket B) = (1/7) / (3/7)
 P(gray / bucket B) = 1/3

 Another useful way to manipulate conditional probabilities is known as Bayes’ rule.


 If we have P(x|c) but want to have P(c|x)

11/07/22 8
Classifying with Conditional
Probabilities
 Bayesian decision theory can told us to find the two probabilities:
 If P1(x, y) > P2(x, y) , then the class is 1.
 If P1(x, y) < p2(x, y), then the class is 2.

 What we really need to compare p(c1|x,y) and p(c2|x,y).


 Given a point identified as x,y; what is the probability it came from class c1?
 What is the probability it came from class c2?

11/07/22 9
Classifying with Conditional
Probabilities

 Posterior = (likelihood * prior) / evidence

 With these definitions, we can define the Bayesian classification rule:


 If P(c1 | x, y) > P(c2 | x, y) , the class is c1.
 If P(c1 | x, y) < p(c2 | x, y), the class is c2.

11/07/22 10
Uses of Naïve Bayes Classification

 Application of Naïve Bayes:


 Naïve Bayes text classification
 Spam filtering
 Hybrid recommender system (Collaborative and Content based filtering)
 Online application

 Bayesian reasoning is applied to decision making and inferential statistics


that deals with probability inference.
 It used the knowledge of prior events to predict future events.

11/07/22 11
Example One

11/07/22 Figure 2: Example training data 12


Example One

 X = ( age = youth, income = medium, student = yes, credit_rating =


fair)
 A person belonging to tuple X will buy a computer?

 Maximum Posteriori Hypothesis :


 P(Ci | X) = P(X | Ci) P(Ci) / P(X)
 Maximum P(Ci | X) = P(X | Ci) P(Ci) as P(X) is constant

11/07/22 13
Example One

 P(C1=yes) = P(buys_computer = yes) = 9/14 = 0.643


 P(C2=no) = P(buys_computer = no) = 5/14 = 0.357
 P(age=youth /buys_computer = yes) = 2/9 = 0.222
 P(age=youth /buys_computer = no) = 3/5 = 0.600
 P(income=medium /buys_computer = yes) = 4/9 = 0.444
 P(income=medium /buys_computer = no) = 2/5 = 0.400
 P(student=yes /buys_computer = yes) = 6/9 = 0.667
 P(student=yes/buys_computer = no) = 1/5 = 0.200
 P(credit rating=fair /buys_computer = yes) = 6/9 = 0.667
 P(credit rating=fair /buys_computer = no) = 2/5 =0.400

11/07/22 14
Example One

 P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) *


P(income=medium /buys_computer = yes) * P(student=yes
/buys_computer = yes) * P(credit rating=fair /buys_computer = yes)

 P(X/Buys a computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044

 P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019

11/07/22 15
Example One

 Find class Ci that Maximizes P(X/Ci) * P(Ci):


 P(X/Buys a computer = yes) * P(buys_computer = yes)
 = 0.044 * 0.643
 = 0.028
 P(X/Buys a computer = No) * P(buys_computer = no)
 = 0.019 * 0.357
 = 0.007

 Prediction : Buys a computer for Tuple X. (x can buy a computer)

11/07/22 16
Example Two

 Consider a set of documents, each of which is related either to


Sports (S ) or to Informatics (I).

 Given a training set of 11 documents, we would like to estimate a


Naive Bayes classifier, using the Bernoulli document model, to
classify unlabelled documents as S or I.

 We define a vocabulary of eight words:

11/07/22 17
Example Two

 Types of Naïve Bayes:

11/07/22 18
Example Two

Figure 3: Vocabulary of eight words

11/07/22 19
Example Two

 Thus each document is represented as an 8-dimensional binary vector.


 The training data is presented below as a matrix for each class, in which each row
represents an 8-dimensional document vector.

11/07/22 20
Example Two

 Classify the following into Sports or Informatics using a Naive Bayes classifier.
 b1 = (1, 0, 0, 1, 1, 1, 0, 1) = S or I
 b2 = (0, 1, 1, 0, 1, 0, 1, 0) = S or I

11/07/22 21
Example Two

 The total number of documents in the training set N =11; NS =6, NI


=5.

 We can estimate the prior probabilities from the training data as:
 P(S) = 6/11
 P(I) = 5/11

11/07/22 22
Example Two

 The word count in the training data are:

11/07/22 23
Example Two

 We can estimate the word likelihood using:

11/07/22 24
Example Two

 The word likelihood for class I:

11/07/22 25
Example Two

 To compute the posterior probabilities of the two test vectors and hence classify them.
 b1 = (1, 0, 0, 1, 1, 1, 0, 1)
 P(S| b1) = P(wt | S) x P(S)
 (1/2 X 5/6 X 2/3 X ½ X ½ X 2/3 X 1/3 X 2/3) x (6/11)
 (5/891) = 5.6 x 10-3
 P(I| b1) = P(wt | I) x P(I)
 (1/5 X 2/5 X 2/5 X 1/5 X 1/5 X 1/5 X 2/5 X 1/5) x (5/11)
 (8/859375) = 9.3 x 10-6
 Classify this document as S.

11/07/22 26
Example Two

 To compute the posterior probabilities of the two test vectors and hence classify them.
 b2 = (0, 1, 1, 0, 1, 0, 1, 0)
 P(S| b2) = P(wt | S) x P(S)
 (1/2 X 1/6 X 1/3 X ½ X ½ X 1/3 X 2/3 X 1/3) x (6/11)
 (12/14256) = 8.4 x 10-4
 P(I| b2) = P(wt | I) x P(I)
 (4/5 X 3/5 X 3/5 X 4/5 X 1/5 X 4/5 X 3/5 X 4/5) x (5/11)
 (34560/4296875) = 8.0 x 10-3
 Classify this document as I.

11/07/22 27
Naïve Bayes: Syntax

 Import the class containing the classification method:

11/07/22 28
Summary

 Using probabilities can sometimes be more effective than using hard rules for classification.
 Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values.

 You can reduce the need for a lot of data by assuming conditional independence among the features in your data.

 The assumption we make is that the probability of one word doesn’t depend on any other words in the document.

11/07/22 29
Summary

 Despite its incorrect assumptions, naïve Bayes is effective at classification.

 Underflow is one problem that can be addressed by using the logarithm of probabilities in your
calculations.

11/07/22 30
Question & Answer

11/07/22 31
Thank You !!!

11/07/22 32
Assignment Three
 Predict outcome for the following: x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

11/07/22 33

You might also like