Professional Documents
Culture Documents
Naive Bayes (1)
Naive Bayes (1)
Model Selection
Overfitting, Underfitting
Learning theory-Bias/Variance trade off
Bayesian Models: Bayesian concept learning, Bayesian
Decision Theory,
Naive Bayes theorem
Zero Probability & Laplacian Correction
Presentation by:
Ms. Shweta Redkar
Assistant Professor,
Manipal University Jaipur
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
2
Prediction Problems: Classification vs.
Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
7
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
8
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B) = P(B | A )P( A )
i
i =1 i
9
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
10
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Since P(X) is constant for all classes, only
11
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided
by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x− )2
and P(xk|Ci) is 1 −
g ( x, , ) = e 2 2
2
P(X | C i) = g ( xk , Ci , Ci )
12
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’ age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Data to be classified: >40 medium no fair yes
>40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
<=30 low yes fair yes
Student = yes >40 medium yes fair yes
<=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
13
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40 high no fair yes
• Compute P(X|Ci) for each class >40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40 low yes excellent yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium no fair no
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 <=30 low yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
<=30 medium yes excellent yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
31…40 medium no excellent yes
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
31…40 high yes fair yes
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 >40 medium no excellent no
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Example 2
• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
22
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
• Used both for classification and regression
• Applications:
• handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
23
SVM—Support Vector Machines
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)
24
SVM—General Philosophy
25
When Data Is Linearly Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class
labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one
(the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane
(MMH)
26
Maximal Marginal Hyperplane=
27
SVM—Linearly Separable
w= weight vector
b = bias
• Optimal separating hyperplane is defined as a linear classifier which can
separate the two classes of training samples with the largest marginal
width.
• The samples that lie along or withing the margins of the decision boundary
(by Kuhn-Tucker Theorem) are called support vectors.
• (6)
• The kernel function can be computed efficiently and solve the
problem of mapping the samples to potentially high dimensional
feature space.
What are some of the kernel functions that could be used?
35
SVM Tools
◼SVM-light: http://svmlight.joachims.org/
◼LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
◼Gist: http://bioinformatics.ubc.ca/gist/
◼Weka
36
SVM Related Links
◼ SVM Website
◼ http://www.kernel-machines.org/
◼ Representative implementations
37