Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Topics:

Model Selection
Overfitting, Underfitting
Learning theory-Bias/Variance trade off
Bayesian Models: Bayesian concept learning, Bayesian
Decision Theory,
Naive Bayes theorem
Zero Probability & Laplacian Correction
Presentation by:
Ms. Shweta Redkar
Assistant Professor,
Manipal University Jaipur
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
2
Prediction Problems: Classification vs.
Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
6
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
7
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers

8
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B) =  P(B | A )P( A )
i
i =1 i

• Bayes’ Theorem: P( H | X) = P(X | H ) P( H ) = P(X | H ) P( H ) / P(X)


P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

9
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P( H | X) = P(X | H ) P( H ) = P(X | H ) P( H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

10
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
• Since P(X) is constant for all classes, only

needs to be maximized P(C | X) = P(X | C )P(C )


i i i

11
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P( X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k =1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided
by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ

( x− )2
and P(xk|Ci) is 1 −
g ( x,  ,  ) = e 2 2

2 

P(X | C i) = g ( xk , Ci , Ci )
12
Naïve Bayes Classifier: Training Dataset

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’ age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Data to be classified: >40 medium no fair yes
>40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
<=30 low yes fair yes
Student = yes >40 medium yes fair yes
<=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

13
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40 high no fair yes
• Compute P(X|Ci) for each class >40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40 low yes excellent yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium no fair no
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 <=30 low yes fair yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 >40 medium yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
<=30 medium yes excellent yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
31…40 medium no excellent yes
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
31…40 high yes fair yes
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 >40 medium no excellent no
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Example 2

• Classify the new instance into new categories i.e., Yes or No


1. Calculate the prior probabilities

2. Calculate the conditional


probabilities of individual attributes.

Probability that Outlook is Sunny given Yes


Probability that Outlook is Sunny given No
Identify/Predict the given condition

Normalise the probabilities


• Demo

• Naïve bayes have 3types of classifier:


1. Bernoulli Naïve bayes
2. Multinomial Naïve bayes
3. Gaussian Naïve Bayes
• Count Vectorizer Method
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases

• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier

• How to deal with these dependencies? Bayesian Belief Networks


Zero Probability & Laplacian Correction

• What if I encounter probability values of zero?”


Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
n
P( X | C i ) =  P( x k | C i )
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990),
and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected” counterparts

22
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)
• Used both for classification and regression
• Applications:
• handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests

23
SVM—Support Vector Machines
• A new classification method for both linear and nonlinear data
• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension, data
from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)

24
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

25
When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class
labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one
(the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane
(MMH)
26
Maximal Marginal Hyperplane=

27
SVM—Linearly Separable

◼ A separating hyperplane can be written as


W.X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin)
are support vectors
◼ This becomes a constrained (convex) quadratic optimization problem: Quadratic
objective function and linear constraints → Quadratic Programming (QP) → Lagrangian
multipliers
28
Once I’ve got a trained support vector machine, how do I use it to classify test
(i.e.,new) tuples?

• Based on the Lagrangian formulation, the MMH can be rewritten as


the decision boundary

• The complexity of the learned classifier is characterized by the number of support


vectors rather than the dimensionality of the data. Hence, SVMs tend to be less
prone to overfitting than some other methods.
• The support vectors are the essential or critical training tuples—they lie closest
to the decision boundary (MMH).
What when the Data Are Linearly Inseparable

• Let “x” be the vector denoted by [xj], where j=1,…m.


• Data Set : {xi, yi}, i=1,….,n, yi ϵ {-1,1}
• To construct the SVM, all training samples are mapped to feature space by
a non-linear function ф(xi).
• A separating hyperplane in the feature space is expressed as
f(x)= <w,ф(x)>+b (1)
= σ𝑚 𝑗=1 𝑤𝑗 ф(xj) + b

w= weight vector
b = bias
• Optimal separating hyperplane is defined as a linear classifier which can
separate the two classes of training samples with the largest marginal
width.

• Solution α =[αi] is obtained by following function


W(α) = σ𝑛𝑖=1 αi - 1/2 σ𝑛𝑗=1 αi αj yi yj <ф(xi) , ф(xj) > (2)
Subject to
σ𝑛𝑖=1 𝑦𝑖αi=0, i=1,…n (3)

Where 0≤ αi ≤C*Cfactor (for +ve samples)


0≤ c≤C (for -ve samples)
C=regularization parameter
(Controls the trade-off between training error and margin)
The larger the value of C, larger penalty is assigned to errors.
• In the optimization problem (discussed in previous slide) only those items
with αi >0 can remain.

• The samples that lie along or withing the margins of the decision boundary
(by Kuhn-Tucker Theorem) are called support vectors.

• The weight vector in (1) can be expressed in terms of xi and αi of the


optimization function (2)
w= σ𝑛𝑖=1 αiyi ф(xi) (4)
Where αi ≥0

Than the separating hyperplane in (1) becomes


𝑓 𝑥 = σ𝑛𝑖=1 αiyi <ф(xi) , ф(x)>+ b (5)
• To avoid the computation of inner product <ф(xi) . ф(x)> in the high
dimensional space during the optimization of (2) , the kernel function
that can satisfy the Mercer’s condition is introduced.

• (6)
• The kernel function can be computed efficiently and solve the
problem of mapping the samples to potentially high dimensional
feature space.
What are some of the kernel functions that could be used?
35
SVM Tools
◼SVM-light: http://svmlight.joachims.org/
◼LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
◼Gist: http://bioinformatics.ubc.ca/gist/
◼Weka

36
SVM Related Links
◼ SVM Website

◼ http://www.kernel-machines.org/

◼ Representative implementations

◼ LIBSVM: an efficient implementation of SVM, multi-class classifications,


nu-SVM, one-class SVM, including also various interfaces with java,
python, etc.

◼ SVM-light: simpler but performance is not better than LIBSVM, support


only binary classification and only C language

◼ SVM-torch: another recent implementation also written in C.

37

You might also like