Professional Documents
Culture Documents
Module- 4.1-DM-1
Module- 4.1-DM-1
2
Supervised vs. Unsupervised Learning
4
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Previously unseen records should be assigned a class as accurately as
possible.
5
Estimate accuracy of the model
The known label of test sample is compared with the classified result
Note: If the test set is used to select models, it is called validation (test) set
A test set is used to determine the accuracy of the model. Usually the given
data set is divided into training and test sets
with training set used to build the model and test set used to validate it.
6
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
8
Classification Techniques
Decision Tree based Methods
Naïve Bayes and Bayesian Belief Networks
Rule-based Methods
Support Vector Machines
Memory based reasoning
Neural Networks
Chapter 8. Classification: Basic Concepts
10
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
11
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
12
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
Next, we need to compute the expected information requirement for each attribute.
Let’s start with the attribute age, We need to look at the distribution of yes and no
tuples for each category of age.
For the age category “youth,” there are two yes tuples and three no tuples.
For the category “middle aged,” there are four yes tuples and zero no tuples.
For the category “senior,” there are three yes tuples and two no tuples.
The expected information needed to classify a tuple in D if the tuples are partitioned
according to age is
Table presents a training set, D, of class-labeled tuples
randomly selected from the AllElectronics customer
database.
In this example, each attribute is discretevalued.
The class label attribute, buys computer, has two distinct
values (namely, yes, no); therefore, there are two distinct
classes (i.e., m =2).
Let class C1 correspond to yes and class C2 correspond to
no.
There are nine tuples of class yes and five tuples of class no.
A (root) node N is created for the tuples in D. To find the
splitting criterion for these tuples, we must compute the
information gain of each attribute.
A
Absolute value of D
Similarly, we can compute Gain.income/ D 0.029 bits, Gain.student/ D
0.151 bits, and Gain.credit rating/ D 0.048 bits. Because age has the
highest information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of
the attribute’s values. The tuples are then partitioned accordingly, as shown
in Figure.
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D ) I (2,3) I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info( D ) Infoage ( D ) 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 25
Overfitting
Overfitting occurs when our machine learning model tries to cover
all the data points or more than the required data points present in
the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
Underfitting
Underfitting occurs when our machine learning model is not able to
26
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
28
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
29
Bayes’ Theorem: Basics
M
Total probability Theorem: P(B) P(B | A )P( A )
i i
i 1
medium income
30
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i )
k 1 2 n
This greatly reduces the computation cost: Only counts the
k 1
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
1
and P(xk|Ci) is g ( x, , ) e 2 2
2
P ( X | C i ) g ( xk , Ci , Ci )
31
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
32
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
>40 low yes fair yes
P(buys_computer = “no”) = 5/14= 0.357 >40
31…40
low
low
yes excellent
yes excellent
no
yes
<=30 medium no fair no
Compute P(X|Ci) for each class <=30
>40
low yes fair
medium yes fair
yes
yes
“uncorrected” counterparts 34
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
35
Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then
Compute Test
Distance Record
X X X
d ( p, q ) ( pi
i
q )
i
2
neighbors
Weigh the vote according to distance
X
Nearest Neighbor Classification…
Scaling issues
Attributes may have to be scaled to prevent
d= (X1-Y1)2+(X2-Y2)2
Example 2
CSE-307-Data Mining
49
CSE-307-Data Mining
50
CSE-307-Data Mining
51
CSE-307-Data Mining
52
53
Evaluation of Machine Learning Model
True Positive (TP):
True Positive (TP): Instances that are actually positive (belong
to the positive class) and are correctly identified as positive by
the model.
Here's an example to illustrate this concept:
Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
True Positive (TP):
True Class: The email is actually spam.
email is spam.
Example: An email containing typical characteristics of spam
(e.g., certain keywords, links, or patterns) is correctly identified
as spam by the model.
True Negative (TN)
True Negative (TN): Instances that are actually negative (belong
to the negative class) and are correctly identified as negative by
the model.
Here's an example to illustrate this concept:
Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
True Negative (TN):
True Class: The email is actually not spam (negative class).
email is spam.
Example: A regular, non-spam email is mistakenly identified as
spam by the model due to certain features or patterns in the
email that the model misinterprets.
False Negative (FN)
False Negative (FN): Instances that are actually positive (belong
to the positive class) but are incorrectly identified as negative
by the model.
Here's an example to illustrate this concept:
Suppose you have a binary classification model that predicts
whether an email is spam (positive class) or not spam (negative
class). Now, let's consider the following scenarios:
False Negative (FN):
True Class: The email is actually spam (positive class).