Week 6 - 7 - Classification

CIS 517 :Data Mining and
Warehousing
Week 6 & 7: Classification
(Chapter 8)
Instructor :
1
Email : maalnasser@iau.edu.sa
1
Lecture Outline
 What is classification? What is

prediction?
 Issues regarding classification and
prediction
 Classification by Decision tree induction
 Rule based Classification.
 Classification accuracy
 Summary
Classification vs. Prediction
 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the

training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts
unknown or missing values
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function of the
values of other attributes.
 Goal: previously unseen records should be assigned a
class as accurately as possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Classification—A Two-Step Process
 Model construction: describing a set of predetermined

classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 3 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Model Construction
Use the Model in Prediction
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
Issues (2):
Evaluating Classification Methods
 Predictive accuracy
 Time to construct the model
 Time to use the model

 Handling noise and missing values
 Decision tree size

 Compactness of classification rules
Classification: Measure the quality
 Usually the Accuracy measure is used:

Classification Techniques
 Decision Tree based Methods

 Rule-based Methods
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines(SVM)
Classification by Decision Tree Induction
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Example: Training Dataset
Output: A Decision Tree for
“buys_computer”
Algorithm for Decision
Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
Example of a Decision
Tree ca l
ca l
us
ri ri u o
ego ego tin ss
t t n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of
Decision Tree l l
ir ca ir ca ous
o o u
t eg t eg n tin ss
ca ca co cla MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree
Classification Task Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Model
Decisio
11 No Small 55K ? n Tree
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test
Test Data
Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund
10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
Test Data
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5 (J48 on WEKA)
 SLIQ,SPRINT
Decision Tree Induction
 Greedy strategy (Heuristic method)

 Split the records based on an attribute test
that optimizes certain criterion.
 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
Greedy Algorithm : Generate a Decision
Tree
 Input:
 Data partition, D, which is a set of training tuples and
their associated class labels;
 attribute list, the set of candidate attributes;
 Attribute selection method, a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting
attribute and, possibly, either a split-point or splitting
subset.
 Output: A decision tree
Greedy Algorithm : Generate a
Decision Tree
 Method :
How to determine the best
split?
split?
split?
split?
Example 2 :
New data : Age=30 , Student=Yes, Income=4000,

Buy_computer =?
Attribute Selection Measures
 Heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of
class-labeled training tuples into individual
classes.
 Three popular attribute selection measures:
1. Information gain.
2. Gain ratio.
3. Gini index.
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain.
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy)
m needed to classify a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )  
j
 Info ( D j )
j 1 | D |
 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)
Attribute Selection Measure:
Information Gain (ID3)
Example: Induction of a decision tree using information gain.
Table presents a training set, D, of class-labeled tuples randomly selected from
the All Electronics customer database.
Example: Induction of a decision tree using
information gain
1. Compute the expected information needed to classify a tuple
in D : The class label attribute, buys computer, has two distinct
values (yes, no); there are two distinct classes m=2.
2. Compute the expected information requirement for each

attribute. Let’s start with the attribute age
information gain
3. The gain in information from such a partitioning would be :
Gain(income )=0.029 bits, Gain(student)= 0.151 bits,

and Gain(credit rating)=0.048 bits.
 Because age has the highest information gain among the

attributes, it is selected as the splitting attribute.
 Node N is labeled with age, and branches are grown for
each of the attribute’s values
information gain
information gain
Rule-Based Classification
Using IF-THEN Rules for Classification
Example : Rule accuracy and coverage
 Let’s go back to our data in Table .

 These are class labeled tuples from the All Electronics customer
database. Our task is to predict whether a customer will buy a
computer. Consider rule R1, which covers 2 of the 14 tuples.
 It can correctly classify both tuples.
 Coverage (R1)=2/14 = 14.28%
 Accuracy (R1)=2/2 =100%.
Rule Extraction from a Decision Tree
Naïve Bayes Classification
Bayes Classification
 A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural
network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
Bayes’ Theorem: Basics
 Bayes’ Theorem:
P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

P(X)
 Let X be a data sample (“evidence”): class label is unknown

 Let H be a hypothesis that X belongs to class C
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
Naïve Bayes Classifier
Example of Naïve Bayes Classifier: Training
Dataset
Dataset
Dataset
Dataset
Dataset
Dataset
Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy?
Other metrics to consider?
 What if we have more than one classifier and want to
choose the “best” one? This is referred to as model
selection.
 Use validation test set of class-labeled tuples instead
of training set when assessing accuracy.
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
57
Metrics for Evaluating Classifier Performance:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class YES NO
YES True Positives (TP) False Negatives (FN)
NO False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i

that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
58
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P yes no Class Imbalance Problem:
yes TP FN P
 One class may be rare, e.g.
no FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the
 Classifier Accuracy, or negative class and minority of

recognition rate: percentage of the positive class
 Sensitivity: True Positive
test set tuples that are
correctly classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or  Specificity: True Negative
Error rate = (FP + FN)/All recognition rate
 Specificity = TN/N
60
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier labeled

as positive are actually positive
 Recall: completeness – what % of positive tuples did the classifier

label as positive?
 Perfect score is 1.0

 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,
61
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90/230 = 39.13%
 Recall = 90/300 = 30.00%
63
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
64
Cross-Validation Method
Cross-Validation Method
Summary
 Classification is an extensively studied problem (mainly in

statistics, machine learning & neural networks)
 Classification is probably one of the most widely used
data mining techniques with a lot of extensions
 Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
 Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..

Week 6 - 7 - Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 6 - 7 - Classification

Uploaded by

Copyright:

Available Formats

CIS 517 :Data Mining and

 What is classification? What is

 classifies data (constructs a model) based on the

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

 Model construction: describing a set of predetermined

NAME RANK YEARS TENURED Classifier

 Time to construct the model

 Time to use the model

 Decision tree size

 Usually the Accuracy measure is used:

 Decision Tree based Methods

1 Yes Single 125K No

Training Data Model: Decision Tree

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

13 Yes Large 110K ?

 Greedy strategy (Heuristic method)

New data : Age=30 , Student=Yes, Income=4000,

 Information gained by branching on attribute A

2. Compute the expected information requirement for each

Gain(income )=0.029 bits, Gain(student)= 0.151 bits,

 Because age has the highest information gain among the

 Let’s go back to our data in Table .

P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

 Let X be a data sample (“evidence”): class label is unknown

Example of Confusion Matrix:

 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i

 Classifier Accuracy, or negative class and minority of

 Precision: exactness – what % of tuples that the classifier labeled

 Recall: completeness – what % of positive tuples did the classifier

 Perfect score is 1.0

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

 Precision = 90/230 = 39.13%

 Recall = 90/300 = 30.00%

 Classification is an extensively studied problem (mainly in

You might also like