Professional Documents
Culture Documents
Week 6 - 7 - Classification
Week 6 - 7 - Classification
Warehousing
Week 6 & 7: Classification
(Chapter 8)
Instructor :
1
Email : maalnasser@iau.edu.sa
1
Lecture Outline
Classification:
predicts categorical class labels
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Classification—A Two-Step Process
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 3 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Process (1):
Model Construction
Classification Process (2):
Use the Model in Prediction
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Issues (2):
Evaluating Classification Methods
Predictive accuracy
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Example: Training Dataset
Output: A Decision Tree for
“buys_computer”
Algorithm for Decision
Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
Example of a Decision
Tree ca l
ca l
us
ri ri u o
ego ego tin ss
t t n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
6 No Medium 60K No
Training Set
Apply
Model
Decisio
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? n Tree
12 Yes Medium 80K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test
Test Data
Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund
10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
Test Data
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5 (J48 on WEKA)
SLIQ,SPRINT
Decision Tree Induction
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Greedy Algorithm : Generate a Decision
Tree
Input:
Data partition, D, which is a set of training tuples and
their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting
attribute and, possibly, either a split-point or splitting
subset.
Output: A decision tree
Greedy Algorithm : Generate a
Decision Tree
Method :
How to determine the best
split?
How to determine the best
split?
How to determine the best
split?
How to determine the best
split?
Example 2 :
Bayes’ Theorem:
Confusion Matrix:
Actual class\Predicted class YES NO
YES True Positives (TP) False Negatives (FN)
NO False Positives (FP) True Negatives (TN)
60
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
61
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Classifier Evaluation Metrics: Example
63
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
At i-th iteration, use Di as test set and others as training set
Leave-one-out: k folds where k = # of tuples, for small sized data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
64
Cross-Validation Method
Cross-Validation Method
Summary