Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

Data Mining:

Concepts and Techniques


Classification

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Classification
• Classification
– predicts categorical class labels (discrete or
nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data

2/19/23 Data Mining: Concepts and Techniques 2


Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
2/19/23 Data Mining: Concepts and Techniques 3
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
2/19/23 Data Mining: Concepts and Techniques 4
Process (2): Using the Model in Prediction
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
2/19/23
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 5
Supervised vs. Unsupervised Learning

• Supervised learning (classification)


– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2/19/23 Data Mining: Concepts and Techniques 6
Evaluating Classification Methods
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules

2/19/23 Data Mining: Concepts and Techniques 7


Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
[age>40, income=high, student=yes, >40 low yes fair yes
>40 low yes excellent no
c_r=fair] 31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discre
tized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no samples left
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf 9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )  
j
 Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A)  Info(D)  Info A(D)
10
m
Info ( D)  
Attribute Selection: Information Gain
pi log 2 ( pi )
i 1

g Class P: buys_computer = “yes” 5 4


Info age ( D)  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D)  I (9,5)  
14
log 2 ( )  log 2 ( ) 0.940
14 14 14
 I (3,2)  0.694
14
age pi n i I(p i, n i)
<=30 2 3 0.971 Gain(age)  Info ( D )  Info age ( D)  0.246
31…40 4 0 0
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income)  0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student )  0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating )  0.048
31…40 high yes fair yes 11
>40 medium no excellent no
age income student credit_rating buys_computer
<=30 high no fair no

Decision tree
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age?

<=30 31..40 >40

yes
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

2/19/23 Data Mining: Concepts and Techniques 13


Practice
student Gender Rating Class label
Yes M Excellent Y
Yes M Fair Y
Yes F Fair Y
Yes F Good N
No M Fair N
No F Excellent N

numerator
log2 1 2 3 4 5 6
2 -1 0
3 -1.60 -0.58 0
denominator 4 -2 -1 -0.42 0
5 -2.32 -1.32 -0.74 -0.32 0
6 -2.58 -1.6 -1 -0.58 -0.26 0
Gain Ratio for Attribute Selection (C4.5)

• Information gain measure is biased towards attributes with a lar


ge number of values
Student id Gender Rating Class label
1 M Excellent Y
2 M Fair N
3 F Fair Y
4 F Good Y
5 M Fair N
6 F Excellent N

Student id
1 6
2 3 4 5
Y N N
Y Y N 15
Gain Ratio for Attribute Selection (C4.5)
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
– The gain ratio is derived by taking into account the number and size of
children nodes into which an attribute splits the dataset.
v | Dj | | Dj | Income
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D| High 4
Medium 6
– GainRatio(A) = Gain(A)/SplitInfo(A) Low 4
• Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the
splitting attribute
16
v | Dj | | Dj |
Exercise SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
age income student credit_rating buys_computer
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes Compute gain ratio of age
>40
>40
medium
low
no fair
yes fair
yes
yes
Compute gain ratio of student
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no Gain(age)  Info ( D)  Gain
fair Info age
no ( D))  Info
(age 0.246
( D )  Info age ( D) 
<=30 low yes fair yes
>40 medium yes fair Gain(income)  0.029
yes
<=30 medium yes excellent yes
31…40 medium no excellent Gain( student )  0.151
yes
31…40 high yes fair yes
>40 medium no excellent Gain(credit _ rating )  0.048
no

3/14 4/14 5/14 6/14 7/14


log2 -2.22 -1.81 -1.49 -1.22 -1

17
Gain Ratio for Attribute Selection (C4.5)
• Unfortunately, in some situations the gain ratio modification
overcompensates and can lead to preferring an attribute just
because its intrinsic information is much lower than for the
other attributes. A 1
– GainRatio(A) = Gain(A)/SplitInfo(A) SplitInfo=0.0763
High 99
Low 1
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( ) A2
j 1 |D| |D|
High 50 SplitInfo=1
Low 50

• A standard fix is to choose the attribute that maximizes the gain


ratio, provided that the information gain for that attribute is at
least as great as the average information gain for all the
attributes examined.
18
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini index,


gini(D) is defined as n
gini( D)  1  p 2 j impurity
j 1
where pj is the relative frequency of class j in D
student Gender Rating Class
label
Yes M Excellent Y
gini(D)=1-(0.52+0.52)=0.5
Yes M Fair Y
Yes F Fair Y
Yes F Good N
No M Fair N
No F Excellent N
19
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini index,


gini(D) is defined as n
gini( D)  1  p 2 j impurity
j 1
where pj is the relative frequency of class j in D
student Gender Rating Class
label
Yes M Excellent Y
gini(D)=1-(12+02)=0
Yes M Fair Y
Yes F Fair Y
Yes F Good Y
No M Fair Y
No F Excellent Y
20
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini index,


gini(D) is defined as n
gini( D)  1  p 2 j impurity
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D1| |D2 |
gini A ( D)  gini( D1)  gini( D 2)
|D| |D|
• Reduction in Impurity:
gini( A)  gini( D)  giniA (D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
21
age income student credit_rating buys_computer
<=30 high no fair no

Computation of Gini Index


<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
D has 9 tuples in buys_computer >40 low yes excellent no
= “yes” and 5 in “no” 31…40
<=30
low
medium
yes excellent
no fair
yes
no
2 2 <=30 low yes fair yes
9 5 >40 medium yes fair yes
gini ( D)  1        0.459
 14   14  <=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
31…40 high yes fair yes
>40 medium no excellent no

Student
student pi ni gini
yes 6 1 1-(6/7)2-(1/7)2=0.246
no 3 4 1-(3/7)2-(4/7)2=0.490
ginistudent(D)=(7/14)* 0.246+ (7/14)*0.490=0.368

Dgini(student)=0.459-0.368=0.091
22
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D)  1        0.459
 14   14 

• Suppose the attribute income partitions D into 10 in D1: {low,


10 4
medium} and 4 in D2 giniincome{low,medium} ( D)   Gini( D1 )   Gini ( D1 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index

23
2 2
9 5
gini ( D)  1        0.459
 14   14 
age income student credit_rating buys_computer
<=30 high no fair no
 10  4 <=30 high no excellent no
giniincome{low,medium} ( D )   Gini ( D1 )   Gini ( D1 ) 31…40 high no fair yes
 14   14  >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
{(low, high), medium} 0.458 31…40 high yes fair yes
{(medium, high), low} 0.450 >40 medium no excellent no

income? gini( A)  gini( D)  giniA (D)

Low, medium high


Exercise
Gini index considers a binary split for each attribute, find
the best binary split on attribute Income
Income Age Credit Class label

l 20-29 good T

l 30-39 good F

l 40-49 good F

m 20-29 fair F

m 30-39 fair F

m 40-49 fair T

h 40-49 good F

h 20-29 good T

h 30-39 fair T

h 40-49 fair T
Comparing Attribute Selection Measures

• The three measures, in general, return good results but


– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• tends to favor tests that result in equal-sized partitions
and purity in both partitions

26
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples

27
Overfitting
3 errors
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”

34
Pre-Pruning (example)
Halt tree construction early ̵ do
not split a node if this would
result in the goodness measure
falling below a threshold
3 blue dots
1 orange dots
Post-Pruning(example)
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
– https://youtu.be/u4kbPtiVVB8
44
Classification in Large Databases

• Classification—a classical problem extensively studied by


statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why is decision tree induction popular?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– comparable classification accuracy with other methods

45
Entropy
• Category : a, b m
Entropy ( S )   pi log 2 ( pi )
– S1={a, a, a, a, a, a}
i 1

– S2={a, a, a, b, b, b}

Entropy ( S1 )  (1 log 2 1  0  log 2 0)  0

Entropy ( S 2 )  (0.5  log 2 0.5  0.5  log 2 0.5)  1

46
Example
Age att1 att2 Class label
E1  (1 log 2 1  0  log 2 0)  0
5 … … N
10 … … N
20 … … Y
38 … … Y E2  (0.5  log 2 0.5  0.5  log 2 0.5)  1
52 … … N

1 4
I ( S , T1 )  * 0  *1  0.8
5 5

47
Example
Age att1 att2 Class label
E1  (1 log 2 1  0  log 2 0)  0
5 … … N
10 … … N
20 … … Y
2 2 1 1
38 … … Y E2  (  log 2   log 2 )  0.92
3 3 3 3
52 … … N

2 3
I ( S , T2 )  * 0  * 0.92  0.552
5 5

48
Computing Information-Gain for Continuous-Valued
Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
49

You might also like