Professional Documents
Culture Documents
Classification Intr DT
Classification Intr DT
Classification
Algorithms
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
2/19/23
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 5
Supervised vs. Unsupervised Learning
no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discre
tized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no samples left
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf 9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
10
m
Info ( D)
Attribute Selection: Information Gain
pi log 2 ( pi )
i 1
Decision tree
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age?
yes
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
numerator
log2 1 2 3 4 5 6
2 -1 0
3 -1.60 -0.58 0
denominator 4 -2 -1 -0.42 0
5 -2.32 -1.32 -0.74 -0.32 0
6 -2.58 -1.6 -1 -0.58 -0.26 0
Gain Ratio for Attribute Selection (C4.5)
Student id
1 6
2 3 4 5
Y N N
Y Y N 15
Gain Ratio for Attribute Selection (C4.5)
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
– The gain ratio is derived by taking into account the number and size of
children nodes into which an attribute splits the dataset.
v | Dj | | Dj | Income
SplitInfo A ( D) log 2 ( )
j 1 |D| |D| High 4
Medium 6
– GainRatio(A) = Gain(A)/SplitInfo(A) Low 4
• Ex.
17
Gain Ratio for Attribute Selection (C4.5)
• Unfortunately, in some situations the gain ratio modification
overcompensates and can lead to preferring an attribute just
because its intrinsic information is much lower than for the
other attributes. A 1
– GainRatio(A) = Gain(A)/SplitInfo(A) SplitInfo=0.0763
High 99
Low 1
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( ) A2
j 1 |D| |D|
High 50 SplitInfo=1
Low 50
Student
student pi ni gini
yes 6 1 1-(6/7)2-(1/7)2=0.246
no 3 4 1-(3/7)2-(4/7)2=0.490
ginistudent(D)=(7/14)* 0.246+ (7/14)*0.490=0.368
Dgini(student)=0.459-0.368=0.091
22
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
23
2 2
9 5
gini ( D) 1 0.459
14 14
age income student credit_rating buys_computer
<=30 high no fair no
10 4 <=30 high no excellent no
giniincome{low,medium} ( D ) Gini ( D1 ) Gini ( D1 ) 31…40 high no fair yes
14 14 >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
{(low, high), medium} 0.458 31…40 high yes fair yes
{(medium, high), low} 0.450 >40 medium no excellent no
l 20-29 good T
l 30-39 good F
l 40-49 good F
m 20-29 fair F
m 30-39 fair F
m 40-49 fair T
h 40-49 good F
h 20-29 good T
h 30-39 fair T
h 40-49 fair T
Comparing Attribute Selection Measures
26
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
27
Overfitting
3 errors
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
34
Pre-Pruning (example)
Halt tree construction early ̵ do
not split a node if this would
result in the goodness measure
falling below a threshold
3 blue dots
1 orange dots
Post-Pruning(example)
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
– https://youtu.be/u4kbPtiVVB8
44
Classification in Large Databases
45
Entropy
• Category : a, b m
Entropy ( S ) pi log 2 ( pi )
– S1={a, a, a, a, a, a}
i 1
– S2={a, a, a, b, b, b}
46
Example
Age att1 att2 Class label
E1 (1 log 2 1 0 log 2 0) 0
5 … … N
10 … … N
20 … … Y
38 … … Y E2 (0.5 log 2 0.5 0.5 log 2 0.5) 1
52 … … N
1 4
I ( S , T1 ) * 0 *1 0.8
5 5
47
Example
Age att1 att2 Class label
E1 (1 log 2 1 0 log 2 0) 0
5 … … N
10 … … N
20 … … Y
2 2 1 1
38 … … Y E2 ( log 2 log 2 ) 0.92
3 3 3 3
52 … … N
2 3
I ( S , T2 ) * 0 * 0.92 0.552
5 5
48
Computing Information-Gain for Continuous-Valued
Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
49