Professional Documents
Culture Documents
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
Decision Trees
Piyush Rai
Machine Learning (CS771A)
Aug 5, 2016
A Classification Problem
Indoor or Outdoor ?
Decision Tree
Defined by a hierarchy of rules (in form of a tree)
Rules form the internal nodes of the tree (topmost internal node = root)
Each rule (internal node) tests the value of some feature
Note: The tree need not be a binary tree
(Labeled) Training data is used to construct the Decision Tree1 (DT)
The DT can then be used to predict label y of a test example x
1 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A)
The DT can be used to predict the region (blue/green) of a new test point
By testing the features of the test point
In the order defined by the DT (first x2 and then x1 )
Machine Learning (CS771A)
Question: Why does it make more sense to test the feature outlook first?
Answer: Of all the 4 features, its most informative2
We will focus on classification and use Entropy/Information-Gain as a
measure of feature informativeness (other measures can also be used)
2 Analogy: Playing the game 20 Questions (ask the most useful questions first)
Machine Learning (CS771A)
Entropy
Entropy is a measure of randomness/uncertainty of a set
Assume our data is a set S of examples having a total of C distinct classes
Suppose pc is the prob. that a random element of S has class c {1, . . . , C }
.. basically, the fraction of elements of S belonging to class c
pc log2 pc
cC
Information Gain
Assume each element of S has a set of features. E.g., in the Playing Tennis
example, each example has 4 features {Outlook, Temp., Humidity, Wind}
Suppose Sv denotes the subset of S for which a feature F has value v
E.g., If feature F = Outlook has value v = Sunny then Sv = days 1,2,8,9,11
X |Sv |
v
|S|
H(Sv )
Split 2 has higher IG and gives more pure classes (hence preferred)
Pic credit: Decision Forests: A Unified Framework by Criminisi et al
Machine Learning (CS771A)
10
IG (S, wind)
|Sstrong |
|Sweak |
H(Sweak )
H(Sstrong )
|S|
|S|
= 0.94 8/14 0.811 6/14 1
= H(S)
=
Machine Learning (CS771A)
0.048
Learning by Asking Questions: Decision Trees
11
12
13
For this node (S = [2+, 3]), the IG for the feature temperature:
X
IG (S, temperature) = H(S)
v {hot,mild,cool}
|Sv |
H(Sv )
|S|
14
15
16
17
Overfitting Illustration
18
Desired: a DT that is not too big in size, yet fits the training data reasonably
Mainly two approaches
Prune while building the tree (stopping early)
Prune after building the tree (post-pruning)
19
3 Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
Machine Learning (CS771A)
20
21
Next Class:
Learning as Optimization
22