Professional Documents
Culture Documents
Decision Trees - Lecture11
Decision Trees - Lecture11
Trees
Lecture
11
David
Sontag
New
York
University
• 40
data
points
• Goal:
predict
MPG
• Need
to
find:
f
:
X
Y
• Discrete
data
(for
now)
Y X
From the UCI repository (thanks to Ross Quinlan)
Hypotheses: decision trees f :XY
• Each internal node
tests an attribute xi
Cylinders
• Each branch
assigns an attribute
3
4
5
6
8
value xi=v
good bad bad
Maker
Horsepower
• Each leaf assigns a
class y
america
asia
europe
low
med
high
• To classify input x:
bad good good bad good bad
traverse the tree
from root to leaf,
output the labeled y
Human
interpretable!
Hypothesis space
• How many possible
hypotheses?
• What functions can be
represented?
Cylinders
3
4
5
6
8
good Maker
bad bad Horsepower
aRributes! T T F F T T F
• Lets first look at how to split america
asia
europe
low
med
high
nodes, then consider how to bad good good bad good bad
find the best tree
What
is
the
Simplest
Tree?
predict
mpg=bad
Records
in which
cylinders
=8
Recursive
Step
Build tree from Build tree from Build tree from Build tree from
These records.. These records.. These records.. These records..
Records in
Records in which cylinders
which cylinders =8
=6
Records in
Records in
which cylinders
which cylinders
=5
=4
Second
level
of
tree
A B
t t f
f
B C C C
t f t f t f t f
_ _ _
+ + + A A
t f t f
+ _ _ +
• Which tree do we prefer?
Learning
decision
trees
is
hard!!!
• Learning
the
simplest
(smallest)
decision
tree
is
an
NP-‐complete
problem
[Hyafil
&
Rivest
’76]
• Resort
to
a
greedy
heurisPc:
– Start
from
empty
decision
tree
– Split
on
next
best
a2ribute
(feature)
– Recurse
Spliing:
choosing
a
good
aRribute
Would we prefer to split on X1 or X2? X1 X2 Y
T T T
T F T
X1 X2
t f t f T T T
T F T
Y=t : 4 Y=t : 1 Y=t : 3 Y=t : 2 F T T
Y=f : 0 Y=f : 3 Y=f : 1 Y=f : 2
F F F
F T F
Idea: use counts at leaves to define
F F F
probability distributions, so we can
measure uncertainty!
Measuring
uncertainty
• Good
split
if
we
are
more
certain
about
classificaPon
ajer
split
– DeterminisPc
good
(all
true
or
all
false)
– Uniform
distribuPon
bad
– What
about
distribuPons
in
between?
Entropy
Information Theory interpretation:
H(Y) is the expected number of bits
needed to encode a randomly
drawn value of Y (under most
efficient code)
• “High
Entropy”
– Y
is
from
a
uniform
like
distribuPon
– Flat
histogram
– Values
sampled
from
it
are
less
predictable
• “Low
Entropy”
– Y
is
from
a
varied
(peaks
and
valleys)
distribuPon
– Histogram
has
many
lows
and
highs
– Values
sampled
from
it
are
more
predictable
Entropy Example
Entropy
Probability
of
heads
P(Y=t) = 5/6
X1 X2 Y
P(Y=f) = 1/6
T T T
T F T
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6 T T T
= 0.65 T F T
F T T
F F F
CondiPonal
Entropy
CondiPonal
Entropy
H( Y |X)
of
a
random
variable
Y
condiPoned
on
a
random
variable
X
X1 X2 Y
Example: X1
t f T T T
T F T
P(X1=t) = 4/6 Y=t : 4 Y=t : 1
P(X1=f) = 2/6 Y=f : 0 T T T
Y=f : 1
T F T
F T T
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)
- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2) F F F
= 2/6
InformaPon
gain
• Decrease
in
entropy
(uncertainty)
ajer
spliing
X1 X2 Y
In our running example: T T T
T F T
IG(X1) = H(Y) – H(Y|X1)
= 0.65 – 0.33 T T T
T F T
IG(X1) > 0 we prefer the split! F T T
F F F
Learning
decision
trees
• Start
from
empty
decision
tree
• Split
on
next
best
a2ribute
(feature)
– Use,
for
example,
informaPon
gain
to
select
aRribute:
• Recurse
Suppose we want
to predict MPG
Don’t split a
node if all
matching
records have
the same
output value
Base Case
Two
Don’t split a
node if data
points are
identical on
remaining
attributes
Base Case Two:
No attributes can
distinguish
Base
Cases:
An
idea
• Base
Case
One:
If
all
records
in
current
data
subset
have
the
same
output
then
don’t
recurse
• Base
Case
Two:
If
all
records
have
exactly
the
same
set
of
input
aRributes
then
don’t
recurse
y = a XOR b
Is it OK to omit Base
Case 3?
Summary:
Building
Decision
Trees
BuildTree(DataSet,Output)
• If
all
output
values
are
the
same
in
DataSet,
return
a
leaf
node
that
says
“predict
this
unique
output”
• If
all
input
values
are
the
same,
return
a
leaf
node
that
says
“predict
the
majority
output”
• Else
find
aRribute
X
with
highest
Info
Gain
• Suppose
X
has
nX
disPnct
values
(i.e.
X
has
arity
nX).
– Create
a
non-‐leaf
node
with
nX
children.
– The
i’th
child
should
be
built
by
calling
BuildTree(DSi,Output)
Where
DSi
contains
the
records
in
DataSet
where
X
=
ith
value
of
X.
MPG Test
set error
Infinite
number of
possible split
values!!!
Finite
dataset, only
finite number
of relevant
splits!
“One
branch
for
each
numeric
value”
idea:
• Requires small
<70
≥70
change
• Allow repeated splits bad good
on same variable
• How does this compare
to “branch on each
value” approach?
The
set
of
possible
thresholds
• Binary
tree,
split
on
aRribute
X
– One
branch:
X
<
t
– Other
branch:
X
≥
t
Optimal splits for continuous attributes
• Search
through
possible
values
of
t
– Infinitely
Seems
hmany possible split points c to define node test Xj > c ?
ard!!!
No! Moving split Optimal splits
point along for continuous
the empty space between two attributes
observed values
• But
has
only
a
fi nite
n umber
o f
t’s
a re
i mportant:
no effect on information gain or empirical loss; so just use midpoint
Infinitely many possible split points c to define node test Xj > c ?
Xj
No! Moving split point along the empty space between two observed values
tc11empirical
has no effect on information gain or tc22 loss; so just use midpoint
Xj
– Sort
data
Moreover, only according
to
X
into
splits between {x1,…,xfrom
examples m}
different classes
c1 c2
– Consider
can be optimalsplit
forpinformation
oints
of
the
gainform
xi
+
(xloss
or empirical
i+1
–
xi)/2
reduction
Xj
– Morever,
Moreover,only
only
splits
splitsbbetween
etween
examples
examples
from
of
different
differentcclasses
lasses
maRer!
c1 informationc2gain or empirical loss reduction
can be optimal for
Xj
tc1
1 tc 2
2 (Figures
from
Stuart
Russell)
Picking
the
best
threshold
• Suppose
X
is
real
valued
with
threshold
t
• Want
IG(Y
|
X:t),
the
informaPon
gain
for
Y
when
tesPng
if
X
is
greater
than
or
less
than
t
• Define:
• H(Y|X:t)
=
p(X
<
t)
H(Y|X
<
t)
+
p(X
>=
t)
H(Y|X
>=
t)
• IG(Y|X:t)
=
H(Y)
-‐
H(Y|X:t)
• IG*(Y|X)
=
maxt
IG(Y|X:t)
• Decision
trees
are
one
of
the
most
popular
ML
tools
– Easy
to
understand,
implement,
and
use
– ComputaPonally
cheap
(to
solve
heurisPcally)
• InformaPon
gain
to
select
aRributes
(ID3,
C4.5,…)
• Presented
for
classificaPon,
can
be
used
for
regression
and
density
esPmaPon
too
• Decision
trees
will
overfit!!!
– Must
use
tricks
to
find
“simple
trees”,
e.g.,
• Fixed
depth/Early
stopping
• Pruning
• Hypothesis
tesPng