Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Decision Tree learning: ID3

• Decision tree representation


– Objective
• Classify the instances by sorting them down the tree from
the root to some leaf node, which provides the class of the
instance
– Each node tests the attribute of an instance whereas each branch
descending from that node specifies one possible value of this
attribute
– Testing
• Start at the root node, test the attribute specified by this
node and identify the appropriate branch then repeat the
process
– Example
• Decision tree for classifying the Saturday mornings to test
whether it is suitable for playing tennis based on the
weather attributes
Decision trees represent a
disjunction of conjunctions of
constraints
on the attribute values of
instances.

For “Yes”
• The characteristics of best suited problems for DT
– Instances that are represented by attribute-value pairs
• Each attribute takes small number of disjoint possible values (e.g., Hot,
Mild, Cold)  discrete
• Real-valued attributes like temperature  continuous
– The target function has discrete output values
• Binary, multi-class and real-valued outputs
– Disjunctive descriptions may be required
– The training data may contain errors
• DT robust to both errors: errors in classification and attributes
– The training data may contain missing attribute values
• Basics of DT learning - Iterative Dichotomiser 3 - ID3
• ID3 algorithm
– ID3 algorithm begins with the original set S as the root node.
– On each iteration of the algorithm, it iterates through every
unused attribute of the set S and calculates the entropy H(S)
(or information gain IG(S)) of that attribute.
– It then selects the attribute which has the smallest entropy (or
largest information gain) value.
• The set S is then split by the selected attribute (e.g. age is less than 50,
age is between 50 and 100, age is greater than 100) to produce subsets
of the data.
– The algorithm continues to recurs on each subset, considering
only attributes never selected before.
• Recursion on a subset may stop, When
– All the elements in the class belong to same class
– All instances does not belong to same class but there is no
attribute to select
– There is no example in the subset
• Steps in ID3
– Calculate the entropy of every attribute using the data set S
– Split the set S into subsets using the attribute for which the
resulting entropy (after splitting) is minimum (or, equivalently,
information gain is maximum)
– Make a decision tree node containing that attribute
– Recurs on subsets using remaining attributes.
• Root - Significance
– Root node
• Which attribute should be tested at the root of the DT?
– Decided based on the information gain or entropy
– Significance
» The best attribute that classifies the training samples very
well
» The attribute that should be tested in all the instances of the
dataset
• @root node
– Information gain is more or entropy has the least value
• Entropy measures the homogeneity or impurities in the
instances
• Information gain measures the expected reduction in
entropy
All members are - ve

All members are + ve

Binary or boolean
classification

Multi-class
classification
DT by example
Target
Entropy using frequency table of independent attributes on single
dependent attribute
(Target Variable here: PlayGolf)
Entropy using frequency table of single attribute
on single attribute

5/14

9/14
Entropy using frequency table of single attribute on two attribute
PlayGolf on Outlook
PlayGolf on Temperature
PlayGolf on Humidity
PlayGolf on Windy
Entropy using frequency table of single attribute
on two attribute

= -3/5 log2(3/5) – 2/5 log2(2/5) = -4/4 log2(4/4) – 0/4 log2(0/4)


=0.971 =0.0

= -2/5 log2(2/5) – 3/5 log2(3/5)


=0.971
Information gain calculation
Calculation for Humidity, Temperature
Gain(PlayGolf , Humidity)=E(PlayGolf)-E(PlayGolf, Humidity)
= 0.94 - P(High) * E(3,4) - P(Normal) * E(6,1)
= 0.940 - 7/14 *E(3,4) - 7/14*E(6,1)
=0.940 – 0.5 * ( 3/7*log(3/7) + 4/7 log(4/7) ) –
0.5* (1/7log(1/7) + 6/7 log(6/7) )
= 0.94 - 0.5 * 0.985 – 0.5 * 0.592 = 0.152
Gain(PlayGolf ,Temperature)=E(PlayGolf)-E(PlayGolf, Temperature)
=0.94 - P(Hot) * E(2,2) - P(Mild) * E(4,2) – P(Cold) E(3,1)
=0.940 - 4/14 *E(2,2) - 6/14*E(4,2) – 4/14 E(3,1)
=0.940 – 4/14 * ( 2/4*log(2/4) + 2/4 log(2/4) ) – 6/14* (4/6log(4/6) + 2/6 log(2/6) ) -
4/14 * ( 3/4*log(3/4) + 1/4 log(1/4) )
= 0.029
Calculation for windy
Which attribute is the best classifier?
WINNER is Outlook and chosen as Root node
Choose attribute with the largest information gain as the
decision node, divide the dataset by its branches and
repeat the same process on every branch.
Choose the other nodes below the root node

Iterate the same procedure for finding the


root node
=3/5*(-0/3 log20/3 – 3/3 log23/3 +
2/5*(-0/2 log20/2 – 2/2 log22/2)

P(high)*E(0,3) + P(Normal)*E(2,0)

Refer to slide no 20
Branch with entropy of ‘0’ is a leaf node
Branch with entropy more than ‘0’ needs
further splitting
Age Income Try this
Student Credit_rating Buys_computer
 
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

33
• HYPOTHESIS SPACE SEARCH IN DECISION TREE
LEARNING
• ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples
– The hypothesis space searched by ID3 is the set of
possible decision trees
• simple-to complex, hill-climbing search through this
hypothesis space,
– beginning with the empty tree, then considering progressively more
elaborate hypotheses in search of a decision tree that correctly
classifies the training data
• ID3 capabilities and limitations by considering search space
and search strategy
– ID3 hypothesis space of all decision trees is a complete space of
finite discrete-valued functions, relative to the available
attributes
• Adv – avoids search incomplete hypothesis space that might not contain
target function
– ID3 maintains only a single current hypothesis as it searches
through the space of decision trees
• In candidate-elimination algorithms set of all hypotheses consistent with
the available training examples
– it does not have the ability to determine how many alternative decision trees are
consistent with the available training data
– ID3 does not support backtracking leads to local optimum.
– ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis
• INDUCTIVE BIAS IN DECISION TREE LEARNING
– inductive bias is the set of assumptions that, together
with the training data, deductively justify the
classifications assigned by the learner to future
instances
• selects in favor of shorter trees over longer ones, and
• selects trees that place the attributes with highest
information gain closest to the root.
• A closer approximation to the inductive bias of ID3: Shorter
trees are preferred over longer trees. Trees that place high
information gain attributes close to the root are preferred
over those that do not.
• Inductive bias ID3 versus CANDIDATE-ELIMINATION
– ID3  Preference bias
• searches a complete hypothesis space
– searches incompletely through this space, from simple to complex
hypotheses, until it finds a hypothesis consistent with the data and it
does not go beyond that.
– Inductive bias: a consequence of the ordering of hypotheses by its
search strategy and its hypothesis space introduces no additional bias
» inductive bias of ID3 follows from its search strategy thus a
preference for certain hypotheses over others (e.g., for shorter
hypotheses)
– CANDIDATE-ELIMINATION  Restriction bias
• searches an incomplete hypothesis space (i.e., one that can
express only a subset of the potentially teachable concepts)
– searches this space completely, finding every hypothesis consistent with
the training data
– Inductive bias: consequence of the expressive power of its hypothesis
representation. Its search strategy introduces no additional bias.
» Solely relies on search space representation
• Why Prefer Short Hypotheses?
– Occam's razor
• Prefer the simplest hypothesis that fits the data
– Claim
• short hypothesis that coincidentally fits the training data
– Critics
• Not always true
• Internal representation strategy of learners decides the size
of the hypothesis.
– Two learners arrive at different hypotheses, both justifying their
contradictory conclusions by Occam's razor
• Occam's razor will produce two different hypotheses from
the same training examples
• ISSUES IN DECISION TREE LEARNING
– Avoid Overfitting of data
• Definition: Given a hypothesis space H, a hypothesis h E H is
said to overfit the training data if there exists some alternative
hypothesis h' E H, such that h has smaller error than h' over
the training examples, but h' has a smaller error than h over
the entire distribution of instances.
• Reasons for overfitting
– when there is noise in the data, or when the number of training
examples is too small to produce a representative sample of the true
target function
• Methods to avoid overfitting
– Stop growing the tree earlier
– Firstly allow to overfit and then post-prune the tree (more successful)
Criterion is to be used to determine the correct final tree size are:
• Assess the utility of post-pruning nodes using separate set of
examples different from training
• Assess the improvement by applying expansion or pruning
• Apply explicit condition for tree growth

You might also like