Lesson 5

INTRODUCTION TO
AI AND MACHINE
LEARNING
• what is decision tree
UNDERSTANDING • How decision tree works
DECISION TREES • Application of decision tree
WHAT IS DECISION TREE
▪ An inductive learning task

▪ Use particular facts to make more generalized conclusions
▪ A predictive model based on a branching series of Boolean
tests
▪ A structure that contains nodes (as boxes) and edges (as arrows) built
from dataset
▪ Each node is either decision node (to make a decision) or leaf node (
represent an outcome).
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

PREDICT COVID RISK
▪ We use fake
COVID-19 data, on the
left as an
over-simplified
analysis example
▪ We wanted to predict
the risk level of
catching COVID (for
an individual) is High
/ Low (target)

PREDICT COVID RISK
root node
intermediate node
leaf node

PREDICT COVID RISK
▪ The decision tree is generated based on data collected.

▪ We will use the term instances for each row of the data collected
▪ We will use the term attributes for each column of the data collected
▪ When we use attributes (a characteristic of data) in data mining
software it is called features as it is a label of the characteristic
▪ Not all attributes collected is used in each path of the decision
▪ e.g. Vaccinated (Vaxxed) will reduce the risk to Low
▪ Some attributes may not even appear in the decision tree
▪ e.g. Community spread (Community) is not used in this decision tree
HOW DECISION TREE WORKS
ITERATIVELY DICHOTOMISER 3 (ID3)
▪ Iterative (repeatedly) dichotomizer (divider) 3, algorithm is

inverted by Ross Quinlan in 1975
▪ Top-down greedy approach

▪ Start from top
▪ In each iteration, it will select the best attribute at that moment to
create a node

ID3 PROCESS
Steps for ID3

▪ Choose the best features to split the remaining data points
or instances and make the feature a decision node.
▪ Repeat process for recursively for each child
▪ Stop when:
▪ All the instances have the same target feature value
▪ No more attributes (all used)
▪ No more instances available

ID3 PROCESS
How does ID3 select the best features?
Short answer – Select feature with highest Information Gain or Gain
Longer answer
Entropy is the measure of disorder in the target feature of the dataset.
Information Gain is the calculated reduction in entropy
Hence the selected feature will cause the biggest ordered group of
target features 🡪 (e.g. biggest homogenous group of instances
formed)
COVID RISK ANALYSIS
▪ We will use only
Above 12 feature as
the root node
▪ We will have
▪ 4 of 4 (100%) [see box]
▪ Above 12 is N AND
▪ Risk is Low
▪ A mixed set for risk when
Above 12 is Y

COVID RISK ANALYSIS
Vaxxed feature as the
root node
▪ We will have
▪ 5 of 5 (100%) [see box]
▪ Vaxxed is Y
▪ Risk is Low
▪ a mixed set for risk when
Vaxxed is N

COVID RISK ANALYSIS
Community feature as
the root node
▪ We will have
▪ No homogenous split (no
red box)
▪ 3 of 4 for (N and Low)
▪ 6 of 7 for (Y and High)

COVID RISK ANALYSIS
▪ The best feature for root node is Vaxxed which generate 5 instances
in a homogenous partition – Highest Gain
▪ This is a simplified example as it we have ignored the mixed set.

However, they do contribute to the calculation.

ENTROPY AND GAIN
▪ In data mining software (e.g.Orange3) Decision tree could be built
based on induce binary tree
▪ In binary classification, every split will only have 2 possible classes

(or outcomes) (e.g. Yes / No, True / False, 1 / 0)
▪ We also assume the target feature have 2 possible classes with

equal number of both classes.
▪ e.g. 5 x True AND 5 x False in a dataset of 10 instances
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source: https://towardsdatascience.com/decision-trees-for-classification-id3-algorithm-explained-89df76e72df1

ENTROPY AND GAIN
▪ Entropy which measure of disorder in the target feature (Risk in this
case) of the dataset, can be calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
n = number of classes of in the target feature (e.g. High / Low, n= 2)
Pi = Probability of class “i” OR
Ratio of “number rows with class i in the target feature” to the
”total number of rows” in the dataset.
(e.g. 5/7 in the Above 12 Y with LOW Risk in our example)

ENTROPY AND GAIN
▪ Information Gain (IG) for a feature (A) is calculated as:
IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
▪ Sv = the set of row in S for which the feature A has value v

▪ |S| = number of rows in S
▪ |Sv| = number of rows Sv
Since we are using an application, the calculation is for your information

only.
IMPROVING THE DECISION TREE
Pruning is a technique used to reduce the number of features

used in tree to
▪ Prevent overfitting
▪ Reduce computation time
2 types of pruning
▪ Pre-pruning (forward pruning)
▪ Post-pruning (backward pruning)
The decision tree

could over-learnt all
the data including its
error.
e.g. The leaf node
with only 1 instance
(see bottom) it could
be considered as
overfitting.

PRE-PRUNING
Pre-Pruning
▪ Used during the building process to stop adding
features
▪ Typically based on the reduction of information gain
Consideration - Problematic
▪ Features individually do not contribute much to the
decision, but when combined may have a significant impact
POST-PRUNING
Post-Pruning
▪ More commonly used, pruning occurs after the building
process (until the full decision tree is built)
2 techniques uses algorithms based on either reduced error

or reduce cost complexity
▪ Subtree replacement
▪ Subtree Raising
POST-PRUNING | SUBTREE REPLACEMENT
The use of leaf node will

generalize the decision tree
but could reduce accuracy
Used for very large tree

POST-PRUNING | SUBTREE RAISING
Very time consuming to

verify as it is not based on
data, but likely guided by
domain knowledge

ERROR PROPAGATION
▪ Decision trees works by a series of decisions. If the decision

of one feature is wrong
▪ Subsequent decision will be wrong
▪ Path will be affected, which could also be wrong

NOTE ON DECISION TREE
ADVANTAGES / DISADVANTAGES
▪ Advantages
▪ Easy to interpret
▪ Easy to prepare
▪ Less data cleaning required
▪ Disadvantages
▪ Unstable nature
▪ Less effective in predicting outcome of a continuous variable

APPLICATION OF DECISION TREE
▪ Decision tree can use for

▪ Categorical variable
▪ Distinct category with no in-between
▪ Continuous variable
▪ A range of values such as age, weight
▪ Suitable for handling non-linear data set effectively in real life

situations

END OF LESSON 5

Lesson 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 5

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

▪ An inductive learning task

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ The decision tree is generated based on data collected.

▪ Iterative (repeatedly) dichotomizer (divider) 3, algorithm is

▪ Top-down greedy approach

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Steps for ID3

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

How does ID3 select the best features?

Short answer – Select feature with highest Information Gain or Gain

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ This is a simplified example as it we have ignored the mixed set.

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ In binary classification, every split will only have 2 possible classes

▪ We also assume the target feature have 2 possible classes with

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source: https://towardsdatascience.com/decision-trees-for-classification-id3-algorithm-explained-89df76e72df1

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source: https://towardsdatascience.com/decision-trees-for-classification-id3-algorithm-explained-89df76e72df1

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

▪ Sv = the set of row in S for which the feature A has value v

Since we are using an application, the calculation is for your information

Pruning is a technique used to reduce the number of features

The decision tree

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

2 techniques uses algorithms based on either reduced error

The use of leaf node will

Used for very large tree

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Very time consuming to

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Decision trees works by a series of decisions. If the decision

▪ Subsequent decision will be wrong

▪ Path will be affected, which could also be wrong

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Decision tree can use for

▪ Suitable for handling non-linear data set effectively in real life

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

You might also like