Decision Tree Classifier Project

RESEARCH & PROJECT SUBMISSIONS
Program:
Course Code: CSE 323
Course Name: PROGRAMMING
WITH DATA STRUCTURES
Examination Committee
Prof Hossam Fahmy
Dr. Islam El-Maddah
Ain Shams University

Faculty of Engineering
Spring Semester – 2020
SPRING SEMESTER - YEAR 2019/2020

AIN SHAMS UNIVERSITY CSE 323, PROGRAMMING WITH DATA STRUCTURES, SPRING 2020
FACULTY OF ENGINEERING
Student Personal Information for Group Work

Student Names: Student Codes:
‫إبراهيم حسن إبراهيم‬ 1600008
‫عمر عبد العزيز مرجان عبد العزيز‬ 1600885
‫عمر عبد الباسط عبد المقصود عبد الفتاح‬ 1600888
Plagiarism Statement
I certify that this assignment / report is my own work, based on my personal study and/or research and that I have
acknowledged all material and sources used in its preparation, whether they are books, articles, reports, lecture notes, and
any other kind of document, electronic or personal communication. I also certify that this assignment / report has not been
previously been submitted for assessment for another course. I certify that I have not copied in part or whole or otherwise
plagiarized the work of other students and / or persons.
Signature/Student Name: Date: 30-5-2020
Submission Contents
01: Background
02: Implementation Details
03: Complexity of Operations
04: References
2
01
Background
3
1. Tree & Decision Tree

1.1. What is a tree?
Tree is simply a hierarchical data structure that is used to store data that is
hierarchical in nature. Tree is one of the most powerful data structures from the
aspect of operation complexity. Tree differ from another linear data structures
such as arrays or linked lists as the tree doesn’t store the data sequentially, but it
stores the data in form of nodes connected with edges. Each node holds two
different things: value and pointer(s). Value is simply what object you want to
store in the node and pointer is memory reference to the object (next node) that
is wanted to have a relation (hierarchical relation) with the original object in the
parent node.
Figure 1. Simple Tree
1.1.1. Simple tree notation

Root The node that references the whole tree. It doesn’t have a parent.
Edge Connection between two nodes represented by a line.
Leaf A node with no children.
Parent Predecessor of a node.
Children All successors of a node.
HeightNode No of edges on the longest path from the node to a leaf.
HeightTree No of edges on the longest path from the root to a leaf.
DepthNode No of edges from the root node to the specified node.
Path No of successive edges from a source to a destination (nodes).
4
1.2. What is a decision tree?

Decision Tree Analysis is a general, prescient displaying instrument that has
applications spreading over various zones. As a rule, decision trees are developed
by means of an algorithmic methodology that distinguishes ways to split a data
set based on different conditions. It is one of the most generally utilized and
viable techniques for supervised learning. Decision Trees are a non-parametric
supervised learning strategy utilized for both classification and regression. The
objective is to make a model that predicts the estimation of an objective variable
by taking in straightforward choice standards derived from the information (data)
features.
The decision guidelines are for the most part in type of if-then-else explanations.
The more profound the tree, the more perplexing the standards and fitter the
model.
A decision tree could be a tree-like chart with nodes speaking to the put where
we choose a feature and inquire some question; edges speak to the answers the
to the questions; and the leaves speak to the genuine yield or class label. They
are utilized in non-linear choice making with straightforward linear choice surface.
Decision trees classify the cases by sorting them down the tree from the root
to a few leaf node, with the leaf hub giving the classification to the given case or
example. Each node within the tree acts as a test case for a few features, and
each edge plummeting from that node represents one of the conceivable answers
to the test case. This prepare is recursive in nature and is rehashed for each
subtree established at the coming nodes. Figure 2 shows a simple decision tree
represents the cases at which some people go out for a walk.
1.2.1. Obvious DT issue
When splitting a predictor having q possible unordered values, there are 2 q-1 − 1
possible partitions of the q values into two groups, and the computations become
prohibitive for large q. However, with a 0 − 1 outcome, this computation
simplifies. We order the predictor classes according to the proportion falling in
outcome class 1. Then we split this predictor as if it were an ordered predictor.
This result also holds for a quantitative outcome and square error loss—the
categories are ordered by increasing mean of the outcome. Although intuitive,
the proofs of these assertions are not trivial. For multicategory outcomes, no such
5
simplifications are possible. The partitioning algorithm tends to favor categorical

predictors with many levels q; the number of partitions grows exponentially in q,
and the more choices we have, the more likely we can find a good one for the
data at hand. This can lead to severe overfitting if q is large, and such variables
should be avoided.
Figure 2. Simple DT
6
02
Implementation
Details
DDvhv
7
1. Overall implementation
Our Implementation is divided into two parts: the decision tree algorithm and a
GUI to make it easy to use the program without being involved in the python
script.
1.1. The decision tree implementation
1.1.1. Presumptions
In our implementation of decision tree algorithm, we assumed that the input
variables have only two values 0 and 1. Also we assumed that we are doing binary
classification (i.e. output labels have two values 0 and 1).
1.1.2. Information gain and entropy
The implementation of the decision tree is based on splitting the data to achieve
the highest possible information gain. So, what is information gain? It is a
statistical property that measures how well a given attribute separates the
training examples according to their target classification [1]. The highest the
information gain, the more separable is the data into groups that are readily
distinguishable from each other. To calculate the information gain, we must
define some statistical quantity that is related to the information theory which is
entropy. Entropy is a measure of degree of impurity in a group of examples. For
our case of binary classification, we have a simple mathematical equation:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑠𝑒𝑡) = −𝑝+ ∗ log 2 𝑝+ − 𝑝− log2 𝑝− (1)
Where: p+ is the portion of the data belonging to the positive class and p- is the
portion of the data belonging to the negative class.
Entropy is the negative of log as it is meant to describe the impurity of the set
with a positive number. Figure 3 visualizes this property.
Now we can conceptually calculate the information gain as a weighted average of
the entropy of each of the children:
𝐼. 𝑔. = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑝𝑎𝑟𝑒𝑛𝑡) − 𝑝𝑙𝑒𝑓𝑡 ∗ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑙𝑒𝑓𝑡) − 𝑝𝑟𝑖𝑔ℎ𝑡 ∗ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑟𝑖𝑔ℎ𝑡) (2)
Where: parent is the table available before splitting, left is the table resulting
from splitting and satisfying the left node condition, right is the table resulting
from splitting and satisfying the right node condition, pleft is the ratio of the labels
8
that belong to the left child to the total labels and p right is the ratio of the labels
that belong to the right child to the total labels.
Figure 3. Entropy vs p+ curve
Figure 4 visualizes the splitting operation for an OR function. It is noticed that

entropy for the left table is 0 (no impurity) & for the right table is 1 (p - = 0.5).
Figure 4. Splitting the dataset on X1
9
1.1.3. Coding the decision tree

1.1.3.1. Overall view
The decision tree code is divided into three main parts: the entropy function, the
tree node class which is called “TreeNode”, and a wrapper class around the tree
node class which is called “DecisionTreeClassifier”.
1.1.3.2. “Entropy” function
This function is easy to implement as all we want is to call (numpy) (python library
that introduces facilities for mathematical operations) function to calculate the
log2 and multiply it by the proportion of the class from the total set. To improve
the performance little bit, we hard coded two special cases: one is when there is
no impurities in the set (entropy equals min 0 when there is when class only in
the labels) and the other is when there are two classes that have equal
proportions (entropy equals max of 1 when the labels have two classes and the
proportion of one of the two class equals 0.5). Have a look to the code here:
https://github.com/omar-ashinawy/DS-DT/blob/master/Entropy.py.
1.1.3.3. Tree node class
Tree node class has five functions: “train”, “information_gain”, “split_column”,
“predict_element” and “predict”. Also, the class has some attributes: depth
(depth of the current leaf node), “maxDepth” (depth of the furthest leaf node,
which is specified by the user), “colmn” (yes, “colmn”. No typos here.) (the best
column i.e training feature or attribute that is achieved so far), left (the left child),
right (the right child) and prediction (stores the prediction for the given training
example). Have a look to the code here:
https://github.com/omar-ashinawy/DSDT/blob/master/TreeNode.py.
1.1.3.3.1. The “train” function
The function takes the input features (as a matrix) and the corresponding labels
and changes class attributes to build the decision tree.
This function has a recursive nature, which means that we look for base cases and
build upon them until we have complete implementation to the whole function.
The objective of the train function is to build the whole decision tree, i.e. splitting
the original dataset until we have leaf nodes that have pure sets so we could have
10
a decision for new data. This objective is done by finding the best split that
maximizes the information gain on each step of splitting. Figure 5 visualizes the
complete decision tree resulting from the train function when it is used to be
trained on an OR function. The train function then -in a nutshell- call the
“information_gain” function then calls the “split_column” function so the train
function chooses the best column to split by means of it.
Figure 5. The complete tree after training on an OR function
The base cases are: 1. Labels have only one class. 2. Labels have only one
element. 3. Specified maximum depth is reached. 4. All splits give tables that all
have maximum entropy (impurity) i.e. for all the splits maximum information gain
equals zero. If we are in a general case (any case other than the four specified
above), then simply the train function will recursively call itself on the right child
and the left child until a base case is reached.
11
So when will we reach a leaf node? This will happen if we have the best split that
maximizes the information gain, so, at this node we will save the prediction in the
attribute “prediction”.
1.1.3.3.2. The “information_gain” function
This function takes as inputs: a column that represents some training examples on
one feature, a column that represents the labels and index at which the table will
be split. The function goal is to calculate the information gain of the table after
splitting. Here, like we did in the entropy function, made use of the base case
when there is only one class in the labels or there are no elements on one of the
children. Obviously, at this base case, the information gain equals zero. At a
general case, we use equation (2).
1.1.3.3.3. The “split_column” function
The inputs to this function are the matrix of all training examples and all
attributes, the labels and the column index at which we want to spli
This function finds the best split index and returns this index and the maximum
information gain gained from this split. This is done first by sorting the values of
the feature column to have all zeros next to each other and all ones next to each
other to be able to split. Then the function finds the boundaries at which the
labels change from zero to one and runs through these boundaries to find the
index that have best split.
1.1.3.3.4. The “predict_element” function
This function aims at predicting the label of one training example. This function is
recursive in nature, so we want to find a base case to start giving a prediction and
build upon it. The base case is found when there is no column stored in the
attribute “column” or there is no split index stored in the attribute “split”. At this
base case the prediction equals the value stored in the prediction attribute (this
attribute stores the value of labels, mean of the value of labels when there is only
one class in the labels or there is only one label or the value of left and right
nodes predictions after finding best split). At a general case, the function tries to
find the nearest location of the entered training example by comparing index of
the best column to split the data to the index stored in the split attribute. If this
column’s index is less that the “split” attribute the we will turn “left” and vice
12
versa. We will continue traversing recursively until we reach a leaf node (node
that have a “NULL” left or right child).
1.1.3.3.5. The “predict” function
Predict function simply calls the “predict_element” function to predict a value for
each training example and returns an array of predictions (prediction for each
example).
1.1.3.4. The “DecisionTreeClassifier” class
1.1.3.4.1. The “fit” function
We called it “fit” to have the same name as scikit learn decision tree model to be
convenient to use. It simply creates a root as a “TreeNode” object and calls the
train function on the root to build the whole decision tree.
1.1.3.4.2. The “predict” function
Calls the predict function of the “TreeNode” class on an instance (the root).
1.1.3.4.3. The “accuracy” function
Calculates a chosen accuracy metric like usual accuracy, F1 score, or Matthews
correlation coefficient (more representative metric that is suitable for
classification) of the model on the given dataset. Here are some definitions to
make the idea clear.
True positives (TP) # of right predictions from the positive class.
False positives (FP) # of false predictions from the positive class.
True negatives (TN) # of right predictions from the negative class.
False negatives (FN) # of false predictions from the negative class.
Precision True positives / # of all positives.
Recall True positives / (True positives + false negatives).
F1 score Harmonic mean of precision & recall.
# 𝑜𝑓 𝑡𝑟𝑢𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (3)
𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑙𝑎𝑏𝑒𝑙𝑠
2
𝐹1 𝑠𝑐𝑜𝑟𝑒 = (4)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛−1 + 𝑟𝑒𝑐𝑎𝑙𝑙−1
13
𝑇𝑝 ∗ 𝑇𝑁 − 𝐹𝑃 ∗ 𝐹𝑁
𝑀𝐶𝐶 = (5)
√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝑇𝑃)
1.2. GUI implementation
Figure 6. GUI to DT algorithm
This GUI is created to make it easy for the user to get the functionality of the
decision tree algorithm to work without being involved in the python script. Some
implementation details and a guide to run this GUI is presented in this video:
https://www.youtube.com/watch?v=g9s85DdLTNw.
Also the whole code of the project can be found here: https://github.com/omar-
ashinawy/DS-DT/blob/master/Dt.py
2. It Works!
In this section, we demonstrate that the project gives a reasonable accuracy using
usual accuracy defined by eq. (3) and F1 score defined by eq. (4) even when tree
is not pruned.
14
We also tested the classifier on the given test file and saved the test results in:
https://drive.google.com/open?id=1YYTMkutEiLqC6yr-Zqy5V_bfN4dPvTz8
15
16
03
Complexity of
operations
17
1. Creating the decision tree

Let’s see a special case (two variables and two labels) but has the worst-case
entropy, i.e. the highest impurity. As an example, we have this simple XOR
function shown in figure 7. The “train” function (the function that creates the
decision tree) calls itself recursively and calls two other functions: “split_column”
and “information_gain”. For recursive calls, and for one training example the
worst-case complexity is O (n*log2(n)) where n is the number of features. Doing
this for all training examples requires O (m*n*log2 (n)) where m is number of
training examples. For “split_column” function, the main functionality is sorting
the column to be split on, so this requires O (m*log2(m)). For calculation of
proportions of each class in “entropy” and “information_gain” functions, it
obviously requires going through all training examples, so it is O (m).
To conclude, the time complexity (t.c.) can be approximated by:
𝑇. 𝐶. = 𝑂(𝑚) + 𝑂 (𝑚 ∗ 𝑛 ∗ log 2 𝑛) + 𝑂(𝑛 ∗ log 2 𝑛) (6)
Figure 7. Simple XOR but addresses the worst-case
18
04
References
19
1. Gupta, Shubham, Hacker earth website,

https://www.hackerearth.com/practice/machine-learning/machine-
learning-algorithms/ml-decision-tree/tutorial/.
2. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome, “The Elements of
Statistical Learning: Data Mining, Inference, and Prediction,” Springer
(2013).
3. Sani, Habiba M.; Lei, Ci; Neagu, Daniel, “Computational complexity analysis
of decision tree algorithms,” Springer Nature Switzerland, (2018).
4. Wikipedia, “F1 Score,” https://en.wikipedia.org/wiki/F1_score.
5. Wikipedia, “Confusion Matrix,”
https://en.wikipedia.org/wiki/Confusion_matrix.
6. Wikipedia, “Matthews Correlation Coefficient,”
https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
20

Decision Tree Classifier Project

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Tree Classifier Project

Uploaded by

Copyright:

Available Formats

RESEARCH & PROJECT SUBMISSIONS

Ain Shams University

SPRING SEMESTER - YEAR 2019/2020

Student Personal Information for Group Work

Signature/Student Name: Date: 30-5-2020

1. Tree & Decision Tree

Figure 1. Simple Tree

1.1.1. Simple tree notation

1.2. What is a decision tree?

simplifications are possible. The partitioning algorithm tends to favor categorical

Figure 3. Entropy vs p+ curve

Figure 4 visualizes the splitting operation for an OR function. It is noticed that

Figure 4. Splitting the dataset on X1

1.1.3. Coding the decision tree

Figure 5. The complete tree after training on an OR function

1.2. GUI implementation

Figure 6. GUI to DT algorithm

1. Creating the decision tree

Figure 7. Simple XOR but addresses the worst-case

1. Gupta, Shubham, Hacker earth website,

You might also like