Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Decision Tree

1
What is Classification?

 Classification is the process of dividing the datasets into


different categories or groups by adding label.
 Spam or non spam

 Its add the individual data to a particular labeled group on


the basis of some condition
 Analyzing data  based on some conditions  divided
into different groups.
2 Department of CSE, CUET
Classification
Fraud detection (transaction real or fraud)
Notification alert about credit card transaction genuine or not?

Classification anything [i.e. House, car, fruits]

3 Department of CSE, CUET


Types of Classification
Decision Tree (DT)
Support Vector Machine (SVM)
KNN
Random Forest
Naïve Bayes

4 Department of CSE, CUET


Decision Tree Algorithm
Graphical representation of all the possible solutions to a
decision
Decisions are based on some conditions
Decisions made can be easily explained [for an example:
here a task.?]

5 Department of CSE, CUET


What is Decision Tree?
Graphical representation of all the possible solutions to a
decision based on certain conditions [i.e.: Dataset, Intelli
gent Computerized Assistant]

6 Department of CSE, CUET


Cont.
Easy to understand
Easy to interpretable

Dataset Decision Tree

7 Department of CSE, CUET


Which question to ask and when?

8 Department of CSE, CUET


Which question to ask and when?

9 Department of CSE, CUET


Which question to ask and when?

Which attribute
among them
should we pick
first?
Ans.: Determine
the attribute
that best
classifies the
training data

10 Department of CSE, CUET


Which question to ask and when?

How do we
choose the best
attribute?

OR

How does a tree


decide where to
split?

11 Department of CSE, CUET


Terminology
Gini Index: The measure of impurity (or purity) used in
building a DT.

12 Department of CSE, CUET


Terminology
Information Gain: Constructing a DT is all about finding
attribute that returns the highest information gain. The
information gain is the decrease in entropy after a
dataset is split on the basis of an attribute.

13 Department of CSE, CUET


Terminology
Reduction in variance: The split with lower variance is
selected as the criteria to split the population.
Variance means how much data are varied. Data is less impure
or more pure than variance is low

Chi Square: To find out the statistical significance betw


een the differences between sub-nodes and parent-node.

14 Department of CSE, CUET


Terminology
Entropy: Entropy is a measurable physical property that
is most commonly associated with a state of randomness
( এল োলেল োতো ).

The first step to solve the problem of a DT.

15 Department of CSE, CUET


Entropy is measure of impurity.

P(s)=0.5

Entropy = 1
P(s)= 1 or 0
If data are highly pure or
impure  Entropy = 0
16 Department of CSE, CUET
Cont.

17 Department of CSE, CUET


Information Gain
Measures the reduction in entropy
Decide which attribute should be selected as the decision
node.

18 Department of CSE, CUET


Compute the Entropy

19 Department of CSE, CUET


20 Department of CSE, CUET
21 Department of CSE, CUET
Which node to select as a root node?

22 Department of CSE, CUET


Which node to select further?

23 Department of CSE, CUET


Complete DT look like

24 Department of CSE, CUET


What should I do to play?  Pruning
Cutting down the nodes in order to optimal solution.
Reducing the complexity

25 Department of CSE, CUET


Decision trees
To classify an example:
1.Start at the root
2.Perform the test
3.Follow the edge corresponding to outcome
4.Go to 2. unless leaf
5.Predict that outcome associated with the leaf

26 Department of CSE, CUET


Advantage & Disadvantages
Adv:
Are simple to understand and interpret. People are able to under
stand decision tree models after a brief explanation.
Help determine worst, best and expected values for different sc
enarios.
Can be combined with other decision techniques.
Disadv:
They are unstable, meaning that a small change in the data can
lead to a large change in the structure of the optimal decision tree.
They are often relatively inaccurate. Many other predictors
perform better with similar data.

27 Department of CSE, CUET


K-NN Algorithm
 It is one of the simplest Machine Learning algorithms bas
ed on Supervised Learning technique.
 It is a non-parametric algorithm, which means it does n
ot make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does
not learn from the training set immediately instead it store
s the dataset and at the time of classification, it performs a
n action on the dataset.

28 Department of CSE, CUET


How KNN works
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neig
hbors
Step-3: Take the K nearest neighbors as per the calculated Eucl
idean distance.
Step-4: Among these k neighbors, count the number of the data
points in each category.
Step-5: Assign the new data points to that category for which t
he number of the neighbor is maximum.
Step-6: Our model is ready.

29 Department of CSE, CUET


Ctd..

30 Department of CSE, CUET


KNN ctd…

As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A

31 Department of CSE, CUET


How to select the value of K in the K-NN Algorithm
There is no particular way to determine the best value fo
r "K", so we need to try some values to find the best out
of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be nois
y and lead to the effects of outliers in the model.
Large values for K are good, but it may find some diffic
ulties.

32 Department of CSE, CUET


Advantages and disadvantages of KNN Algorithm:
Adv:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadv:
Always needs to determine the value of K which may be
complex some time.
The computation cost is high because of calculating the
distance between the data points for all the training samp
les

33 Department of CSE, CUET


THANK YOU

34 Department of CSE, CUET

You might also like