15 - DT and KNN Algorithm

Decision Tree
1
What is Classification?
 Classification is the process of dividing the datasets into

different categories or groups by adding label.
 Spam or non spam
 Its add the individual data to a particular labeled group on

the basis of some condition
 Analyzing data  based on some conditions  divided
into different groups.
2 Department of CSE, CUET
Classification
Fraud detection (transaction real or fraud)
Notification alert about credit card transaction genuine or not?
Classification anything [i.e. House, car, fruits]

Types of Classification
Decision Tree (DT)
Support Vector Machine (SVM)
KNN
Random Forest
Naïve Bayes

Decision Tree Algorithm
Graphical representation of all the possible solutions to a
decision
Decisions are based on some conditions
Decisions made can be easily explained [for an example:
here a task.?]

What is Decision Tree?
Graphical representation of all the possible solutions to a
decision based on certain conditions [i.e.: Dataset, Intelli
gent Computerized Assistant]

Cont.
Easy to understand
Easy to interpretable
Dataset Decision Tree

Which question to ask and when?


Which attribute
among them
should we pick
first?
Ans.: Determine
the attribute
that best
classifies the
training data

How do we
choose the best
attribute?
OR
How does a tree

decide where to
split?

Terminology
Gini Index: The measure of impurity (or purity) used in
building a DT.

Terminology
Information Gain: Constructing a DT is all about finding
attribute that returns the highest information gain. The
information gain is the decrease in entropy after a
dataset is split on the basis of an attribute.

Terminology
Reduction in variance: The split with lower variance is
selected as the criteria to split the population.
Variance means how much data are varied. Data is less impure
or more pure than variance is low
Chi Square: To find out the statistical significance betw

een the differences between sub-nodes and parent-node.

Terminology
Entropy: Entropy is a measurable physical property that
is most commonly associated with a state of randomness
( এল োলেল োতো ).
The first step to solve the problem of a DT.

Entropy is measure of impurity.
P(s)=0.5

Entropy = 1
P(s)= 1 or 0
If data are highly pure or
impure  Entropy = 0
Cont.

Information Gain
Measures the reduction in entropy
Decide which attribute should be selected as the decision
node.

Compute the Entropy

Which node to select as a root node?

Which node to select further?

Complete DT look like

What should I do to play?  Pruning
Cutting down the nodes in order to optimal solution.
Reducing the complexity

Decision trees
To classify an example:
1.Start at the root
2.Perform the test
3.Follow the edge corresponding to outcome
4.Go to 2. unless leaf
5.Predict that outcome associated with the leaf

Advantage & Disadvantages
Adv:
Are simple to understand and interpret. People are able to under
stand decision tree models after a brief explanation.
Help determine worst, best and expected values for different sc
enarios.
Can be combined with other decision techniques.
Disadv:
They are unstable, meaning that a small change in the data can
lead to a large change in the structure of the optimal decision tree.
They are often relatively inaccurate. Many other predictors
perform better with similar data.

K-NN Algorithm
 It is one of the simplest Machine Learning algorithms bas
ed on Supervised Learning technique.
 It is a non-parametric algorithm, which means it does n
ot make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does
not learn from the training set immediately instead it store
s the dataset and at the time of classification, it performs a
n action on the dataset.

How KNN works
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neig
hbors
Step-3: Take the K nearest neighbors as per the calculated Eucl
idean distance.
Step-4: Among these k neighbors, count the number of the data
points in each category.
Step-5: Assign the new data points to that category for which t
he number of the neighbor is maximum.
Step-6: Our model is ready.

Ctd..

KNN ctd…
As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A

How to select the value of K in the K-NN Algorithm
There is no particular way to determine the best value fo
r "K", so we need to try some values to find the best out
of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be nois
y and lead to the effects of outliers in the model.
Large values for K are good, but it may find some diffic
ulties.

Advantages and disadvantages of KNN Algorithm:
Adv:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadv:
Always needs to determine the value of K which may be
complex some time.
The computation cost is high because of calculating the
distance between the data points for all the training samp
les

THANK YOU

15 - DT and KNN Algorithm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15 - DT and KNN Algorithm

Uploaded by

Copyright:

Available Formats

Decision Tree

 Classification is the process of dividing the datasets into

 Its add the individual data to a particular labeled group on

Classification anything [i.e. House, car, fruits]

3 Department of CSE, CUET

4 Department of CSE, CUET

5 Department of CSE, CUET

6 Department of CSE, CUET

Dataset Decision Tree

7 Department of CSE, CUET

8 Department of CSE, CUET

9 Department of CSE, CUET

10 Department of CSE, CUET

How does a tree

11 Department of CSE, CUET

12 Department of CSE, CUET

13 Department of CSE, CUET

Chi Square: To find out the statistical significance betw

14 Department of CSE, CUET

The first step to solve the problem of a DT.

15 Department of CSE, CUET

17 Department of CSE, CUET

18 Department of CSE, CUET

19 Department of CSE, CUET

22 Department of CSE, CUET

23 Department of CSE, CUET

24 Department of CSE, CUET

25 Department of CSE, CUET

26 Department of CSE, CUET

27 Department of CSE, CUET

28 Department of CSE, CUET

29 Department of CSE, CUET

30 Department of CSE, CUET

31 Department of CSE, CUET

32 Department of CSE, CUET

33 Department of CSE, CUET

34 Department of CSE, CUET

You might also like