Supervised Learning _ SVM _DT

Supervised Learning
By Ahlem Marzouk
1
Definition
Supervised Learning is a basic concept on which

algorithms are trained on labelled datasets to learn
relationships between input data and desired outcomes .
The training prepares the model to make correct

predictions or classification on new data.
2
Data
Training set
Testing set
[<=80%]
X Train Y Train X Test Y Test
4
INPUT OUTPUT INPUT OUTPUT
X Train
Algorithm Algorithm
Y Prediction
After Training
Evaluation
Y Train Metric
X Test Y Test
5
Supervised
Learning
Classification Regression
KNN
SVM
Linear
Regression
Decision Tree 6
Classification
Classification is the process of unknown data.
classifying input data into distinct
This process is used to detect specific
classes .
types of new observations, such as
It entails training a model using a determining if an email is spam,
labeled dataset to understand the link whether we have normal or abnormal
between inputs and outputs, allowing it traffic behavior, whether we have an
to make predictions about new, attack or normal activity, and so on...
7
Evaluation Metrics
❑ To evaluate the performance of our model, we compare predicted results
with actual labels.
❑ Commonly used measures for classification are accuracy, precision and
recall.
❑ A confusion matrix provides an easy summary of the prediction results
in a classification problem.
In this matrix, correct and incorrect predictions are summarized in a table
with their values and divided by classes.
8
Evaluation Metrics
Accuracy: refers to the percentage of instances correctly
predicted by the model.
Recall: Out of all the positive classes, how many we

predicted correctly.
Precision: out of all the predictive positive classes, how

much we predicted correctly.
9
Example:
10
Example:
11
KNN [K-Nearest Neighbor]
K-Nearest Neighbor is a very simple, multi-purpose and one of

the most important machine learning algorithms.
It is used for classification and regression problems and is

based on a feature similarity approach.
12
Concept
The classification, in KNN, is based on similarity to other

cases. This similarity is evaluated using the distance between
two cases.
So, points that are close to each other are called "Neighbors".
13
Concept
The classification, in KNN, is based on similarity to other

cases. This similarity is evaluated using the distance between
two cases.
So, points that are close to each other are called "Neighbors".
14
How does it work:

1- Choose the value of K.
2- Calculate the distance of the new case from all cases in the
dataset.
3- Search for the K observations in the dataset which are close
to the new case.
4- Predict the class of the new case using the vote.
15
How does it work:
16
How does it work:
17
How does it work:
18
How does it work:
19
Choice of the distance:

Euclidean distance. For quantitative data of the same
Manhattan distance. type: the Euclidean distance is a
good choice.
Minkowski distance.
The Manhattan distance is a good
Jaccard distance. measure to use when the data (input
Hamming distance. variables) are not of the same type
(e.g. age, sex, length, weight, etc.).
The choice of the distance depends
on the type of data being used.
20
Choice of the number of neighbors K:

Low K:
Can lead to overfitting, where the model
memorizes the training data too closely
and doesn't generalize well to unseen
data.
This results in high variance and poor
performance on new data.
High variance: the sensitivity of the model
to the data points.
21

High K: Low variance: there no sensitivity of the
model to the data points.
Can lead to underfitting, where the model
ignores local patterns and makes overly High bias: it fails to capture the
general predictions. underlying patterns in the data.
This results high bias and lower variance.
22
To avoid classification ties, it
is advised that K be an odd
integer, and cross-validation
techniques can assist you in
determining the best k for
your data set.
23
Advantages of KNN:
❑ Flexibility for non-linear data.

❑ Robust to outliers.
❑ As a lazy learning algorithm: There is no training phase, KNN does not perform an intensive
training phase. This can be advantageous in scenarios where training time is a critical
consideration.
24
Disadvantages of KNN:
❑ KNN does not need a model to make a prediction. So, the cost is that it must keep all
observations in memory in order to make its prediction. This leads to carefully choose the
size of the training set.
❑ The choice of the distance calculation method and the number of neighbors k may not be
obvious. We need to try out several combinations and tuning the algorithm to get a
satisfactory result.
❑ Unbalanced data between classes can create problems.

25
SVM [Support Vector Machine]
SVM is an algorithm that can be used for classification or

regression purposes. It aims to find a separator, called a
hyperplane, to classify data points.
We can find two types of SVM: linear and non-linear.
26
Linear SVM
Goal: Find an optimal hyperplane with maximum distance

between classes. It's called a hyperplane with maximum margin. 27
Linear SVM
If the number of If the number of
If the number of
attributes is equal to 2, attributes is n ≥ 4, the
attributes equals 3, the
the hyperplane is a hyperplane is a space
hyperplane is a plane.
line. of dimension n – 1.
• Difficult to
imagine
28
Linear SVM
Core principle of the algorithm:
In the SVM algorithm we define:
o The point closest to the lines of the two
classes, called support vectors.
o The distance between the vectors and the
hyperplane, called the margin.
The objective of SVM is to find a separator that
maximizes this margin.
The hyperplane with the maximum margin is
called the optimal hyperplane.
29
Linear SVM
Mathematical Approach [the two-dimensional linear
case]:
Let's consider two vectors:

To have an idea about the degree of relationship between two vectors, we'll use the
scalar product:
30
Linear SVM
case]:
The hyperplane is a line with equation
We consider the vectors w=(a,-1) and X=(x,y), then we get the equation we will use to define
the two classes:
31
Linear SVM
case]:
The next step is to decide whether a new point belongs to one of the two classes by comparing
the result of the equation after replacing it with the point's coordinates.
The point above or on the hyperplane will be classed +1, and the point below the hyperplane
will be classed -1.
32
Non-Linear SVM
In this case, it is impossible to draw a
single line or hyperplane to classify the
points correctly.
We therefore try to convert this lower-

dimensional space into a higher-
dimensional space, using quadratic
functions to find a decision boundary
that clearly divides the data points.
33
Non-Linear SVM
The functions that help us do

this are called kernels, and
the choice of which choice
of kernel to use is
determined purely by the
setting of the
hyperparameters.
Examples of Kernels:
Polynomial Kernels, Sigmoid Kernel, RBF kernel

etc.... 34
Decision Tree [DT]
A decision tree is a non-parametric supervised learning algorithm, which is utilized for
both classification and regression tasks.
This tree contains nodes that can be divided into three categories:
o Root node: (the tree is reached via this node).
o Internal nodes (or decision nodes): nodes that have descendants (or
children), who are themselves nodes.
o Terminal nodes (or leaf nodes): nodes that have no descendants.
35
Decision Tree [DT]
Root
Node
Yes No
Inner End
Node. Node.
Yes No
End Inner
Node. Node.
End Node. 36
Decision Tree [DT]
Does the
Example applicant have a
permanent job?
Yes No
Does the
candiadte have a No credit.
house?
Yes No
Yes for the

Next question ?
fredit.
Continue the
discussion. 37
Decision Tree [DT]
The purpose of decision trees is to clean up the data, it employs a "divide and rule"
technique.
It starts with all the data and then asks a series of questions (splits) to categorize the
data points. When a new node is added in the tree, we enhance the purity of the
groups as much as possible. Of course, this procedure continues until most data
points are categorized and we have a clear answer.
38
Decision Tree [DT]
The most popular splitting criterion for decision tree models are information gain and
Gini impurity. They help to evaluate the quality of each test condition and how well it
will be able to classify samples into a class.
They both have a goal: splitting the data into more homogeneous subsets, but they use
different approaches to measure impurity.
The algorithm uses one of them as a splitting criterion.
40
Decision Tree [DT]
Information gain is the reduction in entropy after splitting on a certain characteristic.
It prefers splits that provide more balanced distributions of class labels in the resulting
subsets.
Entropy: evaluates the impurity in the dataset. If the entropy decreases, this means
that the data is more and more pure.
Gini impurity: measures the impurity, or the probability of misclassification in a

dataset used in decision tree methods.
A Gini impurity of zero implies that all the items in the dataset belong to the same class,
whereas a Gini impurity of one shows that the items are uniformly distributed across all
categories.
41
Decision Tree [DT]
Limitations:
1. Unstable: Small changes in data might result in significantly distinct tree structures,
affecting predictions.
2. Overfitting: Decision trees can become very complicated, fitting the training data
effectively but failing to generalize to new data.
3. Biased towards features with lots of splits: Features with a higher number of
category levels can influence decisions more than other pertinent features.
42
Decision Tree [DT]
Advantages:
1. Easy interpretation for simple trees.
2. Robust against outliers and missing data: Decision trees don't require a lot of data
preparation to handle outliers and missing values.
3. Nonparametric.
4. Flexible: They can deal with category and numerical data, which allows them
to adapt to different kinds of problems.
5. Fast Training.
43
Decision Tree [DT]
Example
44
The set S with 8 examples
Root Node [4 (yes),4 (No)] 50% Yes 50% No
The split is based on the Age
Terminal Node Terminal Node
Age >40 with 3 examples

Age <=30 with 3 examples Age 31-40 with 2 examples [2 (Yes), 1 (No)] 66.7% yes and
[0(Yes), 3 (No)] 100% No [2 (Yes), 0 (No)] 100% Yes 33.3% No
The split is based on the Credit
rating
Internal Node
Terminal Node
Terminal Node
Fair Credit rating with 2

examples Excellent Credit rating with 1
[2 (Yes), 0 (No)] 100% Yes example.
[0 (Yes), 1 (No)] 100% No

Supervised Learning _ SVM _DT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning _ SVM _DT

Uploaded by

Copyright:

Available Formats

Supervised Learning

Supervised Learning is a basic concept on which

The training prepares the model to make correct

X Train Y Train X Test Y Test

Recall: Out of all the positive classes, how many we

Precision: out of all the predictive positive classes, how

K-Nearest Neighbor is a very simple, multi-purpose and one of

It is used for classification and regression problems and is

The classification, in KNN, is based on similarity to other

The classification, in KNN, is based on similarity to other

How does it work:

How does it work:

How does it work:

How does it work:

How does it work:

Choice of the distance:

Choice of the number of neighbors K:

Choice of the number of neighbors K:

❑ Flexibility for non-linear data.

❑ Unbalanced data between classes can create problems.

SVM is an algorithm that can be used for classification or

We can find two types of SVM: linear and non-linear.

Goal: Find an optimal hyperplane with maximum distance

Let's consider two vectors:

The hyperplane is a line with equation

We therefore try to convert this lower-

The functions that help us do

Polynomial Kernels, Sigmoid Kernel, RBF kernel

o Root node: (the tree is reached via this node).

o Terminal nodes (or leaf nodes): nodes that have no descendants.

Yes for the

The algorithm uses one of them as a splitting criterion.

Gini impurity: measures the impurity, or the probability of misclassification in a

Terminal Node Terminal Node

Age >40 with 3 examples

Fair Credit rating with 2

You might also like