Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Supervised Learning

By Ahlem Marzouk
1
Definition

Supervised Learning is a basic concept on which


algorithms are trained on labelled datasets to learn
relationships between input data and desired outcomes .

The training prepares the model to make correct


predictions or classification on new data.

2
Data

Training set
Testing set
[<=80%]

X Train Y Train X Test Y Test

4
INPUT OUTPUT INPUT OUTPUT
X Train

Algorithm Algorithm
Y Prediction
After Training

Evaluation
Y Train Metric

X Test Y Test

5
Supervised
Learning

Classification Regression

KNN

SVM
Linear
Regression

Decision Tree 6
Classification
Classification is the process of unknown data.
classifying input data into distinct
This process is used to detect specific
classes .
types of new observations, such as
It entails training a model using a determining if an email is spam,
labeled dataset to understand the link whether we have normal or abnormal
between inputs and outputs, allowing it traffic behavior, whether we have an
to make predictions about new, attack or normal activity, and so on...

7
Evaluation Metrics
❑ To evaluate the performance of our model, we compare predicted results
with actual labels.
❑ Commonly used measures for classification are accuracy, precision and
recall.
❑ A confusion matrix provides an easy summary of the prediction results
in a classification problem.
In this matrix, correct and incorrect predictions are summarized in a table
with their values and divided by classes.
8
Evaluation Metrics
Accuracy: refers to the percentage of instances correctly
predicted by the model.

Recall: Out of all the positive classes, how many we


predicted correctly.

Precision: out of all the predictive positive classes, how


much we predicted correctly.

9
Example:

10
Example:

11
KNN [K-Nearest Neighbor]

K-Nearest Neighbor is a very simple, multi-purpose and one of


the most important machine learning algorithms.

It is used for classification and regression problems and is


based on a feature similarity approach.

12
KNN [K-Nearest Neighbor]
Concept

The classification, in KNN, is based on similarity to other


cases. This similarity is evaluated using the distance between
two cases.

So, points that are close to each other are called "Neighbors".

13
KNN [K-Nearest Neighbor]
Concept

The classification, in KNN, is based on similarity to other


cases. This similarity is evaluated using the distance between
two cases.

So, points that are close to each other are called "Neighbors".

14
KNN [K-Nearest Neighbor]

How does it work:


1- Choose the value of K.
2- Calculate the distance of the new case from all cases in the
dataset.
3- Search for the K observations in the dataset which are close
to the new case.
4- Predict the class of the new case using the vote.

15
KNN [K-Nearest Neighbor]

How does it work:

16
KNN [K-Nearest Neighbor]

How does it work:

17
KNN [K-Nearest Neighbor]

How does it work:

18
KNN [K-Nearest Neighbor]

How does it work:

19
KNN [K-Nearest Neighbor]

Choice of the distance:


Euclidean distance. For quantitative data of the same
Manhattan distance. type: the Euclidean distance is a
good choice.
Minkowski distance.
The Manhattan distance is a good
Jaccard distance. measure to use when the data (input
Hamming distance. variables) are not of the same type
(e.g. age, sex, length, weight, etc.).
The choice of the distance depends
on the type of data being used.
20
KNN [K-Nearest Neighbor]

Choice of the number of neighbors K:


Low K:
Can lead to overfitting, where the model
memorizes the training data too closely
and doesn't generalize well to unseen
data.
This results in high variance and poor
performance on new data.
High variance: the sensitivity of the model
to the data points.
21
KNN [K-Nearest Neighbor]

Choice of the number of neighbors K:


High K: Low variance: there no sensitivity of the
model to the data points.
Can lead to underfitting, where the model
ignores local patterns and makes overly High bias: it fails to capture the
general predictions. underlying patterns in the data.
This results high bias and lower variance.

22
KNN [K-Nearest Neighbor]
Choice of the number of neighbors K:
To avoid classification ties, it
is advised that K be an odd
integer, and cross-validation
techniques can assist you in
determining the best k for
your data set.

23
KNN [K-Nearest Neighbor]
Advantages of KNN:

❑ Flexibility for non-linear data.


❑ Robust to outliers.
❑ As a lazy learning algorithm: There is no training phase, KNN does not perform an intensive
training phase. This can be advantageous in scenarios where training time is a critical
consideration.

24
KNN [K-Nearest Neighbor]
Disadvantages of KNN:
❑ KNN does not need a model to make a prediction. So, the cost is that it must keep all
observations in memory in order to make its prediction. This leads to carefully choose the
size of the training set.

❑ The choice of the distance calculation method and the number of neighbors k may not be
obvious. We need to try out several combinations and tuning the algorithm to get a
satisfactory result.

❑ Unbalanced data between classes can create problems.


25
SVM [Support Vector Machine]

SVM is an algorithm that can be used for classification or


regression purposes. It aims to find a separator, called a
hyperplane, to classify data points.

We can find two types of SVM: linear and non-linear.

26
Linear SVM

Goal: Find an optimal hyperplane with maximum distance


between classes. It's called a hyperplane with maximum margin. 27
Linear SVM
If the number of If the number of
If the number of
attributes is equal to 2, attributes is n ≥ 4, the
attributes equals 3, the
the hyperplane is a hyperplane is a space
hyperplane is a plane.
line. of dimension n – 1.
• Difficult to
imagine

28
Linear SVM
Core principle of the algorithm:
In the SVM algorithm we define:
o The point closest to the lines of the two
classes, called support vectors.
o The distance between the vectors and the
hyperplane, called the margin.
The objective of SVM is to find a separator that
maximizes this margin.
The hyperplane with the maximum margin is
called the optimal hyperplane.
29
Linear SVM
Mathematical Approach [the two-dimensional linear
case]:

Let's consider two vectors:


To have an idea about the degree of relationship between two vectors, we'll use the
scalar product:

30
Linear SVM
Mathematical Approach [the two-dimensional linear
case]:

The hyperplane is a line with equation

We consider the vectors w=(a,-1) and X=(x,y), then we get the equation we will use to define
the two classes:

31
Linear SVM
Mathematical Approach [the two-dimensional linear
case]:

The next step is to decide whether a new point belongs to one of the two classes by comparing
the result of the equation after replacing it with the point's coordinates.

The point above or on the hyperplane will be classed +1, and the point below the hyperplane
will be classed -1.

32
Non-Linear SVM
In this case, it is impossible to draw a
single line or hyperplane to classify the
points correctly.

We therefore try to convert this lower-


dimensional space into a higher-
dimensional space, using quadratic
functions to find a decision boundary
that clearly divides the data points.

33
Non-Linear SVM

The functions that help us do


this are called kernels, and
the choice of which choice
of kernel to use is
determined purely by the
setting of the
hyperparameters.
Examples of Kernels:

Polynomial Kernels, Sigmoid Kernel, RBF kernel


etc.... 34
Decision Tree [DT]
A decision tree is a non-parametric supervised learning algorithm, which is utilized for
both classification and regression tasks.

This tree contains nodes that can be divided into three categories:

o Root node: (the tree is reached via this node).

o Internal nodes (or decision nodes): nodes that have descendants (or
children), who are themselves nodes.

o Terminal nodes (or leaf nodes): nodes that have no descendants.

35
Decision Tree [DT]

Root
Node
Yes No
Inner End
Node. Node.
Yes No
End Inner
Node. Node.

End Node. 36
Decision Tree [DT]
Does the
Example applicant have a
permanent job?
Yes No
Does the
candiadte have a No credit.
house?
Yes No

Yes for the


Next question ?
fredit.

Continue the
discussion. 37
Decision Tree [DT]
The purpose of decision trees is to clean up the data, it employs a "divide and rule"
technique.

It starts with all the data and then asks a series of questions (splits) to categorize the
data points. When a new node is added in the tree, we enhance the purity of the
groups as much as possible. Of course, this procedure continues until most data
points are categorized and we have a clear answer.

38
Decision Tree [DT]
The most popular splitting criterion for decision tree models are information gain and
Gini impurity. They help to evaluate the quality of each test condition and how well it
will be able to classify samples into a class.

They both have a goal: splitting the data into more homogeneous subsets, but they use
different approaches to measure impurity.

The algorithm uses one of them as a splitting criterion.

40
Decision Tree [DT]
Information gain is the reduction in entropy after splitting on a certain characteristic.
It prefers splits that provide more balanced distributions of class labels in the resulting
subsets.
Entropy: evaluates the impurity in the dataset. If the entropy decreases, this means
that the data is more and more pure.

Gini impurity: measures the impurity, or the probability of misclassification in a


dataset used in decision tree methods.

A Gini impurity of zero implies that all the items in the dataset belong to the same class,
whereas a Gini impurity of one shows that the items are uniformly distributed across all
categories.
41
Decision Tree [DT]
Limitations:
1. Unstable: Small changes in data might result in significantly distinct tree structures,
affecting predictions.
2. Overfitting: Decision trees can become very complicated, fitting the training data
effectively but failing to generalize to new data.
3. Biased towards features with lots of splits: Features with a higher number of
category levels can influence decisions more than other pertinent features.

42
Decision Tree [DT]
Advantages:
1. Easy interpretation for simple trees.
2. Robust against outliers and missing data: Decision trees don't require a lot of data
preparation to handle outliers and missing values.
3. Nonparametric.
4. Flexible: They can deal with category and numerical data, which allows them
to adapt to different kinds of problems.
5. Fast Training.

43
Decision Tree [DT]
Example

44
The set S with 8 examples
Root Node [4 (yes),4 (No)] 50% Yes 50% No
The split is based on the Age

Terminal Node Terminal Node

Age >40 with 3 examples


Age <=30 with 3 examples Age 31-40 with 2 examples [2 (Yes), 1 (No)] 66.7% yes and
[0(Yes), 3 (No)] 100% No [2 (Yes), 0 (No)] 100% Yes 33.3% No
The split is based on the Credit
rating

Internal Node
Terminal Node
Terminal Node

Fair Credit rating with 2


examples Excellent Credit rating with 1
[2 (Yes), 0 (No)] 100% Yes example.
[0 (Yes), 1 (No)] 100% No

You might also like