Machine Learning Algorithns - Unit3

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 124

Machine Learning Algorithms

14/05/2023 1
Content
• Regression
• Linear Regression
• Logistic Regression
• K-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
• Random forest
• Gradient boosting
• Clustering
• K-Means Clustering
• Fuzzy C means Clustering
• Hierarchical Clustering

14/05/2023 2
Regression
A regression problem is the problem of determining a relation between one or more independent
variables and an output variable which is a real continuous variable, given a set of observed values of
the set of independent variables and the corresponding values of the output variable.

Regression is a supervised learning problem where there is an input x an output y and the task is to
learn the mapping from the input to the output.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data.

In simple words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the regression
line is minimum."

The distance between datapoints and line tells whether a model has captured a strong relationship or
not.
14/05/2023 3
Why do we use Regression Analysis?

Regression analysis helps in the prediction of a continuous variable.

There are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which can
make predictions more accurately.

So for such case we need Regression analysis which is a statistical method and used in machine learning
and data science.

Below are some other reasons for using Regression analysis:


• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.

14/05/2023 4
Regression Analysis in Machine learning
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:

Now, the company wants to do the


advertisement of $200 in the year 2019 and
wants to know the prediction about the sales
for this year.

So to solve such type of prediction problems


in machine learning, we need regression
analysis.

14/05/2023 5
Terminologies Related to the Regression Analysis
o Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.

o Outliers: Outlier is an observation which contains either very low value or very high value in comparison to
other observed values. An outlier may hamper the result, so it should be avoided.

o Multicollinearity: If the independent variables are highly correlated with each other than other variables,
then such condition is called Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.

o Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test
dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.
14/05/2023 6
Different Regression models
There are various types of regressions which are used in data science and machine learning.

Each type has its own importance on different scenarios, but at the core, all the regression methods analyze
the effect of the independent variable on dependent variables.

Some important types of regression are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression
14/05/2023 7
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms.

It is a statistical method that is used for predictive analysis.

Linear regression makes predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent variables, hence called as linear regression.

Since linear regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.

14/05/2023 8
Linear Regression in Machine Learning
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

14/05/2023 9
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

14/05/2023 10
Hypothesis function for Linear Regression 

While training the model we are given :

x: input training data (univariate – one input variable(parameter))


y: labels to data (supervised learning)

When training the model – it fits the best line to predict the value of y for a given value of x. The
model gets the best regression fit line by finding the best θ1 and θ2 values.

θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our
model for prediction, it will predict the value of y for the input value of x.
14/05/2023 11
Cost Function (J)
By achieving the best-fit regression line, the model aims to predict y value such that the error
difference between predicted value and true value is minimum. So, it is very important to update
the θ1 and θ2 values, to reach the best value that minimize the error between predicted y value
(pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y
value (pred) and true y value (y).

14/05/2023 12
14/05/2023 13
Gradient Descent
Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.

A regression model uses gradient descent to update the coefficients of the line by reducing the cost function.

It is done by a random selection of values of coefficient and then iteratively update the values to reach the
minimum cost function.

We graph cost function as a function of parameter estimates i.e. parameter range of our hypothesis function
and the cost resulting from selecting a particular set of parameters.

We move downward towards pits in the graph, to find the minimum value.

Way to do this is taking derivative of cost function as explained in the previous figure.

Gradient Descent step downs the cost function in the direction of the steepest descent.

Size of each step is determined by parameter α known as Learning Rate. 


14/05/2023 14
In the Gradient Descent algorithm, one can infer two points : 
 
1.If slope is +ve : α = α– (+ve value). Hence value of α decreases.
2.If slope is -ve : α  = α– (-ve value). Hence value of α increases.
The choice of correct learning rate is very important as it ensures that Gradient Descent converges in a
reasonable time. : 

If we choose α to be very small, Gradient Descent will


If we choose  α to be very large, Gradient
take small steps to reach local minima and will take a
Descent can overshoot the minimum. It may fail
longer time to reach minima. 
to converge or even diverge. 

14/05/2023 15
Example
Suppose we are given a dataset

Given is a Work vs Experience dataset of a company and the task is to predict the salary of a
employee based on his / her work experience.

14/05/2023 16
Example
Iteration 1 – In the start, θ0 and θ1 values are randomly chosen.
Let us suppose, θ0 = 0 and θ1 = 0.

•Predicted values after iteration 1 with Linear regression hypothesis.

Cost Function – Error

14/05/2023 17
Gradient Descent – Updating θ0 value
Here, j = 0

Gradient Descent – Updating θ1 value


Here, j = 1

14/05/2023 18
Example
Iteration 2 – θ0 = 0.005 and θ1 = 0.02657
• Predicted values after iteration 1 with Linear regression hypothesis

Now, similar to iteration no. 1 performed above we will again calculate Cost function and
update θj values using Gradient Descent.

We will keep on iterating until Cost function doesn’t reduce further. At that point, model
achieves best θ values. Using these θ values in the model hypothesis will give the best
prediction results.
14/05/2023 19
Normal Equation in Linear Regression
Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function.
We can directly find out the value of θ without using Gradient Descent.
Following this approach is an effective and a time-saving option when are working with a dataset with
small features.
Normal Equation is a follows :

In the above equation,


θ : hypothesis parameters that define it the best.
X : Input feature value of each instance.
Y : Output value of each instance.

14/05/2023 20
Math Behind the equation –
Given the hypothesis function

where,
n : the no. of features in the data set.
x0 : 1 (for vector multiplication)Notice that this is the dot product between θ and x values. So
for the convenience to solve we can write it as :

The motive in Linear Regression is to minimize the cost function :

where,
xi : the input value of iih training example.
m : no. of training instances
n : no. of data-set features
yi : the expected result of ith instance
14/05/2023 21
Let us representing cost function in a vector form.

we have ignored 1/2m here as it will not make any difference in the working. It was used for the
mathematical convenience while calculation gradient descent. But it is no more needed here.

xij : value of jih feature in iih training example.


14/05/2023 22
But each residual value is squared. We cannot simply square the above expression. As the square of a
vector/matrix is not equal to the square of each of its values. So to get the squared value, multiply the
vector/matrix with its transpose. So, the final equation derived is

Therefore, the cost function is

So, now getting the value of θ using derivative

14/05/2023 23
So, this is the finally derived Normal Equation with θ giving the minimum cost value.

14/05/2023 24
Difference between Gradient Descent and Normal Equation.

14/05/2023 25
Batch Gradient Descent
Batch Gradient Descent: Batch Gradient Descent involves calculations over the full training set at
each step as a result of which it is very slow on very large training data. Thus, it becomes very
computationally expensive to do Batch GD.

However, this is great for convex or relatively smooth error manifolds. Also, Batch GD scales well
with the number of features.

14/05/2023 26
Stochastic Gradient Descent
Stochastic Gradient Descent: SGD tries to solve the main problem in Batch Gradient descent which is the usage of
whole training data to calculate gradients as each step.

SGD is stochastic in nature i.e it picks up a “random” instance of training data at each step and then computes the
gradient making it much faster as there is much fewer data to manipulate at a single time, unlike Batch GD.

There is a downside of the Stochastic nature of SGD i.e once it reaches close to the minimum value then it doesn’t
settle down, instead bounces around which gives us a good value for model parameters but not optimal which can ve
solved by reducing the learning rate at each step which can reduce the bouncing and SGD might settle down at
global minimum after some time.

14/05/2023 27
Difference between Batch Gradient Descent and Stochastic Gradient Descent

14/05/2023 28
Multiple Linear Regression (MLR)

It is the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows −
Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e.
dependent variable the regression line for p features can be calculated as follows −

Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.
Multiple Linear Regression models always includes the errors in the data known as residual error which
changes the calculation as follows −

We can also write the above equation as follows −

14/05/2023 29
Logistic Regression
Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.

Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.

Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.

It is a predictive analysis algorithm which works on the concept of probability.

This sigmoid function is used to model the data in logistic regression. The function can be
represented as:

14/05/2023 30
Logistic Function (Sigmoid Function)
•The sigmoid function is a mathematical function used to
map the predicted values to probabilities.

•It maps any real value into another value within a range of
0 and 1.

•The value of the logistic regression must be between 0 and


1, which cannot go beyond this limit, so it forms a curve
like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.

•In logistic regression, we use the concept of the threshold


value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
14/05/2023 31
Binary Logistic Regression model

The simplest form of logistic regression is binary or binomial logistic regression in which the target or
dependent variable can have only 2 possible types either 1 or 0.

It allows us to model a relationship between multiple predictor variables and a binary/binomial target
variable.

In case of logistic regression, the linear function is basically used as an input to another function such
as 𝑔 in the following relation −

Here, 𝑔 is the logistic or sigmoid function which can be given as follows −

14/05/2023 32
Binary Logistic Regression model

To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis
lie between 0 and 1 and crosses the axis at 0.5.

The classes can be divided into positive or negative. The output comes under the probability of positive
class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis
function as positive if it is ≥0.5, otherwise negative.

14/05/2023 33
Binary Logistic Regression model
We also need to define a loss function to measure how well the algorithm performs using the weights
on functions, represented by theta as follows :

Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with
the help of fitting the weights which means by increasing or decreasing the weights.

With the help of derivatives of the loss function w.r.t each weight, we would be able to know what
parameters should have high weight and what should have smaller weight.

The following gradient descent equation tells us how loss would change if we modified the parameters −

14/05/2023 34
Generally, logistic regression means binary logistic regression having binary target variables, but
there can be two more categories of target variables that can be predicted by it.

Logistic regression becomes a classification technique only when a decision threshold is brought into
the picture. The setting of the threshold value is a very important aspect of Logistic regression and is
dependent on the classification problem itself.

Based on the number of categories, Logistic regression can be classified as:

1.Binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs
“loss”, “pass” vs “fail”, “dead” vs “alive”, etc.

2.Multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types
have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.

3.Ordinal: it deals with target variables with ordered categories. For example, a test score can be
categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score
like 0, 1, 2, 3.
14/05/2023 35
Optimization techniques for Gradient Descent

Gradient Descent is an iterative Optimization algorithm, used to find the minimum value for a function.

The general idea is to initialize the parameters to random values, and then take small steps in the direction of
the “slope” at each iteration.

Gradient descent is highly used in supervised learning to minimize the error function and find the optimal
values for the parameters.

14/05/2023 36
Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm
converge towards the minima in a faster way, as the gradients towards the uncommon directions are
canceled out. The pseudocode for momentum method is given below.

V and dW are analogous to acceleration and velocity respectively. α is the learning rate, and β is
normally kept at 0.9.

14/05/2023 37
RMSprop: RMSprop was proposed by University of Toronto’s Geoffrey Hinton. The intuition is to
apply an exponentially weighted average method to the second moment of the gradients (dW 2). The
pseudocode for this is as follows:

14/05/2023 38
Adam Optimization: Adam optimization algorithm incorporates the momentum method and
RMSprop, along with bias correction. The pseudocode for this approach is as follows,

Kingma and Ba, the proposers of Adam, recommended


the following values for the hyperparameters.

14/05/2023 39
K-Nearest Neighbor (KNN) Algorithm
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
• It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
14/05/2023 40
K-Nearest Neighbor(KNN) Algorithm
• Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want
to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it
works on a similarity measure.
• Our KNN model will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.

14/05/2023 41
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories.
To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. Consider the below diagram:

14/05/2023 42
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each category.
• Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
• Step-6: Our model is ready

14/05/2023 43
How does K-NN work?
Suppose we have a new data point and we need to Firstly, we will choose the number of neighbors,
put it in the required category. Consider the below so we will choose the k=5.
image:
Next, we will calculate the Euclidean
distance between the data points.

14/05/2023 44
How does K-NN work?
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image

As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
14/05/2023 45
category A.
How to select the value of K in the K-NN Algorithm?
• There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
• Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points for all the
training samples.

14/05/2023 46
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
It is basically a collection of objects on the basis of similarity and dissimilarity between them.

14/05/2023 47
Clustering
We can see the different fruits are divided into several groups with similar properties.

14/05/2023 48
Why Clustering ?
• Clustering is very much important as it determines the intrinsic grouping among the unlabeled
data present.

• There are no criteria for a good clustering. It depends on the user, what is the criteria they may
use which satisfy their need.

• For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection).

• This algorithm must make some assumptions which constitute the similarity of points and each
assumption make different and equally valid clusters.

14/05/2023 49
Clustering
• The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
• Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products.
• Netflix also uses this technique to recommend the movies and web-series to its users as per the
watch history.

14/05/2023 50
Types of Clustering

Hard Clustering:
• In hard clustering, each data point is clustered or grouped to any
one cluster.
• For each data point, it may either completely belong to a cluster
or not.
• K-Means Clustering is a hard clustering algorithm. It clusters
data points into k-clusters.

14/05/2023 51
Types of Clustering
Soft Clustering:
• In soft clustering, instead of putting each data points into separate clusters, a probability of that
point to be in that cluster assigned.
• In soft clustering or fuzzy clustering, each data point can belong to multiple clusters along with its
probability score or likelihood.
• One of the widely used soft clustering algorithms is the Fuzzy C-means clustering (FCM)
Algorithm.

14/05/2023 52
K-Means Algorithm
• K-Means algorithm is a centroid based clustering technique.
• This technique cluster the dataset to k different cluster having an almost equal number of points.
• Each cluster is k-means clustering algorithm is represented by a centroid point.
• What is a centroid point?
The centroid point is the point that represents its cluster. Centroid point is the average of all the
points in the set and will change in each step and will be computed by:

The idea of the K-Means algorithm is to find k-centroid points and every point in the dataset will
belong either of k-sets having minimum Euclidean distance.

14/05/2023 53
K-Means Algorithm
• Specify the desired number of clusters 2. Randomly assign each data point to a
K: cluster : Let’s assign three points in cluster 1
1. Let us choose k=2 for these 5 data points shown using red color and two points in cluster
in 2-D space. 2 shown using grey color.

14/05/2023 54
K-Means Algorithm 4. Re-assign each point to the closest cluster
centroid : Note that only the data point at
3. Compute cluster centroids : The centroid of data the bottom is assigned to the red cluster
points in the red cluster is shown using red cross and
even though its closer to the centroid of
those in grey cluster using grey cross.
grey cluster.
Thus, we assign that data point into grey
cluster

14/05/2023 55
K-Means Algorithm
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and
5th steps until we’ll reach global optima.

When there will be no further switching of data points between two clusters for two successive
repeats.
14/05/2023 It will mark the termination of the algorithm if not explicitly mentioned. 56
Fuzzy C-Means Clustering
• Fuzzy C-Means clustering is a soft clustering approach, where each data point is assigned a
likelihood or probability score to belong to that cluster.
• This algorithm works by assigning membership to each data point corresponding to each cluster
center on the basis of distance between the cluster center and the data point.
• More the data is near to the cluster center more is its membership towards the particular cluster
center.
• Clearly, summation of membership of each data point should be equal to one.
• After each iteration membership and cluster centers are updated according to the formula:
where,
'n' is the number of data points.
'vj' represents the jth cluster center.
'm' is the fuzziness index m € [1, ∞].
'c' represents the number of cluster center.
'µij' represents the membership of ith data to jth cluster center.
'dij' represents the Euclidean distance between ith data and jth
14/05/2023 cluster center. 57
Fuzzy C-Means Clustering
• Main objective of fuzzy c-means algorithm is to minimize:

where,
'||xi – vj||' is the Euclidean distance between ith data and jth cluster
center.

14/05/2023 58
Fuzzy C-Means Clustering
Advantages
1) Gives best result for overlapped data set and comparatively better then k-means algorithm.
2) Unlike k-means where data point must exclusively belong to one cluster center here data point is
assigned membership to each cluster center as a result of which data point may belong to more then
one cluster center.
Disadvantages
1) Apriori specification of the number of clusters.
2) With lower value of β we get the better result but at the expense of more number of iteration.
3) Euclidean distance measures can unequally weight underlying factors.

14/05/2023 59
Algorithmic steps for Fuzzy c-means clustering
Let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the fuzzy membership 'µij' using:

3) Compute the fuzzy centers 'vj' using:

14/05/2023 60
Algorithmic steps for Fuzzy c-means clustering

4) Repeat step 2) and 3) until the minimum 'J' value is achieved or ||U(k+1) - U(k)|| < β.
where,
‘k’ is the iteration step.
‘β’ is the termination criterion between [0, 1].
‘U = (µij)n*c’ is the fuzzy membership matrix.
‘J’ is the objective function.

14/05/2023 61
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters.
This algorithm starts with all the data points assigned to a cluster of their own.
Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates
when there is only a single cluster left.

14/05/2023 62
Hierarchical Clustering
At the bottom, we start with 25 data points, each assigned to
separate clusters.

Two closest clusters are then merged till we have just one
cluster at the top.

The height in the dendrogram at which two clusters are merged


represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different
groups can be chosen by observing the dendrogram.

The best choice of the no. of clusters is the no. of vertical lines
in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be


4 as the red horizontal line in the dendrogram below covers
14/05/2023 63
maximum vertical distance AB.
14/05/2023 64
Hierarchical Clustering
Two important things that you should know about hierarchical clustering are:

•This algorithm has been implemented above using bottom up approach. It is also possible to follow top-
down approach starting with all data points assigned in the same cluster and recursively performing
splits till each data point is assigned a separate cluster.

•The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters :

• Euclidean distance: ||a-b||2 = √(Σ(ai-bi))


• Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
• Manhattan distance: ||a-b||1 = Σ|ai-bi|
• Maximum distance:||a-b||INFINITY = maxi|ai-bi|
• Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}
14/05/2023 65
Difference between K Means and Hierarchical clustering

• Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the
time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e.
O(n2).
• In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in Hierarchical
clustering.
• K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D,
sphere in 3D).
• K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data
into. But, you can stop at whatever number of clusters you find appropriate in hierarchical
clustering by interpreting the dendrogram

14/05/2023 66
Naive Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors.

In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, all
of these properties independently contribute to the probability that this fruit is an apple and that is
why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets.

Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
.
14/05/2023 67
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:

Above,
•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of predictor given class.
•P(x) is the prior probability of predictor.
 
14/05/2023 68
How Naive Bayes algorithm works?
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
14/05/2023 69
How Naive Bayes algorithm works?
Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes.

This algorithm is mostly used in text classification and with problems having multiple classes.

14/05/2023 70
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.
14/05/2023 71
Types of Naïve Bayes Model
There are three types of Naive Bayes Model, which are given below:

•Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.

•Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular document
belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

•Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.

14/05/2023 72
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine.

14/05/2023 73
Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features ,then hyperplane will be a straight line. And if there are 3 features, then hyperplane will
be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
14/05/2023 74
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

14/05/2023 75
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

14/05/2023 76
Linear SVM:The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2.

We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes.
14/05/2023 77
Hence, the SVM algorithm helps to find the best line or
decision boundary; this best boundary or region is
called as a hyperplane.

SVM algorithm finds the closest point of the lines


from both the classes. These points are called support
vectors.

The distance between the vectors and the hyperplane is


called as margin. And the goal of SVM is to maximize
this margin.

The hyperplane with maximum margin is called


the optimal hyperplane.

14/05/2023 78
So to separate these data points, we need to add one more
Non-Linear SVM: dimension.
If data is linearly arranged, then we can separate it
by using a straight line, but for non-linear data, we For linear data, we have used two dimensions x and y, so
cannot draw a single straight line. for non-linear data, we will add a third dimension z.

Consider the below image 1: It can be calculated as:


z=x2 +y2 By adding the third dimension, the sample space
will become as below image:

14/05/2023 79
Now, SVM will divide the datasets into classes in Since we are in 3-d Space, hence it is looking like a plane
the following way. Consider the below image: parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become
as:

Hence we get a circumference of radius 1 in case of non-linear data.


14/05/2023 80
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form.

SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space
and transforms it into a higher dimensional space.

In simple words, kernel converts non-separable problems into separable problems by adding more
dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of
the types of kernels used by SVM.

Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as
below
K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of
the multiplication of each pair of input values.
14/05/2023 81
Pros and Cons associated with SVM
Pros:
• It works really well with a clear margin of separation
• It is effective in high dimensional spaces.
• It is effective in cases where the number of dimensions is greater than the number of samples.
• It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.

Cons:
• It doesn’t perform well when we have large data set because the required training time is higher
• It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
• SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-
fold cross-validation. It is included in the related SVC method of Python scikit-learn library.

14/05/2023 82
Decision Tree
•Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems.

•It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.

•In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.

•The decisions or the test are performed on the basis of features of the given dataset.

•It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.

•It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
14/05/2023 83
Decision Tree

14/05/2023 84
Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.

•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

•Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

•Branch/Sub Tree: A tree formed by splitting the tree.

•Pruning: Pruning is the process of removing the unwanted branches from the tree.

•Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

14/05/2023 85
How does the Decision Tree algorithm Work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called the final
node as a leaf node.

14/05/2023 86
Example: Suppose there is a candidate who has a job
offer and wants to decide whether he should accept the
offer or Not.

So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM).

The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels.

The next decision node further gets split into one decision
node (Cab facility) and one leaf node.

Finally, the decision node splits into two leaf nodes


14/05/2023 87
Example: Suppose we have a sample of 14 patient data set and we have to predict which drug to
suggest to the patient A or B.

14/05/2023 88
Let’s say we pick cholesterol as the first attribute to split data

Let’s suppose our new patient has high cholesterol by the above split of our data we cannot
say whether Drug B or Drug A will be suitable for the patient.

Also, If the patient cholesterol is normal we still do not have an idea or information to determine
that either Drug A or Drug B is Suitable for the patient.

14/05/2023 89
Let us take Another Attribute Age, as we can see age has three categories in it Young, middle
age and senior let’s try to split.

14/05/2023 90
Assumptions that we make while using the
Decision tree:
– In the beginning, we consider the whole training set as the root.

-Feature values are preferred to be categorical, if the values continue then they are converted to
discrete before building the model.

-Based on attribute values records are distributed recursively.

-We use a statistical method for ordering attributes as a root node or the internal node.

14/05/2023 91
Mathematics behind Decision tree algorithm:

Before going to the Information Gain first we have to understand entropy.

Entropy: Entropy is the measures of impurity, disorder, or uncertainty in a bunch of examples.

Purpose of Entropy:
Entropy controls how a Decision Tree decides to split the data. It affects how a Decision Tree draws
its boundaries.
“Entropy values range from 0 to 1”, Less the value of entropy more it is trusting able.
 

14/05/2023 92
suppose we have F1, F2, F3 features we selected the F1 feature as our root node
F1 contains 9 yes label and 5 no label in it, after splitting the F1 we get F2 which have 6 yes/2 No
and F3 which have 3 yes/3 no.
Now if we try to calculate the Entropy of both F2 by using the Entropy formula…
Putting the values in the formula:

Here, 6 is the number of yes taken as positive as we are calculating probability divided by 8 is the
total rows present in the F2.
Similarly, if we perform Entropy for F3 we will get 1 bit which is a case of an attribute as in it there
is 50%, yes and 50% no.
This splitting will be going on unless and until we get a pure subset.
14/05/2023 93
What is a Pure subset:
The pure subset is a situation where we will get either all yes or all no in this case.

We have performed this concerning one node what if after splitting F2 we may also require some other
attribute to reach the leaf node and we also have to take the entropy of those values and add it up to do
the submission of all those entropy values for that we have the concept of information gain.
Information Gain: Information gain is used to decide which feature to split on at each step in building
the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose
the split that results in the purest daughter nodes. A commonly used measure of purity is called
information.

For each node of the tree, the information value measures how much information a feature gives us
about the class. The split with the highest information gain will be taken as the first split and the process
will continue until all children nodes are pure, or until the information gain is 0.

14/05/2023 94
The algorithm calculates the information gain for each split and the split which is giving the highest value
of information gain is selected.

We can say that in Information gain we are going to compute the average of all the entropy-based on the
specific split.

Sv = Total sample after the split as in F2 there are 6 yes


S = Total Sample as in F1=9+5=14
Now calculating the Information Gain:

Like this, the algorithm will perform this for n number of splits, and the information gain for whichever
split is higher it is going to take it in order to construct the decision tree.

The higher the value of information gain of the split the higher the chance of it getting selected for the
particular split.
14/05/2023 95
Gini Impurity:
Gini Impurity is a measurement used to build Decision Trees to determine how the features of a data set
should split nodes to form the tree. More precisely, the Gini Impurity of a data set is a number between 0-
0.5, which indicates the likelihood of new, random data being miss classified if it were given a random class
label according to the class distribution in the data set.
 
Entropy vs Gini Impurity
The maximum value for entropy is 1 whereas the maximum value for Gini impurity is 0.5.
As the Gini Impurity does not contain any logarithmic function to calculate it takes less computational time
as compared to entropy.
 

14/05/2023 96
Advantages of the Decision Tree
•It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
•It can be very useful for solving decision-related problems.
•It helps to think about all the possible outcomes for a problem.
•There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


•The decision tree contains lots of layers, which makes it complex.
•It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
•For more class labels, the computational complexity of the decision tree may increase.

14/05/2023 97
Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique.

It can be used for both Classification and Regression problems in ML. It is based on the concept
of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." 

Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

14/05/2023 98
Random Forest
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
• There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm:
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing.

14/05/2023 99
How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.


Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes

14/05/2023 100
So let’s understand how the algorithm works. From a given data set, multiple bootstrap samples are
created and the number of bootstrap samples would depend on the number of models we want to train.
Suppose I want to build 10 models here then I’ll create 10 bootstrap samples-

Now on each of these bootstrap samples, we built a decision tree model. So here, as you can see, if we
have 10 decision tree models, each built on a different subset of data

Now each of these individual decision trees would generate a prediction, which is then finally
combined to get the final output or the final prediction-
14/05/2023 101
So effectively we’re combining multiple trees to get the final output. And hence it’s called the forest.

But why is it called Random Forest? You might say it’s because we use the random bootstrap samples.
Well, that’s partially correct, along with a random sampling of data points or rows, the random forest also
performs random sampling of features.

Note that the random sampling of rows is done at the tree level. So every tree will get a different subset of
data points. Feature sampling is done on the node level or on the split level and not on the tree level. 102
14/05/2023
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree.

During the training phase, each decision tree produces a prediction result, and when a new data point
occurs, then based on the majority of results, the Random Forest classifier predicts the final decision.
Consider the below image:

14/05/2023 103
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1.Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2.Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3.Land Use: We can identify the areas of similar land use by this algorithm.
4.Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


5. Random Forest is capable of performing both Classification and Regression tasks.
6. It is capable of handling large datasets with high dimensionality.
7. It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


•Although random forest can be used for both classification and regression tasks, it is not more suitable for
Regression tasks.

14/05/2023 104
Bagging vs Boosting in Machine Learning
Ensemble learning helps improve machine learning results by combining several models.

This approach allows the production of better predictive performance compared to a single model.

Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Bagging and Boosting are
two types of Ensemble Learning.

These two decrease the variance of a single estimate as they combine several estimates from different
models. So the result may be a model with higher stability.

Let’s understand these two terms in a glimpse.

Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel
and combines them for determining the model average.

Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this
model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.
14/05/2023 105
Bagging
Bootstrap Aggregating, also knows as bagging, is a machine learning ensemble meta-algorithm designed to
improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression.

It decreases the variance and helps to avoid overfitting.

It is usually applied to decision tree methods.

Bagging is a special case of the model averaging approach. 

Example of Bagging

The Random Forest model uses Bagging, where decision tree models with higher variance are present.
It makes random feature selection to grow trees. Several random trees make a Random Forest.

14/05/2023 106
Implementation Steps of Bagging
•Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
•Step 2: A base model is created on each of these subsets.
•Step 3: Each model is learned in parallel from each training set and independent of each other.
•Step 4: The final predictions are determined by combining the predictions from all the models.

14/05/2023 107
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of
weak classifiers.

It is done by building a model by using weak models in series.

Firstly, a model is built from the training data. Then the second model is built which tries to correct the
errors present in the first model. T

his procedure is continued and models are added until either the complete training data set is predicted
correctly or the maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”.

14/05/2023 108
Implementation Steps of Boosting
1.Initialize the dataset and assign equal weight to each of the data point.
2.Provide this as input to the model and identify the wrongly classified data points.
3.Increase the weight of the wrongly classified data points.
4.if (got required results)
  Goto step 5
else
  Goto step 2
5.End

14/05/2023 109
Gradient Boosting is a popular boosting algorithm. In gradient boosting, each predictor corrects its
predecessor’s error.

In contrast to Adaboost, the weights of the training instances are not tweaked, instead, each predictor is
trained using the residual errors of predecessor as labels.

This is a technique called the Gradient Boosted Trees whose base learner is CART (Classification and
Regression Trees).

14/05/2023 110
The ensemble consists of N trees.

Tree1 is trained using the feature matrix X and the labels y. The predictions labelled y1(hat) are used to
determine the training set residual errors r1.

Tree2 is then trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2.

The process is repeated until all the N trees forming the ensemble are trained.

Shrinkage refers to the fact that the prediction of each tree in the ensemble is shrunk after it is multiplied
by the learning rate (eta) which ranges between 0 to 1.

There is a trade-off between eta and number of estimators, decreasing learning rate needs to be
compensated with increasing estimators in order to reach certain model performance.

Since all trees are trained now, predictions can be made.

y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)


14/05/2023 111
Similarities Between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.

1.Both are ensemble methods to get N learners from 1 learner.


2.Both generate several training data sets by random sampling.
3.Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority
Voting).
4.Both are good at reducing variance and provide higher stability.

14/05/2023 112
Differences Between Bagging and Boosting

14/05/2023 113
Generalisation
How well a model trained on the training set predicts the right output for new instances is called
generalization.

Generalization refers to how well the concepts learned by a machine learning model apply to specific
examples not seen by the model when it was learning.

The goal of a good machine learning model is to generalize well from the training data to any data
from the problem domain.

This allows us to make predictions in the future on data the model has never seen. Overfitting and
underfitting are the two biggest causes for poor performance of machine learning algorithms.

The model should be selected having the best generalisation. This is said to be the case if these
problems are avoided.

14/05/2023 114
Generalisation
• Underfitting Underfitting is the production of a machine
learning model that is not complex enough to accurately
capture relationships between a dataset features and a
target variable.

• Overfitting Overfitting is the production of an analysis


which corresponds too closely or exactly to a particular
set of data, and may therefore fail to fit additional data or
predict future observations reliably.

14/05/2023 115
Generalisation

14/05/2023 116
Confusion matrix
A confusion matrix is used to describe the performance of a classification model (or “classifier”) on a set
of test data for which the true values are known.

A confusion matrix is a table that categorizes predictions according to whether they match the actual
value.

14/05/2023 117
Two-class datasets
For a two-class dataset, a confusion matrix is a table with two rows and two columns that reports the
number of false positives, false negatives, true positives, and true negatives.

Assume that a classifier is applied to a two-class test dataset for which the true values are known. Let
TP denote the number of true positives in the predicted values, TN the number of true negatives, etc.
Then the confusion matrix of the predicted values can be represented as follows:

14/05/2023 118
Multiclass datasets
Confusion matrices can be constructed for multiclass datasets also.

Example If a classification system has been trained to distinguish between cats, dogs and rabbits, a
confusion matrix will summarize the results of testing the algorithm for further inspection.

Assuming a sample of 27 animals - 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix could
look like the table below:

This confusion matrix shows that, for example, of the 8 actual cats, the system predicted that three were
dogs, and of the six dogs, it predicted that one was a rabbit and two were cats.

14/05/2023 119
Precision and Recall
Let a binary classifier classify a collection of test data.

TP = Number of true positives


TN = Number of true negatives
FP = Number of false positives
FN = Number of false negatives

The precision P is defined as

The recall R is defined as

14/05/2023 120
Other measures of performance
Using the data in the confusion matrix of a classifier of two-class dataset, several measures of
performance have been defined. A few of them are listed below.

14/05/2023 121
Receiver Operating Characteristic (ROC)
The acronym ROC stands for Receiver Operating Characteristic, a terminology coming from signal
detection theory. They are now increasingly used in machine learning and data mining research.

Let a binary classifier classify a collection of Now we introduce the following terminology:
test data.

Let, as before,

TP = Number of true positives


TN = Number of true negatives
FP = Number of false positives
FN = Number of false negatives

14/05/2023 122
ROC space
We plot the values of FPR along the horizontal axis (that is , x-axis) and the values of TPR along the vertical
axis (that is, y-axis) in a plane.

For each classifier, there is a unique point in this plane with


coordinates (FPR,TPR). The ROC space is the part of the
Plane whose points correspond to (FPR,TPR).

Each prediction result or instance of a confusion matrix represents


one point in the ROC space.

The position of the point (FPR,TPR) in the ROC space gives an


indication of the performance of the classifier

14/05/2023 123
ROC curve
In the case of certain classification algorithms, the classifier may depend on a parameter. Different values
of the parameter will give different classifiers and these in turn give different values to TPR and FPR.
The ROC curve is the curve obtained by plotting in the ROC space the points (TPR , FPR) obtained by
assigning all possible values to the parameter in the classifier.

The closer the ROC curve is to the top left corner (0, 1)
of the ROC space, the better the accuracy of the classifier.
Among the three classifiers A, B, C with ROC curves as
shown in Figure , the classifier C is closest to the top left
corner of the ROC space. Hence, among the three, it
gives the best accuracy in predictions.

Area under the ROC curve The measure of the area under
the ROC curve is denoted by the acronym AUC . The
value of AUC is a measure of the performance of a
classifier. For the perfect classifier, AUC = 1.0.
14/05/2023 124

You might also like