Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

Machine

Learning –
Classification
Dr. Satish Kumar
Associate Professor,
Department of Robotics & Automation
Symbiosis Institute of Technology,
Pune
What is Machine Learning?
Applications of ML
Search Smart Email
Video Game
Engines Phones Programs

Adaptive Autonomous
Robots Banks
Telescopes Car

Disease
Diagnosis
• Algorithm: A set of rules and statistical techniques used to
learn patterns from data and draw significant information
from it.
• Model: A model is trained by using a Machine Learning
Algorithm.
Machine • Predictor Variable: It is a feature(s) of the data that can be
used to predict the output.
Learning • Response Variable: It is the feature or the output variable
that needs to be predicted by using the predictor variable(s).
Terminology • Training Data: The training data helps the model to
identify key trends and patterns essential to predict the
output.
• Validation Data: it is used to evaluate the model, to fine
tune the model hyperparameters
• Testing Data: It is used to evaluate the model once the
model is completely trained.
Various Datasets for ML Model
Train Validation Test
Various Datasets for ML Model

Data Data Choosing Training Evaluating


Prediction
Collection Preparation algorithm Model Model
Machine
Learning
Algorithm
Classification
Introduction To Classification

• Refers to a predictive modelling problem where a class label is predicted


• Examples
• Given an email, classify it in spam or not-spam email.
• Given a handwritten character, classify it as one of the known characters.
• Classification requires a training dataset
• Label encoding
• Common performance evaluation metrics
• Accuracy, ROC curve….
Classification Problem

• Given a set of n attributes (ordinal or categorical) , a set of m classes and a set of


labeled training data
• (Ai, Bi)….. (Aj, Bj) –labeled dataset
• A = (V1, V2, …, Vn) -attributes
• B ∈{C1, C2, … Cm} –classes

• Determine a classification rule that predicts the class of any input data from the values
of its attributes
Classification Problems in Real Life

Handwritten
Image
Email spam digit
segmentation
recognition

Disease Face Fraud


diagnosis recognition detection
Binary Classification

Multi-Class Classification
Types of
Classificaiton Multi-Label Classification

Imbalanced Classification
Binary Classification

• Two class labels


• It requires only one classifier model
• Confusion matrix is enough to understand
the results
• Examples: disease detection, churn
prediction, email spam detection
• Popular algorithms: logistic regression,
KNN, decision trees, SVM, Naïve
Multi-Class Classification

• More than two class labels


• Examples: face recognition, optical character
recognition, plant species classification
• discrete probability distribution for a categorical
outcome e.g.Kin {1, 2, 3, …,K}.
• Popular algorithms: KNN, decision trees, Naïve
bayes, random forest, gradient boosting
• Binary classification algorithms used for multi-
class classification: Logistic regression, SVM
One-vs-Rest
One-vs-One
One-vs-rest : : combining multiple binary classifiers for multi-class classification

Train a classifier

(𝑖)
ℎ𝜃 = 𝑃(𝑦 = 𝑖|𝑥; 𝜃) (i=1,2,3)
One or more class labels may be predicted for
each input sample
Multi-Label
Classification
Examples: Photo classification

Popular algorithms: specialized version of


standard classification algorithm such as
mulit-label decision tree, multi-label random
forest, multi-label gradient boosting
Number of samples in each class is
Imbalanced unequally distributed
Classification Example: Fraud detection, outlier
detection, etc

Under sampling/oversampling to change


the composition of samples

Popular algorithms: cost sensitive logistic


regression, cost sensitive SVM, etc
Classification Algorithms
• Types according to their learning mechanism
• Eager learners
• Eager Learners develop a classification model based on a training dataset before receiving a test dataset.
• Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction.
• Examples: Logistic Regression, Decision Trees, Support Vector Classifier, Artificial Neural Networks,
etc.
• Lazy learners
• Lazy Learner firstly stores the training dataset and wait until it receives the test dataset.
• In Lazy learner case, classification is done on the basis of the most related data stored in the training
dataset. It takes less time in training but more time for predictions.
• Examples: K-nearest Neighbours (KNN), Case Based Learning (CBL) algorithm, etc.
Linear Classifiers
The underlying machine learning
concepts are the same
Why The theory (statistics and optimization)
Linear are much better understood
Classifiers Linear classifiers are still widely used
(and very effective when data is scarce)

Linear classifiers are a component of


neural networks.
Linear Classifiers and
Neural Networks

• Linear classifiers are a component of


neural networks
Linear Classifier

• A classification algorithm (Classifier) that makes classification based on a linear predictor function combining
a set of weights with the feature vector
• y = f(σ𝑖𝑤𝑖𝑥𝑖)
Perceptron

Logistic regression
Linear
Classifiers Naïve bayes

Binomial classification

Multiclass classification Support vector


machine
Perceptron
Perceptron

• Inputs are feature values


• Each feature has weight
• Sum is the activation

• If the activation is
• Positive –output 1
• Negative –output -1
Perceptron…(Contd.)
Learning Weights in Perceptron

• •How to learn weights in perceptron


• ▪Until convergence:
• ▪Randomly initialize weights
• ▪Calculate f(w,X1) for first sample X1
• ▪Suppose the actual label of sample is Y1 (which can be 1 or 0), calculate Y1-f(w,X1) as error e
• ▪update weights array as W = W + eX1
• ▪Move to next sample.
Logistic Regression
Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique.

It is used for predicting the categorical dependent variable using a


given set of independent variables.

Logistic regression predicts the output of a categorical dependent


Logistic variable.

Regression Therefore, the outcome must be a categorical or discrete value.

It can be either Yes or No, 0 or 1, true or False, etc.


Assumptions for Logistic Regression:

The dependent The independent


variable must be variable should
categorical in not have multi-
nature. collinearity.
Logistic Function (Sigmoid Function)

The sigmoid function is a mathematical function used to map the predicted values to probabilities.

It maps any real value into another value within a range of 0 and 1.

The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.

In logistic regression, we use the concept of the threshold value, which defines the probability of either 0
or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to
0.
Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:

• Where, β0 , β1 , β2 … 𝑒𝑡𝑐 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠.


• So, regression vector can be written as

• Then, in a more compact form,


Logistic Regression Equation:
• Now, if we try to apply Linear Regression to the above problem, we are likely to get continuous
values.
• To get the values between 0 to 1, the equation is modified as follow

Where , 𝑧 = β𝑇 𝑥𝑖

Therefore,

Finally, g(z) is called logistic function or the sigmoid function.


Here is a plot showing g(z):
• g(z) value tends to 1 as z tends to ∞
• g(z) value tends to 0 as z tends to 0
Sigmoidal • g(z) is always bounded between 0 and 1
function range
Type of Logistic Regression

• On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial:
• In binomial Logistic regression, there can be only two possible types of the dependent variables,
• Ex- 0 or 1, Pass or Fail, etc.
• Multinomial:
• In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent
variable,
• Ex- "cat", "dogs", or "sheep"
• Ordinal:
• In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables,
• Ex - "low", "Medium", or "High".
Evaluation Metrics for
Classification
• Confusion matrix
Evaluation • Accuracy

Metrics •
Log-loss
AUC-ROC
Confusion Matrix
• Confusion Matrix is a performance
measurement for machine learning
classification.
• It is a table with 4 different combinations
of predicted and actual values.
• It is extremely useful for measuring
Recall, Precision, Specificity, Accuracy,
and most importantly AUC-ROC curves.

Source: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Understanding Confusion Matrix
• True Positive (TP)
• Interpretation: You predicted positive and it’s true.
• You predicted that a fruit is an Apple, and it is an Apple.

• True Negative (TN)


• Interpretation: You predicted negative and it’s false.
• You predicted that a fruit is not an Apple, but it is an
Apple.

• False Positive (FP) – (Type-I error)


• Interpretation: You predicted negative and it’s true.
• You predicted that a fruit is not an Apple, and it is not an
Apple.

• False Negative (FN) – (Type-II error)


• Interpretation: You predicted positive and it’s false.
• You predicted that fruit is an Apple, but it is not an Apple.
Understanding Confusion Matrix

• Predicted values as Positive and Negative


and
• Actual values as True and False.
Precision

Classification
report Recall

Using
Confusion F measure

Matrix
Accuracy
Recall (Sensitivity or True Positive
Rate)
• Out of the total positive, what percentage are
predicted positive. It is the same as TPR (true
positive rate) called as recall value
• The number of true positives divided by the
total number of actual positives.
Precision

• Out of all the positive predicted, what


percentage is truly positive.
• The number of true positives divided by the
number of predicted positives
F-measure (F1 score)

• It is the harmonic mean of


precision and recall.
• It takes both false positive and
false negatives into account
Accuracy

Ratio of the number of


correct predictions and
the total number of
predictions
Example
Calculate precision, recall and
accuracy for the shown prediction
table, having threshold value 0.6
Solution
TP = 2, TN = 2, FP = 1, FN = 2

2 2 1
= (2+2)=4 = 2

2 2
= (2+1)= 3

(2+2) 4
= (2+2+1+2)=7
AUC-ROC curve
AUC - ROC Curve

AUC - ROC curve is a performance measurement for the


classification problems at various threshold settings.

ROC is a probability curve and AUC represents the


degree or measure of separability.

It tells how much the model is capable of


distinguishing between classes.

Source: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Receiver Operator Characteristic (ROC) Curve:

This curve is an evaluation metric for binary classification


problems.

It is a probability curve that plots the True Positive


Rate against False Positive Rate at various threshold values.

ROC curve basically separates the ‘signal’ from the ‘noise’.

Source: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
Area Under the Curve (AUC)

It measure of the ability The higher the AUC, the


of a classifier to better the performance of Higher the AUC, the
distinguish between the model at better the model is at
classes and is used as a distinguishing between predicting 0 classes as 0
summary of the ROC the positive and negative and 1 classes as 1
curve. classes
AUC-ROC curve

The ROC curve is plotted with


TPR against the FPR where
TPR is on the y-axis and FPR is
on the x-axis.

Source: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
AUC Curve (AUC = 1)

• When AUC = 1, then the classifier is


able to perfectly distinguish between
all the Positive and the Negative class
points correctly
• AUC had been 0, then the classifier
would be predicting all Negatives as
Positives, and all Positives as
Negatives.

Source: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
AUC Curve
(0.5<AUC<1)

• When 0.5<AUC<1, there is a high


chance that the classifier will be able
to distinguish the positive class values
from the negative class values.
• The classifier is able to detect more
numbers of True positives and True
negatives than False negatives and
False positives.

Source: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
AUC Curve (AUC=0.5)

• When AUC=0.5, then the classifier is not


able to distinguish between Positive and
Negative class points.
• Meaning either the classifier is predicting
random class or constant class for all the
data points.

Source: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
Log-loss is indicative of how close the
prediction probability is to the
Log Loss corresponding actual/true value (0 or 1 in
(Logistic Loss case of binary classification).

or
More the predicted probability diverges
Cross Entropy Loss) from the actual value, the higher is the log-
loss value

Source: https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a
How is log-loss value calculated

Where,
• ‘I’ is the given observation/record,
• ‘y’ is the actual/true value,
• ‘p’ is the prediction probability, and
• ‘ln’ refers to the natural logarithm (logarithmic value using base of e)
of a number.
Log-loss value
• Example
▪ Predicted probabilities ▪ Corrected probabilities
Log-loss value

Solution:
• To compensate for this negative value,
use a negative average of the values

• Log loss or Binary cross-entropy = 0.214


Classification Algorithms
Classification Algorithms

• K-nearest Neighbor (KNN)


• Naïve Bayes Classifier (NBC)
• Support Vector Machine (SVM)
• Decision Tree Classifier (DTC)
K-Nearest Neighbor
Algorithm
K-Nearest Neighbor Algorithm

K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.

K-nearest neighbor or K-NN algorithm basically creates an imaginary


boundary to classify the data.

When new data points come in, the algorithm will try to predict that to
the nearest of the boundary line.
K-Nearest Neighbor Classifier

12/28/2021 65
How does KNN Work?

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each
category.
• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
• Step-6: KNN model is ready

Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
How does KNN Work?

For classification: A class label assigned to the majority of K Nearest Neighbors from the training
dataset is considered as a predicted class for the new data point.

For regression: Mean or median of continuous values assigned to K Nearest Neighbors from
training dataset is a predicted continuous value for our new data point
12/28/2021 67
Choosing the Right Value of K
It’s very important to have the right k-value
when analyzing the dataset to avoid
overfitting and underfitting of the dataset.

There is no particular way to determine the


best value for "K“

Suggestions to choose K
K value should be odd
Domain knowledge is
while considering
Using error curves very useful in choosing
binary(two-class)
the K value.
classification
Support Vector
Machine
Support Vector
Machine
• Support Vector Machine or SVM is one
of the most popular Supervised
Learning algorithms.
• It can be use for Classification as well as
Regression problems.
• The goal of the SVM algorithm is to
create the best line or decision
boundary that can segregate n-
dimensional space into classes so that
we can easily put the new data point in
the correct category in the future.
• This best decision boundary is called a
hyperplane.

Source: https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
Hyper-plane

• It is plane that linearly divide the n-dimensional data points in


two component.
• In case of 2D, hyperplane is line, in case of 3D it is plane.
• It is also called as n-dimensional line.
• Figure shows, a blue line(hyperplane) linearly separates the
data point in two components.
The dimension of the hyperplane
depends upon the number of features.

If the number of input features is two,


Dimension of then the hyperplane is just a line.

the
If the number of input features is
hyperplane three, then the hyperplane becomes a
2-D plane.

It becomes difficult to imagine when


the number of features exceeds three.
Selecting the best
hyper-plane
• One reasonable choice as the best
hyperplane is the one that represents
the largest separation or margin
between the two classes.
• So we choose the hyperplane whose
distance from it to the nearest data
point on each side is maximized.
• If such a hyperplane exists it is known
as the maximum-margin
hyperplane/hard margin.
• So from the figure, we choose L2.

Source: https://www.geeksforgeeks.org/support-vector-machine-algorithm/
Selecting the best hyper-
plane

• In Figure, have one blue ball in the boundary


of the red ball.
• So how does SVM classify the data?
• The blue ball in the boundary of red ones is
an outlier of blue balls.
• The SVM algorithm has the characteristics
to ignore the outlier and finds the best
hyperplane that maximizes the margin.
• SVM is robust to outliers.
Margin and Support
Vectors

• The Solid black line in figure is optimal


hyperplane.
• The two dotted line is some hyperplane, which
is passing through nearest data points to the
optimal hyperplane.
• Then distance between hyperplane and
optimal hyperplane is know as margin, and
the closest data-points are known as
support vectors.
• Margin is an area which do not contains any
data points
Margin and Support Vectors

• when we are choosing optimal hyperplane we will choose one among set of hyperplane
which is highest distance from the closest data points.
• If optimal hyperplane is very close to data points then margin will be very small and it
will generalize well for training data but when an unseen data will come it will fail to
generalize well
• So our goal is to maximize the margin so that our classifier is able to generalize well for
unseen instances.
• So, In SVM our goal is to choose an optimal hyperplane which maximizes the
margin.
Decision Tree
Classification
Decision Tree
Classification
• Decision Tree is a Supervised learning
technique that can be used for both classification
and Regression problems, but mostly it is
preferred for solving Classification problems.
• There are two nodes, which are the Decision
Node and Leaf Node.
• Decision nodes are used to make any decision
and have multiple branches,
• Whereas Leaf nodes are the output of those
decisions and do not contain any further
branches.

Source: https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or more
homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision


Decision Tree node/root node into sub-nodes according to the given conditions.

Terminologies Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches


from the tree.

Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
Gini Index (Gini Impurity)

Information Gain
Attribute
Selection
Measures Chi-square

Reduction in Variance

Dr. Mayuri Mehta, SCET 12/28/2021 80


Gini Index • Calculate
• Calculate Gini impurity for
sub-nodes, using the
formula
• Calculate
• Calculate Gini for split
using the weighted Gini
score of each node of that
split
• Select
• Select the feature with the
least Gini impurity for the
split
1. Calculate the entropy of
the parent node
Information 2. Calculate entropy of each
individual node of split and
Gain calculate the weighted
average of all sub-nodes
available in the split.
3. calculate information gain as
follows
• and chose the node with the
highest
• information gain for splitting
1. For each split, individually calculate the
Chi- Square value of each child node by
taking the sum of Chi-Square values for
each class in a node
2. Calculate the Chi-Square value of each
split as the sum of Chi-Square values
Chi-Square for all the child nodes
3. Select the split with higher Chi-Square
value

It generates a tree called CHAID (Chi-


square Automatic Interaction Detector)
1. For each split,
individually calculate the
Reduction in • variance of each child
node
Variance 2. Calculate the
variance of each split
as the weighted
average variance of
child nodes
3. Select the split with the
lowest variance
DT Summary
Setting Pruning

Setting Tree pruning


constraints on • Cost Complexity
Avoiding tree size
• Minimum samples
Pruning
• Reduced Error
Overfitting for a node split
• Minimum samples
Pruning

in Decision for a leaf node


• Maximum depth of
the tree (vertical
Tree depth)
• Maximum number of
leaf nodes
• Maximum features
to consider for a
split
Random Forest Classifier
Random forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in


Classification and Regression problems.

It builds decision trees on different samples and takes their majority vote for classification
and average in case of regression.

Based on the concept of ensemble learning.

Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset

The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting
Working of Each individual tree in the random forest spits out a class
Random Forest prediction and the class with the most votes becomes our
model’s prediction
Classifier
Working of Random Forest Classifier
Two phases

• Create the random forest


• Make predictions for each tree

Steps

• Select random K data points from the training set.


• Build the decision trees associated with the selected data points (Subsets).
• Choose the number N for decision trees that you want to build.
• Repeat Step 1 & 2. 5.
• For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Advantages

• Random Forest is capable of performing both


Classification and Regression tasks.
• It is capable of handling large datasets with high
dimensionality.
• It enhances the accuracy of the model and
Advantages & prevents the overfitting issue.
• Random forests can also handle missing values.
Disadvantages • We can get the relative feature importance.

Disadvantages

• Random forest is slow in generating predictions.


• The model is difficult to interpret.
Random Forests vs Decision Trees

Random forests is a set of multiple decision trees.

Deep decision trees may suffer from overfitting, but random forests
prevents overfitting by creating trees on random subsets.

Decision trees are computationally faster.

Random forests is difficult to interpret, while a decision tree is easily


interpretable and can be converted to rules.
Ensemble Methods
(Bagging and Boosting)
Hypothesis: when weak
Aggregate predictions from a
models are correctly combined,
group of predictors, which may
more accurate and/or robust
be classifiers or regressors
models can be obtained.

Ensemble Method
Bagging
Ensemble Methods • It creates a different training subset from
sample training data with replacement &
the final output is based on majority voting.
• For example, Random Forest.
Boosting
• It combines weak learners into strong
learners by creating sequential models such
that the final model has the highest
accuracy.
• For example, ADA BOOST, XG BOOST
Bagging and
Boosting

Source: https://www.analyticsvidhya.com/blog/2021/06/understanding-randomforest
• Bagging, also known
as Bootstrap Aggregation is the ensemble
Bagging technique used by random forest.
• Bagging chooses a random sample from
the data set.
• Hence each model is generated from the
samples (Bootstrap Samples) provided by
the Original Data with replacement
known as row sampling.
• This step of row sampling with
replacement is called bootstrap.
• Now each model is trained independently
which generates results.
• The final output is based on majority
voting after combining the results of all
models.
• This step which involves combining all
the results and generating output based on
majority voting is known as aggregation.
Advantages and Disadvantages of Bagging

Advantages

• Decreases the variance without increasing bias.


• Works so well because of diversity in the training data since the sampling is
done by bootstrapping.
• It can save computational time and still can increase the accuracy of the model.
• Works well with small datasets as well.

Disadvantages

• Interpretability gets lost


• Loss of important information
Bagging

Source: https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest
Boosting

• Converts weak learner to strong


learners.
• Train weak learners sequentially, each
trying to correct its predecessor.
• The data set is weighted.
• Mainly focused on reducing bias.
Boosting Algorithm

• Step 1 : Initialize the dataset and assign


equal weight to each of the data point .
• Step 2: Provide this as input to the
model and identify the wrongly
classified data points.
• Step 3: Increase the weight of the
wrongly classified data points.
• Step 4: if (got required results) go to
step 5 else go to step 2
• Step 5: End

Source: https://www.geeksforgeeks.org/xgboost/
Examples of Boosting Techniques

XG
Ada
Gradient Boost(Xtreme
Boost(Adaptive
Boosting Gradient
Boosting)
Boosting)
Advantages and Disadvantages of Boosting

• Boosting is a resilient method that curbs over-fitting


Advantages easily.
• Versatile — can be applied to a wide variety of problems.

• Sensitive to outliers
Disadvantages • It is almost impossible to scale up
Similarities between Bagging and Boosting

Both are ensemble methods to get N learners from 1 learner.

Both generate several training data sets by random sampling.

Both make the final decision by averaging the N learners (or taking the majority of
them i.e Majority Voting).

Both are good at reducing variance and provide higher stability


Differences between Bagging and Boosting

You might also like