EDAB Module 5 Singular Value Decomposition (SVD)

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a matrix factorization


technique that decomposes a matrix into three other matrices.
Singular Value Decomposition, commonly known as SVD, is
a powerful mathematical tool in the world of data science and
machine learning.
SVD is primarily used for dimensionality reduction,
information extraction, and noise reduction.
It has some interesting algebraic properties and conveys
important geometrical and theoretical insights about linear
transformations.
• It also has some important applications in data science.
• The SVD of a matrix
A can be written as:

A=UDVT

• where U and V are orthogonal matrices, and D is a diagonal


matrix with non-negative entries called singular values. The
columns of U and V are called left and right singular vectors,
respectively.
Applications of SVD:
• Matrix Approximation: SVD can be used to approximate a
matrix by using only the top k singular values and vectors,
leading to a lower-rank approximation.
• Principal Component Analysis (PCA): SVD is closely
related to PCA. The principal components of a dataset can be
identified using the singular vectors and values of its
covariance matrix.
• Image Compression: In image processing, SVD can be used
for image compression. By retaining only the most significant
singular values, an image can be represented with lower
storage requirements.
Recommendation Systems: SVD is applied in
collaborative filtering methods for recommendation
systems, where it helps in uncovering latent features in
user-item interaction matrices.
Signal Processing: SVD is used in signal processing to
analyze and filter signals.

Data Cleaning and Denoising: SVD can be used for


data imputation and cleaning by identifying and
removing noise in datasets.
Principal Component Analysis(PCA)
• Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901.
• Principal Component Analysis (PCA) is a dimensionality reduction
technique commonly used in machine learning and data analysis.
• Its goal is to transform the original features of a dataset into a new set
of uncorrelated variables, called principal components, which are
linear combinations of the original features.
• These principal components are ordered by the amount of variance
they capture, with the first component capturing the most variance.
• Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated variables.

• PCA is the most widely used tool in exploratory data analysis and in machine learning for
predictive models.

• Moreover, Principal Component Analysis (PCA) is an unsupervised learning algorithm


technique used to examine the interrelations among a set of variables.

• It is also known as a general factor analysis where regression determines a line of best fit.

• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of
a dataset while preserving the most important patterns or relationships between the
variables without any prior knowledge of the target variables.
• Principal Component Analysis (PCA) is used to reduce the dimensionality of a
data set by finding a new set of variables, smaller than the original set of
variables, retaining most of the sample’s information, and useful for the
regression and classification of data.

1. Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies a set of orthogonal axes, called principal components,
that capture the maximum variance in the data. The principal components are linear combinations of the original variables in the dataset and are ordered
in decreasing order of importance. The total variance captured by all the principal
components is equal to the total variance in the original dataset.
• Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal components,
that capture the maximum variance in the data.

• The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance.

• The total variance captured by all the principal components is equal to the total
variance in the original dataset.
2.The first principal component captures the most variation
in the data, but the second principal component
captures the maximum variance that is orthogonal to the
first principal component, and so on.
3.Principal Component Analysis can be used for a variety of
purposes, including data visualization, feature selection,
and data compression.
2. In data visualization, PCA can be used to plot high-dimensional data
in two or three dimensions, making it easier to interpret. In feature
selection, PCA can be used to identify the most important variables in
a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.

3. In Principal Component Analysis, it is assumed that the information is


carried in the variance of the features, that is, the higher the variation
in a feature, the more information that features carries.
PCA has several applications, including:
• Dimensionality Reduction: It helps reduce the number of features in a dataset
while retaining most of the information.
i) Noise Reduction: By focusing on the principal components with the highest
variance, PCA can help filter out noise in the data.
ii) Visualization: It aids in visualizing high-dimensional data by projecting it
onto a lower-dimensional space.

iii)Data Compression: PCA can be used for compressing data while


preserving its essential characteristics.
Classification

• Classification is a process of categorizing data or objects into predefined classes


or categories based on their features or attributes.
• In machine learning, classification is a type of supervised learning technique
where an algorithm is trained on a labeled dataset to predict the class or
category of new, unseen data.
• The main objective of classification is to build a model that can accurately
assign a label or category to a new observation based on its features.
• For example, a classification model might be trained on a dataset of images
labeled as either dogs or cats and then used to predict the class of new, unseen
images of dogs or cats based on their features such as color, texture, and
shape.
Types of Classification
Classification is of two types:

• Binary Classification: In binary classification, the goal is to classify the input


into one of two classes or categories. Example – On the basis of the given
health conditions of a person, we have to determine whether the person has a
certain disease or not.

• Multiclass Classification: In multi-class classification, the goal is to classify


the input into one of several classes or categories. For Example – On the basis
of data about different species of flowers, we have to determine which specie
our observation belongs to.
Types of classification algorithms
There are various types of classifiers. Some of them are :

I) Linear Classifiers: Linear models create a linear decision boundary between


classes. They are simple and computationally efficient. Some of the linear
classification models are as follows:
 Logistic Regression
 Support Vector Machines having kernel = ‘linear’
 Single-layer Perceptron
 Stochastic Gradient Descent (SGD) Classifier
I) Non-linear Classifiers: Non-linear models create a non-linear decision boundary
between classes. They can capture more complex relationships between the input
features and the target variable. Some of the non-linear classification models are as
follows:
 K-Nearest Neighbours
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Ensemble learning classifiers:
 Random Forests,
 AdaBoost,
 Bagging Classifier,
 Voting Classifier,
 ExtraTrees Classifier
 Multi-layer Artificial Neural Networks
How does classification work?

• The basic idea behind classification is to train a model on a labeled dataset, where the input data is
associated with their corresponding output labels, to learn the patterns and relationships between
the input data and output labels. Once the model is trained, it can be used to predict the output
labels for new unseen data.
• The classification process typically involves the following steps:

1. Understanding the problem: Before getting started with classification, it is important to


understand the problem you are trying to solve. What are the class labels you are trying to
predict? What is the relationship between the input data and the class labels?

 Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7
independent variables, called features. This means, there can be only two possible outcomes:
 The patient has the disease, which means “True”.
 The patient has no disease. which means “False”.
 This is a binary classification problem.
1. Data preparation: Once you have a good understanding of the problem, the next step is to prepare
your data. This includes collecting and preprocessing the data and splitting it into training, validation,
and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be
used by the classification algorithm.

 X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and
M is the number of features.

 y: An N vector corresponding to predicted classes for each of the N observations.

2. Feature Extraction: The relevant features or attributes are extracted from the data that can be used
to differentiate between the different classes.

 Suppose our input X has 7 independent features, having only 5 features influencing the label or
target values remaining 2 are negligibly or not correlated, then we will use only these 5 features
only for the model training.
1. Model Selection: There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM), or neural
networks.

2. It is important to select a model that is appropriate for your problem, taking into account the
size and complexity of your data, and the computational resources you have available.

3. Model Training: Once you have selected a model, the next step is to train it on your training
data. This involves adjusting the parameters of the model to minimize the error between the
predicted class labels and the actual class labels for the training data.
• Model Evaluation: Evaluating the model: After training the model, it is important to evaluate its
performance on a validation set. This will give you a good idea of how well the model is likely to
perform on new, unseen data.

 Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC curve are the quality
metrics used for measuring the performance of the model.

7. Fine-tuning the model: If the model’s performance is not satisfactory, you can fine-tune it by adjusting the
parameters, or trying a different model.
8. Deploying the model: Finally, once we are satisfied with the performance of the model, we can deploy it
to make predictions on new data. it can be used for real world problem.
Applications of Classification Algorithm

• Classification algorithms are widely used in many real-world


applications across various domains, including:
 Email spam filtering
 Credit risk assessment
 Medical diagnosis
 Image classification
 Sentiment analysis.
 Fraud detection
 Quality control
 Recommendation systems
Classification Error Rate

In machine learning, misclassification rate is a metric that tells us the percentage of


observations that were incorrectly predicted by some classification model.
• It is calculated as:

• Misclassification Rate = # incorrect predictions / # total predictions The value for

misclassification rate can range from 0 to 1 where:


 0 represents a model that had zero incorrect predictions.
1 represents a model that had completely incorrect predictions.

• The lower the value for the misclassification rate, the better a classification
model is able to predict the outcomes of the response variable.

• The following example show how to calculate misclassification rate for a


logistic regression model in practice.
Example: Calculating Misclassification Rate for a Logistic Regression Model
Suppose we use a logistic regression model to predict whether or not 400 different college
basketball players get drafted into the NBA.
The following confusion matrix summarizes the predictions made by the model:

Here is how to calculate the misclassification rate for the model:


•Misclassification Rate = # incorrect predictions / # total predictions
•Misclassification Rate = (false positive + false negative) / (total predictions)
•Misclassification Rate = (70 + 40) / (400)
•Misclassification Rate = 0.275
• The misclassification rate for this model is 0.275 or 27.5%.
• This means the model incorrectly predicted the outcome for 27.5% of the players.
• The opposite of misclassification rate would be accuracy,
which is calculated as:
• Accuracy = 1 – Misclassification rate
 Accuracy = 1 – 0.275
 Accuracy = 0.725

• This means the model correctly predicted the outcome for 72.5% of the players.

Bayes Classification Rule

What is Naive Bayes classifiers?

• Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle, i.e.
every pair of features being classified is independent of each other. To start with, let us consider a
dataset.
• Why it is called Naive Bayes?

• The “Naive” part of the name indicates the simplifying assumption


made by the Naïve Bayes classifier. The classifier assumes that the
features used to describe an observation are conditionally
independent, given the class label. The “Bayes” part of the name
refers to Reverend Thomas Bayes, an 18th-century statistician and
theologian who formulated Bayes’ theorem.
• Assumption of Naive Bayes
• The fundamental Naive Bayes assumption is that each feature
makes an:
1. Feature independence: The features of the data are conditionally independent of each other,
given the class label.
2. Continuous features are normally distributed: If a feature is continuous, then it is assumed to
be normally distributed within each class.
3. Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to
have a multinomial distribution within each class.
4. Features are equally important: All features are assumed to contribute equally
Basically, we are trying to find probability of event A, given the event B is true. to the prediction of the class
label

.
•No missing data: The data should not contain any missing values.
• Bayes’ Theorem
• Bayes’ Theorem finds the probability of an event occurring given the probability of another event that
has already occurred. Bayes’ theorem is stated mathematically as the following
• equation:

where A and B are events and P(B) ≠ 0


 Event B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance (here, it is event B).
 P(B) is Marginal Probability: Probability of Evidence.

 P(A|B) is a posteriori probability of B, i.e. probability of event after


evidence is seen.
 P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will
come true based on the evidence.
Types of Naive Bayes Model
• There are three types of Naive Bayes Model:
• Gaussian Naive Bayes classifier

• In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed
according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution
When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as
shown below:
Multinomial Naive Bayes

Feature vectors represent the frequencies with which certain events have been generated by a
multinomial distribution. This is the event model typically used for document classification.
Bernoulli Naive Bayes

In the multivariate Bernoulli event model, features are independent booleans (binary variables)
describing inputs. Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).

Advantages of Naive Bayes Classifier


 Easy to implement and computationally efficient.
 Effective in cases with a large number of features.

Performs well even with limited training data.


Disadvantages of Naive Bayes Classifier

 Assumes that features are independent, which may not always hold in real-world data.
 Can be influenced by irrelevant attributes.

 May assign zero probability to unseen events, leading to poor generalization.

Applications of Naive Bayes Classifier


 Spam Email Filtering: Classifies emails as spam or non-spam based on features.

 Text Classification: Used in sentiment analysis, document categorization, and topic


classification.

 Medical Diagnosis: Helps in predicting the likelihood of a disease based on


symptoms.

 Credit Scoring: Evaluates creditworthiness of individuals for loan approval.

 Weather Prediction: Classifies weather conditions based on various factors


Linear Methods for Classification

• Linear methods for classification are techniques that use linear functions to separate different classes of data. They are
based on the idea of finding a decision boundary that minimizes the classification error or maximizes the likelihood of
the data. Some examples of linear methods for classification are:

 Logistic regression: This method models the probability of a binary response variable as a logistic function of a
linear combination of predictor variables. It estimates the parameters of the linear function by maximizing the
likelihood of the observed data1.

 Linear discriminant analysis (LDA): This method assumes that the data from each class follows a multivariate
normal distribution with a common covariance matrix, and derives a linear function that best discriminates
between the classes. It estimates the parameters of the normal distributions by using the sample means and
covariance matrix of the data2.

 Support vector machines (SVMs): This method finds a linear function that maximizes the margin between the
classes, where the margin is defined as the distance from the decision boundary to the closest data points. It uses a
technique called kernel trick to transform the data into a higher-dimensional space where a linear separation is
possible3.
 Perceptron: This method is a simple algorithm that iteratively updates the parameters of a linear
function based on the prediction errors of the data points. It converges to a solution if the data is linearly
separable, but may not find the optimal solution4.

 Stochastic gradient descent (SGD): This method is a general optimization technique that iteratively
updates the parameters of a linear function by moving in the direction of the negative gradient of a loss
function. It can be applied to various linear methods for classification, such as logistic regression and
SVMs5.

• These are some of the most common linear methods for classification, but there are also other variants and
extensions of these methods. Linear methods for classification are useful because they are simple, fast,
and interpretable, but they may not perform well if the data is not linearly separable or has complex
nonlinear patterns. In that case, nonlinear methods for classification may be more suitable.
Logistic Regression

• Logistic regression is a supervised machine learning algorithm mainly used for binary classification
where we use a logistic function, also known as a sigmoid function that takes input as independent variables and
produces a probability value between 0 and 1. For example, we have two classes Class 0 and Class 1 if the value
of the logistic function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression but is mainly used for
classification problems. The difference between linear regression and logistic regression is that linear regression
output is the continuous value that can be anything while logistic regression predicts the probability that an
instance belongs to a given class or not.
• Understanding Logistic Regression
It is used for predicting the categorical dependent variable using a given set of independent variables.

 Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value.

 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives
the probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.

 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).

 The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

 Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.

 Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification.
• Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

 It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.

 In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Terminologies involved in Logistic Regression
• Here are some common terms involved in logistic regression:

 Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
 Dependent variable: The target variable in a logistic regression model, which we are trying to predict.

 Logistic function: The formula used to represent how the independent and dependent variables relate to one another. The
logistic function transforms the input variables into a probability value between 0 and 1, which represents the likelihood of the
dependent variable being 1 or 0.

 Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the probability is
the ratio of something occurring to everything that could possibly occur.

 Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression, the log
odds of the dependent variable are modeled as a linear combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables relate to one
another.

 Intercept: A constant term in the logistic regression model, which represents the log odds when all independent
variables are equal to zero.
 Maximum likelihood estimation: The method used to estimate the coefficients of the logistic regression model,
which maximizes the likelihood of observing the data given the model.
• Assumptions for Logistic Regression
• The assumptions for Logistic regression are as follows:

 independent observations: Each observation is independent of the other. meaning there is no correlation between any input variables.

 Binary dependent variables: It takes the assumption that the dependent variable must be binary or dichotomous, meaning it can take only two
values. For more than two categories SoftMax functions are used.

 Linearity relationship between independent variables and log odds: The relationship between the independent variables and the log odds of the
dependent variable should be linear.
 No outliers: There should be no outliers in the dataset.
 Large sample size: The sample size is sufficiently large
• Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:

1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0
or 1, Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent
variable, such as “cat”, “dogs”, or “sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as
“low”, “Medium”, or “High”.
Sr. Linear Regresssion Logistic Regression
No
Linear regression is used to Logistic regression is used to predict the categorical
predict the continuous dependent variable using a given set of independent
dependent variable using a variables.
given set of independent
1
variables.

Linear regression is used for It is used for solving classification problems.


solving Regression
problem.

Sr.No Linear Regression Logistic Regression


In this we predict the value of continuous In this we predict values of
3. variables categorical variables
4. In this we find best fit line. In this we find S-Curve.

Least square estimation method is used Maximum likelihood estimation for


estimation of accuracy. method is used for Estimation of accuracy.
The output must be continuous Output is must be categorical value
5. value, such as price,age,etc. such as 0 or 1, Yes or no, etc.

It required linear relationship between It not required linear relationship.


6. dependent and independent variables.
There may be collinearity between the There should not be collinearity between
8 independent variables. independent varible
Binary classification
• Binary classification is a type of supervised learning algorithm in machine learning that categorizes new
observations into one of two classes. It is a fundamental concept in machine learning and serves as the
cornerstone for many predictive modeling tasks.

• In binary classification, the goal is to predict whether an observation belongs to one of two classes. For example,
in a medical diagnosis, a binary classifier for a specific disease could take a patient’s symptoms as input features
and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis are:

 True Positive (TP): The patient is diseased and the model predicts “diseased”.
 True Negative (TN): The patient is healthy and the model predicts “healthy”.
 False Positive (FP): The patient is healthy but the model predicts “diseased”.
 False Negative (FN): The patient is diseased but the model predicts “healthy”.

• We can evaluate a binary classifier based on the following parameters :
•True Positive (TP): The patient is diseased and the model predicts “diseased”.
•False Positive (FP): The patient is healthy but the model predicts “diseased”.
•True Negative (TN): The patient is healthy and the model predicts “healthy”.
•False Negative (FN): The patient is diseased and the model predicts “healthy”.
After obtaining these values, we can compute the accuracy of the binary classifier as follows:

In machine learning, many methods utilize binary classification. The most


common are:
•Support Vector Machines
•Naive Bayes
•Nearest Neighbour
•Decision Trees
•Logistic Regression
•Neural Networks
The following are a few binary classification applications, where the 0 and 1
columns are two possible classes for each observation:

In a medical diagnosis, a binary classifier for a specific disease could take a


patient's symptoms as input features and predict whether the patient is healthy or
has the disease. The possible outcomes of the diagnosis are positive and
negative.
Evaluation of binary classifiers
If the model successfully predicts the patients as positive, this case is called True Positive (TP). If the
model successfully predicts patients as negative, this is called True Negative (TN). The binary classifier
may misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test
result, this error is called False Negative (FN). Similarly, If a healthy patient is classified as diseased by a
positive test result, this error is called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:
After obtaining these values, we can compute the accuracy score of the binary classifier as follows:

•True Positive (TP): The patient is diseased and the model predicts "diseased"
•False Positive (FP): The patient is healthy but the model predicts "diseased"
•True Negative (TN): The patient is healthy and the model predicts "healthy"
•False Negative (FN): The patient is diseased and the model predicts "healthy"

The following is a confusion matrix, which represents the above parameters:


MultiClass Classification Using K-Nearest Neighbours
• The K-Nearest Neighbors (KNN) algorithm is a robust and intuitive machine learning method employed to
tackle classification and regression problems. By capitalizing on the

• concept of similarity, KNN predicts the label or value of a new data point by considering its K closest
neighbours in the training dataset. In this article, we will learn about a supervised learning algorithm (KNN) or
the k – Nearest Neighbours, highlighting it’s user-friendly nature.
• What is the K-Nearest Neighbors Algorithm?

• K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It
belongs to the supervised learning domain and finds intense application in pattern recognition, data mining,
and intrusion detection.

• It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which
assume a Gaussian distribution of the given data). We are given some prior data (also called training data),
which classifies coordinates into groups identified by an attribute.
• As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points to a group by analyzing the
training set. Note that the unclassified points are marked as ‘White’. Intuition Behind KNN Algorithm
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified

point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point

close to a cluster of points classified as ‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the second point (5.5,
4.5) should be classified as ‘Red’.
Why do we need a KNN algorithm?

• (K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used for its
simplicity and ease of implementation. It does not require any assumptions about the underlying data
distribution. It can also handle both numerical and categorical data, making it a flexible choice for various
types of datasets in classification and regression tasks. It is a non-parametric method that makes
predictions based on the similarity of data points in a given dataset. K-NN is less sensitive to outliers
compared to other algorithms.

• The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
metric, such as Euclidean distance. The class or value of the data point is then determined by the majority
vote or average of the K neighbors. This approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.
• Applications of the KNN Algorithm

 Data Preprocessing – While dealing with any Machine Learning problem we first perform the
EDA part in which if we find that the data contains missing values then there are multiple
imputation methods are available as well. One of such method is KNN Imputer which is quite
effective ad generally used for sophisticated imputation methodologies.

 Pattern Recognition – KNN algorithms work very well if you have trained a KNN algorithm
using the MNIST dataset and then performed the evaluation process then you must have come
across the fact that the accuracy is too high.

 Recommendation Engines – The main task which is performed by a KNN algorithm is to assign a
new query point to a pre-existed group that has been created using a huge corpus of datasets.
This is exactly what is required in the recommender systems to assign each user to a particular
group and then provide them recommendations based on that group’s preferences.
• Advantages of the KNN Algorithm
 Easy to implement as the complexity of the algorithm is not that high.

 Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage and hence whenever a new
example or data point is added then the algorithm adjusts itself as per that new example and has its contribution to the
future predictions as well.
 Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm are the value of k and
the choice of the distance metric which we would like to choose from our evaluation metric.
• Disadvantages of the KNN Algorithm

 Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy Algorithm. The main
significance of this term is that this takes lots of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.

 Curse of Dimensionality – There is a term known as the peaking phenomenon according to this the KNN algorithm is
affected by the curse of dimensionality which implies the algorithm faces a hard time classifying the data points
properly when the dimensionality is too high.

 Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to the problem of
overfitting as well. Hence generally feature selection as well as dimensionality reduction techniques are applied to deal with
this problem.
Linear Discriminant Analysis

• Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for classification tasks in machine learning.
It is a technique used to find a linear combination of features that best separates the classes in a dataset. LDA works
by projecting the data onto a lower-dimensional space that maximizes the separation between the classes. It does this
by finding a set of linear discriminants that maximize the ratio of between-class variance to within-class variance. In
other words, it finds the directions in the feature space that best separate the different classes of data.

• LDA assumes that the data has a Gaussian distribution and that the covariance matrices of the different classes are
equal. It also assumes that the data is linearly separable, meaning that a linear decision boundary can accurately classify
the different classes.

• Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a


dimensionality reduction technique that is commonly used for supervised classification problems. It is used for modelling
differences in groups i.e.

• separating two or more classes. It is used to project the features in higher dimension space into a lower dimension space.
Example
• Suppose we have two sets of data points belonging to two different classes that we want to classify.
As shown in the given 2D graph, when the data points are plotted on the 2D plane, there’s no straight
line that can separate the two classes of the data points completely. Hence, in this case, LDA
(Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to
maximize the separability between the two classes.
• Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data
onto a new axis in a way to maximize the separation of the two categories and hence, reducing the 2D
graph into a 1D graph.
• Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.

In the above graph, it can be seen that a new axis (in red) is
generated and plotted in the 2D graph such that it maximizes the
distance between the means of the two classes and minimizes the
variation within each class. In simple terms, this newly generated
axis increases the separation between the data points of the two
classes. After generating this new axis using the above-mentioned
criteria, all the data points of the classes are plotted on this new
axis and are shown in the figure given below
But Linear Discriminant Analysis fails when the mean
of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes
both the classes linearly separable. In such cases, we
use non-linear discriminant analysis
Optimal Classification - Naive Bayes Classifiers

Naive Bayes classifiers are a collection of classification algorithms based on


Bayes’ Theorem. It is not a single algorithm but a family of algorithms where
all of them share a common principle, i.e. every pair of features being
classified is independent of each other. To start with, let us consider a
dataset.
One of the most simple and effective classification algorithms, the Naïve
Bayes classifier aids in the rapid development of machine learning models
with rapid prediction capabilities
Why it is called Naive Bayes?
The “Naive” part of the name indicates the simplifying assumption made
by the Naïve Bayes classifier. The classifier assumes that the features
used to describe an observation are conditionally independent, given the
class label. The “Bayes” part of the name refers to Reverend Thomas
Bayes, an 18th-century statistician and theologian who formulated Bayes’
theorem.

Consider a fictional dataset that describes the weather conditions for


playing a game of golf. Given the weather conditions, each tuple
classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf. Here is
a tabular representation of our dataset.
outlook temperature humidity windy play golf
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
 The dataset is divided into two parts, namely, feature matrix and the response vector

 Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of
dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
 Response vector contains the value of class variable (prediction or output) for each row of feature
matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
• The fundamental Naive Bayes assumption is that each feature makes an:

 Feature independence: The features of the data are conditionally independent of each other, given the class
label.
 Continuous features are normally distributed: If a feature is continuous, then it is assumed to be normally
distributed within each class.

 Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to have a
multinomial distribution within each class.

 Features are equally important: All features are assumed to contribute equally to the prediction of the class
label.
 No missing data: The data should not contain any missing values.
Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already
occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0


 Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as
evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The
evidence is an attribute value of an unknown instance (here, it is event B).
 P(B) is Marginal Probability: Probability of Evidence.
 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
 P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based on the evidence.
Types of Naive Bayes Model
There are three types of Naive Bayes Model:
Gaussian Naive Bayes classifier
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed
according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution When
plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values as shown below:
Types of Naive Bayes Model
There are three types of Naive Bayes Model:
Gaussian Naive Bayes classifier
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be
distributed according to a Gaussian distribution. A Gaussian distribution is also called
Normal distribution When plotted, it gives a bell-shaped curve which is symmetric about the
mean of the feature values as shown below:
yes no P(yes) P(no)
Sunny 3 2 3/9 2/5

Rainy 4 0 4/9 0/5


Overcast 2
3 2/9 3/5

Total 9 5 100% 100%


Multinomial Naive Bayes
• Feature vectors represent the frequencies with which certain events have been generated by a multinomial
distribution. This is the event model typically used for document classification.
• Bernoulli Naive Bayes

• In the multivariate Bernoulli event model, features are independent Booleans (binary variables) describing
inputs. Like the multinomial model, this model is popular for document classification tasks, where binary
term occurrence (i.e. a word occurs in a document or not) features are used rather than term frequencies (i.e.
frequency of a word in the document).
• Advantages of Naive Bayes Classifier
 Easy to implement and computationally efficient.
 Effective in cases with a large number of features.
• Performs well even with limited training data.
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold in real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor generalization.
• Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-spam
based on features.
Text Classification: Used in sentiment analysis, document
categorization, and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease
based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
Weather Prediction: Classifies weather conditions based on various
factors.

You might also like