Professional Documents
Culture Documents
EDAB Module 5 Singular Value Decomposition (SVD)
EDAB Module 5 Singular Value Decomposition (SVD)
EDAB Module 5 Singular Value Decomposition (SVD)
A=UDVT
• PCA is the most widely used tool in exploratory data analysis and in machine learning for
predictive models.
• It is also known as a general factor analysis where regression determines a line of best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of
a dataset while preserving the most important patterns or relationships between the
variables without any prior knowledge of the target variables.
• Principal Component Analysis (PCA) is used to reduce the dimensionality of a
data set by finding a new set of variables, smaller than the original set of
variables, retaining most of the sample’s information, and useful for the
regression and classification of data.
1. Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies a set of orthogonal axes, called principal components,
that capture the maximum variance in the data. The principal components are linear combinations of the original variables in the dataset and are ordered
in decreasing order of importance. The total variance captured by all the principal
components is equal to the total variance in the original dataset.
• Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal components,
that capture the maximum variance in the data.
• The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance.
• The total variance captured by all the principal components is equal to the total
variance in the original dataset.
2.The first principal component captures the most variation
in the data, but the second principal component
captures the maximum variance that is orthogonal to the
first principal component, and so on.
3.Principal Component Analysis can be used for a variety of
purposes, including data visualization, feature selection,
and data compression.
2. In data visualization, PCA can be used to plot high-dimensional data
in two or three dimensions, making it easier to interpret. In feature
selection, PCA can be used to identify the most important variables in
a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.
• The basic idea behind classification is to train a model on a labeled dataset, where the input data is
associated with their corresponding output labels, to learn the patterns and relationships between
the input data and output labels. Once the model is trained, it can be used to predict the output
labels for new unseen data.
• The classification process typically involves the following steps:
Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7
independent variables, called features. This means, there can be only two possible outcomes:
The patient has the disease, which means “True”.
The patient has no disease. which means “False”.
This is a binary classification problem.
1. Data preparation: Once you have a good understanding of the problem, the next step is to prepare
your data. This includes collecting and preprocessing the data and splitting it into training, validation,
and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be
used by the classification algorithm.
X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and
M is the number of features.
2. Feature Extraction: The relevant features or attributes are extracted from the data that can be used
to differentiate between the different classes.
Suppose our input X has 7 independent features, having only 5 features influencing the label or
target values remaining 2 are negligibly or not correlated, then we will use only these 5 features
only for the model training.
1. Model Selection: There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM), or neural
networks.
2. It is important to select a model that is appropriate for your problem, taking into account the
size and complexity of your data, and the computational resources you have available.
3. Model Training: Once you have selected a model, the next step is to train it on your training
data. This involves adjusting the parameters of the model to minimize the error between the
predicted class labels and the actual class labels for the training data.
• Model Evaluation: Evaluating the model: After training the model, it is important to evaluate its
performance on a validation set. This will give you a good idea of how well the model is likely to
perform on new, unseen data.
Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC curve are the quality
metrics used for measuring the performance of the model.
7. Fine-tuning the model: If the model’s performance is not satisfactory, you can fine-tune it by adjusting the
parameters, or trying a different model.
8. Deploying the model: Finally, once we are satisfied with the performance of the model, we can deploy it
to make predictions on new data. it can be used for real world problem.
Applications of Classification Algorithm
• The lower the value for the misclassification rate, the better a classification
model is able to predict the outcomes of the response variable.
• This means the model correctly predicted the outcome for 72.5% of the players.
• Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle, i.e.
every pair of features being classified is independent of each other. To start with, let us consider a
dataset.
• Why it is called Naive Bayes?
.
•No missing data: The data should not contain any missing values.
• Bayes’ Theorem
• Bayes’ Theorem finds the probability of an event occurring given the probability of another event that
has already occurred. Bayes’ theorem is stated mathematically as the following
• equation:
• In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed
according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution
When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as
shown below:
Multinomial Naive Bayes
Feature vectors represent the frequencies with which certain events have been generated by a
multinomial distribution. This is the event model typically used for document classification.
Bernoulli Naive Bayes
In the multivariate Bernoulli event model, features are independent booleans (binary variables)
describing inputs. Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).
Assumes that features are independent, which may not always hold in real-world data.
Can be influenced by irrelevant attributes.
• Linear methods for classification are techniques that use linear functions to separate different classes of data. They are
based on the idea of finding a decision boundary that minimizes the classification error or maximizes the likelihood of
the data. Some examples of linear methods for classification are:
Logistic regression: This method models the probability of a binary response variable as a logistic function of a
linear combination of predictor variables. It estimates the parameters of the linear function by maximizing the
likelihood of the observed data1.
Linear discriminant analysis (LDA): This method assumes that the data from each class follows a multivariate
normal distribution with a common covariance matrix, and derives a linear function that best discriminates
between the classes. It estimates the parameters of the normal distributions by using the sample means and
covariance matrix of the data2.
Support vector machines (SVMs): This method finds a linear function that maximizes the margin between the
classes, where the margin is defined as the distance from the decision boundary to the closest data points. It uses a
technique called kernel trick to transform the data into a higher-dimensional space where a linear separation is
possible3.
Perceptron: This method is a simple algorithm that iteratively updates the parameters of a linear
function based on the prediction errors of the data points. It converges to a solution if the data is linearly
separable, but may not find the optimal solution4.
Stochastic gradient descent (SGD): This method is a general optimization technique that iteratively
updates the parameters of a linear function by moving in the direction of the negative gradient of a loss
function. It can be applied to various linear methods for classification, such as logistic regression and
SVMs5.
• These are some of the most common linear methods for classification, but there are also other variants and
extensions of these methods. Linear methods for classification are useful because they are simple, fast,
and interpretable, but they may not perform well if the data is not linearly separable or has complex
nonlinear patterns. In that case, nonlinear methods for classification may be more suitable.
Logistic Regression
• Logistic regression is a supervised machine learning algorithm mainly used for binary classification
where we use a logistic function, also known as a sigmoid function that takes input as independent variables and
produces a probability value between 0 and 1. For example, we have two classes Class 0 and Class 1 if the value
of the logistic function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression but is mainly used for
classification problems. The difference between linear regression and logistic regression is that linear regression
output is the continuous value that can be anything while logistic regression predicts the probability that an
instance belongs to a given class or not.
• Understanding Logistic Regression
It is used for predicting the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives
the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification.
• Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Terminologies involved in Logistic Regression
• Here are some common terms involved in logistic regression:
Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
Logistic function: The formula used to represent how the independent and dependent variables relate to one another. The
logistic function transforms the input variables into a probability value between 0 and 1, which represents the likelihood of the
dependent variable being 1 or 0.
Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the probability is
the ratio of something occurring to everything that could possibly occur.
Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression, the log
odds of the dependent variable are modeled as a linear combination of the independent variables and the intercept.
• Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables relate to one
another.
Intercept: A constant term in the logistic regression model, which represents the log odds when all independent
variables are equal to zero.
Maximum likelihood estimation: The method used to estimate the coefficients of the logistic regression model,
which maximizes the likelihood of observing the data given the model.
• Assumptions for Logistic Regression
• The assumptions for Logistic regression are as follows:
independent observations: Each observation is independent of the other. meaning there is no correlation between any input variables.
Binary dependent variables: It takes the assumption that the dependent variable must be binary or dichotomous, meaning it can take only two
values. For more than two categories SoftMax functions are used.
Linearity relationship between independent variables and log odds: The relationship between the independent variables and the log odds of the
dependent variable should be linear.
No outliers: There should be no outliers in the dataset.
Large sample size: The sample size is sufficiently large
• Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0
or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent
variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as
“low”, “Medium”, or “High”.
Sr. Linear Regresssion Logistic Regression
No
Linear regression is used to Logistic regression is used to predict the categorical
predict the continuous dependent variable using a given set of independent
dependent variable using a variables.
given set of independent
1
variables.
•True Positive (TP): The patient is diseased and the model predicts "diseased"
•False Positive (FP): The patient is healthy but the model predicts "diseased"
•True Negative (TN): The patient is healthy and the model predicts "healthy"
•False Negative (FN): The patient is diseased and the model predicts "healthy"
• concept of similarity, KNN predicts the label or value of a new data point by considering its K closest
neighbours in the training dataset. In this article, we will learn about a supervised learning algorithm (KNN) or
the k – Nearest Neighbours, highlighting it’s user-friendly nature.
• What is the K-Nearest Neighbors Algorithm?
• K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It
belongs to the supervised learning domain and finds intense application in pattern recognition, data mining,
and intrusion detection.
• It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which
assume a Gaussian distribution of the given data). We are given some prior data (also called training data),
which classifies coordinates into groups identified by an attribute.
• As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points to a group by analyzing the
training set. Note that the unclassified points are marked as ‘White’. Intuition Behind KNN Algorithm
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified
point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point
close to a cluster of points classified as ‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the second point (5.5,
4.5) should be classified as ‘Red’.
Why do we need a KNN algorithm?
• (K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used for its
simplicity and ease of implementation. It does not require any assumptions about the underlying data
distribution. It can also handle both numerical and categorical data, making it a flexible choice for various
types of datasets in classification and regression tasks. It is a non-parametric method that makes
predictions based on the similarity of data points in a given dataset. K-NN is less sensitive to outliers
compared to other algorithms.
• The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
metric, such as Euclidean distance. The class or value of the data point is then determined by the majority
vote or average of the K neighbors. This approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.
• Applications of the KNN Algorithm
Data Preprocessing – While dealing with any Machine Learning problem we first perform the
EDA part in which if we find that the data contains missing values then there are multiple
imputation methods are available as well. One of such method is KNN Imputer which is quite
effective ad generally used for sophisticated imputation methodologies.
Pattern Recognition – KNN algorithms work very well if you have trained a KNN algorithm
using the MNIST dataset and then performed the evaluation process then you must have come
across the fact that the accuracy is too high.
Recommendation Engines – The main task which is performed by a KNN algorithm is to assign a
new query point to a pre-existed group that has been created using a huge corpus of datasets.
This is exactly what is required in the recommender systems to assign each user to a particular
group and then provide them recommendations based on that group’s preferences.
• Advantages of the KNN Algorithm
Easy to implement as the complexity of the algorithm is not that high.
Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage and hence whenever a new
example or data point is added then the algorithm adjusts itself as per that new example and has its contribution to the
future predictions as well.
Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm are the value of k and
the choice of the distance metric which we would like to choose from our evaluation metric.
• Disadvantages of the KNN Algorithm
Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy Algorithm. The main
significance of this term is that this takes lots of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
Curse of Dimensionality – There is a term known as the peaking phenomenon according to this the KNN algorithm is
affected by the curse of dimensionality which implies the algorithm faces a hard time classifying the data points
properly when the dimensionality is too high.
Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to the problem of
overfitting as well. Hence generally feature selection as well as dimensionality reduction techniques are applied to deal with
this problem.
Linear Discriminant Analysis
• Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for classification tasks in machine learning.
It is a technique used to find a linear combination of features that best separates the classes in a dataset. LDA works
by projecting the data onto a lower-dimensional space that maximizes the separation between the classes. It does this
by finding a set of linear discriminants that maximize the ratio of between-class variance to within-class variance. In
other words, it finds the directions in the feature space that best separate the different classes of data.
• LDA assumes that the data has a Gaussian distribution and that the covariance matrices of the different classes are
equal. It also assumes that the data is linearly separable, meaning that a linear decision boundary can accurately classify
the different classes.
• separating two or more classes. It is used to project the features in higher dimension space into a lower dimension space.
Example
• Suppose we have two sets of data points belonging to two different classes that we want to classify.
As shown in the given 2D graph, when the data points are plotted on the 2D plane, there’s no straight
line that can separate the two classes of the data points completely. Hence, in this case, LDA
(Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to
maximize the separability between the two classes.
• Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data
onto a new axis in a way to maximize the separation of the two categories and hence, reducing the 2D
graph into a 1D graph.
• Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.
In the above graph, it can be seen that a new axis (in red) is
generated and plotted in the 2D graph such that it maximizes the
distance between the means of the two classes and minimizes the
variation within each class. In simple terms, this newly generated
axis increases the separation between the data points of the two
classes. After generating this new axis using the above-mentioned
criteria, all the data points of the classes are plotted on this new
axis and are shown in the figure given below
But Linear Discriminant Analysis fails when the mean
of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes
both the classes linearly separable. In such cases, we
use non-linear discriminant analysis
Optimal Classification - Naive Bayes Classifiers
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of
dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each row of feature
matrix. In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
• The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of each other, given the class
label.
Continuous features are normally distributed: If a feature is continuous, then it is assumed to be normally
distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to have a
multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the prediction of the class
label.
No missing data: The data should not contain any missing values.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already
occurred. Bayes’ theorem is stated mathematically as the following equation:
• In the multivariate Bernoulli event model, features are independent Booleans (binary variables) describing
inputs. Like the multinomial model, this model is popular for document classification tasks, where binary
term occurrence (i.e. a word occurs in a document or not) features are used rather than term frequencies (i.e.
frequency of a word in the document).
• Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
• Performs well even with limited training data.
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold in real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor generalization.
• Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-spam
based on features.
Text Classification: Used in sentiment analysis, document
categorization, and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease
based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
Weather Prediction: Classifies weather conditions based on various
factors.