Professional Documents
Culture Documents
ML Unit 2
ML Unit 2
ML Unit 2
INTRODUCTION
Machine learning is programming computers to optimize a performance criterion using example data or
past experience. We have a model defined up to some parameters, and learning is the execution of a computer
program to optimize the parameters of the model using the training data or past experience. The model may
be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both 2. The
field of study known as machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.
• In simple words, a discriminative model makes predictions on unseen data based on conditional probability
and can be used either for classification or regression problem statements.
• On the contrary, a generative model focuses on the distribution of a dataset to return a probability for a
given example. They are related to known effects of causal direction, classification vs. inference learning, and
observational vs. feedback learning
d. Outliers Generative models have more impact on outliers than discriminative models.
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.The linear regression model provides a sloped straight line
representing the relationship between the variables. Consider the below image:
Types of Linear Regression Linear regression can be further divided into two types of the algorithm:
● Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
● Multiple Linear regression: If more than one independent variable is used to predict the value of
a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent
variable increases on X axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X axis, then such a relationship is called a negative linear relationship.
Least Squares
In statistics, when we have data in the form of data points that can be represented on a
cartesian plane by taking one of the variables as the independent variable represented as the x-
coordinate and the other one as the dependent variable represented as the y-coordinate, it is
called scatter data. This data might not be useful in making interpretations or predicting the
values of the dependent variable for the independent variable where it is initially unknown. So,
we try to get an equation of a line that fits best to the given data points with the help of
the Least Square Method.
The method uses averages of the data points and some formulae discussed as follows to find
the slope and intercept of the line of best fit. This line can be then used to make further
interpretations about the data and to predict the unknown values. The Least Squares Method
provides accurate results only if the scatter data is evenly distributed and does not contain
outliers.
What is Least Square Method?
The Least Squares Method is used to derive a generalized linear equation between two
variables, one of which is independent and the other dependent on the former. The value of the
independent variable is represented as the x-coordinate and that of the dependent variable is
represented as the y-coordinate in a 2D cartesian coordinate system. Initially, known values are
marked on a plot. The plot obtained at this point is called a scatter plot. Then, we try to
represent all the marked points as a straight line or a linear equation. The equation of such a
line is obtained with the help of the least squares method. This is done to get the value of the
dependent variable for an independent variable for which the value was initially unknown. This
helps us to fill in the missing points in a data table or forecast the data. The method is
discussed in detail as follows.
The least-squares method can be defined as a statistical method that is used to find the equation
of the line of best fit related to the given data. This method is called so as it aims at reducing
the sum of squares of deviations as much as possible. The line obtained from such a method is
called a regression line.
The formula used in the least squares method and the steps used in deriving the line of best fit
from this method are discussed as follows:
• Step 1: Denote the independent variable values as xi and the dependent ones as yi.
• Step 2: Calculate the average values of xi and yi as X and Y.
• Step 3: Presume the equation of the line of best fit as y = mx + c, where m is the slope of the
line and c represents the intercept of the line on the Y-axis.
• Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
• Step 5: The intercept c is calculated from the following formula:
c = Y – mX
Under-fitting / Overfitting
Underfitting in Machine Learning
A statistical model or a machine learning algorithm is said to have underfitting when a model is
too simple to capture data complexities. It represents the inability of the model to learn the
training data effectively result in poor performance both on the training and testing data. In simple
terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples. It
mainly happens when we uses very simple model with overly simplified assumptions. To address
underfitting problem of the model, we need to use more complex models, with enhanced feature
representation, and less regularization.
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate predictions on
testing data. When a model gets trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test data results in High variance.
Then the model does not categorize the data correctly, because of too many details and noise. The
causes of overfitting are the non-parametric and non-linear methods because these types of
machine learning algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
Cross-Validation
Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one of
these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance. Cross validation is an important step in the machine learning process and helps to
ensure that the model selected for deployment is robust and generalizes well to new data.
The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation, leave-
one-out cross validation, and Holdout validation, Stratified Cross-Validation. The choice of
technique depends on the size and nature of the data, as well as the specific requirements of the
modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is used
for the testing purpose. It’s a simple and quick way to evaluate a model. The major drawback of
this method is that we perform training on the 50% of the dataset, it may possible that the
remaining 50% of the data contains some important information which we are leaving while
training our model i.e. higher bias.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation process
maintains the same class distribution as the entire dataset. This is particularly important when
dealing with imbalanced datasets, where certain classes may be underrepresented. In this
method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used for
training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well to unseen
data.
Lasso Regression
Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds penalty
term to the cost function. This term is the absolute sum of the coefficients. As the value of
coefficients increases from 0 this term penalizes, cause model, to decrease the value of
coefficients in order to reduce loss. The difference between ridge and lasso regression is that it
tends to make coefficients to absolute zero as compared to Ridge which never sets the value of
coefficient to absolute zero.
Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the category
of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.
Logistic regression is another supervised learning algorithm which is used to solve the classification
problems.
Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
Logistic regression is a type of regression, but it is different from the linear regression algorithm in
the term how they are used.
Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented
as:
Gradient Descent is an iterative optimization algorithm that tries to find the optimum value
(Minimum/Maximum) of an objective function. It is one of the most used optimization
techniques in machine learning projects for updating the parameters of a model in order to
minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model which gives the
highest accuracy on training as well as testing datasets. In gradient descent, The gradient is
a vector that points in the direction of the steepest increase of the function at a specific
point. Moving in the opposite direction of the gradient allows the algorithm to gradually
descend towards lower values of the function, and eventually reaching to the minimum of
the function.
Steps Required in Gradient Descent Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Kernel Methods
Kernel method is the mathematical technique that is used in machine learning for analyzing
data. This method uses Kernel function - that maps data from one space to another space.
It is generally used in Support Vector Machines (SVMs) where the algorithms classify data
by finding the hyperplane that separates the data points of different classes.
The most important benefit of Kernel Method is that it can work with non-linearly separable
data, and it works with multiple Kernel functions - depending on the type of data.
Because the linear classifier can solve a very limited class of problems, the kernel trick is
employed to empower the linear classifier, enabling the SVM to solve a larger class of
problems.
Support vector machines use various kinds of kernel methods in machine learning. Here are a few
of them:
1. Linear Kernel
It is used when the data is linearly separable.
K(x1, x2) = x1 . x2
2. Polynomial Kernel
It is used when the data is not linearly separable.
K(x1, x2) = (x1 . x2 + 1)d
3. Gaussian Kernel
The Gaussian kernel is an example of a radial basis function kernel. It can be represented with this
equation:
k(xi, xj) = exp(-𝛾||xi - xj||2)
4. Exponential Kernel
Similar to RBF kernel, but it decays much more quickly.
k(x, y) =exp(-||x -y||22)
5. Laplacian Kernel
Similar to RBF Kernel, it has a sharper peak and faster decay.
k(x, y) = exp(- ||x - y||)
The Machine Learning systems which are categorized as instance-based learning are the systems
that learn the training examples by heart and then generalizes to new instances based on some
similarity measure. It is called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-learning (because they delay
processing until a new instance must be classified). The time complexity of this algorithm depends
upon the size of training data. Each time whenever a new query is encountered, its previously
stores data is examined. And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training
instances. For example, If we were to create a spam filter with an instance-based learning
algorithm, instead of just flagging emails that are already marked as spam emails, our spam filter
would be programmed to also flag emails that are very similar to them. This requires a measure of
resemblance between two emails. A similarity measure between two emails could be the same
sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting
the identification of a local model from scratch.
K-Nearest Neighbors
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data). We are given some prior data
(also called training data), which classifies coordinates into groups identified by an attribute.
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line that
joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance traveled
by the object instead of the displacement. This metric is calculated by summing the absolute
difference between the coordinates of the points in n-dimensions.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of
the Minkowski distance.
Workings of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.
To make predictions, the algorithm calculates the distance between each new data point
in the test dataset and all the data points in the training dataset. The Euclidean distance is
a commonly used distance metric in K-NN, but other distance metrics, such as Manhattan
distance or Minkowski distance, can also be used depending on the problem and data.
Once the distances between the new data point and all the data points in the training
dataset are calculated, the algorithm proceeds to find the K nearest neighbors based on
these distances. Thе specific method for selecting the nearest neighbors can vary, but a
common approach is to sort the distances in ascending order and choose the K data points
with the shortest distances.
After identifying the K nearest neighbors, the algorithm makes predictions based on the
labels or values associated with these neighbors. For classification tasks, the majority
class among the K neighbors is assigned as the predicted label for the new data point. For
regression tasks, the average or weighted average of the values of the K neighbors is
assigned as the predicted value.
The process of forming a decision tree involves recursively partitioning the data based on the
values of different attributes. The algorithm selects the best attribute to split the data at each
internal node, based on certain criteria such as information gain or Gini impurity. This splitting
process continues until a stopping criterion is met, such as reaching a maximum depth or having
a minimum number of instances in a leaf node.
ID3
ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea
of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through
the given sets to test each attribute at every tree node. In order to select the attribute that is most
useful for classifying a given sets, we introduce a metric---information gain.
To find a optimal way to classify a learning set, what we need to do is to minimize the questions
asked (i.e. minimizing the depth of the tree). Thus , we need some function which can measure
which questions provide the most balanced splitting. The information gain metric is such a
function.
(Tom M. Mitchell,1997,p55)
First, lets assume, without loss of generality, that the resulting decision tree classifies instances
into two categories, we'll call them P(positive)and N(negative).
Given a set S, containing these positive and negative target, the entropy of S related to this boolean
classification is:
Entropy(S)=
- P(positive)log2P(positive) - P(negative)log2P(negative)
For example, if S is (0.5+, 0.5-) then Entropy(S) is 1, if S is (0.67+, 0.33-) then Entropy(S) is 0.92,
if P is (1+, 0-) then Entropy(S) is 0. Note that the more uniform is the probability distribution, the
greater is its information.
You may notice that entropy is a measure of the impurity in a collection of training sets. But how it
is related to the optimisation of our decision making in classifying the instances. What you will see
at the following will answer this question.
(Tom M. Mitchell,1997,p57)
As we mentioned before, to minimize the decision tree depth, when we traverse the tree path, we
need to select the optimal attribute for splitting the tree node, which we can easily imply that the
attribute with the most entropy reduction is the best choice.
We define information gain as the expected reduction of entropy related to specified attribute when
splitting a decision tree node.
The information gain , Gain(S,A) of an attribute A,
Gain(S,A)= Entropy(S) -
We can use this notion of gain to rank attributes and to build decision trees where at each node is
located the attribute with greatest gain among the attributes not yet considered in the path from the
root.
1. To create small decision trees so that records can be identified after only a few decision tree
splitting.
CART
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of decision trees:
• Classification Trees: The tree is used to determine which “class” the target variable is most
likely to fall into when it is continuous.
• Regression trees: These are used to predict a continuous variable’s value.
In the decision tree, nodes are split into sub-nodes based on a threshold value of an attribute.
The root node is taken as the training set and is split into two by considering the best attribute
and threshold value. Further, the subsets are also split using the same logic. This continues till
the last pure sub-set is found in the tree or the maximum number of leaves possible in that
growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from labelled
data to predict unseen data.
• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.
• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression tasks, CART uses residual reduction as the
splitting criterion. The lower the residual reduction, the better the fit of the model to the data.
• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes that
contribute little to the model accuracy. Cost complexity pruning and information gain pruning
are two popular pruning techniques. Cost complexity pruning involves calculating the cost of
each node and removing nodes that have a negative cost. Information gain pruning involves
calculating the information gain of each node and removing nodes that have a low information
gain.
How does CART algorithm works?
The CART algorithm works via the following process:
• The best-split point of each input is obtained.
• Based on the best-split points of each input in Step 1, the new “best” split point is identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
Ensemble Methods
Ensemble methods is a machine learning technique that combines several base models in
order to produce one optimal predictive model. To better understand this definition lets take a step
back into ultimate goal of machine learning and model buildingDecision Trees can also solve
quantitative problems as well with the same format. In the Tree to the left, we want to know wether
or not to invest in a commercial real estate property. Is it an office building? A Warehouse? An
Apartment building? Good economic conditions? Poor Economic Conditions? How much will an
investment return? These questions are answered and solved using this decision tree.When making
Decision Trees, there are several factors we must take into consideration: On what features do we
make our decisions on? What is the threshold for classifying each question into a yes or no answer?
In the first Decision Tree, what if we wanted to ask ourselves if we had friends to play with or not.
If we have friends, we will play every time. If not, we might continue to ask ourselves questions
about the weather. By adding an additional question, we hope to greater define the Yes and No
classes.This is where Ensemble Methods come in handy! Rather than just relying on one Decision
Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take a
sample of Decision Trees into account, calculate which features to use or questions to ask at each
split, and make a final predictor based on the aggregated results of the sampled Decision Trees.
Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data point
occurs, then based on the majority of results, the Random Forest classifier predicts the final decision.
Consider the below image:
Evaluation of Classification Algorithms
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification
or Regression model. So for evaluating a Classification model, we have the following ways:
o It is used for evaluating the performance of a classifier, whose output is a probability value between
the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions. The matrix looks like as below table:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.