ML Unit 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

UNIT II SUPERVISED LEARNING

Introduction-Discriminative and Generative Models -Linear Regression - Least Squares -Under-fitting


/ Overfitting -Cross-Validation – Lasso Regression- Classification - Logistic Regression- Gradient
Linear Models -Support Vector Machines –Kernel Methods -Instance based Methods - K-Nearest
Neighbors - Tree based Methods –Decision Trees –ID3 – CART - Ensemble Methods –Random Forest
- Evaluation of Classification Algorithms

INTRODUCTION
Machine learning is programming computers to optimize a performance criterion using example data or
past experience. We have a model defined up to some parameters, and learning is the execution of a computer
program to optimize the parameters of the model using the training data or past experience. The model may
be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both 2. The
field of study known as machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.

Discriminative and Generative Models


Machine learning models can be classified into two types: Discriminative and Generative.

• In simple words, a discriminative model makes predictions on unseen data based on conditional probability
and can be used either for classification or regression problem statements.

• On the contrary, a generative model focuses on the distribution of a dataset to return a probability for a
given example. They are related to known effects of causal direction, classification vs. inference learning, and
observational vs. feedback learning

The Approach of Generative Models


In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and use the
Bayes Theorem to calculate the posterior probability P(Y |X):

The Approach of Discriminative Models


In the case of discriminative models, to find the probability, they directly assume some functional
form for P(Y|X) andthen estimate the parameters of P(Y|X) with the help of the training data.
Difference between Discriminative and Generative Models
Let’s see some of the differences between the Discriminative and Generative Models.
a. Core Idea Discriminative models draw boundaries in the data space, while generative models try
to model how data is placed throughout the space. A generative model explains how the data was
generated, while a discriminative model focuses on predicting the labels of the data.

b. Mathematical Intuition In mathematical terms, discriminative machine learning trains a model,


which is done by learning parameters that maximize the conditional probability P(Y|X). On the other
hand, a generative model learns parameters by maximizing the joint probability of P(X, Y).

c. Applications Discriminative models recognize existing data, i.e., discriminative modeling


identifies tags and sorts data and can be used to classify data, while Generative modeling produces
something. Since these models use different approaches to machine learning, both are suited for
specific tasks i.e., Generative models are useful for unsupervised learning tasks. In contrast,
discriminative models are useful for supervised learning tasks. GANs(Generative adversarial
networks) can be thought of as a competition between the generator, which is a component of the
generative model, and the discriminator, so basically, it is generative vs. discriminative model.

d. Outliers Generative models have more impact on outliers than discriminative models.

e. Computational Cost Discriminative models are computationally cheap as compared to generative


models.

Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.The linear regression model provides a sloped straight line
representing the relationship between the variables. Consider the below image:
Types of Linear Regression Linear regression can be further divided into two types of the algorithm:

● Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.

● Multiple Linear regression: If more than one independent variable is used to predict the value of
a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line:

A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:

Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent
variable increases on X axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X axis, then such a relationship is called a negative linear relationship.

Least Squares

In statistics, when we have data in the form of data points that can be represented on a
cartesian plane by taking one of the variables as the independent variable represented as the x-
coordinate and the other one as the dependent variable represented as the y-coordinate, it is
called scatter data. This data might not be useful in making interpretations or predicting the
values of the dependent variable for the independent variable where it is initially unknown. So,
we try to get an equation of a line that fits best to the given data points with the help of
the Least Square Method.
The method uses averages of the data points and some formulae discussed as follows to find
the slope and intercept of the line of best fit. This line can be then used to make further
interpretations about the data and to predict the unknown values. The Least Squares Method
provides accurate results only if the scatter data is evenly distributed and does not contain
outliers.
What is Least Square Method?
The Least Squares Method is used to derive a generalized linear equation between two
variables, one of which is independent and the other dependent on the former. The value of the
independent variable is represented as the x-coordinate and that of the dependent variable is
represented as the y-coordinate in a 2D cartesian coordinate system. Initially, known values are
marked on a plot. The plot obtained at this point is called a scatter plot. Then, we try to
represent all the marked points as a straight line or a linear equation. The equation of such a
line is obtained with the help of the least squares method. This is done to get the value of the
dependent variable for an independent variable for which the value was initially unknown. This
helps us to fill in the missing points in a data table or forecast the data. The method is
discussed in detail as follows.

Least Square Method Definition

The least-squares method can be defined as a statistical method that is used to find the equation
of the line of best fit related to the given data. This method is called so as it aims at reducing
the sum of squares of deviations as much as possible. The line obtained from such a method is
called a regression line.

Formula of Least Square Method

The formula used in the least squares method and the steps used in deriving the line of best fit
from this method are discussed as follows:
• Step 1: Denote the independent variable values as xi and the dependent ones as yi.
• Step 2: Calculate the average values of xi and yi as X and Y.
• Step 3: Presume the equation of the line of best fit as y = mx + c, where m is the slope of the
line and c represents the intercept of the line on the Y-axis.
• Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
• Step 5: The intercept c is calculated from the following formula:
c = Y – mX

Under-fitting / Overfitting
Underfitting in Machine Learning
A statistical model or a machine learning algorithm is said to have underfitting when a model is
too simple to capture data complexities. It represents the inability of the model to learn the
training data effectively result in poor performance both on the training and testing data. In simple
terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples. It
mainly happens when we uses very simple model with overly simplified assumptions. To address
underfitting problem of the model, we need to use more complex models, with enhanced feature
representation, and less regularization.
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate predictions on
testing data. When a model gets trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test data results in High variance.
Then the model does not categorize the data correctly, because of too many details and noise. The
causes of overfitting are the non-parametric and non-linear methods because these types of
machine learning algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
Cross-Validation

Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one of
these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance. Cross validation is an important step in the machine learning process and helps to
ensure that the model selected for deployment is robust and generalizes well to new data.

The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.

Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation, leave-
one-out cross validation, and Holdout validation, Stratified Cross-Validation. The choice of
technique depends on the size and nature of the data, as well as the specific requirements of the
modeling problem.

1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is used
for the testing purpose. It’s a simple and quick way to evaluate a model. The major drawback of
this method is that we perform training on the 50% of the dataset, it may possible that the
remaining 50% of the data contains some important information which we are leaving while
training our model i.e. higher bias.

2. LOOCV (Leave One Out Cross Validation)


In this method, we perform training on the whole dataset but leaves only one data-point of the
available dataset and then iterates for each data-point. In LOOCV, the model is trained
on samples and tested on the one omitted sample, repeating this process for each data
point in the dataset. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to higher variation.
Another drawback is it takes a lot of execution time as it iterates over ‘the number of data
points’ times.

3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation process
maintains the same class distribution as the entire dataset. This is particularly important when
dealing with imbalanced datasets, where certain classes may be underrepresented. In this
method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used for
training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well to unseen
data.

4. K-Fold Cross Validation


In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds) then
we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for testing
purpose each time.

Lasso Regression

Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds penalty
term to the cost function. This term is the absolute sum of the coefficients. As the value of
coefficients increases from 0 this term penalizes, cause model, to decrease the value of
coefficients in order to reduce loss. The difference between ridge and lasso regression is that it
tends to make coefficients to absolute zero as compared to Ridge which never sets the value of
coefficient to absolute zero.

Limitation of Lasso Regression:


• Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater
than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all
predictors are relevant (or may be used in the test set).
• If there are two or more highly collinear variables then LASSO regression select one of them
randomly which is not good for the interpretation of data

Classification

The Classification algorithm is a Supervised Learning technique that is used to identify the category
of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).


Logistic Regression

Logistic regression is another supervised learning algorithm which is used to solve the classification
problems.

In classification problems, we have dependent variables in a binary or discrete format such as 0 or


1.

Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.

It is a predictive analysis algorithm which works on the concept of probability.

Logistic regression is a type of regression, but it is different from the linear regression algorithm in
the term how they are used.

Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented
as:

f(x)= Output between the 0 and 1 value.

x= input to the function

e= base of natural logarithm.

Gradient Linear Models

Gradient Descent is an iterative optimization algorithm that tries to find the optimum value
(Minimum/Maximum) of an objective function. It is one of the most used optimization
techniques in machine learning projects for updating the parameters of a model in order to
minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model which gives the
highest accuracy on training as well as testing datasets. In gradient descent, The gradient is
a vector that points in the direction of the steepest increase of the function at a specific
point. Moving in the opposite direction of the gradient allows the algorithm to gradually
descend towards lower values of the function, and eventually reaching to the minimum of
the function.
Steps Required in Gradient Descent Algorithm

• Step 1 we first initialize the parameters of the model randomly


• Step 2 Compute the gradient of the cost function with respect to each parameter. It involves
making partial differentiation of cost function with respect to the parameters.
• Step 3 Update the parameters of the model by taking steps in the opposite direction of the
model. Here we choose a hyperparameter learning rate which is denoted by alpha. It helps
in deciding the step size of the gradient.
• Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined model

Support Vector Machines

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Kernel Methods

Kernel method is the mathematical technique that is used in machine learning for analyzing
data. This method uses Kernel function - that maps data from one space to another space.
It is generally used in Support Vector Machines (SVMs) where the algorithms classify data
by finding the hyperplane that separates the data points of different classes.
The most important benefit of Kernel Method is that it can work with non-linearly separable
data, and it works with multiple Kernel functions - depending on the type of data.
Because the linear classifier can solve a very limited class of problems, the kernel trick is
employed to empower the linear classifier, enabling the SVM to solve a larger class of
problems.
Support vector machines use various kinds of kernel methods in machine learning. Here are a few
of them:

1. Linear Kernel
It is used when the data is linearly separable.
K(x1, x2) = x1 . x2

2. Polynomial Kernel
It is used when the data is not linearly separable.
K(x1, x2) = (x1 . x2 + 1)d

3. Gaussian Kernel
The Gaussian kernel is an example of a radial basis function kernel. It can be represented with this
equation:
k(xi, xj) = exp(-𝛾||xi - xj||2)

4. Exponential Kernel
Similar to RBF kernel, but it decays much more quickly.
k(x, y) =exp(-||x -y||22)

5. Laplacian Kernel
Similar to RBF Kernel, it has a sharper peak and faster decay.
k(x, y) = exp(- ||x - y||)

6. Hyperbolic or the Sigmoid Kernel


It is used for non-linear classification problems. It transforms the input data into a higher-
dimensional space using the Sigmoid kernel.
k(x, y) = tanh(xTy + c)

7. Anova radial basis kernel


It is a multiple-input kernel function that can be used for feature selection.
k(x, y) = k=1nexp(-(xk -yk)2)d

8. Radial-basis function kernel


It maps the input data to an to an infinite-dimensional space.
K(x, y) = exp(-γ ||x - y||^2)
9. Wavelet kernel
It is a non-stationary kernel function that can be used for time-series analysis.
K(x, y) = ∑φ(i,j) Ψ(x(i),y(j))

10. Spectral kernel


This function is based on eigenvalues & eisenvectors of a similarity matrix.
K(x, y) = ∑λi φi(x) φi(y)

11. Mahalonibus kernel


This function takes into account the covariance structure of the data.
K(x, y) = exp(-1/2 (x - y)T S^-1 (x - y))

Instance based Methods

The Machine Learning systems which are categorized as instance-based learning are the systems
that learn the training examples by heart and then generalizes to new instances based on some
similarity measure. It is called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-learning (because they delay
processing until a new instance must be classified). The time complexity of this algorithm depends
upon the size of training data. Each time whenever a new query is encountered, its previously
stores data is examined. And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training
instances. For example, If we were to create a spam filter with an instance-based learning
algorithm, instead of just flagging emails that are already marked as spam emails, our spam filter
would be programmed to also flag emails that are very similar to them. This requires a measure of
resemblance between two emails. A similarity measure between two emails could be the same
sender or the repetitive use of the same keywords or something else.

Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .

Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting
the identification of a local model from scratch.
K-Nearest Neighbors
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data). We are given some prior data
(also called training data), which classifies coordinates into groups identified by an attribute.

Intuition Behind KNN Algorithm


If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given
an unclassified point, we can assign it to a group by observing what group its nearest neighbors
belong to. This means a point close to a cluster of points classified as ‘Red’ has a higher
probability of getting classified as ‘Red’.

Distance Metrics Used in KNN Algorithm


As we know that the KNN algorithm helps us identify the nearest points or the groups for a
query point. But to determine the closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:

Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line that
joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.

Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance traveled
by the object instead of the displacement. This metric is calculated by summing the absolute
difference between the coordinates of the points in n-dimensions.

Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of
the Minkowski distance.
Workings of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.
To make predictions, the algorithm calculates the distance between each new data point
in the test dataset and all the data points in the training dataset. The Euclidean distance is
a commonly used distance metric in K-NN, but other distance metrics, such as Manhattan
distance or Minkowski distance, can also be used depending on the problem and data.
Once the distances between the new data point and all the data points in the training
dataset are calculated, the algorithm proceeds to find the K nearest neighbors based on
these distances. Thе specific method for selecting the nearest neighbors can vary, but a
common approach is to sort the distances in ascending order and choose the K data points
with the shortest distances.
After identifying the K nearest neighbors, the algorithm makes predictions based on the
labels or values associated with these neighbors. For classification tasks, the majority
class among the K neighbors is assigned as the predicted label for the new data point. For
regression tasks, the average or weighted average of the values of the K neighbors is
assigned as the predicted value.

Tree based Methods


In Supervised Machine Learning, the data is labeled to tell the machine what patterns it should look
for. Examples include a model built on the loan dataset from a bank with the loan default (yes or no)
as the target/ label or a model built on the patient health dataset from a hospital with the diabetes
positive (yes or no) as the target variable. In Unsupervised Machine Learning, the data is not labeled
and so the machine does not know what patterns to look for but it can still find some patterns.
Cybersecurity is one space where unsupervised ML can be applied. Reinforcement learning is an
algorithm that learns by trial and error. It has an objective and it tries out a lot of different things and
is rewarded or penalized depending on whether its behaviors help or hinder it from reaching the
objective.
Decision Trees
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction. The decision tree algorithm falls under the category
of supervised learning. They can be used to solve both regression and classification problems.
Decision Tree Terminologies
There are specialized terms associated with decision trees that denote various components and
facets of the tree structure and decision-making procedure. :
• Root Node: A decision tree’s root node, which represents the original choice or feature from
which the tree branches, is the highest node.
• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by the values
of particular attributes. There are branches on these nodes that go to other nodes.
• Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are decided
upon. There are no more branches on leaf nodes.
• Branches (Edges): Links between nodes that show how decisions are made in response to
particular circumstances.
• Splitting: The process of dividing a node into two or more sub-nodes based on a decision
criterion. It involves selecting a feature and a threshold to create subsets of data.
• Parent Node: A node that is split into child nodes. The original node from which a split
originates.
• Child Node: Nodes created as a result of a split from a parent node.
• Decision Criterion: The rule or condition used to determine how the data should be split at a
decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve its
generalization and prevent overfitting.

The process of forming a decision tree involves recursively partitioning the data based on the
values of different attributes. The algorithm selects the best attribute to split the data at each
internal node, based on certain criteria such as information gain or Gini impurity. This splitting
process continues until a stopping criterion is met, such as reaching a maximum depth or having
a minimum number of instances in a leaf node.
ID3

ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea
of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through
the given sets to test each attribute at every tree node. In order to select the attribute that is most
useful for classifying a given sets, we introduce a metric---information gain.

To find a optimal way to classify a learning set, what we need to do is to minimize the questions
asked (i.e. minimizing the depth of the tree). Thus , we need some function which can measure
which questions provide the most balanced splitting. The information gain metric is such a
function.

4.1 Entropy --- measuring homogeneity of a learning set

(Tom M. Mitchell,1997,p55)

In order to define information gain precisely, we need to discuss entropy first.

First, lets assume, without loss of generality, that the resulting decision tree classifies instances
into two categories, we'll call them P(positive)and N(negative).

Given a set S, containing these positive and negative target, the entropy of S related to this boolean
classification is:

Entropy(S)=

- P(positive)log2P(positive) - P(negative)log2P(negative)

P(positive): proportion of positive examples in S

P(negative): proportion of negative examples in S

For example, if S is (0.5+, 0.5-) then Entropy(S) is 1, if S is (0.67+, 0.33-) then Entropy(S) is 0.92,
if P is (1+, 0-) then Entropy(S) is 0. Note that the more uniform is the probability distribution, the
greater is its information.

You may notice that entropy is a measure of the impurity in a collection of training sets. But how it
is related to the optimisation of our decision making in classifying the instances. What you will see
at the following will answer this question.

4.2 Information Gain--- measuring the expected reduction in Entropy

(Tom M. Mitchell,1997,p57)

As we mentioned before, to minimize the decision tree depth, when we traverse the tree path, we
need to select the optimal attribute for splitting the tree node, which we can easily imply that the
attribute with the most entropy reduction is the best choice.

We define information gain as the expected reduction of entropy related to specified attribute when
splitting a decision tree node.
The information gain , Gain(S,A) of an attribute A,

Gain(S,A)= Entropy(S) -

Sum for v from 1 to n of (|Sv|/|S|) * Entropy(Sv)

We can use this notion of gain to rank attributes and to build decision trees where at each node is
located the attribute with greatest gain among the attributes not yet considered in the path from the
root.

The intention of this ordering is:

1. To create small decision trees so that records can be identified after only a few decision tree
splitting.

2. To match a hoped for minimality of the process of decision making

CART

CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of decision trees:

• Classification Trees: The tree is used to determine which “class” the target variable is most
likely to fall into when it is continuous.
• Regression trees: These are used to predict a continuous variable’s value.
In the decision tree, nodes are split into sub-nodes based on a threshold value of an attribute.
The root node is taken as the training set and is split into two by considering the best attribute
and threshold value. Further, the subsets are also split using the same logic. This continues till
the last pure sub-set is found in the tree or the maximum number of leaves possible in that
growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from labelled
data to predict unseen data.
• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.
• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression tasks, CART uses residual reduction as the
splitting criterion. The lower the residual reduction, the better the fit of the model to the data.
• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes that
contribute little to the model accuracy. Cost complexity pruning and information gain pruning
are two popular pruning techniques. Cost complexity pruning involves calculating the cost of
each node and removing nodes that have a negative cost. Information gain pruning involves
calculating the information gain of each node and removing nodes that have a low information
gain.
How does CART algorithm works?
The CART algorithm works via the following process:
• The best-split point of each input is obtained.
• Based on the best-split points of each input in Step 1, the new “best” split point is identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

Ensemble Methods
Ensemble methods is a machine learning technique that combines several base models in
order to produce one optimal predictive model. To better understand this definition lets take a step
back into ultimate goal of machine learning and model buildingDecision Trees can also solve
quantitative problems as well with the same format. In the Tree to the left, we want to know wether
or not to invest in a commercial real estate property. Is it an office building? A Warehouse? An
Apartment building? Good economic conditions? Poor Economic Conditions? How much will an
investment return? These questions are answered and solved using this decision tree.When making
Decision Trees, there are several factors we must take into consideration: On what features do we
make our decisions on? What is the threshold for classifying each question into a yes or no answer?
In the first Decision Tree, what if we wanted to ask ourselves if we had friends to play with or not.
If we have friends, we will play every time. If not, we might continue to ask ourselves questions
about the weather. By adding an additional question, we hope to greater define the Yes and No
classes.This is where Ensemble Methods come in handy! Rather than just relying on one Decision
Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take a
sample of Decision Trees into account, calculate which features to use or questions to ask at each
split, and make a final predictor based on the aggregated results of the sampled Decision Trees.
Random Forest

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data point
occurs, then based on the majority of results, the Random Forest classifier predicts the final decision.
Consider the below image:
Evaluation of Classification Algorithms

Once our model is completed, it is necessary to evaluate its performance; either it is a Classification
or Regression model. So for evaluating a Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability value between
the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions. The matrix looks like as below table:
3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.

You might also like