ML Unit-2 Material Add-On

Alpaydin Chapter 2, Mitchell Chapter 7
• Alpaydin slides are in turquoise.

Learning a Class from Examples
– Ethem Alpaydin, copyright: The MIT Press, 2010.
Class C of a “family car”
Prediction: Is car x a family car?
– alpaydin@boun.edu.tr
Knowledge extraction: What do people expect from a family car?
– http://www.cmpe.boun.edu.tr/ẽthem/i2ml2e
Output:
• All other slides are based on Mitchell. Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Training set X
X  {xt ,rt }N
Class C
t
p1  price  p2  AND e1  engine power  e2 
 1if x is positive
r
0 if x is negative
 x1 
x 
 x2 
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Hypothesis class H
 1if h says x is positive
h(x)  
S, G, and the Version Space
0 if h says x is negative most specific hypothesis, S
most general hypothesis, G
h  H, between S and G is
consistent
Error of h on H and make up the version space (M
E(h|X )N1
 hx r 
t 1 t t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7
Computational Learning Theory (from Mitchell Computational Learning Theory

Chapter 7)
What general laws constrain inductive learning?
• Theoretical characterization of the difficulties and capabilities of
learning algorithms. We seek theory to relate:
• Questions: • Probability of successful learning

– Conditions for successful/unsuccessful learning
• Number of training examples
– Conditions of success for particular algorithms
• Complexity of hypothesis space
• Two frameworks:
– Probably Approximately Correct (PAC) framework: classes of • Accuracy to which target concept is approximated
hypotheses that can be learned; complexity of hypothesis

• Manner in which training examples presented
space and bound on training set size.
– Mistake bound framework: number of training errors

made before correct hypothesis is determined.
2 3
Specific Questions Sample Complexity
• Sample complexity: How many training examples are needed How many training examples are sufficient to learn the target
for a learner to converge? concept?
• Computational complexity: How much computational effort is 1. If learner proposes instances, as queries to teacher
needed for a learner to converge? • Learner proposes instance x, teacher provides c(x)
• Mistake bound: How many training examples will the 2. If teacher (who knows c) provides training examples
learner misclassify before converging?
• teacher provides sequence of examples of form ⟨x, c(x)⟩
Issues: When to say it was successful? How are inputs acquired?
3. If some random process (e.g., nature) proposes instances
• instance x generated randomly, teacher provides c(x)
True Error of a Hypothesis Two Notions of Error

Instance space X
Training error of hypothesis h with respect to target concept

c
- c h-
+ • How often h(x) c(x) over training instances
+
x∈D
-
Where c
and h disagree
Definition: The true error (denoted errorD (h)) of

hypothesis h with respect to target concept c and
distribution D is the probability that h will misclassify
an instance drawn at random according to D.
error (h) Pr [c(x) = h(x)]

≡ 6 /D 7
True error of hypothesis h with respect to c
• How often h(x) c(x) over

future random instances Our
concern:
• Can we bound the true error of h given the

training error of h?
• First consider when training error of h is zero

(i.e., h ∈ V SH,D )
6 7
Exhausting the Version Space How many examples will ϵ-exhaust the VS?
Hypothesis space H
Theorem: [Haussler, 1988].
error =.1
error =.3
r =.4 If the hypothesis space H is finite, and D is a sequence of m ≥ 1
r =.2
error =.2 independent random examples of some target concept c, then for
r =0
any 0 ≤ є ≤ 1, the probability that the version space with respect
VSH,D
error =.2 to H and D is not є-exhausted (with respect to c) is less than
error =.3 r =.3
r =.1 error =.1
r =0
|H|e−еm
This bounds the probability that any consistent learner will output a hypothesis h
(r = training error, error = true error)
with error(h) ≥ є
If we want this probability to be below δ
Definition: The version space V SH,D is said to be
|H|e−еm ≤ δ
є-exhausted with respect to c and D, if every hypothesis
then
h 1
in V SH,D has error less than є with respect to c and D. m≥ (ln |H| + ln(1/δ))
є
(∀h ∈ V SH,D) errorD(h) < є 9
Proof of ϵ-Exhasting Theorem PAC Learning

−еm
Theorem: Prob. of V SH,D not being є-exhausted is ≤ |H|e .
Proof: Consider a class C of possible target concepts defined over a set of
• Let hi ∈ H (i = 1..k) be those that have true error greater than є instances X of length n, and a learner L using hypothesis space H.
wrt
Definition: C is PAC-learnable by L using H if for all
c (k ≤ |H|).
c ∈ C, distributions D over X, є such that 0 < є < 1/2,
• We fail to є-exhaust the VS iff at least one hi is consistent with all m and δ such that 0 < δ < 1/2,
sample training instances (note: they have true error greater than є).
learner L will with probability at least (1 − δ) output a
• Prob. of a single hypothesis with error > є is consistent for one random
sample is at most (1 − є). • Since k ≤ |H|, and for 0 ≤ є ≤ 1, (1 − є) ≤ e−е:
• Prob. of that hypothesis being consistent with m samples is (1 − є) .m k(1 − є)m ≤ |H|(1 − є)m ≤ |H|e−еm
• Prob. of at least one of k hypotheses with error > є is consistent with
m
samples is k(1 − є) .m
10 11
hypothesis h∈H such that errorD (h) ≤ є, in
time that is polynomial in 1/є, 1/δ, n and
size(c).
10 11
Agnostic Learning Shattering a Set of Instances
So far, we assumed that c ∈ H. What if it is not the case? Definition: a dichotomy of a set S is a partition of S into
Agnostic learning setting: don’t assume c ∈H two disjoint subsets.
• What do we want then? Definition: a set of instances S is shattered by hypothesis

space H if and only if for every dichotomy of S there exists
– The hypothesis h that makes fewest errors on training
data some hypothesis in H consistent with this dichotomy.
• What is sample complexity in this case?

1
m≥ (ln |H| + ln(1/δ))
2є2
derived from Hoeffding bounds:
Pr[errorD(h) > errorD(h) + є] ≤ e−2mc 2
12 13
Three Instances Shattered The Vapnik-Chervonenkis Dimension

Instance space X
Definition: The Vapnik-Chervonenkis dimension,

V C(H), of hypothesis space H defined over instance
space X is the size of the largest finite subset of X
shattered by H. If arbitrarily large finite sets of X can be
shattered by H, then V C(H) ≡ ∞.
Note that |H| can be infinite, while V C(H) finite!

Each closed contour indicates one dichotomy. What kind of
hypothesis space H can shatter the instances?
14 15
VC Dim. of Linear Decision Surfaces VC Dimension: Another Example
S = {3.1, 5.7}, and hypothesis space includes intervals

a < x < b.
(a) (b)
• Dichotomies: both, none, 3.1, or 5.7.
• When H is a set of lines, and S a set of points, V C(H) = 3.

• Are there intervals that cover all the above dichotomies?
• (a) can be shattered, but (b) cannot be. However, if at least
one subset of size 3 can be shattered, that’s fine.
What about S = x0, x1, x2 for an arbitrary xi? (cf. collinear points).
• Set of size 4 cannot be shattered, for any combination of

points (think about an XOR-like situation).
16
17
Sample Complexity from VC Dimension Mistake Bounds
How many randomly drawn examples suffice to є-exhaust V So far: how many examples needed to learn?
SH,D
with probability at least (1 − δ)? What about: how many mistakes before convergence?
• This is an interesting question because some learning systems

1
m≥ (4 log2(2/δ) + 8V C(H) log2(13/є)) may need to start operating while still learning.
є
V C(H) is directly related to the sample complexity: Let’s consider similar setting to PAC learning:
• More expressive H needs more samples. • Instances drawn at random from X according to distribution D.
• More samples needed for H with more tunable parameters. • Learner must classify each instance before receiving correct
classification from teacher.
• Can we bound the number of mistakes learner makes
18 19
VC Dim. of Linear Decision Surfaces VC Dimension: Another Example
before converging?
18 19
Optimal Mistake Bounds Mistake Bounds and VC Dimension
Let MA(C) be the max number of mistakes made by algorithm A to
learn concepts in C. (maximum over all possible c ∈ C, and all Littlestone (1987) showed:
possible training sequences)
MA(C) max MA(c) V C(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|)

≡c∈C
Definition: Let C be an arbitrary non-empty concept class. The

optimal mistake bound for C, denoted Opt(C), is the minimum
over all possible learning algorithms A of MA(C).
Opt(C) ≡ min MA(C)

A∈
learning algorithms
20 21
Multiple Classes, Ci i=1,...,K

Noise and Model Complexity X  {xt ,r tt 1
}N
Use the simpler one because 1i f xt Ci
Simpler to use rit  
0 i f x C j , j  i
t
(lower computational complexity)
Easier to train (lower space complexity)
Easier to explain (more interpretable)
Generalizes better (lower Train hypotheses
variance - Occam’s razor) hi(x), i =1,...,K:
1if xt C i
hi x   
t
0 i f x C j , j
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Regression
Model Selection & Generalization
X  x t ,r t ⚫ Learning is an ill-posed problem; data is not sufficient to
t gx  w1x  w0
find a unique solution

N
gx  w2x 2  w1x  ⚫ The need for inductive bias, assumptions about H
0
r t  w ⚫ Generalization: How well a model performs on new data
r t  f xt   ⚫ Overfitting: H more complex than C or f
t ⚫ Underfitting: H less complex than C or f
 r 
1
 gx

X
N 2
E g|
E w1 ,w0 |NX  
  t
t 1
1 N
 r  w xw 
 N t 1
t
2
1 t 0
Triple Trade-Off
⚫ There is a trade-off between three factors
(Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As N E
 As c (H) first E and then E
Cross-Validation
⚫ To estimate generalization error, we need data
unseen during training. We split the data as
⚫ Training set (50%)
⚫ Validation set (25%)
⚫ Test (publication) set (25%)
⚫ Resampling when there is few data
Dimensions of a Supervised
Learner
1.Model:gx| 
2.Loss function: E |X   Lrt ,gxt | 

t
3.Optimization procedure:
 *  argmi nE |X 

8/10/22, 9:29 AM Binary Classification
Binary Classification
Introduction
Given a collection of objects let us say we have the task to classify the
objects into two groups based on some feature(s). For example, let us say
given some pens and pencils of different types and makes, we can easily
seperate them into two classes, namely pens and pencils. This seemingly
trivial task, if we take a moment would seem to involve a lot of underlying
computation. The question to ask would be what is our basis for
classification? And how perhaps might we incorporate the principle to
allow for efficient classification by 'intelligent algorithms'.
Machine Learning
In simple terms instead of having hard coded algorithms to do specific

tasks, flexibility is introduced by having an algorithm "learn". This is
usually called the 'programming by example' paradigm. Which is to say,
instead of us giving the algorithm on which to base its classifying principles
we let the machine itself come up with its own 'algorithm' based on
examples that we provide. This would mean that we would give it, lets say,
hundred instances each of different kinds of pens and pencils, labelled
correctly, whose features it can study.(This forms the Training Dataset, the
'examples' from above). And now the machine comes up with its own
prediction rule based on which a new example (unlabelled) would be
classified as a pen or pencil!
https://www.cse.iitk.ac.in/users/se367/10/presentation_local/Binary Classification.html
1/4
Ref: Theoretical Machine Learning - Rob Schapire, Princeton Course Archive
Now we mainly have two kinds of Machine Learning Algorithms:

Supervised and Unsupervised. This is akin to the examples being provided
alongwith their correct labels, in case of supervised. And in the case of
unsupervised learning the algorithm divides the input dataset into subsets
and clusters them basing it on its own similarity computation. Here there are
no labels given with the input, rather the data is divided by the algorithm
itself with what seems to it as a reasonable set of classes.
Binary Classification
Binary Classification would generally fall into the domain of Supervised

Learning since the training dataset is labelled. And as the name suggests it
is simply a special case in which there are only two classes.
Some typical examples include:
Credit Card Fraudulent Transaction detection

Medical Diagnosis
Spam Detection
Now there are various paradigms that are used for learning binary classifiers
which include:
Decision Trees
Neural Networks
Bayesian Classification
Support Vector Machines
Support Vector Machines

2/4
Support vector machines (SVMs) are a set of supervised learning methods

which learn from the dataset and used for classification. Given a set of
training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one
class or the other. Simply speaking, we can think of an SVM model as
representing the examples as points in space, mapped so that each of the
examples of the separate classes are divided by a gap that is as wide as
possible. New examples are then mapped into the same space and classified
to belong to the class based on which side of the gap they fall on.
More formally it constructs a hyperplane or set of hyperplanes in a high or
infinite dimensional space, which can then be used for classification as
described above. Now if we view a data point as an n dimensional vector,
we want to separate the points by an n-1 dimensional hyperplane.Of the
many hyperplanes that might classify the data, One reasonable choice is
such that the distance from the hyperplane to the nearest data point on each
side is maximized. This is what is called a maximum margin hyperplane.
Ref: wikimedia
Example : Sentiment Classification
We worked on a project to do Sentiment Classification, which I hpoe would

illuminate better the problem of Binary Classification! Sentiment
Classification involves classifying a review (say IMDB Movie Review) into
whether it expresses a positive or a negative sentiment of the reviewer. In
this basically a dataset of two thousand reviews consisting of one thousand
reviews annotated positive and one thousand reviews annotated negative
was considered. Now the task is to classify a novel review based on its
3/4
sentiment.
How this was worked was that using various Data Mining Software, each of
the review of the dataset was converted into a document vector, akin to a
'bag of words' paradigm( since only the presence or absence of a word was
considered, not the context). Thus the words of the documents were
stemmed and and Dataset Vector was constructed consisting of all of the
unique words appearing in the Dataset. Thus each Data Point was now a
vector, where each dimension corresponded to unique word. The occurrence
of words was set as either binary or using TFIDF. Now an SVM classifier
was used which trained on the dataset thereby mapping the vectors and
obtaining a hyperplane for seperation. A good accuracy of about ~83% was
obtained using this classifier!
Conclusion
Thus we hope that with this illustration the wide array of problems that can
be tackled using Machine Learning Algorithms is highlighted. Also we
would like to stress upon the importance of Learning in the context of AI.
We would not consider a system to be truly intelligent if it were incapable
of learning since learning seems to be at the core of intelligence!
References
Theoretical Machine Learning - Rob Schapire, Princeton Course
Archive
Wikipedia
CS674 : Machine Learning and Knowledge Discovery Course Project!
4/4
Linear vs. Non-Linear Classification
Shabeg Singh Gill
Last Updated: May 13, 2022 Share :
Introduction
We will be studying Linear Classification as well as Non-Linear Classification.
Linear Classification refers to categorizing a set of data points to a discrete class based on a linear combination of its
explanatory variables. On the other hand, Non-Linear Classification refers to separating those instances that are not
linearly separable.
Linear
Classification
→ Linear Classification refers to categorizing a set of data points into a discrete class based on a linear
combination of its explanatory variables.
→ Some of the classifiers that use linear functions to separate classes are Linear Discriminant Classifier, Naive
Bayes, Logistic Regression, Perceptron, SVM (linear kernel).
→ In the figure above, we have two classes, namely 'O' and '+.' To differentiate between the two classes, an arbitrary line
is drawn, ensuring that both the classes are on distinct sides.
→ Since we can tell one class apart from the other, these classes are called ‘linearly-separable.’
→ However, an infinite number of lines can be drawn to distinguish the two classes.
→ The exact location of this plane/hyperplane depends on the type of the linear classifier.
Linear Discriminant Classifier
→ It is a dimensionality reduction technique in the domain of Supervised Machine Learning.
→ It is crucial in modeling differences between two groups, i.e., classes.
→ It helps project features in a high dimensions space in a lower-dimensional space.

→ Technique - Linear Discriminant Analysis (LDA) is used, which reduced the 2D graph into a 1D graph by creating a
new axis. This helps to maximize the distance between the two classes for differentiation.
→ In the above graph, we notice that a new axis is created, which maximizes the distance between the mean of the two
classes.
→ As a result, variation within each class is also minimized.
→ However, the problem with LDA is that it would fail in case the means of both the classes are the same. This would
mean that we would not be able to generate a new axis for differentiating the two.
Naive Bayes
→ It is based on the Bayes Theorem and lies in the domain of Supervised Machine Learning.
→ Every feature is considered equal and independent of the others during Classification.
→ Naive Bayes indicates the likelihood of occurrence of an event. It is also known as conditional probability.
A: event 1
B: event 2
P(A|B): Probability of A being true given B is true - posterior probability P(B|
A): Probability of B being true given A is true - the likelihood
P(A): Probability of A being true - prior
P(B): Probability of B being true - marginalization
However, in the case of the Naive Bayes classifier, we are concerned only with the maximum posterior
probability, so we ignore the denominator, i.e., the marginal likelihood. Argmax does not depend on the
normalization term.
→ The Naive Bayes classifier is based on two essential assumptions:-
(i) Conditional Independence - All features are independent of each other. This implies that one feature does
not affect the performance of the other. This is the sole reason behind the ‘Naive’ in ‘Naive Bayes.’
(ii) Feature Importance - All features are equally important. It is essential to know all the features to make good
predictions and get the most accurate results.
→ Naive Bayes is classified into three main types: Multinomial Naive Bayes, Bernoulli Naive Bayes, and
Gaussian Bayes.
Logistic Regression
→ It is a very popular supervised machine learning algorithm.
→ The target variable can take only discrete values for a given set of features.
→ The model builds a regression model to predict the probability of a given data entry.
→ Similar to linear regression, logistic regression uses a linear function and, in addition, makes use of the 'sigmoid'
function.
→ Logistic regression can be further classified into three categories:-
Binomial - target variable assumes only two values since binary. Example: ‘0’ or ‘1’.
Multinomial - target variable assumes >= three unordered values since multinomial. Example: 'Class A,' 'Class B,'
and 'Class C.'
Ordinal - target variable assumes ordered values since ordinal. Example: ‘Very Good’, ‘Good’, ‘Average, ‘poor’,
‘very poor’.
Support Vector Machine (linear kernel)
→ It is a straightforward supervised machine learning algorithm used for regression/classification.
→ This model finds a hyper-plane that creates a boundary between the various data types.
→ It can be used for binary Classification as well as multinomial classification problems.
→ A binary classifier can be created for each class to perform multi-class Classification.
→ In the case of SVM, the classifier with the highest score is chosen as the output of the SVM.
→ SVM works very well with linearly separable data but can work for non-linearly separable data as well.
Non-Linear
Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly separable.
→ Some of the classifiers that use non-linear functions to separate classes are Quadratic Discriminant Classifier, Multi-
Layer Perceptron (MLP), Decision Trees, Random Forest, and K-Nearest Neighbours (KNN).
→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the two classes, it is impossible
to draw an arbitrary straight line to ensure that both the classes are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class present between the data
points of the second class.
→ In such cases, piece-wise linear or non-linear classification boundaries are required to distinguish the two classes.
Quadratic Discriminant Classifier
→ This technique is similar to LDA(Linear Discriminant Analysis) discussed above.

→ The only difference is that here, we do not assume that the mean and covariance of all classes are the same.
→ We get the quadratic discriminant function as the following:-
→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This would give us a clear
picture of the difference between the two.
Multi-Layer Perceptron (MLP)
→ This is nothing but a collection of fully connected dense layers. These help transform any given input
dimension into the desired dimension.
→ It is nothing but simply a neural network.
→ MLP consists of one input layer(one node belonging to each input), one output layer (one node belonging to each
output), and a few hidden layers (>= one node belonging to each hidden layer).
→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.
→ There is one hidden layer consisting of 3 nodes.
→ There is an output layer consisting of 2 nodes, indicating two outputs.
→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in the hidden layer. Once
this is done, the hidden layer processes the information passed on to it and then further passes it on to the output layer.
Decision Tree
→ It is considered to be one of the most valuable and robust models.

→ Instances are classified by sorting them down from the root to some leaf node.
→ An instance is classified by starting at the tree's root node, testing the attribute specified by this node, then moving
down the tree branch corresponding to the attribute's value, as shown in the above figure.
→ The process is repeated based on each derived subset in a recursive partitioning manner.
→ For a better understanding, see the diagram below.
→ The above decision tree helps determine whether the person is fit or not.
→ Similarly, Random Forests, a collection of Decision Trees, is a linear classifier too.
K-Nearest Neighbours
→ KNN is a supervised machine learning algorithm . It is used for classification problems. Since it is a
supervised machine learning algorithm, it uses labeled data to make predictions.
→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the same.
→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the
new point. It chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to.
→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.
→ For better understanding, have a look at the diagram below.
Comparison
Now, we will briefly sum up all that we’ve learned and try to compare and contrast Linear Classification and Non-
Linear Classification.
S.NoLinear Non-Linear Classification

Classification
1. LinearNon-Linear ClassificationClassification refers refers toto categorizing categorizing athose instances
set of datathat are not linearly points into aseparable.
discrete class based on a linear combination of its explanatory variables.
1. It is possible toIt is not easy to classify dataclassify data with a with a straightstraight line.
line.
1. Data is classified with

Thethe
utilization
help of aofhyperplane.
kernelsFiAs mQasde to
transform non- separable data into separable data.
1. What is the difference between Linear Classification and Non-Linear Classification?

The main difference is that in the case of Linear Classification, data is classified using a hyperplane. In contrast,
kernels are used to organize data in the Non-Linear Classification case.
2. Name a few linear classifiers.

Some of the popular linear classifiers are:
i) Naive Bayes
ii) Logistic Regression
iii) Support Vector Machine (linear kernel)
3. What are the most popular non-linear classifiers?

Some of the popular non-linear classifiers are:
i) Multi-Layer Perceptron (MLP)
ii) Decision Tree
iii) Random Forests
iv) K-Nearest Neighbors
Key
Takeaways
Congratulations on making it this far. This blog discussed a fundamental overview of Linear Classification and Non-
Linear Classification and the differences between the two!!
We learned about Linear Classifiers such as Linear Discriminant Classifier, Naive Bayes, Logistic Regression and
Support Vector Machines, and Non-Linear Classifiers such as Quadratic Discriminant classifiers, Multi- Layer
Perceptron, Decision Trees, Random Forests, and K-Nearest Neighbours.
If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for
a carefully crafted and designed course on-campus placements and interview preparation.
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
Multiclass Classification- Explained in

Machine Learning
By Great Learning Team - Aug 13, 2020
1. What is Multiclass Classification?
2.Which classifiers do we use in multiclass classification? When do we use them?
3.What is Entropy?
4.What is Gini Index?
5.Confusion Matrix in Multi-class Classification
6.Multiclass Vs Multi-label
Contributed by: Ayushi Jain

LinkedIn Profile: https://www.linkedin.com/in/ayushi-jain-541047131/
We have heard about classification and regression techniques in machine learning.

We know that these two techniques work on different algorithms for discrete and
continuous data respectively. In this article, we will learn more about classification.
If we dig deeper into classification, we deal with two types of target variables,
binary class, and multi-class target variables.
Binary, as the name suggests, has two categories in the dependent column.
Multiclass refers to columns with more than two categories in it.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
1/9
You will get answers to all the questions that might cross your mind while reading
this article, such as: 
1. What is multiclass classification?
2.Which classifiers do we use in multiclass classification?
3.How and when do we use these classifiers?
4.Is multiclass and multi-label classification similar?
What is multiclass classification?

Classification means categorizing data and forming groups based on the
similarities. In a dataset, the independent variables or features play a vital role
in classifying our data. When we talk about multiclass classification, we have
more than two classes in our dependent or target variable, as can be seen in
Fig.1:
The above picture is taken from the Iris dataset which depicts that the target
variable has three categories i.e., Virginica, setosa, and Versicolor, which are three
species of Iris plant. We might use this dataset later, as an example of a
conceptual understanding of multiclass classification.
Which classifiers do we use in

multiclass classification? When do we
use them?
We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest
classifier, KNN, and logistic regression for classification. But we might learn about
only a few of them here because our motive is to understand multiclass
classification. So, using a few algorithms we will try to cover almost all the relevant
concepts related to multiclass classification.
Naive Bayes
Naive Bayes is a parametric algorithm which means it requires a fixed set of

2/9
parameters or assumptions to simplify the machine’s learning process. In
3/9
parametric algorithms, the number of parameters used is independent of the size of
training data. 
Naïve Bayes Assumption:
It assumes that features of a dataset are completely independent of each other.

But it is generally not true that is why we also call it a ‘naïve’ algorithm.
It is a classification model based on conditional probability and uses Bayes theorem

to predict the class of unknown datasets. This model is mostly used for large
datasets as it is easy to build and is fast for both training and making predictions.
Moreover, without hyperparameter tuning, it can give you better results as
compared to other algorithms.
Naïve Bayes can also be an extremely good text classifier as it performs well, such
as in the spam ham dataset.
Bayes theorem is stated as-
By P (A|B), we are trying to find the probability of event A given that event B is
true. It is also known as posterior probability.
Event B is known as evidence.
P (A) is called priori of A which means it is probability of event before evidence is

seen.
P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that
are not linearly separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear
classification instead of Naïve Bayes classifier.
4/9
KNN (K-nearest neighbours)


KNN is a supervised machine learning algorithm that can be used to solve both
classification and regression problems. It is one of the simplest algorithms yet
powerful one. It does not learn a discriminative function from the training data but
memorizes the training data instead. Due to the very same reason, it is also known
as a lazy algorithm.
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points for
defining them as similar. Most popular choice is Euclidean distance which is written
as:
K in KNN is the hyperparameter that can be chosen by us to get the best possible
fit for the dataset. If we keep the smallest value for K, i.e. K=1, then the model will
show low bias, but high variance because our model will be overfitted in this case.
Whereas, a larger value for K, lets suppose k=10, will surely smoothen our decision
boundary, which means low variance but high bias. So we always go for a trade-
off between the bias and variance, known as bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:
Advantages-
KNN makes no assumptions about the distribution of classes i.e. it is a non-

parametric classifier
It is one of the methods that can be widely used in multiclass
classification It does not get impacted by the outliers
This classifier is easy to use and implement
Disadvantages-
K value is difficult to find as it must work well with test data also, not only with
the training data
It is a lazy algorithm as it does not make any models
It is computationally extensive because it measures distance with each data

point
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made
based on some conditional statements. This is one of the most used supervised
5/9
learning methods in classification problems because of their high accuracy, stability,
and easy interpretation. They can map linear as well as non-linear relationships in a 
good way.
Let us look at the figure below, Fig.3, where we have used adult census income
dataset with two independent variables and one dependent variable. Our target or
dependent variable is income, which has binary classes i.e, <=50K or >50K.
Fig 3: Decision Tree- Binary Classifier
We can see that the algorithm works based on some conditions, such as Age <50
and Hours>=40, to further split into two buckets for reaching towards homogeneity.
Similarly, we can move ahead for multiclass classification problem datasets, such
as Iris data.
Now a question arises in our mind. How should we decide which column to take first
and what is the threshold for splitting? For splitting a node and deciding threshold for
splitting, we use entropy or Gini index as measures of impurity of a node. We aim to
maximize the purity or homogeneity on each split, as we saw in Fig.2.
What is Entropy?
Entropy or Shannon entropy is the measure of uncertainty, which has a similar
sense as in thermodynamics. By entropy, we talk about a lack of information. To
understand better, let us suppose we have a bag full of red and green balls.
Scenario1: 5 red balls and 5 green balls.
6/9
If you are asked to take one ball out of it then what is the probability that the ball will
be green colour ball?
Here we all know there will have 50% chances that the ball we pick will be green.
Scenario2: 1 red and 9 green balls

Here the chances of red ball are minimum and we are certain enough that the
ball we pick will be green because of its 9/10 probability.
Scenario3: 0 red and 10 green balls

In this case, we are very certain that the ball we pick is of green colour.
In the second and third scenario, there is high certainty of green ball in our first
pick or we can say there is less entropy. But in the first scenario there is high
uncertainty or high entropy.
Entropy 𝖺 Uncertainty
Formula for entropy:
Where p(i) is probability of an element/class ‘i’ in the data

After finding entropy we find Information gain which is written as below:
What is Gini Index?

Gini is another useful metric to decide splitting in decision trees.
Gini Index formula:
Where p(i) is probability of an element/class ‘i’ in the data.
7/9
We have always seen logistic regression is a supervised classification algorithm
being used in binary classification problems. But here, we will learn how we can 
extend this algorithm for classifying multiclass data. In binary, we have 0 or 1 as our
classes, and the threshold for a balanced binary classification dataset is generally
0.5. Whereas, in multiclass, there can be 3 balanced classes for which we require 2
threshold values which can be, 0.33 and 0.66. But a question arises, by using what
method do we calculate threshold and approach multiclass classification? So let’s
first see a general formula that we use for the logistic regression curve:
Where P is the probability of the event occurring and the above equation derives
from here:
There are two ways to approach this kind of a problem. They are explained as below:
One vs. Rest (OvR)– Here, one class is considered as positive, and rest all are taken
as negatives, and then we generate n-classifiers. Let us suppose there are 3
classes in a dataset, therefore in this approach, it trains 3-classifiers by taking one
class at a time as positive and rest two classes as negative. Now, each classifier
predicts the probability of a particular class and the class with the highest
probability is the answer.
One vs. One (OvO)– In this approach, n ∗ (n − 1)⁄2 binary classifier models are
generated. Here each classifier predicts one class label. Once we input test data to
the classifier, the class which has been predicted the most is chosen as the
answer.
Confusion Matrix in Multi-class Classification

A confusion matrix is table which is used in every classification problem to
8/9
describe the performance of a model on a test data.
9/9
As we know about confusion matrix in binary classification, in multiclass
classification also we can find precision and recall accuracy. 
Let’s take an example to have a better idea about confusion matrix in multiclass
classification using Iris dataset which we have already seen above in this article.
Finding precision and recall from above Table.1:
Precision for Virginica class is the number of correctly predicted virginica species out
of all the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of
the species that our predictor classifies as Virginica are actually virginica. Similarly,
we can find for other species i.e. for Setosa and Versicolor, precision is 20% and 62.5%
respectively.
Whereas, Recall for Virginica class is the number of correctly predicted virginica
species out of actual virginica species, which is 50%. This means that our classifier
classified half of the virginica species as virginica. Similarly, we can find for other
species i.e. for Setosa and Versicolor, recall is 20% and 71.4% respectively.
Multiclass Vs Multi-label
People often get confused between multiclass and multi-label classification. But
these two terms are very different and cannot be used interchangeably. We have
already understood what multiclass is all about. Let’s discuss in brief how multi-
label is different from multiclass.
Multi-label refers to a data point that may belong to more than one class. For
example, you wish to watch a movie with your friends but you have a different
choice of genres that you all enjoy. Some of your friends like comedy and others
are more into action and thrill. Therefore, you search for a movie that fulfills both
the requirements and here, your movie is supposed to have multiple labels.
Whereas, in multiclass or binary classification, your data point can belong to only a
single class. Some more examples of the multi-label dataset could be protein
classification in the
10
human body, or music categorization according to genres. It can also one of the
concepts highly used in photo classification. 
I hope this article has provided you with some fair conceptual knowledge. Don’t
stop here, remember that there are many more ways to classify your data. All that
is important is how you polish your basics to create and implement more
algorithms. Let us conclude by looking at what Professor Pedro Domingos said-
“Machine learning will not single-handedly determine the future, any more than
any other technology; it’s what we decide to do with it that counts, and now you
have the tools to decide.”
If you found this helpful and wish to learn more such concepts, join Great Learning
Academy’s free courses today!
Great Learning Team

Great Learning's Blog covers the latest developments and innovations in technology that can be
leveraged to build rewarding careers. You'll find career guides, tech tutorials and industry news to
keep yourself updated with the fast-changing world of tech and business.
11
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Become a Certi ed Professional 
Introduction to Classi cation Algorithms

Last updated on Nov 25,2020 64.9K Views Share
Upasana
Research Analyst, Tech Enthusiast, Currently working on Azure IoT & Data Science...
The idea of Classi cation Algorithms is pretty simple. You predict the target class by analyzing the training dataset. This is one
of the most, if not the most essential concept you study when you learn Data Science.
This blog discusses the following concepts:
What is Classi cation?

Classi cation vs Clustering Algorithms
Basic Terminology in Classi cation Algorithm
Applications of Classi cation Algorithm
Types of Classi cation Algorithm
Logistic Regression
Decision Tree
Naive Bayes Classi er
K Nearest Neighbour
SVM
What is Classi cation?

We use the training dataset to get better boundary conditions which could be used to determine each target class. Once the
boundary conditions are determined, the next task is to predict the target class. The whole process is known as classi cation.
Target class examples:
Analysis of the customer data to predict whether he will buy computer accessories (Target class: Yes or No)
Classifying fruits from features like color, taste, size, weight (Target classes: Apple, Orange, Cherry, Banana)
Gender classi cation from hair length (Target classes: Male or Female)
Let’s understand the concept of classi cation algorithms with gender classi cation using hair length (by no means am I trying to
stereotype by gender, this is only an example). To classify gender (target class) using hair length as feature parameter we could
train a model using any classi cation algorithms to come up with some set of boundary conditions which can be used to
di erentiate the male and female genders using hair length as the training feature. In gender classi cation case the boundary
condition could the proper hair length value. Suppose the di erentiated boundary hair length value is 15.0 cm then we can say
that if hair length is less than 15.0 cm then gender could be male or else female.
Classi cation Algorithms vs Clustering Algorithms

In clustering, the idea is not to predict the target class as in classi cation, it’s more ever trying to group the similar kind of
things by considering the most satis ed condition, all the items in the same group should be similar and no two di
erent group items should not be similar.
Group items Examples:
While grouping similar language type documents (Same language documents are one group.) 
While categorizing the news articles (Same news category(Sport) articles are one group ) FREE WEBINAR
Tableau for Data Science in 60 Minu…
https://www.edureka.co/blog/classification-algorithms/
1/12
Let’s understand the concept with clustering genders based on hair length example. To determine gender, di erent
similarity
measure could be used to categorize male and female genders. This could be done by nding the similarity between two
×
continue until all the hair length properly grouped into two
Subscribe to categories.
our Newsletter, and get personalized recommendations.
Basic Terminology in Classi cation Algorithms Sign up with Google
Classi er: An algorithm that maps the input data to a speci c category.
Classi cation model: A classi cation model tries to draw some conclusion from the input values given for training. It
Signup with Facebook
will predict the class labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon being observed.
Binary Classi cation: Classi cation task with two possible outcomes. Eg: Gender
Already haveclassi cation
an account? . (Male / Female)
Multi-class classi cation: Classi cation with more than two classes. In multi-class classi cation, each sample is assigned
one and only one target label. Eg: An animal can be a cat or dog but not both at the same time.
Multi-label classi cation: Classi cation task where each sample is mapped to a set of target labels (more than one class).
Eg: A news article can be about sports, a person, and location at the same time.
Applications of Classi cation Algorithms

Email spam classi cation
Bank customers loan pay willingness prediction.
Cancer tumor cells identi cation.
Sentiment analysis
Drugs classi cation
Facial key points detection
Pedestrians detection in an automotive car driving.
Types of Classi cation Algorithms

Classi cation Algorithms could be broadly classi ed as the following:
Linear Classi ers

Logistic regression
Naive Bayes classi er
Fisher’s linear discriminant
Support vector machines
Least squares support vector machines
Quadratic classi ers
Kernel estimation
k-nearest neighbor
Decision trees
Random forests
Neural networks
Learning vector quantization
Examples of a few popular Classi cation Algorithms are given below.
Logistic Regression
As confusing as the name might be, you can rest assured. Logistic Regression is a classi cation and not a regression algorithm. It
estimates discrete values (Binary values like 0/1, yes/no, true/false) based on a given set of independent variable(s). Simply
put, it basically, predicts the probability of occurrence of an event by tting data to a logit function. Hence, it is also known as
logit regression. The values obtained would always lie within 0 and 1 since it predicts the probability.
Let us try and understand this through another example.
Let’s say there’s a sum on your math test. It can only have 2 outcomes, right? Either you solve it or you don’t (and let’s not assume
points for method here). Now imagine, that you are being given a wide range of sums in an attempt to understand which
e! chapters
have you understood well. The outcome of this study would be something like this – if you are given a trigonometry
edureka.co
based problem, you are 70% likely to solve it. On the other hand, if it is an arithmetic problem, the probability of you getting an
answer is only 30%. This is what Logistic Regression provides you.
If I had to do the math, I would model the log odds of the outcome as a linear combination of the predictor
Copy Link

FREE WEBINAR
2/12

Data Science with R Programming Certi cation
×
Cours Subscribe to our Newsletter, and get personalized recommendations.
e
Sign up with Google
Instructor-led
Sessions Real-life
Case Studies
Assignments
Lifetime Access
Already have an account? Sign in.
Explore Curriculum
odds= p/ (1-p) = probability of event occurrence / probability of event occurrence ln(odds) = ln(p/(1-
p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3. +bkXk)
In the equation given above, p is the probability of the presence of the characteristic of interest. It chooses parameters that
maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary
regression).
Now, a lot of you might wonder, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical
ways to replicate a step function. I can go way more in-depth with this, but that will beat the purpose of this blog.
R-Code:
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
e edureka.co
There are many di erent steps that could be tried in order to improve the model:
include interaction terms

remove features
regularize techniques
Whatsapp Linkedin Twitter Facebook Reddit
use a non-linear
model 
Copy Link
FREE WEBINAR
algorithm does is, it splits the population into two or more homogeneous sets based on the most signiforcant
Tableau Dataattributes
Science in 60 Minu…
3/12
the groups as distinct as

×
Subscribe to our Newsletter, and get personalized recommendations.
Sign up with Google
In the image above, you can see that population is classi ed into four di erent groups based on multiple attributes to
identify
‘if they will play or not’.
R-Code:
library(rpart)
x <-
cbind(x_train,y_train) #
grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Naive Bayes Classi er

This is a classi cation technique based on an assumption of independence between predictors or what’s known as Bayes’
theorem. In simple terms, a Naive Bayes classi er assumes that the presence of a particular feature in a class is unrelated to
the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, a Naive Bayes Classi er would consider all of
these properties to independently contribute to the probability that this fruit is an apple.
To build a Bayesian model is simple and particularly functional in case of enormous data sets. Along with simplicity, Naive
Bayes is known to outperform sophisticated classi cation methods as well.

e edureka.co

Copy Link
FREE WEBINAR
4/12
×
Sign up with Google
Here,
P(c|x) is the posterior probability of class (target) given predictor (attribute).

P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Example: Let’s work through an example to understand this better. So, here I have a training data set of weather namely,
sunny, overcast and rainy, and corresponding binary variable ‘Play’. Now, we need to classify whether players will play or not
based on
weather condition. Let’s follow the below steps to perform it.
Step 1: Convert the data set to the frequency table
ta Science Training
PYTHON DATA
SCIENCE DATA SCIENCE AND DATA SCIENCE WITH R
MACHIN
E
MACHINE LEARNING WITH PYTHON PROGRAMMING
NTERNSHIP CERTIFICATION CERTIFICATION
LEARNING CERTIFICATION
PROGRAM COURSE
Python Machine Data Science with R Data

A
Data Science and Data Science with
Learning Programming R
Prog
Machine Learning Python Certi cation
Certi cation Certi cation Certi
nternship Program Course
Training Training Course
Traini
eviews Reviews Reviews Reviews Reviews
     5(5)      5(106336)      5(12112)      5(38747) 

e edureka.co

Copy Link
FREE WEBINAR
5/12
×
Sign up with Google
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the
highest
posterior probability is the outcome of prediction.
Problem: Players will play if the weather is sunny, is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny) Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of di erent class based on various attributes. This algorithm
is mostly used in text classi cation and with problems having multiple classes.
e R-Code:
edureka.co
library(e1071)
x <-
Fitting model
fit <-naiveBayes(y_train
Whatsapp
~ .,Linkedin
data = x) Twitter Facebook Reddit
summary(fit)
#Predict Output

Copy Link
predicted= predict(fit,x_test) FREE WEBINAR
6/12
K nearest neighbors is a simple algorithm used for both classi cation and regression problems. It basically stores all
available Become a Certi ed Professional 
cases to classify the new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its
×
While the three former distance functions are used for continuous variables, Hamming distance function is used for
categorical
Sign up with Google
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out
to be a challenge while performing kNN modeling.
You can understand KNN easily by taking an example of our real lives. If you have a crush on a girl/boy in class, of whom
you have no information, you might want to talk to their friends and social circles to gain access to their information!
R-Code:
e library(knn)
edureka.co
x <-
Fitting model
fit <-knn(y_train ~ ., data =
x,k=5) summary(fit)

#Predict Output
Copy Link
FREE WEBINAR
Things to consider before selecting
Variables should be normalized else higher range variables can bias Tableau for Data Science in 60 Minu…
7/12
Works on pre-processing stage more before going for kNN like an outlier, noise
removal Become a Certi ed Professional 
×
In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with
the Subscribe to our Newsletter, and get personalized recommendations.
value of each feature being the value of a particular coordinate.

Sign up with Google
For example, if we only had two features like Height and Hair length of an individual, we’d rst plot these two variables in
two- dimensional space where each point has two coordinates (these coordinates
Signup are known as Support Vectors)
with Facebook
Data Science with R Programming Certi cation Training

Course
Weekday / Weekend Batches
See Batch
Details

e edureka.co
Now, we will nd some line that splits the data between the two di erently classi ed groups of data. This will be the line

Copy Link
FREE WEBINAR
8/12
×
Sign up with Google
In the example shown above, the line which splits the data into two di erently classi ed groups is the blue line, since the
two
closest points are the farthest apart from the line. This line is our classi er. Then, depending on where the testing data
lands on either side of the line, that’s what class we can classify the new data as.
R-Code:
library(e1071)
x <-
Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
So, with this, we come to the end of this Classi cation Algorithms Blog. Try out the simple R-Codes on your systems now
and you’ll no longer call yourselves newbies in this concept.
e edureka.co
Upcoming Batches For Data Science with R Programming Certi cation Training Course

Course Name Date

Copy Link
FREE WEBINAR
9/12
Data Science with R Programming Class Starts on 19th November,2022

BecomeSAT&SUN
a Certi (Weekend
ed Professional
Batch)  View Details
Certi cation Training Course
×
Sign up with Google

Recommended videos for you
  Already have an account? Sign in.
3 Scenarios Where Predictive Analytics is Python Loops – While, For and Nested The Whys and Hows of Predictive
a Must Loops in Python Programming Modeling-II
Watch Now Watch Now Watch Now
Recommended blogs for you
Comprehensive Guide To Logistic How To Perform Logistic Regression In Python Seaborn Tutorial: What is
Regression In R Python? Seaborn and How to Use it?
Read Article Read Article Read Article
Comments 0
Comments

e edureka.co

Copy Link
FREE WEBINAR
Trending Courses in Data Tableau for Data Science in 60 Minu…

10/1
×
Sign up with Google
Data Science and Data Science with Python

Signup with Machine
Facebook Data Science with
R
Machine Learning Python Certi cation Learning Certi cation
Programming Internship ... Course Trainin ...
Certi cation ... Already have an account? Sign in.
 1k Enrolled Learners  107k Enrolled Learners  13k Enrolled Learners  39k Enrolled
Reviews Reviews Reviews Reviews
 5 (50)  5 (42550)  5 (4850)  5 (15500)
Browse Categories
Arti cial Intelligence BI and Visualization Big Data Blockchain Cloud Computing Cyber Security Data Warehousing and
ETL Databases DevOps Digital Marketing Enterprise Front End Web Development Mobile
Development Operating Systems Programming & Frameworks Project Management and Methodologies Robotic
Process Automation Software Testing
Systems & Architecture

TRENDING CERTIFICATION COURSES TRENDING MASTERS COURSES
DevOps Certi cation Training Data Scientist Masters Program
AWS Architect Certi cation Training DevOps Engineer Masters Program
Big Data Hadoop Certi cation Training Cloud Architect Masters Program
Tableau Training & Certi cation Big Data Architect Masters Program
Python Certi cation Training for Data Science Machine Learning Engineer Masters Program
Selenium Certi cation Training Full Stack Web Developer Masters Program
PMP® Certi cation Exam Training Business Intelligence Masters Program
Robotic Process Automation Training using UiPath Data Analyst Masters Program
Apache Spark and Scala Certi cation Training Test Automation Engineer Masters Program
Microsoft Power BI Training Post-Graduate Program in Arti cial Intelligence & Machine Learning
Online Java Course and Training Post-Graduate Program in Big Data
Engineering Python Certi cation Course
COMPANY WORK WITH US
About us Careers
News & Media Become an Instructor
Reviews Become an A liate
Contact us Become a Partner

e Blog
edureka.co
Community
Hire from Edureka
DOWNLOAD
APP
Sitemap
Blog Sitemap
Community
Sitemap Webinars

Copy Link
FREE WEBINAR
CATEGORI
11/1
Cloud Computing DevOps Big Data Data Science BI and Visualization Programming & Frameworks Software Testing
Project Management and Methodologies Robotic Process Become

Automationa Certi edDevelopment
Frontend ProfessionalData
 Warehousing and ETL Arti cial
Intelligence Blockchain Databases Cyber Security Mobile Development Operating Systems Architecture & Design
×
TRENDING BLOG ARTICLES 
Sign up with Google

TRENDING BLOG ARTICLES
Selenium tutorial Selenium interview questions Java tutorial What is HTML Java interview questions PHP tutorial JavaScript interview
questions Spring tutorial PHP interview questions

Inheritance in Java Polymorphism in Java Spring interview questions Pointers in
C Linux commands Android tutorial JavaScript tutorial jQuery tutorial SQL interview questions MySQL tutorial Machine
learning tutorial Python tutorial Already have an account? Sign in.
What is machine learning Ethical hacking tutorial SQL injection AWS certi cation career opportunities AWS tutorial What Is cloud
computing What is blockchain Hadoop tutorial What is arti cial intelligence Node Tutorial Collections in Java
Exception handling in java
Python Programming Language Python interview questions Multithreading in Java ReactJS Tutorial Data Science vs Big Data vs Data Analyt…
Software Testing Interview Questions R Tutorial Java Programs JavaScript Reserved Words and Keywor… Implement thread.yield() in Java:
Exam… Implement Optical Character Recogniti… All you Need to Know About Implement…
© 2022 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved. Terms & Conditions    
Legal & Privacy

e edureka.co

Copy Link
FREE WEBINAR
12/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Open in Get
app started
Pradeep Dhote Follow
Jul 26, 2020 · 7 min read · Listen
Decision Tree : Classification and

Regression Algorithm(CART ,ID3)
Decision Tree Explained in Detail
CART :- Classification and Regression Tree

It is used for generating both classification tree and regression tree.
It uses Gini index as metric/cost function to evaluate split in feature selection
It is used for binary classification.
It use least square as a metric to select features in case of Regression tree.
Calculate Gini Index for sub-

https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
1/17
Open in app
Sum of Square Of probability for Success and Failure (p² x
The node for which the Gini impurity is least is selected as the root node to split
Lets start with weather data set, which is quite famous in explaining decision tree
algorithm,where target is to predict play or not( Yes or No) based on weather
condition
From data, outlook, temperature, humidity, wind are the features of data.
So, lets start building tree,

Outlook is a nominal feature. it can take three value, sunny, overcast and rain. Lets
summarize the final decision for outlook features,
2/17
Gini index calculated by subtracting the sum of square probability of each

Open in app
Gini index (outlook=sunny)= 1-(2/5)²-(3/5)² = 1- 0.16–0.36 = 0.48
Gini index(outlook=overcast)= 1- (4/4)²-(0/4)² = 1- 1- 0 = 0
Gini index(outlook=rainfall)= 1- (3/5)² -(2/5)² = 1- 0.36- 0.16 = 0.48
Now , we will calculate the weighted sum of Gini index for outlook features,
Gini(outlook) = (5/14)0.48 + (4/14) *0 + (5/14)0.48 = 0.342
Similarly, temperature is also a nominal feature, it can take three values, hot,cold
and mild. lets summarize the final decision of temperature feature.
Gini(temperature=hot) = 1-(2/4)²-(2/4)² = 0.5
Gini(temperature=cool) = 1-(3/4)²-(1/4)² = 0.375
Gini(temperature=mild) = 1-(4/6)²-(2/6)² = 0.445
Now, the weighted sum of Gini index for temperature features can be calculated as,
Gini(temperature)= (4/14) *0.5 + (4/14) *0.375 + (6/14) *0.445 =0.439
Similarly for Humidity
3/17
Open in app Get started
Humidity is a binary class feature , it can take two value high and normal.
Gini(humidity=high) = 1-(3/7)²-(4/7)² = 0.489
Gini(humidity=normal) = 1-(6/7)²-(1/7)² = 0.244
Now, the weighted sum of Gini index for humidity features can be calculated as,
Gini(humidity) = (7/14) *0.489 + (7/14) *0.244=0.367
Similarly for Wind
wind is a binary class feature , it can take two value weak and strong.
Gini(wind=weak)= 1-(6/8)²-(2/8)² = 0.375
Gini(wind=strong)= 1-(3/6)²-(3/6)²= 0.5
Now, the weighted sum of Gini index for wind features can be calculated as,
Gini(wind) = (8/14) *0.375 + (6/14) *0.5=0.428
Decision for root node

So,the final decision of all the
4/17
From table, you can seen that Gini index for outlook feature is lowest. So we get our
root node.
Now, lets focus on sub data on sunny outlook feature. we need to find the Gini index
for temperature, humidity and wind feature respectively.
5/17
Gini index for temperature on sunny outlook
Gini(outlook=sunny & temperature=hot) = 1-(0/2)²-(2/2)² = 0
Gini(outlook=sunny & temperature=cool) = 1-(1/1)²-(0/1)² = 0
Gini(outlook=sunny & temperature=mild) = 1-(1/2)²-(1/2)² = 0.5
Now, the weighted sum of Gini index for temperature on sunny outlook features can be
calculated as,
Gini(outlook=sunny & temperature)= (2/5) *0 + (1/5) *0+ (2/5) *0.5
6/17
Gini Index for humidity on sunny

Open in app
Gini(outlook=sunny & humidity=high) = 1-(0/3)²-(3/3)² = 0
Gini(outlook=sunny & humidity=normal) = 1-(2/2)²-(0/2)² = 0
Now, the weighted sum of Gini index for humidity on sunny outlook features can be
calculated as,
Gini(outlook = sunny & humidity) = (3/5) *0 + (2/5) *0=0
Gini Index for wind on sunny outlook
Gini(outlook=sunny & wind=weak) = 1-(1/3)²-(2/3)² = 0.44
Gini(outlook=sunny & wind=strong) = 1-(1/2)²-(1/2)² = 0.5
Now, the weighted sum of Gini index for wind on sunny outlook features can be
calculated as,
Gini(outlook = sunny and wind) = (3/5) *0.44 + (2/5) *0.5=0.266+0.2= 0.466
Decision on sunny outlook factor
7/17
we have calculated the Gini index of all the features when the outlook is sunny. You can
infer that humidity has lowest value. so next node will be humidity.
Now,Lets focus on sub data for overcast outlook feature. and calculate till all dataset
split is a same manner
8/17
So Final Decision Tree will be like above tree
ID3 (Iterative Dichotomiser)

D3(Iterative Dichotomiser 3): This solution uses Entropy and Information gain as
metrics to form a better decision tree. The attribute with the highest information gain
is used as a root node, and a similar approach is followed after that
Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class
distribution is equal.
In this way, entropy will give a measure of impurity in the dataset.
Steps to decide which attribute to split:
1. Compute the entropy for the dataset
2. For every attribute:
2.1 Calculate entropy for all categorical values.
2.2 Take average information entropy for the attribute.
2.3 Calculate gain for the current

9/17
Open in app
3. Pick the attribute with the highest information
4. Repeat until we get the desired tree
we will take the same weather data set , we have taken for explaining CART algorithm
in previous story
Number of observations = 14
Number of observations having Decision ‘Yes’ =
10/1
probability of ‘Yes’ , p(Yes) =

Open in app
Number of observations having Decision ‘No’ =5
probability of ‘No’ , p(No) = 5/14
As we have four attribute, outlook, temperature, humidity, and wind
Information Gain on Sunny outlook factor
Number of instance sunny outlook factor = 5
Decision = ‘yes’,prob(Decision = ‘yes’ |outlook =sunny) = 2/5
Decision = ‘No’,prob(Decision = ‘No’ |outlook =sunny) = 3/5
Entropy(Decision |outlook =sunny)= -(2/5)*log(2/5)-(3/5)*log(3/5)=0.97
Entropy(Decision |outlook =overcast)= -(4/4)*log(4/4)= 0 Entropy(Decision|
outlook =rainfall)=- (3/5)*log(3/5)-(2/5)*log(2/5)= 0.97
Gain(Decision,outlook) = Entropy(Decision)-P(Decision |outlook =sunny)*log

P(Decision |outlook =sunny)-P(Decision |outlook =overcast)*log P(Decision
|outlook =overcast)-P(Decision|outlook =rainfall)*log P(Decision|outlook
11/1
Gain(Decision,outlook) = 0.97- (5/14)*0.97-(4/4)*0- Open in app
Summary of Information Gain for all the attribute
Gain(Decision, outlook) = 0.247
Gain(Decision, wind ) = 0.048
Gain(Decision, temperature) =
0.029 Gain(Decision, humidity) =
0.151
So, outlook has the highest information gain so it is selected as first node/ root node.
Information Gain on Temperature under Sunny outlook factor
12/1
Entropy(sunny|temp =hot)=- (0/2)*log(0/2)-(2/2)*log(2/2)=

Open in app
Entropy(sunny|temp =cool)=- (1/1)*log(1/1)-(0/1)*log(0/1)= 0
Entropy(sunny|temp =mild)=- (1/2)*log(1/2)-(1/2)*log(1/2)= 1
Gain(sunny,temp) = 0.97- (2/5)*0 -(1/5)*0 -(2/5)*1=0.57
Summary of information gain on all attribute under Sunny outlook factor
Gain(sunny, temp) = 0.57
Gain(sunny, humidity) = 0.97
Gain(sunny, wind) =0.019
we can see that information gain of ‘Humidity’ attribute is higher than other. so, it is
next node under sunny outlook factor.
13/1
14/1
Information Gain on humidity factor

humidity takes two value, normal and high.
From the both table, you can infer that, whenever humidity is high decision is ‘No’
And, when humidity is normal decision is ‘Yes’
15/1
Now,Lets focus on other sub data feature. and calculate till all dataset split is a same
manner
16/1
C4.5 :
It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and
discreet values.It is also used for classfication purposes
algorithm can create a more generalized models including continuous data and could
handle missing data
Decision rules will be found based on entropy and information gain ratio pair of each
feature. In each level of decision tree, the feature having the maximum gain ratio will
be the decision rule.
About Help Terms Privacy
Get the Medium app
17/1
Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given
to article, we will learn about classification in machine learning in detail. The following topics are covered in this blog:
What is Classification in Machine Learning?

Classification Terminologies In Machine Learning
Classification Algorithms
Logistic Regression
Naive Bayes
Stochastic Gradient
Descent K-Nearest
Neighbors
Decision Tree
Random Forest
Artificial Neural Network
Support Vector Machine
Classifier Evaluation
Algorithm Selection
Use Case- MNIST Digit Classification
What is Classification In Machine Learning

Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The
points. The classes are often referred to as target, label or categories.
The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The
data will fall into.
Let us try to understand this with a simple example.
Heart disease detection can be identified as a classification problem, this is a binary classification since there can be only two classes i.e h
classifier, in this case, needs training data to understand how the given input variables are related to the class. And once the classifier is train
disease is there or not for a particular patient.
Since classification is a type of supervised learning, even the targets are also provided with the input data. Let us get familiar with the
classificati
Classification Terminologies In Machine Learning

Classifier – It is an algorithm that is used to map the input data to a specific category.
Classification Model – The model predicts or draws a conclusion to the input data given for training, it will predict the class or category
for
Feature – A feature is an individual measurable property of the phenomenon being observed.
Binary Classification – It is a type of classification with two outcomes, for eg – either true or false.
Multi-Class Classification – The classification with more than two classes, in multi-class classification each sample is assigned to one and o
Multi-label Classification – This is a type of classification where each sample is assigned to a set of labels or targets.
Initialize – It is to assign the classifier to be used for the
Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the model for training the train X and train label y.
Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y.
Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc.
Types Of Learners In Classification
Lazy Learners – Lazy learners simply store the training data and wait until a testing data appears. The classification is done using the
m
Advantages and Disadvantages
Logistic regression is specifically meant for classification, it is useful in understanding how a set of independent variables affect the outcome
of t
The main disadvantage of the logistic regression algorithm is that it only works when the predicted variable is binary, it assumes that the
predictors are independent of each other.
Use Cases
Identifying risk factors for diseases
Word classification
Weather Prediction
Voting Applications
Learn more about logistic regression with python here.
Naive Bayes Classifier

It is a classification algorithm based on Bayes’s theorem which gives an assumption of independence among predictors. In simple terms, a
particular feature in a class is unrelated to the presence of any other feature.
Even if the features depend on each other, all of these properties contribute to the probability independently. Naive Bayes model is easy to
data sets. Even with a simplistic approach, Naive Bayes is known to outperform most of the classification methods in machine learning.
Followi Theorem.
The Naive Bayes classifier requires a small amount of training data to estimate the necessary parameters to get the results. They are
extremely f The only disadvantage is that they are known to be a bad estimator.
Use Cases
Disease Predictions
Document Classification
Spam Filters
Sentiment Analysis
Know more about the Naive Bayes Classifier here.
Stochastic Gradient Descent
It is a very effective and simple approach to fit linear models. Stochastic Gradient Descent is particularly useful when the sample data is in a
l
penalties for classification.
Updating the parameters such as weights in neural networks or coefficients in linear regression
K-Nearest Neighbor
It is a lazy learning algorithm that stores all instances corresponding to training data in n-dimensional space. It is a lazy learning algorithm
model, instead, it works on storing instances of training data.
Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is supervised and takes a bunch of labeled
new point, it looks at the labeled points closest to that new point also known as its nearest neighbors. It has those neighbors vote, so
whichev new point. The “k” is the number of neighbors it checks.
Python Machine Learning Certification Training

Instructor-led Live Sessions
Real-life Case Studies
Assignments
Lifetime Access
Explore Curriculum
Find out our Machine Learning Certification Training Course in Top Cities
India United States Other Countries

Machine Learning Training in Hyderabad Machine Learning Course in Dallas Machine Learning Course in Melbourne
Machine Learning Certification in Bangalore Machine Learning Course in Charlotte Machine Learning Course in London
Machine Learning Course in Mumbai Machine Learning Certification in NYC Machine Learning Course in Dubai
Advantages And Disadvantages
This algorithm is quite simple in its implementation and is robust to noisy training data. Even if the training data is large, it is quite efficient.
The there is no need to determine the value of K and computation cost is pretty high compared to other algorithms.
Use Cases
Industrial applications to look for similar tasks in comparison to others
Handwriting detection applications
Image recognition
Video recognition
Stock analysis
Know more about K Nearest Neighbor Algorithm here
Decision Tree
The decision tree algorithm builds the classification model in the form of a tree structure. It utilizes the if-then rules which are equally
exhausti
A decision tree gives an advantage of simplicity to understand and visualize, it requires very little data preparation as well. The disadvantage
complex trees that may bot categorize efficiently. They can be quite unstable because even a simplistic change in the data can hinder the
whole
Use Cases
Data exploration
Pattern Recognition
Option pricing in finances
Identifying disease and risk threats
Know more about decision tree algorithm here
Random Forest
Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It operates by constructing a
mult class that is the mode of the classes or classification or mean prediction(regression) of the individual trees.
A random forest is a meta-estimator that fits a number of trees on various subsamples of data sets and then uses an average to improve th
sample size is always the same as that of the original input size but the samples are often drawn with replacements.
The advantage of the random forest is that it is more accurate than the decision trees due to the reduction in the over-fitting. The only
disadvan complex in implementation and gets pretty slow in real-time prediction.
Use Cases
Industrial applications such as finding if a loan applicant is high-risk or low-risk
For Predicting the failure of mechanical parts in automobile engines
Predicting social media share scores
Performance scores
Know more about the Random Forest algorithm here.
Artificial Neural Networks
A neural network consists of neurons that are arranged in layers, they take some input vector and convert it into an output. The process
invo which is often a non-linear function to it and then passes the output to the next layer.
Support Vector Machine
The support vector machine is a classifier that represents the training data as points in space separated into categories by a gap as wide
predicting which category they fall into and which space they will belong to.
It uses a subset of training points in the decision function which makes it memory efficient and is highly effective in high dimensional spaces.
T is that the algorithm does not directly provide probability estimates.
Use cases
ata Science Training

DATA
Business applications for comparing the performance of a stock over a period of
time Investment suggestions
Classification of applications requiring accuracy and efficiency
Learn more about support vector machine in python here
Classifier Evaluation
The most important part after the completion of any classifier is the evaluation to check its accuracy and efficiency. There are a lot of ways in
these methods listed below.
Holdout Method
This is the most common method to evaluate a classifier. In this method, the given data set is divided into two parts as a test and train set
20% a The train set is used to train the data and the unseen test set is used to test its predictive power
Accuracy
Accuracy is a ratio of correctly predicted observation to the total observations
True Positive: The number of correct predictions that the occurrence is
positive.
True Negative: Number of correct predictions that the occurrence is negative.
F1- Score
It is the weighted average of precision and recall
Precision And Recall

Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that ha
They are basically used as the measure of relevance.
ROC Curve
Receiver operating characteristics or ROC curve is used for visual comparison of classification models, which shows the relationship between
area under the ROC curve is the measure of the accuracy of the model.
Algorithm Selection
Apart from the above approach, We can follow the following steps to use the best algorithm for the model
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
print(mnist)
Output:
Exploring The Dataset
1 import matplotlib
2 import matplotlib.pyplot as plt
3
4 X, y = mnist['data'], mnist['target']
5 random_digit = X[4800]
6 random_digit_image = random_digit.reshape(28,28)
7 plt.imshow(random_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
Output:
Splitting the Data
We are using the first 6000 entries as the training data, the dataset is as large as 70000 entries. You can check using the shape of the X and
y. taken 6000 entries as the training set and 1000 entries as a test set.
1x_train, x_test = X[:6000], X[6000:7000] 2y_train, y_test = y[:6000], y[6000:7000]
Shuffling The Data
To avoid unwanted errors, we have shuffled the data using the numpy array. It basically improves the efficiency of the model.
1 import numpy as np
2
3 shuffle_index = np.random.permutation(6000)
4 x_train, y_train = x_train[shuffle_index], y_train[shuffle_index]
Creating A Digit Predictor Using Logistic Regression
1 y_train = y_train.astype(np.int8)
2 y_test = y_test.astype(np.int8)
3 y_train_2 = (y_train==2)
4 y_test_2 = (y_test==2)
5 print(y test 2)
Cross-Validation
from sklearn.model_selection import cross_val_score

a = cross_val_score(clf, x_train, y_train_2, cv=3, scoring="accuracy")
a.mean()
Output:
Python Machine Learning Certification Training

Weekday / Weekend Batches
See Batch Details
Creating A Predictor Using Support Vector Machine
1 from sklearn import svm

2
3 cls = svm.SVC()
4 cls.fit(x_train, y_train_2) cls.predict([random_digit])
5
Output:
Cross-Validation
a = cross_val_score(cls, x_train, y_train_2, cv = 3, scoring="accuracy")

a.mean()
Output:
In the above example, we were able to make a digit predictor. Since we were predicting if the digit were 2 out of all the entries in the data,
we g shows much better accuracy with the logistic regression classifier instead of the support vector machine classifier.
This brings us to the end of this article where we have learned Classification in Machine Learning. I hope you are clear with all that has been sha

ML Unit-2 Material Add-On

Uploaded by

Copyright:

Available Formats

You might also like

ML Unit-2 Material Add-On

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit-2 Material Add-On

Uploaded by

Copyright:

Available Formats

Alpaydin Chapter 2, Mitchell Chapter 7

• Alpaydin slides are in turquoise.

Computational Learning Theory (from Mitchell Computational Learning Theory

• Questions: • Probability of successful learning

hypotheses that can be learned; complexity of hypothesis

– Mistake bound framework: number of training errors

• instance x generated randomly, teacher provides c(x)

True Error of a Hypothesis Two Notions of Error

Training error of hypothesis h with respect to target concept

Definition: The true error (denoted errorD (h)) of

error (h) Pr [c(x) = h(x)]

• How often h(x) c(x) over

• Can we bound the true error of h given the

• First consider when training error of h is zero

Proof of ϵ-Exhasting Theorem PAC Learning

• What do we want then? Definition: a set of instances S is shattered by hypothesis

• What is sample complexity in this case?

Pr[errorD(h) > errorD(h) + є] ≤ e−2mc 2

Three Instances Shattered The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension,

Note that |H| can be infinite, while V C(H) finite!

S = {3.1, 5.7}, and hypothesis space includes intervals

• When H is a set of lines, and S a set of points, V C(H) = 3.

• Set of size 4 cannot be shattered, for any combination of

Sample Complexity from VC Dimension Mistake Bounds

• This is an interesting question because some learning systems

• Can we bound the number of mistakes learner makes

MA(C) max MA(c) V C(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|)

Definition: Let C be an arbitrary non-empty concept class. The

Opt(C) ≡ min MA(C)

Multiple Classes, Ci i=1,...,K

2.Loss function: E |X   Lrt ,gxt | 

In simple terms instead of having hard coded algorithms to do specific

Ref: Theoretical Machine Learning - Rob Schapire, Princeton Course Archive

Now we mainly have two kinds of Machine Learning Algorithms:

Binary Classification would generally fall into the domain of Supervised

Credit Card Fraudulent Transaction detection

Support Vector Machines

Support vector machines (SVMs) are a set of supervised learning methods

Example : Sentiment Classification

We worked on a project to do Sentiment Classification, which I hpoe would

Linear Discriminant Classifier

→ It is a dimensionality reduction technique in the domain of Supervised Machine Learning.

→ It is crucial in modeling differences between two groups, i.e., classes.

→ It helps project features in a high dimensions space in a lower-dimensional space.

→ As a result, variation within each class is also minimized.

P(A|B): Probability of A being true given B is true - posterior probability P(B|

A): Probability of B being true given A is true - the likelihood

P(A): Probability of A being true - prior

P(B): Probability of B being true - marginalization

→ The Naive Bayes classifier is based on two essential assumptions:-

→ It is a very popular supervised machine learning algorithm.

→ Logistic regression can be further classified into three categories:-

→ It is a straightforward supervised machine learning algorithm used for regression/classification.

→ It can be used for binary Classification as well as multinomial classification problems.

Quadratic Discriminant Classifier

→ This technique is similar to LDA(Linear Discriminant Analysis) discussed above.

→ We get the quadratic discriminant function as the following:-