Professional Documents
Culture Documents
ML Unit-2 Material Add-On
ML Unit-2 Material Add-On
ML Unit-2 Material Add-On
1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Training set X
X {xt ,rt }N
Class C
t
p1 price p2 AND e1 engine power e2
1if x is positive
r
0 if x is negative
x1
x
x2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5
Hypothesis class H
1if h says x is positive
h(x)
S, G, and the Version Space
0 if h says x is negative most specific hypothesis, S
most general hypothesis, G
h H, between S and G is
consistent
Error of h on H and make up the version space (M
E(h|X )N1
hx r
t 1 t t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7
– Probably Approximately Correct (PAC) framework: classes of • Accuracy to which target concept is approximated
2 3
Specific Questions Sample Complexity
• Sample complexity: How many training examples are needed How many training examples are sufficient to learn the target
for a learner to converge? concept?
• Computational complexity: How much computational effort is 1. If learner proposes instances, as queries to teacher
needed for a learner to converge? • Learner proposes instance x, teacher provides c(x)
• Mistake bound: How many training examples will the 2. If teacher (who knows c) provides training examples
learner misclassify before converging?
• teacher provides sequence of examples of form ⟨x, c(x)⟩
Issues: When to say it was successful? How are inputs acquired?
3. If some random process (e.g., nature) proposes instances
concern:
6 7
Exhausting the Version Space How many examples will ϵ-exhaust the VS?
Hypothesis space H
Theorem: [Haussler, 1988].
error =.1
error =.3
r =.4 If the hypothesis space H is finite, and D is a sequence of m ≥ 1
r =.2
error =.2 independent random examples of some target concept c, then for
r =0
any 0 ≤ є ≤ 1, the probability that the version space with respect
VSH,D
error =.2 to H and D is not є-exhausted (with respect to c) is less than
error =.3 r =.3
r =.1 error =.1
r =0
|H|e−еm
This bounds the probability that any consistent learner will output a hypothesis h
(r = training error, error = true error)
with error(h) ≥ є
If we want this probability to be below δ
Definition: The version space V SH,D is said to be
|H|e−еm ≤ δ
є-exhausted with respect to c and D, if every hypothesis
then
h 1
in V SH,D has error less than є with respect to c and D. m≥ (ln |H| + ln(1/δ))
є
(∀h ∈ V SH,D) errorD(h) < є 9
10 11
Agnostic Learning Shattering a Set of Instances
So far, we assumed that c ∈ H. What if it is not the case? Definition: a dichotomy of a set S is a partition of S into
Agnostic learning setting: don’t assume c ∈H two disjoint subsets.
12 13
14 15
VC Dim. of Linear Decision Surfaces VC Dimension: Another Example
16
17
How many randomly drawn examples suffice to є-exhaust V So far: how many examples needed to learn?
SH,D
with probability at least (1 − δ)? What about: how many mistakes before convergence?
• More expressive H needs more samples. • Instances drawn at random from X according to distribution D.
• More samples needed for H with more tunable parameters. • Learner must classify each instance before receiving correct
classification from teacher.
18 19
VC Dim. of Linear Decision Surfaces VC Dimension: Another Example
before converging?
18 19
Optimal Mistake Bounds Mistake Bounds and VC Dimension
Let MA(C) be the max number of mistakes made by algorithm A to
learn concepts in C. (maximum over all possible c ∈ C, and all Littlestone (1987) showed:
possible training sequences)
20 21
1if xt C i
hi x
t
0 i f x C j , j
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Regression
Model Selection & Generalization
X x t ,r t ⚫ Learning is an ill-posed problem; data is not sufficient to
t gx w1x w0
find a unique solution
N
gx w2x 2 w1x ⚫ The need for inductive bias, assumptions about H
0
r t w ⚫ Generalization: How well a model performs on new data
r t f xt ⚫ Overfitting: H more complex than C or f
t ⚫ Underfitting: H less complex than C or f
r
1
gx
X
N 2
E g|
E w1 ,w0 |NX
t
t 1
1 N
r w xw
N t 1
t
2
1 t 0
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
Triple Trade-Off
⚫ There is a trade-off between three factors
(Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
As N E
As c (H) first E and then E
Cross-Validation
⚫ To estimate generalization error, we need data
unseen during training. We split the data as
⚫ Training set (50%)
⚫ Validation set (25%)
⚫ Test (publication) set (25%)
⚫ Resampling when there is few data
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
Dimensions of a Supervised
Learner
1.Model:gx|
3.Optimization procedure:
* argmi nE |X
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
8/10/22, 9:29 AM Binary Classification
Binary Classification
Introduction
Given a collection of objects let us say we have the task to classify the
objects into two groups based on some feature(s). For example, let us say
given some pens and pencils of different types and makes, we can easily
seperate them into two classes, namely pens and pencils. This seemingly
trivial task, if we take a moment would seem to involve a lot of underlying
computation. The question to ask would be what is our basis for
classification? And how perhaps might we incorporate the principle to
allow for efficient classification by 'intelligent algorithms'.
Machine Learning
https://www.cse.iitk.ac.in/users/se367/10/presentation_local/Binary Classification.html
1/4
8/10/22, 9:29 AM Binary Classification
Binary Classification
Now there are various paradigms that are used for learning binary classifiers
which include:
Decision Trees
Neural Networks
Bayesian Classification
Support Vector Machines
Ref: wikimedia
https://www.cse.iitk.ac.in/users/se367/10/presentation_local/Binary Classification.html
3/4
8/10/22, 9:29 AM Binary Classification
sentiment.
How this was worked was that using various Data Mining Software, each of
the review of the dataset was converted into a document vector, akin to a
'bag of words' paradigm( since only the presence or absence of a word was
considered, not the context). Thus the words of the documents were
stemmed and and Dataset Vector was constructed consisting of all of the
unique words appearing in the Dataset. Thus each Data Point was now a
vector, where each dimension corresponded to unique word. The occurrence
of words was set as either binary or using TFIDF. Now an SVM classifier
was used which trained on the dataset thereby mapping the vectors and
obtaining a hyperplane for seperation. A good accuracy of about ~83% was
obtained using this classifier!
Conclusion
Thus we hope that with this illustration the wide array of problems that can
be tackled using Machine Learning Algorithms is highlighted. Also we
would like to stress upon the importance of Learning in the context of AI.
We would not consider a system to be truly intelligent if it were incapable
of learning since learning seems to be at the core of intelligence!
References
Theoretical Machine Learning - Rob Schapire, Princeton Course
Archive
Wikipedia
CS674 : Machine Learning and Knowledge Discovery Course Project!
https://www.cse.iitk.ac.in/users/se367/10/presentation_local/Binary Classification.html
4/4
Linear vs. Non-Linear Classification
Shabeg Singh Gill
Last Updated: May 13, 2022 Share :
Introduction
We will be studying Linear Classification as well as Non-Linear Classification.
Linear Classification refers to categorizing a set of data points to a discrete class based on a linear combination of its
explanatory variables. On the other hand, Non-Linear Classification refers to separating those instances that are not
linearly separable.
Linear
Classification
→ Linear Classification refers to categorizing a set of data points into a discrete class based on a linear
combination of its explanatory variables.
→ Some of the classifiers that use linear functions to separate classes are Linear Discriminant Classifier, Naive
Bayes, Logistic Regression, Perceptron, SVM (linear kernel).
→ In the figure above, we have two classes, namely 'O' and '+.' To differentiate between the two classes, an arbitrary line
is drawn, ensuring that both the classes are on distinct sides.
→ Since we can tell one class apart from the other, these classes are called ‘linearly-separable.’
→ However, an infinite number of lines can be drawn to distinguish the two classes.
→ The exact location of this plane/hyperplane depends on the type of the linear classifier.
→ In the above graph, we notice that a new axis is created, which maximizes the distance between the mean of the two
classes.
→ However, the problem with LDA is that it would fail in case the means of both the classes are the same. This would
mean that we would not be able to generate a new axis for differentiating the two.
Naive Bayes
→ It is based on the Bayes Theorem and lies in the domain of Supervised Machine Learning.
→ Every feature is considered equal and independent of the others during Classification.
→ Naive Bayes indicates the likelihood of occurrence of an event. It is also known as conditional probability.
A: event 1
B: event 2
However, in the case of the Naive Bayes classifier, we are concerned only with the maximum posterior
probability, so we ignore the denominator, i.e., the marginal likelihood. Argmax does not depend on the
normalization term.
(i) Conditional Independence - All features are independent of each other. This implies that one feature does
not affect the performance of the other. This is the sole reason behind the ‘Naive’ in ‘Naive Bayes.’
(ii) Feature Importance - All features are equally important. It is essential to know all the features to make good
predictions and get the most accurate results.
→ Naive Bayes is classified into three main types: Multinomial Naive Bayes, Bernoulli Naive Bayes, and
Gaussian Bayes.
Logistic Regression
→ The target variable can take only discrete values for a given set of features.
→ The model builds a regression model to predict the probability of a given data entry.
→ Similar to linear regression, logistic regression uses a linear function and, in addition, makes use of the 'sigmoid'
function.
Binomial - target variable assumes only two values since binary. Example: ‘0’ or ‘1’.
Multinomial - target variable assumes >= three unordered values since multinomial. Example: 'Class A,' 'Class B,'
and 'Class C.'
Ordinal - target variable assumes ordered values since ordinal. Example: ‘Very Good’, ‘Good’, ‘Average, ‘poor’,
‘very poor’.
Support Vector Machine (linear kernel)
→ This model finds a hyper-plane that creates a boundary between the various data types.
→ A binary classifier can be created for each class to perform multi-class Classification.
→ In the case of SVM, the classifier with the highest score is chosen as the output of the SVM.
→ SVM works very well with linearly separable data but can work for non-linearly separable data as well.
Non-Linear
Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly separable.
→ Some of the classifiers that use non-linear functions to separate classes are Quadratic Discriminant Classifier, Multi-
Layer Perceptron (MLP), Decision Trees, Random Forest, and K-Nearest Neighbours (KNN).
→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the two classes, it is impossible
to draw an arbitrary straight line to ensure that both the classes are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class present between the data
points of the second class.
→ In such cases, piece-wise linear or non-linear classification boundaries are required to distinguish the two classes.
→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This would give us a clear
picture of the difference between the two.
→ This is nothing but a collection of fully connected dense layers. These help transform any given input
dimension into the desired dimension.
→ It is nothing but simply a neural network.
→ MLP consists of one input layer(one node belonging to each input), one output layer (one node belonging to each
output), and a few hidden layers (>= one node belonging to each hidden layer).
→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.
→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in the hidden layer. Once
this is done, the hidden layer processes the information passed on to it and then further passes it on to the output layer.
Decision Tree
→ An instance is classified by starting at the tree's root node, testing the attribute specified by this node, then moving
down the tree branch corresponding to the attribute's value, as shown in the above figure.
→ The process is repeated based on each derived subset in a recursive partitioning manner.
→ The above decision tree helps determine whether the person is fit or not.
K-Nearest Neighbours
→ KNN is a supervised machine learning algorithm . It is used for classification problems. Since it is a
supervised machine learning algorithm, it uses labeled data to make predictions.
→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the same.
→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the
new point. It chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to.
→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.
Comparison
Now, we will briefly sum up all that we’ve learned and try to compare and contrast Linear Classification and Non-
Linear Classification.
1. It is possible toIt is not easy to classify dataclassify data with a with a straightstraight line.
line.
Key
Takeaways
Congratulations on making it this far. This blog discussed a fundamental overview of Linear Classification and Non-
Linear Classification and the differences between the two!!
We learned about Linear Classifiers such as Linear Discriminant Classifier, Naive Bayes, Logistic Regression and
Support Vector Machines, and Non-Linear Classifiers such as Quadratic Discriminant classifiers, Multi- Layer
Perceptron, Decision Trees, Random Forests, and K-Nearest Neighbours.
If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for
a carefully crafted and designed course on-campus placements and interview preparation.
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
3.What is Entropy?
6.Multiclass Vs Multi-label
Binary, as the name suggests, has two categories in the dependent column.
Multiclass refers to columns with more than two categories in it.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
1/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
You will get answers to all the questions that might cross your mind while reading
The above picture is taken from the Iris dataset which depicts that the target
variable has three categories i.e., Virginica, setosa, and Versicolor, which are three
species of Iris plant. We might use this dataset later, as an example of a
conceptual understanding of multiclass classification.
Naive Bayes
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
3/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
training data.
Naïve Bayes can also be an extremely good text classifier as it performs well, such
as in the spam ham dataset.
By P (A|B), we are trying to find the probability of event A given that event B is
true. It is also known as posterior probability.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that
are not linearly separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear
classification instead of Naïve Bayes classifier.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
4/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points for
defining them as similar. Most popular choice is Euclidean distance which is written
as:
K in KNN is the hyperparameter that can be chosen by us to get the best possible
fit for the dataset. If we keep the smallest value for K, i.e. K=1, then the model will
show low bias, but high variance because our model will be overfitted in this case.
Whereas, a larger value for K, lets suppose k=10, will surely smoothen our decision
boundary, which means low variance but high bias. So we always go for a trade-
off between the bias and variance, known as bias-variance trade-off.
Advantages-
Disadvantages-
K value is difficult to find as it must work well with test data also, not only with
the training data
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made
based on some conditional statements. This is one of the most used supervised
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
5/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
and easy interpretation. They can map linear as well as non-linear relationships in a
good way.
Let us look at the figure below, Fig.3, where we have used adult census income
dataset with two independent variables and one dependent variable. Our target or
dependent variable is income, which has binary classes i.e, <=50K or >50K.
We can see that the algorithm works based on some conditions, such as Age <50
and Hours>=40, to further split into two buckets for reaching towards homogeneity.
Similarly, we can move ahead for multiclass classification problem datasets, such
as Iris data.
Now a question arises in our mind. How should we decide which column to take first
and what is the threshold for splitting? For splitting a node and deciding threshold for
splitting, we use entropy or Gini index as measures of impurity of a node. We aim to
maximize the purity or homogeneity on each split, as we saw in Fig.2.
What is Entropy?
Entropy or Shannon entropy is the measure of uncertainty, which has a similar
sense as in thermodynamics. By entropy, we talk about a lack of information. To
understand better, let us suppose we have a bag full of red and green balls.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
6/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
If you are asked to take one ball out of it then what is the probability that the ball will
be green colour ball?
Here we all know there will have 50% chances that the ball we pick will be green.
In the second and third scenario, there is high certainty of green ball in our first
pick or we can say there is less entropy. But in the first scenario there is high
uncertainty or high entropy.
Entropy 𝖺 Uncertainty
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
7/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
being used in binary classification problems. But here, we will learn how we can
extend this algorithm for classifying multiclass data. In binary, we have 0 or 1 as our
classes, and the threshold for a balanced binary classification dataset is generally
0.5. Whereas, in multiclass, there can be 3 balanced classes for which we require 2
threshold values which can be, 0.33 and 0.66. But a question arises, by using what
method do we calculate threshold and approach multiclass classification? So let’s
first see a general formula that we use for the logistic regression curve:
Where P is the probability of the event occurring and the above equation derives
from here:
There are two ways to approach this kind of a problem. They are explained as below:
One vs. Rest (OvR)– Here, one class is considered as positive, and rest all are taken
as negatives, and then we generate n-classifiers. Let us suppose there are 3
classes in a dataset, therefore in this approach, it trains 3-classifiers by taking one
class at a time as positive and rest two classes as negative. Now, each classifier
predicts the probability of a particular class and the class with the highest
probability is the answer.
One vs. One (OvO)– In this approach, n ∗ (n − 1)⁄2 binary classifier models are
generated. Here each classifier predicts one class label. Once we input test data to
the classifier, the class which has been predicted the most is chosen as the
answer.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
8/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
describe the performance of a model on a test data.
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
9/9
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
Let’s take an example to have a better idea about confusion matrix in multiclass
classification using Iris dataset which we have already seen above in this article.
Precision for Virginica class is the number of correctly predicted virginica species out
of all the predicted virginica species, which is 4/7 = 57.1%. This means that only 4/7 of
the species that our predictor classifies as Virginica are actually virginica. Similarly,
we can find for other species i.e. for Setosa and Versicolor, precision is 20% and 62.5%
respectively.
Whereas, Recall for Virginica class is the number of correctly predicted virginica
species out of actual virginica species, which is 50%. This means that our classifier
classified half of the virginica species as virginica. Similarly, we can find for other
species i.e. for Setosa and Versicolor, recall is 20% and 71.4% respectively.
Multiclass Vs Multi-label
People often get confused between multiclass and multi-label classification. But
these two terms are very different and cannot be used interchangeably. We have
already understood what multiclass is all about. Let’s discuss in brief how multi-
label is different from multiclass.
Multi-label refers to a data point that may belong to more than one class. For
example, you wish to watch a movie with your friends but you have a different
choice of genres that you all enjoy. Some of your friends like comedy and others
are more into action and thrill. Therefore, you search for a movie that fulfills both
the requirements and here, your movie is supposed to have multiple labels.
Whereas, in multiclass or binary classification, your data point can belong to only a
single class. Some more examples of the multi-label dataset could be protein
classification in the
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
10
8/9/22, 9:40 PM What is Multiclass Classification in Machine Learning | Great Learning
human body, or music categorization according to genres. It can also one of the
I hope this article has provided you with some fair conceptual knowledge. Don’t
stop here, remember that there are many more ways to classify your data. All that
is important is how you polish your basics to create and implement more
algorithms. Let us conclude by looking at what Professor Pedro Domingos said-
“Machine learning will not single-handedly determine the future, any more than
any other technology; it’s what we decide to do with it that counts, and now you
have the tools to decide.”
If you found this helpful and wish to learn more such concepts, join Great Learning
Academy’s free courses today!
https://www.mygreatlearning.com/blog/multiclass-classification-explained/
11
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Upasana
Research Analyst, Tech Enthusiast, Currently working on Azure IoT & Data Science...
The idea of Classi cation Algorithms is pretty simple. You predict the target class by analyzing the training dataset. This is one
of the most, if not the most essential concept you study when you learn Data Science.
Analysis of the customer data to predict whether he will buy computer accessories (Target class: Yes or No)
Classifying fruits from features like color, taste, size, weight (Target classes: Apple, Orange, Cherry, Banana)
Gender classi cation from hair length (Target classes: Male or Female)
Let’s understand the concept of classi cation algorithms with gender classi cation using hair length (by no means am I trying to
stereotype by gender, this is only an example). To classify gender (target class) using hair length as feature parameter we could
train a model using any classi cation algorithms to come up with some set of boundary conditions which can be used to
di erentiate the male and female genders using hair length as the training feature. In gender classi cation case the boundary
condition could the proper hair length value. Suppose the di erentiated boundary hair length value is 15.0 cm then we can say
that if hair length is less than 15.0 cm then gender could be male or else female.
While grouping similar language type documents (Same language documents are one group.)
While categorizing the news articles (Same news category(Sport) articles are one group ) FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
1/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Let’s understand the concept with clustering genders based on hair length example. To determine gender, di erent
similarity
measure could be used to categorize male and female genders. This could be done by nding the similarity between two
×
continue until all the hair length properly grouped into two
Subscribe to categories.
our Newsletter, and get personalized recommendations.
Classi er: An algorithm that maps the input data to a speci c category.
Classi cation model: A classi cation model tries to draw some conclusion from the input values given for training. It
Signup with Facebook
will predict the class labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon being observed.
Binary Classi cation: Classi cation task with two possible outcomes. Eg: Gender
Already haveclassi cation
an account? . (Male / Female)
Multi-class classi cation: Classi cation with more than two classes. In multi-class classi cation, each sample is assigned
one and only one target label. Eg: An animal can be a cat or dog but not both at the same time.
Multi-label classi cation: Classi cation task where each sample is mapped to a set of target labels (more than one class).
Eg: A news article can be about sports, a person, and location at the same time.
Logistic Regression
As confusing as the name might be, you can rest assured. Logistic Regression is a classi cation and not a regression algorithm. It
estimates discrete values (Binary values like 0/1, yes/no, true/false) based on a given set of independent variable(s). Simply
put, it basically, predicts the probability of occurrence of an event by tting data to a logit function. Hence, it is also known as
logit regression. The values obtained would always lie within 0 and 1 since it predicts the probability.
Let’s say there’s a sum on your math test. It can only have 2 outcomes, right? Either you solve it or you don’t (and let’s not assume
points for method here). Now imagine, that you are being given a wide range of sums in an attempt to understand which
e! chapters
Introduction to Classi cation Algorithms
have you understood well. The outcome of this study would be something like this – if you are given a trigonometry
edureka.co
based problem, you are 70% likely to solve it. On the other hand, if it is an arithmetic problem, the probability of you getting an
answer is only 30%. This is what Logistic Regression provides you.
If I had to do the math, I would model the log odds of the outcome as a linear combination of the predictor
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
2/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Explore Curriculum
odds= p/ (1-p) = probability of event occurrence / probability of event occurrence ln(odds) = ln(p/(1-
p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3. +bkXk)
In the equation given above, p is the probability of the presence of the characteristic of interest. It chooses parameters that
maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary
regression).
Now, a lot of you might wonder, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical
ways to replicate a step function. I can go way more in-depth with this, but that will beat the purpose of this blog.
R-Code:
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
Introduction to Classi cation Algorithms
e edureka.co
There are many di erent steps that could be tried in order to improve the model:
use a non-linear
model
Copy Link
FREE WEBINAR
algorithm does is, it splits the population into two or more homogeneous sets based on the most signiforcant
Tableau Dataattributes
Science in 60 Minu…
https://www.edureka.co/blog/classification-algorithms/
3/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
×
Subscribe to our Newsletter, and get personalized recommendations.
In the image above, you can see that population is classi ed into four di erent groups based on multiple attributes to
identify
‘if they will play or not’.
R-Code:
library(rpart)
x <-
cbind(x_train,y_train) #
grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, a Naive Bayes Classi er would consider all of
these properties to independently contribute to the probability that this fruit is an apple.
To build a Bayesian model is simple and particularly functional in case of enormous data sets. Along with simplicity, Naive
Bayes is known to outperform sophisticated classi cation methods as well.
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
4/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
×
Subscribe to our Newsletter, and get personalized recommendations.
Here,
Example: Let’s work through an example to understand this better. So, here I have a training data set of weather namely,
sunny, overcast and rainy, and corresponding binary variable ‘Play’. Now, we need to classify whether players will play or not
based on
weather condition. Let’s follow the below steps to perform it.
ta Science Training
PYTHON DATA
SCIENCE DATA SCIENCE AND DATA SCIENCE WITH R
MACHIN
E
MACHINE LEARNING WITH PYTHON PROGRAMMING
NTERNSHIP CERTIFICATION CERTIFICATION
LEARNING CERTIFICATION
PROGRAM COURSE
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
5/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
×
Subscribe to our Newsletter, and get personalized recommendations.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the
highest
posterior probability is the outcome of prediction.
Problem: Players will play if the weather is sunny, is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny) Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of di erent class based on various attributes. This algorithm
is mostly used in text classi cation and with problems having multiple classes.
Introduction to Classi cation Algorithms
e R-Code:
edureka.co
library(e1071)
x <-
cbind(x_train,y_train) #
Fitting model
fit <-naiveBayes(y_train
Whatsapp
~ .,Linkedin
data = x) Twitter Facebook Reddit
summary(fit)
#Predict Output
Copy Link
predicted= predict(fit,x_test) FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
6/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
K nearest neighbors is a simple algorithm used for both classi cation and regression problems. It basically stores all
available Become a Certi ed Professional
cases to classify the new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its
×
Subscribe to our Newsletter, and get personalized recommendations.
While the three former distance functions are used for continuous variables, Hamming distance function is used for
categorical
Sign up with Google
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out
to be a challenge while performing kNN modeling.
Signup with Facebook
You can understand KNN easily by taking an example of our real lives. If you have a crush on a girl/boy in class, of whom
you have no information, you might want to talk to their friends and social circles to gain access to their information!
R-Code:
Introduction to Classi cation Algorithms
e library(knn)
edureka.co
x <-
cbind(x_train,y_train) #
Fitting model
fit <-knn(y_train ~ ., data =
x,k=5) summary(fit)
Whatsapp Linkedin Twitter Facebook Reddit
#Predict Output
predicted= predict(fit,x_test)
Copy Link
FREE WEBINAR
Things to consider before selecting
Variables should be normalized else higher range variables can bias Tableau for Data Science in 60 Minu…
https://www.edureka.co/blog/classification-algorithms/
7/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Works on pre-processing stage more before going for kNN like an outlier, noise
removal Become a Certi ed Professional
×
In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with
the Subscribe to our Newsletter, and get personalized recommendations.
For example, if we only had two features like Height and Hair length of an individual, we’d rst plot these two variables in
two- dimensional space where each point has two coordinates (these coordinates
Signup are known as Support Vectors)
with Facebook
See Batch
Details
Now, we will nd some line that splits the data between the two di erently classi ed groups of data. This will be the line
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
8/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
×
Subscribe to our Newsletter, and get personalized recommendations.
In the example shown above, the line which splits the data into two di erently classi ed groups is the blue line, since the
two
closest points are the farthest apart from the line. This line is our classi er. Then, depending on where the testing data
lands on either side of the line, that’s what class we can classify the new data as.
R-Code:
library(e1071)
x <-
cbind(x_train,y_train) #
Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
So, with this, we come to the end of this Classi cation Algorithms Blog. Try out the simple R-Codes on your systems now
and you’ll no longer call yourselves newbies in this concept.
Introduction to Classi cation Algorithms
e edureka.co
Upcoming Batches For Data Science with R Programming Certi cation Training Course
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
9/12
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
3 Scenarios Where Predictive Analytics is Python Loops – While, For and Nested The Whys and Hows of Predictive
a Must Loops in Python Programming Modeling-II
Comprehensive Guide To Logistic How To Perform Logistic Regression In Python Seaborn Tutorial: What is
Regression In R Python? Seaborn and How to Use it?
Comments 0
Comments
Copy Link
FREE WEBINAR
×
Subscribe to our Newsletter, and get personalized recommendations.
Browse Categories
Arti cial Intelligence BI and Visualization Big Data Blockchain Cloud Computing Cyber Security Data Warehousing and
ETL Databases DevOps Digital Marketing Enterprise Front End Web Development Mobile
Development Operating Systems Programming & Frameworks Project Management and Methodologies Robotic
TRENDING CERTIFICATION COURSES TRENDING MASTERS COURSES
Big Data Hadoop Certi cation Training Cloud Architect Masters Program
Tableau Training & Certi cation Big Data Architect Masters Program
Python Certi cation Training for Data Science Machine Learning Engineer Masters Program
Selenium Certi cation Training Full Stack Web Developer Masters Program
Robotic Process Automation Training using UiPath Data Analyst Masters Program
Apache Spark and Scala Certi cation Training Test Automation Engineer Masters Program
Microsoft Power BI Training Post-Graduate Program in Arti cial Intelligence & Machine Learning
About us Careers
DOWNLOAD
APP
Sitemap
Blog Sitemap
Community
Whatsapp Linkedin Twitter Facebook Reddit
Sitemap Webinars
Copy Link
FREE WEBINAR
CATEGORI
Tableau for Data Science in 60 Minu…
https://www.edureka.co/blog/classification-algorithms/
11/1
8/10/22, 9:28 AM Classification Algorithms | Types of Classification Algorithms | Edureka
Cloud Computing DevOps Big Data Data Science BI and Visualization Programming & Frameworks Software Testing
Intelligence Blockchain Databases Cyber Security Mobile Development Operating Systems Architecture & Design
×
Subscribe to our Newsletter, and get personalized recommendations.
TRENDING BLOG ARTICLES
C Linux commands Android tutorial JavaScript tutorial jQuery tutorial SQL interview questions MySQL tutorial Machine
What is machine learning Ethical hacking tutorial SQL injection AWS certi cation career opportunities AWS tutorial What Is cloud
computing What is blockchain Hadoop tutorial What is arti cial intelligence Node Tutorial Collections in Java
Python Programming Language Python interview questions Multithreading in Java ReactJS Tutorial Data Science vs Big Data vs Data Analyt…
Software Testing Interview Questions R Tutorial Java Programs JavaScript Reserved Words and Keywor… Implement thread.yield() in Java:
Exam… Implement Optical Character Recogniti… All you Need to Know About Implement…
© 2022 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved. Terms & Conditions
Copy Link
FREE WEBINAR
https://www.edureka.co/blog/classification-algorithms/
12/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Open in Get
app started
Open in app
Sum of Square Of probability for Success and Failure (p² x
The node for which the Gini impurity is least is selected as the root node to split
Lets start with weather data set, which is quite famous in explaining decision tree
algorithm,where target is to predict play or not( Yes or No) based on weather
condition
From data, outlook, temperature, humidity, wind are the features of data.
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
2/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Now , we will calculate the weighted sum of Gini index for outlook features,
Similarly, temperature is also a nominal feature, it can take three values, hot,cold
and mild. lets summarize the final decision of temperature feature.
Now, the weighted sum of Gini index for temperature features can be calculated as,
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
3/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Humidity is a binary class feature , it can take two value high and normal.
Now, the weighted sum of Gini index for humidity features can be calculated as,
wind is a binary class feature , it can take two value weak and strong.
Now, the weighted sum of Gini index for wind features can be calculated as,
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
4/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
From table, you can seen that Gini index for outlook feature is lowest. So we get our
root node.
Now, lets focus on sub data on sunny outlook feature. we need to find the Gini index
for temperature, humidity and wind feature respectively.
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
5/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Now, the weighted sum of Gini index for temperature on sunny outlook features can be
calculated as,
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
6/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Now, the weighted sum of Gini index for humidity on sunny outlook features can be
calculated as,
Now, the weighted sum of Gini index for wind on sunny outlook features can be
calculated as,
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
7/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
we have calculated the Gini index of all the features when the outlook is sunny. You can
infer that humidity has lowest value. so next node will be humidity.
Now,Lets focus on sub data for overcast outlook feature. and calculate till all dataset
split is a same manner
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
8/17
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class
distribution is equal.
Open in app
3. Pick the attribute with the highest information
4. Repeat until we get the desired tree
we will take the same weather data set , we have taken for explaining CART algorithm
in previous story
Number of observations = 14
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
10/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Gain(Decision, temperature) =
0.151
So, outlook has the highest information gain so it is selected as first node/ root node.
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
12/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
we can see that information gain of ‘Humidity’ attribute is higher than other. so, it is
next node under sunny outlook factor.
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
13/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
14/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
From the both table, you can infer that, whenever humidity is high decision is ‘No’
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
15/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
Now,Lets focus on other sub data feature. and calculate till all dataset split is a same
manner
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
16/1
8/9/22, 9:48 PM Decision Tree : Classification and Regression Algorithm(CART ,ID3) | by Pradeep Dhote | Medium
C4.5 :
It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and
discreet values.It is also used for classfication purposes
algorithm can create a more generalized models including continuous data and could
handle missing data
Decision rules will be found based on entropy and information gain ratio pair of each
feature. In each level of decision tree, the feature having the maximum gain ratio will
be the decision rule.
https://pradeep-dhote9.medium.com/decision-tree-classification-and-regression-algorithm-cart-id3-36a2450a7f1a
17/1
Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given
to article, we will learn about classification in machine learning in detail. The following topics are covered in this blog:
The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The
data will fall into.
Heart disease detection can be identified as a classification problem, this is a binary classification since there can be only two classes i.e h
classifier, in this case, needs training data to understand how the given input variables are related to the class. And once the classifier is train
disease is there or not for a particular patient.
Since classification is a type of supervised learning, even the targets are also provided with the input data. Let us get familiar with the
classificati
Classification Model – The model predicts or draws a conclusion to the input data given for training, it will predict the class or category
for
Binary Classification – It is a type of classification with two outcomes, for eg – either true or false.
Multi-Class Classification – The classification with more than two classes, in multi-class classification each sample is assigned to one and o
Multi-label Classification – This is a type of classification where each sample is assigned to a set of labels or targets.
Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the model for training the train X and train label y.
Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y.
Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc.
Types Of Learners In Classification
Lazy Learners – Lazy learners simply store the training data and wait until a testing data appears. The classification is done using the
m
Advantages and Disadvantages
Logistic regression is specifically meant for classification, it is useful in understanding how a set of independent variables affect the outcome
of t
The main disadvantage of the logistic regression algorithm is that it only works when the predicted variable is binary, it assumes that the
predictors are independent of each other.
Use Cases
Word classification
Weather Prediction
Voting Applications
Even if the features depend on each other, all of these properties contribute to the probability independently. Naive Bayes model is easy to
data sets. Even with a simplistic approach, Naive Bayes is known to outperform most of the classification methods in machine learning.
Followi Theorem.
The Naive Bayes classifier requires a small amount of training data to estimate the necessary parameters to get the results. They are
extremely f The only disadvantage is that they are known to be a bad estimator.
Use Cases
Disease Predictions
Document Classification
Spam Filters
Sentiment Analysis
It is a very effective and simple approach to fit linear models. Stochastic Gradient Descent is particularly useful when the sample data is in a
l
penalties for classification.
Updating the parameters such as weights in neural networks or coefficients in linear regression
K-Nearest Neighbor
It is a lazy learning algorithm that stores all instances corresponding to training data in n-dimensional space. It is a lazy learning algorithm
model, instead, it works on storing instances of training data.
Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is supervised and takes a bunch of labeled
new point, it looks at the labeled points closest to that new point also known as its nearest neighbors. It has those neighbors vote, so
whichev new point. The “k” is the number of neighbors it checks.
Assignments
Lifetime Access
Explore Curriculum
Find out our Machine Learning Certification Training Course in Top Cities
This algorithm is quite simple in its implementation and is robust to noisy training data. Even if the training data is large, it is quite efficient.
The there is no need to determine the value of K and computation cost is pretty high compared to other algorithms.
Use Cases
Image recognition
Video recognition
Stock analysis
Decision Tree
The decision tree algorithm builds the classification model in the form of a tree structure. It utilizes the if-then rules which are equally
exhausti
A decision tree gives an advantage of simplicity to understand and visualize, it requires very little data preparation as well. The disadvantage
complex trees that may bot categorize efficiently. They can be quite unstable because even a simplistic change in the data can hinder the
whole
Use Cases
Data exploration
Pattern Recognition
Random Forest
Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It operates by constructing a
mult class that is the mode of the classes or classification or mean prediction(regression) of the individual trees.
A random forest is a meta-estimator that fits a number of trees on various subsamples of data sets and then uses an average to improve th
sample size is always the same as that of the original input size but the samples are often drawn with replacements.
The advantage of the random forest is that it is more accurate than the decision trees due to the reduction in the over-fitting. The only
disadvan complex in implementation and gets pretty slow in real-time prediction.
Use Cases
Performance scores
A neural network consists of neurons that are arranged in layers, they take some input vector and convert it into an output. The process
invo which is often a non-linear function to it and then passes the output to the next layer.
Support Vector Machine
The support vector machine is a classifier that represents the training data as points in space separated into categories by a gap as wide
predicting which category they fall into and which space they will belong to.
It uses a subset of training points in the decision function which makes it memory efficient and is highly effective in high dimensional spaces.
T is that the algorithm does not directly provide probability estimates.
Use cases
Classifier Evaluation
The most important part after the completion of any classifier is the evaluation to check its accuracy and efficiency. There are a lot of ways in
these methods listed below.
Holdout Method
This is the most common method to evaluate a classifier. In this method, the given data set is divided into two parts as a test and train set
20% a The train set is used to train the data and the unseen test set is used to test its predictive power
Accuracy
positive.
F1- Score
ROC Curve
Receiver operating characteristics or ROC curve is used for visual comparison of classification models, which shows the relationship between
area under the ROC curve is the measure of the accuracy of the model.
Algorithm Selection
Apart from the above approach, We can follow the following steps to use the best algorithm for the model
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
print(mnist)
Output:
1 import matplotlib
2 import matplotlib.pyplot as plt
3
4 X, y = mnist['data'], mnist['target']
5 random_digit = X[4800]
6 random_digit_image = random_digit.reshape(28,28)
7 plt.imshow(random_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
Output:
We are using the first 6000 entries as the training data, the dataset is as large as 70000 entries. You can check using the shape of the X and
y. taken 6000 entries as the training set and 1000 entries as a test set.
To avoid unwanted errors, we have shuffled the data using the numpy array. It basically improves the efficiency of the model.
1 import numpy as np
2
3 shuffle_index = np.random.permutation(6000)
4 x_train, y_train = x_train[shuffle_index], y_train[shuffle_index]
1 y_train = y_train.astype(np.int8)
2 y_test = y_test.astype(np.int8)
3 y_train_2 = (y_train==2)
4 y_test_2 = (y_test==2)
5 print(y test 2)
Cross-Validation
Output:
Output:
Cross-Validation
Output:
In the above example, we were able to make a digit predictor. Since we were predicting if the digit were 2 out of all the entries in the data,
we g shows much better accuracy with the logistic regression classifier instead of the support vector machine classifier.
This brings us to the end of this article where we have learned Classification in Machine Learning. I hope you are clear with all that has been sha