Chapter5 - Machine Learning

Machine learning
Top 10 Machine Learning Algorithms For Beginners: Supervised, Unsupervised

Learning and More
The many different types of machine learning algorithms have been designed
in such dynamic times to help solve real-world complex problems. The ml
algorithms are automated and self-modifying to continue improving over time.
Before we delve into the top 10 machine learning algorithms you should know,
let's take a look at the different types of machine learning algorithms and how
they are classified.
Machine learning algorithms are classified into 4 types:
• Supervised
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
However, these four types of ml algorithms are further classified into more
types.
What Are The 10 Popular Machine Learning Algorithms?
Below is the list of Top 10 commonly used Machine Learning (ML) Algorithms:
• Linear regression
• Logistic regression
• Decision tree
• SVM algorithm
• Naive Bayes algorithm
• KNN algorithm
• K-means
• Random forest algorithm
• Dimensionality reduction algorithms
• Gradient boosting algorithm and AdaBoosting algorithm
How Learning These Vital Algorithms Can Enhance Your Skills in Machine
Learning
There are three types of most popular Machine Learning algorithms, i.e -
supervised learning, unsupervised learning, and reinforcement learning. All
three techniques are used in this list of 10 common Machine Learning
Algorithms:
List of Popular Machine Learning Algorithms
1. Linear Regression
To understand the working functionality of Linear Regression, imagine how you

would arrange random logs of wood in increasing order of their weight. There
is a catch; however – you cannot weigh each log. You have to guess its weight
just by looking at the height and girth of the log (visual analysis) and arranging
them using a combination of these visible parameters. This is what linear
regression in machine learning is like.
In this process, a relationship is established between independent and

dependent variables by fitting them to a line. This line is known as the
regression line and is represented by a linear equation Y= a *X + b.
In this equation:
• Y – Dependent Variable
• a – Slope
• X – Independent variable
• b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared
difference of distance between data points and the regression line.
2. Logistic Regression
Logistic Regression is used to estimate discrete values (usually binary values

like 0/1) from a set of independent variables. It helps predict the probability of
an event by fitting data to a logit function. It is also called logit regression.
These methods listed below are often used to help improve logistic regression
models:
• include interaction terms
• eliminate features
• regularize techniques
• use a non-linear model
3. Decision Tree
Decision Tree algorithm in machine learning is one of the most popular

algorithm in use today; this is a supervised learning algorithm that is used for
classifying problems. It works well in classifying both categorical and
continuous dependent variables. This algorithm divides the population into
two or more homogeneous sets based on the most significant attributes/
independent variables.
4. SVM (Support Vector Machine) Algorithm
SVM algorithm is a method of a classification algorithm in which you plot raw

data as points in an n-dimensional space (where n is the number of features
you have). The value of each feature is then tied to a particular coordinate,
making it easy to classify the data. Lines called classifiers can be used to split
the data and plot them on a graph.
5. Naive Bayes Algorithm
A Naive Bayes classifier assumes that the presence of a particular feature in a

class is unrelated to the presence of any other feature.
Even if these features are related to each other, a Naive Bayes classifier would
consider all of these properties independently when calculating the probability
of a particular outcome.
A Naive Bayesian model is easy to build and useful for massive datasets. It's
simple and is known to outperform even highly sophisticated classification
methods.
6. KNN (K- Nearest Neighbors) Algorithm
This algorithm can be applied to both classification and regression problems.

Apparently, within the Data Science industry, it's more widely used to solve
classification problems. It’s a simple algorithm that stores all available cases
and classifies any new cases by taking a majority vote of its k neighbors. The
case is then assigned to the class with which it has the most in common. A
distance function performs this measurement.
KNN can be easily understood by comparing it to real life. For example, if you
want information about a person, it makes sense to talk to his or her friends
and colleagues!
Things to consider before selecting K Nearest Neighbours Algorithm:
• KNN is computationally expensive
• Variables should be normalized, or else higher range variables can

bias the algorithm
• Data still needs to be pre-processed.
7. K-Means
It is an unsupervised learning algorithm that solves clustering problems. Data

sets are classified into a particular number of clusters (let's call that number K)
in such a way that all the data points within a cluster are homogenous and
heterogeneous from the data in other clusters.
How K-means forms clusters:
• The K-means algorithm picks k number of points, called centroids, for

each cluster.
• Each data point forms a cluster with the closest centroids, i.e., K
clusters.
• It now creates new centroids based on the existing cluster members.

• With these new centroids, the closest distance for each data point is
determined. This process is repeated until the centroids do not
change.
8. Random Forest Algorithm
A collective of decision trees is called a Random Forest. To classify a new object

based on its attributes, each tree is classified, and the tree “votes” for that
class. The forest chooses the classification having the most votes (over all the
trees in the forest).
Each tree is planted & grown as follows:
• If the number of cases in the training set is N, then a sample of N

cases is taken at random. This sample will be the training set for
growing the tree.
• If there are M input variables, a number m<<M is specified such that

at each node, m variables are selected at random out of the M, and
the best split on this m is used to split the node. The value of m is held
constant during this process.
• Each tree is grown to the most substantial extent possible. There is no

pruning.
9. Dimensionality Reduction Algorithms
In today's world, vast amounts of data are being stored and analyzed by
corporates, government agencies, and research organizations. As a data
scientist, you know that this raw data contains a lot of information - the
challenge is to identify significant patterns and variables.
Dimensionality reduction algorithms like Decision Tree, Factor Analysis,

Missing Value Ratio, and Random Forest can help you find relevant details.
10. Gradient Boosting Algorithm and AdaBoosting Algorithm
Gradient Boosting Algorithm and AdaBoosting Algorithm are boosting

algorithms used when massive loads of data have to be handled to make
predictions with high accuracy. Boosting is an ensemble learning algorithm
that combines the predictive power of several base estimators to improve
robustness.
In short, it combines multiple weak or average predictors to build a strong

predictor. These boosting algorithms always work well in data science
competitions like Kaggle, AV Hackathon, CrowdAnalytix. These are the most
preferred machine learning algorithms today. Use them, along with Python and
R Codes, to achieve accurate outcomes.
# These types of following techniques that we can perform through Machine

learning:
Among the different types of ML tasks, a crucial distinction is drawn between
supervised and unsupervised learning:
• Supervised machine learning is when the program is “trained” on

a predefined set of “training examples,” which then facilitate its
ability to reach an accurate conclusion when given new data. This
deals with the labelled data.
• Unsupervised machine learning is when the program is given a
bunch of data and must find patterns and relationships therein.
This deals with the unlabelled data.
Information purpose only (Optional)
Supervised Machine Learning
In the majority of supervised learning applications, the ultimate goal is to

develop a finely tuned predictor function h(x) (sometimes called the
“hypothesis”). “Learning” consists of using sophisticated mathematical
algorithms to optimize this function so that, given input data x about a certain
domain (say, square footage of a house), it will accurately predict some
interesting value h(x) (say, market price for said house).
In practice, x almost always represents multiple data points. So, for example, a
housing price predictor might consider not only square footage (x1) but also
number of bedrooms (x2), number of bathrooms (x3), number of floors (x4),
year built (x5), ZIP code (x6), and so forth. Determining which inputs to use is
an important part of ML design. However, for the sake of explanation, it is
easiest to assume a single input value.
Let’s say our simple predictor has this form:
where and are constants. Our goal is to find the perfect values of and
to make our predictor work as well as possible.
Optimizing the predictor h(x) is done using training examples. For each
training example, we have an input value x_train , for which a corresponding
output, y , is known in advance. For each example, we find the difference
between the known, correct value y , and our predicted value h(x_train) .
With enough training examples, these differences give us a useful way to
measure the “wrongness” of h(x) . We can then tweak h(x) by tweaking the
values of and to make it “less wrong”. This process is repeated until the
system has converged on the best values for and . In this way, the
predictor becomes trained, and is ready to do some real-world predicting.
Machine Learning Examples
We’re using simple problems for the sake of illustration, but the reason ML
exists is because, in the real world, problems are much more complex. On this
flat screen, we can present a picture of, at most, a three-dimensional dataset,
but ML problems often deal with data with millions of dimensions and very
complex predictor functions. ML solves problems that cannot be solved by
numerical means alone.
With that in mind, let’s look at another simple example. Say we have the
following training data, wherein company employees have rated their
satisfaction on a scale of 1 to 100:
First, notice that the data is a little noisy. That is, while we can see that there is
a pattern to it (i.e., employee satisfaction tends to go up as salary goes up), it
does not all fit neatly on a straight line. This will always be the case with real-
world data (and we absolutely want to train our machine using real-world
data). How can we train a machine to perfectly predict an employee’s level of
satisfaction? The answer, of course, is that we can’t. The goal of ML is never to
make “perfect” guesses because ML deals in domains where there is no such
thing. The goal is to make guesses that are good enough to be useful.
It is somewhat reminiscent of the famous statement by George E. P. Box, the

British mathematician and professor of statistics: “All models are wrong, but
some are useful.”
The goal of ML is never to make “perfect” guesses because ML deals in
domains where there is no such thing. The goal is to make guesses that are
good enough to be useful.
Machine learning builds heavily on statistics. For example, when we train our
machine to learn, we have to give it a statistically significant random sample as
training data. If the training set is not random, we run the risk of the machine
learning patterns that aren’t actually there. And if the training set is too small
(see the law of large numbers), we won’t learn enough and may even reach
inaccurate conclusions. For example, attempting to predict companywide
satisfaction patterns based on data from upper management alone would
likely be error-prone.
With this understanding, let’s give our machine the data we’ve been given
above and have it learn it. First we have to initialize our predictor h(x) with
some reasonable values of and . Now, when placed over our training set,
our predictor looks like this:
If we ask this predictor for the satisfaction of an employee making $60,000, it
would predict a rating of 27:
It’s obvious that this is a terrible guess and that this machine doesn’t know
very much.
Now let’s give this predictor all the salaries from our training set, and note the
differences between the resulting predicted satisfaction ratings and the actual
satisfaction ratings of the corresponding employees. If we perform a little
mathematical wizardry (which I will describe later in the article), we can
calculate, with very high certainty, that values of 13.12 for and 0.61 for
are going to give us a better predictor.
And if we repeat this process, say 1,500 times, our predictor will end up
looking like this:
At this point, if we repeat the process, we will find that and will no longer
change by any appreciable amount, and thus we see that the system has
converged. If we haven’t made any mistakes, this means we’ve found the
optimal predictor. Accordingly, if we now ask the machine again for the
satisfaction rating of the employee who makes $60,000, it will predict a rating
of ~60.
Machine Learning Regression: A Note on Complexity
The above example is technically a simple problem of univariate linear

regression, which in reality can be solved by deriving a simple normal equation
and skipping this “tuning” process altogether. However, consider a predictor
that looks like this:
This function takes input in four dimensions and has a variety of polynomial
terms. Deriving a normal equation for this function is a significant challenge.
Many modern machine learning problems take thousands or even millions of
dimensions of data to build predictions using hundreds of coefficients.
Predicting how an organism’s genome will be expressed or what the climate
will be like in 50 years are examples of such complex problems.
Many modern ML problems take thousands or even millions of dimensions of

data to build predictions using hundreds of coefficients.
Fortunately, the iterative approach taken by ML systems is much more resilient

in the face of such complexity. Instead of using brute force, a machine learning
system “feels” its way to the answer. For big problems, this works much better.
While this doesn’t mean that ML can solve all arbitrarily complex problems—it
can’t—it does make for an incredibly flexible and powerful tool.
Gradient Descent: Minimizing “Wrongness”
Let’s take a closer look at how this iterative process works. In the above
example, how do we make sure and are getting better with each step, not
worse? The answer lies in our “measurement of wrongness”, along with a little
calculus. (This is the “mathematical wizardry” mentioned to previously.)
The wrongness measure is known as the cost function (aka loss function), .
The input represents all of the coefficients we are using in our predictor. In
our case, is really the pair and . gives us a mathematical
measurement of the wrongness of our predictor is when it uses the given
values of and .
The choice of the cost function is another important piece of an ML program.
In different contexts, being “wrong” can mean very different things. In our
employee satisfaction example, the well-established standard is the linear least
squares function:
With least squares, the penalty for a bad guess goes up quadratically with the
difference between the guess and the correct answer, so it acts as a very
“strict” measurement of wrongness. The cost function computes an average
penalty across all the training examples.
Now we see that our goal is to find and for our predictor h(x) such that
our cost function is as small as possible. We call on the power of
calculus to accomplish this.
Consider the following plot of a cost function for some particular machine
learning problem:
Here we can see the cost associated with different values of and . We can
see the graph has a slight bowl to its shape. The bottom of the bowl represents
the lowest cost our predictor can give us based on the given training data. The
goal is to “roll down the hill” and find and corresponding to this point.
This is where calculus comes in to this machine learning tutorial. For the sake
of keeping this explanation manageable, I won’t write out the equations here,
but essentially what we do is take the gradient of , which is the pair of
derivatives of (one over and one over ). The gradient will be
different for every different value of and , and defines the “slope of the
hill” and, in particular, “which way is down” for these particular s. For
example, when we plug our current values of into the gradient, it may tell us
that adding a little to and subtracting a little from will take us in the
direction of the cost function-valley floor. Therefore, we add a little to ,
subtract a little from , and voilà! We have completed one round of our
learning algorithm. Our updated predictor, h(x) = + x, will return better
predictions than before. Our machine is now a little bit smarter.
This process of alternating between calculating the current gradient and

updating the s from the results is known as gradient descent.
That covers the basic theory underlying the majority of supervised machine
learning systems. But the basic concepts can be applied in a variety of ways,
depending on the problem at hand.
Classification Problems in Machine Learning
Under supervised ML, two major subcategories are:
• Regression machine learning systems – Systems where the value

being predicted falls somewhere on a continuous spectrum. These
systems help us with questions of “How much?” or “How many?”
• Classification machine learning systems – Systems where we
seek a yes-or-no prediction, such as “Is this tumor cancerous?”,
“Does this cookie meet our quality standards?”, and so on.
As it turns out, the underlying machine learning theory is more or less the
same. The major differences are the design of the predictor h(x) and the
design of the cost function .
Our examples so far have focused on regression problems, so now let’s take a
look at a classification example.
Here are the results of a cookie quality testing study, where the training
examples have all been labeled as either “good cookie” ( y = 1 ) in blue or “bad
cookie” ( y = 0 ) in red.
In classification, a regression predictor is not very useful. What we usually want

is a predictor that makes a guess somewhere between 0 and 1. In a cookie
quality classifier, a prediction of 1 would represent a very confident guess that
the cookie is perfect and utterly mouthwatering. A prediction of 0 represents
high confidence that the cookie is an embarrassment to the cookie industry.
Values falling within this range represent less confidence, so we might design
our system such that a prediction of 0.6 means “Man, that’s a tough call, but
I’m gonna go with yes, you can sell that cookie,” while a value exactly in the
middle, at 0.5, might represent complete uncertainty. This isn’t always how
confidence is distributed in a classifier but it’s a very common design and
works for the purposes of our illustration.
It turns out there’s a nice function that captures this behavior well. It’s called
the sigmoid function, g(z) , and it looks something like this:
z is some representation of our inputs and coefficients, such as:

so that our predictor becomes:
Notice that the sigmoid function transforms our output into the range
between 0 and 1.
The logic behind the design of the cost function is also different in
classification. Again we ask “What does it mean for a guess to be wrong?” and
this time a very good rule of thumb is that if the correct guess was 0 and we
guessed 1, then we were completely wrong—and vice-versa. Since you can’t
be more wrong than completely wrong, the penalty in this case is enormous.
Alternatively, if the correct guess was 0 and we guessed 0, our cost function
should not add any cost for each time this happens. If the guess was right, but
we weren’t completely confident (e.g., y = 1 , but h(x) = 0.8 ), this should
come with a small cost, and if our guess was wrong but we weren’t completely
confident (e.g., y = 1 but h(x) = 0.3 ), this should come with some significant
cost but not as much as if we were completely wrong.
This behavior is captured by the log function, such that:
Again, the cost function gives us the average cost over all of our training
examples.
So here we’ve described how the predictor h(x) and the cost function
differ between regression and classification, but gradient descent still works
fine.
A classification predictor can be visualized by drawing the boundary line; i.e.,
the barrier where the prediction changes from a “yes” (a prediction greater
than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system,
our cookie data can generate a classification boundary that looks like this:
Now that’s a machine that knows a thing or two about cookies!
An Introduction to Neural Networks
No discussion of Machine Learning would be complete without at least

mentioning neural networks. Not only do neural networks offer an extremely
powerful tool to solve very tough problems, they also offer fascinating hints at
the workings of our own brains and intriguing possibilities for one day creating
truly intelligent machines.
Neural networks are well suited to machine learning models where the
number of inputs is gigantic. The computational cost of handling such a
problem is just too overwhelming for the types of systems we’ve discussed. As
it turns out, however, neural networks can be effectively tuned using
techniques that are strikingly similar to gradient descent in principle.
A thorough discussion of neural networks is beyond the scope of this tutorial,

but I recommend checking out previous post on the subject.
Unsupervised Machine Learning
Unsupervised machine learning is typically tasked with finding relationships

within data. There are no training examples used in this process. Instead, the
system is given a set of data and tasked with finding patterns and correlations
therein. A good example is identifying close-knit groups of friends in social
network data.
The machine learning algorithms used to do this are very different from those
used for supervised learning, and the topic merits its own post. However, for
something to chew on in the meantime, take a look at clustering
algorithms such as k-means, and also look into dimensionality
reduction systems such as principle component analysis. You can also read our
article on semi-supervised image classification.
Putting Theory Into Practice
We’ve covered much of the basic theory underlying the field of machine
learning but, of course, we have only scratched the surface.
Keep in mind that to really apply the theories contained in this introduction to
real-life machine learning examples, a much deeper understanding of these
topics is necessary. There are many subtleties and pitfalls in ML and many
ways to be lead astray by what appears to be a perfectly well-tuned thinking
machine. Almost every part of the basic theory can be played with and altered
endlessly, and the results are often fascinating. Many grow into whole new
fields of study that are better suited to particular problems.
What is a Confusion Matrix in Machine Learning?
In machine learning, Classification is used to split data into categories. How do

we know if our classification model performs well, after cleaning and pre-
processing the data and training our model? That is where a confusion matrix
comes into the picture. “A confusion matrix is used to measure the
performance of a classifier in depth”.
What Are Confusion Matrices, and Why Do We Need Them?
Classification Models have multiple categorical outputs. Most error measures

will calculate the total error in our model, but we cannot find individual
instances of errors in our model. The model might misclassify some categories
more than others, but we cannot see this using a standard accuracy measure.
when it is not predicting the minority classes. This is where confusion matrices
are useful.
A confusion matrix presents a table layout of the different outcomes of the

prediction and results of a classification problem and helps visualize its
outcomes.
Hierarchical Clustering
Hierarchical clustering is an unsupervised learning method for clustering data

points. The algorithm builds clusters by measuring the dissimilarities between
data. Unsupervised learning means that a model does not have to be trained,
and we do not need a "target" variable. This method can be used on any data
to visualize and interpret the relationship between individual data points.
Here we will use hierarchical clustering to group data points and visualize the
clusters using both a dendrogram and scatter plot.
How does it work?
We will use Agglomerative Clustering, a type of hierarchical clustering that

follows a bottom up approach. We begin by treating each data point as its own
cluster. Then, we join clusters together that have the shortest distance
between them to create larger clusters. This step is repeated until one large
cluster is formed containing all of the data points.
Hierarchical clustering requires us to decide on both a distance and linkage

method. We will use euclidean distance and the Ward linkage method, which
attempts to minimize the variance between clusters.
Confusion Matrix for Machine Learning

Everything you Should Know about Confusion Matrix:
Have you been in a situation where you expected your machine learning model
to perform really well but it sputtered out a poor accuracy? You’ve done all the
hard work – so where did the classification model go wrong? How can you
correct this?
There are plenty of ways to gauge the performance of your classification model
but none have stood the test of time like the confusion matrix. It helps us
evaluate how our model performed, where it went wrong and offers us
guidance to correct our path.
In this article, we will explore what is confusion matrix in machine learning and
how a Confusion matrix gives a holistic view of the performance of your model.
And unlike its name, you will realize that a Confusion matrix python is a pretty
simple yet powerful concept. So let’s unravel the mystery around the
confusion matrix!
Here’s what we’ll cover:

1. What is a Confusion Matrix?
a. True Positive
b. True Negative
c. False Positive – Type 1 Error
d. False Negative – Type 2 Error
2. Why need a Confusion matrix?
3. Precision vs Recall
4. F1-score
5. Confusion matrix in Scikit-learn or confusion matrix sklearn
6. Confusion matrix for multi-class Classification
1. What is a Confusion Matrix?
The million-dollar question – what, after all, is a confusion matrix?
A Confusion matrix is an N x N matrix used for evaluating the performance of a

classification model, where N is the number of target classes. The matrix
compares the actual target values with those predicted by the machine
learning model. This gives us a holistic view of how well our classification
model is performing and what kinds of errors it is making.
For a binary classification problem, we would have a 2 x 2 matrix as shown

below with 4 values:
Let’s decipher the matrix:
• The target variable has two values: Positive or Negative

• The columns represent the actual values of the target variable
• The rows represent the predicted values of the target variable
But wait – what’s TP, FP, FN and TN here? That’s the crucial part of a confusion
matrix. Let’s understand each term below.
Understanding True Positive, True Negative, False Positive and False Negative
in a Confusion Matrix
a. True Positive (TP)
• The predicted value matches the actual value

• The actual value was positive and the model predicted a positive value
b. True Negative (TN)
• The predicted value matches the actual value

• The actual value was negative and the model predicted a negative value
c. False Positive (FP) – Type 1 error
• The predicted value was falsely predicted

• The actual value was negative but the model predicted a positive value
• Also known as the Type 1 error
d. False Negative (FN) – Type 2 error
• The predicted value was falsely predicted

• The actual value was positive but the model predicted a negative value
• Also known as the Type 2 error
Let me give you an example to better understand this. Suppose we had a

classification dataset with 1000 data points. We fit a classifier on it and get the
below confusion matrix:
The different values of the Confusion matrix would be as follows:
• True Positive (TP) = 560; meaning 560 positive class data points were
correctly classified by the model
• True Negative (TN) = 330; meaning 330 negative class data points were
correctly classified by the model
• False Positive (FP) = 60; meaning 60 negative class data points were
incorrectly classified as belonging to the positive class by the model
• False Negative (FN) = 50; meaning 50 positive class data points were
incorrectly classified as belonging to the negative class by the model
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
Remember the Type 1 and Type 2 errors. Interviewers love to ask the difference
between these two.
2. Why Do We Need a Confusion Matrix?
Before we answer this question, let’s think about a hypothetical classification

problem.
Let’s say you want to predict how many people are infected with a contagious
virus in times before they show the symptoms, and isolate them from the
healthy population (ringing any bells, yet? ). The two values for our target
variable would be: Sick and Not Sick.
Now, you must be wondering – why do we need a confusion matrix when we

have our all-weather friend – Accuracy? Well, let’s see where accuracy falters.
Our dataset is an example of an imbalanced dataset. There are 947 data points
for the negative class and 3 data points for the positive class. This
1. calculate the accuracy:
Let’s see how our model performed:

The total outcome values are:
TP = 30, TN = 930, FP = 30, FN = 10
So, the accuracy for our model turns out to be:
96%! Not bad!
But it is giving the wrong idea about the result. Think about it.
Our model is saying “I can predict sick people 96% of the time”. However, it is
doing the opposite. It is predicting the people who will not get sick with 96%
accuracy while the sick are spreading the virus!
Do you think this is a correct metric for our model given the seriousness of the
issue? Shouldn’t we be measuring how many positive cases we can predict
correctly to arrest the spread of the contagious virus? Or maybe, out of the
correctly predicted cases, how many are positive cases to check the reliability
of our model?
This is where we come across the dual concept of Precision and Recall.
3. Precision vs. Recall
Precision tells us how many of the correctly predicted cases actually turned out
to be positive.
Here’s how to calculate Precision:
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.
And here’s how we can calculate Recall:

We can easily calculate Precision and Recall for our model by plugging in the
values into the above questions:
50% percent of the correctly predicted cases turned out to be positive cases.
Whereas 75% of the positives were successfully predicted by our model.
Awesome!
Precision is a useful metric in cases where False Positive is a higher concern

than False Negatives.
Precision is important in music or video recommendation systems, e-

commerce websites, etc. Wrong results could lead to customer churn and be
harmful to the business.
Recall is a useful metric in cases where False Negative trumps False Positive.
Recall is important in medical cases where it doesn’t matter whether we raise
a false alarm but the actual positive cases should not go undetected!
In our example, Recall would be a better metric because we don’t want to

accidentally discharge an infected person and let them mix with the healthy
population thereby spreading the contagious virus. Now you can understand
why accuracy was a bad metric for our model.
But there will be cases where there is no clear distinction between whether
Precision is more important or Recall. What should we do in those cases? We
combine them!
4. F1-Score
In practice, when we try to increase the precision of our model, the recall goes
down, and vice-versa. The F1-score captures both the trends in a single value:
F1-score is a harmonic mean of Precision and Recall, and so it gives a

combined idea about these two metrics. It is maximum when Precision is equal
to Recall.
But there is a catch here. The interpretability of the F1-score is poor. This
means that we don’t know what our classifier is maximizing – precision or
recall? So, we use it in combination with other evaluation metrics which gives
us a complete picture of the result.
5. Confusion Matrix using scikit-learn in Python
You know the theory – now let’s put it into practice. You can create matrix with
the Scikit-learn (sklearn) library in Python.
Sklearn has two great functions: confusion_matrix() and classification_report().
• Sklearn confusion_matrix() returns the values of the Confusion matrix.

The output is, however, slightly different from what we have studied so
far. It takes the rows as Actual values and the columns as Predicted
values. The rest of the concept remains the same.
• Sklearn classification_report() outputs precision, recall and f1-score for
each target class. In addition to this, it also has some extra values: micro
avg, macro avg, and weighted avg
Mirco average is the precision/recall/f1-score calculated for all the classes.
Macro average is the average of precision/recall/f1-score.
Weighted average is just the weighted average of precision/recall/f1-score.
Confusion Matrix for Multi-Class Classification
How would a confusion matrix work for a multi-class classification problem?

Well, don’t scratch your head! We will have a look at that here.
Let’s draw a confusion matrix for a multiclass problem where we have to

predict whether a person loves Facebook, Instagram or Snapchat. The
confusion matrix would be a 3 x 3 matrix like this:
The true positive, true negative, false positive and false negative for each class
would be calculated by adding the cell values as follows:
That’s it! You are ready to decipher any N x N confusion matrix!
Finding Confusion Matrix With Python Code:

We'll build a logistic regression model using a heart attack dataset to predict if a patient is at risk of a
heart attack. Depicted below is the dataset that we'll be using for this demonstration.
Figure 9: Heart Attack Dataset
Let’s import the necessary libraries to create our model.
Figure 10: Importing Confusion Matrix in python
We can import the confusion matrix function from sklearn. metrics. Let’s split
our dataset into the input features and target output dataset.
Figure 11: Splitting data into variables and target dataset
As we can see, our data contains a massive range of values, some are single
digits, and some have three numbers. To make our calculations more
straightforward, we will scale our data and reduce it to a small range of values
using the Standard Scaler.
Figure 12: Scaling down our dataset
Now, let's split our dataset into two: one to train our model and another to
test our model. To do this, we use train_test_split imported from sklearn.
Using a Logistic Regression Model, we will perform Classification on our train
data and predict our test data to check the accuracy.
Confusion Matrix for Machine Learning
To find the accuracy of a confusion matrix and all other metrics, we can import
accuracy_score and classification_report from the same library.
Figure 14: Accuracy of classifier
The accuracy_score gives us the accuracy of our classifier
Figure 15: Confusion Matrix for data
Using the predicted values(pred) and our actual values(y_test), we can create a
confusion matrix with the confusion_matrix function.
Then, using the ravel() method of our confusion_matrix function, we can get
the True Positive, True Negative, False Positive, and False Negative values.
Figure 16: Extracting matrix value
Figure 17: Confusion Matrix Metrics
Finally, using the classification report, we can find the values of various metrics
of our confusion matrix.

Chapter5 - Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter5 - Machine Learning

Uploaded by

Copyright:

Available Formats

Machine learning

Top 10 Machine Learning Algorithms For Beginners: Supervised, Unsupervised

Machine learning algorithms are classified into 4 types:

• Naive Bayes algorithm

• Random forest algorithm

• Dimensionality reduction algorithms

• Gradient boosting algorithm and AdaBoosting algorithm

To understand the working functionality of Linear Regression, imagine how you

In this process, a relationship is established between independent and

Logistic Regression is used to estimate discrete values (usually binary values

• use a non-linear model

Decision Tree algorithm in machine learning is one of the most popular

4. SVM (Support Vector Machine) Algorithm

SVM algorithm is a method of a classification algorithm in which you plot raw

5. Naive Bayes Algorithm

A Naive Bayes classifier assumes that the presence of a particular feature in a

This algorithm can be applied to both classification and regression problems.

Things to consider before selecting K Nearest Neighbours Algorithm:

• KNN is computationally expensive

• Variables should be normalized, or else higher range variables can

• Data still needs to be pre-processed.

It is an unsupervised learning algorithm that solves clustering problems. Data

How K-means forms clusters:

• The K-means algorithm picks k number of points, called centroids, for

• It now creates new centroids based on the existing cluster members.

8. Random Forest Algorithm

A collective of decision trees is called a Random Forest. To classify a new object

Each tree is planted & grown as follows:

• If the number of cases in the training set is N, then a sample of N

• If there are M input variables, a number m<<M is specified such that

• Each tree is grown to the most substantial extent possible. There is no

9. Dimensionality Reduction Algorithms

Dimensionality reduction algorithms like Decision Tree, Factor Analysis,

Gradient Boosting Algorithm and AdaBoosting Algorithm are boosting

In short, it combines multiple weak or average predictors to build a strong

# These types of following techniques that we can perform through Machine

• Supervised machine learning is when the program is “trained” on

Supervised Machine Learning

In the majority of supervised learning applications, the ultimate goal is to

Let’s say our simple predictor has this form:

Machine Learning Examples

It is somewhat reminiscent of the famous statement by George E. P. Box, the

Machine Learning Regression: A Note on Complexity

The above example is technically a simple problem of univariate linear

Many modern ML problems take thousands or even millions of dimensions of

Fortunately, the iterative approach taken by ML systems is much more resilient

This process of alternating between calculating the current gradient and

Under supervised ML, two major subcategories are:

• Regression machine learning systems – Systems where the value

In classification, a regression predictor is not very useful. What we usually want

z is some representation of our inputs and coefficients, such as:

Now that’s a machine that knows a thing or two about cookies!

An Introduction to Neural Networks

No discussion of Machine Learning would be complete without at least

A thorough discussion of neural networks is beyond the scope of this tutorial,

Unsupervised Machine Learning

Unsupervised machine learning is typically tasked with finding relationships

Putting Theory Into Practice

In machine learning, Classification is used to split data into categories. How do

What Are Confusion Matrices, and Why Do We Need Them?

Classification Models have multiple categorical outputs. Most error measures

A confusion matrix presents a table layout of the different outcomes of the