Professional Documents
Culture Documents
Campbell A. Python Machine Learning For Beginners. All You Need To Know... 2022
Campbell A. Python Machine Learning For Beginners. All You Need To Know... 2022
Campbell A. Python Machine Learning For Beginners. All You Need To Know... 2022
Have you thought about a career in data science? It's where the money is right now,
and it's only going to become more widespread as the world evolves. Machine
learning is a big part of data science, and for those that already have experience in
programming, it's the next logical step.
Machine learning is a subsection of AI, or Artificial Intelligence, and computer
science, using data and algorithms to imitate human thinking and learning. Through
constant learning, machine learning gradually improves its accuracy, eventually
providing the optimal results for the problem it has been assigned to.
It is one of the most important parts of data science and, as big data continues to
expand, so too will the need for machine learning and AI.
Here's what you will learn in this quick guide to machine learning with Python for
beginners:
What machine learning is
Why Python is the best computer programming language for machine
learning
The different types of machine learning
How linear regression works
The different types of classification
How to use SVMs (Support Vector Machines) with Scikit-Learn
How Decision Trees work with Classification
How K-Nearest Neighbors works
How to find patterns in data with unsupervised learning algorithms
You will also find plenty of code examples to help you understand how everything
works.
If you are ready to take your programming further, scroll up, click Buy Now, and
find out why machine learning is the next logical step.
Python Machine Learning for
Beginners
All You Need to Know about Machine
Learning with Python
© Copyright 2022 - All rights reserved. Alex Campbell.
The contents of this book may not be reproduced, duplicated, or transmitted without direct written
permission from the author.
Under no circumstances will any legal responsibility or blame be held against the publisher for any
reparation, damages, or monetary loss due to the information herein, either directly or indirectly.
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part of the content within this book
without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. No warranties of any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical, or professional advice. Please
consult a licensed professional before attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of the information contained
within this document, including, but not limited to, —errors, omissions, or inaccuracies.
Table of Contents
Introduction
Prerequisites
Chapter 1: What Is Machine Learning?
How It Works
Machine Learning Features
Why We Need Machine Learning
The History of Machine Learning
Machine Learning Today
Chapter 2: The Different Types of Machine Learning
Which Algorithms to Use
Chapter 3: Linear Regression
What Is Regression?
Linear Regression
Implementing Linear Regression in Python
Chapter 4: The Different Types of Classification
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Chapter 5: Support Vector Machines with Scikit-Learn
How Does SVM Work?
Classifier Building in Scikit-learn
Chapter 6: Using Decision Trees
Decision Tree Algorithm
Attribute Selection Measure
Building a Decision Tree Classifier
Visualizing Decision Trees
Optimizing Decision Tree Performance
Chapter 7: K-Nearest Neighbors
Implementing KNN Algorithm With Scikit-Learn
Chapter 8: Finding Patterns in Data
The Difference between Supervised and Unsupervised Learning
Preparing the Data
Clustering
Conclusion
References
Introduction
Machine Learning (ML) and Artificial Intelligence (AI) are not just the latest
buzz words. They are now the most important part of the world we live in
and far more important and useful than science fiction would have you
believe. Without them, we simply couldn't process the huge amounts of data
we produce, at least not effectively or efficiently. Without them, more people
would be stuck doing repetitive, mundane jobs instead of putting their skills
to better use. Without them, companies couldn't make good business
decisions or draw up effective strategies and solutions.
While the human brain can process large amounts of data, it can only absorb
so much data at any one time. AI doesn't have these limitations, and it is far
more accurate, free from human error. However, it isn't the easiest
technology to develop and requires the right programming language. That
language is Python, for several reasons:
1. It has a wide range of libraries, modules that include pre-written code
for certain functions or actions. This means developers don't have to
start from scratch every time. Some of the best Python libraries are:
Scikit-Learn - handles regression, clustering, classification, and
other ML algorithms
Pandas – for higher-level analysis and data structures, help with
data merging and filtering, and gathering data from external
sources
Keras – a deep learning library for prototyping and calculations
TensorFlow – another deep learning library that helps set up,
train, and use artificial neural networks with vast datasets
Matplotlib – creates visualizations, like histograms, 2D plots,
charts, etc.
2. It is easy to learn with intuitive syntax, which means it can easily be
used for ML with little effort.
3. It is a flexible language, offering a choice of scripting or OOPs, with
no requirement to recompile source code, which allows changes to be
quickly implemented. It can also be combined with other languages
where the need arises.
4. It isn't dependent on any platform and can be used on more than 20. It
also isn't too difficult to transfer it from one platform to another.
5. It is simple to read, so all developers can understand anyone's code and
change it if they need to.
6. It offers a great choice of visualization tools so that data can be
presented in a human-readable format.
7. It also has one of the largest communities of any programming
language, with developers and others waiting to help and provide
resources.
Prerequisites
This book is aimed at those who already have programming experience with
Python. If you are completely new to programming, you really need to go and
learn the basics at the very least before attempting any of the coding and
examples in this book.
If you are experienced and are ready to take your knowledge up a notch, let's
dive in and learn all about machine learning using Python.
Chapter 1: What Is Machine Learning?
The real world is full of humans whose brains have a vast capacity to learn
from their experiences and machines or computers that work from human
instruction. One question that has long been asked is, "can a computer learn
from past data or experiences the same way humans do?" That question is
answered with machine learning.
One of the fastest-growing technologies, machine learning is all about
teaching computers to learn from past data. It does this by using algorithms
that build mathematical models and use historical information or data to
make predictions.
You already use machine learning in your everyday life, most likely without
even knowing it. It is used to filter spam emails from your inbox, image
recognition, auto-tagging in Facebook, speech recognition, recommender
systems in Netflix, Amazon, etc., and so much more.
Machine learning is a subset of Artificial Intelligence, and the term was first
coined in 1959 by Arthur Samuel. It can be defined as enabling machines to
learn automatically from data, use past experiences to improve their
performance, and make predictions without needing to be explicitly
programmed.
We use samples of historical data, called training data, to teach machine
learning algorithms how to build mathematical models that make decisions or
predictions. This branch of computer science combines statistics and
computer science to build predictive models, and they use or construct
algorithms to learn from the data. The more data we provide, the better the
machine learning model's performance.
How It Works
When we provide a machine learning system with sufficient historical data, it
builds a prediction model. When we give it new data, it will use that model to
predict the output for that data. The accuracy of that output is entirely
dependent on the amount of data we provide – the more data we give, the
more accurate the output will be.
Let's say we are dealing with a complex problem, and we need some
predictions. Rather than writing code from scratch, we can use pre-built,
generic algorithms. We give the data to those algorithms, and they use that
data to build the logic and provide the predicted outputs. In short, machine
learning changes how we think about problem-solving.
Machine Learning Features
Machine learning offers plenty of features:
It uses data to find patterns in datasets
It learns from past data and uses it to improve automatically
The technology is purely data-driven
Machine learning can be seen as similar to data mining in that both
deal with vast amounts of data
Why We Need Machine Learning
Machine learning is fast becoming a requirement for everyday life, and the
need for it increases as each day passes. Why do we need it so badly? For a
start, it can take the place of humans. That isn't a bad thing – some jobs are
incredibly mundane and time-consuming, and allowing machine learning to
take over means freeing up time that is better spent elsewhere. Conversely,
some jobs are far too complex for humans to do – we do have some
limitations and don't have a way to manually access the large amounts of data
needed for these jobs. That's where computer systems, specifically machine
learning, come into the picture.
We can give this vast amount of data to a machine learning algorithm. They
explore that data, build their models, and predict the output. But the amount
of data we give these models isn't the only thing that affects their
performance – it also comes down to the cost function, and machine learning
can save us significant money and time.
We can also understand just how important machine learning is by looking at
its use cases. Some prominent uses are cyber fraud detection, self-driving
cars, Facebook friend suggestions, facial and speech recognition, and spam
email filtering. Plus, major companies like Amazon and Netflix use it to
analyze user preferences and provide product recommendations.
To recap the importance of machine learning, it can:
Analyze and learn from ever-increasing amounts of data
Solve problems too complex for humans
Make decision making more efficient in many sectors
Find patterns hidden in the data and extract information from it
The History of Machine Learning
Until 40 or 50 years ago, machine learning was the stuff of science fiction.
Today, it is a prominent part of our lives, making things much easier for us,
from self-driving cars to product recommendations and virtual assistants
(think Siri, Alexa, Cortana, etc.) However, while machine learning is still
relatively new, the idea behind it has been around for many years. Here are
some of the more important milestones in its history:
1834
The father of the computer, Charles Babbage, came up with the idea of a
device that could easily be programmed with punch cards. The device was
never built, but modern computers rely on its logical structure.
1936
Alan Turing devised the theory that machines can learn a set of instructions
and execute them.
1940
This year saw the invention of ENIAC, the first manually operated
computer., and the first general-purpose, electronic computer. This led to
EDSAC (1949) and EDVAC (1951), among other stored-program computers,
being invented.
1943 - 1950
1943 saw the first modeling of a human neural network with an electrical
circuit. Scientists began applying this idea to work in 1950, analyzing how
human neurons potentially worked.
In 1950, Alan Turing also published a seminal paper on artificial intelligence.
His paper was called "Computer Machinery and Intelligence," and it asked an
important question – can machines think.
1952
The pioneer of machine learning, Arthur Samuel, developed a program to
help an IBM computer play checkers. The more it played, the better it got.
1959
Arthur Samuel coined the term "machine learning."
1974 – 1980
This was a tough era for ML and AI researchers, which became known as the
"AI Winter." This was a time when machine translations failed, and interest
in AI began to wane. This led to a reduction in research funding by the
governments.
1959
For the first time, a real-world problem was the subject of a neural network
application designed to use adaptive filters to remove echoes from phone
lines.
1985
Charles Rosenberg and Terry Sejnowski invented NETtalk, a neural network
that could teach itself to pronounce 20,000 words correctly in just seven days.
1997
The Deep Blue intelligent computer from IBM beat Garry Kasparov, a
Russian Chess Grandmaster, at his own game, becoming the first computer
ever to beat a human at chess.
2006
A computer scientist called Geoffrey Hinton renamed neural net research,
calling it 'deep learning.' Today it is one of the top-trending technologies.
2012
Google developed a deep neural network that could recognize images of cats
and humans from videos on YouTube.
2014
A chatbot called Eugene Goostman passed the Turing test, becoming the first
chatbot ever to convince the human judges on the panel that it was human,
not a machine. 33% of the panel were human.
In the same year, Facebook created its own deep neural network called
DeepFace, claiming it had the same precision as humans in recognizing
specific people.
2016
A computer program called AlphaGo beat Lee Sodol, the second-best player
in the world, in a game of Go. The following year, it would go on to beat Ke
Jie, the world's number one player.
2017
An intelligent system was built by Alphabet's Jigsaw team, which could learn
online trolling. By reading millions and millions of comments from different
sites, it learned how to stop online trolling.
Machine Learning Today
These days, machine learning has come a long way, and it continues to
advance thanks to research. Modern ML models are now used to predict
diseases, weather, analyze the stock market, and much more. In the next
chapter, we will delve into the different types of machine learning we can use
today.
Chapter 2: The Different Types of Machine Learning
Like many things, there is more than one way to train a machine learning
algorithm, and each way comes with its own set of pros and cons. To
understand those pros and cons, we first need to look at the type of data they
use. There are two types of data in machine learning – labeled and unlabeled.
Labeled Data – contains input and output parameters in a pattern only
readable by a machine. However, a significant amount of human labor
is required to read that data to start with.
Unlabeled Data – no more than one parameter is in machine-readable
form, which means human labor is not required, but the solutions are
way more complex.
Machine learning algorithms are separated into four different types:
Supervised Learning
In supervised learning, the algorithm learns by example. The algorithm is
given a known dataset, which contains the desired inputs and outputs. It’s
down to the algorithm to find the right method to work out how to get to
those inputs and outputs. The operator already knows the right answer to the
problem, but the algorithm will identify specific patterns in the data. Then it
will learn from its observations and use that to make predictions. If the
prediction is wrong, the operator corrects it, and the process is repeated until
the algorithm has achieved the highest possible level of accuracy and
performance.
Supervised learning tasks include:
Classification – in these tasks, the machine learning models draw
conclusions from observed values and select the best categories for
new data. For example, a program that determines whether an email is
spam or not must look at existing data and learn how to filter the
emails.
Regression – in these tasks, the models must understand and estimate
relationships between variables. During regression analysis, one
dependent variable becomes the focus, along with several changeable
variables, making classification one of the best tools for forecasting
and prediction.
Forecasting – in these tasks, predictions are made about the future
based on present and past data. This is commonly used in trend
analysis.
Semi-Supervised Learning
Semi-supervised learning only differs from supervised learning in that it uses
labeled and unlabeled data. The labeled data has tags that allow the algorithm
to understand the data, while the unlabeled data doesn’t have any
information. Using a combination of labeled and unlabeled data means that
the algorithms learn how to put labels on unlabeled data.
Unsupervised Learning
In unsupervised learning, the algorithm examines the data looking for
patterns with no human instruction and no key to learn from. Instead, it
analyzes the data given to it and determines relationships and correlations.
Unsupervised learning leaves the machine to interpret vast data and
determine how to deal with it by organizing the data to describe its structure.
This could be clustering or arranging it in another way that makes it easier to
read. The more data an unsupervised learning algorithm accesses, the better
its decision-making gets.
Unsupervised learning tasks include:
Clustering - sets of data are grouped by similarity, based on pre-
defined criteria. This is useful when data needs to be segmented into
multiple groups and analysis performed on each one to find the
patterns.
Dimension Reduction – this reduces how many variables need to be
considered to find the required information.
Reinforcement Learning
Reinforcement learning revolves around controlled learning processes where
a specific set of actions is provided to the algorithm, with the parameters and
the required outputs. Because the rules are pre-defined, the algorithm can
explore the possibilities and options, monitoring each result and evaluating
them to determine the optimal one. Reinforcement learning is all about trial
and error. Past experiences are studied, and the algorithm continually adapts
its approach until the best result is achieved.
Which Algorithms to Use
Making sure you choose the right algorithm is dependent on a few factors,
such as:
Size of data
Quality of data
Diversity of data
The answers required to derive useful insights from the data
Algorithm accuracy
How long does it take to train
The required parameters
Data points
This is not an exhaustive list, and choosing the right one is a combination of
specification, business need, time available, and experimentation. Even the
best data scientists in the world cannot tell you the best algorithm to use right
off the bat. It requires experimentation, but below, you can find a list of the
most popular machine learning algorithms:
Naïve Bayes Classifier (Supervised learning, classification) –based on
Bayes Theorem, this algorithm classifies every value independently.
This algorithm uses probability to help us predict categories or classes
based on a provided feature set. It may be a simple algorithm, but it
works very well and tends to be used a lot because it outperforms many
of the more sophisticated algorithms.
K-Means Clustering (Unsupervised learning, clustering) – this
algorithm places unlabeled data into categories. It searches the data and
finds groups, representing the number of groups by a variable K. It
iteratively assigns data points to a K group based on the provided
features.
Support Vector Machine (Supervised learning, classification) – these
algorithms are used in regression and classification analysis. The
algorithm is given a set of training data, with each set belonging to one
of two categories. The algorithm builds a model that can take new data
and assign it to one of these categories.
Linear Regression (Supervised learning, regression) – this is
regression at its most basic level, allowing us to understand existing
relationships between continuous variables.
Logistic Regression (Supervised learning, classification) – this type of
regression estimates an event’s probability based on previous data. It
covers binary dependent variables, where there can only be two values,
1 and 0, to represent the outcomes.
Artificial Neural Networks (Reinforcement learning) – ANNs are
made up of units in layers. Each layer connects to those on either side
of it. The inspiration for these comes from the brain and other
biological systems and how they process information. Essentially, they
are processing elements, all interconnected and working together to
solve a problem.
Decision Trees (Supervised learning, classification and regression) –
decision trees are flow charts with a tree structure. A branching method
is used to illustrate all possible outcomes of a decision, with each node
representing a test on a variable. Each branch is that test’s outcome.
Random Forests (Supervised learning, classification and regression) –
these come under the ensemble learning methods, where several
algorithms are combined to get better results for regression and
classification tasks, among others. Each classification algorithm is
weak on its own but, combined with others, it can give excellent
results. It begins with a decision tree with an input at the top. The
algorithm traverses the tree, segmenting the data into ever smaller sets
based on certain variables.
K-Nearest Neighbors (Supervised learning) – this algorithm is used to
estimate the likelihood of a data point belonging to one group or
another. It examines the data points surrounding a single point to see
what group it is in. For example, a point is on a grid, and KNN wants
to determine whether it belongs to group A or group B. It looks at the
nearest data points to see which group most of the points are in.
As you can see, choosing the right algorithm is quite involved, and to help
you out, we will go into more detail in the coming chapters for some of these
algorithms, starting with linear regression.
Chapter 3: Linear Regression
Linear regression is one of the basic techniques anyone new to machine
learning and statistical techniques should study before moving on to more
complex methods. First, let’s take a look at what regression is.
What Is Regression?
Regression is a technique that looks for relationships between variables. For
example, you could look at details of employees in a specific company and
determine the relationship between their salary and things like education,
experience, age, where they live, etc. These are known as features.
Each employee’s data represents an observation in this type of regression
problem. There is a presumption that the features are classed as independent
while the salary is dependent on the features. In the same way, you could
examine house prices to determine a mathematical dependence based on their
features, such as the number of bedrooms, living area, how close they are to
the city center, etc.
Typically regression analysis considers phenomena of interest and has several
observations, while each observation has at least two features. If we follow
an assumption that at least one feature is dependent on the other features, we
try to find some kind of relationship between them. In other words, we need a
function that will sufficiently map features and variables to others.
Dependent features – these are the dependent variables, otherwise
known as the outputs or the responses
Independent features – these are the independent variables, otherwise
known as the inputs or the predictors
Typically, a regression problem will contain two dependent variables – one
continuous and one unbounded. However, the inputs may be discrete,
continuous, or they may even be categorical data, like brand, nationality,
gender, etc. Best practice recommends using y to denote the outputs and x for
the inputs. For two or more independent variables, the vector �� =
( �� ₁, …, �� ᵣ ) can be used, where r denotes the number of inputs.
When Is Regression Needed?
Regression is usually used to solve problems asking where something
influences something else and how, or when trying to find the relationship
between multiple variables. For example, regression can be used to work out
if gender or experience affects salaries and to what extent.
It is also used to forecast responses using new predictors. For example, given
the time of day, external temperature, and the number of people in a
household, you could try to predict a household’s electricity consumption for
the next hour.
Many different fields make use of regression, including computer sciences,
economy, social sciences, etc. Every day, it becomes more important as more
and more data becomes available and we become more aware of how to use
the data.
Linear Regression
One of the most widely used techniques in regression, and possibly the most
important, is linear regression. One of the easiest regression methods to use,
it has the advantage of the easy interpretation of the results.
Problem Formulation
Let's say we want to implement linear regression of a dependent variable (y)
on �� = ( �� ₁, …, �� ᵣ ), a set of independent variables where r
denotes the number of predictors. We assume the relationship between ��
and �� is linear:
�� = �� ₀ + �� ₁ �� ₁ + ⋯ + �� ᵣ �� ᵣ + ��
That is the regression equation. The regression coefficients are �� ₀,
�� ₁, …, �� ᵣ , while the random error is �� .
Linear regression will calculate the regression coefficient's estimators od the
predicted weights, which are indicated by �� ₀, �� ₁, …, �� ᵣ . These
weights define �� ( �� ) = �� ₀ + �� ₁ �� ₁ + ⋯ +
�� ᵣ �� ᵣ , which is the estimated regression function that should be able
to capture the dependencies between the outputs and inputs well.
Each observation i = 1, …, n, has a predicted or estimated response of
�� ( �� ᵢ ), which should be as near to yi as possible – the actual
corresponding response. The differences for all the observations �� = 1,
…, �� is �� ᵢ - �� ( �� ᵢ ), and these are known as residuals.
Regression is all about working out the best-predicted weights that
correspond to the smallest residual.
So how do you get the best weights? The SSR (sum of squared residuals) for
all the observations �� = 1, …, �� must be minimized:
: SSR = Σ ᵢ ( �� ᵢ - �� ( �� ᵢ ))²
This is known as the method of ordinary least squares.
Regression Performance
The actual responses �� ᵢ , �� = 1, …, �� vary according to their
dependence on Xi, the predictors. However, we also consider the output's
inherent variance.
The coefficient of determination, R2, indicates how much of y's variation is
dependent on X using the specific regression model. The larger R2 is, the
better the fit, and the model can explain the output variation with different
inputs much better.
R2 = 1 corresponds to SSR = 0. This tells you that you have the best fit
because predicted and actual response values fit perfectly.
Implementing Linear Regression in Python
Now you know what linear regression is all about, let's look at how to
implement it in Python. It really is nothing more difficult than implementing
the right libraries and their classes and functions.
Linear Regression Packages
A fundamental package is NumPy, a scientific package that lets you perform
high-performance arrays, whether single or multi-dimensional. It is open-
source and also offers plenty of useful mathematical routines.
Scikit-Learn is another useful machine learning package built on NumPy and
other packages. Scikit-Learn gives you what you need to preprocess the data,
reduce the dimensionality, implement the regression, clustering or
classification, and much more. It is also an open-source package.
Simple Linear Regression
Let's dive in with simple linear regression. When you implement any linear
regression, you need to follow five steps:
1. Import the correct packages and classes
2. Provide the model with data and do the required transformations
3. Build a regression model, fitting it with existing data
4. Check the model fitting results so you know if you have the right one
5. Apply the model to get your predictions.
Step 1: Import the correct packages and classes
First, you need to import the NumPy package and the LinearRegression class
from sklearn.linear_model:
import numpy as np
from sklearn.linear_model import LinearRegression
That gives you everything you need to implement the linear regression.
NumPy's fundamental data type is numpy.ndarray, which is the array type.
For the remainder of this chapter, we will use 'array' to refer to all instances
of numpy.ndarray .
We use the sklearn.linear_model.LinearRegression class to do both linear and
polynomial regression, making the predictions accordingly.
Step 2: Provide the data
Next, we need to define the data we are working with. The inputs, which are
(regressors, x) and the outputs, which are (predictor, y) must be arrays or
similar – this is easiest way to provide the data required for the regression:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
You should now have two arrays – input x and output y. the array must be
two-dimensional, which means having one column and however many rows
are required, so we must call .reshape() on x. The number of columns and
rows is denoted by the argument (-1, 1) in .reshape().
X and y now look like this:
>>>
>>> print(x)
[[ 5]
[15]
[25]
[35]
[45]
[55]]
>>> print(y)
Output:
[ 5 20 14 32 22 38]
There are two dimensions for x and x.shape is (6, 1) and one dimension for y
and y.shape is (6, ).
Step 3: Build your model and fit it
Next, we need to build our linear regression model and use the existing data
to fit it. First, an instance of the LinearRegression class needs to be created to
represent our model:
The next step is to create a linear regression model and fit it using the
existing data.
model = LinearRegression()
The variable called model is created as an instance of LinearRegression,
which can take several parameters, all optional:
fit_interpret - a Boolean which has a default of True. It determines
whether the intercept �� ₀ should be calculated (True) or considered
equal to zero (False)
normalize – a Boolean with a default of False. It determines whether
the input variables should be normalized (True) or not (False)
copy_X – a Boolean with a default of True. It determines whether the
input variables should be copied (True) or not (False)
n_jobs – an integer or None, which is the default. It represents how
many jobs are used in parallel computation. None indicates one job,
while -1 indicates all processors used.
Our example will use the defaults for all the parameters.
Now we need to use our model. First, .fit() needs to be called:
model.fit(x, y)
Once this has been called, we can calculate the best weight values for �� ₀
and �� ₁. To do this, x and y (the existing input and output) are used as the
arguments. Simply put, .fit() will fit our model. The variable model is
returned as self, which is why the last two statements can be replaced with
one:
model = LinearRegression().fit(x, y)
This is just a shorter version of the other two statements, but it does exactly
the same.
Step 4: Get the results
Once the model has been fitted, the results can be obtained to tell you if the
model works well. We call .score() on model to get R2, which is the
coefficient of determination:
>>>
>>> r_sq = model.score(x, y)
>>> print('coefficient of determination:', r_sq)
Output:
coefficient of determination: 0.715875613747954
When you apply .score(), the predictor x and the regressor y are the
arguments, and the return should be a value R2.
The model’s attributes are .intercept(), representing �� ₀ (the coefficient)
and .coef_, representing �� ₁:
>>>
>>> print('intercept:', model.intercept_)
Output:
intercept: 5.633333333333329
>>> print('slope:', model.coef_)
Output:
slope: [0.54]
The code shows you how to get �� ₀ and �� ₁. Note that .coef is an
array and .intercept_ is a scalar.
The value of �� ₀ = 5.63 shows that when x is zero, the model will predict
a response of 5.63, while �� ₁ = 0.54 indicates that, when x increases by
one, the predicted response will rise by 0.54.
Also, note that y may be provided as a two-dimensional array, and a similar
result would be obtained. It might look like this:
>>>
>>> new_model = LinearRegression().fit(x, y.reshape((-1, 1)))
>>> print('intercept:', new_model.intercept_)
intercept: [5.63333333]
>>> print('slope:', new_model.coef_)
Output:
slope: [[0.54]]
You can see that this example is much like the last one, but, here, .intercept is
a single-dimensional array containing one element, �� ₀, while .coef is
two dimensional and one element of �� 1.
Step 5: Predict the response
When you are satisfied with your model, you use existing or new data to
make predictions. To get the predicted response, you use .predict():
>>>
>>> y_pred = model.predict(x)
>>> print('predicted response:', y_pred, sep='\n')
Output:
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333
35.33333333]
When you apply .predict(), the regressor is passed as an argument, and you
get the predicted response that corresponds to it.
>>>
>>> y_pred = model.intercept_ + model.coef_ * x
>>> print('predicted response:', y_pred, sep='\n')
Output:
predicted response:
[[ 8.33333333]
[13.73333333]
[19.13333333]
[24.53333333]
[29.93333333]
[35.33333333]]
Each element x is multiplied with model.coef_ and model.intercept is added
to the product in this example.
Here, the only difference in the output from the last example is in the
dimensions. In the first example, the predicted response was one-
dimensional, and, in this one, it is two-dimensional.
If the number of dimensions of x was reduced to one, both examples would
give us the same result. To do this, replace x with one of the following when
you multiply it with model.coef:
x.reshape(-1)
x.flatten(), or
x.ravel().
In practice, we typically use regression for forecasting, which means fitted
models can be used to calculate outputs based on new inputs.
>>>
>>> x_new = np.arange(5).reshape((-1, 1))
>>> print(x_new)
[[0]
[1]
[2]
[3]
[4]]
>>> y_new = model.predict(x_new)
>>> print(y_new)
Output:
[5.63333333 6.17333333 6.71333333 7.25333333 7.79333333]
In this example, we applied .predict() to x_new (a regressor, and the result is
y_new. The array containing elements from 0 to 5 (inclusive and exclusive
respectively), is generated using arrange() . The array is 0, 1, 2, 3, 4.
Let’s now look at the different classification types.
Chapter 4: The Different Types of Classification
Classification is a common type of machine learning task and is used to
assign specific classes with label values. It can then determine if a class is of
one type or another. Perhaps the most common example of this is filtering
spam emails, where an email is classified as spam or not spam. Throughout
your journey, you will come across plenty of challenges, and there are several
approaches in terms of the model type that fits each challenge.
Classification Predictive Modeling
Typically, classification refers to problems where the predicted result is a
type of class label obtained from the provided data. Some of the more
popular types of challenges include:
Spam email classification – determining if an email is spam or not
Handwriting classification – determining if a handwritten character is a
known one or not
User behavior classification – determines if recent behavior is churn or
not
All classification models need a training dataset containing plenty of input
and output examples. The model uses this dataset to train itself. The data
must have all possible scenarios for the specific problem, and each label must
be represented by enough data for the model to learn from and train itself.
Often, the class labels are returned as string values, which means they must
be encoded into an integer—for example, 0 to represent spam and 1 to
represent not spam.
The only way to determine the best model for the problem is to experiment
and work out the best configuration and algorithm to provide the best
performance for the problem. In predictive modeling, the algorithms are all
compared against their results. One of the best metrics used to evaluate a
model’s performance on class label predictions is classification accuracy. It
may not be the best parameter, but it is certainly one of the best places to start
from in most classification tasks.
Rather than a class label, some tasks might predict class membership
probabilities of specified inputs. In cases like this, one of the most helpful
indicators of model accuracy is the ROC curve. In your machine learning
journey, you will probably come across four classification task types, and the
different predictive model types are:
Binary classification
Multi-label classification
Multi-class classification
Imbalanced classification
Let’s look into each one, with code examples to show you how they work.
Binary Classification
Binary classification covers tasks that can provide an output of one of two
class labels. Typically, one class label is the normal state, while the other is
the abnormal state. We can understand this better with the following
examples:
Detecting spam emails – normal state = not spam, while abnormal
state = spam
Conversion prediction - normal state = not churned, while abnormal
state = churn
Conversion prediction – normal state = purchased an item, while
abnormal state = didn’t purchase an item
Conversion prediction – normal state = no cancer detected, while
abnormal state = cancer detected
The notation typically followed is that the normal state is assigned 0, while
the abnormal state is assigned 1. For example, a model can also be created to
predict an output’s Bernoulli probability. Simply put, a discrete value is
returned to cover every case, and the output is given as 1 or 0. Once the two
states are associated, an output can be given for one of the present values.
The commonly used binary classification algorithms are:
K-Nearest Neighbors
Logistic Regression
Support Vector Machine
Decision Trees
Naive Bayes
Some of these algorithms are designed specifically for binary classification
and do not have native support for any more than two class types. This
includes Logistic Regression and Support Vector Machines.
To show you how binary classification works, we will create a dataset and
apply the classification to it. We will generate a binary classification for the
dataset using the make_blobs() function from Scikit-Learn. In our example,
we have a dataset containing 1,000 examples belonging to one of the two
present classes with two input features:
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
X, y = make_blobs(n_samples=5000, centers=2, random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(5000, 2) (5000,)
Counter({1: 2500, 0: 2500})
[-11.5739555 -3.2062213] 1
[0.05752883 3.60221288] 0
[-1.03619773 3.97153319] 0
[-8.22983437 -3.54309524] 1
[-10.49210036 -4.70600004] 1
[-10.74348914 -5.9057007 ] 1
[-3.20386867 4.51629714] 0
[-1.98063705 4.9672959 ] 0
[-8.61268072 -3.6579652 ] 1
[-10.54840697 -2.91203705] 1
In this example, a dataset is created with 5,000 samples divided into two
elements – input X and output Y. The resulting distribution would show you
that any instance can belong to class 0 or class 1, and each has approximately
50% of the instances.
The first 10 examples are displayed with numeric input values and a target
value of an integer representing class membership.
The input variables are then shown on a scatter plot, with color-coded points
based on the class values.
Multi-Class Classification
As the name indicates, these problems do not have two fixed labels; instead,
they can have multiple labels. Some of the most common multi-class
classification types are:
Facial classification
Plant species classification
Optical character classification
There is no abnormal or normal outcome, and the result belongs to any one of
multiple variables of known classes. There may also be tons of labels, such as
the prediction of images compared to how closely they resemble one of
potentially thousands in a facial recognition system.
You could also consider a challenge whereby the next word in a sequence
needs to be predicted as a multi-class classification problem. In a scenario
like this, all the words define every possible number of classes and could run
into the millions.
Categorical distribution is typically used for these types of models, whereas
Bernoulli is used for binary classification. In categorical distributions, events
can have several results or endpoints, and the models make predictions on the
input probability regarding the individual output labels.
The following algorithms are commonly used for multi-class classification:
K-Nearest Neighbors
Naive Bayes
Decision trees
Gradient Boosting
Random Forest
The binary classification algorithms can also be used with multi-class
classification based on the notion of one vs. rest – one class vs. all the others
– or one vs. one – one model for a pair of classes.
One vs. Rest – the primary task is one model being fit for each class
which faces all the others
One vs. One – the primary task is defining a single binary model for
each class pair
As in binary classification, we will use the make_blobs() function:
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
X, y = make_blobs(n_samples=1000, centers=4, random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(1000, 2) (1000,)
Counter({1: 250, 2: 250, 0: 250, 3: 250})
[-10.45765533 -3.30899488] 1
[-5.90962043 -7.80717036] 2
[-1.00497975 4.35530142] 0
[-6.63784922 -4.52085249] 3
[-6.3466658 -8.89940182] 2
[-4.67047183 -3.35527602] 3
[-5.62742066 -1.70195987] 3
[-6.91064247 -2.83731201] 3
[-1.76490462 5.03668554] 0
[-8.70416288 -4.39234621] 1
Here, it’s clear that we have more than two types of classes, and they can be
separately classified into all the different types.
Multi-Label Classification
Multi-label classification covers classification tasks where two or more class
labels need to be assigned. These are class labels predicted for each example.
One example would be the classification of photos, where one image may
contain multiple objects, such as fruit, an animal, a person, etc. The biggest
difference lies in the fact that these models can predict more than one label.
Multi-class and binary classification models cannot be used for multi-label
classification. You also need the algorithm to be modified to be used for
multiple classes, making this more challenging than a Yes or No statement.
Some of the algorithms commonly used in multi-label classification are:
Multi-label Random Forests
Multi-label Decision trees
Multi-label Gradient Boosting
You could use a different approach. A separate classification algorithm could
be used to predict the labels for each class type. The multi-label classification
dataset will be generated using a Scikit-Learn library in our example, and the
code shows you how to create the multi-label classification and shows it
working with 1000 samples and 4 class types:
from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(n_samples=1000, n_features=3,
n_classes=4, n_labels=4, random_state=1)
print(X.shape, y.shape)
for i in range(10):
print(X[i], y[i])
Output:
(1000, 3) (1000, 4)
[ 8. 11. 13.] [1 1 0 1]
[ 5. 15. 21.] [1 1 0 1]
[15. 30. 14.] [1 0 0 0]
[ 3. 15. 40.] [0 1 0 0]
[ 7. 22. 14.] [1 0 0 1]
[12. 28. 15.] [1 0 0 0]
[ 7. 30. 24.] [1 1 0 1]
[15. 30. 14.] [1 1 1 1]
[10. 23. 21.] [1 1 1 1]
[10. 19. 16.] [1 1 0 1]
Imbalanced Classification
Imbalanced classification is used for tasks with an uneven distribution of
examples in each class. Typically, these are binary classification tasks where
a large percentage of the training set is classified as normal, and the rest are
abnormal.
This type of classification is commonly used in:
Fraud detection
Medical diagnosis
Outlier detection
Special techniques are used to turn these problems into binary classification
tasks. You can choose between over-sampling for the smaller number of
classes or under-sampling for the bigger number of classes. SMOTE Over-
sampling and random under-sampling are two of the best examples.
When you are fitting the model on the training dataset, you can also use
special modeling algorithms to focus more on the smaller class. This includes
using cost-sensitive algorithms, such as:
Cost-Sensitive Logistic Regression
Cost-Sensitive Decision Trees
Cost-Sensitive Support Vector Machines
Once the model is chosen, we need to access and score it. We can do that
using the Precision, Recall, or F-Measure score. We need a dataset developed
for the problem, and we’ll generate a synthetic, imbalanced binary
classification dataset containing 1000 samples:
from numpy import where
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
X, y = make_classification(n_samples=1000, n_features=2,
n_informative=2, n_redundant=0, n_classes=2,
n_clusters_per_class=1, weights=[0.99,0.01], random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(1000, 2) (1000,)
Counter({0: 983, 1: 17})
[0.86924745 1.18613612] 0
[1.55110839 1.81032905] 0
[1.29361936 1.01094607] 0
[1.11988947 1.63251786] 0
[1.04235568 1.12152929] 0
[1.18114858 0.92397607] 0
[1.1365562 1.17652556] 0
[0.46291729 0.72924998] 0
[0.18315826 1.07141766] 0
[0.32411648 0.53515376] 0
Here, the label distribution can be seen, along with a serious class imbalance,
wither 17 belonging to one type and the remaining 983 belonging to the other
one. As expected, we get a majority of class or type 0.
In the next chapter, we’ll look at SVMs or Support Vector Machines using
Scikit-Learn.
Chapter 5: Support Vector Machines with Scikit-Learn
This chapter will take an in-depth look at a popular algorithm used in
supervised machine learning - Support Vector Machines.
SVM is highly accurate, more so than logistic regression, decision trees, and
other similar classifiers. Perhaps one of its best-known features is a kernel
trick that can handle non-linear input spaces. It is used in several
applications, including intrusion detection, facial recognition, email
classification, web pages, news articles, gene classification, and handwriting
recognition.
SVM is one of the most exciting algorithms with simple concepts. The
classifier uses a hyperplane with the biggest margin to separate the data
points, finding the optimal hyperplane for classifying new data points.
Support Vector Machines
Typically, a Support Vector Machine is considered a classification problem
approach but can easily be employed in regression problems. It also handles
multiple categorical and continuous variables with ease.
The hyperplane is constructed in multidimensional space to keep the different
classes separate. An optimal hyperplane is generated iteratively and used for
minimizing the risk of an error. The primary idea behind the SVM is to find
the MMH (Maximum Marginal Hyperplane) that efficiently splits the dataset
into classes.
Support Vectors - These are the data points nearest to the hyperplane,
and they calculate margins to ensure the separating line is better
defined. The support vectors are relevant to the classifier's
construction.
Hyperplane – This is a decision plane separating a set of objects with
different class memberships
Margin – This is the gap between a pair of lines on the nearest class
points and is calculated as the perpendicular distance between the line
and the closest points or support vectors. The larger the margin
between the classes, the better, while smaller margins are considered
bad.
How Does SVM Work?
The SVMs primary objective is to split the dataset most efficiently. The
distance between the nearest pints is called the margin, and the idea is to
choose the best hyperplane with the biggest margin possible between the
dataset's support vectors.
Non-Linear and Inseparable Planes
A linear hyperplane cannot be used to solve all problems. A kernel trick is
used to turn the input space into a higher-dimensional space in situations
where it cannot. We plot the data points on the x-axis and the z-axis, allowing
you to use linear separation to segregate the points.
SVM Kernels
In practice, a kernel is used to implement the SVM algorithm, transforming
the input data space into the right format. This is done using a technique
known as the 'kernel trick, 'where the low-dimensional input space is turned
into a high-dimensional space. In simple terms, more dimension is added to a
non-separable problem, and it is converted into separable problems. This is
incredibly useful in problems revolving around non-linear separation, and it
helps construct more accurate classifiers.
Linear Kernel
You can use a linear kernel as a normal dot product for two specified
observations. The product of the two vectors is the result of each input value
pair being multiplied.
K(x, xi) = sum(x * xi)
Polynomial Kernel
These are generalized forms of linear kernels, and they can tell the difference
between non-linear and curved input spaces.
K(x,xi) = 1 + sum(x * xi)^d
In this, d indicates the degree of the polynomial. d = 1 is much the same as
the linear transformation, and the degree must be specified manually in the
algorithm.
Radial Basis Function Kernel
This is one of the most popular kernel functions in SVM classification and
can map input spaces in infinite-dimensional space.
K(x,xi) = exp(-gamma * sum((x – xi^2))
In this, the parameter is gamma, ranging from 0 to 1. Higher gamma values
fit the training set perfectly, resulting in over-fitting. The best default value is
Gamma=0.1, and, as with the polynomial kernel, the gamma value must be
specified manually in the algorithm.
Classifier Building in Scikit-learn
Now you know the theory behind SVMs, it's time to look at how to
implement it in Python using Scikit-Learn.
We'll be using the well-known cancer dataset, a popular multi-class
classification problem computed from digitized images of FNA (fine needle
aspirates) of breast mass and describing the cell nuclei characteristics from
the images.
There are 30 features in the dataset:
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
It also has a target, which is the type of cancer. There are two types –
malignant and benign. We want to build a model to classify the cancer types
– the dataset can be downloaded from the Scikit-Learn library or the UCI
Machine Learning Library.
Step One - Loading Data
First, we need to load the dataset – we'll get it from the Scikit-Learn library:
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
Step Two - Exploring Data
Once the dataset is loaded, we can look at it to see more information about it.
We'll look at the 30 features and the target names:
# print the names of the 30 features
print("Features: " cancer.feature_names)
In this equation, |Dj|/|D| acts as the jth partition's weight, while v indicates
attribute A's discrete values.
We can define the gain ratio as:
The attribute with the most gain ratio is picked as the splitting attribute.
Gini Index
CART (Classification and Regression Tree) is a decision tree algorithm that
uses Gini to create the split points:
In this equation, Pi indicates the probability of a tuple from D belonging to
class Ci.
The Gini Index takes each attribute's binary split into account. A weighted
sum of each partition's impurity can be computed; where a binary split on
attribute A results in data D being split into D1 and D2, then D's Gini Index
is:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names =
feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
In the resulting chart, the internal nodes each have a decision rule used to
split the data. Gini, otherwise known as the Gini Ratio, measures the node's
impurity. A node is said to be pure when all the records in it share the same
class, such as the leaf node.
The resulting tree is not pruned; it is difficult to explain or understand, so we
need to prune it and optimize it.
Optimizing Decision Tree Performance
When we use Scikit-Learn, we can only optimize the decision tree classifier
model by pre-pruning it. A tree's maximum depth may be used as one of the
control variables for this. In the example below, a decision tree can be plotted
on the same data using max_depth=3.
The pre-pruning parameters are:
criterion – Optional (default=" gini") or Choose the attribute selection
measure. This parameter provides the ability to use the attribution
selection measure different-different. The criteria supported by it are
'gini' for the Gini Index and 'entropy' for information gain.
splitter – a string and optional (default=" best") or Split Strategy.
Using this parameter, we can decide on the split parameter.
max_depth – int or None, optional (default=None), or the Maximum
Depth of a Tree. If this is None, we expand the nodes until every leaf
has samples that are less than min_samples_split. The higher the
maximum depth value, the more chance of overfitting, while lower
values risk underfitting.
Other than using these parameters, we can also use entropy or other attribute
selection measures:
# Create Decision Tree classifier object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Step Five - Training and Predictions
Training the KNN algorithm to make predictions is straightforward,
particularly when Scikit-Learn is used:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
First, the KNeighborsClassifier class is imported from the library called
sklearn.neighbors. Second, we initialized the class with a single parameter,
n_neighbors, which is K's value. K doesn't have an ideal value; it is chosen
once the testing and evaluation stages are out of the way. However, the most
commonly used value is 5.
Now we need to use our test data to make predictions:
y_pred = classifier.predict(X_test)
Step Six - Evaluating the Algorithm
A few metrics commonly used in algorithm evaluation are confusion matrix,
precision, recall, and F1 score. We can use two methods from sklearn.metrics
to calculate the metrics – confusion_matrix and classification_report.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
[[11 0 0]
0 13 0]
0 1 6]]
precision recall f1-score support
# Loading dataset
iris_df = datasets.load_iris()
# Features
print(iris_df.feature_names)
# Targets
print(iris_df.target)
# Target Names
print(iris_df.target_names)
Output:
label = {0: 'red', 1: 'blue', 2: 'green'}
# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width
# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
# Loading dataset
iris_df = datasets.load_iris()
# Declaring Model
model = KMeans(n_clusters=3)
# Fitting Model
model.fit(iris_df.data)
# Printing Predictions
print(predicted_label)
print(all_predictions)
Output:
[0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000022122222222222222222222222
21222222222222222222222212111121111112
2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2]
Hierarchical Clustering
The hierarchical algorithm builds a hierarchy of clusters, as the name implies.
At the start, all the data is assigned to a cluster. Then the nearest two clusters
are joined in the same cluster, and so on until one cluster is left.
Here's an example using a grain dataset, which can be downloaded from here.
# Importing Modules
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import pandas as pd
# Remove the grain species from the DataFrame, save for later
varieties = list(seeds_df.pop('grain_variety'))
"""
Perform hierarchical clustering on samples using the
linkage() function with the method='complete' keyword argument.
Assign the result to mergings.
"""
mergings = linkage(samples, method='complete')
"""
Plot a dendrogram using the dendrogram() function on mergings,
specifying the keyword arguments labels=varieties, leaf_rotation=90,
and leaf_font_size=6.
"""
dendrogram(mergings,
labels=varieties,
leaf_rotation=90,
leaf_font_size=6,
)
plt.show()
The result will be shown as a dendrogram plot.
K-Means vs. Hierarchical Clustering
There are a couple of differences worth mentioning:
K-Means can handle big data efficiently, while hierarchical clustering
cannot. This is because K-Means has linear time complexity, i.e., O(n),
while hierarchical clustering has quadratic time complexity, i.e., O(n2)
An arbitrary cluster choice is used at the start of K-Means, and the
results will likely differ when the algorithm is run several times. In
hierarchical clustering, the results are reproducible.
K-Means works well with hyper spherical cluster shapes, like a sphere
in 3D or a circle in 2D
Noisy data is not allowed in K-Means but can be used in hierarchical
clustering.
T-SNE Clustering
One of the best unsupervised learning algorithms for visualization is t-SNE,
otherwise known as t-distributed stochastic neighbor embedding. This
algorithm is used for mapping higher-dimensional space into two-
dimensional or three-dimensional space so it can be visualized. More
specifically, each high-dimensional object is modeled by a two-dimensional
or three-dimensional point in a way that nearby points model similar objects,
while distant points with high probability model dissimilar points.
Here's t-SNE implemented on the iris dataset:
# Importing Modules
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Defining Model
model = TSNE(learning_rate=100)
# Fitting Model
transformed = model.fit_transform(iris_df.data)
# Plotting 2d t-Sne
x_axis = transformed[:, 0]
y_axis = transformed[:, 1]
# Load Dataset
iris = load_iris()
# Declaring Model
dbscan = DBSCAN()
# Fitting
dbscan.fit(iris.data)