Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Python

Machine Learning
The Crash Course for Beginners to Programming and
Deep Learning, Artificial Intelligence, Neural
Networks and Data Science.
Scikit Learn, Tensorflow, Pandas and Numpy.
Python
Machine Learning
Introduction
Chapter 1: Understanding Machine Learning
Chapter 2: What is Supervised Machine Learning
Chapter 3: What is Unsupervised Machine Learning?
Chapter 4: Setting Up the Environment with Python
Chapter 5: Data Preprocessing with Machine Learning
Chapter 6: Linear Regression with Python
Chapter 7: Decision Trees to Handle Your Regression Problems
Chapter 8: Random Forest Algorithms That Work with Python
Chapter 9: Support Vector Regression Problems
Chapter 10: Naïve Bayes Problems and Python Machine Learning
Chapter 11: K-Nearest Neighbors for Classification Problems
Chapter 12: Data Wrangling Visualization and Aggregation
Chapter 13: Django and Web Application
Chapter 14: A Look at the Neural Networks
Conclusion
© Copyright 2019 by Django Smith - All rights reserved.
The following Book is reproduced below with the goal of providing
information that is as accurate and reliable as possible. Regardless,
purchasing this Book can be seen as consent to the fact that both the
publisher and the author of this book are in no way experts on the topics
discussed within and that any recommendations or suggestions that are
made herein are for entertainment purposes only. Professionals should be
consulted as needed prior to undertaking any of the action endorsed herein.
This declaration is deemed fair and valid by both the American Bar
Association and the Committee of Publishers Association and is legally
binding throughout the United States.
Furthermore, the transmission, duplication, or reproduction of any of the
following work including specific information will be considered an illegal
act irrespective of if it is done electronically or in print. This extends to
creating a secondary or tertiary copy of the work or a recorded copy and is
only allowed with the express written consent from the Publisher. All
additional right reserved.
The information in the following pages is broadly considered a truthful and
accurate account of facts and as such, any inattention, use, or misuse of the
information in question by the reader will render any resulting actions
solely under their purview. There are no scenarios in which the publisher or
the original author of this work can be in any fashion deemed liable for any
hardship or damages that may befall them after undertaking information
described herein.
Additionally, the information in the following pages is intended only for
informational purposes and should thus be thought of as universal. As
befitting its nature, it is presented without assurance regarding its prolonged
validity or interim quality. Trademarks that are mentioned are done without
written consent and can in no way be considered an endorsement from the
trademark holder.
Introduction
Congratulations on purchasing Python Machine Learning
and thank you for
doing so.

The following chapters will discuss everything that you need to know in
order to get started with Python machine learning. Machine learning is a
growing field, one that is taking over the world of technology and helping
us to do and learn things like never before. And when you add in some of
the cool things that you are able to do with the Python language along with
it, you will find that this kind of learning is easy to work with as well.

The beginning of this guidebook is going to spend some time looking at


machine learning and what it all entails. We will look at the basics of
machine learning as well as some of the most common types of machine
learning including supervised and unsupervised machine learning. Once we
have a good idea of how we can work with machine learning, it is time to
move on to a bit of work with Python. We will explore how to set up our
Python environment to work with machine learning, before moving on to
some different things that you can do with Python machine learning
including data pre-proccing, linear regression, and more!

 
Once we have the ideas of machine learning and Python all setup, it is time
to learn a lot of the other things that you will be able to do when it comes to
machine learning. We will look at how to work with decision trees and
random forests, support vector regressions, Naïve Bayes problems and
KNN classification problems. To end this guidebook, we will look at data
wrangling visualization, web applications, and accelerated data analysis that
can be done with the help of the Python language.

There are so many things that you are able to do when it comes to machine
learning, and this is a field that is going to grow and grow over time. Make
sure to check out this guidebook to help you get started learning how to
work with Python machine learning!             

There are plenty of books on this subject on the market, thanks again for
choosing this one! Every effort was made to ensure it is full of as much
useful information as possible, please enjoy!
Chapter 1: Understanding Machine Learning
You will find that the Python language is going to be able to help us out a
lot when it comes to working with machine learning. But before we really
get into all of this, we need to take a look at machine learning and some of
the different parts that come with it. The field of machine learning is
growing like crazy, and being able to use it in the proper manner, and
understand all of the parts that go along with it can make a world of
difference.

Machine learning is going to be a part of the data science world, a part that
is going to deal with any technology that can be trained how to learn based
on the input that it is given. With a traditional program on the computer, the
program is going to be trained to only do what is in the code. It is never
going to look at the input that the user has, and it is not able to make
decisions on its own. The program is going to just repeat what is in the
code.

When we take a look at machine learning though, things are going to be a


bit different. With this, you will find that the program is going to be able to
learn. There are several ways that it is able to do this including looking at
the patterns that it can find in the data, trial, and error, and even learning
from all of the data or the input that it gets from the user.

The idea that is going to be found with machine learning is that it will
ensure that the program is able to learn how to read data, and then from that
information, it is going to be able to make some of its own decisions. There
are many times when you want to have the programmer to be able to react
based on how the user is interacting with it, without a programmer having
to guess all of the options that are going to be presented there.

For example, when we are working with speech recognition, we can see
how this process would be almost impossible if we used traditional coding
methods. But we can use some of the features that come with machine
learning in order to teach the code to learn how to understand the other
person is talking.
The first definition of machine learning was coined in 1959 by Arthur
Samuel. He defined machine learning in this manner “Field of study that
enables computers to learn without being explicitly programmed.”

This is one of the neat things about machine learning. The machine is able
to figure out patterns out of a large amount of data, even if the programmer
didn’t specifically tell it how to behave. This can be helpful in uses such as
speech recognition, search engines, and for companies who need to search
through large amounts of data to find patterns and make decisions about
how to act in the future.

How does machine learning work?

With that definition of machine learning in place, the next thing that we
need to take a look at is how we can get started with this process. Before we
look at all of the details that come with machine learning, it is important to
first look at the way that the human brain is going to work, and then we can
compare that to machine learning to get a better outlook on how all of this
should work.

For example, most of us know that when we see a plate that has been heated
in the microwave, or we see that there is something that we need to take out
of the oven, we know that we should not use our bare hands to touch this.
But when we think about it, how do we know that we shouldn’t touch the
hot stuff?

There are going to be two parts that come with this one. We either learned
this from experience when at some time in our past we had touched one of
these plates and knew that it hurt. Or it is possible that someone else taught
us that touching the hot plates was a bad idea and we need to avoid that. In
either case, we are going to have some kind of experience in our past that
teaches us not to touch that hot plate. We used the information from our
past to help us make some of our future decisions.

You will find that machine learning is going to work in a similar manner. At
the beginning of the program, the computer is going to have no knowledge.
It is possible that the programmer is going to come and put in some of this
information, but they will start out with no knowledge and may not know
how they are supposed to act.

To make the machine have the ability to learn, information needs to be


passed over to these machines. Then going off of this new information, the
machines are able to go through patterns using a variety of techniques to
help them out. over some time, the machines are going to be able to identify
patterns out of all the data that they have been able to collect in order to
make some good decisions over time.

Training data is often going to be given to the machine ahead of time. These
will ensure that you are able to get the program to learn in the manner that it
should. The algorithms that you put into this are going to help you learn the
models along the way.

When will I use machine learning?

While it is possible that you haven’t been able to work with machine
learning much in the past, there are a lot of different examples of when this
kind of learning is going to be useful, and it is likely that you have used it
often in the past. Machine learning is always trying to add to its own
teaching set, and it is able to learn more than we have ever imagined, and it
is likely that this is going to grow over time. Some of the different ways that
we are able to use machine learning and the different applications that you
are able to use with it will include:

Data security. Malware is a large problem in our world of technology, and it


seems like more attacks are happening on a regular basis. Kaspersky Labs
has been able to find up to 325,000 new malware files each day, and that
was back in 2014; it is likely that the numbers have gone up in that time
frame.

Even with all of these attacks, and with them being a bit different from each
other, the company Deep Instinct tells us that each piece of malware will
have at least some code that is similar to what was there before. Only up to
ten percent of the code in a new malware piece is going to be new, which
isn’t going to be all that much. This means that you could work with
machine learning in order to figure out how to accurately recognize all of
the different types of malware and prevent them from taking over your
computer, without having to try and keep up with the hackers all of the
time.

Machine learning is also going to be able to help out with personal security.
If you have ever gone to some kind of big event, then you have spent a
good amount of time, before you even get into that event, going through a
security line. Machine learning is going to be an asset with this because it
ensures that false alarms are not as likely, and it can also help to catch a few
of the things that someone in person could miss. We could see this kind of
technology become a big deal in the airports, stadiums, and concerts that
you want to work with.

Financial trading can benefit from machine learning. There are a lot of
people who have found that it could greatly benefit them if they were able
to use learning to predict what the stock market is going to do in the future.
It is possible that some of the algorithms that come with machine learning
are able to do this with a good deal of accuracy, which makes them very
valuable with you earning money in this field. It is believed that these
algorithms if they are done in the proper manner, will be able to make some
really smart predictions, and can help you earn more money overall.

Healthcare is next on the list. Machine learning is helpful here because it is


able to process the information it gets more effectively, while also spotting
more patterns along the way. There has been a study that checked out how
the computer-assisted diagnosis was able to help spot issues compared to
just having the doctor alone. These were done by having the machine
learning algorithm review early mammography scans of women who then
developed breast cancer at a later time.

When computers were brought into the mix, the algorithm was able to spot
out 52 percent of cancer a year or more ahead of when these women were
officially given the diagnosis. In addition, machine learning was able to
understand more about the disease risk factors and as it looked at these
scans and learned more along the way, the accuracy could get higher.
Fraud detection is another area that can benefit when it comes to using
some machine learning along the way. Machine learning is getting really
good at spotting some of the potential fraud that may be happening in many
different fields, and many banks and other companies are starting to use it
in order to help them do better at protecting themselves and their customers
.

An example of this is PayPal. This company is working with machine


learning in order to catch individuals who may be working with money
laundering. The company actually has a number of tools that they are able
to use in order to compare millions of different transactions that they have
to go through all of them and then figure out the difference between the
legitimate transactions and the ones that may be fraudulent and that they
need to stop.
Sites that work with recommendations can benefit from machine learning as
well. Many sites, including those like Netflix and Amazon, will use the
recommendation feature. This helps them to look through your activity and
then compare it with the other users in the hopes of figuring out what you
are most likely to binge or watch or purchase next. The recommendations,
as they gather more information from users, are always getting better so it
won’t take long before they meet up with your needs, even if you are a new
user.

An online search on Google, Yahoo, and Bing can use machine learning on
a regular basis. One of the most popular methods of machine learning is
with search engines. Each time that you put in a search, the program is
going to watch how you are responding to the results it sends. If you click
the top result and you stay on that page for a bit, it will assume that the
program got the right results and the search was successful. If you happen
to click on the second page, or you go and type in a new search and you
don’t look at any results, the program is going to learn from that mistake to
do better next time.

Alexa and other voice recognition software: When you first purchase one of
these products, you may run into a lot of issues. It may have trouble
recognizing some of your speech patterns and there may be some mistakes.
Over time, the program is going to learn from its mistakes and there will be
a lot more accuracy in the results you get from the voice recognition
program.
As you can see with the examples above, there are a lot of ways that you are
able to work with machine learning. And it is believed that what we see
now is just a part of the beginning. As we see more technologies benefit
from this over time, and the field starts to grow, it is likely that we will see
some more adaptations that come with this machine learning, and we will
be able to add more things to this list as well.

Why is machine learning seen as so important?


The ultimate goal of artificial intelligence is to make a machine work in a
similar manner as humans. However, the initial work that has been done
with this shows that we are not able to make a machine as intelligent as
humans. This is because humans are constantly learning in an environment
that is evolving all of the time, while this isn’t really possible for a machine.
Therefore, the best way to make a machine intelligent is to make them learn
on their own. What this means is that machine learning is basically the
discipline of science that works to teach machines how to learn on their
own from data.

The idea that comes with machine learning is that instead of going through
and coding all of that logic and data into the program, the data is going to
be fed into the machine, and then the machine is able to learn from this
data, simply by finding the patterns that are there. What is interesting here
is that these machine learning techniques can be faster than humans at
finding these patterns.

The techniques that are used in machine learning have been around for
some time. But because, until recently, there has been a lack of hardware
that is high performance enough, these techniques were not used to help
solve problems in the real world. But now, we have a lot of the complex
hardware that is needed for machine learning, and because of the huge
amount of data readily available, these techniques are coming back and
have been used successfully to help develop machines that are intelligent.
As you will see as we go through this guidebook a bit more, there are
actually three different types of machine learning that you are able to work
with. They are all going to do some pretty amazing things when it comes to
coding, especially when you decide to work with Python coding in them.
But they do the learning in a different manner.
In this guidebook, we are going to focus mainly on the two main types of
machine learning known as supervised machine learning and unsupervised
machine learning. There is also reinforcement machine learning which is
where the computer is going to learn more based on the idea of trial and
error along the way.
Chapter 2: What is Supervised Machine Learning
In the previous chapter, we spent some time taking a look at machine
learning and what it is all about. Now it is time to look at some of the
different types of machine learning that are out there. Both the supervised
learning and the unsupervised learning are going to be important when it
comes to working with machine learning, but as we look through this
chapter and the next one, we are going to see that they are going to take
different angles when it comes to solving coding problems

The first of the machine learning algorithm types that we are going to need
to talk about is the idea of supervised machine learning. Supervised
machine learning is going to be the type of learning that your code can do
when you choose an algorithm that will help it to learn the right response,
based on the inputs that the user gives to it. So, as the user works with this
program a bit more, and provides different inputs or data into it, the
program will be able to learn from all of the information it is given.

There are a number of options and methods that the program is able to rely
on when it comes to supervised machine learning. The first method that it
can look to is when you do a targeted response, given by the programmer
over to the system to help it to learn. Or you can provide some examples to
the program to help it as well. This testing could include a variety of values
or strings of labels to help you get the program to learn the way that you
would like it to. Then you can release it to the world and it is able to learn
along the way.

This is actually a pretty simple process that you are able to work with, and
we are going to spend some time in this guidebook looking at how to make
this happen for our needs. With supervised learning, you will basically take
some time to show the computer various examples of what you would like
it to learn, showing it answers that are considered correct so that it can learn
from this.

A good way to look at this is with a teacher in a classroom. Any time that a
teacher would like to show a new topic to their group of students, they
could choose to use the option of showing their students examples of the
specific situation they are learning. The students would be able to learn
from those examples, figuring out what is the correct answer. Of course, the
teacher isn’t going to need to go through each example of this, but it at
leaves gives the students a guide that they are able to go from
.

At different times, when the students are tested on these examples, or they
see them in another setting, they should have the guidance on how to
respond to them. If they see that there is something that comes up that
doesn’t really match the examples their teacher showed them, then the
students know how they are able to respond to this as well.
Any machine that you use that has this kind of learning algorithm
programmed on it is going to be able to react in a manner similar to how the
students did before. You will start out with this process of showing the
program a variety of different examples and telling it how you would like
the program to act based on these examples. Then, later on when the
program is all on its own when the program sees an example that is similar
to what it was shown, then it knows how it is going to be able to react. If it
sees something that doesn’t match up with the examples that it was shown,
then it is likely able to react properly as well.

When it comes to working with supervised machine learning, there are


actually a few different learning algorithms that are going to fit in here. This
gives you a lot of power when it is time to work with this kind of learning
method. You have to learn a bit more about how these work (and we will
discuss them in more detail as we go through this guidebook), so that you
know which one you are going to pull out and us for your needs. Some of
the most common types of supervised machine learning that we are going to
explore in this guidebook will include:

1. Regression algorithms.
2. KNN
3. Decision trees
4. Random forests

There are a lot of times when you are working in this field of machine
learning when supervised options are going to be used. This is a common
type of machine learning, and it has a lot of value to help you write the
codes that you would like. We will take some more time as we progress
through this guidebook to see how this kind of learning is going to work,
based on the different algorithms that we spend some time on.

With that said, having a good idea of how the supervised machine learning
is supposed to work is going to make a big difference as well. You can see a
lot of scenarios when you would like to have this come together in some of
your codings. And when you add it together with some of your data to sort
it through for you, there are no limits on what you are able to do.
Chapter 3: What is Unsupervised Machine
Learning?
Now that we have had a chance to talk about the benefits and the types of
supervised machine learning that you are able to work with when you get
into this part of the data science field, it is time to move on to the second
type of machine learning that you are most likely to use in your work.

There are going to be many times when you will need to pull out some of
the algorithms that are needed with supervised machine learning. And then
there are going to be times when this kind of learning is not quite right for
your needs, and it is time to bring out the algorithms that go with
unsupervised machine learning.

As a review, remember that when you are working with the supervised
machine learning, you are going to show the computer some examples of
how it should behave, and then the computer is going to learn how you
would like it to respond to different scenarios. There are a lot of different
types of programs that are going to need to use this, and learning how to use
it can make a big difference in the results that you get
.

But you can already think of a few times when this is not going to work out
that well for you. Maybe you are thinking about all of the thousands or
more examples that you would have to show the computer in order to get
this kind of method to work. Sure, you could take the time to do this, but it
can take a lot of time and get really tedious in the process. Plus, there are
some kinds of programs were just showing it examples are not going to
work well, or it won’t be that practical for the program that you need to do.

When you reach into some of these issues, it is likely that you need to look
for a different type of algorithm, and this could be unsupervised machine
learning. This is the next method of machine learning that you are going to
look at. Unsupervised machine learning is going to be the type that you
would use to allow the program to learn on its own, rather than showing it
all of the examples and teaching it. If supervised learning is like learning in
the classroom, then unsupervised learning is going to be independent
learning.
The program is going to be able to learn on it's own based on any
information that the user is going to give to it when you use unsupervised
machine learning. It is possible that it isn’t going to give the best answer
that you would like, and it is going to make mistakes on occasion. But if
you set up this kind of algorithm in the proper manner, it is going to be able
to learn from these mistakes. Basically, when you bring out the algorithms
that you would use with this kind of learning, it basically means that the
program is going to be able to figure out and analyze any of the data
patterns that it sees and make good predictions on that information based on
the input that the user decides to give to it.

Just like what we saw when we were working with supervised learning,
there are a few options that the coder is able to use with algorithms. No
matter which of the algorithms you need for the coding that you do, it is
still possible for that algorithm to take the data and then do some
restructuring of it so that the data falls into classes.

When this restricting is done, it is easier for the coder, or anyone else, is
able to look through all of the data and determine what is there and what is
the most important. You are going to enjoy working with this kind of
machine learning because it is going to be set up so that the computer is
going to be able to do the work of learning for you, rather than writing out
all of the instructions and having to do the work to teach the computer on its
own.

One example of how this kind of machine learning is going to be when a


company would like to read through a ton of data in order to make
predictions about the information that it sees. It can also be a way that you
can get your search engine to work while still providing you with results
that are as accurate and valuable as possible.

When you are ready to work with unsupervised machine learning, you will
get the benefit of working with a few different algorithms to help you get it
all done. The most common algorithms that you need to learn any time you
want to work with this kind of learning includes:

1. Clustering algorithms
2. Neural networks
3. Markov algorithm

What is reinforcement learning?

Before we move on to adding some Python into this and mix up some of the
different things that we are able to do with both, we need to take a moment
to learn about the third type of machine learning, known as reinforcement
machine learning.

When we are looking at reinforcement machine learning, you may notice


that it is going to be a bit different compared to some of the others that we
have talked about. It shows a bit of similarity when it is compared to
unsupervised machine learning, but the algorithms are a bit different, and
we are working with the idea of trial and error rather than teaching it.

Whenever you want to work with reinforcement machine learning, you are
doing a method that is more of a trial and error than the other two. This
method can be similar to working with a smaller child. When the child does
something that you do not approve of, you will tell them that they did it
wrong or you will ask them to stop. If they do an action that you approve
of, you can do another action to tell them that you approve, such as praising
them or giving them positive reinforcement.

Through a lot of trial and error to see how you respond to them, the child is
going to learn what you see as acceptable behavior or not. With the right
type of reinforcement each time, the child will strive to do what you want
from them each time. This is similar to how the reinforcement machine
learning works. The program will learn, based on trial and error, how you
want it to behave in each situation.

To keep it simple, this is what reinforcement machine learning is going to


be like. It works on the idea of trial and error and it requires that the
application uses an algorithm that helps it to make decisions. It is a good
one to go with any time that you are working with an algorithm that should
make these decisions without any mistakes and with a good outcome. Of
course, it is going to take some time for your program to learn what it
should do. But you can add this into the specific code that you are writing
so that your computer program leans how you want it to behave.
Chapter 4: Setting Up the Environment with
Python
Now that we know a bit more about machine learning, it is time to take a bit
of time to look at the Python code. There are a lot of reasons why you will
love working with the Python code. It is easy to use, easy to learn, has a lot
of great frameworks and libraries to work with (and we will discuss at least
a few of these as we go through this guidebook), and is still powerful
enough to make machine learning easy for you.

While it is possible to work with other coding languages to help you get the
results that you want, but most people prefer to work with Python due to all
of the benefits that we have discussed. Before we take a look at how to set
up the Python environment so you are able to use it properly, let’s take a
look at a few of the different parts that come with the Python language, so
you understand how a few of these codes can work for you.

The parts you should know about the Python code

First, we need to take a look at these important keywords in the Python


language. Like with what you will find in other coding languages there is a
list of keywords in Python that are meant to tell your text editor what to do.
These keywords are reserved and you should only use them for their
intended purposes if you want to be able to avoid issues with your code
writing. They are basically the commands that will tell your compiler how
to behave and they remain reserved so that you can execute the code
without a lot of issues in the process.

Variables are important because they will save up spots on your computer’s
memory in order to hold onto parts of your code. When you go through the
process of creating a new variable, you are making sure that you are
reserving some space on your computer for this. In some cases, such as
when you are working with data types, your interpreter will do the work of
deciding where this information should be stored because that speeds up the
process.

When it comes to working with variables, your job will be to make sure that
the right variables are lining up with the right values. This will ensure that
the right parts show up in your code at the right time. The good news is that
you are able to give the variable the value that you would like, but do check
that it actually works inside of your code. When you are ready to assign a
new value to a variable, you can use the equal sign to make this happen.
Let’s look at a good example of how this would work
:

#!/usr/bin/python

counter = 100          # An integer assignment


miles   = 1000.0       # A floating point
name    = "John"       # A string

print counter
print miles
print name

With the above example, you will get the results to show up with the
variable that you placed with the value. So, when the counter shows up, it
will show you a 100 for the counter, 1000 for the miles, and John as the
result of the name.

Next are the Python comments. These are helpful to leave a little note in
your code and can make a difference in how others are able to look through
the code and know which parts are supposed to work with. Working with
the comments can be relatively easy when you are on Python. You simply
need to add the # sign in front of any comment you would like to write. The
compiler will know how to avoid that part of the code and will skip over it,
without any interruption in the program.

One thing to note is how many comments you write. While you can
technically write out as many of these as you would like or that you think
the code needs, try to only keep with the ones that are absolutely necessary.
You do not want to write in so many comments that it is hard to read the
rest of the code. Just write in comments when they are needed, not all of the
time.

Python statements are a simple part of the code that can make a big
difference, so we are going to take some time to explore them real quick
here. Statements are going to be the part that the compiler is going to
execute for you. You can write out any kind of statement that you want, but
make sure they are in the right place, and that you are not using any of the
keywords with them either, or the compiler will get confused.

And the next thing that you need to take a look at here is the functions. 
Functions can be another part of your language that you need to learn about.
This is basically a part of the code that can be reused and it can help to
finish off one of your actions in the code. Basically, these functions are
often really effective at writing out your code without having a lot of
wasted space in the code. There are a lot of functions that you can use in
Python, and this can be a great benefit to the programmer.

When you are working on your code, you will need to make sure that you
are defining the function. Once you have been able to define the function
and it is considered finalized, it is time to execute it to work in the code.
You have the choice to call it up from the Python prompt or you can call it
up from another function. Below is an example of how you are able to do
this:

#!/usr/bin/python

# Function definition is here


def print( str ):
"This prints a passed string into this function"
print str
return;

# Now you can call printme function


printme("I'm first call to user defined function!")
printme("Again second call to the same function")

When this is placed into your compiler, you will be able to see the two
statements that we wrote out inside of the code come up like a message.
This is just one example of how you would call up a function and you can
always change up the statements that are inside the code and figure out how
you want them to execute later
.
These are just a few of the basics that come with the Python code. We will
take a closer look at doing these a bit more as we move through this
guidebook, but these will help you get the basics of the Python language
and how you can use it for your advantage in machine learning.

Getting that environment set up


Now that we have had a chance to look at machine learning, some of the
ways that you can benefit from and benefit from machine learning, and
some of the different types of machine learning that you are able to work
with, it is time to introduce some Python into this. Python is a great coding
language that you are able to work with, no matter what your skill level is
when coding. And when it is combined with some of the ideas that come
with machine learning, you are going to be able to see even better results in
the long run as well.

That is why we are going to spend some time looking at how you can set up
your own environment when working with the Python code. This will help
you to make sure that Python is set up on your computer in the proper
manner, and will make it easier to work with some of the codes that we will
talk about later on.

You will find along the way that the Python code is going to be a really easy
one to learn compared to some of the others that are out there, and it is often
one that is recommended for beginners to learn because it is simple. But
this isn’t meant to fool you! Just because you see that it is simple to work
with doesn’t mean that you won’t be able to find the strength and the power
that you need with this one. There are a lot of different parts that you can
learn about the code, but first, we are going to make sure that the
environment for Python is set up in the right way to help with Python
environment with the help of machine learning.

So, to help us get this done, we need to go to www.python.org


and
download the Python program that we want to work with. Then make sure
that with the files you are working with, you will need to make sure the
right IDE is present. This is going to be the environment that has to be there
and will ensure that you are able to write out the codes that you want to
work with. The IDE is also going to include all of the installation of Python,
the debugging tools you need, and the editors.
For this specific section of machine learning, we are going to focus on the
IDE for Anaconda. This is an easy IDE to install and it is going to have
some of the development tools that you need. It is also going to come with
its own command line utility, that is going to be so great for you installing
any of the third party software that you need with it. And when you work
with this IDE, you won’t have to worry about doing a separate installation
with the Python environment on its own.

Now we are on to the part of downloading this IDE. There are going to be
some steps that you will need to complete to make this happen. We are
going to keep these steps as simple as possible, and we are going to look at
what we need to do to install this Anaconda IDE for a Windows computer.
But you will find that the steps that come with installing this on a Mac
computer or a Linux computer are going to be similar to this as well. Some
of the steps that you need to use in order to help you download this kind of
IDE to your computer include:

1. To start, go to https://www.anaconda.com/download/
2. From here, you will be sent to the homepage. You will need to
select on Python 3.6 because it is the newest version. Click on
Download to get the executable file. This takes a few minutes to
download but will depend on how fast your internet is.
3. Once the executable file is downloaded, you can go over to its
download folder and run the executable. When you run this file,
you should see the installation wizard come up. Click on the
“next” button.
4. Then the License Agreement dialogue box is going to appear.
Take a minute to read this before clicking the “I Agree” button.
5. From your “Select Installation Type” box, check the “Just Me’
radio button and then “next”.
6. You will want to choose which installation directory you want to
use before moving on. You should make sure that you have
about 3 GB of free space on the installation directory.
7. Now you will be at the “Advanced Installation Options”
dialogue box”. You will want to select the “Register Anaconda
as my default Python 3.6” and then click on Install.
8. And then your program will go through a few more steps and the
IDE will be installed on your program.

As you can see with all of this, setting up the Python environment that you
would like to work with is going to be simple. You just need to go through
these steps to get the Anaconda IDE set up properly, and then you are able
to use it for all of the codes that we will discuss in this guidebook, along
with some of the other codes that you will want to write along the way.

Remember that there are some other options that you can work with when it
is time to pick out an IDE that you would like to work with. If you are
doing some other work than machine learning, or you like some of the
features and more that comes with another IDE, you are able to download
these IDEs to make it work with them as well. But we are going to spend
time working with the Anaconda IDE because it is going to have all of the
features that we need to get the machine learning algorithms working the
way that you would like.
Chapter 5: Data Preprocessing with Machine
Learning
The first thing that we are going to take a look at here to help us get started
with some of the things that we can do with machine learning is data
preprocessing. You need to make sure that any data you are working with is
in the right format before they are able to work with machine learning at all.
Converting data to the right format so that you can do some work with
machine learning is going to be known as data preprocessing.

Now, it is going to depend on the type of data you would like to work with,
but there are going to be a few steps that need to happen in order to help
you see the preprocessing work well. These steps are important to work
with because they will ensure the data comes in the right form to be used.
The steps that you should focus on for this part will include:

Getting the data set


Importing your libraries
Importing the dataset
Handling any of the missing values
Handling the categorical data
Dividing the data into different training sets and test sets
Scaling the data.

Let’s take a closer look at each of these and how they can work together to
prepare your data the proper way.

Getting the data set

To start, all of the data sets that we are going to use in this guidebook can be
found at the following link to make things easier:

https://drive.google.com/file/d/1TB0tMuLvuL0Ad1dzHRyxBX9cseOhUs_
4/view

When you get to this link, you can download the “rar” file and then copy
the “Datasets” folder of to your D drive. All of the algorithms that we will
use in this book will access the datasets from “D:/Datasets” folder. The
dataset that you need for this chapter to help with learning more about
preprocessing is known as patients.csv.

Now, we need to take a bit of time to look over this set of data. When we do
that, we start to notice that there is a lot of information about the age,
gender, and the BMI of the patients at hand. This set of data is also going to
come with an extra column that will show us whether or not the patients we
are looking over are diabetic or not. You should also notice that the age and
the BMI columns are going to be numerical, while the other two columns
are going to be categorical in nature instead.

Another distinction that we need to make here is going to be between the


dependent and the independent variable. When we are looking these over,
the variable whose value is predicted to come in as dependent and the
variables that are used to make these predictions will become our
independent variables. When we look at this kind of example, the diabetic
column is going to be seen as the dependent variable because it will rely, at
least a bit on the other columns there. Then we will notice that the other
columns are going to be out independent options.

Importing the libraries that you need

Now that you have some of the data that you need to get started on this, you
need to work on importing the library. Python automatically comes with
some prebuilt libraries that are able to perform a variety of tasks. For the
purpose of this book, we are going to use the Scikit Learn Library. To keep
it simple, we are going to install three libraries that are the most essential
for helping us do machine learning. These include numpy,
matplotlib.pyplot, and pandas.

First off is the numpy library. This is a good one to download because it can
help you get some of those mathematical functions that are more complex
done. Since machine learning does spend time in the mathematical field, it
is worth your time to make sure that this particular library is in place.

The next library that you will want to take a look at is going to be the
matplotlib.pyplot library. This is another good one that can be used often
because it helps with any of the charts that you will want to work with.
Often, when you are looking through some of the data that is needed to
work with machine learning, it is helpful to have this library present. It at
least lets you have a good idea of the data points and where they fall on the
charts.

And we need to take some time to download the Pandas library. This one is
important as well, and you will find that it is really easy to use. Many times
a programmer is going to be able to use it in order to import and view any
of the sets of data that they want to use with machine learning
.

If you want to import these libraries, you will either need to come up with a
new notebook for Python in Jupyter, or you are able to use Spyder in order
to open up a brand new file. The codes that we are going to focus on in this
part of the guidebook are going to be useable with Spyder. To help you
import these libraries, you need to work with the codes below

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the data that you need

Once you have been able to download the three libraries that we have just
talked about, it is now time to look at the steps that are needed in order to
import the set of data that you would like to use in your chosen application.
This is also going to give us a better look at why the pandas library is so
important to use in this part of the process.

The data set that we are going to use will show up in the Comma Separated
Values, or CSV format. The pandas library will have inside of it the
read_csv function that is going to take us to the path to the SCV formatted
set of data as the parameter and then will load up the set of data into the
pandas data frame, which is basically going to be an object that stores the
set of data in the form of rows and columns. To help us get this to happen,
we need to use the code below:

 
patient_data = pd.read_csv(“D:/Datasets/patients.csv”)
The script above is going to help you to load up the data set for patients.csv
in the dataset fold that you have it set. If you are using the Jupyter
notebook, this is even easier to do. You would just use the following script
to help you see what the data looks like:

patient_data.head()

But, if you are working with the Spyder program, you would go over to
your Variable explorer and then double click on patient_data variable from
the list of variables that show up. Once you click on the right variable, you
will be able to see all the details of this dataset.

When we reach this point in the process, you should then be able to see that
the data frame that comes with pandas is going to look very similar to the
matrix with zero-based index. Once you are able to load all of this up, the
next step is to divide the set of data into a matrix of features and vector of
dependent variables. The feature set is going to be able to hold all of the
variables that you have that are considered independent.

For example, the feature matrix for patients.csv data set is going to include
the information that you place for age, BMI, and gender of the patient. In
addition, the size that you see with your feature matrix is going to be equal
to the number of independent variables by the number of records. For this
example, our matrix is going to end up being three by twelve. This is
because we have three independent variables and twelve patients with
records we are going to look at.

To help us get started with this one, we need to first go through and create
the feature. You are able to provide it with any name that you would like,
but the traditional method of doing this is going to be with the capital X. To
help us have a better chance at reading the code, we will give it the name of
features, and then use the script below to make it work the way that we
want.

features = patient_data.iloc [:,0:3].values

With the script that we used above, the iloc function of your data frame is
used to help select all the rows as well as the first three columns from the
patient_data of the data frame. This iloc function is going to take on two
parameters. The first is the range of rows to select, and then the second part
is going to be the range of columns you want the program to select.

If you would like to create your own label vector from here, you would use
the following script to get it done:

labels = patient_data.iloc[:3].values

Handling any missing values that show up

If you stop now and look at the object that is labeled patient_data, you will
notice that there is a record that is missing at index 4 for the column of the
BMI on that patient. To help us be able to handle it when values are
missing, the easiest approach out be to find a way to remove the record that
is missing a value. However, this record could also have some crucial
information in it that we need to focus on, so you will sometimes want to
keep it there.

Another option that you can use to help deal with some of these missing
values is to add in some kind of value or character that will be able to
replace the value that is missing. Often we find that the best choice to
replace that value with is either the mean or the median of all the other
values that are found in that column. To help you handle the values that are
missing when they come up, you just need to use the imputer class that is
found in the library of sklearn.preprocceing. the script that you will need to
make sure this happens is below

from sklearn.preprocessing import Imputer


imputer = Imputer(missing_values=”Nan”, strategy=”mean”, axis =0)
imputer = imputer.fit(features[:,1:2])
features[:,1:2] = imputer.transform(features[:,1:2])

With the script that we wrote above, the first line is in charge of importing
the Imputer class from the right library. We then went on to create the
object of the Imputer class. This is going to take on three parameters
including axis, strategy, and missing_value. In terms of the missing_value
parameter, we are specifying that this is the value that we want to be
replaced. In our data set, the missing value is being shown by “nan”. The
strategy parameter is going to specify the type of strategy that we want to
use in order to fill in this missing value. You can also choose from most-
frequent, median, and mean. And then the axis parameter is going to denote
which axis we want to impute here

Handling any of our categorical data

Right now we know that the algorithms that we may use with machine
learning are going to be based on concepts that are mathematical in many
cases, and to work with these mathematics, we need to be able to work with
numbers. Because of this kind of issue, it is going to become easier and
more convenient for us to work in values that are categorical and then work
on moving them to be numerical. When we look at the example that we are
doing here, we see that there are two values that are seen as categorical and
those are the gender and diabetic options. How are we able to turn Male and
Female, and Yes and No into numerical values to make things easier?

The good news is that with the sklearn.preprocessing library, there is the
LabelEncorder class, the one that is going to take your categorical column
and then give you the right numerical output to make sense out of it. The
script that you can use for this one includes:

from sklearn.preprocessing import LabelEncoder


labelencorder_features = LabelEncoder)_
features[:,2] =labelencoder_features.fit_transform(features[:,2])

Just like what we did with the Imputer class, the LabelEncorder class is
going to have a fit_transform method, which is just a combination of the
transform and the fit methods. The class is going to be able to take the
categorical column that you have as the input and then will return the right
numeric values to help you out.

In addition, you can always take the labels vector and then convert it to a
set of numeric values as follows:

labels = labelencoder_features.fit_transform(labels)

Your training sets vs. your testing sets


The next thing that we need to take a look at is the training sets and the test
sets. These are going to be two very important things when it comes to the
work that you are doing in machine learning, Being able to separate out the
values that you have in a smart manner for this to work is going to make the
biggest difference overall.

Earlier on when we first introduced the idea of machine learning, we


discussed that some of the models that we are looking at are going to be
trained on one set of data, and then tested on a different one. This splitting
up of the test and the training set is going to be done to make sure that any
algorithm that you use for machine learning doesn’t end up overfitting and
trying too hard, messing up the data that you get. When we talk about the
idea of overfitting, we are looking at the tendency that machine learning is
going to do well with the results it gives on the training data, but then it
ends up giving poor results when you get to the test data.

A good model that you can rely on when you work in machine learning is
one that is able to give you some good results with both of these, with the
training data and the test data. With this, we are able to say that the model
we picked as correctly learned all of the underlying assumptions from our
set of data, and then that we are able to accurately use it to make the
decisions that we need out of any new set of data that we decide to use.

To get a better idea of how this is going to work, Let’s look at the script that
we use below. This script is going to help us divide up the data into the 75
percent train size, and the rest is going to be the test size.

from sklearn.model_selection import train_test_split


train_features, test_features, train_labels, test_labels = train_test_split
(features, labels, test_size = 0.25, random_state = 0)

When you execute this script above, you are going to see the train_features
variable is going to contain the matrix of 9 features (because this is the 75
percent of 12) while the train_labels is going to contain the labels that
correspond to this. However, with the test-features, you are going to have
the remaining three features and the test_label will have the corresponding
labels.
Scaling the data we have

Now we are to the final thing that we need to look at when we get to this
part of the process. We need to look at the steps that are needed in order to
scale any of the data we have when we place it into our machine learning
algorithm. It is important to know all about scaling because there are going
to be some sets of data that will show up a big difference between the
values that come with it.

For example, if we decided to add in a new column for the number of red
blood cells of patients in this, then it is likely that the numbers are going to
be hundreds of thousands (unless the patient is really sick), while it is not
likely the age column would even get to 100. Many of the models that you
can use with machine learning are going to use what is known as the
Euclidean distance to help us find the distance that occurs in the points of
data you are looking at.

The good news is that the sklearn.preprocessing library is going to contain


the class known as StandardScaler that you can use in order to implement
the standardization features. Like with other preprocessing classes, it will
also contain the fit-transform that we talked about before and it will take a
data set that you provide it as the input, and then output a scaled data set.
The following script will make this happen for you.

from sklearn.preprocessing import StandardScaler


feature_scaler = StandardScaler()
train_features = feature_scaler.fit_transform(train_features)
test_features = feature_scaler.transform(test_features)

one thing to note is that there isn’t really a need for you to scale labels on
any of your classification problems. For regression problems, we will take a
look at how to scale labels for a regression section.

And that is all there is to it. You need to go through some of these steps
along the way to make sure that you are preprocessing the information, and
that it is ready to go with the work. Your algorithms are going to be more
accurate and will work out much better if you are able to add in these
preprocessing steps to the mix. They will ensure that you are able to get the
best results because the data is prepared in the manner that it should be.
Chapter 6: Linear Regression with Python
Linear regression when we just have one variable

The first part of linear regression that we are going to focus on is when we
just have one variable. This is going to make things a bit easier to work
with and will ensure that we can get some of the basics down before we try
some of the things that are a bit harder. We are going to focus on problems
that have just one independent and one dependent variable on them.

To help us get started with this one, we are going to use the set of data for
car_price.csv so that we can learn what the price of the care is going to be.
We will have the price of the car be our dependent variable and then the
year of the car is going to be the independent variable. You are able to find
this information in the folders for Data sets that we talked about before. To
help us make a good prediction on the price of the cars, we will need to use
the Scikit Learn library from Python to help us get the right algorithm for
linear regression. When we have all of this setup, we need to use the
following steps to help out.

Importing the right librarie


s

First, we need to make sure that we have the right libraries to get this going.
The codes that you need to get the libraries for this section include:

import pandas as pd
importn numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

You can implement this script into the Jupyter notebook The final line
needs to be there if you are using the Jupyter notebook, but if you are using
Spyder, you can remove the last line because it will go through and do this
part without your help.

Importing the Dataset

Once the libraries have been imported using the codes that you had before, ;
the next step is going to be importing the data sets that you want to use for
this training algorithm. We are going to work with the “car_price.csv”
dataset. You can execute the following script to help you get the data set in
the right place:

car_data = pd.read_csv(‘D:\Datasets\car_price.csv’
)

Analyzing the data

Before you use the data to help with training, it is always best to practice
and analyze the data for any scaling or any values that are missing. First, we
need to take a look at the data. The head function is going to return the first
five rows of the data set you want to bring up. You can use the following
script to help make this one work:

car_data.head()

IN addition, the describe function can be used in order to return to you all
of the statistical details of the dataset.

car_data.describe ()

finally, let’s take a look to see if the linear regression algorithm is actually
going to be suitable for this kind of task. We are going to take the data
points and plot them on the graph. This will help us to see if there is a
relationship between the year and the price. To see if this will work out, use
the following script
:

plt.scatter(car_data[‘Year’], car_data[‘Price’])
plt.title(“Year vs Price”)
plt.xlabel(“Year”)
plt.ylabel(“Price”)
plt.show()

When we use the script that is above, we are trying to work with a
scatterplot that we can then find on the library Matplotlib. This is going to
be useful because this scatter plot is going to have the year on the x-axis
and then the price is going to be over on our y-axis. From the figure for the
output, we are able to see that when there is an increase in the year, then the
price of the car is going to go up as well. This shows us the linear
relationship that is present between the year and the price. This is a good
way to see how this kind of algorithm can be used to solve this problem.

Going back to data pre-processing

Remember in the last chapter we looked at some of the steps that you need
to follow in order to do some data preprocessing. This is done in order to
help us to divide up the data and label it to get the test and the training set
that we need. Now we need to use that information and actually have these
two tasks come up for us. To divide the data into features and labels, you
will need to use the script below to get it started:

features = car_data.iloc[:,0:1].values
labels = car+data.iloc[:,1].values

Since we only have two columns here, the 0th


column is going to contain the
feature set and then the first column is going to contain the label. We will
then be able to divide up the data so that there are 20 percent to the test set
and 80 percent to the training. Use the following scripts to help you get this
done:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split
(features, labels, test_size = 0.2, random_state = 0)

From this part, we are able to go back and look at the set of data again. And
when we do this, it is easy to see that there is not going to be a huge
difference between the values of the years and the values of the prices. Both
of these will end up being in the thousands each. What this means is that it
is not really necessary for you to do any scaling because you can just use
the data as you have it here. That saves you some time and effort in the long
run.

How to train the algorithm and get it to make some prediction


s

Now it is time to do a bit of training with the algorithm and ensure that it is
able to make the right predictions for you. This is where the
LinearRegression class is going to be helpful because it has all of the labels
and other training features that you need to input and train your models.
This is simple to do and you just need to work with the script below to help
you to get started:

from sklearn.linear_model import LinearRegresison


lin-reg = LinearREgression()
lin_reg.fit (train_features, train_labels)

Using the same example of the car prices and the years from before, we are
going to look and see what the coefficient is for only the independent
variable. We need to use the following script to help us do that:

print(lin_reg.coef_)

The result of this process is going to be 204.815. This shows that for each
unit change in the year, the car price is going to increase by 204.815 (at
least in this example).

Once you have taken the time to train this model, the final step to use is to
predict the new instance that you are going to work with. The predict
method is going to be used with this kind of class to help see this happen.
The method is going to take the test features that you choose and add them
in as the input, and then it can predict the output that would correspond with
it the best. The script that you are able to use to make this happen will be
the following:

predictions = lin_reg.predict( test_features)

When you use this script, you will find that it is going to give us a good
prediction of what we are going to see in the future. Basically, we are able
to guess how much a car is going to be worth based on the year it is
produced in the future, going off the information that we have right now.
There could be some things that can change with the future, and it does
seem to matter based on the features that come with the car. But this is a
good way to get a look at the cars and get an average of what they cost each
year, and how much they will cost in the future.

So, let’s see how this would work. We now want to look at this linear
regression and figure out how much a car is going to cost us in the year
2025. Maybe you would like to save up for a vehicle and you want to
estimate how much it is going to cost you by the time you save that money.
You would be able to use the information that we have and add in the new
year that you want it based from, and then figure out an average value for a
new car in that year.

Of course, remember that this is not going to be 100 percent accurate.


Inflation could change prices, the manufacturer may change some things
up, and more. Sometimes the price is going to be lower, and sometimes
higher. But it at least gives you a good way to predict the price of the
vehicle that you have and how much it is going to cost you in the future.

This chapter spent some time looking at an example of how the linear
regression algorithm is going to work if you are just working with one
dependent and one independent variable. You can take this out and add in
more variables if you want, using the same kinds of ideas that we discussed
in this chapter as well.
Chapter 7: Decision Trees to Handle Your
Regression Problems
In the last chapter, we spent some time looking at how you would be able to
use linear regression to get some machine learning problems done. And
now that we have a better understanding of this, it is time to take a look at
decision trees and see how we are able to use these for our benefit.

You will find that when you bring up the idea of machine learning, you will
not have to look far in order to hear about decision trees. These decision
trees are going to be a very important part of the machine learning process.
Each feature in the set of data that you are working with will be treated just
like a new node in the tree. And with each node, a decision has to be made
to determine which path is the best one for you.

Each time that you use this decision tree, you are going to be in a different
kind of situation, and this means that you need to weight the decisions of
each, and then decide which path is the right one for you. The process will
keep going on from there, helping you to make new decisions, until you get
to the leaf node and you are to the final decision, the one you will choose to
go with
.

This may seem like a lot of steps, and something that is a bit complicated to
work with at first, but you may be surprised to find out that we have
actually worked with decision trees and some of the ideas that come with
them, for our entire lives. For example, if you have ever tried to get a loan
from a bank, it is likely they used a form of the decision tree to figure out
whether or not to give you this loan or not.

When the bank uses this kind of decision tree, they are going to take a look
at a ton of data that they can gather on the customer to help them make the
decision. The information they may look at includes their age, gender, job
history, credit history, their salary, and more. As the bank is looking through
this information, they will be able to use it to figure out whether they want
to give the customer the loan, or if they think the customer is too much of a
risk for this.
Of course, each bank is going to be different. Some banks will turn down an
application quickly, and others are more than happy to work with customers
even when they are turned down. But no matter which bank you go with,
they are going to sit down and define the criteria they would like to meet
with the customer before they give out the loan. These new criteria are
going to be the set of rules that are used to help the bank figure out who is
going to get a loan or not. Some examples of the criteria that could be used
will include:

1. The bank may decide that if the applicant is at least 25 years old,
and under 60, then they are able to move to the next step of the
decision tree. If the applicant is younger or older than these two
ages, then the loan is rejected.
2. If the applicant has been able to meet with the first criteria, then
the bank is going to check and see whether that applicant has a
salary at all. If they do have a salary and it is steady, then the
bank would move on to step three. If the applicant is jobless or
doesn’t make an income at all, then the bank is going to reject
that application.
3. If the person who applies is male and has a salary, then they are
going to move on to the fourth step. If the applicant has a salary
is female and has a salary, then the bank would go to the fifth
step.
4. If the applicant has a salary that is more than $35,000 a year,
then they are able to get the loan. If their income is less than this,
then they will not be accepted.
5. If the applicant during this step earns more than $45,000 a year,
they will be able to get the loan. But if their income is less than
this amount, then the loan is rejected.

This is a basic way to look at the decision tree. With loan applications, and
with many other times using decision trees, there are going to be a lot more
steps and complexity that comes with it. There are even times when we will
need to bring in some kind of statistical technique, such as entropy, to help
us create the nodes that we need to make sure that the impurity of
classification in the labeled data is taken care of.
To help make sure that this process is kept as simple as possible, we want to
work with the features that have a minimum amount of entropy and then
this part is going to be set up as the root node. This is going to help anyone
who is using the decision tree, including the bank from before, a starting
point they can work with to help them pick out the perfect applicants to
give the loan to.

Are there benefits to using a decision tree?

There are a lot of times when you will want to work with these decision
trees. These are a good option to work with because they are simple and can
help you to see the steps that you need to take in order to see the steps that
are needed in order to make a decision. Some of the many benefits that you
are going to notice when you decide to work with the decision tree
algorithms include:

1. You will be able to use these decision trees to help out with a
few different problems. These are going to work well for
regression and classification tasks.
2. You can use these decision trees to help with the linear data and
the non-linear data that you would like to classify.
3. When you look at some of the other algorithms that you are able
to use with machine learning, you will find that decision trees are
going to be fast to train.

Implementing your own decision tree

Now that we know a bit more about these decision trees and why they are
such a great thing to use for your own decision making in data science, it is
time to learn how to work with Python in order to make one of your own
decision trees. To work with this, we are going to use an example of trying
to predict petrol consumption (using millions) in the United States based on
a few different features.

The features that we choose to use with this are going to include the ratio of
those who have their own license, the per capita income, the tax on the
petrol (using cents), and how many miles there are of paved highways.
Let’s take some time to look through the various steps that are needed to
help us make this kind of decision tree
.

The first step that we need to take a look at is to import both the libraries
and the data set that is needed. The libraries that you need to make this
happen includes:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

When you are ready to import the right data set to use here, you need to use
the following command:

petrol_data = pd.read_csv)’D:\Datasets\petrol_data.csv’)

From this point, we need to look at the data a bit and see what is going to
show up there. To ensure that we are going to look at the data in the proper
manner, you just need to use the code of “petrol_data.head() to get this
started. When you type this in, it is going to ensure that you will have the
chart come up with the numbers (which are based on the categories we
want), showing up on your screen.

We also need to stop for a moment here and make sure that we use data
preprocessing in the proper manner. This helps us to get all of the data
organized the way that we want. To make this happen for this set of data,
we need to use the following information to help us get started:

 
features = petrol_data.iloc[:, 0:4].values
labels = petrol_data.iloc[:,4].values

Then you can take this information and divide it up so that eighty percent
goes to training and then the other twenty percent goes to a test set. Use the
following script to get this to happen.

from sklearn.model_selection import train_test_split


train_features, test_features, train_labels, test_labels = train_test_split
(features, labels, test_size = 0.2, random_state = 0)

In this example, we need to take some time to do a bit of data scaling. We


have miles of highway, petrol tax in cents, and the consumption of petrol in
millions. This definitely needs to be scaled to help us figure out what needs
to happen next. Because of all of these differences, we need to be able to
scale this information in order to get it to compare well to one another. The
steps that you need to use in order to help you to scale the data and get it to
work well together is going to include the following
:

from sklearn.preprocessing import StandardScaler


feature_scaler = StandardScaler()

train_features_poly = feature_scaler.fit_transform(train)features)
test_features_poly = feature_scaler.transform(test_features)

Training the algorithm

Now that the features are scaled down, it is time to train the algorithm that
we are going to use. To implement the decision tree to do classification, you
will need to work with the decision tree classifier from the sklearn.tree
library. The following script will make sure that the right labels and features
are passed on to the decision tree:

from sklearn.tree import DecisionTreeClassifier


dt_reg = DecisionTreeClassifier()
dt_reg.fit(train_features, train_labels

And the final thing that we need to do in order to get started with this part
of the plan is to make predictions. We have all of the information that we
need at this point, and it is time to go through and use the prediction
method. With the algorithms and codes that we have been using so far, you
are going to be able to make some predictions based on the data that you
have, helping you to get the best information to make decisions. The code
that you need to use for this will include:

predictions = dt_reg.predict(test_features)
At this point, a decision tree, with the help of Python, is going to be created
for you. This is going to ensure that you will be able to see which prediction
is right for you based on the information that you put in. Businesses and
other companies are often going to use this to help them pick the
information and the choices that are right for them. This method is faster
and more accurate than what a single person is able to do, which is why
they are so valuable.
Chapter 8: Random Forest Algorithms That
Work with Python
Now that we have gotten a good idea of what it is like to make our own
decision trees, it is time to move on to the next idea of random forests. A
single decision tree is going to sometimes have a bias based on the type of
data that you decide to feed into it. A better approach for the decision tree,
to make sure that the bias doesn’t lead you to the wrong decisions, is going
to be when you can average a lot of different predictions to give you the
ultimate prediction at the end. And this is what we are able to work with
when we talk to random forests.

Random forest algorithms are helpful because we are able to combine


together a few different decision trees together to make our own forest. This
is how we end up with a random forest. Similar to the decision trees from
before (except adding more than one together), these random forests are
going to be used in order to predict a continuous value, or regression, as
well as discrete values and classification. With that in mind, let’s stop here
and look a bit more at random forests and how we are able to make them
work.

The algorithms that we are going to use to make our own random forests
are going to be able to help us with a few different things. Some of the steps
that we are able to perform when it comes to working with random forests
will include the following:

1. These random forests are going to help us with choosing the K


random data points with your set of data.
2. The random forest is going to help you when it is time to create a
decision tree regression or another kind of classification
algorithm that can be based on the k data points you have.
3. You can use these random forests to help you figure out how
many trees need to show up in the group, and then you can go
through the first few steps on every tree to help you here.
4. If you are working on a problem that has regression in it, each of
the trees will be able to go through and predict a continuous
value. The output that you get in the end is going to be
calculated when you take the values and find the means of all the
predictions that the trees come up with.
5. If you are working with a problem that is classification, each tree
is going to be responsible when it is time to predict a value that
is more discrete.

There are a number of benefits that come with working on these random
forest algorithms. But what are these benefits? The different benefits that
you are able to see when a company or a programmer decides to work with
the random forest algorithm include:

1. Compared to a lot of the other options that you can make in


machine learning, the random forest algorithm is going to be
stable, and it can do the scaling well. Since there are so many
different trees in this kind of forest, adding in or taking away
data or a set of data is only going to impact a few trees, rather
than all of them. This ensures that there is enough stability in it
that you can trust the results.
2. The random forest algorithm is going to work out really well
with your data, no matter if you are working with values that are
numerical or categorical in nature.
3. When you bring out these random forest algorithms, you do not
need to work with scaling as much as we did in the past. This is
due to the fact that this algorithm isn’t really going to concern
itself on the distance between your points of data, so the scaling
is not as important.

Of course, even though there are a number of benefits that come with
random forest algorithms, there are a few negatives, which is why you may
not choose to use them all of the time. Some of the drawbacks of working
with one of these algorithms when you want to do machine learning will
include:

1. These random forest algorithms are going to be more complex


than other algorithms to work with. You can have as many of
these trees in your random forest as you would like, potentially
thousands or more. When all of these trees are involved with a
prediction, it is sometimes hard to look back and figure out
exactly how you came to your final prediction, which can really
take away from the trust you have in the model.
2. You will find that because these random forests are going to be
so complex, you are going to have to sacrifice some time to
make them work. If there is some reason that you are short on
time with your data, then you do not want to work with the
random forests. They take up some time to accomplish, and the
more trees that you try to add into the algorithm, the more time it
is going to take.

As you can see, the random tree algorithm is going to have a lot of benefits
and can really help you to make some good predictions. You will be able to
go through and combine a few different predictions together and average
them out to figure out the one that is best for you. There are some
drawbacks, but if you have a lot of information to go with and you want to
be able to come up with the best prediction out of many different ones, then
the random forest algorithm is the best option for you.
Chapter 9: Support Vector Regression Problems
Another topic that you are able to work with when it comes to Python
machine learning is SVR or Support Vector Regression. This is going to be
one of the types of support vector machine algorithm that you can work
with, and it works well whether you are trying to do a linear regression or a
non-linear regression. This type of algorithm has been around since the
160s and it is one of the most famous of all the algorithms that you are able
to use in machine learning. Before the idea of neural networks is as
common and well-used as they are today, SVM was the most accurate
algorithm that you were able to use when you wanted to work in machine
learning.

In this chapter, we will take a look at this kind of algorithm and the intuition
that you are able to find with it. We will also take a look at how this kind of
algorithm is going to work. First, though, we have to take a look at the
theory that comes with the support vector machine algorithm.

The theory that comes with SVM

When we are looking at the linear regression that is typical or one that is
found in a space that is two-dimensional, the job here is to find a straight
line that is able to bisect as many of the points of data as is possible in a
straight line or a curve. These sounds good in general, but remember that
there are times in the real world that there are multiple decision boundaries
that can be used to classify the data points. And then, if you decide to add in
some data points at some point, then the boundary for the decision that you
choose will decide how you will apply and classify these data points.

The whole job that comes with this kind of algorithm is that it is responsible
to help you find a decision boundary that works well to classify your data.
You want to make sure that the accuracy is as good as possible and that
there is a minimal amount of misclassification in the process. The way that
the SVM algorithm is able to do this will be able to maximize the distance
between the closest data points from all of the classes that are in the set of
data.
When you use this kind of algorithm, you will find that these boundaries are
discoverable with the help of its support vectors, which is how it is going to
get its name in the process. When looking at the charts, the support vectors
are the ones that will pass through the closest data points of the two classes
you want to work with. The job here is to maximize the distance that occurs
between the two vectors
.

The next thing that you are going to look for is a line that will run parallel
to both of these support vectors. You want to try and draw it right in the
middle of the two vectors to get the best results. This is going to be the
decision boundary so you want it to be as accurate as possible. And it is
going to give you a good idea of how your information is split up so you
can make some better decisions in the process.

This kind of algorithm is going to be used any time that you end up having
a ton of data that doesn’t seem to have one vector line that is going to pass
through it very well, so you work with two of them to help you separate out
of this data to help you out. You will use these vectors to make sure that you
can look at both sets of data, separating it out into two sets and sometimes
more. You can then take a look at the categories and found out which one is
going to be the best with the information that you need, check out the two
groups are the similar, and how both of them are different. You can then
separate this information out and then have important business decisions
made from this.

Support Vector Regression is going to help you to make some important


data decisions. There are times when your information is not going to fit
into one group or near one line, and this kind of chart is going to make a big
difference in how you separate out the information that you need. This can
help you to really explore the data that you have, and it will help you to
make some of the comparisons and more for your own business.
Chapter 10: Naïve Bayes Problems and Python
Machine Learning
Now we need to move on to working with the Naïve Bayes machine
learning algorithm. We have spent some time so far in this guidebook
looking at the different regression algorithms that you are able to work
with. For this algorithm, we are going to take a look at some of the
classification problems that you are able to work with. And the first of these
algorithms that we are going to take a look at is the Naïve Bayes.

Naïve Bayes is going to be a type of supervised machine learning like we


talked about ahead of time. And it is going to get its basis in the Bayes
Theorem. This algorithm is going to be based on the idea that the
independence feature. This states that a feature that is found in a data set
doesn’t end up having any kind of relationship with one another.

For example, you may find that fruit is going to be a banana if it is five
inches long or more, had a diameter that is at least a centimeter, and it is
yellow in color. But when you are working with the Naïve Bayes algorithm,
you are not going to have any concerns about these features or about how
these features are going to depend on each other. The fruit will be declared
as a banana with the independent contribution of the various features. This
is why it is a Naïve algorithm.

When it comes to working in machine learning, this kind of algorithm is


going to be seen as one of the most simple machine learning algorithms, but
there is going to be a lot of power that comes with it. And that is why we
are going to take a look at it a bit in this chapter.

First, we need to take a look at some of the advantages that you are able to
get when it comes to using the Naïve Bayes algorithm. This will help us
learn more about this algorithm. There are a lot of different benefits that
come with this learning algorithm, and some of the ones you should
consider include:

 
1. The Naïve Bayes algorithm is really simple to work with, and
you can train employees on how to use it quickly. It isn’t going
to have a huge amount of math that goes with it, and you won’t
have to deal with backpropagation or any error correction to deal
with it either. This makes it easy to use in a lot of different data
that you want to look through.
2. You will find that the Naïve Bayes algorithm is one that can
perform the best, especially compared to a lot of the other
algorithms that you can choose to work with when it comes to
categorical data. If you are using this for some numeric features,
then this is going to be an algorithm that is going to assume a
distribution that is normal.

While many companies find that working with the Naïve Bayes algorithm
because it is such a simple one to use and easy to understand, there are a
few disadvantages to working with this kind of algorithm, and there are
times when it is better to work with something different. It works in many
situations and can make things easier for you, but there are times when you
will want to use it in a different manner.

To start, when you are working with data from the real world, you will find
that the features you use will be mostly dependent on the other kinds of
features who are there. the assumption of independence that comes with this
algorithm is going to be a bad way to predict any data that has an
independent feature that comes with it. You have to take this into
consideration when you decide to work with it.

Another negative that can come with this is when you decide to work with
some categorical features. This could be a value in a test set that may not be
found in your training set. When this happens, then the Naïve Bayes
algorithm will then automatically assign a probability of zero to that
instance. What this means is that the programmer or whoever handles this is
going to need to go through and do some cross-validation over any of the
results to make sure it works properly when using this algorithm.

Now that we understand some of the benefits and negatives that come with
this algorithm, it is time to take a look at all of the different ways that you
are able to use this algorithm. Sure, there are going to be times when you
may need to check the results and when you may not get the results that you
would like. But it does work well and can provide you with the answers that
you need. A few of the different ways that you are going to be able to
benefit when you use this kind of algorithm will include:

1. The Naïve Bayes algorithm is often going to be used when it


comes to problems that are multi-class and ones that are used for
text classification. This may include a few things like spam
filtering in emails and any kind of sentimental analysis.
2. This is an algorithm that can be used in many cases to help out
with any collaborative filtering algorithms. These are going to be
used when the programmer would like to make a
recommendation system that is based on machine learning.
3. Programmers will often find that Naïve Bayes is going to be
faster compared to some of the other algorithms out there while
giving comparable results. This makes it a lot easier to use this
algorithm in some applications that are done in real time.

There are many times when you are going to want to work with the Naïve
Bayes algorithm. It has a lot of power behind it, without taking as much
time and effort as some of the others. While it only works with some of the
classification problems that you want to work with, and it is going to
require you to do a little bit of work to check the problems, it is still one of
the best-supervised machine learning methods that you are able to use.
Chapter 11: K-Nearest Neighbors for
Classification Problems
Another type of machine learning algorithm that you are going to enjoy
using, based on the data that you want to work with and what you want to
get out of the data, is the K-nearest neighbors, or the KNN, algorithm.
Many programmers like to work with this kind of method when they have
some data from the real world, and this data doesn’t seem to follow a trend.
While we all wish the data that we had would follow a nice trend, this is
just not something that tends to happen all that much. This machine
learning algorithm is going to help you to deal with this kind of data.

The main kind of idea that you are going to be able to see with the KNN
algorithm is simple. This algorithm is going to work to find the distance
between your new points of data and then compare it to the older data
points that you already added into it. Then the algorithm will take this a bit
further and will rank all of your points of data, going in ascending order
based on how far they are from your testing point. The KNN algorithm will
be able to take all of this information, and all of the points that you have,
and will choose the top K-nearest data points. Then it has the information
and assigns new points of data to the class that has the most K data points
.

You may get into this kind of algorithm and hear some programmers call
this a lazier algorithm, rather than one that is eager. What this means is that
with this learning algorithm, it is not going to take a look at any data for
training to help it come up with the results. This algorithm is not going to
work with a training phase, and if there is one that is added in to be careful
and to add in more accuracy, it is small.

There are benefits to this of course. First off, it means that the training
phase with this set of data is going to be fast. And since there is a lack of
generalization that occurs, it means that the KNN algorithm is able to use
all of the data that you add in for training. During the testing phase, you are
going to use more of the data points that you have, which increases the
accuracy.
Another reason that the KNN algorithm is going to be so important to sort
out your data is that it is basing itself on the idea of similarity. This means
that it is focused on how closely any of the sample features will be to the
training. And the similarity that shows up with the data is going to help you
to classify it a bit better.

As you can guess with this one, the KNN algorithm is going to be used
when you have a problem that needs classification. Any object or data point
that you add into the mix is going to be classified using the majority vote of
all the data points that are near it. The end goal that comes with this one is
that all of the points should be classified and assigned to a point that is as
similar to it as possible. That data point needs to be similar to all of its
nearest neighbors as well, otherwise, this doesn’t work the way that you
want.

You may find as you work with the KNN algorithm that there are certain
times where you can bring it up with some regression problems. However,
this is not that common and many times it is going to be less accurate than
what you see when you use it with the classification problems. For the most
part, the KNN algorithm is going to be used in problems of classification.

How am I able to use the KNN algorithm

As you can guess from the discussion we just had, there are a lot of
different ways that the KNN algorithm can come into play. Some of the
different applications for this algorithm are going to include:

Credit ratings: Many times the KNN algorithm will be used to


help with credit ratings. First, you will collect financial
characteristics and then will compare that information with
people who have similar financial features in the database. Just
by the nature that comes with a credit rating, those who have
similar details financially would end up getting the same kind of
credit rating. This means that you could take a database that is
already in existence and use it to help predict the credit rating of
a new customer, without having to go through and perform all
the calculations again.
Many times when a bank is considering giving out a loan to
someone, they may use a KNN algorithm. They may want to ask
if they should give an individual a loan? Is it likely that an
individual is going to default on their loan? Does the person look
like they have characteristics of others who have defaulted on
their loans, or are they closer to those who haven’t defaulted on
their loans? The KNN algorithm can help you compare the
information you have against information that is in the database
and then answer these questions.
The KNN algorithm can even be used in political science. You
can take a potential voter and class them as either “will not vote”
or “will vote” based on their characteristics and how they stack
up against others who have or haven’t voted in the past. It is
even possible to look at the information you have about the
person and about others in your database to make a prediction on
which party they will vote for.
Before we move on to some of the other things that you can do with Python
machine learning, we need to take a look at some of the benefit and some of
the drawbacks of working with this kind of algorithm. To start are the
benefits. You will find that there are many different benefits that come with
using this kind of algorithm in your data science. Some of the best benefits
will include:

1. When you use the KNN machine learning algorithm, there is not
going to be an assumption about the data. What this means is
that this kind of algorithm is going to work well when you bring
in nonlinear data to look over.
2. You will find that the KNN algorithm is going to be simple to
use, while also helping you to explain how things work to others,
especially when those others don’t have any knowledge of
machine learning.
3. The KNN algorithm s going to contain a lot of accuracy inside of
it. There are a few models of supervised machine learning that
you can work with, but for the level that comes with this one,
you will find that it will work for your needs.
4. You can bring out the KNN algorithm to help out with a wide
variety of problems that you need to handle. This is a good
algorithm to use when you have a regression problem and with
classification problems. You may need to change things up a bit,
but you can make it work for both kinds of problems.

With those in mind, it is also important to take a look at some of the


drawbacks that come with the KNN algorithm. It doesn’t matter how great
an algorithm appears, there are always going to be a few drawbacks that
come with it, and the KNN algorithm is going to be no different. Some of
the reasons why this algorithm may not be the best for all of the machine
learning problems that you are trying to work with will include:

1. When it comes to computation, the KNN algorithm is going to


get pretty expensive. The reason for this is that the KNN
algorithm is going to function by storing all of your training
data.
2. To make this kind of algorithm work, you have to make sure that
your computer or system has a lot of memory to store all the
information. Depending on the kind of system that you are
working with, this can sometimes be hard to handle.
3. Any training data that you add to the algorithm is going to be
stored. This is going to take up a lot of space on your computer.
If you don’t have enough space on the computer, then this
algorithm is not going to work all that well.
4. When it comes to the stage for making predictions in this
process, you may notice that the KNN algorithm is not going to
be as fast as some of the other learning algorithms that you are
able to pick.
5. If you add in any features that are irrelevant to the mixture, then
the KNN algorithm is going to be really sensitive to this. It is
also sensitive when you try to scale the data. This is going to
make things hard if you have data that needs to be updated on a
regular basis.

There are a lot of things to love about the KNN algorithm, but there are also
times when it is not going to be the best option for you. Learning how this
algorithm works, and when you are able to use it to get the best results
possible. There are times when this is going to work well, and times when
you need to avoid it, but it is definitely a machine learning algorithm that
has a time and place, and it is one that you should take the time to learn
more about.
Chapter 12: Data Wrangling Visualization and
Aggregation
We are now going to take some time to look at the idea of data wrangling.
This is going to be the process where we are going to transform and then
map data from one raw form into another format. The reason that we are
going to want to do this is to make the data more appropriate and valuable
for a lot of other purposes. Often the purpose is going to be to analyze the
data that we have on hand.

There are going to be a lot of ways that you are able to do this. You may
want to due to data visualization, data aggregation, spend the time training
your own statistical model, and other options as well. Data wrangling is
going to be a process that will follow a set of general steps, usually with the
beginning of extracting data in its raw form from the source of the data.
Then you can wrangle the raw data with a variety of algorithms, such as
sorting, or you can prase the data into some predefined structures. And then
this ends with depositing the content you get into some storage to use later
on.

This kind of data transformation is going to be applied to some entities that


are distinct, like in data values, columns, rows, and fields, within the same
set of data. And some of the steps that you are going to use to make this
happen could include parsing, joining, extractions, cleansing, consolidating,
augmenting, and even filtering. The steps that you use are going to depend
on what is needed in order to wrangle the outputs later on.

The recipients who get this information are often going to be a data scientist
or data architect who will take the time to look through the data a bit
further. These are often hired by business owners or users who will read
through the reports made. They often want this spelled out for them so that
they can use data they have to make informed decisions for the future of
their own business.

Depending on how much data is coming in and how it needs to be


formatted, this process was traditionally performed in a manual manner. It
could be done through a spreadsheet for example. Or sometimes it is done
with a hand-written script with the help of the R, SQL, or Python
languages. Statistical data analysis and data mining are often used to make
this a bit faster as well.

Now that we have a little look at how this works and what it means, it is
time to take a closer look at some of the different things that you are able to
do with your data, and how you are able to use all of this to get the best
results out of any data you add into a machine learning algorithm.

Data Cleansing

The first thing that we are going to take a look at is the idea of data
cleansing. This is going to be a process that a data scientist is able to go
through in order to detect and either fix or deletes any records that are
corrupt or inaccurate from their set of data. Inaccurate, incorrect, and even
irrelevant parts of data can mess with the results that you get, so getting
them under control and removing them or fixing them can make a big
difference. Often this data cleansing is going to be performed interactively
with some of the tools that are found with data wrangling.

After cleaning it up, the set of data should end up being pretty consistent
with the other sets of data that you have in the system. The inconsistencies
that are detected or which have been removed may have been caused by
user errors, corruption in the storage that you used, or with different
definitions in the data dictionary. Data cleaning is going to be a bit different
from data validation though, so keep these separate
.

For data cleansing to work, there need to be a few parts that come together
to help us figure out how the data works and whether it is acting in the way
that it should. Data that is considered high-quality needs to be able to pass a
few criteria first. Some of these include

1. Validity: This is going to be the amount that the data is able to


conform to the pre-defined rules or constraints of the business.
Some examples of how this can work will include the following:
a. Data type constraints. This is when the values that fall
in a column need to come from a certain type of data,
such as a date or a numerical value.
b. Range constraints: This means that the dates or the
numbers in the column need to fall inside a certain
range. There will be a minimum or maximum value
that is allowed.
c. Mandatory constraints. This one is going to focus on
the idea that none of the columns are allowed to be
empty in the set of data.
d. Unique constraints: This is where a field or even a
combination of fields, needs to be unique across a set
of data. You may put this in place to make sure that no
two people on the form put in the same social security
number.
e. Set membership constraints: This one would be telling
the program that there needs to be an answer in there
with a discrete value of codes. It could be something
like gender in there.
f. Regular expression patterns. There are times when a
field is going to need to be validated with this way.
You may choose to have a phone number match up in a
certain way if you would like.
2. Accuracy: Accuracy is one that is hard to achieve with the help
of data cleansing because it is going to require that everything
can access a source of data that is external and that is going to
contain the true value. Accuracy is able to be reached in some
contexts of cleansing notably with customer contact data when it
is able to use an external database to match up geographical
locations and zip codes for example.
3. Completeness: The degree to which all required measures are
known. Of course, this one is really hard to work with when you
are doing a data cleansing method.
4. Consistency: This is going to be the degree by which a set of
measures are equivalent to across systems. It is possible that an
inconsistency occurs when there are two items in your data set
that happen to contractive each other.
5. Uniformity This is going to be the degree to which a set of data
measures are specified using the same units of measure in all
system. For example, if you pool together some data from
different locations, it is possible that measurements may be in
different terms.

There are a few issues that can come up when it is time to work with data
cleansing. The first issue is with correcting errors and losing information in
the process. This is one of the biggest issues, especially when you want to
try and correct the values to remove entries that are not valid and any
duplicates that come up. In many situations, the available information on
these situations is going to be limited, and it is not going to have enough
information to help you make the right corrections or changes.

What this ends up doing is that the primary solution to this issue is going to
be to delete these entries. The deletion of data is not always a good thing
though because it means that you are going to lose a lot of information,
information that could be important to what you are trying to do, and it can
skew the results and the predictions that you end up making.

The next issue that you need to take a look at here is maintaining the
cleansed data. Cleaning your data can be expensive and it is going to take
up a lot of time. so, after you have gone through and cleaned the data, and
you have double checked that the data is free of errors, you want to try and
avoid having to do it all of the time. You need to make sure that you only
repeat this when you know the values have changed. Overall, this means
that the lineage for cleansing has to be kept, which requires a lot of data
collection and management techniques. Some companies and businesses
just don’t have the time or resources to do this.

Data cleansing framework can cause some issues. There are a lot of
situations where it is not possible to derive a complete data-cleansing graph
to help guide how the process should beave in advance. This is going to
make it so that the cleansing process is going to be iterative, involving a lot
of exploration and interaction. This is going to require a framework in the
form of a collection of methods for error detection and elimination, along
with a lot of auditing of your information. This requires a lot of time and
energy to make it happen.

Data editing
The next thing that we need to take a look at when it comes to the data that
we can use is the idea of data editing. This is going to be the process of
involving a review, and then any adjustments that are necessary, of
collected survey data. The purpose of doing this is to control the quality of
the data that you decide to collect. You are able to edit the data in a manual
manner, you can use Python to make this happen, or sometimes it helps to
do a combination of both of these to see the best results.

There are actually a few different methods of editing that you are able to
work with. The first one is going to be interactive editing. This type is
going to be reserved for most of the modern methods that are manual but
uses a computer to help. Most of the tools that are used will allow you to
check the specified edits both during and after you enter it. And you can go
through and correct any of the data that you need right away if you find it
necessary. There are a few approaches that you are able to use here in order
to correct some of the data that is wrong and these include:

1. Recontact the person who put in the information.


2. Compare the data from that user to the data they put in the
previous year.
3. Compare the data of the user to the data that other similar
respondents have put in at some point in the past.
4. Use the subject matter knowledge of the human editor to fill in
the information based on their own knowledge along the way.

You will find that when it comes to editing, this interactive method is going
to be the standard to use. It is used for both continuous and categorical data
if you have them. Many editors like to use this because it has the ability to
reduce the time frame that is needed to complete the cyclical process of
adjusting after the review.

The next kind of editing that you are able to use is going to be selective
editing. This is going to just be an umbrella term for a few methods you can
use to identify any outliers and influential errors that come up. This
technique is going to aim to work with interactive editing on a chosen
subset of the records. This is going to help you to get more information
when you are short on time and can help you get so much more out of the
editing process.
Now there are going to be two streams that you can work with here. There
is the critical stream which is going to be any of the records that are going
to be the most likely to contain some of those major errors. These are the
ones that need to be edited using the more traditional interactive manner we
talked about before. Then there is the non-critical stream. These may have
errors, but it is less likely to have ones that are going to influence anything
.

There are two methods of macro editing that you are able to work with. The
aggregation method is going to be followed by most statistical agencies
before publication. You are going to take some time to verify whether
figures that are going to be published even seem possible. This is going to
be done when you compare quantities in publication tables with the same
quantities that showed up in earlier publications. If this is done and there
seems to be an unusual value that is seen, a micro-editing procedure is
going to be applied to the individual records, and the field that seems to
contribute to the suspicious quantity.

Then there is the distribution method. This one is going to use the
distribution of variables to help. Then it is going to compare all of the
values that are there with the distribution. Records that happen to contain
values that are seen as uncommon based on the distribution, then these are
going to be checked out a bit more, and even edited too.

And finally, there is going to be a process that is known as automatic


editing. This is going to record that will be done all on a computer, without
any intervention by a human. There needs to be some prior knowledge on
the values of a single variable or a combination of variables can be
formulated as a set of edit rules which specify or constrain the admissible
values
.

Data scraping

In most cases, the transfer of data between one program to another is going
to be done with data structures that are suited for automated processing.
This is going to be done by a computer rather than by a person. This
interchange in formats and protocols is going to be structured in a rigid
manner, which makes it easier for us to know that it is going to work out
well.
This means that the key element that will make data scraping a bit different
from regular parsing is that the output that is being scraped is going to be
intended for display to an end user, rather than as an input to be used by
another program. This means that this information is not going to be
documented, and it is not going to be structured for convenient parsing.

When we take a look at data scraping, we are going to have to ignore binary
data, display formatting, superfluous commentary, redundant labels, and
any other information that is seen as irrelevant. And in some cases, the
information is going to be taken out if it is seen as hindering the process of
automation
.

Data scraping is a process that is going to be done either to interface a


legacy system or to interface a third-party system that is not going to have
an API that is convenient. In the latter scenario, the operator of the system
is going to be able to see screen scraping as unwanted, mostly because it is
going to cause a big increase in the load of the system. It also will cut out
the revenue that can be gained from advertising, and there is a lot of loss in
the amount of control that you have over the content that comes in.

Data scraping is something that is seen as an ad-hoc inelegant technique.


What this means is that it is seen as the last resort, usually only brought out
when no other mechanism to interchange the data is usable. Aside from the
overhead for processing and programming that is so high, the output
displays intended for humans to look at are going to change often. Humans
are able to cope with this and figure it out, but the program may end up
reporting what looks like nonsense, and it won’t be able to check the
validity of what it sees.

There are a few technical variants that you are able to use here. The first
one is going to be screen scraping. This is a technique that will capture old
screens and then try to take the data and change it over to something that is
a bit more modern. Some of the more modern forms of screen scraping that
you are going to see will include the program capturing the bit map data
that is found on the screen and then it will put it through an OCR engine.
With the help of GUI applications and a query of the graphical controls, we
are going to end up with a sequence of screens that can be captured on an
automatic basis and then converted into a database that you are able to work
with.

Then we can move on to web scraping. When we look at a web page, we


will notice that they will have a coding language that is based on text, such
as HTML, and this means that they come with a lot of data that you can use
in text form. However, since humans are the ones who go through these
web pages and use them, these pages are not created for automated use.
This is why there are a few tool kits that were designed to help with
scraping web content as well.

There are several companies who have come up with systems for web
scraping that are going to try and copy the human processing that occurs
when they view a webpage. The hope here is that these systems will be able
to automatically extract the information that is useful out of the page,
making it easier for others to use.

In addition, there are large websites that are going to use some defensive
algorithms in the hopes of protecting their data from web scrapers. They do
this by placing a limit on how many requests an IP or an IP network is able
to send out. This is going to be an ongoing battle. Website developers want
to be able to scrape the information in order to find the most valuable
content, but they don’t want hackers and more to be able to get onto the
website and cause issues as well.

And finally, there is the idea of report mining. This is when there is an
extraction of the data from human-readable computer reports. With data
extraction that is conventional, it needs a connection to a working source
system, a suitable connectivity standard or API, and some complex
querying to make it happen. By using the standard reporting options for that
system, and making sure that the output is directed out to a spool file rather
than a printer, it is possible for the static reports to be generated in the right
way.

The reason that this method is used is that it is helpful in avoiding a lot of
usage of the CPU during regular business hours, can help with quick
reports, even customized ones, and can minimize the license costs for ERP
customers. While web scraping and data scraping are going to involve
having to interact with a dynamic report, this kind of mining is going to be
the process of extracting out data from files that come in formats that
people can read. You will be able to generate this information and these
reports through almost any system that you find, and all you need to do is
intercept the data feed over to the printer
.

As you can see, there are a lot of different things that you are able to do
with your system when it comes to using the data and visualizing it.
Working with some of the techniques that we talked about in this chapter,
along with a few others, you will be able to really see the difference in how
well your system is going to work and what kind of data you are able to pull
up and use for your needs.
Chapter 13: Django and Web Application
As we discussed a bit earlier on in this guidebook, machine learning is
slowly making its way to a lot of different sectors when it comes to all the
fields and businesses that are able to use it. In fact, pretty much all business
segments have started to use machine learning at some point or another.
Machine learning has been able to take a role in our daily lives, and it is
slowly but surely starting to represent something that is pretty well-known
and concrete in the minds of the public.

However, just because it is more recent and is just starting to grow, don’t be
fooled into thinking that it is going to be a cutting edge option or it is only
for businesses that have a ton of money to keep up with the latest and
greatest in the technology. In fact, machine learning and some of the
techniques that come with it have been around for some time, and you will
find that ignoring machine learning can be a bad thing for your business.

Think of it this way, do you see it as a waste of the time of your employees
to work on a task that could be automated and done for them? Is it a waste
of your valuable data when it just sits in a database, waiting for someone to
have the time and resources to get to it? Or do you feel that it is a waste of
the final product when it is delivered without being able to reach the full
capacity of what it should be able to do? If you answered yes to any, but
most likely all, of these questions, then machine learning is something that
you need to spend some time on.

But here comes the next question, how are you able to implement machine
learning in business, implanting it as a tool for your engineers or as a
service for your clients? How do you streamline it without having to have
experts in the field handle it all of the time? That is what we are going to
explore in this chapter!

Working with an API in machine learning

One of the practical methods that you are able to do all of the things we
discussed above for machine learning is to bring out an API. Offering a
machine learning solution with the help of this API is going to allow a
person to focus more on the results, while still ensuring that the developers
you have working on machine learning have full control to maintain the
model that comes with it.

Using the API is going to prove itself as an efficient way of delivering


machine learning. This is true even with companies who have large AI
divisions and who will really apply this in an extensive manner, such as
what we see with Google and Amazon.

Now we need to go back to the idea of some of the advances that come with
Machine learning. Another factor that we need to consider for its role in
why this method is being adopted so well is because of the exceptional
libraries that come with Python. Python is one of the best languages to use
with machine learning because of all that you are able to do with it, and you
will find the libraries are going to be even better. In particular, you will find
that some of the libraries like TensorFlow, pandas, and scikit-learn are
going to make it so much easier for your developers to come up with
solutions that are high quality in machine learning.

You can easily see why the Python language is going to be the most used
programming language in data science and machine learning. Nothing else
even comes close! The language is not only simple enough for a beginner to
use and learn, but it is also going to have the power and functionality to
handle some of the more complex tasks that you are going to do in machine
learning.

One thing that we need to take a look at here is the idea of Django from
Python. This is going to be a web framework that is open sourced and
available with Python (it has been fully written in Python so you know you
are going to be able to handle it with some of the things we have discussed
in this guidebook). Django is really easy to get started with, it has the
stability that you need, and it is already integrated with the libraries that
come with Python, so you won’t have to worry about that.

The Django web framework was first developed because there were several
developers of a newspaper’s online site who were tired of the type of
framework that they were using at the time. They wanted to be able to work
with a Python version and use it to build up the backend that was needed of
the portal. Because of this need, Django was developed.
Since it was first developed, Django has seen a lot of success, mainly form
its ease of use. You will find that over time, it has grown to become one of
the top four web frameworks that are being used on many businesses and
websites. In fact, some major websites like Disqus and Instagram are
already using this framework to keep themselves up and running.

Plus, this framework is going to be able to help you out with a lot of the
major issues and questions that you run into when it is time to implement
machine learning and the Python language online. Adding to the Django
REST Framework and a web server of your choices (working with
Guincorn is a great option if you are looking for one), it is possible for you
to get a Python-based API up and running in no time, allowing you to work
with the solution that you need with machine learning.

More about Django

Let’s take a closer look at some of the things you should know about
Django and how to use it. Django is a framework that was written out in
Python. A framework is just going to be a collection of modules that are
brought together in a manner that makes development easier on those who
are using it. These are going to be put in a program together and this allows
you to create a website or another application using an existing source,
rather than having to start straight from nothing each time.

This is how many websites, even those that are developed at home by one
person, are able to include some more advanced features and functionality,
including contact forms, file upload support, management, and admin
panels, comment boxes and more. Whether you want to create a web
application for yourself or for a business, you will find that using a
framework like Django could give you a high-quality product without
having to be a programmer on your own. By using the Django framework,
the components that you need for this are going to be in there already, you
just have to take the time to configure them in the proper manner to match
your site.

When we take a look at the official site for Django, it is basically a high-
level web framework from Python that is going to encourage the rapid
development of a project, along with a design that is pragmatic and clean.
This framework, even though it is open sourced, was built up by developers
experienced in Python, and it takes care of a lot of the hassle that comes
with web development. This allows a lot of people to work on this
framework, even if they have no web development experience to go from.

When we use a framework like Django or some of the other ones, it makes
life easier. The framework is going to take care of the web development
part so that you are able to focus on writing your app, without having to
reinvent the wheel and start from the beginning each time. plus, Django is
open sourced and free, making it the perfect choice for those who want to
create their own website and those who need to do it for a company or
business.

Django is going to be a great option to use because it offers us a big


collection of modules that you are able to use with any of the web-based
projects you want to focus with. Mainly though, the Django framework was
designed in order to save wasted time for developers and to make life
easier.

Another thing that you may find interesting when it comes to Django is that
it was created with the idea that front end developers would be the ones
most likely to use it. The template language is going to be designed in a
way to make you feel comfortable, and it is going to be easy for anyone to
learn in order to work with HTML, including front end developers and
designers. But it is also going to be easily extensible and flexible, allowing
developers to augment the template language as needed.

If you do plan to do some work with machine learning and Python,


especially if you are trying to do this with web design and web applications,
it is a good idea to bring in the Django framework to help you out. There
are a lot of different ways that you can use, and it is going to make your life
easier.

How to get started with Django

Now that we understand a bit more about Django and what it is all about, it
is time to move on to a few of the other things that you are able to do with
this program. Django is going to adhere to what is known as the MVT
architectural pattern, or model view template. After you have taken the time
to install it and you have all of the necessary information and files in place
for it, you will need to get to executing it. The command that is needed to
do this includes:

 
Django-admin startproject mysite

When you work with this command, it is going to create most of the
configurations and structure that you need for that folder to get the project
up and running. Once you have given that command some time to get up
and running, it is time to get yourself into the folder for the project. Any
time that you want to do this, work with the following code:

Python manage.py startapp myapp

At this point, you have created the app that is going to be able to run the
correct machine learning API that you need. Next, you need to take some
time to edit out the models/py of the app, that is, if you plan to use a
database to go along with this project. And don’t forget to take some time to
create what is known as the API Views to make this all work together well.

For the most part, you will find that your machine learning model is going
to be accessed from this API Views when you need it. This means you just
need to add the views.py file to the mix to get it. One of the methods that
you can use to make the integration of the model and the server happen is
that, when you are done building and then validating the model to make
sure it works, you will save up any of the binaries that come up with Pickle.
From there, you can add all of these codes into a package, and then import
them over to the View.

In addition to this, there are times when you may find the model is going to
be too large for the storage you have, or it may be too large for other
reasons. If this is true of your file, then it is a good idea to load this in a new
way, as a global variable in this file. This ensures that the file is only going
to load one time. Traditionally, this is going to load up every time that you
try to call up the view. When the file is large, this is going to slow down the
computer and can even cause things to get stuck. But with the steps that we
just did, we are basically asking it to just load up once, any time that you
start up the server, rather than each time you start up the View.

As you can see here, there are a lot of great features that you are able to use
when it comes to Django. It is a fast and reliable way for you to take some
of your machine learning models and get them to work for you. And it
works in Python which can make things even easier for you to work with.
Taking the time to get it downloaded and making sure you get it set up with
the right Python libraries will ensure that it is ready to handle any of the
web-based models that you would like to do with your project.
Chapter 14: A Look at the Neural Networks
The final thing that we are going to take a look at in this guidebook is the
idea of neural networks and how they will work with Python, and with
machine learning. Neural networks are going to be one of the frameworks
that you can do in machine learning, one that tries to mimic the way that the
natural biological neural networks in humans are going to operate.

The human mind is pretty amazing. It is able to look at a lot of different


things around us and identify patterns with a very high degree of accuracy.
Any time that you go on a drive and see a cow, for example, you will
recognize it at a cow. This applies to any animal or other things that you are
going to see when you are out on the road. The reason that we are able to do
this is that we have learned over a period of time how these items look, and
what can differentiate one item from another.

The goal of an artificial neural network is to be able to mimic, as much as


possible, the way that the human brain is able to work. It is a complex
architecture to make this happen. But when it is in place, it allows the
system to do a lot of amazing things that it would not be able to do in other
situations. Let’s take a look at how these neural networks can work, and
why they are so amazing for our work with machine learning.

A neuron is going to be made up of the body of a cell, with a few


extensions that come from it. The majority of these extensions are going to
be in the form of a branch that is known as a dendrite. A long process, or a
branching, is going to exist and this part is known as the axon. The
transmission of signals begin at a region in this axon, and that is going to be
known as the hillock.

The neuron is going to have a boundary that we like to call the cell
membrane. A potential difference is going to exist between the outside and
the inside of this cell membrane, and that difference is called the membrane.
If you are able to get the input to be big enough, then some action potential
is going to be generated. This action potential will then travel all the way
down the axon and head away from the body of the cell.
A neuron is going to come next. This neuron is connected up with another
neuron, and on down the line, with the help of a synapse. The information is
going to head out of the neuron through the axon and then it is passed on to
the synapses and to the neuron which is going to receive the message. Note
that the neuron is only going to fire once the threshold has gone above the
amount that is specified. The signals in this process are going to be
important because they are going to be received by the other neurons in the
chain. The neurons are going to use signals to help them communicate with
one another.

When it comes to the synapses, they can either be excitatory or inhibitory.


When a spike or a signal arrives in one of the excitatory synapses, the
receiving neuron is going to be caught on fire. If the signals are going to be
inhibitory, then the neuron is not going to be fired onward.

The synapses and the cell body are going to be able to work together to
calculate the differences that are there between the excitatory inputs and the
inhibitory inputs. If there is a big difference here, then the neurons are told
to fire the message on down the line.

These types of networks are going to be used a lot because they are great at
learning and analyzing patterns by looking at it in several different layers.
Each layer that it goes through will spend its time seeing if there is a pattern
that is inside the image. If the neural network does find a new pattern, it is
going to activate the process to help the next layer start. This process will
continue going on and on until all the layers in that algorithm are created
and the program is able to predict what is in the image
.

Now, there are going to be several things that are going to happen from this
point. If the algorithm went through all the layers and then was able to use
that information to make an accurate prediction, the neurons are going to
become stronger. This results in a good association between the patterns
and the object and the system will be more efficient at doing this the next
time you use the program.

This may seem a bit complicated so let’s take a look at how these neural
networks will work together. Let’s say that you are trying to create a
program that can take the input of a picture and then recognize that there is
a car in that picture. It will be able to do this based on the features that are
in the car, including the color of the car, the number on the license plate,
and even more.

When you are working with some of the conventional coding methods that
are available, this process can be really difficult to do. You will find that the
neural network system can make this a really easy system to work with.

For the algorithm to work, you would need to provide the system with an
image of the car. The neural network would then be able to look over the
picture. It would start with the first layer, which would be the outside edges
of the car. Then it would go through a number of other layers that help the
neural network understand if there were any unique characteristics that are
present in the picture that outlines that it is a car. If the program is good at
doing the job, it is going to get better at finding some of the smallest details
of the car, including things like its windows and even wheel patterns.

Now, when you are working on this, you may notice that there could
potentially be a ton of layers that work with it. The more details and the
more layers that you decide to find, the more accurate the prediction with
the neural network is going to be. When there are more layers, there are
more chances that the algorithm is going to be able to learn along the way.
From then on, the algorithm is going to get better at making predictions and
will be able to do it with more accuracy, and faster, from now one.

This is a good algorithm to use when you want to have it recognize pictures,
or even with some of the facial recognition software that is out there. With
these, there is no possible way that you can input all of the possible
information that is needed. So working with the neural networks, where the
program is able to learn things along the way, can really make a difference
in whether the program is going to work or not
.

Backpropagation

For you to work on training a neural network to do a certain task at the right
time, the units of each part need to be adjusted. This ensures that there is a
reduction in the amount of error that happens between the target output and
the actual output. What this means is that the derivatives of the weights
need to be computed by the network. So, this means that the network has to
be able to monitor the changes in error as the weights are decreased or
increased based on the situation. The backpropagation algorithm is the one
that you will choose to use to help you figure this part out.

If you plan on having the network units you work with becoming linear,
then this is an easy enough algorithm to understand. With the linear options,
the algorithm is going to be able to get the error derivative of the weights by
determining the rate at which the error is changing as the unit’s activity
level is being changed.

When we are looking at the situation with the output units, the derivative of
the error is going to be obtained when we can figure out the difference
between the target output and the real output. To help us find any change
rate in error for the hidden unit in a layer, all the weights between the
hidden unit and the output unit which it has been connected to has to be
determined ahead of time.

To help us with this part, we then need to go through and multiply the
weights by error derivatives in the weights and then the product that is
added together. The answer that you are able to get here is going to be equal
to the change rate in error for your hidden unit.

Once you have this error change rate, you can then go through and calculate
the error change rate for all of the other layers that you want to work with.
The calculation that you get for these is going to be done from one layer to
the next and in the opposite direction from where you would like the
message to head through the network.
Conclusion
Thank for making it through to the end of Python Machine Learning
, let’s
hope it was informative and able to provide you with all of the tools you
need to achieve your goals whatever they may be.

The next step is to get started with some of the coding and more that we
have spent time talking about in this guidebook. There are so many great
things that you are able to do when it comes to Python machine learning,
and this industry and field is just starting to be explored. Being able to look
through some of the different topics that we brought up, and learning how
to do some of the work, can make a difference at how well you will be able
to utilize this field for your own.

This guidebook took some time to look at a lot of the different aspects of
machine learning that you are able to do with the help of the Python
language. Whether you want to work with supervised learning,
unsupervised learning, or even reinforcement learning, you will find that
this guidebook has the options that will work for you. We will break apart a
lot of different problems and then use the Python coding language to help
us get the work done the right way
.

When you are ready to learn more about Python machine learning and how
it is going to work well for you, make sure to check out this guidebook and
get started today!

Finally, if you found this book useful in any way, a review on Amazon is
always appreciated!
 

You might also like